The Cost of Poor Data in AI Systems
Industry: Recruitment / HR Problem: AI Matching candidates to jobs Root Cause: "Resume Parsing" is harder than it looks.
The Theory
We built a system to match 50,000 candidate resumes to open Job Descriptions. The plan: Embed the Resume (Vector DB). Embed the Job Description. Perform Semantic Search.
The Reality of Data
The client assured us their data was "clean." In reality:
- 30% of resumes were scanned images (PDFs of photos). The OCR failed, so the text was empty.
- 20% of resumes were in foreign languages (which the embedding model didn't support well).
- Job Descriptions were inconsistent. Some said "Salary: Competitive" (useless data).
The Cost
Because the input data was garbage, the AI matches were garbage. A Senior Java Developer was matched with a "Coffee Shop Manager" role because the resume mentioned "Java" (the island) and "Manager."
Financial Impact:
- 2 months of engineering time wasted tuning the model, when the problem was the data.
- Loss of trust from the recruiters.
The Fix: Data Pre-Processing Pipeline
We stopped the AI project. We built a "Data Sanitation" pipeline.
- Filter: Reject any resume < 50 words.
- Standardize: Convert all formats to Markdown.
- Enrich: Use a cheaper LLM to extract "Years of Experience" and "Primary Tech Stack" into structured JSON before embedding.
Once we cleaned the data, the original (simple) model worked perfectly.
Takeaway: Stop buying bigger models. Start cleaning your CSVs.
DJC Insights