The Cost of Poor Data in AI Systems

2025-11-15 | Case Studies | by DJC AI Team

Industry: Recruitment / HR Problem: AI Matching candidates to jobs Root Cause: "Resume Parsing" is harder than it looks.

The Theory

We built a system to match 50,000 candidate resumes to open Job Descriptions. The plan: Embed the Resume (Vector DB). Embed the Job Description. Perform Semantic Search.

The Reality of Data

The client assured us their data was "clean." In reality:

30% of resumes were scanned images (PDFs of photos). The OCR failed, so the text was empty.
20% of resumes were in foreign languages (which the embedding model didn't support well).
Job Descriptions were inconsistent. Some said "Salary: Competitive" (useless data).

The Cost

Because the input data was garbage, the AI matches were garbage. A Senior Java Developer was matched with a "Coffee Shop Manager" role because the resume mentioned "Java" (the island) and "Manager."

Financial Impact:

2 months of engineering time wasted tuning the model, when the problem was the data.
Loss of trust from the recruiters.

The Fix: Data Pre-Processing Pipeline

We stopped the AI project. We built a "Data Sanitation" pipeline.

Filter: Reject any resume < 50 words.
Standardize: Convert all formats to Markdown.
Enrich: Use a cheaper LLM to extract "Years of Experience" and "Primary Tech Stack" into structured JSON before embedding.

Once we cleaned the data, the original (simple) model worked perfectly.

Takeaway: Stop buying bigger models. Start cleaning your CSVs.