Why we built our own AI evaluation harness
Off-the-shelf evals miss the things that actually matter in production. Here's what we built instead, why, and what it caught in our first month running it against client workflows.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Most off-the-shelf evaluation frameworks measure something close to what you care about, but rarely the exact thing.
The problem with generic evals
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. They benchmark against academic datasets that have almost nothing to do with the messy, contextual prompts that real users send to your workflow.
Ut enim ad minim veniam, quis nostrud exercitation. Three issues come up constantly when teams rely on generic evals:
- Off-the-shelf datasets don't reflect your actual prompt distribution
- Pass/fail criteria are coarse. They miss subtle quality regressions
- You can't easily tie results back to specific client workflows that broke
What we built instead
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Three components
- A prompt set sampled from real client workflows, anonymised and tagged by use case
- A grading rubric specific to each workflow, with different criteria for sales prep vs knowledge retrieval vs summarisation
- A weekly run that flags any output that scores below baseline so we can investigate before the client notices
What it caught in month one
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam.
The most useful regression we caught was subtle: a model update changed how it formatted bullet lists in a way that broke downstream parsing. No academic eval would have flagged it. Our client-specific one did.
At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident.
The takeaway
If you're running AI workflows in production for anyone other than yourself, generic evals are not enough. Build a thin custom harness against your real data. It pays back the first time it catches something you would have missed.