Why Evals Are Non-Negotiable
You can't improve what you don't measure. LLM systems are probabilistic by nature โ small prompt changes can have large, unexpected effects on output quality, cost, and safety. Without a systematic evaluation framework, you're flying blind.
Production evals must connect model behavior directly to business KPIs. We blend automatic metrics with human-in-the-loop spot checks on high-value segments.
Our Evaluation Loop
Step 1: Define Task Rubric
Before writing a single eval, define what "good" means for your task. For a customer support bot, that might be: relevant answer, correct tone, no hallucinated policy details, and response under 150 words. Write explicit pass/fail rules for each dimension.
Step 2: Build a Diverse Eval Set
Generate diverse eval sets with synthetic augmentation. Cover happy paths, edge cases, adversarial inputs, and domain-specific jargon. Aim for 100โ500 examples minimum before going to production.
Step 3: Score Automatically + Human Audit
Use programmatic judges (regex, classifier models, GPT-4 as judge) for the bulk of evaluation. Add human audits on the 10% of cases with the lowest confidence scores and on random samples weekly.
Step 4: Track Cost/Performance Drift Weekly
Log every production request with metadata: model version, input/output tokens, latency, and a quality score. Set up dashboards to track drift over time. Cost surprises usually mean prompt bloat or increased usage โ catch them early.
Metrics We Track
- Relevance score โ does the response address the actual question?
- Faithfulness โ is every claim grounded in the provided context (for RAG)?
- Toxicity โ automated screening for harmful content
- Latency P50/P95/P99 โ user experience depends on tail latencies
- Cost per request โ track against budget weekly
- Task completion rate โ did the user actually get what they needed?
Common Pitfalls
- Eval set that's too narrow โ only covers cases you already handle well
- Using the same model to judge itself (biased results)
- Not versioning eval sets โ can't compare results across model updates
- Optimizing for eval metrics at the expense of real user satisfaction
Tools We Recommend
- PromptFoo โ open-source LLM evaluation framework
- LangSmith โ tracing + evaluation from LangChain team
- RAGAS โ specifically for RAG pipeline evaluation
- Custom rubric + GPT-4 judge โ for nuanced task-specific evaluation