Home/Blog/Evaluating LLM Systems
🧪
Playbook

Evaluating LLM Systems

From prompts to business metrics — how we run evaluations to ensure quality, safety and cost control in production.

📅 August 2024·⏱️ 8 min read·
LLMEvalsQualitySafetyCost tracking

Why Evals Are Non-Negotiable

You can't improve what you don't measure. LLM systems are probabilistic by nature — small prompt changes can have large, unexpected effects on output quality, cost, and safety. Without a systematic evaluation framework, you're flying blind.

Production evals must connect model behavior directly to business KPIs. We blend automatic metrics with human-in-the-loop spot checks on high-value segments.

Our Evaluation Loop

Step 1: Define Task Rubric

Before writing a single eval, define what "good" means for your task. For a customer support bot, that might be: relevant answer, correct tone, no hallucinated policy details, and response under 150 words. Write explicit pass/fail rules for each dimension.

Step 2: Build a Diverse Eval Set

Generate diverse eval sets with synthetic augmentation. Cover happy paths, edge cases, adversarial inputs, and domain-specific jargon. Aim for 100–500 examples minimum before going to production.

Step 3: Score Automatically + Human Audit

Use programmatic judges (regex, classifier models, GPT-4 as judge) for the bulk of evaluation. Add human audits on the 10% of cases with the lowest confidence scores and on random samples weekly.

Step 4: Track Cost/Performance Drift Weekly

Log every production request with metadata: model version, input/output tokens, latency, and a quality score. Set up dashboards to track drift over time. Cost surprises usually mean prompt bloat or increased usage — catch them early.

Metrics We Track

  • Relevance score — does the response address the actual question?
  • Faithfulness — is every claim grounded in the provided context (for RAG)?
  • Toxicity — automated screening for harmful content
  • Latency P50/P95/P99 — user experience depends on tail latencies
  • Cost per request — track against budget weekly
  • Task completion rate — did the user actually get what they needed?

Common Pitfalls

  • Eval set that's too narrow — only covers cases you already handle well
  • Using the same model to judge itself (biased results)
  • Not versioning eval sets — can't compare results across model updates
  • Optimizing for eval metrics at the expense of real user satisfaction

Tools We Recommend

  • PromptFoo — open-source LLM evaluation framework
  • LangSmith — tracing + evaluation from LangChain team
  • RAGAS — specifically for RAG pipeline evaluation
  • Custom rubric + GPT-4 judge — for nuanced task-specific evaluation

Learn AI automation in practice

Join 6,000+ professionals in our Telegram community for daily tips and exclusive content.