LLM Eval Harness for CI: 40 Lines of Python

You swapped the system prompt last Tuesday. Outputs look fine. But you don't actually know if they're better or worse — and that's the eval problem in one sentence.

Most teams shipping LLM features run on vibes: skim three outputs, decide it looks reasonable, merge. When someone bumps a model version or rewords a system prompt, nobody knows whether quality moved. The fix is not a platform. The fix is forty lines of Python and five test cases you wrote yourself.

The real problem isn't compute — it's that there's no harness

You'll see headlines about evals being a "compute bottleneck." Ignore them for product work. That framing comes from research labs reproducing MMLU and HellaSwag across hundreds of checkpoints. Your eval question is narrower: does this prompt, on this task, with this model, still meet the bar I set? That has nothing to do with GPU hours.

The actual bottleneck is that most product teams have no harness at all. Anthropic's engineering team puts the cost plainly: with evals, you can qualify a new model in days; without, the same upgrade takes weeks of manual spot-checking, and you still don't trust the result. Worse, evals compound — every regression you catch becomes a new test case, and the harness gets sharper over time. No harness, no compounding.

Braintrust, LangSmith, and friends are useful tools. They're also the wrong first step. A dashboard over zero test fixtures is still zero signal. Build the loop in plain Python first. Add the platform when you've outgrown a JSON file.

Three components, nothing more

A minimal harness has exactly three parts:

Test fixtures — JSON file with input + rubric per case
A scoring function — judge-model call that returns a score and rationale
A pass-rate gate — exit non-zero if the score drops below threshold

Anything else is premature. You don't need a labeled dataset of 500 examples. Five well-chosen cases that cover your known failure modes beat 500 generic ones, because they're about your system.

Here's the fixture format. expected_behavior is a rubric string, not a literal answer — the judge reads it like a checklist.

evals/fixtures.json — five cases is enough to start

[
  {
    "id": "ticket-001-billing-refund",
    "input": "I was charged twice for my June subscription. Please refund.",
    "expected_behavior": "Classifies as 'billing'. Does NOT promise a refund. Does NOT invent a ticket number.",
    "tags": ["billing", "refund"]
  },
  {
    "id": "ticket-002-ambiguous",
    "input": "It's broken again, can you fix it?",
    "expected_behavior": "Classifies as 'needs_clarification'. Asks one specific follow-up question. Does NOT guess a category.",
    "tags": ["ambiguous"]
  }
]

Notice the rubrics name failure modes explicitly ("does NOT promise a refund"). That's deliberate — vague rubrics are why judge models hand out 0.9 to everything.

Building the harness

The harness loads fixtures, runs your production prompt, hands the output plus rubric to a judge model, collects a score, and gates on the average. One file, one command.

eval_harness.py — wire this into CI

import json
import re
import sys
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-sonnet-4-6-20260218"
THRESHOLD = 0.75

JUDGE_SYSTEM = """You are a strict evaluator. Read the rubric and the output.
Return JSON only: {"score": <float 0.0-1.0>, "rationale": "<one sentence>"}.
Score 1.0 only if every rubric criterion is met. Score 0.0 if any criterion fails."""

def run_feature(user_input: str) -> str:
    # Replace with your actual production call
    resp = client.messages.create(
        model=MODEL, max_tokens=400,
        system="Classify the support ticket and respond.",
        messages=[{"role": "user", "content": user_input}],
    )
    return resp.content[0].text

def parse_verdict(text: str) -> dict:
    # Judges occasionally wrap JSON in ```json fences or add stray prose.
    # Extract the first {...} block and clamp the score; degrade gracefully.
    match = re.search(r"\{.*\}", text, re.DOTALL)
    if not match:
        return {"score": 0.0, "rationale": f"unparseable judge output: {text[:80]}"}
    try:
        data = json.loads(match.group(0))
        score = max(0.0, min(1.0, float(data.get("score", 0.0))))
        return {"score": score, "rationale": str(data.get("rationale", ""))}
    except (ValueError, TypeError) as e:
        return {"score": 0.0, "rationale": f"parse error: {e}"}

def judge(output: str, rubric: str) -> dict:
    msg = client.messages.create(
        model=MODEL, max_tokens=300, system=JUDGE_SYSTEM,
        messages=[{"role": "user", "content":
            f"RUBRIC:\n{rubric}\n\nOUTPUT:\n{output}"}],
    )
    return parse_verdict(msg.content[0].text)

def main():
    with open("evals/fixtures.json") as f:
        cases = json.load(f)
    if not cases:
        print("No fixtures found.")
        sys.exit(1)
    scores = []
    for c in cases:
        out = run_feature(c["input"])
        verdict = judge(out, c["expected_behavior"])
        print(f"{c['id']}: {verdict['score']:.2f} — {verdict['rationale']}")
        scores.append(verdict["score"])
    avg = sum(scores) / len(scores)
    print(f"\nPASS RATE: {avg:.2f} (threshold {THRESHOLD})")
    sys.exit(0 if avg >= THRESHOLD else 1)

if __name__ == "__main__":
    main()

Run it: python eval_harness.py. Add it as a CI step that runs after deploy-to-staging. If average score drops below THRESHOLD, the build fails and the prompt change doesn't promote.

LLM-as-judge is not cheating. String equality and ROUGE break the moment your model paraphrases — a correct answer can score zero. The judge handles paraphrase, partial correctness, and open-ended outputs. That's the whole point.

Gotchas that will burn you on day one

What to do Monday morning

Pick one LLM feature you shipped in the last 30 days. Write five test cases — pick the inputs you remember being nervous about. Use the JSON format above. Drop eval_harness.py into your repo, point it at the fixtures, run it once against the current prompt.

If it passes cleanly, add it to CI as a post-deploy step. If it surfaces a failure, you just found a regression nobody noticed. Either way, you're done flying blind.

Five test cases and forty lines of Python is enough to stop guessing whether prompt changes help or hurt. Open your repo Monday at 9 AM, copy the harness, write the five fixtures for whatever feature scares you most — if it catches one silent regression this week, it's earned its place in CI forever.

AI Evals Are a Bottleneck: The Minimal Harness I'd Wire Into CI Today

The real problem isn't compute — it's that there's no harness

Three components, nothing more

Building the harness

Gotchas that will burn you on day one

What to do Monday morning