LearnApplied LLM EngineeringLLM-as-a-Judge Evaluation

📊MediumEvaluation & Benchmarks

LLM-as-a-Judge Evaluation

Add calibrated soft judgments to a RAG evaluation trace without letting an LLM override deterministic evidence gates.

19 min read

Learning path

Step 69 of 158 in the full curriculum

RAG Evaluation for Reliable Answers Bias & Fairness in LLMs

policy-answerer-v4-eval proved a hard fact: the current, permitted policy source requires a rollback runbook, not continued deployment. A claim ledger can block unsupported "keep deploying" advice. It can't decide which of two safe replies from a large language model (LLM) is clearer for the engineer.

Consider these two answers to Maya's payment-service incident:

Candidate	Reply	Hard evidence status
`brief`	"Payment-service crossed the rollback threshold; run the rollback runbook under DEP-27."	Supported
`actionable`	"Payment-service crossed the rollback threshold; run the DEP-27 rollback runbook, open an incident note, and page the release lead before retrying."	Supported

Both respect the selected evidence. The remaining question is softer: does the added next step make the second reply more useful without making it wordy or confusing?

An LLM-as-a-judge uses another LLM as an evaluator for quality that can't be fully decided by an exact assertion. It can compare clarity, helpfulness, or tone under a rubric. It must not decide whether restricted context was allowed or whether a policy claim is supported. Those remain deterministic gates.

Zheng et al. found that strong LLM judges could exceed 80% agreement with human preferences on their MT-Bench and Chatbot Arena experiments. The same work reports position bias, verbosity bias, preference for model-like answers, and reasoning limitations. A judge is useful measurement equipment, not ground truth.^{[1]Reference 1Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://arxiv.org/abs/2306.05685}

Keep facts outside the judge

The boundary matters more than the model name. In a deploy-policy answer pipeline, different questions need different evaluators:

Question	Correct evaluator	Why
Did selected evidence pass access and freshness checks?	Code gate	A soft score must never admit forbidden evidence.
Does the answer advise continued deployment when DEP-27 requires rollback?	Claim-to-source verifier	Policy truth is inspectable.
Which supported answer is clearer and more actionable?	Calibrated judge or human	Reasonable reviewers can compare phrasing.
Is the case sensitive, ambiguous, or outside rubric coverage?	Human reviewer	Uncertainty is part of the decision.

Only the third layer changes here, while the first two layers carry forward. The overview below shows the complete contract: deterministic gates decide eligibility, swapped comparisons test preference stability, and calibration plus bias probes decide whether the resulting metric may guide a release.

Three-stage LLM judge flow: hard gate, anonymous judge, calibration gate. — Policy truth decides eligibility first. Pairwise judging ranks only supported replies, and calibration still decides whether that metric can guide release decisions.

Separate retrieval, grounding, and answer relevance

For Retrieval-Augmented Generation (RAG) applications, evaluate three relationships separately:

Context Relevance: Evaluates whether the retrieved context is relevant and sufficient to answer the user's query. This isolates retrieval-quality problems from generation flaws.
Groundedness / Faithfulness: Evaluates whether the generated response is entirely supported by the retrieved context. A low groundedness score indicates the model is using its parametric memory to hallucinate claims not present in the retrieved documents.
Answer Relevance: Evaluates whether the final response directly addresses the user's original query. This detects cases where the model generates a factual but unhelpful or off-topic reply.

These measurements help distinguish retriever failures from generator failures. None replaces deterministic authorization, freshness, or claim-support checks.

Self-preference and same-family judge bias

LLM judges are conditional probability engines, not objective standards. Panickssery et al. found self-preference in several evaluated judge settings: evaluators could recognize and favor their own generations, even without explicit model labels. Effect size varied by model and task, so treat self-preference as a bias to measure rather than a universal ordering rule.^{[2]Reference 2LLM Evaluators Recognize and Favor Their Own Generations.https://arxiv.org/abs/2404.13076}

This bias can mask regressions during a model swap or upgrade. Mitigations include:

Anonymize candidates: Strip model-specific markers, templates, or signatures before evaluation.
Cross-model evaluation: Compare judges from model families different from the generators; a different family is a probe, not automatic neutrality.
Calibrate with humans: Regularly compare the automated judge's scores against a human-graded gold dataset to measure drift.

Start with two supported answers

The lab uses an abbreviated hard gate so the boundary is visible in one screen. The previous lesson built the complete evidence-path validator; here we reuse its result and add one unsafe counterexample to prove it still wins over any soft score.

supported-candidates.py

from dataclasses import dataclass

@dataclass(frozen=True)
class AnswerTrace:
    request_id: str
    selected_source_id: str
    selected_version: str
    admissible: bool
    allowed_action: str

trace = AnswerTrace(
    request_id="incident-48291",
    selected_source_id="dep-27-rollback-threshold",
    selected_version="deploy-policy/2026-04-01",
    admissible=True,
    allowed_action="rollback",
)

answers = {
    "brief": "Payment-service crossed the rollback threshold; run the rollback runbook under DEP-27.",
    "actionable": (
        "Payment-service crossed the rollback threshold; run the DEP-27 rollback runbook, "
        "open an incident note, and page the release lead before retrying."
    ),
    "unsafe_continue": "Keep deploying payment-service while you monitor the graph.",
}

def hard_failures(answer: str, answer_trace: AnswerTrace) -> list[str]:
    failures: list[str] = []
    lowered = answer.lower()
    if not answer_trace.admissible:
        failures.append("selected evidence isn't admissible")
    if "keep deploying" in lowered or "continue deploying" in lowered:
        failures.append("answer advises unsupported continued deployment")
    if answer_trace.allowed_action not in lowered:
        failures.append("answer omits supported rollback action")
    return failures

safe_candidates = [
    name for name, answer in answers.items() if not hard_failures(answer, trace)
]

assert safe_candidates == ["brief", "actionable"]
assert hard_failures(answers["unsafe_continue"], trace) == [
    "answer advises unsupported continued deployment",
    "answer omits supported rollback action",
]

print(f"Evidence version: {trace.selected_version}")
print(f"Candidates eligible for soft judging: {safe_candidates}")
print(f"Blocked answer: {hard_failures(answers['unsafe_continue'], trace)[0]}")

Output

Evidence version: deploy-policy/2026-04-01
Candidates eligible for soft judging: ['brief', 'actionable']
Blocked answer: answer advises unsupported continued deployment

If a judge later says unsafe_continue sounds friendlier, the answer stays blocked. That invariant makes the judge safe to experiment with.

Choose the evaluator before writing the rubric

Not every evaluation question should be routed to an LLM. Choose the measurement tool from the decision you need to make.

Two soft-evaluation shapes matter here:

Shape	Question	Best fit	Main control
Pointwise	Does one safe answer satisfy anchored quality criteria?	Monitoring a single output when no direct alternative exists	Calibrate category or score anchors against human labels
Pairwise	Which of two safe answers better satisfies the rubric?	Comparing prompt or model variants on the same case	Swap candidate order, allow ties, and normalize slots back to reply identity

The DEP-27 example uses pairwise judging because brief and actionable are two safe variants of the same answer.

choose-the-evaluator.py

@dataclass(frozen=True)
class EvaluationQuestion:
    name: str
    has_exact_oracle: bool
    compares_two_safe_variants: bool
    requires_policy_owner: bool = False

def choose_evaluator(question: EvaluationQuestion) -> str:
    if question.has_exact_oracle:
        return "deterministic_gate"
    if question.requires_policy_owner:
        return "human_review"
    if question.compares_two_safe_variants:
        return "pairwise_judge_with_calibration"
    return "pointwise_judge_with_calibration"

questions = [
    EvaluationQuestion("rollback authorization", True, False),
    EvaluationQuestion("clearer supported reply", False, True),
    EvaluationQuestion("new exception policy", False, False, True),
]
choices = {item.name: choose_evaluator(item) for item in questions}

assert choices["rollback authorization"] == "deterministic_gate"
assert choices["clearer supported reply"] == "pairwise_judge_with_calibration"
assert choices["new exception policy"] == "human_review"

for name, choice in choices.items():
    print(f"{name}: {choice}")

Output

rollback authorization: deterministic_gate
clearer supported reply: pairwise_judge_with_calibration
new exception policy: human_review

Write a rubric for the remaining question

A vague instruction such as "pick the better answer" lets the evaluator reward length, politeness, or formatting arbitrarily. A rubric should name what remains undecided after hard checks and include anchors for a tie.

Criterion	Better answer	Tie condition	Outside judge scope
Actionability	Gives a useful, low-friction next step	Both give the same useful next step	Whether the rollback threshold was crossed
Clarity	States remedy plainly without internal clutter	Both are equally clear	Whether policy source is current
Concision	Adds useful information without repetition	Difference is stylistic only	Whether continued deployment is allowed

G-Eval studied LLM evaluation with task-specific criteria and a form-filling output design. A criterion and a structured answer are easier to audit than a free-form impression.^{[3]Reference 3G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.https://arxiv.org/abs/2303.16634}

The next cell builds the packet that would be sent to a model API. Notice two decisions:

Candidate names are anonymous slots, not model or prompt-version names.
Protected facts are displayed as already validated context, not handed to the judge for re-litigation.

pairwise-judge-packet.py

from dataclasses import asdict

@dataclass(frozen=True)
class Criterion:
    name: str
    question: str
    tie_anchor: str

rubric = (
    Criterion(
        name="actionability",
        question="Does the reply give a safe, useful next action?",
        tie_anchor="Neither answer gives a meaningfully better next action.",
    ),
    Criterion(
        name="clarity",
        question="Is the rollback outcome easy for an engineer to understand?",
        tie_anchor="Both answers communicate the outcome equally clearly.",
    ),
    Criterion(
        name="concision",
        question="Does added wording contribute useful information rather than repetition?",
        tie_anchor="The extra wording doesn't change usefulness.",
    ),
)

def pairwise_packet(first_name: str, second_name: str) -> dict[str, object]:
    assert first_name in safe_candidates and second_name in safe_candidates
    return {
        "case_id": trace.request_id,
        "validated_context": {
            "source_id": trace.selected_source_id,
            "version": trace.selected_version,
            "protected_fact": "The required action is rollback, not continued deployment.",
            "hard_checks": "passed before judging",
        },
        "candidates": {
            "A": answers[first_name],
            "B": answers[second_name],
        },
        "rubric": [asdict(item) for item in rubric],
        "allowed_verdicts": ["A", "B", "tie", "needs_human_review"],
    }

packet_ab = pairwise_packet("brief", "actionable")
assert "brief" not in packet_ab["candidates"]
assert "actionable" not in packet_ab["candidates"]

print(f"Context gate: {packet_ab['validated_context']['hard_checks']}")
print(f"Candidate slots: {list(packet_ab['candidates'])}")
print(f"Rubric criteria: {[item['name'] for item in packet_ab['rubric']]}")
print(f"Verdicts: {packet_ab['allowed_verdicts']}")

Output

Context gate: passed before judging
Candidate slots: ['A', 'B']
Rubric criteria: ['actionability', 'clarity', 'concision']
Verdicts: ['A', 'B', 'tie', 'needs_human_review']

In a deployed evaluator, serialize this packet, request structured output from the chosen judge model, and store the raw packet plus parsed verdict. Don't rely on a hidden prompt that can't be reproduced during a regression.

Treat the judge output as untrusted data

The judge is another model. Its JSON can be malformed, its evidence can be irrelevant, and its preference can contradict its own rationale. Parse and validate it just as you would validate a tool result from an agent.

validate-judge-result.py

@dataclass(frozen=True)
class JudgeResult:
    order: tuple[str, str]
    preferred_slot: str
    evidence: tuple[str, ...]
    needs_human_review: bool

def parse_judge_result(
    order: tuple[str, str],
    raw: dict[str, object],
) -> JudgeResult:
    verdict = raw.get("verdict")
    allowed = {"A", "B", "tie", "needs_human_review"}
    if not isinstance(verdict, str) or verdict not in allowed:
        raise ValueError(f"unsupported verdict: {verdict}")

    raw_evidence = raw.get("evidence", [])
    if not isinstance(raw_evidence, list) or not all(
        isinstance(item, str) for item in raw_evidence
    ):
        raise ValueError("evidence must be a list of strings")
    evidence = tuple(raw_evidence)
    if verdict in {"A", "B"} and not evidence:
        raise ValueError("decisive verdict requires criterion evidence")

    return JudgeResult(
        order=order,
        preferred_slot=verdict,
        evidence=evidence,
        needs_human_review=verdict == "needs_human_review",
    )

first_pass = parse_judge_result(
    ("brief", "actionable"),
    {
        "verdict": "B",
        "evidence": [
            "B gives the engineer a next action; A stops after the rollback requirement."
        ],
    },
)

assert first_pass.preferred_slot == "B"

try:
    parse_judge_result(
        ("brief", "actionable"),
        {"verdict": "B", "evidence": "B has a next action."},
    )
except ValueError as exc:
    print(f"Malformed fixture blocked: {exc}")
else:
    raise AssertionError("malformed evidence container must be rejected")

print(f"First pass preference slot: {first_pass.preferred_slot}")
print(f"Recorded rationale: {first_pass.evidence[0]}")

Output

Malformed fixture blocked: evidence must be a list of strings
First pass preference slot: B
Recorded rationale: B gives the engineer a next action; A stops after the rollback requirement.

The output above is a stored fixture, not proof that a particular hosted model will agree. The engineering problem is to make an evaluator run observable and testable before plugging in any provider.

A preference must survive swapping A and B

Pairwise comparison is useful because the evaluator chooses between two concrete alternatives. It also exposes position bias: a judge may prefer the first slot instead of the better reply. Zheng et al. identify this bias in LLM judging, so every pairwise comparison in this lab is run twice with the candidates swapped.^{[1]Reference 1Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://arxiv.org/abs/2306.05685}

Order-swap comparison where pass one picks slot B and pass two picks slot A, both normalizing to the actionable reply for a stable winner, while a slot-following judge picks A in both orders and routes the mismatch to review. — Swap the slots, then normalize back to reply identity. If both passes still pick the same reply, the verdict is stable. If they only keep picking slot A, route it to review.

The detail that matters is normalization. A verdict of B in the first pass and A in the swapped pass can represent the same underlying answer.

aggregate-order-swaps.py

def preferred_candidate(result: JudgeResult) -> str | None:
    if result.preferred_slot not in {"A", "B"}:
        return None
    index = 0 if result.preferred_slot == "A" else 1
    return result.order[index]

def aggregate_swaps(first: JudgeResult, swapped: JudgeResult) -> dict[str, object]:
    if first.needs_human_review or swapped.needs_human_review:
        return {"winner": "needs_human_review", "status": "needs_human_review"}
    if first.preferred_slot == "tie" or swapped.preferred_slot == "tie":
        return {"winner": "tie", "status": "tie"}

    first_choice = preferred_candidate(first)
    second_choice = preferred_candidate(swapped)
    if first_choice is not None and first_choice == second_choice:
        return {"winner": first_choice, "status": "stable"}
    return {"winner": "tie", "status": "unstable_after_swap"}

stable_second_pass = parse_judge_result(
    ("actionable", "brief"),
    {
        "verdict": "A",
        "evidence": ["A preserves the safe remedy and supplies a clear next step."],
    },
)
slot_sensitive_second_pass = parse_judge_result(
    ("actionable", "brief"),
    {
        "verdict": "B",
        "evidence": ["B appears in my preferred slot."],
    },
)
tie_second_pass = parse_judge_result(
    ("actionable", "brief"),
    {"verdict": "tie", "evidence": []},
)
review_second_pass = parse_judge_result(
    ("actionable", "brief"),
    {"verdict": "needs_human_review", "evidence": []},
)

stable = aggregate_swaps(first_pass, stable_second_pass)
unstable = aggregate_swaps(first_pass, slot_sensitive_second_pass)
explicit_tie = aggregate_swaps(first_pass, tie_second_pass)
review = aggregate_swaps(first_pass, review_second_pass)

assert stable == {"winner": "actionable", "status": "stable"}
assert unstable == {"winner": "tie", "status": "unstable_after_swap"}
assert explicit_tie == {"winner": "tie", "status": "tie"}
assert review == {"winner": "needs_human_review", "status": "needs_human_review"}

print(f"Stable comparison: {stable}")
print(f"Slot-sensitive comparison: {unstable}")
print(f"Explicit tie: {explicit_tie}")
print(f"Review route: {review}")

Output

Stable comparison: {'winner': 'actionable', 'status': 'stable'}
Slot-sensitive comparison: {'winner': 'tie', 'status': 'unstable_after_swap'}
Explicit tie: {'winner': 'tie', 'status': 'tie'}
Review route: {'winner': 'needs_human_review', 'status': 'needs_human_review'}

Keep those states separate in your report. An explicit tie is a valid rubric outcome, needs_human_review is an escalation, and unstable_after_swap is evidence that slot order changed a decisive preference.

Probe the biases you expect

One clean comparison doesn't establish that a judge is trustworthy. Build probe cases where an undesirable shortcut is easy to observe.

Bias probes for an LLM judge showing position, padding, identity, and ambiguity checks. — Probe the shortcuts you expect. Here slot swapping passes, padding fails, identity stays masked, ambiguity routes to review, and the failed padding probe blocks promotion.

Probe	Controlled change	Suspicious signal	Response
Position	Swap only slots `A` and `B`	Winner follows slot	Record unstable result
Length	Add apologies and repeated policy text, no new help	Padded copy wins	Tighten concision rubric and track length
Identity	Reveal prompt or model labels in one run only	Preference changes	Keep candidates anonymous
Ambiguity	Compare two equally useful rewrites	Forced winner	Permit ties or human review

Length isn't only a hypothetical confounder. Length-Controlled AlpacaEval proposes a regression-based adjustment intended to answer what preference would have been if compared answers had equal length.^{[4]Reference 4Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.https://arxiv.org/abs/2404.04475} In a local product eval, the smaller first step is to add same-information length probes and report when padding wins.

These fixtures model stored judge returns from two probes. The code doesn't pretend to detect bias from text alone; it asks whether the judge failed a case whose expected behavior you defined in advance.

bias-probe-report.py

@dataclass(frozen=True)
class ProbeResult:
    name: str
    expected_winner: str
    observed_winner: str

padded = (
    answers["brief"]
    + " We sincerely apologize for the inconvenience. "
    + "We appreciate your patience while we coordinate the rollback."
)

probes = [
    ProbeResult(
        name="position_swap",
        expected_winner="actionable",
        observed_winner=str(stable["winner"]),
    ),
    ProbeResult(
        name="same_information_padding",
        expected_winner="brief",
        observed_winner="padded",
    ),
]

failed_probes = [
    probe.name for probe in probes if probe.expected_winner != probe.observed_winner
]

assert "rollback" in padded.lower()
assert failed_probes == ["same_information_padding"]

print(f"Probes run: {len(probes)}")
print(f"Failed probes: {failed_probes}")
print("Action: block metric promotion until padding preference is fixed")

Output

Probes run: 2
Failed probes: ['same_information_padding']
Action: block metric promotion until padding preference is fixed

This is a useful negative result. Releasing a judge because it produced pleasing scores would make the evaluation system worse. A failed probe tells you exactly what to repair.

Calibrate the measurement against people

Hard gates have test oracles. Soft judgments need a labeled calibration set: humans apply the same rubric to a representative sample, then the judge is scored against those labels.

Raw agreement is easy to understand, but can overstate reliability when one label dominates. Cohen's kappa corrects for agreement expected from each rater's label frequencies:^{[5]Reference 5A Coefficient of Agreement for Nominal Scaleshttps://doi.org/10.1177/001316446002000104}

\kappa = \frac{p_o - p_e}{1 - p_e}

Here, $p_o$ is observed agreement and $p_e$ is agreement expected from label prevalence. Kappa isn't a universal release threshold. Your baseline is human-human agreement on the same rubric and the same workflow slices.

This tiny calibration set is intentionally too small to approve a real metric. It shows the computation and demonstrates why a promising number alone can't release an evaluator.

calibrate-against-human-labels.py

from collections import Counter

@dataclass(frozen=True)
class LabeledDecision:
    case_id: str
    slice_name: str
    human: str
    judge: str

calibration_rows = [
    LabeledDecision("r1", "rollback", "actionable", "actionable"),
    LabeledDecision("r2", "rollback", "brief", "brief"),
    LabeledDecision("r3", "rollback", "tie", "tie"),
    LabeledDecision("r4", "rollback", "actionable", "actionable"),
    LabeledDecision("r5", "address_change", "brief", "brief"),
    LabeledDecision("r6", "address_change", "tie", "actionable"),
    LabeledDecision("r7", "address_change", "actionable", "brief"),
    LabeledDecision("r8", "address_change", "brief", "brief"),
]

def raw_agreement(rows: list[LabeledDecision]) -> float:
    return sum(row.human == row.judge for row in rows) / len(rows)

def cohens_kappa(rows: list[LabeledDecision]) -> float:
    labels = {row.human for row in rows} | {row.judge for row in rows}
    total = len(rows)
    human_counts = Counter(row.human for row in rows)
    judge_counts = Counter(row.judge for row in rows)
    observed = raw_agreement(rows)
    expected = sum(
        human_counts[label] / total * judge_counts[label] / total
        for label in labels
    )
    return (observed - expected) / (1.0 - expected)

agreement = raw_agreement(calibration_rows)
kappa = cohens_kappa(calibration_rows)
assert agreement == 0.75

print(f"Calibration rows: {len(calibration_rows)}")
print(f"Raw agreement: {agreement:.2f}")
print(f"Cohen's kappa: {kappa:.3f}")
print("Release evidence: insufficient sample; collect labeled slices")

Output

Calibration rows: 8
Raw agreement: 0.75
Cohen's kappa: 0.610
Release evidence: insufficient sample; collect labeled slices

An aggregate can now conceal the exact problem that requires attention. Report the calibration set by workflow slice before allowing the judge metric to guide any experiment.

calibration-by-workflow-slice.py

def agreement_by_slice(rows: list[LabeledDecision]) -> dict[str, float]:
    grouped: dict[str, list[LabeledDecision]] = {}
    for row in rows:
        grouped.setdefault(row.slice_name, []).append(row)
    return {name: raw_agreement(items) for name, items in grouped.items()}

slice_agreement = agreement_by_slice(calibration_rows)
weak_slices = [
    name for name, score in slice_agreement.items() if score < 0.75
]

assert slice_agreement["rollback"] == 1.0
assert slice_agreement["address_change"] == 0.5
assert weak_slices == ["address_change"]

for name, score in slice_agreement.items():
    print(f"{name}: agreement={score:.2f}")
print(f"Slices requiring review: {weak_slices}")

Output

rollback: agreement=1.00
address_change: agreement=0.50
Slices requiring review: ['address_change']

For an actual evaluation program:

Freeze a rubric and collect human labels for easy wins, real ties, and known failures.
Include workflow slices such as rollback, access review, and address change.
Record human-human agreement before comparing the judge to people.
Re-run calibration after prompt, judge-model, rubric, or traffic-distribution changes.
Escalate slices where agreement or bias probes fail, even if aggregate agreement looks healthy.

Conversation quality still needs the trace

Once a developer conversation has multiple turns, a fluent final reply can conceal a bad evidence path. A judge packet should include relevant conversation turns, selected evidence identifiers, hard-gate outcomes, and the safe candidates being compared.

Conversation judging flow where the same developer turns split into two evidence traces: current deploy policy passes admissibility and reaches anonymous soft judging, while stale policy is blocked before any soft score. — The same turns can route differently once evidence versions diverge. Current policy reaches soft judging; stale policy stops before any semantic score is produced.

The next cell blocks a conversation before semantic judging if its trace isn't admissible. This is the same contract as the single-turn example, applied to a fuller packet.

trace-aware-conversation-packet.py

@dataclass(frozen=True)
class ConversationBundle:
    turns: tuple[str, ...]
    answer_trace: AnswerTrace
    candidate_names: tuple[str, str]

def route_bundle(bundle: ConversationBundle) -> str:
    if not bundle.answer_trace.admissible:
        return "blocked_before_judge"
    for name in bundle.candidate_names:
        if hard_failures(answers[name], bundle.answer_trace):
            return "blocked_before_judge"
    return "ready_for_soft_judge"

safe_bundle = ConversationBundle(
    turns=(
        "Engineer: Payment-service crossed the rollback threshold.",
        "Maya: I found the DEP-27 rollback policy.",
        "Engineer: What should I do before retrying the deploy?",
    ),
    answer_trace=trace,
    candidate_names=("brief", "actionable"),
)
stale_bundle = ConversationBundle(
    turns=safe_bundle.turns,
    answer_trace=AnswerTrace(
        request_id=trace.request_id,
        selected_source_id=trace.selected_source_id,
        selected_version="deploy-policy/2025-01-01",
        admissible=False,
        allowed_action="rollback",
    ),
    candidate_names=("brief", "actionable"),
)

assert route_bundle(safe_bundle) == "ready_for_soft_judge"
assert route_bundle(stale_bundle) == "blocked_before_judge"

print(f"Current policy bundle: {route_bundle(safe_bundle)}")
print(f"Stale policy bundle: {route_bundle(stale_bundle)}")

Output

Current policy bundle: ready_for_soft_judge
Stale policy bundle: blocked_before_judge

Use judges offline before letting them guide changes

LLM judging is usually most defensible as an offline experiment metric: compare prompt versions or model releases over a frozen dataset, investigate disagreements, and let humans approve consequential changes. It's rarely a good reason to make a real-time policy decision for one engineer.

Define an explicit promotion contract. The numbers below are illustrative requirements for this lab, not universal industry thresholds:

Release evidence	Lab requirement	Current lab state
Every candidate passed deterministic policy gates	Required	Pass
Labeled calibration rows	At least 50	8
Known bias probes	All pass	Length probe fails
Human review path	Required	Defined

judge-metric-promotion-gate.py

@dataclass(frozen=True)
class MetricPromotion:
    hard_gate_passed: bool
    calibration_count: int
    minimum_calibration_count: int
    failed_bias_probes: tuple[str, ...]
    has_human_review_path: bool

def promotion_failures(promotion: MetricPromotion) -> list[str]:
    failures: list[str] = []
    if not promotion.hard_gate_passed:
        failures.append("hard policy checks failed")
    if promotion.calibration_count < promotion.minimum_calibration_count:
        failures.append("calibration set is too small")
    if promotion.failed_bias_probes:
        failures.append("judge failed a bias probe")
    if not promotion.has_human_review_path:
        failures.append("human escalation path is missing")
    return failures

promotion = MetricPromotion(
    hard_gate_passed=True,
    calibration_count=len(calibration_rows),
    minimum_calibration_count=50,
    failed_bias_probes=tuple(failed_probes),
    has_human_review_path=True,
)
failures = promotion_failures(promotion)

assert failures == [
    "calibration set is too small",
    "judge failed a bias probe",
]

print("Metric promotion: BLOCKED")
for failure in failures:
    print(f"- {failure}")
print("Next work: label more cases and repair length sensitivity")

Output

Metric promotion: BLOCKED
- calibration set is too small
- judge failed a bias probe
Next work: label more cases and repair length sensitivity

A blocked promotion is the correct result. The lab has produced a useful candidate preference, but it hasn't established that its judge deserves to influence prompt selection across real developer workflows.

A practical evaluation report

When you implement this pattern in a real project, store a report with these sections:

Report section	Evidence to retain	Decision it supports
Hard-gate results	Source IDs, versions, claim failures	Which answers are ineligible
Rubric contract	Criteria, anchors, allowed verdicts	What the judge was asked to measure
Raw judge runs	Both slot orders and rationale snippets	Whether preference is reproducible
Bias probes	Position, length, identity, tie cases	Whether known shortcuts remain
Calibration	Human labels, per-slice agreement, kappa	Whether metric matches reviewers
Promotion decision	Failed requirements and owner	Whether new metric may guide release

The scientist's habit is to evaluate the evaluator. A judge score is one observation; a calibrated, stress-tested metric with recorded failure modes is evidence.

Mastery check

Mastery outcomes

Skill	Evidence from the lab
Separate exact policy truth from soft quality	Unsupported continue-deploy answer fails deterministic checks before judging.
Build a reproducible judge request	Packet keeps candidates anonymous, records rubric anchors, and requests structured verdicts.
Treat judge output as untrusted data	Parser rejects malformed evidence and preserves ties plus escalation.
Detect slot and verbosity shortcuts	Order swaps normalize candidate identity; probes fail when padded wording wins.
Calibrate before promotion	Human labels, per-slice agreement, Cohen's kappa, and explicit promotion requirements keep a demo from becoming a release metric.
Preserve trace provenance	Conversation packet carries policy identity, version, and hard-gate outcomes into offline review.

Evaluation rubric

Keeps policy truth in deterministic gates and sends only supported answers to the judge
Parses structured verdicts as untrusted data and preserves ties, escalation, and swap instability separately
Probes position and verbosity shortcuts before promoting the metric
Compares judge decisions with human labels by workflow slice
Blocks metric promotion when calibration, trace, or bias-probe evidence is inadequate

Follow-up questions

Common pitfalls

The judge is asked to authorize policy truth

Symptom: A polished but unsupported keep-deploying reply receives a high score.
Cause: The pipeline sends all answers to the judge before deterministic policy checks.
Fix: Block inadmissible evidence and unsupported claims first; judge only remaining soft differences.

Pairwise wins follow answer position

Symptom: A prompt variant wins when placed in slot A, then loses when placed in slot B.
Cause: The evaluation reports one ordering and ignores position bias.
Fix: Run both orderings, normalize to candidate identity, and record flips as unstable or route them to humans.

Longer replies win by repeating the same facts

Symptom: Apologies and duplicated policy text improve judge score without helping the engineer.
Cause: The rubric doesn't make concision measurable, and no length probe exists.
Fix: Add same-information padding probes, track response length, and block metric promotion while padding wins.

A high aggregate agreement hides a weak slice

Symptom: Overall calibration looks acceptable, but address-change replies are frequently misjudged.
Cause: Evaluation reports only one aggregate number.
Fix: Label and report agreement by workflow slice, then escalate or repair failed slices before release.

A demonstration becomes a release metric too early

Symptom: Eight hand-picked cases become the quality check for a new prompt.
Cause: The team treats a runnable example as a validation dataset.
Fix: Write a promotion contract with calibration size, probe, trace, and human-review requirements.

Next Step

Continue to Bias & Fairness in LLMs

You can now treat an automated judge as a measured instrument rather than an oracle. Next you'll test whether model and evaluator outcomes remain reliable across user groups and language varieties.

PreviousRAG Evaluation for Reliable Answers

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

LLM Evaluators Recognize and Favor Their Own Generations.

Panickssery, A., Bowman, S. R., & Feng, S. · 2024 · NeurIPS 2024

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.

Liu, Y., et al. · 2023

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.

Dubois, Y., et al. · 2024

A Coefficient of Agreement for Nominal Scales

Cohen, J. · 1960 · Educational and Psychological Measurement

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnApplied LLM EngineeringLLM-as-a-Judge Evaluation

📊MediumEvaluation & Benchmarks

LLM-as-a-Judge Evaluation

Add calibrated soft judgments to a RAG evaluation trace without letting an LLM override deterministic evidence gates.

19 min read

Learning path

Step 69 of 158 in the full curriculum

RAG Evaluation for Reliable Answers Bias & Fairness in LLMs

Consider these two answers to Maya's payment-service incident:

Candidate	Reply	Hard evidence status
`brief`	"Payment-service crossed the rollback threshold; run the rollback runbook under DEP-27."	Supported
`actionable`	"Payment-service crossed the rollback threshold; run the DEP-27 rollback runbook, open an incident note, and page the release lead before retrying."	Supported

Both respect the selected evidence. The remaining question is softer: does the added next step make the second reply more useful without making it wordy or confusing?

Keep facts outside the judge

The boundary matters more than the model name. In a deploy-policy answer pipeline, different questions need different evaluators:

Question	Correct evaluator	Why
Did selected evidence pass access and freshness checks?	Code gate	A soft score must never admit forbidden evidence.
Does the answer advise continued deployment when DEP-27 requires rollback?	Claim-to-source verifier	Policy truth is inspectable.
Which supported answer is clearer and more actionable?	Calibrated judge or human	Reasonable reviewers can compare phrasing.
Is the case sensitive, ambiguous, or outside rubric coverage?	Human reviewer	Uncertainty is part of the decision.

Separate retrieval, grounding, and answer relevance

For Retrieval-Augmented Generation (RAG) applications, evaluate three relationships separately:

Context Relevance: Evaluates whether the retrieved context is relevant and sufficient to answer the user's query. This isolates retrieval-quality problems from generation flaws.
Groundedness / Faithfulness: Evaluates whether the generated response is entirely supported by the retrieved context. A low groundedness score indicates the model is using its parametric memory to hallucinate claims not present in the retrieved documents.
Answer Relevance: Evaluates whether the final response directly addresses the user's original query. This detects cases where the model generates a factual but unhelpful or off-topic reply.

These measurements help distinguish retriever failures from generator failures. None replaces deterministic authorization, freshness, or claim-support checks.

Self-preference and same-family judge bias

This bias can mask regressions during a model swap or upgrade. Mitigations include:

Anonymize candidates: Strip model-specific markers, templates, or signatures before evaluation.
Cross-model evaluation: Compare judges from model families different from the generators; a different family is a probe, not automatic neutrality.
Calibrate with humans: Regularly compare the automated judge's scores against a human-graded gold dataset to measure drift.

Start with two supported answers

supported-candidates.py

from dataclasses import dataclass

@dataclass(frozen=True)
class AnswerTrace:
    request_id: str
    selected_source_id: str
    selected_version: str
    admissible: bool
    allowed_action: str

trace = AnswerTrace(
    request_id="incident-48291",
    selected_source_id="dep-27-rollback-threshold",
    selected_version="deploy-policy/2026-04-01",
    admissible=True,
    allowed_action="rollback",
)

answers = {
    "brief": "Payment-service crossed the rollback threshold; run the rollback runbook under DEP-27.",
    "actionable": (
        "Payment-service crossed the rollback threshold; run the DEP-27 rollback runbook, "
        "open an incident note, and page the release lead before retrying."
    ),
    "unsafe_continue": "Keep deploying payment-service while you monitor the graph.",
}

def hard_failures(answer: str, answer_trace: AnswerTrace) -> list[str]:
    failures: list[str] = []
    lowered = answer.lower()
    if not answer_trace.admissible:
        failures.append("selected evidence isn't admissible")
    if "keep deploying" in lowered or "continue deploying" in lowered:
        failures.append("answer advises unsupported continued deployment")
    if answer_trace.allowed_action not in lowered:
        failures.append("answer omits supported rollback action")
    return failures

safe_candidates = [
    name for name, answer in answers.items() if not hard_failures(answer, trace)
]

assert safe_candidates == ["brief", "actionable"]
assert hard_failures(answers["unsafe_continue"], trace) == [
    "answer advises unsupported continued deployment",
    "answer omits supported rollback action",
]

print(f"Evidence version: {trace.selected_version}")
print(f"Candidates eligible for soft judging: {safe_candidates}")
print(f"Blocked answer: {hard_failures(answers['unsafe_continue'], trace)[0]}")

Output

Evidence version: deploy-policy/2026-04-01
Candidates eligible for soft judging: ['brief', 'actionable']
Blocked answer: answer advises unsupported continued deployment

If a judge later says unsafe_continue sounds friendlier, the answer stays blocked. That invariant makes the judge safe to experiment with.

Choose the evaluator before writing the rubric

Not every evaluation question should be routed to an LLM. Choose the measurement tool from the decision you need to make.

Two soft-evaluation shapes matter here:

Shape	Question	Best fit	Main control
Pointwise	Does one safe answer satisfy anchored quality criteria?	Monitoring a single output when no direct alternative exists	Calibrate category or score anchors against human labels
Pairwise	Which of two safe answers better satisfies the rubric?	Comparing prompt or model variants on the same case	Swap candidate order, allow ties, and normalize slots back to reply identity

The DEP-27 example uses pairwise judging because brief and actionable are two safe variants of the same answer.

choose-the-evaluator.py

@dataclass(frozen=True)
class EvaluationQuestion:
    name: str
    has_exact_oracle: bool
    compares_two_safe_variants: bool
    requires_policy_owner: bool = False

def choose_evaluator(question: EvaluationQuestion) -> str:
    if question.has_exact_oracle:
        return "deterministic_gate"
    if question.requires_policy_owner:
        return "human_review"
    if question.compares_two_safe_variants:
        return "pairwise_judge_with_calibration"
    return "pointwise_judge_with_calibration"

questions = [
    EvaluationQuestion("rollback authorization", True, False),
    EvaluationQuestion("clearer supported reply", False, True),
    EvaluationQuestion("new exception policy", False, False, True),
]
choices = {item.name: choose_evaluator(item) for item in questions}

assert choices["rollback authorization"] == "deterministic_gate"
assert choices["clearer supported reply"] == "pairwise_judge_with_calibration"
assert choices["new exception policy"] == "human_review"

for name, choice in choices.items():
    print(f"{name}: {choice}")

Output

rollback authorization: deterministic_gate
clearer supported reply: pairwise_judge_with_calibration
new exception policy: human_review

Write a rubric for the remaining question

Criterion	Better answer	Tie condition	Outside judge scope
Actionability	Gives a useful, low-friction next step	Both give the same useful next step	Whether the rollback threshold was crossed
Clarity	States remedy plainly without internal clutter	Both are equally clear	Whether policy source is current
Concision	Adds useful information without repetition	Difference is stylistic only	Whether continued deployment is allowed

The next cell builds the packet that would be sent to a model API. Notice two decisions:

Candidate names are anonymous slots, not model or prompt-version names.
Protected facts are displayed as already validated context, not handed to the judge for re-litigation.

pairwise-judge-packet.py

from dataclasses import asdict

@dataclass(frozen=True)
class Criterion:
    name: str
    question: str
    tie_anchor: str

rubric = (
    Criterion(
        name="actionability",
        question="Does the reply give a safe, useful next action?",
        tie_anchor="Neither answer gives a meaningfully better next action.",
    ),
    Criterion(
        name="clarity",
        question="Is the rollback outcome easy for an engineer to understand?",
        tie_anchor="Both answers communicate the outcome equally clearly.",
    ),
    Criterion(
        name="concision",
        question="Does added wording contribute useful information rather than repetition?",
        tie_anchor="The extra wording doesn't change usefulness.",
    ),
)

def pairwise_packet(first_name: str, second_name: str) -> dict[str, object]:
    assert first_name in safe_candidates and second_name in safe_candidates
    return {
        "case_id": trace.request_id,
        "validated_context": {
            "source_id": trace.selected_source_id,
            "version": trace.selected_version,
            "protected_fact": "The required action is rollback, not continued deployment.",
            "hard_checks": "passed before judging",
        },
        "candidates": {
            "A": answers[first_name],
            "B": answers[second_name],
        },
        "rubric": [asdict(item) for item in rubric],
        "allowed_verdicts": ["A", "B", "tie", "needs_human_review"],
    }

packet_ab = pairwise_packet("brief", "actionable")
assert "brief" not in packet_ab["candidates"]
assert "actionable" not in packet_ab["candidates"]

print(f"Context gate: {packet_ab['validated_context']['hard_checks']}")
print(f"Candidate slots: {list(packet_ab['candidates'])}")
print(f"Rubric criteria: {[item['name'] for item in packet_ab['rubric']]}")
print(f"Verdicts: {packet_ab['allowed_verdicts']}")

Output

Context gate: passed before judging
Candidate slots: ['A', 'B']
Rubric criteria: ['actionability', 'clarity', 'concision']
Verdicts: ['A', 'B', 'tie', 'needs_human_review']

Treat the judge output as untrusted data

validate-judge-result.py

@dataclass(frozen=True)
class JudgeResult:
    order: tuple[str, str]
    preferred_slot: str
    evidence: tuple[str, ...]
    needs_human_review: bool

def parse_judge_result(
    order: tuple[str, str],
    raw: dict[str, object],
) -> JudgeResult:
    verdict = raw.get("verdict")
    allowed = {"A", "B", "tie", "needs_human_review"}
    if not isinstance(verdict, str) or verdict not in allowed:
        raise ValueError(f"unsupported verdict: {verdict}")

    raw_evidence = raw.get("evidence", [])
    if not isinstance(raw_evidence, list) or not all(
        isinstance(item, str) for item in raw_evidence
    ):
        raise ValueError("evidence must be a list of strings")
    evidence = tuple(raw_evidence)
    if verdict in {"A", "B"} and not evidence:
        raise ValueError("decisive verdict requires criterion evidence")

    return JudgeResult(
        order=order,
        preferred_slot=verdict,
        evidence=evidence,
        needs_human_review=verdict == "needs_human_review",
    )

first_pass = parse_judge_result(
    ("brief", "actionable"),
    {
        "verdict": "B",
        "evidence": [
            "B gives the engineer a next action; A stops after the rollback requirement."
        ],
    },
)

assert first_pass.preferred_slot == "B"

try:
    parse_judge_result(
        ("brief", "actionable"),
        {"verdict": "B", "evidence": "B has a next action."},
    )
except ValueError as exc:
    print(f"Malformed fixture blocked: {exc}")
else:
    raise AssertionError("malformed evidence container must be rejected")

print(f"First pass preference slot: {first_pass.preferred_slot}")
print(f"Recorded rationale: {first_pass.evidence[0]}")

Output

Malformed fixture blocked: evidence must be a list of strings
First pass preference slot: B
Recorded rationale: B gives the engineer a next action; A stops after the rollback requirement.

The output above is a stored fixture, not proof that a particular hosted model will agree. The engineering problem is to make an evaluator run observable and testable before plugging in any provider.

A preference must survive swapping A and B

The detail that matters is normalization. A verdict of B in the first pass and A in the swapped pass can represent the same underlying answer.

aggregate-order-swaps.py

def preferred_candidate(result: JudgeResult) -> str | None:
    if result.preferred_slot not in {"A", "B"}:
        return None
    index = 0 if result.preferred_slot == "A" else 1
    return result.order[index]

def aggregate_swaps(first: JudgeResult, swapped: JudgeResult) -> dict[str, object]:
    if first.needs_human_review or swapped.needs_human_review:
        return {"winner": "needs_human_review", "status": "needs_human_review"}
    if first.preferred_slot == "tie" or swapped.preferred_slot == "tie":
        return {"winner": "tie", "status": "tie"}

    first_choice = preferred_candidate(first)
    second_choice = preferred_candidate(swapped)
    if first_choice is not None and first_choice == second_choice:
        return {"winner": first_choice, "status": "stable"}
    return {"winner": "tie", "status": "unstable_after_swap"}

stable_second_pass = parse_judge_result(
    ("actionable", "brief"),
    {
        "verdict": "A",
        "evidence": ["A preserves the safe remedy and supplies a clear next step."],
    },
)
slot_sensitive_second_pass = parse_judge_result(
    ("actionable", "brief"),
    {
        "verdict": "B",
        "evidence": ["B appears in my preferred slot."],
    },
)
tie_second_pass = parse_judge_result(
    ("actionable", "brief"),
    {"verdict": "tie", "evidence": []},
)
review_second_pass = parse_judge_result(
    ("actionable", "brief"),
    {"verdict": "needs_human_review", "evidence": []},
)

stable = aggregate_swaps(first_pass, stable_second_pass)
unstable = aggregate_swaps(first_pass, slot_sensitive_second_pass)
explicit_tie = aggregate_swaps(first_pass, tie_second_pass)
review = aggregate_swaps(first_pass, review_second_pass)

assert stable == {"winner": "actionable", "status": "stable"}
assert unstable == {"winner": "tie", "status": "unstable_after_swap"}
assert explicit_tie == {"winner": "tie", "status": "tie"}
assert review == {"winner": "needs_human_review", "status": "needs_human_review"}

print(f"Stable comparison: {stable}")
print(f"Slot-sensitive comparison: {unstable}")
print(f"Explicit tie: {explicit_tie}")
print(f"Review route: {review}")

Output

Stable comparison: {'winner': 'actionable', 'status': 'stable'}
Slot-sensitive comparison: {'winner': 'tie', 'status': 'unstable_after_swap'}
Explicit tie: {'winner': 'tie', 'status': 'tie'}
Review route: {'winner': 'needs_human_review', 'status': 'needs_human_review'}

Probe the biases you expect

One clean comparison doesn't establish that a judge is trustworthy. Build probe cases where an undesirable shortcut is easy to observe.

Probe	Controlled change	Suspicious signal	Response
Position	Swap only slots `A` and `B`	Winner follows slot	Record unstable result
Length	Add apologies and repeated policy text, no new help	Padded copy wins	Tighten concision rubric and track length
Identity	Reveal prompt or model labels in one run only	Preference changes	Keep candidates anonymous
Ambiguity	Compare two equally useful rewrites	Forced winner	Permit ties or human review

bias-probe-report.py

@dataclass(frozen=True)
class ProbeResult:
    name: str
    expected_winner: str
    observed_winner: str

padded = (
    answers["brief"]
    + " We sincerely apologize for the inconvenience. "
    + "We appreciate your patience while we coordinate the rollback."
)

probes = [
    ProbeResult(
        name="position_swap",
        expected_winner="actionable",
        observed_winner=str(stable["winner"]),
    ),
    ProbeResult(
        name="same_information_padding",
        expected_winner="brief",
        observed_winner="padded",
    ),
]

failed_probes = [
    probe.name for probe in probes if probe.expected_winner != probe.observed_winner
]

assert "rollback" in padded.lower()
assert failed_probes == ["same_information_padding"]

print(f"Probes run: {len(probes)}")
print(f"Failed probes: {failed_probes}")
print("Action: block metric promotion until padding preference is fixed")

Output

Probes run: 2
Failed probes: ['same_information_padding']
Action: block metric promotion until padding preference is fixed

This is a useful negative result. Releasing a judge because it produced pleasing scores would make the evaluation system worse. A failed probe tells you exactly what to repair.

Calibrate the measurement against people

Hard gates have test oracles. Soft judgments need a labeled calibration set: humans apply the same rubric to a representative sample, then the judge is scored against those labels.

\kappa = \frac{p_o - p_e}{1 - p_e}

This tiny calibration set is intentionally too small to approve a real metric. It shows the computation and demonstrates why a promising number alone can't release an evaluator.

calibrate-against-human-labels.py

from collections import Counter

@dataclass(frozen=True)
class LabeledDecision:
    case_id: str
    slice_name: str
    human: str
    judge: str

calibration_rows = [
    LabeledDecision("r1", "rollback", "actionable", "actionable"),
    LabeledDecision("r2", "rollback", "brief", "brief"),
    LabeledDecision("r3", "rollback", "tie", "tie"),
    LabeledDecision("r4", "rollback", "actionable", "actionable"),
    LabeledDecision("r5", "address_change", "brief", "brief"),
    LabeledDecision("r6", "address_change", "tie", "actionable"),
    LabeledDecision("r7", "address_change", "actionable", "brief"),
    LabeledDecision("r8", "address_change", "brief", "brief"),
]

def raw_agreement(rows: list[LabeledDecision]) -> float:
    return sum(row.human == row.judge for row in rows) / len(rows)

def cohens_kappa(rows: list[LabeledDecision]) -> float:
    labels = {row.human for row in rows} | {row.judge for row in rows}
    total = len(rows)
    human_counts = Counter(row.human for row in rows)
    judge_counts = Counter(row.judge for row in rows)
    observed = raw_agreement(rows)
    expected = sum(
        human_counts[label] / total * judge_counts[label] / total
        for label in labels
    )
    return (observed - expected) / (1.0 - expected)

agreement = raw_agreement(calibration_rows)
kappa = cohens_kappa(calibration_rows)
assert agreement == 0.75

print(f"Calibration rows: {len(calibration_rows)}")
print(f"Raw agreement: {agreement:.2f}")
print(f"Cohen's kappa: {kappa:.3f}")
print("Release evidence: insufficient sample; collect labeled slices")

Output

Calibration rows: 8
Raw agreement: 0.75
Cohen's kappa: 0.610
Release evidence: insufficient sample; collect labeled slices

An aggregate can now conceal the exact problem that requires attention. Report the calibration set by workflow slice before allowing the judge metric to guide any experiment.

calibration-by-workflow-slice.py

def agreement_by_slice(rows: list[LabeledDecision]) -> dict[str, float]:
    grouped: dict[str, list[LabeledDecision]] = {}
    for row in rows:
        grouped.setdefault(row.slice_name, []).append(row)
    return {name: raw_agreement(items) for name, items in grouped.items()}

slice_agreement = agreement_by_slice(calibration_rows)
weak_slices = [
    name for name, score in slice_agreement.items() if score < 0.75
]

assert slice_agreement["rollback"] == 1.0
assert slice_agreement["address_change"] == 0.5
assert weak_slices == ["address_change"]

for name, score in slice_agreement.items():
    print(f"{name}: agreement={score:.2f}")
print(f"Slices requiring review: {weak_slices}")

Output

rollback: agreement=1.00
address_change: agreement=0.50
Slices requiring review: ['address_change']

For an actual evaluation program:

Freeze a rubric and collect human labels for easy wins, real ties, and known failures.
Include workflow slices such as rollback, access review, and address change.
Record human-human agreement before comparing the judge to people.
Re-run calibration after prompt, judge-model, rubric, or traffic-distribution changes.
Escalate slices where agreement or bias probes fail, even if aggregate agreement looks healthy.

Conversation quality still needs the trace

The next cell blocks a conversation before semantic judging if its trace isn't admissible. This is the same contract as the single-turn example, applied to a fuller packet.

trace-aware-conversation-packet.py

@dataclass(frozen=True)
class ConversationBundle:
    turns: tuple[str, ...]
    answer_trace: AnswerTrace
    candidate_names: tuple[str, str]

def route_bundle(bundle: ConversationBundle) -> str:
    if not bundle.answer_trace.admissible:
        return "blocked_before_judge"
    for name in bundle.candidate_names:
        if hard_failures(answers[name], bundle.answer_trace):
            return "blocked_before_judge"
    return "ready_for_soft_judge"

safe_bundle = ConversationBundle(
    turns=(
        "Engineer: Payment-service crossed the rollback threshold.",
        "Maya: I found the DEP-27 rollback policy.",
        "Engineer: What should I do before retrying the deploy?",
    ),
    answer_trace=trace,
    candidate_names=("brief", "actionable"),
)
stale_bundle = ConversationBundle(
    turns=safe_bundle.turns,
    answer_trace=AnswerTrace(
        request_id=trace.request_id,
        selected_source_id=trace.selected_source_id,
        selected_version="deploy-policy/2025-01-01",
        admissible=False,
        allowed_action="rollback",
    ),
    candidate_names=("brief", "actionable"),
)

assert route_bundle(safe_bundle) == "ready_for_soft_judge"
assert route_bundle(stale_bundle) == "blocked_before_judge"

print(f"Current policy bundle: {route_bundle(safe_bundle)}")
print(f"Stale policy bundle: {route_bundle(stale_bundle)}")

Output

Current policy bundle: ready_for_soft_judge
Stale policy bundle: blocked_before_judge

Use judges offline before letting them guide changes

Define an explicit promotion contract. The numbers below are illustrative requirements for this lab, not universal industry thresholds:

Release evidence	Lab requirement	Current lab state
Every candidate passed deterministic policy gates	Required	Pass
Labeled calibration rows	At least 50	8
Known bias probes	All pass	Length probe fails
Human review path	Required	Defined

judge-metric-promotion-gate.py

@dataclass(frozen=True)
class MetricPromotion:
    hard_gate_passed: bool
    calibration_count: int
    minimum_calibration_count: int
    failed_bias_probes: tuple[str, ...]
    has_human_review_path: bool

def promotion_failures(promotion: MetricPromotion) -> list[str]:
    failures: list[str] = []
    if not promotion.hard_gate_passed:
        failures.append("hard policy checks failed")
    if promotion.calibration_count < promotion.minimum_calibration_count:
        failures.append("calibration set is too small")
    if promotion.failed_bias_probes:
        failures.append("judge failed a bias probe")
    if not promotion.has_human_review_path:
        failures.append("human escalation path is missing")
    return failures

promotion = MetricPromotion(
    hard_gate_passed=True,
    calibration_count=len(calibration_rows),
    minimum_calibration_count=50,
    failed_bias_probes=tuple(failed_probes),
    has_human_review_path=True,
)
failures = promotion_failures(promotion)

assert failures == [
    "calibration set is too small",
    "judge failed a bias probe",
]

print("Metric promotion: BLOCKED")
for failure in failures:
    print(f"- {failure}")
print("Next work: label more cases and repair length sensitivity")

Output

Metric promotion: BLOCKED
- calibration set is too small
- judge failed a bias probe
Next work: label more cases and repair length sensitivity

A practical evaluation report

When you implement this pattern in a real project, store a report with these sections:

Report section	Evidence to retain	Decision it supports
Hard-gate results	Source IDs, versions, claim failures	Which answers are ineligible
Rubric contract	Criteria, anchors, allowed verdicts	What the judge was asked to measure
Raw judge runs	Both slot orders and rationale snippets	Whether preference is reproducible
Bias probes	Position, length, identity, tie cases	Whether known shortcuts remain
Calibration	Human labels, per-slice agreement, kappa	Whether metric matches reviewers
Promotion decision	Failed requirements and owner	Whether new metric may guide release

The scientist's habit is to evaluate the evaluator. A judge score is one observation; a calibrated, stress-tested metric with recorded failure modes is evidence.

Mastery check

Mastery outcomes

Skill	Evidence from the lab
Separate exact policy truth from soft quality	Unsupported continue-deploy answer fails deterministic checks before judging.
Build a reproducible judge request	Packet keeps candidates anonymous, records rubric anchors, and requests structured verdicts.
Treat judge output as untrusted data	Parser rejects malformed evidence and preserves ties plus escalation.
Detect slot and verbosity shortcuts	Order swaps normalize candidate identity; probes fail when padded wording wins.
Calibrate before promotion	Human labels, per-slice agreement, Cohen's kappa, and explicit promotion requirements keep a demo from becoming a release metric.
Preserve trace provenance	Conversation packet carries policy identity, version, and hard-gate outcomes into offline review.

Evaluation rubric

Keeps policy truth in deterministic gates and sends only supported answers to the judge
Parses structured verdicts as untrusted data and preserves ties, escalation, and swap instability separately
Probes position and verbosity shortcuts before promoting the metric
Compares judge decisions with human labels by workflow slice
Blocks metric promotion when calibration, trace, or bias-probe evidence is inadequate

Follow-up questions

Common pitfalls

The judge is asked to authorize policy truth

Symptom: A polished but unsupported keep-deploying reply receives a high score.
Cause: The pipeline sends all answers to the judge before deterministic policy checks.
Fix: Block inadmissible evidence and unsupported claims first; judge only remaining soft differences.

Pairwise wins follow answer position

Symptom: A prompt variant wins when placed in slot A, then loses when placed in slot B.
Cause: The evaluation reports one ordering and ignores position bias.
Fix: Run both orderings, normalize to candidate identity, and record flips as unstable or route them to humans.

Longer replies win by repeating the same facts

Symptom: Apologies and duplicated policy text improve judge score without helping the engineer.
Cause: The rubric doesn't make concision measurable, and no length probe exists.
Fix: Add same-information padding probes, track response length, and block metric promotion while padding wins.

A high aggregate agreement hides a weak slice

Symptom: Overall calibration looks acceptable, but address-change replies are frequently misjudged.
Cause: Evaluation reports only one aggregate number.
Fix: Label and report agreement by workflow slice, then escalate or repair failed slices before release.

A demonstration becomes a release metric too early

Symptom: Eight hand-picked cases become the quality check for a new prompt.
Cause: The team treats a runnable example as a validation dataset.
Fix: Write a promotion contract with calibration size, probe, trace, and human-review requirements.

Next Step

Continue to Bias & Fairness in LLMs

You can now treat an automated judge as a measured instrument rather than an oracle. Next you'll test whether model and evaluator outcomes remain reliable across user groups and language varieties.

PreviousRAG Evaluation for Reliable Answers

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

LLM Evaluators Recognize and Favor Their Own Generations.

Panickssery, A., Bowman, S. R., & Feng, S. · 2024 · NeurIPS 2024

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.

Liu, Y., et al. · 2023

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.

Dubois, Y., et al. · 2024

A Coefficient of Agreement for Nominal Scales

Cohen, J. · 1960 · Educational and Psychological Measurement

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

LLM-as-a-Judge Evaluation

Keep facts outside the judge

Separate retrieval, grounding, and answer relevance

Self-preference and same-family judge bias

Start with two supported answers

Choose the evaluator before writing the rubric

Write a rubric for the remaining question

Treat the judge output as untrusted data

A preference must survive swapping A and B

Probe the biases you expect

Calibrate the measurement against people

Conversation quality still needs the trace

Use judges offline before letting them guide changes

A practical evaluation report

Mastery check

Mastery outcomes

Evaluation rubric

Follow-up questions

Why shouldn't a high judge score release the keep-deploying answer?

The judge selects B when candidates are (brief, actionable) and A when candidates are (actionable, brief). Is that stable?

One swapped run returns tie. Should the report mark the pair as unstable_after_swap?

Your judge agrees with human labels on six of eight rows and has kappa 0.610. Can you approve its use across support workflows?

Why include the selected policy version in a conversation judge packet after hard checks already passed?

Common pitfalls

The judge is asked to authorize policy truth

Pairwise wins follow answer position

Longer replies win by repeating the same facts

A high aggregate agreement hides a weak slice

A demonstration becomes a release metric too early

Mastery Check

Discussion

LLM-as-a-Judge Evaluation

Keep facts outside the judge

Separate retrieval, grounding, and answer relevance

Self-preference and same-family judge bias

Start with two supported answers

Choose the evaluator before writing the rubric

Write a rubric for the remaining question

Treat the judge output as untrusted data

A preference must survive swapping A and B

Probe the biases you expect

Calibrate the measurement against people

Conversation quality still needs the trace

Use judges offline before letting them guide changes

A practical evaluation report

Mastery check

Mastery outcomes

Evaluation rubric

Follow-up questions

Why shouldn't a high judge score release the keep-deploying answer?

The judge selects B when candidates are (brief, actionable) and A when candidates are (actionable, brief). Is that stable?

One swapped run returns tie. Should the report mark the pair as unstable_after_swap?

Your judge agrees with human labels on six of eight rows and has kappa 0.610. Can you approve its use across support workflows?

Why include the selected policy version in a conversation judge packet after hard checks already passed?

Common pitfalls

The judge is asked to authorize policy truth

Pairwise wins follow answer position

Longer replies win by repeating the same facts

A high aggregate agreement hides a weak slice

A demonstration becomes a release metric too early

Mastery Check

Discussion