Add calibrated soft judgments to a RAG evaluation trace without letting an LLM override deterministic evidence gates.
The previous lesson ended with policy-answerer-v4-eval proving a hard fact: the current, permitted policy source authorizes a replacement, not an immediate refund. A claim ledger can block an unsupported refund promise. It can't decide which of two safe replies from a large language model (LLM) is clearer for the customer.
Consider these two answers to Luna's refurbished-laptop case:
| Candidate | Reply | Hard evidence status |
|---|---|---|
brief | "Your refurbished laptop qualifies for a replacement under RPL-14." | Supported |
actionable | "Your refurbished laptop qualifies for a replacement under RPL-14. Reply to confirm you'd like to proceed with the replacement." | Supported |
Both respect the selected evidence. The remaining question is softer: does the added next step make the second reply more useful without making it wordy or confusing?
An LLM-as-a-judge uses another LLM as an evaluator for quality that can't be fully decided by an exact assertion. It can compare clarity, helpfulness, or tone under a rubric. It must not decide whether restricted context was allowed or whether a policy claim is supported. Those remain deterministic .
Zheng et al. found that strong LLM judges could exceed 80% agreement with human preferences on their MT-Bench and Chatbot Arena experiments. The same work reports position, verbosity, self-enhancement, and reasoning limitations. A judge is useful measurement equipment, not ground truth.[1]
The boundary matters more than the model name. In a customer-support answer pipeline, different questions need different evaluators:
| Question | Correct evaluator | Why |
|---|---|---|
| Did selected evidence pass access and freshness checks? | Code gate | A soft score must never admit forbidden evidence. |
| Does the answer promise a refund not supported by RPL-14? | Claim-to-source verifier | Policy truth is inspectable. |
| Which supported answer is clearer and more actionable? | Calibrated judge or human | Reasonable reviewers can compare phrasing. |
| Is the case sensitive, ambiguous, or outside rubric coverage? | Uncertainty is part of the decision. |
This lesson builds only the third layer, while carrying the first two layers forward.
Figure 1: Semantic judging starts after admissibility and claim support have passed. A judge score never reopens a blocked evidence path.
The lab uses an abbreviated hard gate so the boundary is visible in one screen. The previous lesson built the complete evidence-path validator; here we reuse its result and add one unsafe counterexample to prove it still wins over any soft score.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class AnswerTrace:
5 request_id: str
6 selected_source_id: str
7 selected_version: str
8 admissible: bool
9 allowed_remedy: str
10
11trace = AnswerTrace(
12 request_id="ticket-48291",
13 selected_source_id="eu-refurb-v2-rule",
14 selected_version="eu-electronics/2026-04-01",
15 admissible=True,
16 allowed_remedy="replacement",
17)
18
19answers = {
20 "brief": "Your refurbished laptop qualifies for a replacement under RPL-14.",
21 "actionable": (
22 "Your refurbished laptop qualifies for a replacement under RPL-14. "
23 "Reply to confirm you'd like to proceed with the replacement."
24 ),
25 "unsafe_refund": "Your refurbished laptop qualifies for an immediate refund.",
26}
27
28def hard_failures(answer: str, answer_trace: AnswerTrace) -> list[str]:
29 failures: list[str] = []
30 lowered = answer.lower()
31 if not answer_trace.admissible:
32 failures.append("selected evidence isn't admissible")
33 if "refund" in lowered and answer_trace.allowed_remedy != "refund":
34 failures.append("answer promises an unsupported refund")
35 if answer_trace.allowed_remedy not in lowered:
36 failures.append("answer omits supported replacement remedy")
37 return failures
38
39safe_candidates = [
40 name for name, answer in answers.items() if not hard_failures(answer, trace)
41]
42
43assert safe_candidates == ["brief", "actionable"]
44assert hard_failures(answers["unsafe_refund"], trace) == [
45 "answer promises an unsupported refund",
46 "answer omits supported replacement remedy",
47]
48
49print(f"Evidence version: {trace.selected_version}")
50print(f"Candidates eligible for soft judging: {safe_candidates}")
51print(f"Blocked answer: {hard_failures(answers['unsafe_refund'], trace)[0]}")1Evidence version: eu-electronics/2026-04-01
2Candidates eligible for soft judging: ['brief', 'actionable']
3Blocked answer: answer promises an unsupported refundIf a judge later says unsafe_refund sounds friendlier, the answer stays blocked. That invariant makes the judge safe to experiment with.
Not every evaluation question should be routed to an LLM. Choose the measurement tool from the decision you need to make.
1@dataclass(frozen=True)
2class EvaluationQuestion:
3 name: str
4 has_exact_oracle: bool
5 compares_two_safe_variants: bool
6 requires_policy_owner: bool = False
7
8def choose_evaluator(question: EvaluationQuestion) -> str:
9 if question.has_exact_oracle:
10 return "deterministic_gate"
11 if question.requires_policy_owner:
12 return "human_review"
13 if question.compares_two_safe_variants:
14 return "pairwise_judge_with_calibration"
15 return "pointwise_judge_with_calibration"
16
17questions = [
18 EvaluationQuestion("refund authorization", True, False),
19 EvaluationQuestion("clearer supported reply", False, True),
20 EvaluationQuestion("new exception policy", False, False, True),
21]
22choices = {item.name: choose_evaluator(item) for item in questions}
23
24assert choices["refund authorization"] == "deterministic_gate"
25assert choices["clearer supported reply"] == "pairwise_judge_with_calibration"
26assert choices["new exception policy"] == "human_review"
27
28for name, choice in choices.items():
29 print(f"{name}: {choice}")1refund authorization: deterministic_gate
2clearer supported reply: pairwise_judge_with_calibration
3new exception policy: human_reviewA vague instruction such as "pick the better answer" lets the evaluator reward length, politeness, or formatting arbitrarily. A rubric should name what remains undecided after hard checks and include anchors for a tie.
| Criterion | Better answer | Tie condition | Outside judge scope |
|---|---|---|---|
| Actionability | Gives a useful, low-friction next step | Both give the same useful next step | Whether customer is eligible |
| Clarity | States remedy plainly without internal clutter | Both are equally clear | Whether policy source is current |
| Concision | Adds useful information without repetition | Difference is stylistic only | Whether a refund is authorized |
G-Eval studied LLM evaluation with task-specific criteria and a form-filling output design. The practical lesson here is modest: a criterion and a structured answer are easier to audit than a free-form impression.[2]
The next cell builds the packet that would be sent to a model API. Notice two decisions:
1from dataclasses import asdict
2
3@dataclass(frozen=True)
4class Criterion:
5 name: str
6 question: str
7 tie_anchor: str
8
9rubric = (
10 Criterion(
11 name="actionability",
12 question="Does the reply give a safe, useful next action?",
13 tie_anchor="Neither answer gives a meaningfully better next action.",
14 ),
15 Criterion(
16 name="clarity",
17 question="Is the replacement outcome easy for a customer to understand?",
18 tie_anchor="Both answers communicate the outcome equally clearly.",
19 ),
20 Criterion(
21 name="concision",
22 question="Does added wording contribute useful information rather than repetition?",
23 tie_anchor="The extra wording doesn't change usefulness.",
24 ),
25)
26
27def pairwise_packet(first_name: str, second_name: str) -> dict[str, object]:
28 assert first_name in safe_candidates and second_name in safe_candidates
29 return {
30 "case_id": trace.request_id,
31 "validated_context": {
32 "source_id": trace.selected_source_id,
33 "version": trace.selected_version,
34 "protected_fact": "The permitted remedy is replacement, not refund.",
35 "hard_checks": "passed before judging",
36 },
37 "candidates": {
38 "A": answers[first_name],
39 "B": answers[second_name],
40 },
41 "rubric": [asdict(item) for item in rubric],
42 "allowed_verdicts": ["A", "B", "tie", "needs_human_review"],
43 }
44
45packet_ab = pairwise_packet("brief", "actionable")
46assert "brief" not in packet_ab["candidates"]
47assert "actionable" not in packet_ab["candidates"]
48
49print(f"Context gate: {packet_ab['validated_context']['hard_checks']}")
50print(f"Candidate slots: {list(packet_ab['candidates'])}")
51print(f"Rubric criteria: {[item['name'] for item in packet_ab['rubric']]}")
52print(f"Verdicts: {packet_ab['allowed_verdicts']}")1Context gate: passed before judging
2Candidate slots: ['A', 'B']
3Rubric criteria: ['actionability', 'clarity', 'concision']
4Verdicts: ['A', 'B', 'tie', 'needs_human_review']In a deployed evaluator, serialize this packet, request from the chosen judge model, and store the raw packet plus parsed verdict. Don't rely on a hidden prompt that can't be reproduced during a regression.
The judge is another model. Its JSON can be malformed, its evidence can be irrelevant, and its preference can contradict its own rationale. and validate it just as you would validate a tool result from an agent.
1@dataclass(frozen=True)
2class JudgeResult:
3 order: tuple[str, str]
4 preferred_slot: str
5 evidence: tuple[str, ...]
6 needs_human_review: bool
7
8def parse_judge_result(
9 order: tuple[str, str],
10 raw: dict[str, object],
11) -> JudgeResult:
12 verdict = raw.get("verdict")
13 allowed = {"A", "B", "tie", "needs_human_review"}
14 if not isinstance(verdict, str) or verdict not in allowed:
15 raise ValueError(f"unsupported verdict: {verdict}")
16
17 raw_evidence = raw.get("evidence", [])
18 if not isinstance(raw_evidence, list) or not all(
19 isinstance(item, str) for item in raw_evidence
20 ):
21 raise ValueError("evidence must be a list of strings")
22 evidence = tuple(raw_evidence)
23 if verdict in {"A", "B"} and not evidence:
24 raise ValueError("decisive verdict requires criterion evidence")
25
26 return JudgeResult(
27 order=order,
28 preferred_slot=verdict,
29 evidence=evidence,
30 needs_human_review=verdict == "needs_human_review",
31 )
32
33first_pass = parse_judge_result(
34 ("brief", "actionable"),
35 {
36 "verdict": "B",
37 "evidence": [
38 "B gives the customer a next action; A stops after eligibility."
39 ],
40 },
41)
42
43assert first_pass.preferred_slot == "B"
44
45try:
46 parse_judge_result(
47 ("brief", "actionable"),
48 {"verdict": "B", "evidence": "B has a next action."},
49 )
50except ValueError as exc:
51 print(f"Malformed fixture blocked: {exc}")
52else:
53 raise AssertionError("malformed evidence container must be rejected")
54
55print(f"First pass preference slot: {first_pass.preferred_slot}")
56print(f"Recorded rationale: {first_pass.evidence[0]}")1Malformed fixture blocked: evidence must be a list of strings
2First pass preference slot: B
3Recorded rationale: B gives the customer a next action; A stops after eligibility.The output above is a stored fixture, not proof that a particular hosted model will agree. The engineering problem is to make an evaluator run observable and testable before plugging in any provider.
Pairwise comparison is useful because the evaluator chooses between two concrete alternatives. It also exposes position bias: a judge may prefer the first slot instead of the better reply. Zheng et al. identify this bias in LLM judging, so every pairwise comparison in this lab is run twice with the candidates swapped.[1]
The crucial detail is normalization. A verdict of B in the first pass and A in the swapped pass can represent the same underlying answer.
1def preferred_candidate(result: JudgeResult) -> str | None:
2 if result.preferred_slot not in {"A", "B"}:
3 return None
4 index = 0 if result.preferred_slot == "A" else 1
5 return result.order[index]
6
7def aggregate_swaps(first: JudgeResult, swapped: JudgeResult) -> dict[str, object]:
8 if first.needs_human_review or swapped.needs_human_review:
9 return {"winner": "needs_human_review", "status": "needs_human_review"}
10 if first.preferred_slot == "tie" or swapped.preferred_slot == "tie":
11 return {"winner": "tie", "status": "tie"}
12
13 first_choice = preferred_candidate(first)
14 second_choice = preferred_candidate(swapped)
15 if first_choice is not None and first_choice == second_choice:
16 return {"winner": first_choice, "status": "stable"}
17 return {"winner": "tie", "status": "unstable_after_swap"}
18
19stable_second_pass = parse_judge_result(
20 ("actionable", "brief"),
21 {
22 "verdict": "A",
23 "evidence": ["A preserves the safe remedy and supplies a clear next step."],
24 },
25)
26slot_sensitive_second_pass = parse_judge_result(
27 ("actionable", "brief"),
28 {
29 "verdict": "B",
30 "evidence": ["B appears in my preferred slot."],
31 },
32)
33tie_second_pass = parse_judge_result(
34 ("actionable", "brief"),
35 {"verdict": "tie", "evidence": []},
36)
37review_second_pass = parse_judge_result(
38 ("actionable", "brief"),
39 {"verdict": "needs_human_review", "evidence": []},
40)
41
42stable = aggregate_swaps(first_pass, stable_second_pass)
43unstable = aggregate_swaps(first_pass, slot_sensitive_second_pass)
44explicit_tie = aggregate_swaps(first_pass, tie_second_pass)
45review = aggregate_swaps(first_pass, review_second_pass)
46
47assert stable == {"winner": "actionable", "status": "stable"}
48assert unstable == {"winner": "tie", "status": "unstable_after_swap"}
49assert explicit_tie == {"winner": "tie", "status": "tie"}
50assert review == {"winner": "needs_human_review", "status": "needs_human_review"}
51
52print(f"Stable comparison: {stable}")
53print(f"Slot-sensitive comparison: {unstable}")
54print(f"Explicit tie: {explicit_tie}")
55print(f"Review route: {review}")1Stable comparison: {'winner': 'actionable', 'status': 'stable'}
2Slot-sensitive comparison: {'winner': 'tie', 'status': 'unstable_after_swap'}
3Explicit tie: {'winner': 'tie', 'status': 'tie'}
4Review route: {'winner': 'needs_human_review', 'status': 'needs_human_review'}Keep those states separate in your report. An explicit tie is a valid rubric outcome, needs_human_review is an escalation, and unstable_after_swap is evidence that slot order changed a decisive preference.
One clean comparison doesn't establish that a judge is trustworthy. Build probe cases where an undesirable shortcut is easy to observe.
| Probe | Controlled change | Suspicious signal | Response |
|---|---|---|---|
| Position | Swap only slots A and B | Winner follows slot | Record unstable result |
| Length | Add apologies and repeated policy text, no new help | Padded copy wins | Tighten concision rubric and track length |
| Identity | Reveal prompt or model labels in one run only | Preference changes | Keep candidates anonymous |
| Ambiguity | Compare two equally useful rewrites | Forced winner | Permit ties or human review |
Length is not only a hypothetical confounder. Length-Controlled AlpacaEval proposes a regression-based adjustment intended to answer what preference would have been if compared answers had equal length.[3] In a local product eval, the smaller first step is to add same-information length probes and report when padding wins.
The following fixtures model stored judge returns from two probes. The code doesn't pretend to detect bias from text alone; it asks whether the judge failed a case whose expected behavior you defined in advance.
1@dataclass(frozen=True)
2class ProbeResult:
3 name: str
4 expected_winner: str
5 observed_winner: str
6
7padded = (
8 answers["brief"]
9 + " We sincerely apologize for the inconvenience. "
10 + "We appreciate your patience while we process your replacement."
11)
12
13probes = [
14 ProbeResult(
15 name="position_swap",
16 expected_winner="actionable",
17 observed_winner=str(stable["winner"]),
18 ),
19 ProbeResult(
20 name="same_information_padding",
21 expected_winner="brief",
22 observed_winner="padded",
23 ),
24]
25
26failed_probes = [
27 probe.name for probe in probes if probe.expected_winner != probe.observed_winner
28]
29
30assert "replacement" in padded.lower()
31assert failed_probes == ["same_information_padding"]
32
33print(f"Probes run: {len(probes)}")
34print(f"Failed probes: {failed_probes}")
35print("Action: block metric promotion until padding preference is fixed")1Probes run: 2
2Failed probes: ['same_information_padding']
3Action: block metric promotion until padding preference is fixedThis is a valuable negative result. Releasing a judge because it produced pleasing scores would make the evaluation system worse. A failed probe tells you exactly what to repair.
Hard gates have test oracles. Soft judgments need a labeled calibration set: humans apply the same rubric to a representative sample, then the judge is scored against those labels.
Raw agreement is easy to understand, but can overstate reliability when one label dominates. Cohen's kappa corrects for agreement expected from each rater's label frequencies:[4]
Here, is observed agreement and is agreement expected from label prevalence. Kappa isn't a universal release threshold. Your baseline is human-human agreement on the same rubric and the same workflow slices.
This tiny calibration set is intentionally too small to approve a real metric. It shows the computation and demonstrates why a promising number alone can't release an evaluator.
1from collections import Counter
2
3@dataclass(frozen=True)
4class LabeledDecision:
5 case_id: str
6 slice_name: str
7 human: str
8 judge: str
9
10calibration_rows = [
11 LabeledDecision("r1", "replacement", "actionable", "actionable"),
12 LabeledDecision("r2", "replacement", "brief", "brief"),
13 LabeledDecision("r3", "replacement", "tie", "tie"),
14 LabeledDecision("r4", "replacement", "actionable", "actionable"),
15 LabeledDecision("r5", "address_change", "brief", "brief"),
16 LabeledDecision("r6", "address_change", "tie", "actionable"),
17 LabeledDecision("r7", "address_change", "actionable", "brief"),
18 LabeledDecision("r8", "address_change", "brief", "brief"),
19]
20
21def raw_agreement(rows: list[LabeledDecision]) -> float:
22 return sum(row.human == row.judge for row in rows) / len(rows)
23
24def cohens_kappa(rows: list[LabeledDecision]) -> float:
25 labels = {row.human for row in rows} | {row.judge for row in rows}
26 total = len(rows)
27 human_counts = Counter(row.human for row in rows)
28 judge_counts = Counter(row.judge for row in rows)
29 observed = raw_agreement(rows)
30 expected = sum(
31 human_counts[label] / total * judge_counts[label] / total
32 for label in labels
33 )
34 return (observed - expected) / (1.0 - expected)
35
36agreement = raw_agreement(calibration_rows)
37kappa = cohens_kappa(calibration_rows)
38assert agreement == 0.75
39
40print(f"Calibration rows: {len(calibration_rows)}")
41print(f"Raw agreement: {agreement:.2f}")
42print(f"Cohen's kappa: {kappa:.3f}")
43print("Release evidence: insufficient sample; collect labeled slices")1Calibration rows: 8
2Raw agreement: 0.75
3Cohen's kappa: 0.610
4Release evidence: insufficient sample; collect labeled slicesAn aggregate can now conceal the exact problem that requires attention. Report the calibration set by workflow slice before allowing the judge metric to guide any experiment.
1def agreement_by_slice(rows: list[LabeledDecision]) -> dict[str, float]:
2 grouped: dict[str, list[LabeledDecision]] = {}
3 for row in rows:
4 grouped.setdefault(row.slice_name, []).append(row)
5 return {name: raw_agreement(items) for name, items in grouped.items()}
6
7slice_agreement = agreement_by_slice(calibration_rows)
8weak_slices = [
9 name for name, score in slice_agreement.items() if score < 0.75
10]
11
12assert slice_agreement["replacement"] == 1.0
13assert slice_agreement["address_change"] == 0.5
14assert weak_slices == ["address_change"]
15
16for name, score in slice_agreement.items():
17 print(f"{name}: agreement={score:.2f}")
18print(f"Slices requiring review: {weak_slices}")1replacement: agreement=1.00
2address_change: agreement=0.50
3Slices requiring review: ['address_change']For an actual evaluation program:
Once a support conversation has multiple turns, a fluent final reply can conceal a bad evidence path. A judge packet should include relevant conversation turns, selected evidence identifiers, hard-gate outcomes, and the safe candidates being compared.
The next cell blocks a conversation before semantic judging if its trace isn't admissible. This is the same contract as the single-turn example, applied to a fuller packet.
1@dataclass(frozen=True)
2class ConversationBundle:
3 turns: tuple[str, ...]
4 answer_trace: AnswerTrace
5 candidate_names: tuple[str, str]
6
7def route_bundle(bundle: ConversationBundle) -> str:
8 if not bundle.answer_trace.admissible:
9 return "blocked_before_judge"
10 for name in bundle.candidate_names:
11 if hard_failures(answers[name], bundle.answer_trace):
12 return "blocked_before_judge"
13 return "ready_for_soft_judge"
14
15safe_bundle = ConversationBundle(
16 turns=(
17 "Customer: My refurbished laptop failed after delivery.",
18 "Luna: I found the EU refurbished-device policy.",
19 "Customer: What remedy can I receive?",
20 ),
21 answer_trace=trace,
22 candidate_names=("brief", "actionable"),
23)
24stale_bundle = ConversationBundle(
25 turns=safe_bundle.turns,
26 answer_trace=AnswerTrace(
27 request_id=trace.request_id,
28 selected_source_id=trace.selected_source_id,
29 selected_version="eu-electronics/2025-01-01",
30 admissible=False,
31 allowed_remedy="replacement",
32 ),
33 candidate_names=("brief", "actionable"),
34)
35
36assert route_bundle(safe_bundle) == "ready_for_soft_judge"
37assert route_bundle(stale_bundle) == "blocked_before_judge"
38
39print(f"Current policy bundle: {route_bundle(safe_bundle)}")
40print(f"Stale policy bundle: {route_bundle(stale_bundle)}")1Current policy bundle: ready_for_soft_judge
2Stale policy bundle: blocked_before_judgeLLM judging is usually most defensible as an offline experiment metric: compare prompt versions or model releases over a frozen dataset, investigate disagreements, and let humans approve consequential changes. It is rarely a good reason to make a real-time policy decision for one customer.
Define an explicit promotion contract. The numbers below are illustrative requirements for this lab, not universal industry thresholds:
| Release evidence | Lab requirement | Current lab state |
|---|---|---|
| Every candidate passed deterministic policy gates | Required | Pass |
| Labeled calibration rows | At least 50 | 8 |
| Known bias probes | All pass | Length probe fails |
| Human review path | Required | Defined |
1@dataclass(frozen=True)
2class MetricPromotion:
3 hard_gate_passed: bool
4 calibration_count: int
5 minimum_calibration_count: int
6 failed_bias_probes: tuple[str, ...]
7 has_human_review_path: bool
8
9def promotion_failures(promotion: MetricPromotion) -> list[str]:
10 failures: list[str] = []
11 if not promotion.hard_gate_passed:
12 failures.append("hard policy checks failed")
13 if promotion.calibration_count < promotion.minimum_calibration_count:
14 failures.append("calibration set is too small")
15 if promotion.failed_bias_probes:
16 failures.append("judge failed a bias probe")
17 if not promotion.has_human_review_path:
18 failures.append("human escalation path is missing")
19 return failures
20
21promotion = MetricPromotion(
22 hard_gate_passed=True,
23 calibration_count=len(calibration_rows),
24 minimum_calibration_count=50,
25 failed_bias_probes=tuple(failed_probes),
26 has_human_review_path=True,
27)
28failures = promotion_failures(promotion)
29
30assert failures == [
31 "calibration set is too small",
32 "judge failed a bias probe",
33]
34
35print("Metric promotion: BLOCKED")
36for failure in failures:
37 print(f"- {failure}")
38print("Next work: label more cases and repair length sensitivity")1Metric promotion: BLOCKED
2- calibration set is too small
3- judge failed a bias probe
4Next work: label more cases and repair length sensitivityA blocked promotion is the correct result. The lab has produced a useful candidate preference, but it hasn't established that its judge deserves to influence prompt selection across real customer workflows.
When you implement this pattern in a real project, store a report with these sections:
| Report section | Evidence to retain | Decision it supports |
|---|---|---|
| Hard-gate results | Source IDs, versions, claim failures | Which answers are ineligible |
| Rubric contract | Criteria, anchors, allowed verdicts | What the judge was asked to measure |
| Raw judge runs | Both slot orders and rationale snippets | Whether preference is reproducible |
| Bias probes | Position, length, identity, tie cases | Whether known shortcuts remain |
| Calibration | Human labels, per-slice agreement, kappa | Whether metric matches reviewers |
| Promotion decision | Failed requirements and owner | Whether new metric may guide release |
The scientist's habit is to evaluate the evaluator. A judge score is one observation; a calibrated, stress-tested metric with recorded failure modes is evidence.
| Skill | Evidence from the lab |
|---|---|
| Separate exact policy truth from soft quality | Unsafe refund answer fails deterministic checks before judging. |
| Build a reproducible judge request | Packet keeps candidates anonymous, records rubric anchors, and requests structured verdicts. |
| Treat judge output as untrusted data | Parser rejects malformed evidence and preserves ties plus escalation. |
| Detect slot and verbosity shortcuts | Order swaps normalize candidate identity; probes fail when padded wording wins. |
| Calibrate before promotion | Human labels, per-slice agreement, Cohen's kappa, and explicit promotion requirements keep a demo from becoming a release metric. |
| Preserve trace provenance | Conversation packet carries policy identity, version, and hard-gate outcomes into offline review. |
Symptom: A polished but unsupported refund reply receives a high score. Cause: The pipeline sends all answers to the judge before deterministic policy checks. Fix: Block inadmissible evidence and unsupported claims first; judge only remaining soft differences.
Symptom: A prompt variant wins when placed in slot A, then loses when placed in slot B.
Cause: The evaluation reports one ordering and ignores position bias.
Fix: Run both orderings, normalize to candidate identity, and record flips as unstable or route them to humans.
Symptom: Apologies and duplicated policy text improve judge score without helping the customer. Cause: The rubric doesn't make concision measurable, and there is no length probe. Fix: Add same-information padding probes, track response length, and block metric promotion while padding wins.
Symptom: Overall calibration looks acceptable, but address-change replies are frequently misjudged. Cause: Evaluation reports only one aggregate number. Fix: Label and report agreement by workflow slice, then escalate or repair failed slices before release.
Symptom: Eight hand-picked cases become the quality gate for a new prompt. Cause: The team treats a runnable example as a validation dataset. Fix: Write a promotion contract with calibration size, probe, trace, and human-review requirements.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.
Liu, Y., et al. · 2023
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.
Dubois, Y., et al. · 2024
A Coefficient of Agreement for Nominal Scales
Cohen, J. · 1960 · Educational and Psychological Measurement