LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringLLM-as-a-Judge Evaluation
📊MediumEvaluation & Benchmarks

LLM-as-a-Judge Evaluation

Add calibrated soft judgments to a RAG evaluation trace without letting an LLM override deterministic evidence gates.

17 min read
Learning path
Step 65 of 155 in the full curriculum
RAG Evaluation for Reliable AnswersBias & Fairness in LLMs

The previous lesson ended with policy-answerer-v4-eval proving a hard fact: the current, permitted policy source authorizes a replacement, not an immediate refund. A claim ledger can block an unsupported refund promise. It can't decide which of two safe replies from a large language model (LLM) is clearer for the customer.

Consider these two answers to Luna's refurbished-laptop case:

CandidateReplyHard evidence status
brief"Your refurbished laptop qualifies for a replacement under RPL-14."Supported
actionable"Your refurbished laptop qualifies for a replacement under RPL-14. Reply to confirm you'd like to proceed with the replacement."Supported

Both respect the selected evidence. The remaining question is softer: does the added next step make the second reply more useful without making it wordy or confusing?

An LLM-as-a-judge uses another LLM as an evaluator for quality that can't be fully decided by an exact assertion. It can compare clarity, helpfulness, or tone under a rubric. It must not decide whether restricted context was allowed or whether a policy claim is supported. Those remain deterministic .

Zheng et al. found that strong LLM judges could exceed 80% agreement with human preferences on their MT-Bench and Chatbot Arena experiments. The same work reports position, verbosity, self-enhancement, and reasoning limitations. A judge is useful measurement equipment, not ground truth.[1]

Keep facts outside the judge

The boundary matters more than the model name. In a customer-support answer pipeline, different questions need different evaluators:

QuestionCorrect evaluatorWhy
Did selected evidence pass access and freshness checks?Code gateA soft score must never admit forbidden evidence.
Does the answer promise a refund not supported by RPL-14?Claim-to-source verifierPolicy truth is inspectable.
Which supported answer is clearer and more actionable?Calibrated judge or humanReasonable reviewers can compare phrasing.
Is the case sensitive, ambiguous, or outside rubric coverage?Uncertainty is part of the decision.

This lesson builds only the third layer, while carrying the first two layers forward.

Diagram showing Versioned RAG trace, Hard evidence gates, fail, and Block answer. Diagram showing Versioned RAG trace, Hard evidence gates, fail, and Block answer.
Versioned RAG trace, Hard evidence gates, fail, and Block answer.

Figure 1: Semantic judging starts after admissibility and claim support have passed. A judge score never reopens a blocked evidence path.

A soft-evaluation pipeline for the RPL-14 replacement reply, with hard evidence gates before anonymized pairwise judging and calibration. A soft-evaluation pipeline for the RPL-14 replacement reply, with hard evidence gates before anonymized pairwise judging and calibration.
The selected RPL-14 passes deterministic checks first. Only supported candidate replies enter an anonymized judge comparison, and calibration controls whether that metric may guide a release.

Start with two supported answers

The lab uses an abbreviated hard gate so the boundary is visible in one screen. The previous lesson built the complete evidence-path validator; here we reuse its result and add one unsafe counterexample to prove it still wins over any soft score.

supported-candidates.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class AnswerTrace: 5 request_id: str 6 selected_source_id: str 7 selected_version: str 8 admissible: bool 9 allowed_remedy: str 10 11trace = AnswerTrace( 12 request_id="ticket-48291", 13 selected_source_id="eu-refurb-v2-rule", 14 selected_version="eu-electronics/2026-04-01", 15 admissible=True, 16 allowed_remedy="replacement", 17) 18 19answers = { 20 "brief": "Your refurbished laptop qualifies for a replacement under RPL-14.", 21 "actionable": ( 22 "Your refurbished laptop qualifies for a replacement under RPL-14. " 23 "Reply to confirm you'd like to proceed with the replacement." 24 ), 25 "unsafe_refund": "Your refurbished laptop qualifies for an immediate refund.", 26} 27 28def hard_failures(answer: str, answer_trace: AnswerTrace) -> list[str]: 29 failures: list[str] = [] 30 lowered = answer.lower() 31 if not answer_trace.admissible: 32 failures.append("selected evidence isn't admissible") 33 if "refund" in lowered and answer_trace.allowed_remedy != "refund": 34 failures.append("answer promises an unsupported refund") 35 if answer_trace.allowed_remedy not in lowered: 36 failures.append("answer omits supported replacement remedy") 37 return failures 38 39safe_candidates = [ 40 name for name, answer in answers.items() if not hard_failures(answer, trace) 41] 42 43assert safe_candidates == ["brief", "actionable"] 44assert hard_failures(answers["unsafe_refund"], trace) == [ 45 "answer promises an unsupported refund", 46 "answer omits supported replacement remedy", 47] 48 49print(f"Evidence version: {trace.selected_version}") 50print(f"Candidates eligible for soft judging: {safe_candidates}") 51print(f"Blocked answer: {hard_failures(answers['unsafe_refund'], trace)[0]}")
Output
1Evidence version: eu-electronics/2026-04-01 2Candidates eligible for soft judging: ['brief', 'actionable'] 3Blocked answer: answer promises an unsupported refund

If a judge later says unsafe_refund sounds friendlier, the answer stays blocked. That invariant makes the judge safe to experiment with.

Choose the evaluator before writing the rubric

Not every evaluation question should be routed to an LLM. Choose the measurement tool from the decision you need to make.

choose-the-evaluator.py
1@dataclass(frozen=True) 2class EvaluationQuestion: 3 name: str 4 has_exact_oracle: bool 5 compares_two_safe_variants: bool 6 requires_policy_owner: bool = False 7 8def choose_evaluator(question: EvaluationQuestion) -> str: 9 if question.has_exact_oracle: 10 return "deterministic_gate" 11 if question.requires_policy_owner: 12 return "human_review" 13 if question.compares_two_safe_variants: 14 return "pairwise_judge_with_calibration" 15 return "pointwise_judge_with_calibration" 16 17questions = [ 18 EvaluationQuestion("refund authorization", True, False), 19 EvaluationQuestion("clearer supported reply", False, True), 20 EvaluationQuestion("new exception policy", False, False, True), 21] 22choices = {item.name: choose_evaluator(item) for item in questions} 23 24assert choices["refund authorization"] == "deterministic_gate" 25assert choices["clearer supported reply"] == "pairwise_judge_with_calibration" 26assert choices["new exception policy"] == "human_review" 27 28for name, choice in choices.items(): 29 print(f"{name}: {choice}")
Output
1refund authorization: deterministic_gate 2clearer supported reply: pairwise_judge_with_calibration 3new exception policy: human_review

Write a rubric for the remaining question

A vague instruction such as "pick the better answer" lets the evaluator reward length, politeness, or formatting arbitrarily. A rubric should name what remains undecided after hard checks and include anchors for a tie.

CriterionBetter answerTie conditionOutside judge scope
ActionabilityGives a useful, low-friction next stepBoth give the same useful next stepWhether customer is eligible
ClarityStates remedy plainly without internal clutterBoth are equally clearWhether policy source is current
ConcisionAdds useful information without repetitionDifference is stylistic onlyWhether a refund is authorized

G-Eval studied LLM evaluation with task-specific criteria and a form-filling output design. The practical lesson here is modest: a criterion and a structured answer are easier to audit than a free-form impression.[2]

The next cell builds the packet that would be sent to a model API. Notice two decisions:

  1. Candidate names are anonymous slots, not model or prompt-version names.
  2. Protected facts are displayed as already validated context, not handed to the judge for re-litigation.
pairwise-judge-packet.py
1from dataclasses import asdict 2 3@dataclass(frozen=True) 4class Criterion: 5 name: str 6 question: str 7 tie_anchor: str 8 9rubric = ( 10 Criterion( 11 name="actionability", 12 question="Does the reply give a safe, useful next action?", 13 tie_anchor="Neither answer gives a meaningfully better next action.", 14 ), 15 Criterion( 16 name="clarity", 17 question="Is the replacement outcome easy for a customer to understand?", 18 tie_anchor="Both answers communicate the outcome equally clearly.", 19 ), 20 Criterion( 21 name="concision", 22 question="Does added wording contribute useful information rather than repetition?", 23 tie_anchor="The extra wording doesn't change usefulness.", 24 ), 25) 26 27def pairwise_packet(first_name: str, second_name: str) -> dict[str, object]: 28 assert first_name in safe_candidates and second_name in safe_candidates 29 return { 30 "case_id": trace.request_id, 31 "validated_context": { 32 "source_id": trace.selected_source_id, 33 "version": trace.selected_version, 34 "protected_fact": "The permitted remedy is replacement, not refund.", 35 "hard_checks": "passed before judging", 36 }, 37 "candidates": { 38 "A": answers[first_name], 39 "B": answers[second_name], 40 }, 41 "rubric": [asdict(item) for item in rubric], 42 "allowed_verdicts": ["A", "B", "tie", "needs_human_review"], 43 } 44 45packet_ab = pairwise_packet("brief", "actionable") 46assert "brief" not in packet_ab["candidates"] 47assert "actionable" not in packet_ab["candidates"] 48 49print(f"Context gate: {packet_ab['validated_context']['hard_checks']}") 50print(f"Candidate slots: {list(packet_ab['candidates'])}") 51print(f"Rubric criteria: {[item['name'] for item in packet_ab['rubric']]}") 52print(f"Verdicts: {packet_ab['allowed_verdicts']}")
Output
1Context gate: passed before judging 2Candidate slots: ['A', 'B'] 3Rubric criteria: ['actionability', 'clarity', 'concision'] 4Verdicts: ['A', 'B', 'tie', 'needs_human_review']

In a deployed evaluator, serialize this packet, request from the chosen judge model, and store the raw packet plus parsed verdict. Don't rely on a hidden prompt that can't be reproduced during a regression.

Treat the judge output as untrusted data

The judge is another model. Its JSON can be malformed, its evidence can be irrelevant, and its preference can contradict its own rationale. and validate it just as you would validate a tool result from an agent.

validate-judge-result.py
1@dataclass(frozen=True) 2class JudgeResult: 3 order: tuple[str, str] 4 preferred_slot: str 5 evidence: tuple[str, ...] 6 needs_human_review: bool 7 8def parse_judge_result( 9 order: tuple[str, str], 10 raw: dict[str, object], 11) -> JudgeResult: 12 verdict = raw.get("verdict") 13 allowed = {"A", "B", "tie", "needs_human_review"} 14 if not isinstance(verdict, str) or verdict not in allowed: 15 raise ValueError(f"unsupported verdict: {verdict}") 16 17 raw_evidence = raw.get("evidence", []) 18 if not isinstance(raw_evidence, list) or not all( 19 isinstance(item, str) for item in raw_evidence 20 ): 21 raise ValueError("evidence must be a list of strings") 22 evidence = tuple(raw_evidence) 23 if verdict in {"A", "B"} and not evidence: 24 raise ValueError("decisive verdict requires criterion evidence") 25 26 return JudgeResult( 27 order=order, 28 preferred_slot=verdict, 29 evidence=evidence, 30 needs_human_review=verdict == "needs_human_review", 31 ) 32 33first_pass = parse_judge_result( 34 ("brief", "actionable"), 35 { 36 "verdict": "B", 37 "evidence": [ 38 "B gives the customer a next action; A stops after eligibility." 39 ], 40 }, 41) 42 43assert first_pass.preferred_slot == "B" 44 45try: 46 parse_judge_result( 47 ("brief", "actionable"), 48 {"verdict": "B", "evidence": "B has a next action."}, 49 ) 50except ValueError as exc: 51 print(f"Malformed fixture blocked: {exc}") 52else: 53 raise AssertionError("malformed evidence container must be rejected") 54 55print(f"First pass preference slot: {first_pass.preferred_slot}") 56print(f"Recorded rationale: {first_pass.evidence[0]}")
Output
1Malformed fixture blocked: evidence must be a list of strings 2First pass preference slot: B 3Recorded rationale: B gives the customer a next action; A stops after eligibility.

The output above is a stored fixture, not proof that a particular hosted model will agree. The engineering problem is to make an evaluator run observable and testable before plugging in any provider.

A preference must survive swapping A and B

Pairwise comparison is useful because the evaluator chooses between two concrete alternatives. It also exposes position bias: a judge may prefer the first slot instead of the better reply. Zheng et al. identify this bias in LLM judging, so every pairwise comparison in this lab is run twice with the candidates swapped.[1]

The brief and actionable replacement replies judged in both slot orders, with stable agreement selecting the actionable reply and slot-following disagreement routed to review. The brief and actionable replacement replies judged in both slot orders, with stable agreement selecting the actionable reply and slot-following disagreement routed to review.
The preferred reply must remain the same after slot order changes. A judge that picks the first slot twice hasn't distinguished answer quality from placement.

The crucial detail is normalization. A verdict of B in the first pass and A in the swapped pass can represent the same underlying answer.

aggregate-order-swaps.py
1def preferred_candidate(result: JudgeResult) -> str | None: 2 if result.preferred_slot not in {"A", "B"}: 3 return None 4 index = 0 if result.preferred_slot == "A" else 1 5 return result.order[index] 6 7def aggregate_swaps(first: JudgeResult, swapped: JudgeResult) -> dict[str, object]: 8 if first.needs_human_review or swapped.needs_human_review: 9 return {"winner": "needs_human_review", "status": "needs_human_review"} 10 if first.preferred_slot == "tie" or swapped.preferred_slot == "tie": 11 return {"winner": "tie", "status": "tie"} 12 13 first_choice = preferred_candidate(first) 14 second_choice = preferred_candidate(swapped) 15 if first_choice is not None and first_choice == second_choice: 16 return {"winner": first_choice, "status": "stable"} 17 return {"winner": "tie", "status": "unstable_after_swap"} 18 19stable_second_pass = parse_judge_result( 20 ("actionable", "brief"), 21 { 22 "verdict": "A", 23 "evidence": ["A preserves the safe remedy and supplies a clear next step."], 24 }, 25) 26slot_sensitive_second_pass = parse_judge_result( 27 ("actionable", "brief"), 28 { 29 "verdict": "B", 30 "evidence": ["B appears in my preferred slot."], 31 }, 32) 33tie_second_pass = parse_judge_result( 34 ("actionable", "brief"), 35 {"verdict": "tie", "evidence": []}, 36) 37review_second_pass = parse_judge_result( 38 ("actionable", "brief"), 39 {"verdict": "needs_human_review", "evidence": []}, 40) 41 42stable = aggregate_swaps(first_pass, stable_second_pass) 43unstable = aggregate_swaps(first_pass, slot_sensitive_second_pass) 44explicit_tie = aggregate_swaps(first_pass, tie_second_pass) 45review = aggregate_swaps(first_pass, review_second_pass) 46 47assert stable == {"winner": "actionable", "status": "stable"} 48assert unstable == {"winner": "tie", "status": "unstable_after_swap"} 49assert explicit_tie == {"winner": "tie", "status": "tie"} 50assert review == {"winner": "needs_human_review", "status": "needs_human_review"} 51 52print(f"Stable comparison: {stable}") 53print(f"Slot-sensitive comparison: {unstable}") 54print(f"Explicit tie: {explicit_tie}") 55print(f"Review route: {review}")
Output
1Stable comparison: {'winner': 'actionable', 'status': 'stable'} 2Slot-sensitive comparison: {'winner': 'tie', 'status': 'unstable_after_swap'} 3Explicit tie: {'winner': 'tie', 'status': 'tie'} 4Review route: {'winner': 'needs_human_review', 'status': 'needs_human_review'}

Keep those states separate in your report. An explicit tie is a valid rubric outcome, needs_human_review is an escalation, and unstable_after_swap is evidence that slot order changed a decisive preference.

Probe the biases you expect

One clean comparison doesn't establish that a judge is trustworthy. Build probe cases where an undesirable shortcut is easy to observe.

Judge probe matrix for the RPL-14 reply covering slot order, padded verbosity, model identity leakage, and close calls that require human review. Judge probe matrix for the RPL-14 reply covering slot order, padded verbosity, model identity leakage, and close calls that require human review.
Bias warnings become engineering work only when each has an executable probe or a routing rule. Store probe failures beside the main score.
ProbeControlled changeSuspicious signalResponse
PositionSwap only slots A and BWinner follows slotRecord unstable result
LengthAdd apologies and repeated policy text, no new helpPadded copy winsTighten concision rubric and track length
IdentityReveal prompt or model labels in one run onlyPreference changesKeep candidates anonymous
AmbiguityCompare two equally useful rewritesForced winnerPermit ties or human review

Length is not only a hypothetical confounder. Length-Controlled AlpacaEval proposes a regression-based adjustment intended to answer what preference would have been if compared answers had equal length.[3] In a local product eval, the smaller first step is to add same-information length probes and report when padding wins.

The following fixtures model stored judge returns from two probes. The code doesn't pretend to detect bias from text alone; it asks whether the judge failed a case whose expected behavior you defined in advance.

bias-probe-report.py
1@dataclass(frozen=True) 2class ProbeResult: 3 name: str 4 expected_winner: str 5 observed_winner: str 6 7padded = ( 8 answers["brief"] 9 + " We sincerely apologize for the inconvenience. " 10 + "We appreciate your patience while we process your replacement." 11) 12 13probes = [ 14 ProbeResult( 15 name="position_swap", 16 expected_winner="actionable", 17 observed_winner=str(stable["winner"]), 18 ), 19 ProbeResult( 20 name="same_information_padding", 21 expected_winner="brief", 22 observed_winner="padded", 23 ), 24] 25 26failed_probes = [ 27 probe.name for probe in probes if probe.expected_winner != probe.observed_winner 28] 29 30assert "replacement" in padded.lower() 31assert failed_probes == ["same_information_padding"] 32 33print(f"Probes run: {len(probes)}") 34print(f"Failed probes: {failed_probes}") 35print("Action: block metric promotion until padding preference is fixed")
Output
1Probes run: 2 2Failed probes: ['same_information_padding'] 3Action: block metric promotion until padding preference is fixed

This is a valuable negative result. Releasing a judge because it produced pleasing scores would make the evaluation system worse. A failed probe tells you exactly what to repair.

Calibrate the measurement against people

Hard gates have test oracles. Soft judgments need a labeled calibration set: humans apply the same rubric to a representative sample, then the judge is scored against those labels.

Raw agreement is easy to understand, but can overstate reliability when one label dominates. Cohen's kappa corrects for agreement expected from each rater's label frequencies:[4]

κ=po−pe1−pe\kappa = \frac{p_o - p_e}{1 - p_e}κ=1−pe​po​−pe​​

Here, pop_opo​ is observed agreement and pep_epe​ is agreement expected from label prevalence. Kappa isn't a universal release threshold. Your baseline is human-human agreement on the same rubric and the same workflow slices.

This tiny calibration set is intentionally too small to approve a real metric. It shows the computation and demonstrates why a promising number alone can't release an evaluator.

calibrate-against-human-labels.py
1from collections import Counter 2 3@dataclass(frozen=True) 4class LabeledDecision: 5 case_id: str 6 slice_name: str 7 human: str 8 judge: str 9 10calibration_rows = [ 11 LabeledDecision("r1", "replacement", "actionable", "actionable"), 12 LabeledDecision("r2", "replacement", "brief", "brief"), 13 LabeledDecision("r3", "replacement", "tie", "tie"), 14 LabeledDecision("r4", "replacement", "actionable", "actionable"), 15 LabeledDecision("r5", "address_change", "brief", "brief"), 16 LabeledDecision("r6", "address_change", "tie", "actionable"), 17 LabeledDecision("r7", "address_change", "actionable", "brief"), 18 LabeledDecision("r8", "address_change", "brief", "brief"), 19] 20 21def raw_agreement(rows: list[LabeledDecision]) -> float: 22 return sum(row.human == row.judge for row in rows) / len(rows) 23 24def cohens_kappa(rows: list[LabeledDecision]) -> float: 25 labels = {row.human for row in rows} | {row.judge for row in rows} 26 total = len(rows) 27 human_counts = Counter(row.human for row in rows) 28 judge_counts = Counter(row.judge for row in rows) 29 observed = raw_agreement(rows) 30 expected = sum( 31 human_counts[label] / total * judge_counts[label] / total 32 for label in labels 33 ) 34 return (observed - expected) / (1.0 - expected) 35 36agreement = raw_agreement(calibration_rows) 37kappa = cohens_kappa(calibration_rows) 38assert agreement == 0.75 39 40print(f"Calibration rows: {len(calibration_rows)}") 41print(f"Raw agreement: {agreement:.2f}") 42print(f"Cohen's kappa: {kappa:.3f}") 43print("Release evidence: insufficient sample; collect labeled slices")
Output
1Calibration rows: 8 2Raw agreement: 0.75 3Cohen's kappa: 0.610 4Release evidence: insufficient sample; collect labeled slices

An aggregate can now conceal the exact problem that requires attention. Report the calibration set by workflow slice before allowing the judge metric to guide any experiment.

calibration-by-workflow-slice.py
1def agreement_by_slice(rows: list[LabeledDecision]) -> dict[str, float]: 2 grouped: dict[str, list[LabeledDecision]] = {} 3 for row in rows: 4 grouped.setdefault(row.slice_name, []).append(row) 5 return {name: raw_agreement(items) for name, items in grouped.items()} 6 7slice_agreement = agreement_by_slice(calibration_rows) 8weak_slices = [ 9 name for name, score in slice_agreement.items() if score < 0.75 10] 11 12assert slice_agreement["replacement"] == 1.0 13assert slice_agreement["address_change"] == 0.5 14assert weak_slices == ["address_change"] 15 16for name, score in slice_agreement.items(): 17 print(f"{name}: agreement={score:.2f}") 18print(f"Slices requiring review: {weak_slices}")
Output
1replacement: agreement=1.00 2address_change: agreement=0.50 3Slices requiring review: ['address_change']

For an actual evaluation program:

  1. Freeze a rubric and collect human labels for easy wins, real ties, and known failures.
  2. Include workflow slices such as replacement, delivery delay, and address change.
  3. Record human-human agreement before comparing the judge to people.
  4. Re-run calibration after prompt, judge-model, rubric, or traffic-distribution changes.
  5. Escalate slices where agreement or bias probes fail, even if aggregate agreement looks healthy.

Conversation quality still needs the trace

Once a support conversation has multiple turns, a fluent final reply can conceal a bad evidence path. A judge packet should include relevant conversation turns, selected evidence identifiers, hard-gate outcomes, and the safe candidates being compared.

Evaluation packet for Luna's RPL-14 conversation containing transcript, selected policy version, hard-gate pass, two safe candidates, and soft judge dimensions. Evaluation packet for Luna's RPL-14 conversation containing transcript, selected policy version, hard-gate pass, two safe candidates, and soft judge dimensions.
Soft evaluation doesn't replace trace inspection. The packet tells the judge what has already been validated and gives reviewers enough context to reproduce its comparison.

The next cell blocks a conversation before semantic judging if its trace isn't admissible. This is the same contract as the single-turn example, applied to a fuller packet.

trace-aware-conversation-packet.py
1@dataclass(frozen=True) 2class ConversationBundle: 3 turns: tuple[str, ...] 4 answer_trace: AnswerTrace 5 candidate_names: tuple[str, str] 6 7def route_bundle(bundle: ConversationBundle) -> str: 8 if not bundle.answer_trace.admissible: 9 return "blocked_before_judge" 10 for name in bundle.candidate_names: 11 if hard_failures(answers[name], bundle.answer_trace): 12 return "blocked_before_judge" 13 return "ready_for_soft_judge" 14 15safe_bundle = ConversationBundle( 16 turns=( 17 "Customer: My refurbished laptop failed after delivery.", 18 "Luna: I found the EU refurbished-device policy.", 19 "Customer: What remedy can I receive?", 20 ), 21 answer_trace=trace, 22 candidate_names=("brief", "actionable"), 23) 24stale_bundle = ConversationBundle( 25 turns=safe_bundle.turns, 26 answer_trace=AnswerTrace( 27 request_id=trace.request_id, 28 selected_source_id=trace.selected_source_id, 29 selected_version="eu-electronics/2025-01-01", 30 admissible=False, 31 allowed_remedy="replacement", 32 ), 33 candidate_names=("brief", "actionable"), 34) 35 36assert route_bundle(safe_bundle) == "ready_for_soft_judge" 37assert route_bundle(stale_bundle) == "blocked_before_judge" 38 39print(f"Current policy bundle: {route_bundle(safe_bundle)}") 40print(f"Stale policy bundle: {route_bundle(stale_bundle)}")
Output
1Current policy bundle: ready_for_soft_judge 2Stale policy bundle: blocked_before_judge

Use judges offline before letting them guide changes

LLM judging is usually most defensible as an offline experiment metric: compare prompt versions or model releases over a frozen dataset, investigate disagreements, and let humans approve consequential changes. It is rarely a good reason to make a real-time policy decision for one customer.

Define an explicit promotion contract. The numbers below are illustrative requirements for this lab, not universal industry thresholds:

Release evidenceLab requirementCurrent lab state
Every candidate passed deterministic policy gatesRequiredPass
Labeled calibration rowsAt least 508
Known bias probesAll passLength probe fails
Human review pathRequiredDefined
judge-metric-promotion-gate.py
1@dataclass(frozen=True) 2class MetricPromotion: 3 hard_gate_passed: bool 4 calibration_count: int 5 minimum_calibration_count: int 6 failed_bias_probes: tuple[str, ...] 7 has_human_review_path: bool 8 9def promotion_failures(promotion: MetricPromotion) -> list[str]: 10 failures: list[str] = [] 11 if not promotion.hard_gate_passed: 12 failures.append("hard policy checks failed") 13 if promotion.calibration_count < promotion.minimum_calibration_count: 14 failures.append("calibration set is too small") 15 if promotion.failed_bias_probes: 16 failures.append("judge failed a bias probe") 17 if not promotion.has_human_review_path: 18 failures.append("human escalation path is missing") 19 return failures 20 21promotion = MetricPromotion( 22 hard_gate_passed=True, 23 calibration_count=len(calibration_rows), 24 minimum_calibration_count=50, 25 failed_bias_probes=tuple(failed_probes), 26 has_human_review_path=True, 27) 28failures = promotion_failures(promotion) 29 30assert failures == [ 31 "calibration set is too small", 32 "judge failed a bias probe", 33] 34 35print("Metric promotion: BLOCKED") 36for failure in failures: 37 print(f"- {failure}") 38print("Next work: label more cases and repair length sensitivity")
Output
1Metric promotion: BLOCKED 2- calibration set is too small 3- judge failed a bias probe 4Next work: label more cases and repair length sensitivity

A blocked promotion is the correct result. The lab has produced a useful candidate preference, but it hasn't established that its judge deserves to influence prompt selection across real customer workflows.

A practical evaluation report

When you implement this pattern in a real project, store a report with these sections:

Report sectionEvidence to retainDecision it supports
Hard-gate resultsSource IDs, versions, claim failuresWhich answers are ineligible
Rubric contractCriteria, anchors, allowed verdictsWhat the judge was asked to measure
Raw judge runsBoth slot orders and rationale snippetsWhether preference is reproducible
Bias probesPosition, length, identity, tie casesWhether known shortcuts remain
CalibrationHuman labels, per-slice agreement, kappaWhether metric matches reviewers
Promotion decisionFailed requirements and ownerWhether new metric may guide release

The scientist's habit is to evaluate the evaluator. A judge score is one observation; a calibrated, stress-tested metric with recorded failure modes is evidence.

Mastery check

What you can now do

SkillEvidence from the lab
Separate exact policy truth from soft qualityUnsafe refund answer fails deterministic checks before judging.
Build a reproducible judge requestPacket keeps candidates anonymous, records rubric anchors, and requests structured verdicts.
Treat judge output as untrusted dataParser rejects malformed evidence and preserves ties plus escalation.
Detect slot and verbosity shortcutsOrder swaps normalize candidate identity; probes fail when padded wording wins.
Calibrate before promotionHuman labels, per-slice agreement, Cohen's kappa, and explicit promotion requirements keep a demo from becoming a release metric.
Preserve trace provenanceConversation packet carries policy identity, version, and hard-gate outcomes into offline review.

Evaluation rubric

  • Keeps policy truth in deterministic gates and sends only supported answers to the judge
  • Parses structured verdicts as untrusted data and preserves ties, escalation, and swap instability separately
  • Probes position and verbosity shortcuts before promoting the metric
  • Compares judge decisions with human labels by workflow slice
  • Blocks metric promotion when calibration, trace, or bias-probe evidence is inadequate

Follow-up questions

Common pitfalls

The judge is asked to authorize policy truth

Symptom: A polished but unsupported refund reply receives a high score. Cause: The pipeline sends all answers to the judge before deterministic policy checks. Fix: Block inadmissible evidence and unsupported claims first; judge only remaining soft differences.

Pairwise wins follow answer position

Symptom: A prompt variant wins when placed in slot A, then loses when placed in slot B. Cause: The evaluation reports one ordering and ignores position bias. Fix: Run both orderings, normalize to candidate identity, and record flips as unstable or route them to humans.

Longer replies win by repeating the same facts

Symptom: Apologies and duplicated policy text improve judge score without helping the customer. Cause: The rubric doesn't make concision measurable, and there is no length probe. Fix: Add same-information padding probes, track response length, and block metric promotion while padding wins.

A high aggregate agreement hides a weak slice

Symptom: Overall calibration looks acceptable, but address-change replies are frequently misjudged. Cause: Evaluation reports only one aggregate number. Fix: Label and report agreement by workflow slice, then escalate or repair failed slices before release.

A demonstration becomes a release metric too early

Symptom: Eight hand-picked cases become the quality gate for a new prompt. Cause: The team treats a runnable example as a validation dataset. Fix: Write a promotion contract with calibration size, probe, trace, and human-review requirements.

Next Step
Continue to Bias & Fairness in LLMs

You can now treat an automated judge as a measured instrument rather than an oracle. Next you will test whether model and evaluator outcomes remain reliable across customer groups and language varieties.

PreviousRAG Evaluation for Reliable Answers
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.

Liu, Y., et al. · 2023

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.

Dubois, Y., et al. · 2024

A Coefficient of Agreement for Nominal Scales

Cohen, J. · 1960 · Educational and Psychological Measurement