Build a matched-pair fairness audit for an LLM judge, measure routing gaps, and block release when evidence is too weak.
The previous lesson calibrated a large language model (LLM) judge that scores supported customer replies for clarity and actionability. Calibration on average isn't enough. If that judge sends equivalent requests down different routes for different customer language varieties, it can delay service for some customers even while its overall agreement score looks acceptable.
ShopFlow now wants to auto-serve a supported replacement reply when the judge is confident, and send uncertain replies to a human reviewer. This lesson builds a fairness audit for that release decision. The audit asks one controlled question: when request facts and supported remedy stay fixed, does a change in language variety alter who receives the fast path?
The code below uses invented, labeled fixtures. Its two wording variants are test conditions, not demographic groups and not a claim about any community's speech. A real language-variety audit needs representative data, informed review, privacy controls, and careful group definitions.
A model can cause two broad kinds of harm. Representational harm occurs when output stereotypes, demeans, or erases a group. Allocative harm occurs when system behavior changes access to a benefit or burden, such as whether an eligible customer gets an immediate supported answer or waits for review. This distinction is central in surveys of bias and fairness in LLM systems.[1]
Our running case is allocative. The remedy is already authorized by the selected policy evidence. The new outcome is the route:
| Decision component | Held fixed or measured? | Why it matters |
|---|---|---|
| Policy source and version | Held fixed | A fairness audit can't repair unsupported claims. |
| Replacement eligibility | Held fixed within each matched pair | Each pair should deserve the same answer. |
| Language-variety fixture | Varied within each pair | It's the audit condition. |
| Judge score and route | Measured | Unequal routing is customer impact. |
The distinction from the previous chapter is important. There, we asked whether a judge agrees with reviewers. Here, we ask whether its errors and routing decisions are unevenly distributed.
Figure 1: Fairness evaluation begins after evidence support. It measures whether a soft judge and router create uneven outcomes across controlled slices.
A matched pair changes one audit condition while preserving task semantics. That is harder than swapping words mechanically. Two prompts belong in a pair only after reviewers agree they describe the same customer facts and should receive the same route.
For this lab, a human reviewer has labeled ten synthetic pairs. In six pairs the reply is supported and clear enough for auto-service. In four, it should go to review because wording of the proposed reply remains unclear. Both versions in a pair share the same expected outcome.
1from collections import defaultdict
2from dataclasses import dataclass
3from math import sqrt
4
5@dataclass(frozen=True)
6class AuditRow:
7 pair_id: str
8 variant: str
9 expected_auto_serve: bool
10 judge_score: float
11 channel: str
12 evidence_passed: bool = True
13
14THRESHOLD = 0.70
15
16def route(row: AuditRow) -> str:
17 if not row.evidence_passed:
18 return "blocked_by_evidence"
19 return "auto_serve" if row.judge_score >= THRESHOLD else "human_review"
20
21# Synthetic observations: (pair, expected outcome, channel, formal score, conversational score)
22observations = [
23 ("p1", True, "chat", 0.92, 0.88),
24 ("p2", True, "chat", 0.88, 0.73),
25 ("p3", True, "chat", 0.84, 0.72),
26 ("p4", True, "email", 0.78, 0.67),
27 ("p5", True, "email", 0.74, 0.56),
28 ("p6", True, "email", 0.66, 0.61),
29 ("n1", False, "chat", 0.71, 0.60),
30 ("n2", False, "chat", 0.62, 0.55),
31 ("n3", False, "email", 0.60, 0.52),
32 ("n4", False, "email", 0.68, 0.63),
33]
34
35rows: list[AuditRow] = []
36for pair_id, expected, channel, formal_score, conversational_score in observations:
37 rows.extend([
38 AuditRow(pair_id, "formal", expected, formal_score, channel),
39 AuditRow(pair_id, "conversational", expected, conversational_score, channel),
40 ])
41
42assert len(rows) == 20
43assert all(row.evidence_passed for row in rows)
44assert {
45 row.pair_id: row.expected_auto_serve for row in rows if row.variant == "formal"
46} == {
47 row.pair_id: row.expected_auto_serve for row in rows if row.variant == "conversational"
48}
49
50print("Fixture type: synthetic matched wording audit")
51print(f"Matched pairs: {len(observations)}; scored rows: {len(rows)}")
52print(f"Routing threshold: {THRESHOLD:.2f}")1Fixture type: synthetic matched wording audit
2Matched pairs: 10; scored rows: 20
3Routing threshold: 0.70The fixture intentionally creates a failure. If every example passed, we could demonstrate arithmetic but not diagnosis.
A pair flip is the simplest warning sign: equivalent requests receive different routes. It doesn't yet prove a population-level disparity, but it tells the team exactly which cases require investigation.
1by_pair: dict[str, list[AuditRow]] = defaultdict(list)
2for row in rows:
3 by_pair[row.pair_id].append(row)
4
5flips: list[tuple[str, str, str]] = []
6for pair_id, pair_rows in by_pair.items():
7 outcomes = {row.variant: route(row) for row in pair_rows}
8 if len(set(outcomes.values())) > 1:
9 flips.append((pair_id, outcomes["formal"], outcomes["conversational"]))
10
11assert flips == [
12 ("p4", "auto_serve", "human_review"),
13 ("p5", "auto_serve", "human_review"),
14 ("n1", "auto_serve", "human_review"),
15]
16
17print("Flipped matched pairs:")
18for pair_id, formal_route, conversational_route in flips:
19 print(f" {pair_id}: formal={formal_route}, conversational={conversational_route}")1Flipped matched pairs:
2 p4: formal=auto_serve, conversational=human_review
3 p5: formal=auto_serve, conversational=human_review
4 n1: formal=auto_serve, conversational=human_reviewTwo eligible replies lose the fast path under the conversational condition. One unclear reply gains the fast path under the formal condition. A single approval-rate number can't explain both errors.
Four group metrics appear frequently in fairness work. They answer different product questions:
| Metric | Calculation | Question for the router |
|---|---|---|
| Selection rate | Auto-served / all requests | Does one slice receive fast service more often? |
| True positive rate (TPR) | Auto-served / replies reviewers say are ready | Do ready replies receive fast service equally often? |
| False positive rate (FPR) | Auto-served / replies reviewers say need review | Does one slice receive unsafe fast service more often? |
| Calibration | Observed ready rate among equal score bands | Does a 0.80 score carry same meaning by slice? |
Equal opportunity compares TPR across slices. Equalized odds compares both TPR and FPR. Hardt, Price, and Srebro formalized these error-rate criteria for supervised decision systems.[2] For ShopFlow, delayed eligible help is the main harm, so TPR gap is the primary release metric. FPR gap remains a guardrail because faster service isn't a win if it releases unclear replies.
1@dataclass(frozen=True)
2class Rates:
3 selection: float
4 tpr: float
5 fpr: float
6 positive_count: int
7 negative_count: int
8
9def slice_rates(slice_rows: list[AuditRow]) -> Rates:
10 positives = [row for row in slice_rows if row.expected_auto_serve]
11 negatives = [row for row in slice_rows if not row.expected_auto_serve]
12 selected = [row for row in slice_rows if route(row) == "auto_serve"]
13 true_positives = [row for row in positives if route(row) == "auto_serve"]
14 false_positives = [row for row in negatives if route(row) == "auto_serve"]
15 return Rates(
16 selection=len(selected) / len(slice_rows),
17 tpr=len(true_positives) / len(positives),
18 fpr=len(false_positives) / len(negatives),
19 positive_count=len(positives),
20 negative_count=len(negatives),
21 )
22
23rates = {
24 variant: slice_rates([row for row in rows if row.variant == variant])
25 for variant in ("formal", "conversational")
26}
27
28def gap(metric: str) -> float:
29 return abs(getattr(rates["formal"], metric) - getattr(rates["conversational"], metric))
30
31for variant, result in rates.items():
32 print(
33 f"{variant:14} selection={result.selection:.1%} "
34 f"TPR={result.tpr:.1%} FPR={result.fpr:.1%}"
35 )
36print(f"TPR gap={gap('tpr'):.1%}; FPR gap={gap('fpr'):.1%}")1formal selection=60.0% TPR=83.3% FPR=25.0%
2conversational selection=30.0% TPR=50.0% FPR=0.0%
3TPR gap=33.3%; FPR gap=25.0%This audit fails in both directions. Among replies reviewers marked ready, the conversational condition is routed to human review more often. Among replies that need review, the formal condition is incorrectly auto-served once.
A release contract makes the choice reviewable. Thresholds below are product decisions for this lab, not universal definitions of fairness.
1@dataclass(frozen=True)
2class FairnessContract:
3 primary_metric: str
4 max_tpr_gap: float
5 max_fpr_gap: float
6 min_positive_per_slice: int
7 min_negative_per_slice: int
8
9contract = FairnessContract(
10 primary_metric="equal_opportunity",
11 max_tpr_gap=0.10,
12 max_fpr_gap=0.10,
13 min_positive_per_slice=50,
14 min_negative_per_slice=30,
15)
16
17metric_checks = {
18 "TPR gap": gap("tpr") <= contract.max_tpr_gap,
19 "FPR guardrail": gap("fpr") <= contract.max_fpr_gap,
20}
21
22for name, passed in metric_checks.items():
23 print(f"{name}: {'PASS' if passed else 'FAIL'}")
24assert not any(metric_checks.values())1TPR gap: FAIL
2FPR guardrail: FAILThe audit found an actionable regression. It hasn't estimated production disparity. Six ready examples per wording condition are too few for a stable rate, and these fixtures don't identify a population.
One quick way to make that visible is a . The Wilson interval below gives a plausible range for each TPR under binomial sampling. It isn't a complete statistical analysis, but it prevents a tiny dataset from looking decisive.
1def wilson_interval(successes: int, total: int, z: float = 1.96) -> tuple[float, float]:
2 proportion = successes / total
3 denominator = 1 + z * z / total
4 center = (proportion + z * z / (2 * total)) / denominator
5 radius = z * sqrt(
6 (proportion * (1 - proportion) + z * z / (4 * total)) / total
7 ) / denominator
8 return center - radius, center + radius
9
10for variant in ("formal", "conversational"):
11 variant_rows = [
12 row for row in rows
13 if row.variant == variant and row.expected_auto_serve
14 ]
15 successes = sum(route(row) == "auto_serve" for row in variant_rows)
16 low, high = wilson_interval(successes, len(variant_rows))
17 print(
18 f"{variant:14} TPR={successes}/{len(variant_rows)} "
19 f"95% interval=[{low:.1%}, {high:.1%}]"
20 )
21
22enough_support = all(
23 result.positive_count >= contract.min_positive_per_slice
24 and result.negative_count >= contract.min_negative_per_slice
25 for result in rates.values()
26)
27print(f"Minimum slice support: {'PASS' if enough_support else 'FAIL'}")
28assert not enough_support1formal TPR=5/6 95% interval=[43.6%, 97.0%]
2conversational TPR=3/6 95% interval=[18.8%, 81.2%]
3Minimum slice support: FAILThe intervals are wide because the fixture is small. That does not mean ignore the flips. It means use them as regression cases while collecting governed, reviewed evaluation data before making a population claim.
The previous lesson used calibration to ask whether a judge agrees with reviewers. A fairness audit asks a stricter question: does a similar score carry similar meaning across slices? With ten rows per condition, score bands are diagnostic only.
1def score_band(score: float) -> str:
2 if score < 0.70:
3 return "below 0.70"
4 if score < 0.90:
5 return "0.70 to 0.89"
6 return "0.90 and above"
7
8calibration_cells: dict[tuple[str, str], list[AuditRow]] = defaultdict(list)
9for row in rows:
10 calibration_cells[(row.variant, score_band(row.judge_score))].append(row)
11
12for (variant, band), cell in sorted(calibration_cells.items()):
13 observed_ready = sum(row.expected_auto_serve for row in cell) / len(cell)
14 print(f"{variant:14} {band:14}: n={len(cell)}, ready={observed_ready:.1%}")
15
16assert max(len(cell) for cell in calibration_cells.values()) < 10
17print("Calibration decision: insufficient support")1conversational 0.70 to 0.89 : n=3, ready=100.0%
2conversational below 0.70 : n=7, ready=42.9%
3formal 0.70 to 0.89 : n=5, ready=80.0%
4formal 0.90 and above: n=1, ready=100.0%
5formal below 0.70 : n=4, ready=25.0%
6Calibration decision: insufficient supportSingle-slice summaries can hide a failure limited to one channel, locale, or accessibility setting. Real systems therefore plan intersectional reports. They must also impose minimum support, because slicing a small audit repeatedly produces unstable numbers and privacy risks.
1intersection_counts: dict[tuple[str, str], int] = defaultdict(int)
2for row in rows:
3 if row.expected_auto_serve:
4 intersection_counts[(row.channel, row.variant)] += 1
5
6for (channel, variant), count in sorted(intersection_counts.items()):
7 status = "eligible" if count >= contract.min_positive_per_slice else "insufficient"
8 print(f"{channel:5} / {variant:14}: n={count}, {status}")
9
10assert all(count == 3 for count in intersection_counts.values())1chat / conversational: n=3, insufficient
2chat / formal : n=3, insufficient
3email / conversational: n=3, insufficient
4email / formal : n=3, insufficientIn production, group definitions may involve sensitive attributes. Collect and expose them only under an approved purpose, access controls, privacy review, and any required consent or legal basis. A public dashboard with tiny protected-group cells can create harm while trying to measure it.
The matched pairs share policy evidence and human labels. Their routes diverge only after judge scoring. That localizes this lab's failure to the soft-evaluation and threshold layer. Rewriting customer text into a preferred register would conceal the symptom and ask customers to adapt to the system.
If a later investigation localizes a disparity to training data, counterfactual data augmentation (CDA) is one candidate experiment: add paired examples that alter an identity-related attribute while preserving the intended label. It isn't the first repair for this lab because the observed failure is in a deployed judge and router, not a proven training-set defect. CDA also needs review: careless swaps can change meaning, produce implausible text, or hide the group-specific harms you meant to measure.[1]
1def changed_stage(pair_rows: list[AuditRow]) -> str:
2 if len({row.evidence_passed for row in pair_rows}) > 1:
3 return "evidence_gate"
4 if len({route(row) for row in pair_rows}) > 1:
5 return "judge_or_route"
6 return "no_observed_flip"
7
8attribution = {
9 pair_id: changed_stage(pair_rows)
10 for pair_id, pair_rows in by_pair.items()
11 if pair_id in {pair[0] for pair in flips}
12}
13
14print(attribution)
15assert set(attribution.values()) == {"judge_or_route"}1{'p4': 'judge_or_route', 'p5': 'judge_or_route', 'n1': 'judge_or_route'}A reasonable next experiment is a revised rubric and judge prompt that focus on remedy correctness and actionable next steps rather than writing register. For teaching purposes, the following candidate rerun has equal rates. It still isn't release evidence: it uses the same synthetic cases that exposed the defect.
1candidate_scores = {
2 "p1": (0.92, 0.91), "p2": (0.86, 0.84), "p3": (0.81, 0.80),
3 "p4": (0.77, 0.75), "p5": (0.72, 0.71), "p6": (0.66, 0.65),
4 "n1": (0.62, 0.61), "n2": (0.60, 0.58), "n3": (0.55, 0.56),
5 "n4": (0.64, 0.62),
6}
7
8candidate_rows: list[AuditRow] = []
9for pair_id, expected, channel, _, _ in observations:
10 formal_score, conversational_score = candidate_scores[pair_id]
11 candidate_rows.extend([
12 AuditRow(pair_id, "formal", expected, formal_score, channel),
13 AuditRow(pair_id, "conversational", expected, conversational_score, channel),
14 ])
15
16candidate_rates = {
17 variant: slice_rates([row for row in candidate_rows if row.variant == variant])
18 for variant in ("formal", "conversational")
19}
20candidate_tpr_gap = abs(candidate_rates["formal"].tpr - candidate_rates["conversational"].tpr)
21candidate_fpr_gap = abs(candidate_rates["formal"].fpr - candidate_rates["conversational"].fpr)
22
23print(f"Candidate TPR gap={candidate_tpr_gap:.1%}; FPR gap={candidate_fpr_gap:.1%}")
24print("Interpretation: regression repaired on synthetic pairs, not validated for release")
25assert candidate_tpr_gap == 0
26assert candidate_fpr_gap == 01Candidate TPR gap=0.0%; FPR gap=0.0%
2Interpretation: regression repaired on synthetic pairs, not validated for releaseProduct-specific matched pairs test the actual route customers experience. Public benchmarks provide broader regression coverage:
| Evaluation source | What it tests | Appropriate use here |
|---|---|---|
| Matched ShopFlow pairs | Routing consistency for supported replacement replies | Primary product release audit |
| WEAT / SEAT | Whether word or sentence representations encode tested association patterns | Diagnostic probe when you can inspect behavior; not a routing outcome measure[3][4] |
| StereoSet | Whether a language model assigns stronger preference to stereotypical than anti-stereotypical continuations in its test contexts | Probability-level stereotype regression probe[5] |
| RealToxicityPrompts / BOLD | Toxic degeneration from prompts and open-ended generation about demographic groups | Generation-level audit set that needs human review and product-specific slices[6][7] |
| BBQ | Whether question answering relies on stereotypes when context is ambiguous or disambiguated | Broad stereotype regression probe[8] |
| Reviewed toxicity slices | Whether a safety evaluator flags language varieties unevenly | Evaluator audit; Sap et al. showed dialect-related false-positive risk in hate-speech detection.[9] |
These benchmarks operate at different layers. WEAT or SEAT can expose associations in representations even when generated outputs look harmless; RealToxicityPrompts or BOLD can expose output harms without explaining which internal representation caused them. Don't treat a benchmark pass as proof that customer routing is fair. A benchmark tests its own prompt distribution and label design. Don't treat one product slice as full safety coverage either. Use both, and keep the limitation attached to every report.
Fairness metrics can conflict. When outcome prevalence differs across groups and predictions aren't perfect, a score calibrated within each group generally can't also equalize false-positive and false-negative rates across groups. Chouldechova demonstrated this incompatibility for risk scoring systems.[10] The engineering response isn't to give up or chase a single universal score. It's to define the customer harm, select a primary metric, monitor important counter-metrics, and document the accepted trade-off.
In this lab, the chosen outcome is rapid access to a supported reply. Equal opportunity is primary because it asks whether replies reviewers mark ready reach the fast path similarly. The FPR guardrail prevents a superficial fix that merely auto-serves more unclear replies.
A fairness report must say what was tested, what failed, and why a candidate can't yet ship. That keeps a clean toy rerun from being promoted into an unsupported production claim.
1release_requirements = {
2 "synthetic_regression_pairs_pass": candidate_tpr_gap <= contract.max_tpr_gap
3 and candidate_fpr_gap <= contract.max_fpr_gap,
4 "representative_reviewed_slice_set": False,
5 "minimum_positive_and_negative_support": False,
6 "approved_group_definition_and_privacy_review": False,
7 "production_monitoring_owner": False,
8}
9
10failures = [
11 requirement
12 for requirement, passed in release_requirements.items()
13 if not passed
14]
15decision = "APPROVED" if not failures else "BLOCKED"
16
17print(f"Metric promotion: {decision}")
18for failure in failures:
19 print(f" missing: {failure}")
20
21assert decision == "BLOCKED"1Metric promotion: BLOCKED
2 missing: representative_reviewed_slice_set
3 missing: minimum_positive_and_negative_support
4 missing: approved_group_definition_and_privacy_review
5 missing: production_monitoring_ownerA blocked result is progress. The team now has reproducible regressions, a primary metric, counter-metric guardrails, a likely failing stage, and explicit evidence still needed before release.
Symptom: A chart claims a real customer group receives worse outcomes, but all values came from hand-written prompts. Cause: Synthetic tests were confused with representative measurement. Fix: Label fixtures clearly, use them for regressions, and require governed real evaluation data for population claims.
Symptom: Routing rates equalize because unclear replies are auto-served more often. Cause: The team watched selection rate without a false-positive guardrail. Fix: Track TPR and FPR together and tie the primary metric to customer harm.
Symptom: The pipeline rewrites one language variety into another before judging. Cause: The system treats customer expression as the defect. Fix: Audit the judge and routing stage with reviewed matched inputs; preserve meaning and customer voice.
Symptom: A parity report marks small intersections as passing. Cause: It omits sample requirements and uncertainty. Fix: Require minimum support, report insufficient cells, and protect sensitive slice data.
Symptom: The team proposes retraining the base model before locating where routes diverge. Cause: Fairness was treated as an abstract model property instead of a system outcome. Fix: evidence, evaluator, threshold, and workflow outcome first; fix the measured failing layer.
Bias and Fairness in Large Language Models: A Survey
Gallegos, I. O., Rossi, R. A., Barrow, J., et al. · 2024
Equality of Opportunity in Supervised Learning.
Hardt, M., Price, E., & Srebro, N. · 2016 · NeurIPS 2016
Semantics derived automatically from language corpora contain human-like biases.
Caliskan, A., Bryson, J. J., & Narayanan, A. · 2017 · Science 356(6334)
On measuring social biases in sentence encoders.
May, C., Wang, A., Bordia, S., Bowman, S. R., & Rudinger, R. · 2019 · NAACL 2019
StereoSet: Measuring stereotypical bias in pretrained language models.
Nadeem, M., et al. · 2020 · ACL 2021
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models.
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. · 2020
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation.
Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., & Gupta, R. · 2021 · FAccT 2021
BBQ: A Hand-Built Bias Benchmark for Question Answering.
Parrish, A., et al. · 2022 · ACL 2022
The Risk of Racial Bias in Hate Speech Detection.
Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. · 2019 · ACL 2019
Fair prediction with disparate impact: A study of bias in recidivism prediction instruments.
Chouldechova, A. · 2017 · Big Data