Build a trustworthy human-feedback data flywheel: redact traces, write rubrics, measure agreement, select useful examples, prevent leakage, and promote versioned datasets.
CodeAssist's code-review assistant can't pass the governance gate from the last lesson without a trustworthy dataset record. Its team has attack traces, CI failure investigations, and ordinary code-review conversations, but raw logs aren't training data. A few contain private details. Others should remain frozen tests forever. In some rows, both answers are bad, so choosing a "winner" would teach the model the wrong lesson.
Build the missing data pipeline: turn reviewed traces into versioned feedback data while preserving a private evaluation set that can honestly measure whether the assistant improves.
Human feedback isn't one kind of label. A model team commonly needs several different artifacts:
| Artifact | Shape | What it teaches or tests | CodeAssist example |
|---|---|---|---|
| Demonstration | prompt plus approved answer | Supervised fine-tuning (SFT) target behavior | A reviewed explanation of a failing fixture in CI |
| Pointwise assessment | prompt, one answer, anchored label or score | Filtering, slice analysis, or evaluation | Mark an unsafe shell command or score clarity from 1 to 5 |
| Preference pair | prompt, answer A, answer B, choice | Relative behavior for DPO or reward modeling | Prefer the cited, patch-specific answer over a vague answer |
| Evaluation fixture | input, expected checks, never trained on | Whether a new model or agent improved | Injected policy must not trigger secret export or deploy |
| Incident or escalation record | unsafe trace and resolution | New risk investigation and future test design | Both answers disclose private repository notes |
Pointwise assessments need anchored criteria so reviewers interpret a category or score consistently. Preference pairs answer a different question: given the same prompt, which answer is better? InstructGPT used human-written demonstrations and ranked model outputs in its post-training pipeline.[1] Direct Preference Optimization (DPO) uses preference pairs to optimize a policy without first fitting a separate reward model.[2] Those methods don't mean every reviewed log belongs in training. Holdout examples and safety incidents have different jobs.
The first routing rule is strict: a case reserved for evaluation doesn't enter the training queue. It may also be an incident that needs investigation.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Trace:
5 trace_id: str
6 reserved_for_evaluation: bool
7 unsafe_effect: bool
8 approved_answer_available: bool
9 safe_pair_available: bool
10
11def destinations(trace: Trace) -> list[str]:
12 routes: list[str] = []
13 if trace.reserved_for_evaluation:
14 routes.append("FROZEN_EVALUATION")
15 if trace.unsafe_effect:
16 routes.append("INCIDENT_REVIEW")
17 if not trace.reserved_for_evaluation and not trace.unsafe_effect:
18 if trace.safe_pair_available:
19 routes.append("PREFERENCE_QUEUE")
20 elif trace.approved_answer_available:
21 routes.append("DEMONSTRATION_QUEUE")
22 return routes or ["NEEDS_TRIAGE"]
23
24traces = [
25 Trace("attack-014", True, True, False, False),
26 Trace("handoff-102", False, False, True, True),
27 Trace("leak-008", False, True, False, False),
28]
29
30for trace in traces:
31 print(f"{trace.trace_id}: {destinations(trace)}")1attack-014: ['FROZEN_EVALUATION', 'INCIDENT_REVIEW']
2handoff-102: ['PREFERENCE_QUEUE']
3leak-008: ['INCIDENT_REVIEW']Routing seems administrative until leakage happens. If attack-014 is trained on, the next evaluation no longer answers "does the system generalize to this attack?" It answers "did it remember a case we revealed?"
The governance chapter required minimized evidence. The same rule applies earlier in the feedback pipeline: production text must be cleaned before it reaches a selection service, external reviewer, or annotation interface.
For a code-review trace, preserve what changes the judgment:
Remove what the reviewer doesn't need:
This small example uses stable replacements so two occurrences of the same identifier still match during review.
1import re
2
3PATTERNS = {
4 "EMAIL": r"\b[\w.+-]+@[\w.-]+\.[A-Za-z]{2,}\b",
5 "COMMIT": r"\b[a-f0-9]{12}\b",
6 "ENGINEER": r"\bENG-\d+\b",
7}
8
9def redact(text: str) -> tuple[str, dict[str, str]]:
10 mapping: dict[str, str] = {}
11 token_for_match: dict[tuple[str, str], str] = {}
12 counters = {kind: 0 for kind in PATTERNS}
13
14 def replacement(kind: str):
15 def replace(match: re.Match[str]) -> str:
16 raw_value = match.group(0)
17 key = (kind, raw_value)
18 if key not in token_for_match:
19 counters[kind] += 1
20 token_for_match[key] = f"<{kind}_{counters[kind]}>"
21 mapping[token_for_match[key]] = raw_value
22 return token_for_match[key]
23
24 return replace
25
26 cleaned = text
27 for kind, pattern in PATTERNS.items():
28 cleaned = re.sub(pattern, replacement(kind), cleaned)
29 return cleaned, mapping
30
31raw = "Engineer ENG-918204 emailed [email protected] about commit abc123def456. Contact ENG-918204 only through incident channel."
32review_text, vault_mapping = redact(raw)
33print("review_text:", review_text)
34print("mapping_stored_separately:", sorted(vault_mapping))
35print("engineer_token_occurrences:", review_text.count("<ENGINEER_1>"))
36print("raw_identifier_visible_to_reviewer:", "[email protected]" in review_text)1review_text: Engineer <ENGINEER_1> emailed <EMAIL_1> about commit <COMMIT_1>. Contact <ENGINEER_1> only through incident channel.
2mapping_stored_separately: ['<COMMIT_1>', '<EMAIL_1>', '<ENGINEER_1>']
3engineer_token_occurrences: 2
4raw_identifier_visible_to_reviewer: FalseThe reversible mapping, if it must exist, belongs in a separate access-controlled store. The annotation record needs the redaction version and the cleaned text, not the engineer's identity.
A preference interface is only as good as the decision rule behind it. "Choose the better answer" leaves reviewers to invent their own priorities. For CodeAssist's code-review assistant, write a rubric in descending order:
Safety outranks style. If both candidates break the first rule, record BOTH_BAD and send the case to incident review. Don't create a chosen/rejected training pair from two unsafe answers.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Candidate:
5 name: str
6 technically_correct: bool
7 unauthorized_command: bool = False
8 exposes_private_note: bool = False
9
10 def safe(self) -> bool:
11 return not self.unauthorized_command and not self.exposes_private_note
12
13def judge_pair(a: Candidate, b: Candidate) -> str:
14 safe = [candidate for candidate in (a, b) if candidate.safe()]
15 if not safe:
16 return "BOTH_BAD_TO_INCIDENT_REVIEW"
17 if len(safe) == 1:
18 return f"CHOOSE_{safe[0].name}"
19 if a.technically_correct != b.technically_correct:
20 return f"CHOOSE_{a.name if a.technically_correct else b.name}"
21 return "TIE_TO_ADJUDICATION"
22
23safe_answer = Candidate("A", technically_correct=True)
24vague_answer = Candidate("B", technically_correct=False)
25leaking_answer = Candidate("C", technically_correct=True, exposes_private_note=True)
26unauthorized_answer = Candidate("D", technically_correct=True, unauthorized_command=True)
27
28print("correct_vs_vague:", judge_pair(safe_answer, vague_answer))
29print("leak_vs_unauthorized:", judge_pair(leaking_answer, unauthorized_answer))1correct_vs_vague: CHOOSE_A
2leak_vs_unauthorized: BOTH_BAD_TO_INCIDENT_REVIEWEach accepted preference record should identify:
| Field | Why it matters |
|---|---|
trace_id and redaction_version | Rebuild the cleaned source without exporting identity |
prompt_template_version and policy_version | Know which instructions and rule text reviewers evaluated |
| Candidate model IDs and generation settings | Reproduce where the answers came from |
rubric_version | Interpret the choice under the criteria used then |
reviewer_id or review group pseudonym | Analyze quality without exposing unnecessary identity |
| Choice, tie, both-bad, escalation reason | Avoid silently turning unsafe pairs into training examples |
| Dataset version after acceptance | Reconstruct exactly which data trained a candidate |
Collect only reviewer attributes needed for a legitimate analysis, protect them with access controls, and define their retention. Bias analysis isn't an excuse to collect personal information without a purpose.
Two trained reviewers can still read a rule differently. Inter-annotator agreement (IAA) measures whether labels are consistent enough for their intended use.
For two reviewers choosing A, B, or Tie, Cohen's kappa compares observed agreement with agreement expected from each reviewer's label frequency.[3] With many reviewers, missing labels, or other data types, Krippendorff's alpha is often a more flexible reliability measure.[4]
The next table has 120 duplicate-reviewed code-review cases:
| Reviewer A \ Reviewer B | A better | B better | Tie | Row total |
|---|---|---|---|---|
| A better | 42 | 3 | 1 | 46 |
| B better | 4 | 38 | 2 | 44 |
| Tie | 2 | 1 | 27 | 30 |
| Column total | 48 | 42 | 30 | 120 |
The diagonal gives observed agreement:
The row and column totals give chance agreement:
Now compute kappa:
1matrix = [
2 [42, 3, 1],
3 [4, 38, 2],
4 [2, 1, 27],
5]
6
7total = sum(sum(row) for row in matrix)
8observed = sum(matrix[i][i] for i in range(len(matrix))) / total
9row_totals = [sum(row) for row in matrix]
10column_totals = [sum(matrix[row][col] for row in range(3)) for col in range(3)]
11chance = sum(row * col for row, col in zip(row_totals, column_totals)) / (total * total)
12kappa = (observed - chance) / (1 - chance)
13
14print(f"observed_agreement: {observed:.3f}")
15print(f"chance_agreement: {chance:.3f}")
16print(f"cohens_kappa: {kappa:.3f}")1observed_agreement: 0.892
2chance_agreement: 0.344
3cohens_kappa: 0.835
A number doesn't determine policy by itself. CodeAssist might set kappa >= 0.65 as a pilot acceptance threshold, alongside slice checks and adjudication of safety-related disagreements. That's a declared internal gate, not a universal standard. A failed batch should trigger diagnosis:
After redaction and routing, CodeAssist still can't review every eligible trace. Active learning selects items for labeling based on a signal that they may be informative. Classic query strategies include uncertainty sampling and representativeness-aware selection.[5]
For candidate preference pairs, suppose a small preference model estimates the probability that candidate A is better. Values near 0.5 mean the model is uncertain about the comparison.
1import math
2
3pairs = {
4 "routine-test-failure": 0.97,
5 "flaky-test-handoff": 0.53,
6 "injected-policy-refusal": 0.49,
7 "missing-fixture-edge": 0.70,
8}
9
10def entropy(probability_a: float) -> float:
11 probability_b = 1 - probability_a
12 return -sum(
13 p * math.log2(p)
14 for p in (probability_a, probability_b)
15 if p > 0
16 )
17
18for trace_id, score in sorted(pairs.items(), key=lambda item: entropy(item[1]), reverse=True):
19 print(f"{trace_id}: p_a={score:.2f} entropy={entropy(score):.3f}")1injected-policy-refusal: p_a=0.49 entropy=1.000
2flaky-test-handoff: p_a=0.53 entropy=0.997
3missing-fixture-edge: p_a=0.70 entropy=0.881
4routine-test-failure: p_a=0.97 entropy=0.194Uncertainty isn't the same as importance. Five uncertain cases might all be paraphrases of the same missing-fixture diagnosis. A diverse batch avoids spending an entire review round on one local cluster.
The miniature selector below uses an already-normalized uncertainty score from 0 to 1, where 1 means most uncertain. That differs from the previous p_a values: a preference probability near 0.5 should map to high uncertainty. The selector starts with the most uncertain item, then scores remaining items using normalized uncertainty plus distance from anything already selected. The two-dimensional coordinates stand in for embeddings so the mechanics stay visible.
1from math import dist
2
3# (trace_id, two-dimensional embedding, normalized uncertainty)
4items = [
5 ("fixture-failure-1", (0.0, 0.0), 0.99),
6 ("fixture-failure-2", (0.2, 0.1), 0.96),
7 ("fixture-failure-3", (-0.2, 0.2), 0.94),
8 ("flaky-test-handoff", (4.5, 4.5), 0.66),
9 ("injection-trace", (-4.2, -4.0), 0.71),
10 ("accessible-language", (4.7, -4.1), 0.60),
11]
12
13def hybrid_select(batch_size: int, weight_uncertainty: float = 0.6) -> list[str]:
14 if not 1 <= batch_size <= len(items):
15 raise ValueError("batch_size must select at least one item and no more than the pool")
16 if not 0 <= weight_uncertainty <= 1:
17 raise ValueError("weight_uncertainty must be between 0 and 1")
18
19 selected = [max(range(len(items)), key=lambda i: items[i][2])]
20 remaining = set(range(len(items))) - set(selected)
21 while len(selected) < batch_size:
22 max_distance = max(
23 min(dist(items[i][1], items[j][1]) for j in selected)
24 for i in remaining
25 ) or 1.0
26
27 def score(i: int) -> float:
28 coverage = min(dist(items[i][1], items[j][1]) for j in selected) / max_distance
29 return weight_uncertainty * items[i][2] + (1 - weight_uncertainty) * coverage
30
31 chosen = max(remaining, key=score)
32 selected.append(chosen)
33 remaining.remove(chosen)
34 return [items[i][0] for i in selected]
35
36uncertainty_only = [name for name, _, _ in sorted(items, key=lambda item: item[2], reverse=True)[:4]]
37print("uncertainty_only:", uncertainty_only)
38print("hybrid:", hybrid_select(4))1uncertainty_only: ['fixture-failure-1', 'fixture-failure-2', 'fixture-failure-3', 'injection-trace']
2hybrid: ['fixture-failure-1', 'flaky-test-handoff', 'injection-trace', 'accessible-language']Core-set methods formalize the coverage intuition by selecting examples far from the already represented set in embedding space.[6] In a real pipeline, validate whether the representation and scoring method find meaningful code-review failure modes. Distance in an embedding space is a heuristic, not proof that an example will improve a model.
An active selector can produce an interesting queue and still fail to improve the system. Measure it against a random-sampling baseline:
The frozen set must be disjoint from demonstrations and preference pairs.
1evaluation_ids = {"attack-014", "handoff-099", "screen-reader-007"}
2demonstration_ids = {"routine-002", "handoff-102"}
3preference_ids = {"handoff-102", "fix-explanation-044", "attack-014"}
4
5def overlap(training_ids: set[str]) -> list[str]:
6 return sorted(evaluation_ids & training_ids)
7
8all_training = demonstration_ids | preference_ids
9leaked = overlap(all_training)
10print("evaluation_ids:", sorted(evaluation_ids))
11print("leaked_training_ids:", leaked)
12print("promotion_allowed:", not leaked)1evaluation_ids: ['attack-014', 'handoff-099', 'screen-reader-007']
2leaked_training_ids: ['attack-014']
3promotion_allowed: FalseThe result correctly blocks this draft dataset because attack-014 was accidentally placed in both the preference set and the frozen evaluation set.
The values below are example experiment results, not a promised advantage for active selection. An experiment log should make the comparison easy to calculate.
1rounds = {
2 "random": {"accepted_labels": 400, "baseline_score": 0.61, "candidate_score": 0.64},
3 "hybrid": {"accepted_labels": 400, "baseline_score": 0.61, "candidate_score": 0.68},
4}
5
6for name, result in rounds.items():
7 gain = result["candidate_score"] - result["baseline_score"]
8 labels_per_point = result["accepted_labels"] / (gain * 100)
9 print(f"{name}: gain={gain:.2f} labels_per_percentage_point={labels_per_point:.1f}")1random: gain=0.03 labels_per_percentage_point=133.3
2hybrid: gain=0.07 labels_per_percentage_point=57.1If the hybrid batch doesn't outperform random sampling on the frozen test set, don't defend it because its selected examples looked clever. Change the selector or return to the simpler baseline.
An LLM judge can prioritize a large candidate pool or flag likely failures before human review. It can also prefer answers because they appear first, are longer, or resemble its own style. The MT-Bench and Chatbot Arena study measured those position, verbosity, and style-similarity biases in model judges.[7]
Start with a held-out set labeled by people under the actual rubric. Swap answer order and check whether a judge reverses its choice.
1human_gold = {
2 "fixture-policy": "A",
3 "handoff-route": "B",
4 "unsafe-disclosure": "B",
5}
6
7judge_original = {
8 "fixture-policy": "A",
9 "handoff-route": "B",
10 "unsafe-disclosure": "B",
11}
12
13judge_swapped_mapped_back = {
14 "fixture-policy": "B",
15 "handoff-route": "B",
16 "unsafe-disclosure": "A",
17}
18
19agreement = sum(judge_original[key] == human_gold[key] for key in human_gold) / len(human_gold)
20flips = sorted(
21 key for key in human_gold
22 if judge_original[key] != judge_swapped_mapped_back[key]
23)
24print(f"agreement_with_humans: {agreement:.2f}")
25print("order_sensitive_cases:", flips)
26print("auto_accept_enabled:", agreement >= 0.9 and not flips)1agreement_with_humans: 1.00
2order_sensitive_cases: ['fixture-policy', 'unsafe-disclosure']
3auto_accept_enabled: FalseThis judge agrees when shown one order and fails the order-swap test. It can help surface cases for review, but it can't accept preference pairs automatically.
Once a batch has passed review, package its provenance as carefully as the model release from the previous lesson. The dataset manifest should tie accepted rows to their controls:
1REQUIRED_FIELDS = {
2 "dataset_version",
3 "source_window",
4 "selection_policy_version",
5 "redaction_version",
6 "reidentification_access_rule",
7 "rubric_version",
8 "reviewer_training_set",
9 "agreement_report",
10 "row_counts",
11 "frozen_evaluation_set",
12 "leakage_check_passed",
13 "both_bad_escalated",
14 "parent_dataset",
15}
16
17manifest = {
18 "dataset_version": "code-review-feedback-v12",
19 "source_window": "2026-05-01/2026-05-15",
20 "selection_policy_version": "hybrid-selector-v3",
21 "redaction_version": "code-review-redactor-v2",
22 "reidentification_access_rule": "privacy-approved-roles-only",
23 "rubric_version": "code-review-rubric-v4",
24 "reviewer_training_set": "code-reviewer-gold-v3",
25 "agreement_report": {"cohens_kappa": 0.835, "policy_gate": 0.65},
26 "row_counts": {"demonstrations": 182, "preferences": 904, "ties": 61, "rejected": 23},
27 "frozen_evaluation_set": "code-review-eval-v5",
28 "leakage_check_passed": True,
29 "both_bad_escalated": 7,
30 "parent_dataset": "code-review-feedback-v11",
31}
32
33missing = sorted(REQUIRED_FIELDS - manifest.keys())
34print("dataset_version:", manifest["dataset_version"])
35print("missing_fields:", missing)
36print("ready_for_promotion:", not missing and manifest["leakage_check_passed"])1dataset_version: code-review-feedback-v12
2missing_fields: []
3ready_for_promotion: TrueA manifest can be complete and the batch still be unsuitable. Promotion should fail for leakage, unresolved sensitive exposure, low agreement under the declared policy, or unsafe pairs that were forced into chosen/rejected labels.
1def promotion_reasons(batch: dict[str, object]) -> list[str]:
2 reasons: list[str] = []
3 if not batch["redaction_passed"]:
4 reasons.append("redaction failed")
5 if batch["leaked_eval_ids"]:
6 reasons.append("evaluation leakage")
7 if batch["cohens_kappa"] < batch["declared_kappa_gate"]:
8 reasons.append("agreement below declared gate")
9 if batch["unsafe_pairs_accepted"]:
10 reasons.append("unsafe pair accepted as preference")
11 return reasons
12
13draft = {
14 "redaction_passed": True,
15 "leaked_eval_ids": ["attack-014"],
16 "cohens_kappa": 0.835,
17 "declared_kappa_gate": 0.65,
18 "unsafe_pairs_accepted": 0,
19}
20repaired = {
21 "redaction_passed": True,
22 "leaked_eval_ids": [],
23 "cohens_kappa": 0.835,
24 "declared_kappa_gate": 0.65,
25 "unsafe_pairs_accepted": 0,
26}
27
28for name, batch in (("draft", draft), ("repaired", repaired)):
29 reasons = promotion_reasons(batch)
30 print(f"{name}_promoted:", not reasons)
31 print(f"{name}_reasons:", reasons)1draft_promoted: False
2draft_reasons: ['evaluation leakage']
3repaired_promoted: True
4repaired_reasons: []The repaired batch can become code-review-feedback-v12. It's now a defensible input to an SFT or DPO experiment, and its untouched code-review-eval-v5 suite can measure the candidate honestly.
Create a small feedback dataset package from the code-review assistant traces in the previous lesson:
code-review-eval-v5: one prompt-injection attempt, one legitimate flaky-test handoff, and one supported accessible interaction path. These IDs must never enter training data.redaction_version with each cleaned row.code-review-rubric-v4 with authorization, technical correctness, actionability, and tone ordered explicitly.BOTH_BAD to incident review.code-review-feedback-v12 manifest and run promotion checks for redaction, leakage, agreement, and unsafe accepted pairs.Your package includes cleaned JSONL-like records, a rubric, a hand-checkable agreement calculation, a selector result, a leakage failure that you repaired, and a manifest that states exactly why the dataset may be used for training.
Place attack-014 into both the frozen set and a preference pair. Your gate should refuse promotion. If it doesn't, the pipeline can no longer tell training progress from memorization.
You're ready to build human-feedback datasets when you can:
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov, R., et al. · 2023
A Coefficient of Agreement for Nominal Scales
Cohen, J. · 1960 · Educational and Psychological Measurement
Computing Krippendorff's Alpha-Reliability
Krippendorff, K. · 2011 · University of Pennsylvania ScholarlyCommons
Active Learning
Settles, B. · 2012 · Synthesis Lectures on Artificial Intelligence and Machine Learning
Active Learning for Convolutional Neural Networks: A Core-Set Approach
Sener, O., Savarese, S. · 2018 · ICLR 2018
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023