LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringData Labeling and Human Feedback
⚙️MediumMLOps & Deployment

Data Labeling and Human Feedback

Build a trustworthy human-feedback data flywheel: redact traces, write rubrics, measure agreement, select useful examples, prevent leakage, and promote versioned datasets.

17 min read
Learning path
Step 62 of 158 in the full curriculum
Responsible AI GovernanceEvaluating AI Agents

CodeAssist's code-review assistant can't pass the governance gate from the last lesson without a trustworthy dataset record. Its team has attack traces, CI failure investigations, and ordinary code-review conversations, but raw logs aren't training data. A few contain private details. Others should remain frozen tests forever. In some rows, both answers are bad, so choosing a "winner" would teach the model the wrong lesson.

Build the missing data pipeline: turn reviewed traces into versioned feedback data while preserving a private evaluation set that can honestly measure whether the assistant improves.

Human feedback data flow that redacts raw traces first, then splits them into a training lane with reviewed versioned data and a locked evaluation lane with frozen IDs that never enter training. Human feedback data flow that redacts raw traces first, then splits them into a training lane with reviewed versioned data and a locked evaluation lane with frozen IDs that never enter training.
Redact first. Reviewed cases can teach future models. Frozen evaluation IDs stay locked on a separate lane.

Separate records by the job they do

Human feedback isn't one kind of label. A model team commonly needs several different artifacts:

ArtifactShapeWhat it teaches or testsCodeAssist example
Demonstrationprompt plus approved answerSupervised fine-tuning (SFT) target behaviorA reviewed explanation of a failing fixture in CI
Pointwise assessmentprompt, one answer, anchored label or scoreFiltering, slice analysis, or evaluationMark an unsafe shell command or score clarity from 1 to 5
Preference pairprompt, answer A, answer B, choiceRelative behavior for DPO or reward modelingPrefer the cited, patch-specific answer over a vague answer
Evaluation fixtureinput, expected checks, never trained onWhether a new model or agent improvedInjected policy must not trigger secret export or deploy
Incident or escalation recordunsafe trace and resolutionNew risk investigation and future test designBoth answers disclose private repository notes

Pointwise assessments need anchored criteria so reviewers interpret a category or score consistently. Preference pairs answer a different question: given the same prompt, which answer is better? InstructGPT used human-written demonstrations and ranked model outputs in its post-training pipeline.[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155 Direct Preference Optimization (DPO) uses preference pairs to optimize a policy without first fitting a separate reward model.[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290 Those methods don't mean every reviewed log belongs in training. Holdout examples and safety incidents have different jobs.

The first routing rule is strict: a case reserved for evaluation doesn't enter the training queue. It may also be an incident that needs investigation.

01-route-raw-traces-by-purpose.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Trace: 5 trace_id: str 6 reserved_for_evaluation: bool 7 unsafe_effect: bool 8 approved_answer_available: bool 9 safe_pair_available: bool 10 11def destinations(trace: Trace) -> list[str]: 12 routes: list[str] = [] 13 if trace.reserved_for_evaluation: 14 routes.append("FROZEN_EVALUATION") 15 if trace.unsafe_effect: 16 routes.append("INCIDENT_REVIEW") 17 if not trace.reserved_for_evaluation and not trace.unsafe_effect: 18 if trace.safe_pair_available: 19 routes.append("PREFERENCE_QUEUE") 20 elif trace.approved_answer_available: 21 routes.append("DEMONSTRATION_QUEUE") 22 return routes or ["NEEDS_TRIAGE"] 23 24traces = [ 25 Trace("attack-014", True, True, False, False), 26 Trace("handoff-102", False, False, True, True), 27 Trace("leak-008", False, True, False, False), 28] 29 30for trace in traces: 31 print(f"{trace.trace_id}: {destinations(trace)}")
Output
1attack-014: ['FROZEN_EVALUATION', 'INCIDENT_REVIEW'] 2handoff-102: ['PREFERENCE_QUEUE'] 3leak-008: ['INCIDENT_REVIEW']

Routing seems administrative until leakage happens. If attack-014 is trained on, the next evaluation no longer answers "does the system generalize to this attack?" It answers "did it remember a case we revealed?"

Redact before selection or review

The governance chapter required minimized evidence. The same rule applies earlier in the feedback pipeline: production text must be cleaned before it reaches a selection service, external reviewer, or annotation interface.

For a code-review trace, preserve what changes the judgment:

  • Repository, diff hunk, failing test, and cited policy rule.
  • Whether a command, deploy, or secret access required approval.
  • The assistant's proposed action and final gate decision.
  • A stable pseudonymous case ID for joining records later.

Remove what the reviewer doesn't need:

  • Engineer name, email, phone number, or full commit identifier.
  • Free-form incident notes unrelated to the label.
  • Secrets, tokens, or internal credentials copied into retrieved context.

This small example uses stable replacements so two occurrences of the same identifier still match during review.

02-redact-before-human-review.py
1import re 2 3PATTERNS = { 4 "EMAIL": r"\b[\w.+-]+@[\w.-]+\.[A-Za-z]{2,}\b", 5 "COMMIT": r"\b[a-f0-9]{12}\b", 6 "ENGINEER": r"\bENG-\d+\b", 7} 8 9def redact(text: str) -> tuple[str, dict[str, str]]: 10 mapping: dict[str, str] = {} 11 token_for_match: dict[tuple[str, str], str] = {} 12 counters = {kind: 0 for kind in PATTERNS} 13 14 def replacement(kind: str): 15 def replace(match: re.Match[str]) -> str: 16 raw_value = match.group(0) 17 key = (kind, raw_value) 18 if key not in token_for_match: 19 counters[kind] += 1 20 token_for_match[key] = f"<{kind}_{counters[kind]}>" 21 mapping[token_for_match[key]] = raw_value 22 return token_for_match[key] 23 24 return replace 25 26 cleaned = text 27 for kind, pattern in PATTERNS.items(): 28 cleaned = re.sub(pattern, replacement(kind), cleaned) 29 return cleaned, mapping 30 31raw = "Engineer ENG-918204 emailed [email protected] about commit abc123def456. Contact ENG-918204 only through incident channel." 32review_text, vault_mapping = redact(raw) 33print("review_text:", review_text) 34print("mapping_stored_separately:", sorted(vault_mapping)) 35print("engineer_token_occurrences:", review_text.count("<ENGINEER_1>")) 36print("raw_identifier_visible_to_reviewer:", "[email protected]" in review_text)
Output
1review_text: Engineer <ENGINEER_1> emailed <EMAIL_1> about commit <COMMIT_1>. Contact <ENGINEER_1> only through incident channel. 2mapping_stored_separately: ['<COMMIT_1>', '<EMAIL_1>', '<ENGINEER_1>'] 3engineer_token_occurrences: 2 4raw_identifier_visible_to_reviewer: False

The reversible mapping, if it must exist, belongs in a separate access-controlled store. The annotation record needs the redaction version and the cleaned text, not the engineer's identity.

Write the rubric before collecting preferences

A preference interface is only as good as the decision rule behind it. "Choose the better answer" leaves reviewers to invent their own priorities. For CodeAssist's code-review assistant, write a rubric in descending order:

  1. Safety and authorization: Reject any answer that reveals protected notes, exports secrets, or runs a privileged command outside the approval rule.
  2. Technical correctness: Prefer an answer that accurately cites the failing test, diff hunk, or repo policy.
  3. Actionability: Prefer a response that states the next safe command, evidence link, or reviewer handoff.
  4. Clarity and tone: Among equally safe and correct answers, prefer direct and respectful language.

Safety outranks style. If both candidates break the first rule, record BOTH_BAD and send the case to incident review. Don't create a chosen/rejected training pair from two unsafe answers.

03-route-pairwise-judgments.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Candidate: 5 name: str 6 technically_correct: bool 7 unauthorized_command: bool = False 8 exposes_private_note: bool = False 9 10 def safe(self) -> bool: 11 return not self.unauthorized_command and not self.exposes_private_note 12 13def judge_pair(a: Candidate, b: Candidate) -> str: 14 safe = [candidate for candidate in (a, b) if candidate.safe()] 15 if not safe: 16 return "BOTH_BAD_TO_INCIDENT_REVIEW" 17 if len(safe) == 1: 18 return f"CHOOSE_{safe[0].name}" 19 if a.technically_correct != b.technically_correct: 20 return f"CHOOSE_{a.name if a.technically_correct else b.name}" 21 return "TIE_TO_ADJUDICATION" 22 23safe_answer = Candidate("A", technically_correct=True) 24vague_answer = Candidate("B", technically_correct=False) 25leaking_answer = Candidate("C", technically_correct=True, exposes_private_note=True) 26unauthorized_answer = Candidate("D", technically_correct=True, unauthorized_command=True) 27 28print("correct_vs_vague:", judge_pair(safe_answer, vague_answer)) 29print("leak_vs_unauthorized:", judge_pair(leaking_answer, unauthorized_answer))
Output
1correct_vs_vague: CHOOSE_A 2leak_vs_unauthorized: BOTH_BAD_TO_INCIDENT_REVIEW

Store enough provenance to replay the judgment

Each accepted preference record should identify:

FieldWhy it matters
trace_id and redaction_versionRebuild the cleaned source without exporting identity
prompt_template_version and policy_versionKnow which instructions and rule text reviewers evaluated
Candidate model IDs and generation settingsReproduce where the answers came from
rubric_versionInterpret the choice under the criteria used then
reviewer_id or review group pseudonymAnalyze quality without exposing unnecessary identity
Choice, tie, both-bad, escalation reasonAvoid silently turning unsafe pairs into training examples
Dataset version after acceptanceReconstruct exactly which data trained a candidate

Collect only reviewer attributes needed for a legitimate analysis, protect them with access controls, and define their retention. Bias analysis isn't an excuse to collect personal information without a purpose.

Measure agreement, then repair disagreement

Two trained reviewers can still read a rule differently. Inter-annotator agreement (IAA) measures whether labels are consistent enough for their intended use.

For two reviewers choosing A, B, or Tie, Cohen's kappa compares observed agreement with agreement expected from each reviewer's label frequency.[3]Reference 3A Coefficient of Agreement for Nominal Scaleshttps://doi.org/10.1177/001316446002000104 With many reviewers, missing labels, or other data types, Krippendorff's alpha is often a more flexible reliability measure.[4]Reference 4Computing Krippendorff's Alpha-Reliabilityhttps://www.asc.upenn.edu/sites/default/files/2021-03/Computing%20Krippendorff%27s%20Alpha-Reliability.pdf

The next table has 120 duplicate-reviewed code-review cases:

Reviewer A \ Reviewer BA betterB betterTieRow total
A better423146
B better438244
Tie212730
Column total484230120

The diagonal gives observed agreement:

po=(42+38+27)/120=107/120=0.892p_o = (42 + 38 + 27) / 120 = 107 / 120 = 0.892po​=(42+38+27)/120=107/120=0.892

The row and column totals give chance agreement:

pe=(46×48+44×42+30×30)/1202=0.344p_e = (46 \times 48 + 44 \times 42 + 30 \times 30) / 120^2 = 0.344pe​=(46×48+44×42+30×30)/1202=0.344

Now compute kappa:

κ=(po−pe)/(1−pe)=(0.892−0.344)/(1−0.344)≈0.835\kappa = (p_o - p_e) / (1 - p_e) = (0.892 - 0.344) / (1 - 0.344) \approx 0.835κ=(po​−pe​)/(1−pe​)=(0.892−0.344)/(1−0.344)≈0.835
04-compute-cohens-kappa.py
1matrix = [ 2 [42, 3, 1], 3 [4, 38, 2], 4 [2, 1, 27], 5] 6 7total = sum(sum(row) for row in matrix) 8observed = sum(matrix[i][i] for i in range(len(matrix))) / total 9row_totals = [sum(row) for row in matrix] 10column_totals = [sum(matrix[row][col] for row in range(3)) for col in range(3)] 11chance = sum(row * col for row, col in zip(row_totals, column_totals)) / (total * total) 12kappa = (observed - chance) / (1 - chance) 13 14print(f"observed_agreement: {observed:.3f}") 15print(f"chance_agreement: {chance:.3f}") 16print(f"cohens_kappa: {kappa:.3f}")
Output
1observed_agreement: 0.892 2chance_agreement: 0.344 3cohens_kappa: 0.835
Reviewer agreement gate for preference labels, with 107 agreements on the diagonal, 13 disagreements off the diagonal, Cohen kappa of 0.835, and a declared threshold that decides whether the batch promotes or the rubric is repaired. Reviewer agreement gate for preference labels, with 107 agreements on the diagonal, 13 disagreements off the diagonal, Cohen kappa of 0.835, and a declared threshold that decides whether the batch promotes or the rubric is repaired.
107 agreements show stable rubric reading. 13 disagreements show where the next repair pass should focus.

A number doesn't determine policy by itself. CodeAssist might set kappa >= 0.65 as a pilot acceptance threshold, alongside slice checks and adjudication of safety-related disagreements. That's a declared internal gate, not a universal standard. A failed batch should trigger diagnosis:

  • Reviewers disagree about safe command eligibility: clarify the policy citation rule and add gold examples.
  • Reviewers disagree about tone after matching on correctness: split style preference from policy correctness.
  • Disagreements cluster in accessible-language or translated responses: involve the affected language or accessibility expertise before promotion.
  • One reviewer's gold-item misses are unusual after rubric repair: retrain or suspend that reviewer's labels.

Select cases that might teach something new

After redaction and routing, CodeAssist still can't review every eligible trace. Active learning selects items for labeling based on a signal that they may be informative. Classic query strategies include uncertainty sampling and representativeness-aware selection.[5]Reference 5Active Learninghttps://doi.org/10.2200/S00429ED1V01Y201207AIM018

For candidate preference pairs, suppose a small preference model estimates the probability that candidate A is better. Values near 0.5 mean the model is uncertain about the comparison.

05-rank-uncertain-preference-pairs.py
1import math 2 3pairs = { 4 "routine-test-failure": 0.97, 5 "flaky-test-handoff": 0.53, 6 "injected-policy-refusal": 0.49, 7 "missing-fixture-edge": 0.70, 8} 9 10def entropy(probability_a: float) -> float: 11 probability_b = 1 - probability_a 12 return -sum( 13 p * math.log2(p) 14 for p in (probability_a, probability_b) 15 if p > 0 16 ) 17 18for trace_id, score in sorted(pairs.items(), key=lambda item: entropy(item[1]), reverse=True): 19 print(f"{trace_id}: p_a={score:.2f} entropy={entropy(score):.3f}")
Output
1injected-policy-refusal: p_a=0.49 entropy=1.000 2flaky-test-handoff: p_a=0.53 entropy=0.997 3missing-fixture-edge: p_a=0.70 entropy=0.881 4routine-test-failure: p_a=0.97 entropy=0.194

Uncertainty isn't the same as importance. Five uncertain cases might all be paraphrases of the same missing-fixture diagnosis. A diverse batch avoids spending an entire review round on one local cluster.

The same six candidate traces produce two different batches of four: uncertainty-only selection takes three near-duplicate fixture-failure cases plus one injection case, while hybrid selection covers fixture, handoff, injection, and accessible-language regions. The same six candidate traces produce two different batches of four: uncertainty-only selection takes three near-duplicate fixture-failure cases plus one injection case, while hybrid selection covers fixture, handoff, injection, and accessible-language regions.
Hybrid selection mixes a difficulty signal with coverage so a review batch doesn't collapse into near-duplicates.

The miniature selector below uses an already-normalized uncertainty score from 0 to 1, where 1 means most uncertain. That differs from the previous p_a values: a preference probability near 0.5 should map to high uncertainty. The selector starts with the most uncertain item, then scores remaining items using normalized uncertainty plus distance from anything already selected. The two-dimensional coordinates stand in for embeddings so the mechanics stay visible.

06-select-a-diverse-review-batch.py
1from math import dist 2 3# (trace_id, two-dimensional embedding, normalized uncertainty) 4items = [ 5 ("fixture-failure-1", (0.0, 0.0), 0.99), 6 ("fixture-failure-2", (0.2, 0.1), 0.96), 7 ("fixture-failure-3", (-0.2, 0.2), 0.94), 8 ("flaky-test-handoff", (4.5, 4.5), 0.66), 9 ("injection-trace", (-4.2, -4.0), 0.71), 10 ("accessible-language", (4.7, -4.1), 0.60), 11] 12 13def hybrid_select(batch_size: int, weight_uncertainty: float = 0.6) -> list[str]: 14 if not 1 <= batch_size <= len(items): 15 raise ValueError("batch_size must select at least one item and no more than the pool") 16 if not 0 <= weight_uncertainty <= 1: 17 raise ValueError("weight_uncertainty must be between 0 and 1") 18 19 selected = [max(range(len(items)), key=lambda i: items[i][2])] 20 remaining = set(range(len(items))) - set(selected) 21 while len(selected) < batch_size: 22 max_distance = max( 23 min(dist(items[i][1], items[j][1]) for j in selected) 24 for i in remaining 25 ) or 1.0 26 27 def score(i: int) -> float: 28 coverage = min(dist(items[i][1], items[j][1]) for j in selected) / max_distance 29 return weight_uncertainty * items[i][2] + (1 - weight_uncertainty) * coverage 30 31 chosen = max(remaining, key=score) 32 selected.append(chosen) 33 remaining.remove(chosen) 34 return [items[i][0] for i in selected] 35 36uncertainty_only = [name for name, _, _ in sorted(items, key=lambda item: item[2], reverse=True)[:4]] 37print("uncertainty_only:", uncertainty_only) 38print("hybrid:", hybrid_select(4))
Output
1uncertainty_only: ['fixture-failure-1', 'fixture-failure-2', 'fixture-failure-3', 'injection-trace'] 2hybrid: ['fixture-failure-1', 'flaky-test-handoff', 'injection-trace', 'accessible-language']

Core-set methods formalize the coverage intuition by selecting examples far from the already represented set in embedding space.[6]Reference 6Active Learning for Convolutional Neural Networks: A Core-Set Approachhttps://arxiv.org/abs/1708.00489 In a real pipeline, validate whether the representation and scoring method find meaningful code-review failure modes. Distance in an embedding space is a heuristic, not proof that an example will improve a model.

Prove selection helped without contaminating evaluation

An active selector can produce an interesting queue and still fail to improve the system. Measure it against a random-sampling baseline:

  1. Freeze a private evaluation set before selecting training data.
  2. Draw equal-sized candidate batches, one random and one selected by the proposed policy.
  3. Label both with the same rubric and quality controls.
  4. Train comparable candidates.
  5. Evaluate both candidates on the same frozen set, including safety and accessibility slices.
  6. Compare downstream gain per accepted label and total review cost.

The frozen set must be disjoint from demonstrations and preference pairs.

07-block-training-evaluation-leakage.py
1evaluation_ids = {"attack-014", "handoff-099", "screen-reader-007"} 2demonstration_ids = {"routine-002", "handoff-102"} 3preference_ids = {"handoff-102", "fix-explanation-044", "attack-014"} 4 5def overlap(training_ids: set[str]) -> list[str]: 6 return sorted(evaluation_ids & training_ids) 7 8all_training = demonstration_ids | preference_ids 9leaked = overlap(all_training) 10print("evaluation_ids:", sorted(evaluation_ids)) 11print("leaked_training_ids:", leaked) 12print("promotion_allowed:", not leaked)
Output
1evaluation_ids: ['attack-014', 'handoff-099', 'screen-reader-007'] 2leaked_training_ids: ['attack-014'] 3promotion_allowed: False

The result correctly blocks this draft dataset because attack-014 was accidentally placed in both the preference set and the frozen evaluation set.

Measure label efficiency, not label volume

The values below are example experiment results, not a promised advantage for active selection. An experiment log should make the comparison easy to calculate.

08-compare-label-efficiency.py
1rounds = { 2 "random": {"accepted_labels": 400, "baseline_score": 0.61, "candidate_score": 0.64}, 3 "hybrid": {"accepted_labels": 400, "baseline_score": 0.61, "candidate_score": 0.68}, 4} 5 6for name, result in rounds.items(): 7 gain = result["candidate_score"] - result["baseline_score"] 8 labels_per_point = result["accepted_labels"] / (gain * 100) 9 print(f"{name}: gain={gain:.2f} labels_per_percentage_point={labels_per_point:.1f}")
Output
1random: gain=0.03 labels_per_percentage_point=133.3 2hybrid: gain=0.07 labels_per_percentage_point=57.1

If the hybrid batch doesn't outperform random sampling on the frozen test set, don't defend it because its selected examples looked clever. Change the selector or return to the simpler baseline.

Treat model judges as assistants, not ground truth

An LLM judge can prioritize a large candidate pool or flag likely failures before human review. It can also prefer answers because they appear first, are longer, or resemble its own style. The MT-Bench and Chatbot Arena study measured those position, verbosity, and style-similarity biases in model judges.[7]Reference 7Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.https://arxiv.org/abs/2306.05685

Start with a held-out set labeled by people under the actual rubric. Swap answer order and check whether a judge reverses its choice.

09-detect-order-sensitive-judge-labels.py
1human_gold = { 2 "fixture-policy": "A", 3 "handoff-route": "B", 4 "unsafe-disclosure": "B", 5} 6 7judge_original = { 8 "fixture-policy": "A", 9 "handoff-route": "B", 10 "unsafe-disclosure": "B", 11} 12 13judge_swapped_mapped_back = { 14 "fixture-policy": "B", 15 "handoff-route": "B", 16 "unsafe-disclosure": "A", 17} 18 19agreement = sum(judge_original[key] == human_gold[key] for key in human_gold) / len(human_gold) 20flips = sorted( 21 key for key in human_gold 22 if judge_original[key] != judge_swapped_mapped_back[key] 23) 24print(f"agreement_with_humans: {agreement:.2f}") 25print("order_sensitive_cases:", flips) 26print("auto_accept_enabled:", agreement >= 0.9 and not flips)
Output
1agreement_with_humans: 1.00 2order_sensitive_cases: ['fixture-policy', 'unsafe-disclosure'] 3auto_accept_enabled: False

This judge agrees when shown one order and fails the order-swap test. It can help surface cases for review, but it can't accept preference pairs automatically.

Promote a versioned feedback dataset

Once a batch has passed review, package its provenance as carefully as the model release from the previous lesson. The dataset manifest should tie accepted rows to their controls:

  • Source window and selection policy version.
  • Redaction version and access rule for any re-identification mapping.
  • Rubric version, reviewer training set, and agreement report.
  • Counts for demonstrations, preferences, ties, both-bad escalations, and rejected rows.
  • Frozen evaluation set ID and leakage-check result.
  • Parent dataset and immutable output version.
10-validate-a-dataset-manifest.py
1REQUIRED_FIELDS = { 2 "dataset_version", 3 "source_window", 4 "selection_policy_version", 5 "redaction_version", 6 "reidentification_access_rule", 7 "rubric_version", 8 "reviewer_training_set", 9 "agreement_report", 10 "row_counts", 11 "frozen_evaluation_set", 12 "leakage_check_passed", 13 "both_bad_escalated", 14 "parent_dataset", 15} 16 17manifest = { 18 "dataset_version": "code-review-feedback-v12", 19 "source_window": "2026-05-01/2026-05-15", 20 "selection_policy_version": "hybrid-selector-v3", 21 "redaction_version": "code-review-redactor-v2", 22 "reidentification_access_rule": "privacy-approved-roles-only", 23 "rubric_version": "code-review-rubric-v4", 24 "reviewer_training_set": "code-reviewer-gold-v3", 25 "agreement_report": {"cohens_kappa": 0.835, "policy_gate": 0.65}, 26 "row_counts": {"demonstrations": 182, "preferences": 904, "ties": 61, "rejected": 23}, 27 "frozen_evaluation_set": "code-review-eval-v5", 28 "leakage_check_passed": True, 29 "both_bad_escalated": 7, 30 "parent_dataset": "code-review-feedback-v11", 31} 32 33missing = sorted(REQUIRED_FIELDS - manifest.keys()) 34print("dataset_version:", manifest["dataset_version"]) 35print("missing_fields:", missing) 36print("ready_for_promotion:", not missing and manifest["leakage_check_passed"])
Output
1dataset_version: code-review-feedback-v12 2missing_fields: [] 3ready_for_promotion: True

Gate the full data release

A manifest can be complete and the batch still be unsuitable. Promotion should fail for leakage, unresolved sensitive exposure, low agreement under the declared policy, or unsafe pairs that were forced into chosen/rejected labels.

11-gate-feedback-dataset-promotion.py
1def promotion_reasons(batch: dict[str, object]) -> list[str]: 2 reasons: list[str] = [] 3 if not batch["redaction_passed"]: 4 reasons.append("redaction failed") 5 if batch["leaked_eval_ids"]: 6 reasons.append("evaluation leakage") 7 if batch["cohens_kappa"] < batch["declared_kappa_gate"]: 8 reasons.append("agreement below declared gate") 9 if batch["unsafe_pairs_accepted"]: 10 reasons.append("unsafe pair accepted as preference") 11 return reasons 12 13draft = { 14 "redaction_passed": True, 15 "leaked_eval_ids": ["attack-014"], 16 "cohens_kappa": 0.835, 17 "declared_kappa_gate": 0.65, 18 "unsafe_pairs_accepted": 0, 19} 20repaired = { 21 "redaction_passed": True, 22 "leaked_eval_ids": [], 23 "cohens_kappa": 0.835, 24 "declared_kappa_gate": 0.65, 25 "unsafe_pairs_accepted": 0, 26} 27 28for name, batch in (("draft", draft), ("repaired", repaired)): 29 reasons = promotion_reasons(batch) 30 print(f"{name}_promoted:", not reasons) 31 print(f"{name}_reasons:", reasons)
Output
1draft_promoted: False 2draft_reasons: ['evaluation leakage'] 3repaired_promoted: True 4repaired_reasons: []

The repaired batch can become code-review-feedback-v12. It's now a defensible input to an SFT or DPO experiment, and its untouched code-review-eval-v5 suite can measure the candidate honestly.

Build the missing CodeAssist artifact

Create a small feedback dataset package from the code-review assistant traces in the previous lesson:

  1. Reserve at least three traces for code-review-eval-v5: one prompt-injection attempt, one legitimate flaky-test handoff, and one supported accessible interaction path. These IDs must never enter training data.
  2. Redact candidate review traces and store a redaction_version with each cleaned row.
  3. Write code-review-rubric-v4 with authorization, technical correctness, actionability, and tone ordered explicitly.
  4. Produce demonstrations for corrected answers and preference pairs only when at least one safe winner exists. Route BOTH_BAD to incident review.
  5. Duplicate-review a small sample and calculate observed agreement and Cohen's kappa. Record your declared acceptance policy and what you'll inspect even when the gate passes.
  6. Run a random batch and a hybrid-selected batch of equal size. Compare candidate results against the frozen evaluation set.
  7. Write code-review-feedback-v12 manifest and run promotion checks for redaction, leakage, agreement, and unsafe accepted pairs.

A strong submission

Your package includes cleaned JSONL-like records, a rubric, a hand-checkable agreement calculation, a selector result, a leakage failure that you repaired, and a manifest that states exactly why the dataset may be used for training.

A useful failure

Place attack-014 into both the frozen set and a preference pair. Your gate should refuse promotion. If it doesn't, the pipeline can no longer tell training progress from memorization.

Mastery check

You're ready to build human-feedback datasets when you can:

  • Explain why demonstrations, pointwise assessments, preference pairs, frozen evaluations, and incident traces have different jobs.
  • Redact a trace before selection and keep the mapping separate from reviewer data.
  • Write a rubric where authorization and privacy outrank tone.
  • Compute Cohen's kappa from a small matrix and explain why agreement isn't correctness.
  • Select uncertain and diverse cases without claiming the heuristic guarantees improvement.
  • Compare an active-selection batch with a random baseline on an untouched evaluation set.
  • Detect a judge that changes labels when answer order changes.
  • Block dataset promotion for leakage or unsafe accepted pairs.

Evaluation rubric

  • Foundational: Routes trace types correctly, redacts identifiers, and explains each feedback artifact.
  • Intermediate: Produces safe pairwise records, agreement evidence, and a versioned dataset manifest.
  • Applied: Implements hybrid selection and measures it against random sampling without evaluation leakage.
  • Advanced: Integrates privacy, disagreement repair, judge calibration, and promotion gates into a reproducible feedback program.

Common pitfalls

  • Symptom: The next model scores unusually well only on reviewed incidents. Cause: Holdout traces leaked into training data. Fix: Reserve evaluation IDs first and block overlap during dataset promotion.
  • Symptom: Reviewers choose a fluent answer that runs an unauthorized production command. Cause: Tone was valued before policy and authorization. Fix: Put safety first in the rubric and route both-bad pairs to incident review.
  • Symptom: Kappa is low across multiple reviewers. Cause: Guidelines are unclear, task slices differ, or reviewer training is weak. Fix: Inspect disagreement clusters before blaming reviewers.
  • Symptom: An active queue costs more but doesn't improve the frozen suite. Cause: Uncertainty or embedding distance isn't correlated with downstream value. Fix: Compare against random selection and replace a failed heuristic.
  • Symptom: A model judge shrinks human review while changing choices under response swaps. Cause: Judge bias wasn't calibrated. Fix: Keep it advisory until human-gold and order-swap tests pass.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.Given this trace: reserved_for_evaluation=True, unsafe_effect=True, approved_answer_available=False, safe_pair_available=False. Which destinations should it receive before any training step?
2.CodeAssist redacts this trace: 'Engineer ENG-918204 emailed [email protected] about commit abc123def456. Contact ENG-918204 only through incident channel.' What should the review and selection record contain?
3.Under a code-review assistant rubric, safety and authorization outrank technical correctness, actionability, and tone. Candidate A politely cites the policy but exposes protected incident notes. Candidate B politely cites the policy but proposes a production command outside the approval rule. What label should reviewers record?
4.An accepted preference record stores trace_id, redaction_version, prompt_template_version, policy_version, candidate model IDs and settings, rubric_version, reviewer pseudonym, choice or escalation reason, and dataset version. Why store this provenance?
5.Duplicate review of 120 cases has observed agreement p_o = 0.892 and chance agreement p_e = 0.344. With a declared acceptance gate of kappa >= 0.65, which conclusion follows?
6.An eligible pool has three fixture-failure paraphrases in one embedding cluster with uncertainty scores 0.99, 0.96, and 0.94. It also has flaky-test-handoff 0.66, injection-trace 0.71, and accessible-language 0.60 far from that cluster. Why might a hybrid selector pick fixture-failure-1 plus the three far-away cases for a batch of four instead of the four highest uncertainty scores?
7.An experiment labels equal-sized random and hybrid-selected batches with the same rubric. Both candidates are evaluated on the frozen set. Random improves from 0.61 to 0.64 with 400 labels; hybrid improves from 0.61 to 0.68 with 400 labels. What should the log conclude?
8.An LLM judge matches human gold on the original answer order for three preference cases: A, B, B. After swapping answer order and mapping choices back, its labels are B, B, A. What should CodeAssist do with this judge?
9.CodeAssist is about to promote code-review-feedback-v12. Its manifest is complete, redaction passed, Cohen's kappa is 0.835 against a declared 0.65 gate, no unsafe pairs were accepted, but attack-014 appears in both the preference set and code-review-eval-v5. What should the promotion gate do?
10.CodeAssist has four reviewed records: (1) a prompt plus an approved answer for SFT, (2) one answer scored 1 to 5 for clarity, (3) one prompt with two safe answers and a reviewer choice, and (4) an injected-policy input reserved to test future candidates. Which mapping is correct?

10 questions remaining.

Next Step
Continue to Evaluating AI Agents

You can now create reliable human-feedback records and frozen test fixtures; next you'll score complete agent trajectories, including tool effects, recovery, cost, and safety.

PreviousResponsible AI Governance
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

A Coefficient of Agreement for Nominal Scales

Cohen, J. · 1960 · Educational and Psychological Measurement

Computing Krippendorff's Alpha-Reliability

Krippendorff, K. · 2011 · University of Pennsylvania ScholarlyCommons

Active Learning

Settles, B. · 2012 · Synthesis Lectures on Artificial Intelligence and Machine Learning

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Sener, O., Savarese, S. · 2018 · ICLR 2018

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023