LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPortfolio CapstonesCapstone: Fine-Tuned Classifier
⚡HardFine-Tuning & Training

Capstone: Fine-Tuned Classifier

Train and gate a support-ticket encoder that exports exact-receipt evidence and safe intake decisions to a production agent.

23 min read
Learning path
Step 83 of 155 in the full curriculum
Capstone: Eval DashboardCapstone: Production Agent

The evaluation dashboard in the previous capstone made one rule visible: a release is blocked by the failures that matter, even when its aggregate score looks good. Now apply that rule to a learned model.

Your product already has a document QA service that answers policy questions with permitted citations. The next product will be an agent that can draft a support response and request approval. This capstone builds the intake classifier in between:

Given a new refund-support ticket, should it enter human_review_now, or may it enter the guarded agent workflow?

Consider this ticket:

text
1FastShip delivered my tablet after the return window closed. The portal rejected my return. Please make an exception.

Missing this exception is not the same as sending one harmless package-status ticket to a human. The classifier therefore does routing, not authorization. It never decides refund eligibility, sends a reply, or moves money. It protects the agent entry point.

Six held-out support-ticket scores from encoder_v1 on a zero-to-one axis. Routine return-policy and delivery-status tickets score 0.18 and 0.23. The required return-window exception scores 0.42, between intake bundle v2 threshold 0.35 and intake bundle v1 threshold 0.50. Account takeover scores 0.64, angry routine delivery status 0.71, and damaged-package exception 0.89. Bundle v1 is held because threshold 0.50 misses the return-window exception. Bundle v2 routes all three required positives to human review, accepts one routine false positive, and is eligible only for shadow traffic. Six held-out support-ticket scores from encoder_v1 on a zero-to-one axis. Routine return-policy and delivery-status tickets score 0.18 and 0.23. The required return-window exception scores 0.42, between intake bundle v2 threshold 0.35 and intake bundle v1 threshold 0.50. Account takeover scores 0.64, angry routine delivery status 0.71, and damaged-package exception 0.89. Bundle v1 is held because threshold 0.50 misses the return-window exception. Bundle v2 routes all three required positives to human review, accepts one routine false positive, and is eligible only for shadow traffic.
Train a classifier only after the label is defined, then export a versioned routing contract only after its frozen validation receipt and required risk slices pass.

By the end, you will have:

  • a versioned label guide and split rule
  • a baseline whose failure justifies learning contextual representations
  • a small PyTorch encoder training loop that exposes the mechanics
  • a practical Hugging Face fine-tuning recipe for a pretrained encoder
  • held-out threshold and slice gates
  • a serving bundle the production-agent capstone can consume

BERT popularized the pattern of adapting a pretrained bidirectional encoder to a downstream classification task with a small output layer.[1] Hugging Face exposes the same production-oriented path through sequence-classification models and training utilities.[2] The important discipline here is not selecting a fashionable checkpoint. It is proving that a trained classifier satisfies a routing contract.

Define The Routing Contract

Use one binary label:

LabelRouteMeaning
1human_review_nowMoney, access, policy-exception, or time-critical risk requires a person before any agent workflow.
0guarded_agentA routine request may enter the agent, which still follows retrieval evidence and approval rules.

That second row matters. Label 0 does not mean "approve automatically." It means "safe to enter an automation path with its own controls."

Diagram showing Ticket, Score, Threshold?, and meet. Diagram showing Ticket, Score, Threshold?, and meet.
Ticket, Score, Threshold?, and meet.

The model produces a score. The pinned threshold turns that score into an intake route. Either route preserves the agent's downstream evidence and approval boundaries.

Connect The Portfolio Artifacts

Capstone artifactWhat it already provedWhat this classifier needs from it
Document QA productAn answer must cite permitted policy evidence.Routine tickets sent to the agent can ask policy questions safely.
Evaluation dashboardHard failures block a release even when averages improve.Missed urgent slices must block a classifier release.
Fine-tuned classifierTicket becomes a risk route under a pinned threshold.Export the intake contract for the production agent.

Start with labeled fixtures that can be reviewed before any model is trained.

label-contract.py
1label_version = "escalation-policy-v1" 2fixtures = [ 3 { 4 "id": "damaged_package_exception", 5 "text": "The tablet arrived cracked, but the return portal says final sale.", 6 "gold_route": "human_review_now", 7 "reason": "damaged_package_exception", 8 }, 9 { 10 "id": "return_window_exception", 11 "text": "FastShip delivered after the return window closed. Please allow an exception.", 12 "gold_route": "human_review_now", 13 "reason": "return_window_exception", 14 }, 15 { 16 "id": "account_takeover", 17 "text": "Someone changed my delivery address and I can't sign in.", 18 "gold_route": "human_review_now", 19 "reason": "account_takeover", 20 }, 21 { 22 "id": "routine_delivery_status", 23 "text": "Where is my package?", 24 "gold_route": "guarded_agent", 25 "reason": "routine_delivery_status", 26 }, 27 { 28 "id": "return_policy_question", 29 "text": "What is the return policy for damaged tablets?", 30 "gold_route": "guarded_agent", 31 "reason": "return_policy_question", 32 }, 33] 34 35allowed_routes = {"human_review_now", "guarded_agent"} 36assert all(row["gold_route"] in allowed_routes for row in fixtures) 37assert {row["reason"] for row in fixtures if row["gold_route"] == "human_review_now"} == { 38 "damaged_package_exception", 39 "return_window_exception", 40 "account_takeover", 41} 42 43print("label guide:", label_version) 44for row in fixtures: 45 print(f"{row['id']}: {row['gold_route']} ({row['reason']})")
Output
1label guide: escalation-policy-v1 2damaged_package_exception: human_review_now (damaged_package_exception) 3return_window_exception: human_review_now (return_window_exception) 4account_takeover: human_review_now (account_takeover) 5routine_delivery_status: guarded_agent (routine_delivery_status) 6return_policy_question: guarded_agent (return_policy_question)

A label guide should also name ambiguous cases. A frustrated tone alone is not a reason for immediate human routing. A calm message requesting a return-window exception may be. Have support operations adjudicate disagreements and store that policy version alongside any dataset generated from it.

Build An Honest Dataset

This lesson uses a hypothetical teaching dataset, not measured production traffic:

FieldTeaching dataset choice
SourceSynthetic refund-support tickets modeled after return, delivery, and account-access requests
Label versionescalation-policy-v1
ExclusionsSpam, internal QA, tickets without customer text
High-risk slicesdamaged_package_exception, return_window_exception, account_takeover
Split policyEarlier accounts train; later, unseen accounts validate and test
Release ruleNo missed positive in required validation slices

In a real data card, replace every teaching assumption with provenance, consent and retention policy, labeling instructions, class balance, disagreement rate, and known coverage gaps.

Split Before Learning

Suppose multiple messages from the same customer account share product names, carrier wording, or a prior return incident. Randomly putting one thread into training and another into validation makes the classifier look better than it will on a new account. Group by account first, then hold out later traffic.

grouped-time-split.py
1records = [ 2 {"ticket": "t01", "account": "acme", "month": 1, "split": "train"}, 3 {"ticket": "t02", "account": "acme", "month": 2, "split": "train"}, 4 {"ticket": "t03", "account": "north", "month": 2, "split": "train"}, 5 {"ticket": "t04", "account": "north", "month": 2, "split": "train"}, 6 {"ticket": "v01", "account": "summit", "month": 3, "split": "validation"}, 7 {"ticket": "v02", "account": "summit", "month": 3, "split": "validation"}, 8 {"ticket": "x01", "account": "harbor", "month": 4, "split": "test"}, 9 {"ticket": "x02", "account": "harbor", "month": 4, "split": "test"}, 10] 11 12accounts = { 13 split: {row["account"] for row in records if row["split"] == split} 14 for split in ("train", "validation", "test") 15} 16assert accounts["train"].isdisjoint(accounts["validation"]) 17assert accounts["train"].isdisjoint(accounts["test"]) 18assert accounts["validation"].isdisjoint(accounts["test"]) 19 20for split in ("train", "validation", "test"): 21 count = sum(row["split"] == split for row in records) 22 names = ",".join(sorted(accounts[split])) 23 print(f"{split}: {count} tickets, accounts={names}")
Output
1train: 4 tickets, accounts=acme,north 2validation: 2 tickets, accounts=summit 3test: 2 tickets, accounts=harbor

Use validation data to pick features, checkpoints, and threshold. Open the test set once for the final report after choices are frozen. A small, curated fixture set can reveal specific failures, but it can't establish broad production performance by itself.

Establish A Cheap Baseline

A transformer has to earn its complexity. Start with a deterministic rule or a bag-of-words logistic-regression baseline. If a simple baseline satisfies the routing gate on representative held-out data, shipping a neural model may add maintenance without adding safety.

Here the rule catches an explicit damaged-package phrase and misses an indirect return-window exception. That failure gives a contextual model a job worth doing.

keyword-baseline.py
1validation = [ 2 ("explicit_damage_exception", "Tablet arrived cracked and the return portal denied it.", 1), 3 ("indirect_return_exception", "FastShip delivered after the return window closed.", 1), 4 ("routine_delivery_status", "Where is my package?", 0), 5 ("routine_policy", "Where can I read the return policy?", 0), 6] 7terms = ("arrived cracked", "can't sign in", "seller exception") 8 9predictions = [] 10for fixture_id, text, label in validation: 11 predicted = int(any(term in text.lower() for term in terms)) 12 predictions.append((fixture_id, label, predicted)) 13 14positives = sum(label for _, label, _ in predictions) 15true_positives = sum(label == 1 and predicted == 1 for _, label, predicted in predictions) 16missed = [fixture_id for fixture_id, label, predicted in predictions if label == 1 and predicted == 0] 17 18print(f"positive recall: {true_positives}/{positives}") 19print("missed positive:", missed[0]) 20print("next experiment: learn context beyond trigger words")
Output
1positive recall: 1/2 2missed positive: indirect_return_exception 3next experiment: learn context beyond trigger words

A TF-IDF plus logistic-regression baseline is a useful next step when the dataset grows: it learns weighted text features while remaining cheap to inspect and deploy.[3] Don't claim a fine-tuned encoder is cheaper, faster, or more accurate than prompting, or than this baseline, until you measure those candidates on the same task and hardware.

From Logits To A Route

An encoder classifier emits logits: one unnormalized score for each class. Softmax converts two logits into class scores that sum to one. A separate decision threshold converts the positive-class score into the business route.

Use numerically stable softmax by subtracting the largest logit before exponentiating.

stable-softmax.py
1from math import exp 2 3examples = [ 4 ("return_window_closed", [-1.0, 2.0]), 5 ("delivery_status", [1.0, -1.0]), 6] 7 8def softmax(logits: list[float]) -> list[float]: 9 offset = max(logits) 10 values = [exp(value - offset) for value in logits] 11 total = sum(values) 12 return [value / total for value in values] 13 14for fixture_id, logits in examples: 15 routine, human = softmax(logits) 16 print(f"{fixture_id}: human_review_score={human:.2f}, routine_score={routine:.2f}")
Output
1return_window_closed: human_review_score=0.95, routine_score=0.05 2delivery_status: human_review_score=0.12, routine_score=0.88

Call 0.95 a score until held-out reliability evidence supports calling it a probability. A bounded output is not proof of calibration.

Train The Mechanics In PyTorch

Before reaching for a downloaded checkpoint, make the trainable path concrete. The sandbox below builds a tiny word-embedding encoder, averages token vectors, attaches a classification head, and updates every parameter with cross-entropy loss. This is training a text classifier from scratch, not fine-tuning BERT. It exists so you can inspect the mechanics before replacing random embeddings with pretrained representations.

tiny-encoder-training-loop.py
1import torch 2from torch import nn 3 4torch.manual_seed(7) 5 6training_rows = [ 7 ("cracked tablet return portal denied", 1), 8 ("carrier delivered after return window", 1), 9 ("delivery address changed cannot sign in", 1), 10 ("where is package", 0), 11 ("resend return label", 0), 12 ("show return policy", 0), 13] 14vocabulary = {"<pad>": 0} 15for text, _ in training_rows: 16 for token in text.split(): 17 vocabulary.setdefault(token, len(vocabulary)) 18 19max_length = max(len(text.split()) for text, _ in training_rows) 20input_ids = [] 21for text, _ in training_rows: 22 ids = [vocabulary[token] for token in text.split()] 23 input_ids.append(ids + [0] * (max_length - len(ids))) 24inputs = torch.tensor(input_ids) 25labels = torch.tensor([label for _, label in training_rows]) 26 27class TinyEncoderClassifier(nn.Module): 28 def __init__(self, vocab_size: int) -> None: 29 super().__init__() 30 self.embedding = nn.Embedding(vocab_size, 8, padding_idx=0) 31 self.head = nn.Linear(8, 2) 32 33 def forward(self, token_ids: torch.Tensor) -> torch.Tensor: 34 mask = (token_ids != 0).unsqueeze(-1) 35 embedded = self.embedding(token_ids) * mask 36 pooled = embedded.sum(dim=1) / mask.sum(dim=1).clamp_min(1) 37 return self.head(pooled) 38 39model = TinyEncoderClassifier(len(vocabulary)) 40criterion = nn.CrossEntropyLoss() 41optimizer = torch.optim.Adam(model.parameters(), lr=0.05) 42initial_loss = criterion(model(inputs), labels).item() 43original_head = model.head.weight.detach().clone() 44 45for _ in range(80): 46 loss = criterion(model(inputs), labels) 47 optimizer.zero_grad() 48 loss.backward() 49 optimizer.step() 50 51final_logits = model(inputs) 52final_loss = criterion(final_logits, labels).item() 53print("logit shape:", tuple(final_logits.shape)) 54print("loss decreased:", final_loss < initial_loss) 55print("classification head updated:", not torch.equal(original_head, model.head.weight))
Output
1logit shape: (6, 2) 2loss decreased: True 3classification head updated: True

The sandbox can memorize six tiny training examples. That is not deployment evidence. It proves that text becomes tokens, tokens become pooled representations, a head produces logits, and gradients update those weights. Held-out evaluation remains the release evidence.

Fine-Tune A Pretrained Encoder

For the capstone artifact, replace random embeddings with a pretrained encoder such as DistilBERT and fine-tune it for two labels. BERT-style encoders read context on both sides of a token and are designed to be adapted to tasks such as sequence classification.[1] In the Transformers task interface, AutoModelForSequenceClassification supplies the classification head while Trainer runs optimization and evaluation.[2]

The following is a project script rather than a marked copy-runnable cell: it requires your local JSONL splits, a downloaded checkpoint, and the transformers, datasets, and evaluate packages. Each JSON Lines (JSONL) row needs text and integer label fields, where 0 means guarded_agent and 1 means human_review_now.

train_distilbert.py
1import evaluate 2import numpy as np 3from datasets import load_dataset 4from transformers import ( 5 AutoModelForSequenceClassification, 6 AutoTokenizer, 7 DataCollatorWithPadding, 8 Trainer, 9 TrainingArguments, 10) 11 12checkpoint = "distilbert/distilbert-base-uncased" 13dataset = load_dataset( 14 "json", 15 data_files={ 16 "train": "data/train.jsonl", 17 "validation": "data/validation.jsonl", 18 "test": "data/test.jsonl", 19 }, 20) 21tokenizer = AutoTokenizer.from_pretrained(checkpoint) 22 23def tokenize(batch): 24 return tokenizer(batch["text"], truncation=True, max_length=256) 25 26encoded = dataset.map(tokenize, batched=True) 27model = AutoModelForSequenceClassification.from_pretrained( 28 checkpoint, 29 num_labels=2, 30 id2label={0: "guarded_agent", 1: "human_review_now"}, 31 label2id={"guarded_agent": 0, "human_review_now": 1}, 32) 33metric = evaluate.load("f1") 34 35def compute_metrics(prediction): 36 logits, labels = prediction.predictions, prediction.label_ids 37 predicted = np.argmax(logits, axis=-1) 38 return metric.compute(predictions=predicted, references=labels, average="binary") 39 40arguments = TrainingArguments( 41 output_dir="artifacts/encoder_v1", 42 learning_rate=2e-5, 43 per_device_train_batch_size=16, 44 per_device_eval_batch_size=32, 45 num_train_epochs=3, 46 eval_strategy="epoch", 47 save_strategy="epoch", 48 load_best_model_at_end=True, 49 metric_for_best_model="f1", 50) 51trainer = Trainer( 52 model=model, 53 args=arguments, 54 train_dataset=encoded["train"], 55 eval_dataset=encoded["validation"], 56 processing_class=tokenizer, 57 data_collator=DataCollatorWithPadding(tokenizer=tokenizer), 58 compute_metrics=compute_metrics, 59) 60trainer.train() 61print(trainer.evaluate(encoded["validation"])) 62trainer.save_model("artifacts/encoder_v1/model") 63tokenizer.save_pretrained("artifacts/encoder_v1/model")

Two boundaries keep this honest:

  1. Fine-tuning doesn't replace label review. It fits the guide it receives, including guide mistakes.
  2. Highest validation F1 doesn't automatically ship. The candidate still has to pass the high-risk slice gate and serving contract.

Select A Threshold On Validation Rows

Don't accept a library default threshold without checking its error tradeoff. Suppose encoder_v1 exports these validation scores for human_review_now:

Threshold sweep over six frozen classifier fixtures. At threshold 0.35 all three required positives route to human review, one angry routine status ticket is a false positive, and recall is 100 percent, so the bundle is selected for shadow traffic. At threshold 0.50 the return-window exception is missed and recall falls to 67 percent. At threshold 0.75 both return-window and account-takeover fixtures are missed and recall falls to 33 percent, even though false positives reach zero. Threshold sweep over six frozen classifier fixtures. At threshold 0.35 all three required positives route to human review, one angry routine status ticket is a false positive, and recall is 100 percent, so the bundle is selected for shadow traffic. At threshold 0.50 the return-window exception is missed and recall falls to 67 percent. At threshold 0.75 both return-window and account-takeover fixtures are missed and recall falls to 33 percent, even though false positives reach zero.
A threshold is a release policy: high-risk tickets go to a person, while admitted routine tickets may enter the guarded agent.
FixtureSliceGoldScore
damaged_package_exceptiondamaged_package_exception10.89
return_window_exceptionreturn_window_exception10.42
account_takeoveraccount_takeover10.64
angry_delivery_statusroutine_delivery_status00.71
routine_delivery_statusroutine_delivery_status00.23
return_policy_questionreturn_policy_question00.18
threshold-sweep.py
1rows = [ 2 ("damaged_package_exception", 1, 0.89), 3 ("return_window_exception", 1, 0.42), 4 ("account_takeover", 1, 0.64), 5 ("angry_delivery_status", 0, 0.71), 6 ("routine_delivery_status", 0, 0.23), 7 ("return_policy_question", 0, 0.18), 8] 9 10def summarize(threshold: float) -> tuple[int, int, int, float]: 11 tp = fp = fn = 0 12 for _, gold, score in rows: 13 predicted = int(score >= threshold) 14 tp += predicted == 1 and gold == 1 15 fp += predicted == 1 and gold == 0 16 fn += predicted == 0 and gold == 1 17 recall = tp / (tp + fn) 18 return tp, fp, fn, recall 19 20for threshold in (0.35, 0.50, 0.75): 21 tp, fp, fn, recall = summarize(threshold) 22 print(f"threshold={threshold:.2f} tp={tp} fp={fp} fn={fn} recall={recall:.2f}")
Output
1threshold=0.35 tp=3 fp=1 fn=0 recall=1.00 2threshold=0.50 tp=2 fp=1 fn=1 recall=0.67 3threshold=0.75 tp=1 fp=0 fn=2 recall=0.33

On these fixtures, threshold 0.35 is the only candidate shown that avoids a missed escalation. It sends one angry but routine delivery-status ticket to human review. That tradeoff is acceptable only if queue capacity allows it and the same rule survives a larger held-out dataset.

Don't report 1.00 recall from three positive fixtures as proof of production reliability. It is evidence for this test slice, not for unseen traffic.

Make The Hard Gate Executable

The dashboard capstone compared runs only when dataset, grader, corpus, and fixture inventory matched. Carry that discipline here. A classifier comparison receipt pins its dataset, label guide, split manifest, model, and exact validation fixture IDs before scores are aggregated.

intake_bundle_v1 is blocked because its encoder_v1 score for return_window_exception falls below the initially proposed threshold. intake_bundle_v2 keeps the same trained encoder but pins a reviewed threshold, then becomes eligible for shadow traffic, not immediate autonomous routing. A threshold change versions the serving bundle, not the model weights.

exact-receipt-release-gate.py
1from collections import Counter 2 3EXPECTED_IDENTITY = { 4 "dataset_version": "refund-intake-validation-v1", 5 "label_version": "escalation-policy-v1", 6 "split_manifest_version": "refund-intake-split-2026-05", 7 "model_version": "encoder_v1", 8} 9EXPECTED_FIXTURES = { 10 "damaged_package_exception", 11 "return_window_exception", 12 "account_takeover", 13 "angry_delivery_status", 14 "routine_delivery_status", 15 "return_policy_question", 16} 17REQUIRED_SLICES = { 18 "damaged_package_exception", 19 "return_window_exception", 20 "account_takeover", 21} 22validation_rows = [ 23 {"fixture_id": "damaged_package_exception", "slice": "damaged_package_exception", "gold": 1, "score": 0.89}, 24 {"fixture_id": "return_window_exception", "slice": "return_window_exception", "gold": 1, "score": 0.42}, 25 {"fixture_id": "account_takeover", "slice": "account_takeover", "gold": 1, "score": 0.64}, 26 {"fixture_id": "angry_delivery_status", "slice": "routine_delivery_status", "gold": 0, "score": 0.71}, 27 {"fixture_id": "routine_delivery_status", "slice": "routine_delivery_status", "gold": 0, "score": 0.23}, 28 {"fixture_id": "return_policy_question", "slice": "return_policy_question", "gold": 0, "score": 0.18}, 29] 30 31def receipt(threshold: float, rows: list[dict[str, object]], **overrides: str) -> dict[str, object]: 32 return {**EXPECTED_IDENTITY, **overrides, "threshold": threshold, "rows": rows} 33 34def release_decision(run: dict[str, object]) -> tuple[str, str]: 35 for field, expected in EXPECTED_IDENTITY.items(): 36 if run[field] != expected: 37 return "hold", f"{field}:{run[field]}" 38 rows = run["rows"] 39 fixture_ids = [str(row["fixture_id"]) for row in rows] 40 counts = Counter(fixture_ids) 41 missing = sorted(EXPECTED_FIXTURES - set(fixture_ids)) 42 unexpected = sorted(set(fixture_ids) - EXPECTED_FIXTURES) 43 duplicated = sorted(fixture_id for fixture_id, count in counts.items() if count != 1) 44 if missing: 45 return "hold", f"missing:{','.join(missing)}" 46 if unexpected: 47 return "hold", f"unexpected:{','.join(unexpected)}" 48 if duplicated: 49 return "hold", f"duplicate:{','.join(duplicated)}" 50 misses = [ 51 str(row["fixture_id"]) 52 for row in rows 53 if row["slice"] in REQUIRED_SLICES 54 and row["gold"] == 1 55 and row["score"] < run["threshold"] 56 ] 57 if misses: 58 return "hold", f"missed:{','.join(misses)}" 59 return "eligible_for_shadow", "exact_receipt_pass" 60 61runs = { 62 "intake_bundle_v1": receipt(0.50, validation_rows), 63 "intake_bundle_v2": receipt(0.35, validation_rows), 64 "intake_bundle_incomplete": receipt( 65 0.35, 66 [row for row in validation_rows if row["fixture_id"] != "account_takeover"], 67 ), 68 "intake_bundle_padded": receipt( 69 0.35, 70 [*validation_rows, {"fixture_id": "easy_status_extra", "slice": "routine_delivery_status", "gold": 0, "score": 0.02}], 71 ), 72 "intake_bundle_duplicated": receipt(0.35, [*validation_rows, validation_rows[-1]]), 73 "intake_bundle_drifted": receipt(0.35, validation_rows, dataset_version="refund-intake-validation-v2"), 74} 75for bundle_version, run in runs.items(): 76 decision, reason = release_decision(run) 77 print(f"{bundle_version}: decision={decision} reason={reason}")
Output
1intake_bundle_v1: decision=hold reason=missed:return_window_exception 2intake_bundle_v2: decision=eligible_for_shadow reason=exact_receipt_pass 3intake_bundle_incomplete: decision=hold reason=missing:account_takeover 4intake_bundle_padded: decision=hold reason=unexpected:easy_status_extra 5intake_bundle_duplicated: decision=hold reason=duplicate:return_policy_question 6intake_bundle_drifted: decision=hold reason=dataset_version:refund-intake-validation-v2

Lowering the threshold is not automatically the right fix. If false positives overwhelm reviewers, improve labels or model behavior instead. For this teaching receipt, the important move is refusing to hide a required-slice miss inside an average or to pad the comparison with easy rows.

Audit Scores Without Overclaiming Calibration

A threshold can be useful even when the score is not a well-calibrated probability. Calibration asks whether tickets scored near 0.70, across enough held-out examples, need human review about 70 percent of the time. Modern neural classifiers can be miscalibrated; temperature scaling is one common post-training adjustment evaluated on held-out data.[4]

Six fixtures are enough to demonstrate a calculation, not to justify customer-facing percentages:

calibration-diagnostic.py
1scored_rows = [ 2 (0.89, 1), 3 (0.64, 1), 4 (0.42, 1), 5 (0.71, 0), 6 (0.23, 0), 7 (0.18, 0), 8] 9buckets = { 10 "low [0.0,0.4)": [(score, gold) for score, gold in scored_rows if score < 0.4], 11 "middle [0.4,0.7)": [(score, gold) for score, gold in scored_rows if 0.4 <= score < 0.7], 12 "high [0.7,1.0]": [(score, gold) for score, gold in scored_rows if score >= 0.7], 13} 14 15for name, bucket in buckets.items(): 16 average_score = sum(score for score, _ in bucket) / len(bucket) 17 observed_rate = sum(gold for _, gold in bucket) / len(bucket) 18 print(f"{name}: n={len(bucket)}, score={average_score:.2f}, observed={observed_rate:.2f}") 19print("decision: diagnostic only; collect more held-out labels")
Output
1low [0.0,0.4): n=2, score=0.21, observed=0.00 2middle [0.4,0.7): n=2, score=0.53, observed=1.00 3high [0.7,1.0]: n=2, score=0.80, observed=0.50 4decision: diagnostic only; collect more held-out labels

The high bucket exposes why calibrated language matters: an average score of 0.80 with one positive among two tickets is not evidence that 0.80 means an 80 percent escalation rate. Keep displaying a routing action until a larger retained set supports probability claims.

Package The Serving Contract

The shipped product is not only model weights. It is a pinned bundle:

ComponentVersioned value
Label guideescalation-policy-v1
Validation receiptrefund-intake-validation-v1, six exact fixtures
Dataset splitrefund-intake-split-2026-05
Tokenizer and modelencoder_v1
Serving releaseintake_bundle_v2
Input policyrefund-intake-input-v1: trim outer whitespace, preserve punctuation, maximum 256 tokens
Threshold0.35, chosen on frozen validation fixtures
Failure fallbackhuman_review_now
Downstream consumerrefund_agent_v2
Pinned classifier intake contract for intake_bundle_v2. The manifest versions escalation-policy-v1, refund-intake-validation-v1 with six exact fixtures, refund-intake-split-2026-05, encoder_v1, the 256-token input policy, threshold 0.35, human-review fallback, and refund_agent_v2. Blank input, model failure, and invalid scores fail closed to human review; valid scores at or above 0.35 also route to a person, while lower valid scores may enter the guarded agent. The agent accepts r-104, blocks r-105 because its route is human review, and blocks r-106 because its bundle provenance is stale. Rollback restores the entire contract rather than model weights alone. Pinned classifier intake contract for intake_bundle_v2. The manifest versions escalation-policy-v1, refund-intake-validation-v1 with six exact fixtures, refund-intake-split-2026-05, encoder_v1, the 256-token input policy, threshold 0.35, human-review fallback, and refund_agent_v2. Blank input, model failure, and invalid scores fail closed to human review; valid scores at or above 0.35 also route to a person, while lower valid scores may enter the guarded agent. The agent accepts r-104, blocks r-105 because its route is human review, and blocks r-106 because its bundle provenance is stale. Rollback restores the entire contract rather than model weights alone.
Deploy the threshold and fallback with the model. A rollback that restores only weights doesn't restore behavior.

The endpoint should fail closed. A blank ticket, unavailable model, or malformed score is a reason to request human review rather than send uncertain traffic into the agent.

serving-fallback.py
1from math import isfinite 2 3bundle = { 4 "bundle_version": "intake_bundle_v2", 5 "label_version": "escalation-policy-v1", 6 "validation_receipt_version": "refund-intake-validation-v1", 7 "split_manifest_version": "refund-intake-split-2026-05", 8 "tokenizer_version": "encoder_v1", 9 "model_version": "encoder_v1", 10 "input_policy_version": "refund-intake-input-v1", 11 "max_length": 256, 12 "threshold": 0.35, 13 "fallback_route": "human_review_now", 14} 15 16def route_ticket(text: str, score: float | None) -> tuple[str, str]: 17 if not text.strip(): 18 return bundle["fallback_route"], "empty_text" 19 if score is None: 20 return bundle["fallback_route"], "model_unavailable" 21 if not isfinite(score) or not 0.0 <= score <= 1.0: 22 return bundle["fallback_route"], "invalid_score" 23 if score >= bundle["threshold"]: 24 return "human_review_now", "threshold" 25 return "guarded_agent", "below_threshold" 26 27checks = [ 28 ("Carrier delivered after the return window", 0.72, "human_review_now", "threshold"), 29 ("Where is the return policy?", 0.18, "guarded_agent", "below_threshold"), 30 ("", 0.04, "human_review_now", "empty_text"), 31 ("Where is my package?", None, "human_review_now", "model_unavailable"), 32 ("Malformed model score", float("nan"), "human_review_now", "invalid_score"), 33] 34print("bundle:", bundle["bundle_version"], bundle["validation_receipt_version"]) 35for text, score, expected_route, expected_reason in checks: 36 route, reason = route_ticket(text, score) 37 assert (route, reason) == (expected_route, expected_reason) 38 print(f"{route}: {reason}")
Output
1bundle: intake_bundle_v2 refund-intake-validation-v1 2human_review_now: threshold 3guarded_agent: below_threshold 4human_review_now: empty_text 5human_review_now: model_unavailable 6human_review_now: invalid_score

Hand Off To The Production Agent

Only the guarded_agent route from the pinned bundle reaches the next capstone. That agent will still retrieve permitted policy evidence, draft rather than execute a refund, and ask for approval before any consequential action. The classifier gives it an intake boundary, not extra authority.

agent-intake-artifact.py
1handoff = { 2 "producer": "support_ticket_escalation_classifier", 3 "bundle_version": "intake_bundle_v2", 4 "model_version": "encoder_v1", 5 "threshold": 0.35, 6 "allowed_agent_route": "guarded_agent", 7 "blocked_or_manual_route": "human_review_now", 8 "policy_evidence_service": "document_qa_v2", 9 "evaluation_report": "classifier_dashboard_run_intake_bundle_v2", 10} 11 12agent_inputs = [ 13 {"ticket_id": "r-104", "bundle_version": "intake_bundle_v2", "route": "guarded_agent"}, 14 {"ticket_id": "r-105", "bundle_version": "intake_bundle_v2", "route": "human_review_now"}, 15 {"ticket_id": "r-106", "bundle_version": "intake_bundle_v1", "route": "guarded_agent"}, 16] 17 18def admitted_by_pinned_bundle(row: dict[str, str]) -> bool: 19 return ( 20 row["bundle_version"] == handoff["bundle_version"] 21 and row["route"] == handoff["allowed_agent_route"] 22 ) 23 24accepted = [ 25 row["ticket_id"] 26 for row in agent_inputs 27 if admitted_by_pinned_bundle(row) 28] 29blocked = [ 30 row["ticket_id"] 31 for row in agent_inputs 32 if not admitted_by_pinned_bundle(row) 33] 34 35print("agent accepts:", ",".join(accepted)) 36print("agent blocked:", ",".join(blocked)) 37print("agent authority: draft_with_evidence_and_request_approval")
Output
1agent accepts: r-104 2agent blocked: r-105,r-106 3agent authority: draft_with_evidence_and_request_approval

This contract creates a portfolio story with traceable boundaries:

  1. Document QA supplies evidence-bound policy answers.
  2. Exact-receipt evaluation rows and gates expose failures before release.
  3. The classifier blocks high-risk intake from automation.
  4. The agent uses only admitted routine intake and still needs approval for consequential actions.

Release Checklist

Don't ship a classifier project as a notebook screenshot. Keep these files or equivalent artifacts together:

ArtifactQuestion it answers
label_guide.mdWhat does human_review_now mean, including edge cases?
data_card.mdWhere did rows come from and what traffic is missing?
Frozen split manifestDid account or time leakage contaminate evaluation?
Baseline reportDid the encoder solve a measured baseline failure?
Training configurationWhich checkpoint, seed, hyperparameters, and label mapping produced the model?
Versioned evaluation rowsDid the frozen receipt match exactly, and which required-slice failures passed or blocked release?
Serving bundleAre tokenizer, threshold, fallback, and model restored together?
Shadow-monitor planWhich reviewed production sample can trigger rollback?

Use shadow traffic before changing live routing. Record scores and proposed routes, retain human decisions, and compare missed-escalation rate by required slice. Roll back to manual review when a required slice misses its agreed floor; don't wait for average accuracy to fall.

Practice: Break The Intake Contract

Run the relevant cells again after each mutation. Revert one mutation before starting the next.

  1. Inspect intake_bundle_incomplete. Why does it hold even though every remaining score clears threshold 0.35?
  2. Inspect intake_bundle_padded, intake_bundle_duplicated, and intake_bundle_drifted. Why can't easy extras, duplicate rows, or a newer dataset silently change the release decision?
  3. Raise intake_bundle_v2 threshold from 0.35 to 0.50. Which required fixture blocks shadow eligibility?
  4. Change fallback_route to guarded_agent. Which serving-fallback.py assertions fail, and why is that unsafe?
  5. Remove the bundle-version comparison from admitted_by_pinned_bundle. Which stale route artifact now reaches the agent?

Mastery Check

Evaluation Rubric

  • Beginner: Defines the label and explains why the classifier routes intake rather than authorizes action.
  • Applied: Trains a baseline and encoder, chooses threshold from frozen validation rows, and ships tokenizer, threshold, and fallback together.
  • Advanced: Versions evaluation rows, blocks required-slice false negatives, tests calibration cautiously, and supplies a shadow-monitor rollback plan.
  • Research-ready: Challenges split representativeness, label ambiguity, threshold policy, and calibration evidence before claiming generalization.

Common Failure Modes

SymptomCauseCorrection
Accuracy is high while return-window exceptions are missed.Aggregate metric hides asymmetric cost.Gate required slices on false negatives and show the rows.
Candidate score improves after easy status rows are appended.Receipt inventory wasn't frozen before aggregation.Reject missing, unexpected, duplicate, or drifted receipt rows before calculating metrics.
Fine-tuning appears to help, but validation contains the same accounts as training.Thread or account leakage.Group by account before time holdout and freeze the manifest.
A score is described as probability without retained evidence.Softmax was confused with calibration.Say score, audit held-out buckets, and calibrate only with adequate labels.
Rollback changes model weights but traffic behavior remains unsafe.Threshold or tokenizer was deployed separately.Version and restore full serving bundle.
Production agent handles a return-window exception.Classifier route was treated as advisory after a failed gate.Fail closed to human review and test intake boundary.

Key Concepts

  • Label contract before checkpoint.
  • Baseline before complexity.
  • PyTorch training mechanics before pretrained fine-tuning tooling.
  • Threshold chosen on frozen validation evidence.
  • Calibration measured, never assumed from softmax output.
  • Frozen receipt identity, exact fixture coverage, and required-slice failures block release.
  • Model, tokenizer, threshold, and fallback form one serving bundle.
  • Classifier route limits the downstream agent's intake, not its approval controls.

Where This Leads Next

You now have a classifier artifact that can be consumed rather than merely discussed: intake_bundle_v2 deploys encoder_v1 with a reviewed threshold, admits routine tickets to a guarded workflow, and blocks risky or failed inputs to human review. The next capstone must respect that boundary while combining it with policy retrieval, approval gates, traces, and executable evaluation.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A ticket is scored below the pinned threshold and routed to guarded_agent. Which downstream behavior is allowed by that route?
2.A label guide says human_review_now covers money, access, policy-exception, or time-critical risk, and that frustrated tone alone is not enough. Which labeling is consistent with that guide?
3.A refund dataset contains several tickets from the same customer account. Those tickets share product names, carrier wording, and prior return details. What is unsafe about randomly placing some of that account's tickets in training and others in validation?
4.A keyword rule catches 'arrived cracked' and 'can't sign in' but misses 'FastShip delivered after the return window closed.' What does that baseline result justify?
5.A tiny PyTorch classifier builds word IDs, averages learned token embeddings, applies a linear head, and trains with cross-entropy on six rows until loss falls. What conclusion is justified?
6.Validation scores for positive-class human_review_now give these summaries: threshold 0.35 has TP=3, FP=1, FN=0; 0.50 has TP=2, FP=1, FN=1; 0.75 has TP=1, FP=0, FN=2. The release rule forbids missed positives in required slices, and reviewers can absorb one false positive. Which threshold is eligible?
7.Four candidate receipts use the same threshold. Which one is comparable to the frozen validation receipt and may proceed to the slice gate?
8.A calibration diagnostic has a high-score bucket with two held-out rows. Their average score is about 0.80, but only one of the two is actually human_review_now. What is the safest dashboard language?
9.The serving endpoint receives a nonempty ticket, but the model returns NaN for the human-review score. With the pinned bundle's failure fallback, how should it route the ticket?
10.A production agent receives three intake artifacts: r-104 has bundle intake_bundle_v2 and route guarded_agent; r-105 has bundle intake_bundle_v2 and route human_review_now; r-106 has bundle intake_bundle_v1 and route guarded_agent. The handoff contract accepts only guarded_agent from pinned bundle intake_bundle_v2. Which ticket may enter the agent?

10 questions remaining.

Next Step
Continue to Capstone: Production Agent

You will make the agent consume `guarded_agent` intake, obtain policy evidence from the document QA artifact, and keep refund actions behind human approval.

PreviousCapstone: Eval Dashboard
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Text Classification.

Hugging Face. · 2026 · Official documentation

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

On Calibration of Modern Neural Networks

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017