Train and gate a support-ticket encoder that exports exact-receipt evidence and safe intake decisions to a production agent.
The evaluation dashboard in the previous capstone made one rule visible: a release is blocked by the failures that matter, even when its aggregate score looks good. Now apply that rule to a learned model.
Your product already has a document QA service that answers policy questions with permitted citations. The next product will be an agent that can draft a support response and request approval. This capstone builds the intake classifier in between:
Given a new refund-support ticket, should it enter
human_review_now, or may it enter the guarded agent workflow?
Consider this ticket:
1FastShip delivered my tablet after the return window closed. The portal rejected my return. Please make an exception.Missing this exception is not the same as sending one harmless package-status ticket to a human. The classifier therefore does routing, not authorization. It never decides refund eligibility, sends a reply, or moves money. It protects the agent entry point.
By the end, you will have:
BERT popularized the pattern of adapting a pretrained bidirectional encoder to a downstream classification task with a small output layer.[1] Hugging Face exposes the same production-oriented path through sequence-classification models and training utilities.[2] The important discipline here is not selecting a fashionable checkpoint. It is proving that a trained classifier satisfies a routing contract.
Use one binary label:
| Label | Route | Meaning |
|---|---|---|
1 | human_review_now | Money, access, policy-exception, or time-critical risk requires a person before any agent workflow. |
0 | guarded_agent | A routine request may enter the agent, which still follows retrieval evidence and approval rules. |
That second row matters. Label 0 does not mean "approve automatically." It means "safe to enter an automation path with its own controls."
The model produces a score. The pinned threshold turns that score into an intake route. Either route preserves the agent's downstream evidence and approval boundaries.
| Capstone artifact | What it already proved | What this classifier needs from it |
|---|---|---|
| Document QA product | An answer must cite permitted policy evidence. | Routine tickets sent to the agent can ask policy questions safely. |
| Evaluation dashboard | Hard failures block a release even when averages improve. | Missed urgent slices must block a classifier release. |
| Fine-tuned classifier | Ticket becomes a risk route under a pinned threshold. | Export the intake contract for the production agent. |
Start with labeled fixtures that can be reviewed before any model is trained.
1label_version = "escalation-policy-v1"
2fixtures = [
3 {
4 "id": "damaged_package_exception",
5 "text": "The tablet arrived cracked, but the return portal says final sale.",
6 "gold_route": "human_review_now",
7 "reason": "damaged_package_exception",
8 },
9 {
10 "id": "return_window_exception",
11 "text": "FastShip delivered after the return window closed. Please allow an exception.",
12 "gold_route": "human_review_now",
13 "reason": "return_window_exception",
14 },
15 {
16 "id": "account_takeover",
17 "text": "Someone changed my delivery address and I can't sign in.",
18 "gold_route": "human_review_now",
19 "reason": "account_takeover",
20 },
21 {
22 "id": "routine_delivery_status",
23 "text": "Where is my package?",
24 "gold_route": "guarded_agent",
25 "reason": "routine_delivery_status",
26 },
27 {
28 "id": "return_policy_question",
29 "text": "What is the return policy for damaged tablets?",
30 "gold_route": "guarded_agent",
31 "reason": "return_policy_question",
32 },
33]
34
35allowed_routes = {"human_review_now", "guarded_agent"}
36assert all(row["gold_route"] in allowed_routes for row in fixtures)
37assert {row["reason"] for row in fixtures if row["gold_route"] == "human_review_now"} == {
38 "damaged_package_exception",
39 "return_window_exception",
40 "account_takeover",
41}
42
43print("label guide:", label_version)
44for row in fixtures:
45 print(f"{row['id']}: {row['gold_route']} ({row['reason']})")1label guide: escalation-policy-v1
2damaged_package_exception: human_review_now (damaged_package_exception)
3return_window_exception: human_review_now (return_window_exception)
4account_takeover: human_review_now (account_takeover)
5routine_delivery_status: guarded_agent (routine_delivery_status)
6return_policy_question: guarded_agent (return_policy_question)A label guide should also name ambiguous cases. A frustrated tone alone is not a reason for immediate human routing. A calm message requesting a return-window exception may be. Have support operations adjudicate disagreements and store that policy version alongside any dataset generated from it.
This lesson uses a hypothetical teaching dataset, not measured production traffic:
| Field | Teaching dataset choice |
|---|---|
| Source | Synthetic refund-support tickets modeled after return, delivery, and account-access requests |
| Label version | escalation-policy-v1 |
| Exclusions | Spam, internal QA, tickets without customer text |
| High-risk slices | damaged_package_exception, return_window_exception, account_takeover |
| Split policy | Earlier accounts train; later, unseen accounts validate and test |
| Release rule | No missed positive in required validation slices |
In a real data card, replace every teaching assumption with provenance, consent and retention policy, labeling instructions, class balance, disagreement rate, and known coverage gaps.
Suppose multiple messages from the same customer account share product names, carrier wording, or a prior return incident. Randomly putting one thread into training and another into validation makes the classifier look better than it will on a new account. Group by account first, then hold out later traffic.
1records = [
2 {"ticket": "t01", "account": "acme", "month": 1, "split": "train"},
3 {"ticket": "t02", "account": "acme", "month": 2, "split": "train"},
4 {"ticket": "t03", "account": "north", "month": 2, "split": "train"},
5 {"ticket": "t04", "account": "north", "month": 2, "split": "train"},
6 {"ticket": "v01", "account": "summit", "month": 3, "split": "validation"},
7 {"ticket": "v02", "account": "summit", "month": 3, "split": "validation"},
8 {"ticket": "x01", "account": "harbor", "month": 4, "split": "test"},
9 {"ticket": "x02", "account": "harbor", "month": 4, "split": "test"},
10]
11
12accounts = {
13 split: {row["account"] for row in records if row["split"] == split}
14 for split in ("train", "validation", "test")
15}
16assert accounts["train"].isdisjoint(accounts["validation"])
17assert accounts["train"].isdisjoint(accounts["test"])
18assert accounts["validation"].isdisjoint(accounts["test"])
19
20for split in ("train", "validation", "test"):
21 count = sum(row["split"] == split for row in records)
22 names = ",".join(sorted(accounts[split]))
23 print(f"{split}: {count} tickets, accounts={names}")1train: 4 tickets, accounts=acme,north
2validation: 2 tickets, accounts=summit
3test: 2 tickets, accounts=harborUse validation data to pick features, checkpoints, and threshold. Open the test set once for the final report after choices are frozen. A small, curated fixture set can reveal specific failures, but it can't establish broad production performance by itself.
A transformer has to earn its complexity. Start with a deterministic rule or a bag-of-words logistic-regression baseline. If a simple baseline satisfies the routing gate on representative held-out data, shipping a neural model may add maintenance without adding safety.
Here the rule catches an explicit damaged-package phrase and misses an indirect return-window exception. That failure gives a contextual model a job worth doing.
1validation = [
2 ("explicit_damage_exception", "Tablet arrived cracked and the return portal denied it.", 1),
3 ("indirect_return_exception", "FastShip delivered after the return window closed.", 1),
4 ("routine_delivery_status", "Where is my package?", 0),
5 ("routine_policy", "Where can I read the return policy?", 0),
6]
7terms = ("arrived cracked", "can't sign in", "seller exception")
8
9predictions = []
10for fixture_id, text, label in validation:
11 predicted = int(any(term in text.lower() for term in terms))
12 predictions.append((fixture_id, label, predicted))
13
14positives = sum(label for _, label, _ in predictions)
15true_positives = sum(label == 1 and predicted == 1 for _, label, predicted in predictions)
16missed = [fixture_id for fixture_id, label, predicted in predictions if label == 1 and predicted == 0]
17
18print(f"positive recall: {true_positives}/{positives}")
19print("missed positive:", missed[0])
20print("next experiment: learn context beyond trigger words")1positive recall: 1/2
2missed positive: indirect_return_exception
3next experiment: learn context beyond trigger wordsA TF-IDF plus logistic-regression baseline is a useful next step when the dataset grows: it learns weighted text features while remaining cheap to inspect and deploy.[3] Don't claim a fine-tuned encoder is cheaper, faster, or more accurate than prompting, or than this baseline, until you measure those candidates on the same task and hardware.
An encoder classifier emits logits: one unnormalized score for each class. Softmax converts two logits into class scores that sum to one. A separate decision threshold converts the positive-class score into the business route.
Use numerically stable softmax by subtracting the largest logit before exponentiating.
1from math import exp
2
3examples = [
4 ("return_window_closed", [-1.0, 2.0]),
5 ("delivery_status", [1.0, -1.0]),
6]
7
8def softmax(logits: list[float]) -> list[float]:
9 offset = max(logits)
10 values = [exp(value - offset) for value in logits]
11 total = sum(values)
12 return [value / total for value in values]
13
14for fixture_id, logits in examples:
15 routine, human = softmax(logits)
16 print(f"{fixture_id}: human_review_score={human:.2f}, routine_score={routine:.2f}")1return_window_closed: human_review_score=0.95, routine_score=0.05
2delivery_status: human_review_score=0.12, routine_score=0.88Call 0.95 a score until held-out reliability evidence supports calling it a probability. A bounded output is not proof of calibration.
Before reaching for a downloaded checkpoint, make the trainable path concrete. The sandbox below builds a tiny word-embedding encoder, averages token vectors, attaches a classification head, and updates every parameter with cross-entropy loss. This is training a text classifier from scratch, not fine-tuning BERT. It exists so you can inspect the mechanics before replacing random embeddings with pretrained representations.
1import torch
2from torch import nn
3
4torch.manual_seed(7)
5
6training_rows = [
7 ("cracked tablet return portal denied", 1),
8 ("carrier delivered after return window", 1),
9 ("delivery address changed cannot sign in", 1),
10 ("where is package", 0),
11 ("resend return label", 0),
12 ("show return policy", 0),
13]
14vocabulary = {"<pad>": 0}
15for text, _ in training_rows:
16 for token in text.split():
17 vocabulary.setdefault(token, len(vocabulary))
18
19max_length = max(len(text.split()) for text, _ in training_rows)
20input_ids = []
21for text, _ in training_rows:
22 ids = [vocabulary[token] for token in text.split()]
23 input_ids.append(ids + [0] * (max_length - len(ids)))
24inputs = torch.tensor(input_ids)
25labels = torch.tensor([label for _, label in training_rows])
26
27class TinyEncoderClassifier(nn.Module):
28 def __init__(self, vocab_size: int) -> None:
29 super().__init__()
30 self.embedding = nn.Embedding(vocab_size, 8, padding_idx=0)
31 self.head = nn.Linear(8, 2)
32
33 def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
34 mask = (token_ids != 0).unsqueeze(-1)
35 embedded = self.embedding(token_ids) * mask
36 pooled = embedded.sum(dim=1) / mask.sum(dim=1).clamp_min(1)
37 return self.head(pooled)
38
39model = TinyEncoderClassifier(len(vocabulary))
40criterion = nn.CrossEntropyLoss()
41optimizer = torch.optim.Adam(model.parameters(), lr=0.05)
42initial_loss = criterion(model(inputs), labels).item()
43original_head = model.head.weight.detach().clone()
44
45for _ in range(80):
46 loss = criterion(model(inputs), labels)
47 optimizer.zero_grad()
48 loss.backward()
49 optimizer.step()
50
51final_logits = model(inputs)
52final_loss = criterion(final_logits, labels).item()
53print("logit shape:", tuple(final_logits.shape))
54print("loss decreased:", final_loss < initial_loss)
55print("classification head updated:", not torch.equal(original_head, model.head.weight))1logit shape: (6, 2)
2loss decreased: True
3classification head updated: TrueThe sandbox can memorize six tiny training examples. That is not deployment evidence. It proves that text becomes tokens, tokens become pooled representations, a head produces logits, and gradients update those weights. Held-out evaluation remains the release evidence.
For the capstone artifact, replace random embeddings with a pretrained encoder such as DistilBERT and fine-tune it for two labels. BERT-style encoders read context on both sides of a token and are designed to be adapted to tasks such as sequence classification.[1] In the Transformers task interface, AutoModelForSequenceClassification supplies the classification head while Trainer runs optimization and evaluation.[2]
The following is a project script rather than a marked copy-runnable cell: it requires your local JSONL splits, a downloaded checkpoint, and the transformers, datasets, and evaluate packages. Each JSON Lines (JSONL) row needs text and integer label fields, where 0 means guarded_agent and 1 means human_review_now.
1import evaluate
2import numpy as np
3from datasets import load_dataset
4from transformers import (
5 AutoModelForSequenceClassification,
6 AutoTokenizer,
7 DataCollatorWithPadding,
8 Trainer,
9 TrainingArguments,
10)
11
12checkpoint = "distilbert/distilbert-base-uncased"
13dataset = load_dataset(
14 "json",
15 data_files={
16 "train": "data/train.jsonl",
17 "validation": "data/validation.jsonl",
18 "test": "data/test.jsonl",
19 },
20)
21tokenizer = AutoTokenizer.from_pretrained(checkpoint)
22
23def tokenize(batch):
24 return tokenizer(batch["text"], truncation=True, max_length=256)
25
26encoded = dataset.map(tokenize, batched=True)
27model = AutoModelForSequenceClassification.from_pretrained(
28 checkpoint,
29 num_labels=2,
30 id2label={0: "guarded_agent", 1: "human_review_now"},
31 label2id={"guarded_agent": 0, "human_review_now": 1},
32)
33metric = evaluate.load("f1")
34
35def compute_metrics(prediction):
36 logits, labels = prediction.predictions, prediction.label_ids
37 predicted = np.argmax(logits, axis=-1)
38 return metric.compute(predictions=predicted, references=labels, average="binary")
39
40arguments = TrainingArguments(
41 output_dir="artifacts/encoder_v1",
42 learning_rate=2e-5,
43 per_device_train_batch_size=16,
44 per_device_eval_batch_size=32,
45 num_train_epochs=3,
46 eval_strategy="epoch",
47 save_strategy="epoch",
48 load_best_model_at_end=True,
49 metric_for_best_model="f1",
50)
51trainer = Trainer(
52 model=model,
53 args=arguments,
54 train_dataset=encoded["train"],
55 eval_dataset=encoded["validation"],
56 processing_class=tokenizer,
57 data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
58 compute_metrics=compute_metrics,
59)
60trainer.train()
61print(trainer.evaluate(encoded["validation"]))
62trainer.save_model("artifacts/encoder_v1/model")
63tokenizer.save_pretrained("artifacts/encoder_v1/model")Two boundaries keep this honest:
Don't accept a library default threshold without checking its error tradeoff. Suppose encoder_v1 exports these validation scores for human_review_now:
| Fixture | Slice | Gold | Score |
|---|---|---|---|
damaged_package_exception | damaged_package_exception | 1 | 0.89 |
return_window_exception | return_window_exception | 1 | 0.42 |
account_takeover | account_takeover | 1 | 0.64 |
angry_delivery_status | routine_delivery_status | 0 | 0.71 |
routine_delivery_status | routine_delivery_status | 0 | 0.23 |
return_policy_question | return_policy_question | 0 | 0.18 |
1rows = [
2 ("damaged_package_exception", 1, 0.89),
3 ("return_window_exception", 1, 0.42),
4 ("account_takeover", 1, 0.64),
5 ("angry_delivery_status", 0, 0.71),
6 ("routine_delivery_status", 0, 0.23),
7 ("return_policy_question", 0, 0.18),
8]
9
10def summarize(threshold: float) -> tuple[int, int, int, float]:
11 tp = fp = fn = 0
12 for _, gold, score in rows:
13 predicted = int(score >= threshold)
14 tp += predicted == 1 and gold == 1
15 fp += predicted == 1 and gold == 0
16 fn += predicted == 0 and gold == 1
17 recall = tp / (tp + fn)
18 return tp, fp, fn, recall
19
20for threshold in (0.35, 0.50, 0.75):
21 tp, fp, fn, recall = summarize(threshold)
22 print(f"threshold={threshold:.2f} tp={tp} fp={fp} fn={fn} recall={recall:.2f}")1threshold=0.35 tp=3 fp=1 fn=0 recall=1.00
2threshold=0.50 tp=2 fp=1 fn=1 recall=0.67
3threshold=0.75 tp=1 fp=0 fn=2 recall=0.33On these fixtures, threshold 0.35 is the only candidate shown that avoids a missed escalation. It sends one angry but routine delivery-status ticket to human review. That tradeoff is acceptable only if queue capacity allows it and the same rule survives a larger held-out dataset.
Don't report 1.00 recall from three positive fixtures as proof of production reliability. It is evidence for this test slice, not for unseen traffic.
The dashboard capstone compared runs only when dataset, grader, corpus, and fixture inventory matched. Carry that discipline here. A classifier comparison receipt pins its dataset, label guide, split manifest, model, and exact validation fixture IDs before scores are aggregated.
intake_bundle_v1 is blocked because its encoder_v1 score for return_window_exception falls below the initially proposed threshold. intake_bundle_v2 keeps the same trained encoder but pins a reviewed threshold, then becomes eligible for shadow traffic, not immediate autonomous routing. A threshold change versions the serving bundle, not the model weights.
1from collections import Counter
2
3EXPECTED_IDENTITY = {
4 "dataset_version": "refund-intake-validation-v1",
5 "label_version": "escalation-policy-v1",
6 "split_manifest_version": "refund-intake-split-2026-05",
7 "model_version": "encoder_v1",
8}
9EXPECTED_FIXTURES = {
10 "damaged_package_exception",
11 "return_window_exception",
12 "account_takeover",
13 "angry_delivery_status",
14 "routine_delivery_status",
15 "return_policy_question",
16}
17REQUIRED_SLICES = {
18 "damaged_package_exception",
19 "return_window_exception",
20 "account_takeover",
21}
22validation_rows = [
23 {"fixture_id": "damaged_package_exception", "slice": "damaged_package_exception", "gold": 1, "score": 0.89},
24 {"fixture_id": "return_window_exception", "slice": "return_window_exception", "gold": 1, "score": 0.42},
25 {"fixture_id": "account_takeover", "slice": "account_takeover", "gold": 1, "score": 0.64},
26 {"fixture_id": "angry_delivery_status", "slice": "routine_delivery_status", "gold": 0, "score": 0.71},
27 {"fixture_id": "routine_delivery_status", "slice": "routine_delivery_status", "gold": 0, "score": 0.23},
28 {"fixture_id": "return_policy_question", "slice": "return_policy_question", "gold": 0, "score": 0.18},
29]
30
31def receipt(threshold: float, rows: list[dict[str, object]], **overrides: str) -> dict[str, object]:
32 return {**EXPECTED_IDENTITY, **overrides, "threshold": threshold, "rows": rows}
33
34def release_decision(run: dict[str, object]) -> tuple[str, str]:
35 for field, expected in EXPECTED_IDENTITY.items():
36 if run[field] != expected:
37 return "hold", f"{field}:{run[field]}"
38 rows = run["rows"]
39 fixture_ids = [str(row["fixture_id"]) for row in rows]
40 counts = Counter(fixture_ids)
41 missing = sorted(EXPECTED_FIXTURES - set(fixture_ids))
42 unexpected = sorted(set(fixture_ids) - EXPECTED_FIXTURES)
43 duplicated = sorted(fixture_id for fixture_id, count in counts.items() if count != 1)
44 if missing:
45 return "hold", f"missing:{','.join(missing)}"
46 if unexpected:
47 return "hold", f"unexpected:{','.join(unexpected)}"
48 if duplicated:
49 return "hold", f"duplicate:{','.join(duplicated)}"
50 misses = [
51 str(row["fixture_id"])
52 for row in rows
53 if row["slice"] in REQUIRED_SLICES
54 and row["gold"] == 1
55 and row["score"] < run["threshold"]
56 ]
57 if misses:
58 return "hold", f"missed:{','.join(misses)}"
59 return "eligible_for_shadow", "exact_receipt_pass"
60
61runs = {
62 "intake_bundle_v1": receipt(0.50, validation_rows),
63 "intake_bundle_v2": receipt(0.35, validation_rows),
64 "intake_bundle_incomplete": receipt(
65 0.35,
66 [row for row in validation_rows if row["fixture_id"] != "account_takeover"],
67 ),
68 "intake_bundle_padded": receipt(
69 0.35,
70 [*validation_rows, {"fixture_id": "easy_status_extra", "slice": "routine_delivery_status", "gold": 0, "score": 0.02}],
71 ),
72 "intake_bundle_duplicated": receipt(0.35, [*validation_rows, validation_rows[-1]]),
73 "intake_bundle_drifted": receipt(0.35, validation_rows, dataset_version="refund-intake-validation-v2"),
74}
75for bundle_version, run in runs.items():
76 decision, reason = release_decision(run)
77 print(f"{bundle_version}: decision={decision} reason={reason}")1intake_bundle_v1: decision=hold reason=missed:return_window_exception
2intake_bundle_v2: decision=eligible_for_shadow reason=exact_receipt_pass
3intake_bundle_incomplete: decision=hold reason=missing:account_takeover
4intake_bundle_padded: decision=hold reason=unexpected:easy_status_extra
5intake_bundle_duplicated: decision=hold reason=duplicate:return_policy_question
6intake_bundle_drifted: decision=hold reason=dataset_version:refund-intake-validation-v2Lowering the threshold is not automatically the right fix. If false positives overwhelm reviewers, improve labels or model behavior instead. For this teaching receipt, the important move is refusing to hide a required-slice miss inside an average or to pad the comparison with easy rows.
A threshold can be useful even when the score is not a well-calibrated probability. Calibration asks whether tickets scored near 0.70, across enough held-out examples, need human review about 70 percent of the time. Modern neural classifiers can be miscalibrated; temperature scaling is one common post-training adjustment evaluated on held-out data.[4]
Six fixtures are enough to demonstrate a calculation, not to justify customer-facing percentages:
1scored_rows = [
2 (0.89, 1),
3 (0.64, 1),
4 (0.42, 1),
5 (0.71, 0),
6 (0.23, 0),
7 (0.18, 0),
8]
9buckets = {
10 "low [0.0,0.4)": [(score, gold) for score, gold in scored_rows if score < 0.4],
11 "middle [0.4,0.7)": [(score, gold) for score, gold in scored_rows if 0.4 <= score < 0.7],
12 "high [0.7,1.0]": [(score, gold) for score, gold in scored_rows if score >= 0.7],
13}
14
15for name, bucket in buckets.items():
16 average_score = sum(score for score, _ in bucket) / len(bucket)
17 observed_rate = sum(gold for _, gold in bucket) / len(bucket)
18 print(f"{name}: n={len(bucket)}, score={average_score:.2f}, observed={observed_rate:.2f}")
19print("decision: diagnostic only; collect more held-out labels")1low [0.0,0.4): n=2, score=0.21, observed=0.00
2middle [0.4,0.7): n=2, score=0.53, observed=1.00
3high [0.7,1.0]: n=2, score=0.80, observed=0.50
4decision: diagnostic only; collect more held-out labelsThe high bucket exposes why calibrated language matters: an average score of 0.80 with one positive among two tickets is not evidence that 0.80 means an 80 percent escalation rate. Keep displaying a routing action until a larger retained set supports probability claims.
The shipped product is not only model weights. It is a pinned bundle:
| Component | Versioned value |
|---|---|
| Label guide | escalation-policy-v1 |
| Validation receipt | refund-intake-validation-v1, six exact fixtures |
| Dataset split | refund-intake-split-2026-05 |
| Tokenizer and model | encoder_v1 |
| Serving release | intake_bundle_v2 |
| Input policy | refund-intake-input-v1: trim outer whitespace, preserve punctuation, maximum 256 tokens |
| Threshold | 0.35, chosen on frozen validation fixtures |
| Failure fallback | human_review_now |
| Downstream consumer | refund_agent_v2 |
The endpoint should fail closed. A blank ticket, unavailable model, or malformed score is a reason to request human review rather than send uncertain traffic into the agent.
1from math import isfinite
2
3bundle = {
4 "bundle_version": "intake_bundle_v2",
5 "label_version": "escalation-policy-v1",
6 "validation_receipt_version": "refund-intake-validation-v1",
7 "split_manifest_version": "refund-intake-split-2026-05",
8 "tokenizer_version": "encoder_v1",
9 "model_version": "encoder_v1",
10 "input_policy_version": "refund-intake-input-v1",
11 "max_length": 256,
12 "threshold": 0.35,
13 "fallback_route": "human_review_now",
14}
15
16def route_ticket(text: str, score: float | None) -> tuple[str, str]:
17 if not text.strip():
18 return bundle["fallback_route"], "empty_text"
19 if score is None:
20 return bundle["fallback_route"], "model_unavailable"
21 if not isfinite(score) or not 0.0 <= score <= 1.0:
22 return bundle["fallback_route"], "invalid_score"
23 if score >= bundle["threshold"]:
24 return "human_review_now", "threshold"
25 return "guarded_agent", "below_threshold"
26
27checks = [
28 ("Carrier delivered after the return window", 0.72, "human_review_now", "threshold"),
29 ("Where is the return policy?", 0.18, "guarded_agent", "below_threshold"),
30 ("", 0.04, "human_review_now", "empty_text"),
31 ("Where is my package?", None, "human_review_now", "model_unavailable"),
32 ("Malformed model score", float("nan"), "human_review_now", "invalid_score"),
33]
34print("bundle:", bundle["bundle_version"], bundle["validation_receipt_version"])
35for text, score, expected_route, expected_reason in checks:
36 route, reason = route_ticket(text, score)
37 assert (route, reason) == (expected_route, expected_reason)
38 print(f"{route}: {reason}")1bundle: intake_bundle_v2 refund-intake-validation-v1
2human_review_now: threshold
3guarded_agent: below_threshold
4human_review_now: empty_text
5human_review_now: model_unavailable
6human_review_now: invalid_scoreOnly the guarded_agent route from the pinned bundle reaches the next capstone. That agent will still retrieve permitted policy evidence, draft rather than execute a refund, and ask for approval before any consequential action. The classifier gives it an intake boundary, not extra authority.
1handoff = {
2 "producer": "support_ticket_escalation_classifier",
3 "bundle_version": "intake_bundle_v2",
4 "model_version": "encoder_v1",
5 "threshold": 0.35,
6 "allowed_agent_route": "guarded_agent",
7 "blocked_or_manual_route": "human_review_now",
8 "policy_evidence_service": "document_qa_v2",
9 "evaluation_report": "classifier_dashboard_run_intake_bundle_v2",
10}
11
12agent_inputs = [
13 {"ticket_id": "r-104", "bundle_version": "intake_bundle_v2", "route": "guarded_agent"},
14 {"ticket_id": "r-105", "bundle_version": "intake_bundle_v2", "route": "human_review_now"},
15 {"ticket_id": "r-106", "bundle_version": "intake_bundle_v1", "route": "guarded_agent"},
16]
17
18def admitted_by_pinned_bundle(row: dict[str, str]) -> bool:
19 return (
20 row["bundle_version"] == handoff["bundle_version"]
21 and row["route"] == handoff["allowed_agent_route"]
22 )
23
24accepted = [
25 row["ticket_id"]
26 for row in agent_inputs
27 if admitted_by_pinned_bundle(row)
28]
29blocked = [
30 row["ticket_id"]
31 for row in agent_inputs
32 if not admitted_by_pinned_bundle(row)
33]
34
35print("agent accepts:", ",".join(accepted))
36print("agent blocked:", ",".join(blocked))
37print("agent authority: draft_with_evidence_and_request_approval")1agent accepts: r-104
2agent blocked: r-105,r-106
3agent authority: draft_with_evidence_and_request_approvalThis contract creates a portfolio story with traceable boundaries:
Don't ship a classifier project as a notebook screenshot. Keep these files or equivalent artifacts together:
| Artifact | Question it answers |
|---|---|
label_guide.md | What does human_review_now mean, including edge cases? |
data_card.md | Where did rows come from and what traffic is missing? |
| Frozen split manifest | Did account or time leakage contaminate evaluation? |
| Baseline report | Did the encoder solve a measured baseline failure? |
| Training configuration | Which checkpoint, seed, hyperparameters, and label mapping produced the model? |
| Versioned evaluation rows | Did the frozen receipt match exactly, and which required-slice failures passed or blocked release? |
| Serving bundle | Are tokenizer, threshold, fallback, and model restored together? |
| Shadow-monitor plan | Which reviewed production sample can trigger rollback? |
Use shadow traffic before changing live routing. Record scores and proposed routes, retain human decisions, and compare missed-escalation rate by required slice. Roll back to manual review when a required slice misses its agreed floor; don't wait for average accuracy to fall.
Run the relevant cells again after each mutation. Revert one mutation before starting the next.
intake_bundle_incomplete. Why does it hold even though every remaining score clears threshold 0.35?intake_bundle_padded, intake_bundle_duplicated, and intake_bundle_drifted. Why can't easy extras, duplicate rows, or a newer dataset silently change the release decision?intake_bundle_v2 threshold from 0.35 to 0.50. Which required fixture blocks shadow eligibility?fallback_route to guarded_agent. Which serving-fallback.py assertions fail, and why is that unsafe?admitted_by_pinned_bundle. Which stale route artifact now reaches the agent?| Symptom | Cause | Correction |
|---|---|---|
| Accuracy is high while return-window exceptions are missed. | Aggregate metric hides asymmetric cost. | Gate required slices on false negatives and show the rows. |
| Candidate score improves after easy status rows are appended. | Receipt inventory wasn't frozen before aggregation. | Reject missing, unexpected, duplicate, or drifted receipt rows before calculating metrics. |
| Fine-tuning appears to help, but validation contains the same accounts as training. | Thread or account leakage. | Group by account before time holdout and freeze the manifest. |
| A score is described as probability without retained evidence. | Softmax was confused with calibration. | Say score, audit held-out buckets, and calibrate only with adequate labels. |
| Rollback changes model weights but traffic behavior remains unsafe. | Threshold or tokenizer was deployed separately. | Version and restore full serving bundle. |
| Production agent handles a return-window exception. | Classifier route was treated as advisory after a failed gate. | Fail closed to human review and test intake boundary. |
You now have a classifier artifact that can be consumed rather than merely discussed: intake_bundle_v2 deploys encoder_v1 with a reviewed threshold, admits routine tickets to a guarded workflow, and blocks risky or failed inputs to human review. The next capstone must respect that boundary while combining it with policy retrieval, approval gates, traces, and executable evaluation.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
Text Classification.
Hugging Face. · 2026 · Official documentation
The Elements of Statistical Learning.
Hastie, T., Tibshirani, R., Friedman, J. · 2009
On Calibration of Modern Neural Networks
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017