LearnPortfolio CapstonesCapstone: Fine-Tuned Classifier

⚡HardFine-Tuning & Training

Capstone: Fine-Tuned Classifier

Train and gate a support-ticket encoder that exports exact-receipt evidence and safe intake decisions to a production agent.

23 min read

Learning path

Step 83 of 155 in the full curriculum

Capstone: Eval Dashboard Capstone: Production Agent

The evaluation dashboard in the previous capstone made one rule visible: a release is blocked by the failures that matter, even when its aggregate score looks good. Now apply that rule to a learned model.

Your product already has a document QA service that answers policy questions with permitted citations. The next product will be an agent that can draft a support response and request approval. This capstone builds the intake classifier in between:

Given a new refund-support ticket, should it enter human_review_now, or may it enter the guarded agent workflow?

Consider this ticket:

text

FastShip delivered my tablet after the return window closed. The portal rejected my return. Please make an exception.

Missing this exception is not the same as sending one harmless package-status ticket to a human. The classifier therefore does routing, not authorization. It never decides refund eligibility, sends a reply, or moves money. It protects the agent entry point.

Six held-out support-ticket scores from encoder_v1 on a zero-to-one axis. Routine return-policy and delivery-status tickets score 0.18 and 0.23. The required return-window exception scores 0.42, between intake bundle v2 threshold 0.35 and intake bundle v1 threshold 0.50. Account takeover scores 0.64, angry routine delivery status 0.71, and damaged-package exception 0.89. Bundle v1 is held because threshold 0.50 misses the return-window exception. Bundle v2 routes all three required positives to human review, accepts one routine false positive, and is eligible only for shadow traffic. — Train a classifier only after the label is defined, then export a versioned routing contract only after its frozen validation receipt and required risk slices pass.

By the end, you will have:

a versioned label guide and split rule
a baseline whose failure justifies learning contextual representations
a small PyTorch encoder training loop that exposes the mechanics
a practical Hugging Face fine-tuning recipe for a pretrained encoder
held-out threshold and slice gates
a serving bundle the production-agent capstone can consume

BERT popularized the pattern of adapting a pretrained bidirectional encoder to a downstream classification task with a small output layer.^[1] Hugging Face exposes the same production-oriented path through sequence-classification models and training utilities.^[2] The important discipline here is not selecting a fashionable checkpoint. It is proving that a trained classifier satisfies a routing contract.

Define The Routing Contract

Use one binary label:

Label	Route	Meaning
`1`	`human_review_now`	Money, access, policy-exception, or time-critical risk requires a person before any agent workflow.
`0`	`guarded_agent`	A routine request may enter the agent, which still follows retrieval evidence and approval rules.

That second row matters. Label 0 does not mean "approve automatically." It means "safe to enter an automation path with its own controls."

Diagram showing Ticket, Score, Threshold?, and meet. — Ticket, Score, Threshold?, and meet.

The model produces a score. The pinned threshold turns that score into an intake route. Either route preserves the agent's downstream evidence and approval boundaries.

Connect The Portfolio Artifacts

Capstone artifact	What it already proved	What this classifier needs from it
Document QA product	An answer must cite permitted policy evidence.	Routine tickets sent to the agent can ask policy questions safely.
Evaluation dashboard	Hard failures block a release even when averages improve.	Missed urgent slices must block a classifier release.
Fine-tuned classifier	Ticket becomes a risk route under a pinned threshold.	Export the intake contract for the production agent.

Start with labeled fixtures that can be reviewed before any model is trained.

label-contract.py

label_version = "escalation-policy-v1"
fixtures = [
    {
        "id": "damaged_package_exception",
        "text": "The tablet arrived cracked, but the return portal says final sale.",
        "gold_route": "human_review_now",
        "reason": "damaged_package_exception",
    },
    {
        "id": "return_window_exception",
        "text": "FastShip delivered after the return window closed. Please allow an exception.",
        "gold_route": "human_review_now",
        "reason": "return_window_exception",
    },
    {
        "id": "account_takeover",
        "text": "Someone changed my delivery address and I can't sign in.",
        "gold_route": "human_review_now",
        "reason": "account_takeover",
    },
    {
        "id": "routine_delivery_status",
        "text": "Where is my package?",
        "gold_route": "guarded_agent",
        "reason": "routine_delivery_status",
    },
    {
        "id": "return_policy_question",
        "text": "What is the return policy for damaged tablets?",
        "gold_route": "guarded_agent",
        "reason": "return_policy_question",
    },
]

allowed_routes = {"human_review_now", "guarded_agent"}
assert all(row["gold_route"] in allowed_routes for row in fixtures)
assert {row["reason"] for row in fixtures if row["gold_route"] == "human_review_now"} == {
    "damaged_package_exception",
    "return_window_exception",
    "account_takeover",
}

print("label guide:", label_version)
for row in fixtures:
    print(f"{row['id']}: {row['gold_route']} ({row['reason']})")

Output

label guide: escalation-policy-v1
damaged_package_exception: human_review_now (damaged_package_exception)
return_window_exception: human_review_now (return_window_exception)
account_takeover: human_review_now (account_takeover)
routine_delivery_status: guarded_agent (routine_delivery_status)
return_policy_question: guarded_agent (return_policy_question)

A label guide should also name ambiguous cases. A frustrated tone alone is not a reason for immediate human routing. A calm message requesting a return-window exception may be. Have support operations adjudicate disagreements and store that policy version alongside any dataset generated from it.

Build An Honest Dataset

This lesson uses a hypothetical teaching dataset, not measured production traffic:

Field	Teaching dataset choice
Source	Synthetic refund-support tickets modeled after return, delivery, and account-access requests
Label version	`escalation-policy-v1`
Exclusions	Spam, internal QA, tickets without customer text
High-risk slices	`damaged_package_exception`, `return_window_exception`, `account_takeover`
Split policy	Earlier accounts train; later, unseen accounts validate and test
Release rule	No missed positive in required validation slices

In a real data card, replace every teaching assumption with provenance, consent and retention policy, labeling instructions, class balance, disagreement rate, and known coverage gaps.

Split Before Learning

Suppose multiple messages from the same customer account share product names, carrier wording, or a prior return incident. Randomly putting one thread into training and another into validation makes the classifier look better than it will on a new account. Group by account first, then hold out later traffic.

grouped-time-split.py

records = [
    {"ticket": "t01", "account": "acme", "month": 1, "split": "train"},
    {"ticket": "t02", "account": "acme", "month": 2, "split": "train"},
    {"ticket": "t03", "account": "north", "month": 2, "split": "train"},
    {"ticket": "t04", "account": "north", "month": 2, "split": "train"},
    {"ticket": "v01", "account": "summit", "month": 3, "split": "validation"},
    {"ticket": "v02", "account": "summit", "month": 3, "split": "validation"},
    {"ticket": "x01", "account": "harbor", "month": 4, "split": "test"},
    {"ticket": "x02", "account": "harbor", "month": 4, "split": "test"},
]

accounts = {
    split: {row["account"] for row in records if row["split"] == split}
    for split in ("train", "validation", "test")
}
assert accounts["train"].isdisjoint(accounts["validation"])
assert accounts["train"].isdisjoint(accounts["test"])
assert accounts["validation"].isdisjoint(accounts["test"])

for split in ("train", "validation", "test"):
    count = sum(row["split"] == split for row in records)
    names = ",".join(sorted(accounts[split]))
    print(f"{split}: {count} tickets, accounts={names}")

Output

train: 4 tickets, accounts=acme,north
validation: 2 tickets, accounts=summit
test: 2 tickets, accounts=harbor

Use validation data to pick features, checkpoints, and threshold. Open the test set once for the final report after choices are frozen. A small, curated fixture set can reveal specific failures, but it can't establish broad production performance by itself.

Establish A Cheap Baseline

A transformer has to earn its complexity. Start with a deterministic rule or a bag-of-words logistic-regression baseline. If a simple baseline satisfies the routing gate on representative held-out data, shipping a neural model may add maintenance without adding safety.

Here the rule catches an explicit damaged-package phrase and misses an indirect return-window exception. That failure gives a contextual model a job worth doing.

keyword-baseline.py

validation = [
    ("explicit_damage_exception", "Tablet arrived cracked and the return portal denied it.", 1),
    ("indirect_return_exception", "FastShip delivered after the return window closed.", 1),
    ("routine_delivery_status", "Where is my package?", 0),
    ("routine_policy", "Where can I read the return policy?", 0),
]
terms = ("arrived cracked", "can't sign in", "seller exception")

predictions = []
for fixture_id, text, label in validation:
    predicted = int(any(term in text.lower() for term in terms))
    predictions.append((fixture_id, label, predicted))

positives = sum(label for _, label, _ in predictions)
true_positives = sum(label == 1 and predicted == 1 for _, label, predicted in predictions)
missed = [fixture_id for fixture_id, label, predicted in predictions if label == 1 and predicted == 0]

print(f"positive recall: {true_positives}/{positives}")
print("missed positive:", missed[0])
print("next experiment: learn context beyond trigger words")

Output

positive recall: 1/2
missed positive: indirect_return_exception
next experiment: learn context beyond trigger words

A TF-IDF plus logistic-regression baseline is a useful next step when the dataset grows: it learns weighted text features while remaining cheap to inspect and deploy.^[3] Don't claim a fine-tuned encoder is cheaper, faster, or more accurate than prompting, or than this baseline, until you measure those candidates on the same task and hardware.

From Logits To A Route

An encoder classifier emits logits: one unnormalized score for each class. Softmax converts two logits into class scores that sum to one. A separate decision threshold converts the positive-class score into the business route.

Use numerically stable softmax by subtracting the largest logit before exponentiating.

stable-softmax.py

from math import exp

examples = [
    ("return_window_closed", [-1.0, 2.0]),
    ("delivery_status", [1.0, -1.0]),
]

def softmax(logits: list[float]) -> list[float]:
    offset = max(logits)
    values = [exp(value - offset) for value in logits]
    total = sum(values)
    return [value / total for value in values]

for fixture_id, logits in examples:
    routine, human = softmax(logits)
    print(f"{fixture_id}: human_review_score={human:.2f}, routine_score={routine:.2f}")

Output

return_window_closed: human_review_score=0.95, routine_score=0.05
delivery_status: human_review_score=0.12, routine_score=0.88

Call 0.95 a score until held-out reliability evidence supports calling it a probability. A bounded output is not proof of calibration.

Train The Mechanics In PyTorch

Before reaching for a downloaded checkpoint, make the trainable path concrete. The sandbox below builds a tiny word-embedding encoder, averages token vectors, attaches a classification head, and updates every parameter with cross-entropy loss. This is training a text classifier from scratch, not fine-tuning BERT. It exists so you can inspect the mechanics before replacing random embeddings with pretrained representations.

tiny-encoder-training-loop.py

import torch
from torch import nn

torch.manual_seed(7)

training_rows = [
    ("cracked tablet return portal denied", 1),
    ("carrier delivered after return window", 1),
    ("delivery address changed cannot sign in", 1),
    ("where is package", 0),
    ("resend return label", 0),
    ("show return policy", 0),
]
vocabulary = {"<pad>": 0}
for text, _ in training_rows:
    for token in text.split():
        vocabulary.setdefault(token, len(vocabulary))

max_length = max(len(text.split()) for text, _ in training_rows)
input_ids = []
for text, _ in training_rows:
    ids = [vocabulary[token] for token in text.split()]
    input_ids.append(ids + [0] * (max_length - len(ids)))
inputs = torch.tensor(input_ids)
labels = torch.tensor([label for _, label in training_rows])

class TinyEncoderClassifier(nn.Module):
    def __init__(self, vocab_size: int) -> None:
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, 8, padding_idx=0)
        self.head = nn.Linear(8, 2)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        mask = (token_ids != 0).unsqueeze(-1)
        embedded = self.embedding(token_ids) * mask
        pooled = embedded.sum(dim=1) / mask.sum(dim=1).clamp_min(1)
        return self.head(pooled)

model = TinyEncoderClassifier(len(vocabulary))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.05)
initial_loss = criterion(model(inputs), labels).item()
original_head = model.head.weight.detach().clone()

for _ in range(80):
    loss = criterion(model(inputs), labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

final_logits = model(inputs)
final_loss = criterion(final_logits, labels).item()
print("logit shape:", tuple(final_logits.shape))
print("loss decreased:", final_loss < initial_loss)
print("classification head updated:", not torch.equal(original_head, model.head.weight))

Output

logit shape: (6, 2)
loss decreased: True
classification head updated: True

The sandbox can memorize six tiny training examples. That is not deployment evidence. It proves that text becomes tokens, tokens become pooled representations, a head produces logits, and gradients update those weights. Held-out evaluation remains the release evidence.

Fine-Tune A Pretrained Encoder

For the capstone artifact, replace random embeddings with a pretrained encoder such as DistilBERT and fine-tune it for two labels. BERT-style encoders read context on both sides of a token and are designed to be adapted to tasks such as sequence classification.^[1] In the Transformers task interface, AutoModelForSequenceClassification supplies the classification head while Trainer runs optimization and evaluation.^[2]

The following is a project script rather than a marked copy-runnable cell: it requires your local JSONL splits, a downloaded checkpoint, and the transformers, datasets, and evaluate packages. Each JSON Lines (JSONL) row needs text and integer label fields, where 0 means guarded_agent and 1 means human_review_now.

train_distilbert.py

import evaluate
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)

checkpoint = "distilbert/distilbert-base-uncased"
dataset = load_dataset(
    "json",
    data_files={
        "train": "data/train.jsonl",
        "validation": "data/validation.jsonl",
        "test": "data/test.jsonl",
    },
)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=256)

encoded = dataset.map(tokenize, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2,
    id2label={0: "guarded_agent", 1: "human_review_now"},
    label2id={"guarded_agent": 0, "human_review_now": 1},
)
metric = evaluate.load("f1")

def compute_metrics(prediction):
    logits, labels = prediction.predictions, prediction.label_ids
    predicted = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predicted, references=labels, average="binary")

arguments = TrainingArguments(
    output_dir="artifacts/encoder_v1",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)
trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"],
    processing_class=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)
trainer.train()
print(trainer.evaluate(encoded["validation"]))
trainer.save_model("artifacts/encoder_v1/model")
tokenizer.save_pretrained("artifacts/encoder_v1/model")

Two boundaries keep this honest:

Fine-tuning doesn't replace label review. It fits the guide it receives, including guide mistakes.
Highest validation F1 doesn't automatically ship. The candidate still has to pass the high-risk slice gate and serving contract.

Select A Threshold On Validation Rows

Don't accept a library default threshold without checking its error tradeoff. Suppose encoder_v1 exports these validation scores for human_review_now:

Threshold sweep over six frozen classifier fixtures. At threshold 0.35 all three required positives route to human review, one angry routine status ticket is a false positive, and recall is 100 percent, so the bundle is selected for shadow traffic. At threshold 0.50 the return-window exception is missed and recall falls to 67 percent. At threshold 0.75 both return-window and account-takeover fixtures are missed and recall falls to 33 percent, even though false positives reach zero. — A threshold is a release policy: high-risk tickets go to a person, while admitted routine tickets may enter the guarded agent.

Fixture	Slice	Gold	Score
`damaged_package_exception`	`damaged_package_exception`	1	`0.89`
`return_window_exception`	`return_window_exception`	1	`0.42`
`account_takeover`	`account_takeover`	1	`0.64`
`angry_delivery_status`	`routine_delivery_status`	0	`0.71`
`routine_delivery_status`	`routine_delivery_status`	0	`0.23`
`return_policy_question`	`return_policy_question`	0	`0.18`

threshold-sweep.py

rows = [
    ("damaged_package_exception", 1, 0.89),
    ("return_window_exception", 1, 0.42),
    ("account_takeover", 1, 0.64),
    ("angry_delivery_status", 0, 0.71),
    ("routine_delivery_status", 0, 0.23),
    ("return_policy_question", 0, 0.18),
]

def summarize(threshold: float) -> tuple[int, int, int, float]:
    tp = fp = fn = 0
    for _, gold, score in rows:
        predicted = int(score >= threshold)
        tp += predicted == 1 and gold == 1
        fp += predicted == 1 and gold == 0
        fn += predicted == 0 and gold == 1
    recall = tp / (tp + fn)
    return tp, fp, fn, recall

for threshold in (0.35, 0.50, 0.75):
    tp, fp, fn, recall = summarize(threshold)
    print(f"threshold={threshold:.2f} tp={tp} fp={fp} fn={fn} recall={recall:.2f}")

Output

threshold=0.35 tp=3 fp=1 fn=0 recall=1.00
threshold=0.50 tp=2 fp=1 fn=1 recall=0.67
threshold=0.75 tp=1 fp=0 fn=2 recall=0.33

On these fixtures, threshold 0.35 is the only candidate shown that avoids a missed escalation. It sends one angry but routine delivery-status ticket to human review. That tradeoff is acceptable only if queue capacity allows it and the same rule survives a larger held-out dataset.

Don't report 1.00 recall from three positive fixtures as proof of production reliability. It is evidence for this test slice, not for unseen traffic.

Make The Hard Gate Executable

The dashboard capstone compared runs only when dataset, grader, corpus, and fixture inventory matched. Carry that discipline here. A classifier comparison receipt pins its dataset, label guide, split manifest, model, and exact validation fixture IDs before scores are aggregated.

intake_bundle_v1 is blocked because its encoder_v1 score for return_window_exception falls below the initially proposed threshold. intake_bundle_v2 keeps the same trained encoder but pins a reviewed threshold, then becomes eligible for shadow traffic, not immediate autonomous routing. A threshold change versions the serving bundle, not the model weights.

exact-receipt-release-gate.py

from collections import Counter

EXPECTED_IDENTITY = {
    "dataset_version": "refund-intake-validation-v1",
    "label_version": "escalation-policy-v1",
    "split_manifest_version": "refund-intake-split-2026-05",
    "model_version": "encoder_v1",
}
EXPECTED_FIXTURES = {
    "damaged_package_exception",
    "return_window_exception",
    "account_takeover",
    "angry_delivery_status",
    "routine_delivery_status",
    "return_policy_question",
}
REQUIRED_SLICES = {
    "damaged_package_exception",
    "return_window_exception",
    "account_takeover",
}
validation_rows = [
    {"fixture_id": "damaged_package_exception", "slice": "damaged_package_exception", "gold": 1, "score": 0.89},
    {"fixture_id": "return_window_exception", "slice": "return_window_exception", "gold": 1, "score": 0.42},
    {"fixture_id": "account_takeover", "slice": "account_takeover", "gold": 1, "score": 0.64},
    {"fixture_id": "angry_delivery_status", "slice": "routine_delivery_status", "gold": 0, "score": 0.71},
    {"fixture_id": "routine_delivery_status", "slice": "routine_delivery_status", "gold": 0, "score": 0.23},
    {"fixture_id": "return_policy_question", "slice": "return_policy_question", "gold": 0, "score": 0.18},
]

def receipt(threshold: float, rows: list[dict[str, object]], **overrides: str) -> dict[str, object]:
    return {**EXPECTED_IDENTITY, **overrides, "threshold": threshold, "rows": rows}

def release_decision(run: dict[str, object]) -> tuple[str, str]:
    for field, expected in EXPECTED_IDENTITY.items():
        if run[field] != expected:
            return "hold", f"{field}:{run[field]}"
    rows = run["rows"]
    fixture_ids = [str(row["fixture_id"]) for row in rows]
    counts = Counter(fixture_ids)
    missing = sorted(EXPECTED_FIXTURES - set(fixture_ids))
    unexpected = sorted(set(fixture_ids) - EXPECTED_FIXTURES)
    duplicated = sorted(fixture_id for fixture_id, count in counts.items() if count != 1)
    if missing:
        return "hold", f"missing:{','.join(missing)}"
    if unexpected:
        return "hold", f"unexpected:{','.join(unexpected)}"
    if duplicated:
        return "hold", f"duplicate:{','.join(duplicated)}"
    misses = [
        str(row["fixture_id"])
        for row in rows
        if row["slice"] in REQUIRED_SLICES
        and row["gold"] == 1
        and row["score"] < run["threshold"]
    ]
    if misses:
        return "hold", f"missed:{','.join(misses)}"
    return "eligible_for_shadow", "exact_receipt_pass"

runs = {
    "intake_bundle_v1": receipt(0.50, validation_rows),
    "intake_bundle_v2": receipt(0.35, validation_rows),
    "intake_bundle_incomplete": receipt(
        0.35,
        [row for row in validation_rows if row["fixture_id"] != "account_takeover"],
    ),
    "intake_bundle_padded": receipt(
        0.35,
        [*validation_rows, {"fixture_id": "easy_status_extra", "slice": "routine_delivery_status", "gold": 0, "score": 0.02}],
    ),
    "intake_bundle_duplicated": receipt(0.35, [*validation_rows, validation_rows[-1]]),
    "intake_bundle_drifted": receipt(0.35, validation_rows, dataset_version="refund-intake-validation-v2"),
}
for bundle_version, run in runs.items():
    decision, reason = release_decision(run)
    print(f"{bundle_version}: decision={decision} reason={reason}")

Output

intake_bundle_v1: decision=hold reason=missed:return_window_exception
intake_bundle_v2: decision=eligible_for_shadow reason=exact_receipt_pass
intake_bundle_incomplete: decision=hold reason=missing:account_takeover
intake_bundle_padded: decision=hold reason=unexpected:easy_status_extra
intake_bundle_duplicated: decision=hold reason=duplicate:return_policy_question
intake_bundle_drifted: decision=hold reason=dataset_version:refund-intake-validation-v2

Lowering the threshold is not automatically the right fix. If false positives overwhelm reviewers, improve labels or model behavior instead. For this teaching receipt, the important move is refusing to hide a required-slice miss inside an average or to pad the comparison with easy rows.

Audit Scores Without Overclaiming Calibration

A threshold can be useful even when the score is not a well-calibrated probability. Calibration asks whether tickets scored near 0.70, across enough held-out examples, need human review about 70 percent of the time. Modern neural classifiers can be miscalibrated; temperature scaling is one common post-training adjustment evaluated on held-out data.^[4]

Six fixtures are enough to demonstrate a calculation, not to justify customer-facing percentages:

calibration-diagnostic.py

scored_rows = [
    (0.89, 1),
    (0.64, 1),
    (0.42, 1),
    (0.71, 0),
    (0.23, 0),
    (0.18, 0),
]
buckets = {
    "low [0.0,0.4)": [(score, gold) for score, gold in scored_rows if score < 0.4],
    "middle [0.4,0.7)": [(score, gold) for score, gold in scored_rows if 0.4 <= score < 0.7],
    "high [0.7,1.0]": [(score, gold) for score, gold in scored_rows if score >= 0.7],
}

for name, bucket in buckets.items():
    average_score = sum(score for score, _ in bucket) / len(bucket)
    observed_rate = sum(gold for _, gold in bucket) / len(bucket)
    print(f"{name}: n={len(bucket)}, score={average_score:.2f}, observed={observed_rate:.2f}")
print("decision: diagnostic only; collect more held-out labels")

Output

low [0.0,0.4): n=2, score=0.21, observed=0.00
middle [0.4,0.7): n=2, score=0.53, observed=1.00
high [0.7,1.0]: n=2, score=0.80, observed=0.50
decision: diagnostic only; collect more held-out labels

The high bucket exposes why calibrated language matters: an average score of 0.80 with one positive among two tickets is not evidence that 0.80 means an 80 percent escalation rate. Keep displaying a routing action until a larger retained set supports probability claims.

Package The Serving Contract

The shipped product is not only model weights. It is a pinned bundle:

Component	Versioned value
Label guide	`escalation-policy-v1`
Validation receipt	`refund-intake-validation-v1`, six exact fixtures
Dataset split	`refund-intake-split-2026-05`
Tokenizer and model	`encoder_v1`
Serving release	`intake_bundle_v2`
Input policy	`refund-intake-input-v1`: trim outer whitespace, preserve punctuation, maximum `256` tokens
Threshold	`0.35`, chosen on frozen validation fixtures
Failure fallback	`human_review_now`
Downstream consumer	`refund_agent_v2`

Pinned classifier intake contract for intake_bundle_v2. The manifest versions escalation-policy-v1, refund-intake-validation-v1 with six exact fixtures, refund-intake-split-2026-05, encoder_v1, the 256-token input policy, threshold 0.35, human-review fallback, and refund_agent_v2. Blank input, model failure, and invalid scores fail closed to human review; valid scores at or above 0.35 also route to a person, while lower valid scores may enter the guarded agent. The agent accepts r-104, blocks r-105 because its route is human review, and blocks r-106 because its bundle provenance is stale. Rollback restores the entire contract rather than model weights alone. — Deploy the threshold and fallback with the model. A rollback that restores only weights doesn't restore behavior.

The endpoint should fail closed. A blank ticket, unavailable model, or malformed score is a reason to request human review rather than send uncertain traffic into the agent.

serving-fallback.py

from math import isfinite

bundle = {
    "bundle_version": "intake_bundle_v2",
    "label_version": "escalation-policy-v1",
    "validation_receipt_version": "refund-intake-validation-v1",
    "split_manifest_version": "refund-intake-split-2026-05",
    "tokenizer_version": "encoder_v1",
    "model_version": "encoder_v1",
    "input_policy_version": "refund-intake-input-v1",
    "max_length": 256,
    "threshold": 0.35,
    "fallback_route": "human_review_now",
}

def route_ticket(text: str, score: float | None) -> tuple[str, str]:
    if not text.strip():
        return bundle["fallback_route"], "empty_text"
    if score is None:
        return bundle["fallback_route"], "model_unavailable"
    if not isfinite(score) or not 0.0 <= score <= 1.0:
        return bundle["fallback_route"], "invalid_score"
    if score >= bundle["threshold"]:
        return "human_review_now", "threshold"
    return "guarded_agent", "below_threshold"

checks = [
    ("Carrier delivered after the return window", 0.72, "human_review_now", "threshold"),
    ("Where is the return policy?", 0.18, "guarded_agent", "below_threshold"),
    ("", 0.04, "human_review_now", "empty_text"),
    ("Where is my package?", None, "human_review_now", "model_unavailable"),
    ("Malformed model score", float("nan"), "human_review_now", "invalid_score"),
]
print("bundle:", bundle["bundle_version"], bundle["validation_receipt_version"])
for text, score, expected_route, expected_reason in checks:
    route, reason = route_ticket(text, score)
    assert (route, reason) == (expected_route, expected_reason)
    print(f"{route}: {reason}")

Output

bundle: intake_bundle_v2 refund-intake-validation-v1
human_review_now: threshold
guarded_agent: below_threshold
human_review_now: empty_text
human_review_now: model_unavailable
human_review_now: invalid_score

Hand Off To The Production Agent

Only the guarded_agent route from the pinned bundle reaches the next capstone. That agent will still retrieve permitted policy evidence, draft rather than execute a refund, and ask for approval before any consequential action. The classifier gives it an intake boundary, not extra authority.

agent-intake-artifact.py

handoff = {
    "producer": "support_ticket_escalation_classifier",
    "bundle_version": "intake_bundle_v2",
    "model_version": "encoder_v1",
    "threshold": 0.35,
    "allowed_agent_route": "guarded_agent",
    "blocked_or_manual_route": "human_review_now",
    "policy_evidence_service": "document_qa_v2",
    "evaluation_report": "classifier_dashboard_run_intake_bundle_v2",
}

agent_inputs = [
    {"ticket_id": "r-104", "bundle_version": "intake_bundle_v2", "route": "guarded_agent"},
    {"ticket_id": "r-105", "bundle_version": "intake_bundle_v2", "route": "human_review_now"},
    {"ticket_id": "r-106", "bundle_version": "intake_bundle_v1", "route": "guarded_agent"},
]

def admitted_by_pinned_bundle(row: dict[str, str]) -> bool:
    return (
        row["bundle_version"] == handoff["bundle_version"]
        and row["route"] == handoff["allowed_agent_route"]
    )

accepted = [
    row["ticket_id"]
    for row in agent_inputs
    if admitted_by_pinned_bundle(row)
]
blocked = [
    row["ticket_id"]
    for row in agent_inputs
    if not admitted_by_pinned_bundle(row)
]

print("agent accepts:", ",".join(accepted))
print("agent blocked:", ",".join(blocked))
print("agent authority: draft_with_evidence_and_request_approval")

Output

agent accepts: r-104
agent blocked: r-105,r-106
agent authority: draft_with_evidence_and_request_approval

This contract creates a portfolio story with traceable boundaries:

Document QA supplies evidence-bound policy answers.
Exact-receipt evaluation rows and gates expose failures before release.
The classifier blocks high-risk intake from automation.
The agent uses only admitted routine intake and still needs approval for consequential actions.

Release Checklist

Don't ship a classifier project as a notebook screenshot. Keep these files or equivalent artifacts together:

Artifact	Question it answers
`label_guide.md`	What does `human_review_now` mean, including edge cases?
`data_card.md`	Where did rows come from and what traffic is missing?
Frozen split manifest	Did account or time leakage contaminate evaluation?
Baseline report	Did the encoder solve a measured baseline failure?
Training configuration	Which checkpoint, seed, hyperparameters, and label mapping produced the model?
Versioned evaluation rows	Did the frozen receipt match exactly, and which required-slice failures passed or blocked release?
Serving bundle	Are tokenizer, threshold, fallback, and model restored together?
Shadow-monitor plan	Which reviewed production sample can trigger rollback?

Use shadow traffic before changing live routing. Record scores and proposed routes, retain human decisions, and compare missed-escalation rate by required slice. Roll back to manual review when a required slice misses its agreed floor; don't wait for average accuracy to fall.

Practice: Break The Intake Contract

Run the relevant cells again after each mutation. Revert one mutation before starting the next.

Inspect intake_bundle_incomplete. Why does it hold even though every remaining score clears threshold 0.35?
Inspect intake_bundle_padded, intake_bundle_duplicated, and intake_bundle_drifted. Why can't easy extras, duplicate rows, or a newer dataset silently change the release decision?
Raise intake_bundle_v2 threshold from 0.35 to 0.50. Which required fixture blocks shadow eligibility?
Change fallback_route to guarded_agent. Which serving-fallback.py assertions fail, and why is that unsafe?
Remove the bundle-version comparison from admitted_by_pinned_bundle. Which stale route artifact now reaches the agent?

Mastery Check

Evaluation Rubric

Beginner: Defines the label and explains why the classifier routes intake rather than authorizes action.
Applied: Trains a baseline and encoder, chooses threshold from frozen validation rows, and ships tokenizer, threshold, and fallback together.
Advanced: Versions evaluation rows, blocks required-slice false negatives, tests calibration cautiously, and supplies a shadow-monitor rollback plan.
Research-ready: Challenges split representativeness, label ambiguity, threshold policy, and calibration evidence before claiming generalization.

Common Failure Modes

Symptom	Cause	Correction
Accuracy is high while return-window exceptions are missed.	Aggregate metric hides asymmetric cost.	Gate required slices on false negatives and show the rows.
Candidate score improves after easy status rows are appended.	Receipt inventory wasn't frozen before aggregation.	Reject missing, unexpected, duplicate, or drifted receipt rows before calculating metrics.
Fine-tuning appears to help, but validation contains the same accounts as training.	Thread or account leakage.	Group by account before time holdout and freeze the manifest.
A score is described as probability without retained evidence.	Softmax was confused with calibration.	Say score, audit held-out buckets, and calibrate only with adequate labels.
Rollback changes model weights but traffic behavior remains unsafe.	Threshold or tokenizer was deployed separately.	Version and restore full serving bundle.
Production agent handles a return-window exception.	Classifier route was treated as advisory after a failed gate.	Fail closed to human review and test intake boundary.

Key Concepts

Label contract before checkpoint.
Baseline before complexity.
PyTorch training mechanics before pretrained fine-tuning tooling.
Threshold chosen on frozen validation evidence.
Calibration measured, never assumed from softmax output.
Frozen receipt identity, exact fixture coverage, and required-slice failures block release.
Model, tokenizer, threshold, and fallback form one serving bundle.
Classifier route limits the downstream agent's intake, not its approval controls.

Where This Leads Next

You now have a classifier artifact that can be consumed rather than merely discussed: intake_bundle_v2 deploys encoder_v1 with a reviewed threshold, admits routine tickets to a guarded workflow, and blocks risky or failed inputs to human review. The next capstone must respect that boundary while combining it with policy retrieval, approval gates, traces, and executable evaluation.

Next Step

Continue to Capstone: Production Agent

You will make the agent consume `guarded_agent` intake, obtain policy evidence from the document QA artifact, and keep refund actions behind human approval.

PreviousCapstone: Eval Dashboard

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Text Classification.

Hugging Face. · 2026 · Official documentation

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

On Calibration of Modern Neural Networks

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017

Capstone: Fine-Tuned Classifier

Define The Routing Contract

Connect The Portfolio Artifacts

Build An Honest Dataset

Split Before Learning

Establish A Cheap Baseline

From Logits To A Route

Train The Mechanics In PyTorch

Fine-Tune A Pretrained Encoder

Select A Threshold On Validation Rows

Make The Hard Gate Executable

Audit Scores Without Overclaiming Calibration

Package The Serving Contract

Hand Off To The Production Agent

Release Checklist

Practice: Break The Intake Contract

What should each mutation prove?

Mastery Check

What does the classifier authorize when it returns guarded_agent?

Why train the small PyTorch encoder before showing the pretrained fine-tuning script?

Why is encoder_v1 held even if its average metric looks acceptable?

Why should a score of 0.80 not automatically be displayed as an 80 percent risk?

Evaluation Rubric

Common Failure Modes

Key Concepts

Where This Leads Next

Mastery Check