Follow one return-decision assistant from base-model training to post-training, retrieval, serving, evaluation, and the fix chosen after a real failure.
In the previous chapter, you shipped a small app that decides whether damaged order A10234 qualifies under return-policy line P-7. The app validated input, called a model boundary, stored evidence, and rendered a result. One question was deliberately left open: where did the model behind that boundary come from?
This chapter answers that question. A large language model (LLM) product has two connected loops. A model-building loop changes weights through pre-training and post-training. A product loop serves frozen weights with prompts, retrieved evidence, traces, evals, and releases. Good debugging starts by knowing which loop owns the failure.
Key idea: A deployed assistant isn't just a trained model. It's a checkpoint plus current context, serving code, evaluation, and feedback.
The customer sees one answer: "Eligible under P-7: damaged item reported within 30 days." Under that answer are several distinct engineering jobs.
| Part of the system | What enters it | What comes out | Do model weights change? |
|---|---|---|---|
| Pre-training | Filtered sequences of text and code | Base model checkpoint | Yes |
| Supervised fine-tuning (SFT) | Instruction and target-answer demonstrations | Instruction-following checkpoint | Yes |
| Preference or verifier-based post-training | Preferred answers or automatically checked attempts | Better-behaved checkpoint | Yes |
| Retrieval | Current policy documents and order facts | Evidence inserted into a request | No |
| Inference | Checkpoint, prompt, retrieved evidence | Generated response | Normally no |
| Evaluation and release control | Cases, outputs, traces, metrics | Ship, block, or investigate decision | No |
Recipes differ. Some checkpoints receive several post-training passes; some products call a hosted model and never touch training. This map still matters because it separates changing the model from changing what the model sees or how the product runs it.
Before a model can answer a return question, it must learn patterns in language and code. Pre-training presents token sequences and asks the network to predict each next token. Its parameters are updated to assign more probability to continuations seen in useful data.
The tiny example below doesn't train a neural network. It exposes the signal a neural language model receives: after a context token, which continuation tends to occur?
1from collections import Counter, defaultdict
2
3lines = [
4 "damaged return eligible",
5 "damaged return eligible",
6 "damaged return review",
7 "late return review",
8]
9
10next_tokens: dict[str, Counter[str]] = defaultdict(Counter)
11for line in lines:
12 tokens = line.split()
13 for current, following in zip(tokens, tokens[1:]):
14 next_tokens[current][following] += 1
15
16for token in ["damaged", "return"]:
17 choice, count = next_tokens[token].most_common(1)[0]
18 print(f"after {token!r}: predict {choice!r} ({count} observations)")1after 'damaged': predict 'return' (3 observations)
2after 'return': predict 'eligible' (2 observations)Seeing frequent transitions isn't enough. Training needs a number that becomes smaller when the model assigns higher probability to the observed next token. Cross-entropy loss supplies that number: for the correct next token with probability p, loss is -log(p).
1from math import log
2
3def token_loss(probability_of_correct_token: float) -> float:
4 return -log(probability_of_correct_token)
5
6confident_correct = token_loss(0.80)
7uncertain_correct = token_loss(0.20)
8
9print(f"p=0.80 -> loss={confident_correct:.3f}")
10print(f"p=0.20 -> loss={uncertain_correct:.3f}")
11print("lower loss rewards more probability on the observed continuation")1p=0.80 -> loss=0.223
2p=0.20 -> loss=1.609
3lower loss rewards more probability on the observed continuationAt real scale, researchers choose a model size, data mixture, token budget, and compute budget together. Kaplan et al. measured empirical power-law relationships between language-model loss, model size, dataset size, and compute.[1] Hoffmann et al. later showed that, under their compute budgets and experiments, training a smaller model on substantially more data could outperform a much larger undertrained model.[2] These papers don't prescribe every modern training run, but they explain why dataset and compute planning are first-class work.
A base model can continue text impressively and still be a poor assistant. Given this prompt:
1Damaged return request A10234:it may continue with another plausible request rather than produce a structured eligibility decision. It learned continuation, not your API contract, refusal policy, or evidence requirements.
Supervised fine-tuning (SFT) continues training on demonstrations: an instruction is paired with the answer the model should produce. The model still predicts tokens, but now the useful continuation is an assistant answer shaped for a task.
For the return app, one demonstration might require a JSON decision with cited evidence rather than free-form customer-support prose.
1import json
2
3example = {
4 "instruction": "Decide return eligibility for A10234 using policy evidence.",
5 "input": {"reason": "damaged", "days_since_delivery": 12},
6 "target": {
7 "eligible": True,
8 "policy_id": "P-7",
9 "evidence": "Damaged items reported within 30 days are eligible.",
10 },
11}
12
13print("instruction:", example["instruction"])
14print("target:", json.dumps(example["target"], sort_keys=True))1instruction: Decide return eligibility for A10234 using policy evidence.
2target: {"eligible": true, "evidence": "Damaged items reported within 30 days are eligible.", "policy_id": "P-7"}SFT data should make desired behavior measurable. If the application requires eligible, policy_id, and evidence, a demonstration missing those fields teaches the wrong interface.
1required_fields = {"eligible", "policy_id", "evidence"}
2
3targets = [
4 {"eligible": True, "policy_id": "P-7", "evidence": "within 30 days"},
5 {"eligible": True, "evidence": "looks acceptable"},
6]
7
8for index, target in enumerate(targets, start=1):
9 missing = sorted(required_fields - target.keys())
10 verdict = "keep" if not missing else f"reject, missing {missing}"
11 print(f"example {index}: {verdict}")1example 1: keep
2example 2: reject, missing ['policy_id']InstructGPT is a concrete published example of this progression: supervised demonstrations followed by preference data and reinforcement learning from human feedback (RLHF) improved instruction-following behavior relative to the underlying GPT-3 models in its evaluation setting.[3]
SFT can teach the model to cite policy. It can't guarantee that P-7 is today's policy. If the return window changes tomorrow, fresh retrieved policy text is usually a better first fix than training a new checkpoint.
After SFT, multiple answers may be valid JSON and still differ in usefulness. One cites the exact policy clause; another guesses a refund action. Post-training supplies a signal that favors the better behavior.
Two families matter for this map:
| Signal type | Example method | Useful when |
|---|---|---|
| A reviewer prefers answer A over answer B | RLHF or Direct Preference Optimization (DPO) | Helpfulness, tone, calibrated refusal, concise explanations |
| A program can check success or failure | Reinforcement Learning with Verifiable Rewards (RLVR) | Code tests, math answers, strict structured constraints |
In InstructGPT's RLHF stage, a learned reward model represented human preferences inside a reinforcement-learning process.[3] DPO instead optimizes preference pairs with a classification loss, without fitting a separate reward model or running an RL optimization loop.[4] DeepSeek-R1 is a published example where pure reinforcement learning improved measured reasoning on verifiable tasks.[5] Useful pattern to study, not default choice for every product.
Here is the preference-data idea with ordinary Python objects. This code builds a comparison record; an actual DPO or RLHF pipeline would convert many such records into parameter updates.
1pair = {
2 "prompt": "Is damaged order A10234 eligible under P-7?",
3 "chosen": "Eligible. P-7 covers damage reported within 30 days.",
4 "rejected": "Eligible. Your replacement label has already been created.",
5 "reason": "chosen cites evidence and avoids claiming an unperformed action",
6}
7
8print("chosen:", pair["chosen"])
9print("rejected because:", pair["reason"])1chosen: Eligible. P-7 covers damage reported within 30 days.
2rejected because: chosen cites evidence and avoids claiming an unperformed actionVerifiable rewards apply only when a checker represents the desired behavior. For this narrow decision, a checker can compare the output with trusted order facts, require policy evidence, and forbid a side effect that the model never executed.
1from typing import TypedDict
2
3class TrustedOrder(TypedDict):
4 damage_confirmed: bool
5 days_since_delivery: int
6
7TRUSTED_ORDER: TrustedOrder = {"damage_confirmed": True, "days_since_delivery": 12}
8
9def verify_decision(output: dict[str, object], order: TrustedOrder) -> int:
10 eligible_under_p7 = (
11 order["damage_confirmed"] is True
12 and order["days_since_delivery"] <= 30
13 )
14 passes = (
15 output.get("eligible") is eligible_under_p7
16 and output.get("policy_id") == "P-7"
17 and output.get("label_created") is False
18 )
19 return int(passes)
20
21good = {"eligible": True, "policy_id": "P-7", "label_created": False}
22wrong_eligibility = {"eligible": False, "policy_id": "P-7", "label_created": False}
23unsafe = {"eligible": True, "policy_id": "P-7", "label_created": True}
24
25print("checked decision reward:", verify_decision(good, TRUSTED_ORDER))
26print("wrong eligibility reward:", verify_decision(wrong_eligibility, TRUSTED_ORDER))
27print("invented side-effect reward:", verify_decision(unsafe, TRUSTED_ORDER))1checked decision reward: 1
2wrong eligibility reward: 0
3invented side-effect reward: 0That verifier is useful only for the part it checks. Here, damage_confirmed is a trusted fixture; free-form customer text couldn't set it safely. The checker says nothing about empathy, ambiguous damage photos, or whether policy P-7 is fair. A weak checker can train a model toward shortcuts, so every verifier needs adversarial tests and human-reviewed cases.
Once a checkpoint is deployed, inference is the work of generating responses for requests. Normal inference reads weights; it doesn't update them. Behavior can still change through retrieved evidence, prompt instructions, tool permissions, or decoding and serving settings.
The return assistant needs today's policy, so the application retrieves P-7 at request time and combines it with trusted order facts. The miniature function below stands in for the model boundary: it demonstrates that new evidence changes the served answer while the checkpoint identifier stays fixed.
1CHECKPOINT = "returns-assistant-v3"
2
3def answer_from_context(policy_text: str, *, days_since_delivery: int) -> str:
4 if (
5 "damaged items within 30 days are eligible" in policy_text.lower()
6 and days_since_delivery <= 30
7 ):
8 return "A10234: eligible under P-7"
9 return "A10234: escalate for policy review"
10
11old_policy = "P-7: damaged items within 7 days are eligible."
12current_policy = "P-7: damaged items within 30 days are eligible."
13
14print("checkpoint:", CHECKPOINT)
15print("old retrieval:", answer_from_context(old_policy, days_since_delivery=12))
16print("current retrieval:", answer_from_context(current_policy, days_since_delivery=12))1checkpoint: returns-assistant-v3
2old retrieval: A10234: escalate for policy review
3current retrieval: A10234: eligible under P-7Retrieval-Augmented Generation (RAG) was introduced as a model architecture combining generated text with retrieved external knowledge.[6] In products, practical rule is simple: facts that change often should arrive as request-time evidence, not stale parametric memory.
Serving is constrained by GPU memory and latency. Weights alone require roughly weight bytes = parameter count * bytes per parameter.
An eight-billion-parameter checkpoint stored at two bytes per parameter needs about 14.9 GiB for weights alone. It still needs memory for the key-value (KV) cache, runtime buffers, and concurrent requests.
1def gib_for_weights(parameters_billions: float, bytes_per_parameter: float) -> float:
2 total_bytes = parameters_billions * 1_000_000_000 * bytes_per_parameter
3 return total_bytes / (1024 ** 3)
4
5bf16_weights = gib_for_weights(8, 2)
6four_bit_weights = gib_for_weights(8, 0.5)
7
8print(f"8B at 2 bytes/parameter: {bf16_weights:.1f} GiB, weights only")
9print(f"8B at 0.5 bytes/parameter: {four_bit_weights:.1f} GiB, weights only")
10print("KV cache and runtime buffers still require additional memory.")18B at 2 bytes/parameter: 14.9 GiB, weights only
28B at 0.5 bytes/parameter: 3.7 GiB, weights only
3KV cache and runtime buffers still require additional memory.The four-bit result is an ideal packed-weight estimate, not a deployment promise. Real quantized formats can add metadata and higher-precision components. Measure the loaded model footprint before choosing hardware.
More context can also be a quality issue, not only a memory issue. Liu et al. observed in their long-context evaluations that models could perform worse when relevant information appeared in the middle of long inputs.[7] Dumping every policy revision into one huge prompt isn't a substitute for retrieval and evaluation.
The model-building team evaluates checkpoints before promotion. The product team evaluates prompts, retrieval, serving, and releases after integration. Both need cases that look like real work.
For A10234, a golden case isn't "does the answer sound friendly?" It includes trusted order facts, a policy fact, a required decision, required evidence, and forbidden claims.
1case = {
2 "order_id": "A10234",
3 "damage_confirmed": True,
4 "days_since_delivery": 12,
5 "expected_decision": "eligible",
6 "required_policy_id": "P-7",
7}
8
9candidate = {
10 "decision": "eligible",
11 "policy_id": "P-7",
12 "label_created": False,
13}
14passes = (
15 case["damage_confirmed"] is True
16 and case["days_since_delivery"] <= 30
17 and candidate["decision"] == case["expected_decision"]
18 and candidate["policy_id"] == case["required_policy_id"]
19 and candidate["label_created"] is False
20)
21
22print("golden case passes:", passes)1golden case passes: TrueOpen-ended response quality often needs a rubric rather than exact string matching. LLM judges can help scale rubric-based comparisons, and Zheng et al. studied their agreement and biases in MT-Bench and Chatbot Arena.[8] They aren't a substitute for calibrated human-reviewed cases, especially for policy and safety boundaries.
This small evaluator mixes deterministic policy checks with a human-review queue for answers that pass the rule but may still be confusing.
1cases = [
2 ("A10234", 12, True, "eligible", "P-7", False, True),
3 ("A10235", 45, True, "not_eligible", "P-7", False, True),
4 ("A10236", 12, True, "eligible", "P-7", True, False),
5 ("A10237", 45, True, "eligible", "P-7", False, False),
6 ("A10238", 12, True, "eligible", "P-3", False, False),
7]
8
9passed = 0
10for order_id, days_since_delivery, damage_confirmed, actual_decision, policy_id, label_created, expected_pass in cases:
11 expected_decision = "eligible" if damage_confirmed and days_since_delivery <= 30 else "not_eligible"
12 output = {
13 "decision": actual_decision,
14 "policy_id": policy_id,
15 "label_created": label_created,
16 }
17 checks_pass = (
18 output["decision"] == expected_decision
19 and output["policy_id"] == "P-7"
20 and output["label_created"] is False
21 )
22 passed += int(checks_pass == expected_pass)
23 print(order_id, "checks_pass=", checks_pass, "expected=", expected_pass)
24
25print(f"deterministic checks: {passed}/{len(cases)}")
26print("human review remains required for clarity and ambiguous evidence")1A10234 checks_pass= True expected= True
2A10235 checks_pass= True expected= True
3A10236 checks_pass= False expected= False
4A10237 checks_pass= False expected= False
5A10238 checks_pass= False expected= False
6deterministic checks: 5/5
7human review remains required for clarity and ambiguous evidenceWhen an eval fails, "fine-tune the model" is rarely a complete diagnosis. First ask what must change.
1def owner_for_failure(symptom: str) -> str:
2 symptom = symptom.lower()
3 if "stale policy" in symptom:
4 return "retrieval index"
5 if "missing policy_id" in symptom:
6 return "prompt contract or SFT data"
7 if "promises a label" in symptom:
8 return "preference data and safety eval"
9 if "slow first token" in symptom:
10 return "inference serving"
11 if "release regression" in symptom:
12 return "evaluation gate and rollback"
13 return "investigate with a labeled trace"
14
15failures = [
16 "A10234 used stale policy P-7",
17 "Response missing policy_id",
18 "Response promises a label without tool execution",
19 "Slow first token after traffic spike",
20]
21
22for failure in failures:
23 print(owner_for_failure(failure))1retrieval index
2prompt contract or SFT data
3preference data and safety eval
4inference servingRelease gate can block a known regression before long-term fix ships.
1candidate = {
2 "policy_cases_passed": 19,
3 "policy_cases_total": 20,
4 "unsafe_side_effect_claims": 1,
5 "p95_ttft_ms": 920,
6}
7
8reasons: list[str] = []
9if candidate["policy_cases_passed"] != candidate["policy_cases_total"]:
10 reasons.append("policy regression")
11if candidate["unsafe_side_effect_claims"] > 0:
12 reasons.append("invented side effect")
13if candidate["p95_ttft_ms"] > 1000:
14 reasons.append("latency budget exceeded")
15
16decision = "BLOCK" if reasons else "SHIP"
17print(decision, "-", ", ".join(reasons) if reasons else "all gates pass")1BLOCK - policy regression, invented side effectTraces turn failures into concrete work items. Stored trace can already tell you whether to refresh evidence, revise prompt, add preference examples, or optimize serving.
1trace = {
2 "order_id": "A10234",
3 "checkpoint": "returns-assistant-v3",
4 "policy_version": "P-7-2025",
5 "failure": "stale policy",
6}
7
8action_by_failure = {
9 "stale policy": "refresh retrieval document and rerun policy evals",
10 "invalid schema": "tighten output contract and rerun format evals",
11 "slow first token": "profile queueing and KV-cache pressure",
12}
13
14print("checkpoint remains:", trace["checkpoint"])
15print("next action:", action_by_failure[trace["failure"]])1checkpoint remains: returns-assistant-v3
2next action: refresh retrieval document and rerun policy evalsLifecycle gets simpler once you ask precise question about each symptom.
| Observation in the return assistant | First place to look | Why |
|---|---|---|
| Base model continues a request rather than answering | SFT or choose an instruction checkpoint | Desired assistant format wasn't trained or selected |
| Response is valid but repeatedly overpromises actions | Preference/post-training data and eval rubric | Better behavior must be preferred and checked |
| Answer cites an old return window | Retrieval index and document versioning | Current fact is missing from request context |
| JSON fields disappear after a prompt release | Prompt/schema contract and regression eval | Product interface changed without weight updates |
| Time to first token spikes under load | Inference serving and capacity | Model may be fine while queue or memory is overloaded |
| New checkpoint passes style grading but fails P-7 | Evaluation gate | A pleasant answer can still violate policy |
After completing this chapter, you can turn the snippets into a small repository artifact:
| File | Purpose |
|---|---|
data/sft_examples.jsonl | Demonstrations that require eligible, policy_id, and evidence |
data/preference_pairs.jsonl | Chosen and rejected outputs for unsupported action claims |
evals/policy_cases.jsonl | Held-out P-7 cases with deterministic constraints |
evaluate.py | Policy verifier and release-gate report |
trace_router.py | Maps failed traces to retrieval, contract, post-training, or serving work |
README.md | Explains which fixes update weights and which leave the checkpoint frozen |
Small artifact, same habit: inspect failure, then choose smallest intervention backed by evidence instead of reflexively retraining.
A10234 failure and states why.Symptom: A team creates preference pairs to teach tomorrow's policy revision.
Cause: It confuses preferred behavior with current evidence.
Fix: Retrieve versioned policy text at request time and evaluate grounded decisions against it.
Symptom: Verifier scores rise while answers become unhelpful or game a formatting rule.
Cause: Programmatic reward observes only a narrow proxy.
Fix: Add adversarial verifier tests, held-out cases, and human-reviewed rubric samples.
Symptom: Assistant produces fluent output that claims a refund or label was executed.
Cause: Eval set checks wording but not side effects.
Fix: Gate release on deterministic constraints tied to recorded tool actions and policy evidence.
Symptom: Latency spike triggers a request for another tuning run.
Cause: Team hasn't separated inference capacity from model behavior.
Fix: Inspect queue time, time to first token, KV-cache pressure, and trace errors before considering training.
Symptom: Correct policy is somewhere in a large context, but answer cites the wrong rule.
Cause: Context availability isn't the same as reliable use of evidence.
Fix: Retrieve focused policy passages, place evidence clearly, and evaluate citation accuracy on held-out cases.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Scaling Laws for Neural Language Models
Kaplan et al. · 2020
Training Compute-Optimal Large Language Models.
Hoffmann, J., et al. · 2022 · NeurIPS 2022
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov, R., et al. · 2023
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI · 2025
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Lewis, P., et al. · 2020 · NeurIPS 2020
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., et al. · 2023 · TACL 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023