LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesThe LLM Lifecycle
📝EasyNLP Fundamentals

The LLM Lifecycle

Follow one return-decision assistant from base-model training to post-training, retrieval, serving, evaluation, and the fix chosen after a real failure.

13 min read
Learning path
Step 27 of 155 in the full curriculum
First AI App End-to-EndLinear Regression from Scratch

In the previous chapter, you shipped a small app that decides whether damaged order A10234 qualifies under return-policy line P-7. The app validated input, called a model boundary, stored evidence, and rendered a result. One question was deliberately left open: where did the model behind that boundary come from?

This chapter answers that question. A large language model (LLM) product has two connected loops. A model-building loop changes weights through pre-training and post-training. A product loop serves frozen weights with prompts, retrieved evidence, traces, evals, and releases. Good debugging starts by knowing which loop owns the failure.

Key idea: A deployed assistant isn't just a trained model. It's a checkpoint plus current context, serving code, evaluation, and feedback.

Two connected loops for the A10234 return assistant. A purple model-building orbit moves through token loss, supervised targets, and preference ranking while weights change from theta zero to theta three. The promoted returns-v3 checkpoint is locked inside a green product orbit where current P-7 evidence and order facts drive inference, traces, evaluation, and release gates without changing weights. Two connected loops for the A10234 return assistant. A purple model-building orbit moves through token loss, supervised targets, and preference ranking while weights change from theta zero to theta three. The promoted returns-v3 checkpoint is locked inside a green product orbit where current P-7 evidence and order facts drive inference, traces, evaluation, and release gates without changing weights.
Model building changes the checkpoint through prediction, demonstrations, and behavior signals. Production promotes that checkpoint, freezes its weights, and loops over current evidence, inference, traces, evals, and release decisions.

Two loops, one return decision

The customer sees one answer: "Eligible under P-7: damaged item reported within 30 days." Under that answer are several distinct engineering jobs.

Part of the systemWhat enters itWhat comes outDo model weights change?
Pre-trainingFiltered sequences of text and codeBase model checkpointYes
Supervised fine-tuning (SFT)Instruction and target-answer demonstrationsInstruction-following checkpointYes
Preference or verifier-based post-trainingPreferred answers or automatically checked attemptsBetter-behaved checkpointYes
RetrievalCurrent policy documents and order factsEvidence inserted into a requestNo
InferenceCheckpoint, prompt, retrieved evidenceGenerated responseNormally no
Evaluation and release controlCases, outputs, traces, metricsShip, block, or investigate decisionNo

Recipes differ. Some checkpoints receive several post-training passes; some products call a hosted model and never touch training. This map still matters because it separates changing the model from changing what the model sees or how the product runs it.

Stage 1: pre-training learns prediction

Before a model can answer a return question, it must learn patterns in language and code. Pre-training presents token sequences and asks the network to predict each next token. Its parameters are updated to assign more probability to continuations seen in useful data.

The tiny example below doesn't train a neural network. It exposes the signal a neural language model receives: after a context token, which continuation tends to occur?

observe-next-token-signal.py
1from collections import Counter, defaultdict 2 3lines = [ 4 "damaged return eligible", 5 "damaged return eligible", 6 "damaged return review", 7 "late return review", 8] 9 10next_tokens: dict[str, Counter[str]] = defaultdict(Counter) 11for line in lines: 12 tokens = line.split() 13 for current, following in zip(tokens, tokens[1:]): 14 next_tokens[current][following] += 1 15 16for token in ["damaged", "return"]: 17 choice, count = next_tokens[token].most_common(1)[0] 18 print(f"after {token!r}: predict {choice!r} ({count} observations)")
Output
1after 'damaged': predict 'return' (3 observations) 2after 'return': predict 'eligible' (2 observations)

Seeing frequent transitions isn't enough. Training needs a number that becomes smaller when the model assigns higher probability to the observed next token. Cross-entropy loss supplies that number: for the correct next token with probability p, loss is -log(p).

compare-next-token-loss.py
1from math import log 2 3def token_loss(probability_of_correct_token: float) -> float: 4 return -log(probability_of_correct_token) 5 6confident_correct = token_loss(0.80) 7uncertain_correct = token_loss(0.20) 8 9print(f"p=0.80 -> loss={confident_correct:.3f}") 10print(f"p=0.20 -> loss={uncertain_correct:.3f}") 11print("lower loss rewards more probability on the observed continuation")
Output
1p=0.80 -> loss=0.223 2p=0.20 -> loss=1.609 3lower loss rewards more probability on the observed continuation

At real scale, researchers choose a model size, data mixture, token budget, and compute budget together. Kaplan et al. measured empirical power-law relationships between language-model loss, model size, dataset size, and compute.[1] Hoffmann et al. later showed that, under their compute budgets and experiments, training a smaller model on substantially more data could outperform a much larger undertrained model.[2] These papers don't prescribe every modern training run, but they explain why dataset and compute planning are first-class work.

What the base model still lacks

A base model can continue text impressively and still be a poor assistant. Given this prompt:

text
1Damaged return request A10234:

it may continue with another plausible request rather than produce a structured eligibility decision. It learned continuation, not your API contract, refusal policy, or evidence requirements.

Stage 2: SFT teaches the interface

Supervised fine-tuning (SFT) continues training on demonstrations: an instruction is paired with the answer the model should produce. The model still predicts tokens, but now the useful continuation is an assistant answer shaped for a task.

For the return app, one demonstration might require a JSON decision with cited evidence rather than free-form customer-support prose.

shape-an-sft-example.py
1import json 2 3example = { 4 "instruction": "Decide return eligibility for A10234 using policy evidence.", 5 "input": {"reason": "damaged", "days_since_delivery": 12}, 6 "target": { 7 "eligible": True, 8 "policy_id": "P-7", 9 "evidence": "Damaged items reported within 30 days are eligible.", 10 }, 11} 12 13print("instruction:", example["instruction"]) 14print("target:", json.dumps(example["target"], sort_keys=True))
Output
1instruction: Decide return eligibility for A10234 using policy evidence. 2target: {"eligible": true, "evidence": "Damaged items reported within 30 days are eligible.", "policy_id": "P-7"}

SFT data should make desired behavior measurable. If the application requires eligible, policy_id, and evidence, a demonstration missing those fields teaches the wrong interface.

audit-sft-target-shapes.py
1required_fields = {"eligible", "policy_id", "evidence"} 2 3targets = [ 4 {"eligible": True, "policy_id": "P-7", "evidence": "within 30 days"}, 5 {"eligible": True, "evidence": "looks acceptable"}, 6] 7 8for index, target in enumerate(targets, start=1): 9 missing = sorted(required_fields - target.keys()) 10 verdict = "keep" if not missing else f"reject, missing {missing}" 11 print(f"example {index}: {verdict}")
Output
1example 1: keep 2example 2: reject, missing ['policy_id']

InstructGPT is a concrete published example of this progression: supervised demonstrations followed by preference data and reinforcement learning from human feedback (RLHF) improved instruction-following behavior relative to the underlying GPT-3 models in its evaluation setting.[3]

SFT doesn't answer every product question

SFT can teach the model to cite policy. It can't guarantee that P-7 is today's policy. If the return window changes tomorrow, fresh retrieved policy text is usually a better first fix than training a new checkpoint.

Stage 3: post-training chooses better behavior

After SFT, multiple answers may be valid JSON and still differ in usefulness. One cites the exact policy clause; another guesses a refund action. Post-training supplies a signal that favors the better behavior.

Two families matter for this map:

Signal typeExample methodUseful when
A reviewer prefers answer A over answer BRLHF or Direct Preference Optimization (DPO)Helpfulness, tone, calibrated refusal, concise explanations
A program can check success or failureReinforcement Learning with Verifiable Rewards (RLVR)Code tests, math answers, strict structured constraints

In InstructGPT's RLHF stage, a learned reward model represented human preferences inside a reinforcement-learning process.[3] DPO instead optimizes preference pairs with a classification loss, without fitting a separate reward model or running an RL optimization loop.[4] DeepSeek-R1 is a published example where pure reinforcement learning improved measured reasoning on verifiable tasks.[5] Useful pattern to study, not default choice for every product.

Two post-training signal shapes for the A10234 return decision. A relative preference tournament advances answer A, which cites P-7 and avoids claiming a label, over answer B, which invents the action. Beside it, a binary verifier heatmap checks eligibility, policy evidence, and no side effect; A passes all three for reward one, while B fails the final predicate and receives reward zero. Both signals point to a weight update. Two post-training signal shapes for the A10234 return decision. A relative preference tournament advances answer A, which cites P-7 and avoids claiming a label, over answer B, which invents the action. Beside it, a binary verifier heatmap checks eligibility, policy evidence, and no side effect; A passes all three for reward one, while B fails the final predicate and receives reward zero. Both signals point to a weight update.
Preference data gives a relative signal (`A > B`). A verifier evaluates each observable predicate and combines them into an absolute reward. Both can drive post-training, but only the verifier proves the narrow rules it explicitly checks.

Here is the preference-data idea with ordinary Python objects. This code builds a comparison record; an actual DPO or RLHF pipeline would convert many such records into parameter updates.

record-a-preference-pair.py
1pair = { 2 "prompt": "Is damaged order A10234 eligible under P-7?", 3 "chosen": "Eligible. P-7 covers damage reported within 30 days.", 4 "rejected": "Eligible. Your replacement label has already been created.", 5 "reason": "chosen cites evidence and avoids claiming an unperformed action", 6} 7 8print("chosen:", pair["chosen"]) 9print("rejected because:", pair["reason"])
Output
1chosen: Eligible. P-7 covers damage reported within 30 days. 2rejected because: chosen cites evidence and avoids claiming an unperformed action

Verifiable rewards apply only when a checker represents the desired behavior. For this narrow decision, a checker can compare the output with trusted order facts, require policy evidence, and forbid a side effect that the model never executed.

verify-a-policy-decision.py
1from typing import TypedDict 2 3class TrustedOrder(TypedDict): 4 damage_confirmed: bool 5 days_since_delivery: int 6 7TRUSTED_ORDER: TrustedOrder = {"damage_confirmed": True, "days_since_delivery": 12} 8 9def verify_decision(output: dict[str, object], order: TrustedOrder) -> int: 10 eligible_under_p7 = ( 11 order["damage_confirmed"] is True 12 and order["days_since_delivery"] <= 30 13 ) 14 passes = ( 15 output.get("eligible") is eligible_under_p7 16 and output.get("policy_id") == "P-7" 17 and output.get("label_created") is False 18 ) 19 return int(passes) 20 21good = {"eligible": True, "policy_id": "P-7", "label_created": False} 22wrong_eligibility = {"eligible": False, "policy_id": "P-7", "label_created": False} 23unsafe = {"eligible": True, "policy_id": "P-7", "label_created": True} 24 25print("checked decision reward:", verify_decision(good, TRUSTED_ORDER)) 26print("wrong eligibility reward:", verify_decision(wrong_eligibility, TRUSTED_ORDER)) 27print("invented side-effect reward:", verify_decision(unsafe, TRUSTED_ORDER))
Output
1checked decision reward: 1 2wrong eligibility reward: 0 3invented side-effect reward: 0

That verifier is useful only for the part it checks. Here, damage_confirmed is a trusted fixture; free-form customer text couldn't set it safely. The checker says nothing about empathy, ambiguous damage photos, or whether policy P-7 is fair. A weak checker can train a model toward shortcuts, so every verifier needs adversarial tests and human-reviewed cases.

Stage 4: inference runs frozen weights

Once a checkpoint is deployed, inference is the work of generating responses for requests. Normal inference reads weights; it doesn't update them. Behavior can still change through retrieved evidence, prompt instructions, tool permissions, or decoding and serving settings.

The return assistant needs today's policy, so the application retrieves P-7 at request time and combines it with trusted order facts. The miniature function below stands in for the model boundary: it demonstrates that new evidence changes the served answer while the checkpoint identifier stays fixed.

change-context-not-weights.py
1CHECKPOINT = "returns-assistant-v3" 2 3def answer_from_context(policy_text: str, *, days_since_delivery: int) -> str: 4 if ( 5 "damaged items within 30 days are eligible" in policy_text.lower() 6 and days_since_delivery <= 30 7 ): 8 return "A10234: eligible under P-7" 9 return "A10234: escalate for policy review" 10 11old_policy = "P-7: damaged items within 7 days are eligible." 12current_policy = "P-7: damaged items within 30 days are eligible." 13 14print("checkpoint:", CHECKPOINT) 15print("old retrieval:", answer_from_context(old_policy, days_since_delivery=12)) 16print("current retrieval:", answer_from_context(current_policy, days_since_delivery=12))
Output
1checkpoint: returns-assistant-v3 2old retrieval: A10234: escalate for policy review 3current retrieval: A10234: eligible under P-7

Retrieval-Augmented Generation (RAG) was introduced as a model architecture combining generated text with retrieved external knowledge.[6] In products, practical rule is simple: facts that change often should arrive as request-time evidence, not stale parametric memory.

Memory and latency become visible

Serving is constrained by GPU memory and latency. Weights alone require roughly weight bytes = parameter count * bytes per parameter.

An eight-billion-parameter checkpoint stored at two bytes per parameter needs about 14.9 GiB for weights alone. It still needs memory for the key-value (KV) cache, runtime buffers, and concurrent requests.

estimate-weight-memory.py
1def gib_for_weights(parameters_billions: float, bytes_per_parameter: float) -> float: 2 total_bytes = parameters_billions * 1_000_000_000 * bytes_per_parameter 3 return total_bytes / (1024 ** 3) 4 5bf16_weights = gib_for_weights(8, 2) 6four_bit_weights = gib_for_weights(8, 0.5) 7 8print(f"8B at 2 bytes/parameter: {bf16_weights:.1f} GiB, weights only") 9print(f"8B at 0.5 bytes/parameter: {four_bit_weights:.1f} GiB, weights only") 10print("KV cache and runtime buffers still require additional memory.")
Output
18B at 2 bytes/parameter: 14.9 GiB, weights only 28B at 0.5 bytes/parameter: 3.7 GiB, weights only 3KV cache and runtime buffers still require additional memory.

The four-bit result is an ideal packed-weight estimate, not a deployment promise. Real quantized formats can add metadata and higher-precision components. Measure the loaded model footprint before choosing hardware.

More context can also be a quality issue, not only a memory issue. Liu et al. observed in their long-context evaluations that models could perform worse when relevant information appeared in the middle of long inputs.[7] Dumping every policy revision into one huge prompt isn't a substitute for retrieval and evaluation.

Evaluation closes both loops

The model-building team evaluates checkpoints before promotion. The product team evaluates prompts, retrieval, serving, and releases after integration. Both need cases that look like real work.

For A10234, a golden case isn't "does the answer sound friendly?" It includes trusted order facts, a policy fact, a required decision, required evidence, and forbidden claims.

define-a-golden-case.py
1case = { 2 "order_id": "A10234", 3 "damage_confirmed": True, 4 "days_since_delivery": 12, 5 "expected_decision": "eligible", 6 "required_policy_id": "P-7", 7} 8 9candidate = { 10 "decision": "eligible", 11 "policy_id": "P-7", 12 "label_created": False, 13} 14passes = ( 15 case["damage_confirmed"] is True 16 and case["days_since_delivery"] <= 30 17 and candidate["decision"] == case["expected_decision"] 18 and candidate["policy_id"] == case["required_policy_id"] 19 and candidate["label_created"] is False 20) 21 22print("golden case passes:", passes)
Output
1golden case passes: True

Open-ended response quality often needs a rubric rather than exact string matching. LLM judges can help scale rubric-based comparisons, and Zheng et al. studied their agreement and biases in MT-Bench and Chatbot Arena.[8] They aren't a substitute for calibrated human-reviewed cases, especially for policy and safety boundaries.

This small evaluator mixes deterministic policy checks with a human-review queue for answers that pass the rule but may still be confusing.

score-return-evals.py
1cases = [ 2 ("A10234", 12, True, "eligible", "P-7", False, True), 3 ("A10235", 45, True, "not_eligible", "P-7", False, True), 4 ("A10236", 12, True, "eligible", "P-7", True, False), 5 ("A10237", 45, True, "eligible", "P-7", False, False), 6 ("A10238", 12, True, "eligible", "P-3", False, False), 7] 8 9passed = 0 10for order_id, days_since_delivery, damage_confirmed, actual_decision, policy_id, label_created, expected_pass in cases: 11 expected_decision = "eligible" if damage_confirmed and days_since_delivery <= 30 else "not_eligible" 12 output = { 13 "decision": actual_decision, 14 "policy_id": policy_id, 15 "label_created": label_created, 16 } 17 checks_pass = ( 18 output["decision"] == expected_decision 19 and output["policy_id"] == "P-7" 20 and output["label_created"] is False 21 ) 22 passed += int(checks_pass == expected_pass) 23 print(order_id, "checks_pass=", checks_pass, "expected=", expected_pass) 24 25print(f"deterministic checks: {passed}/{len(cases)}") 26print("human review remains required for clarity and ambiguous evidence")
Output
1A10234 checks_pass= True expected= True 2A10235 checks_pass= True expected= True 3A10236 checks_pass= False expected= False 4A10237 checks_pass= False expected= False 5A10238 checks_pass= False expected= False 6deterministic checks: 5/5 7human review remains required for clarity and ambiguous evidence

A failure should choose an owner

When an eval fails, "fine-tune the model" is rarely a complete diagnosis. First ask what must change.

Failure router for trace A10234. Five symptoms fan out along colored paths: stale P-7 goes to retrieval, missing policy_id goes to contract or SFT, invented label goes to behavior plus a gate, slow first-token latency goes to serving, and release regression goes to rollback. Bottom rule says fix context, contracts, serving, or release first, then update weights only after repeated failures with correct context. Failure router for trace A10234. Five symptoms fan out along colored paths: stale P-7 goes to retrieval, missing policy_id goes to contract or SFT, invented label goes to behavior plus a gate, slow first-token latency goes to serving, and release regression goes to rollback. Bottom rule says fix context, contracts, serving, or release first, then update weights only after repeated failures with correct context.
Start from the trace, not a preferred solution. Fix current evidence, contracts, serving, or release control at the owning layer; update weights only when repeated failures remain after the model receives the right context.
route-a-lifecycle-failure.py
1def owner_for_failure(symptom: str) -> str: 2 symptom = symptom.lower() 3 if "stale policy" in symptom: 4 return "retrieval index" 5 if "missing policy_id" in symptom: 6 return "prompt contract or SFT data" 7 if "promises a label" in symptom: 8 return "preference data and safety eval" 9 if "slow first token" in symptom: 10 return "inference serving" 11 if "release regression" in symptom: 12 return "evaluation gate and rollback" 13 return "investigate with a labeled trace" 14 15failures = [ 16 "A10234 used stale policy P-7", 17 "Response missing policy_id", 18 "Response promises a label without tool execution", 19 "Slow first token after traffic spike", 20] 21 22for failure in failures: 23 print(owner_for_failure(failure))
Output
1retrieval index 2prompt contract or SFT data 3preference data and safety eval 4inference serving

Release gate can block a known regression before long-term fix ships.

gate-the-release.py
1candidate = { 2 "policy_cases_passed": 19, 3 "policy_cases_total": 20, 4 "unsafe_side_effect_claims": 1, 5 "p95_ttft_ms": 920, 6} 7 8reasons: list[str] = [] 9if candidate["policy_cases_passed"] != candidate["policy_cases_total"]: 10 reasons.append("policy regression") 11if candidate["unsafe_side_effect_claims"] > 0: 12 reasons.append("invented side effect") 13if candidate["p95_ttft_ms"] > 1000: 14 reasons.append("latency budget exceeded") 15 16decision = "BLOCK" if reasons else "SHIP" 17print(decision, "-", ", ".join(reasons) if reasons else "all gates pass")
Output
1BLOCK - policy regression, invented side effect

Traces turn failures into concrete work items. Stored trace can already tell you whether to refresh evidence, revise prompt, add preference examples, or optimize serving.

turn-a-trace-into-an-action.py
1trace = { 2 "order_id": "A10234", 3 "checkpoint": "returns-assistant-v3", 4 "policy_version": "P-7-2025", 5 "failure": "stale policy", 6} 7 8action_by_failure = { 9 "stale policy": "refresh retrieval document and rerun policy evals", 10 "invalid schema": "tighten output contract and rerun format evals", 11 "slow first token": "profile queueing and KV-cache pressure", 12} 13 14print("checkpoint remains:", trace["checkpoint"]) 15print("next action:", action_by_failure[trace["failure"]])
Output
1checkpoint remains: returns-assistant-v3 2next action: refresh retrieval document and rerun policy evals

Keep the boundaries straight

Lifecycle gets simpler once you ask precise question about each symptom.

Observation in the return assistantFirst place to lookWhy
Base model continues a request rather than answeringSFT or choose an instruction checkpointDesired assistant format wasn't trained or selected
Response is valid but repeatedly overpromises actionsPreference/post-training data and eval rubricBetter behavior must be preferred and checked
Answer cites an old return windowRetrieval index and document versioningCurrent fact is missing from request context
JSON fields disappear after a prompt releasePrompt/schema contract and regression evalProduct interface changed without weight updates
Time to first token spikes under loadInference serving and capacityModel may be fine while queue or memory is overloaded
New checkpoint passes style grading but fails P-7Evaluation gateA pleasant answer can still violate policy

What not to conclude

  • Preference tuning isn't a reliable database update. It may shape how a model responds, but current policy facts belong in retrievable evidence.
  • A verifier isn't automatically a truth oracle. It rewards only rules it can observe.
  • Long context isn't guaranteed recall. Relevant evidence still needs selection and testing.
  • A product incident isn't automatically a training incident. Most deployed fixes start with evidence, contracts, gates, or serving.

A lab-sized lifecycle artifact

After completing this chapter, you can turn the snippets into a small repository artifact:

FilePurpose
data/sft_examples.jsonlDemonstrations that require eligible, policy_id, and evidence
data/preference_pairs.jsonlChosen and rejected outputs for unsupported action claims
evals/policy_cases.jsonlHeld-out P-7 cases with deterministic constraints
evaluate.pyPolicy verifier and release-gate report
trace_router.pyMaps failed traces to retrieval, contract, post-training, or serving work
README.mdExplains which fixes update weights and which leave the checkpoint frozen

Small artifact, same habit: inspect failure, then choose smallest intervention backed by evidence instead of reflexively retraining.

Mastery check

Key concepts

  • Pre-training predicts tokens and changes model weights.
  • SFT teaches instruction behavior from demonstrations.
  • Preference methods and verifiable rewards provide different post-training signals.
  • Retrieval supplies current evidence without changing model weights.
  • Inference exposes memory, latency, throughput, and reliability constraints.
  • Evaluation connects production failures to the layer that owns the fix.

Evaluation rubric

  • Foundational: Explains why a base next-token predictor may continue a return request instead of answering it.
  • Foundational: Marks pre-training, SFT, and post-training as weight-updating stages, while separating retrieval and inference.
  • Intermediate: Chooses SFT, preference data, a verifier, or retrieval for a concrete A10234 failure and states why.
  • Intermediate: Calculates weight-only memory for a checkpoint and names the omitted serving memory.
  • Intermediate: Designs a held-out policy case and a release gate that catches unsupported side-effect claims.

Follow-up questions

Common pitfalls

Treating post-training as fact storage

Symptom: A team creates preference pairs to teach tomorrow's policy revision.

Cause: It confuses preferred behavior with current evidence.

Fix: Retrieve versioned policy text at request time and evaluate grounded decisions against it.

Rewarding a shortcut

Symptom: Verifier scores rise while answers become unhelpful or game a formatting rule.

Cause: Programmatic reward observes only a narrow proxy.

Fix: Add adversarial verifier tests, held-out cases, and human-reviewed rubric samples.

Shipping without a product invariant

Symptom: Assistant produces fluent output that claims a refund or label was executed.

Cause: Eval set checks wording but not side effects.

Fix: Gate release on deterministic constraints tied to recorded tool actions and policy evidence.

Blaming weights for serving failures

Symptom: Latency spike triggers a request for another tuning run.

Cause: Team hasn't separated inference capacity from model behavior.

Fix: Inspect queue time, time to first token, KV-cache pressure, and trace errors before considering training.

Trusting a long prompt to retrieve itself

Symptom: Correct policy is somewhere in a large context, but answer cites the wrong rule.

Cause: Context availability isn't the same as reliable use of evidence.

Fix: Retrieve focused policy passages, place evidence clearly, and evaluate citation accuracy on held-out cases.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.During pre-training, the probability assigned to an observed next token rises from 0.20 to 0.80. Using loss = -log(p), which conclusion is correct?
2.An engineering log lists continued pre-training, SFT on JSON demonstrations, DPO on preference pairs, and a refresh of the P-7 retrieval index. Which grouping correctly identifies the weight updates?
3.You are preparing training data for the return assistant. One dataset teaches the model to output eligible, policy_id, and evidence. A later dataset compares an evidence-citing answer against one that falsely says a label was created. Which mapping is correct?
4.For which return-assistant task is a verifiable reward most appropriate?
5.A trace for A10234 shows checkpoint returns-assistant-v3, policy_version P-7-2024, and an answer that escalated the return. The current P-7 document says damaged items within 30 days are eligible, and the order is 12 days old. What is the first fix to try?
6.An 8B-parameter checkpoint is stored at 2 bytes per parameter. The weight-only estimate is about 14.9 GiB. What does that number tell you about serving capacity?
7.A candidate release passes tone judgments and most policy checks, but one output says a return label was created even though no label tool ran. Which release gate should block it?
8.A return assistant is scored only by exact string match against one friendly reference answer. Which design tests both policy correctness and open-ended quality?

8 questions remaining.

Next Step
Continue to Linear Regression from Scratch

You now know where training sits inside an LLM product, but training has stayed abstract. Next you will fit the smallest useful model yourself, so loss, gradients, validation, and failure diagnosis become concrete before later large-model training chapters.

PreviousFirst AI App End-to-End
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023