LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnCore LLM FoundationsInstruction Tuning & Chat Templates
⚡MediumFine-Tuning & Training

Instruction Tuning & Chat Templates

Teach a base language model to answer as an assistant: curate grounded SFT rows, serialize chat turns exactly, choose loss targets, pack safely, and detect serving-time template drift.

16 min read
Learning path
Step 52 of 155 in the full curriculum
LLM Benchmarks & LimitationsDimensionality Reduction for Embeddings

In the previous lesson, you built evaluations for a policy-answering assistant. One private case asks about a damaged electronics delivery, with this evidence:

Policy evidence: Damaged electronics must be reported within 48 hours with photos.

A candidate model may retrieve the right clause and still respond badly. A base language model predicts plausible next ; it isn't reliably trained to interpret a customer turn, answer as support, cite the policy, and stop. This chapter shows how supervised fine-tuning (SFT) teaches that response behavior, and why the same chat template must survive from data preparation to serving.

Policy support instruction tuning and template parity workflow Policy support instruction tuning and template parity workflow
Instruction tuning rewards the assistant continuations you want. Serving has to reproduce the role-token pattern that made those continuations learnable.

One example defines a behavior contract

A pretrained causal language model has one basic mechanism: given preceding tokens, assign probabilities to the next token. It can sometimes answer a question because conversations appeared in pretraining data, but answering isn't yet a dependable product contract.

For ShopFlow support, one SFT training row can look like this:

RoleContentWhat this turn contributes
systemAnswer from the supplied policy evidence.Sets the behavior boundary.
userMy electronics arrived damaged. What should I do?Gives the customer's request.
assistantReport the damage within 48 hours and attach photos.Supplies the continuation to reward.

During SFT, the increases the probability of target response tokens conditioned on the turns that came before them. In InstructGPT, supervised demonstrations were the first post-training stage before preference feedback was used to refine response quality.[1]

Diagram showing Grounded SFT row 48 hours + photos, Chat template role-token stream, Optimize chosen target tokens, and Serve + evaluate same token contract. Diagram showing Grounded SFT row 48 hours + photos, Chat template role-token stream, Optimize chosen target tokens, and Serve + evaluate same token contract.
Grounded SFT row 48 hours + photos, Chat template role-token stream, Optimize chosen target tokens, and Serve + evaluate same token contract.

SFT isn't a guarantee that the model only changes style or formatting. Fine-tuning can alter factual behavior and task competence too. LIMA provides a useful, narrower result: with a capable base model, a carefully curated set of 1,000 demonstrations produced strong response-format and instruction-following behavior in its experiments.[2] Treat that as evidence for data quality, not a promise that every task needs little data.

Chat messages eventually become one token stream

Applications store a conversation as structured records. The language model receives tokens. A chat template is the serialization rule between those two representations: it adds role markers, delimiters, whitespace, and, when appropriate, the cue that a new assistant response should begin.[3]

Policy support messages serialized into one token stream Policy support messages serialized into one token stream
A template isn't UI formatting. It determines the exact token context on which both fine-tuning and generation operate.

Model families don't share one universal chat serialization:

Checkpoint family exampleShape of its turn markersEngineering consequence
ChatML-style<|im_start|>user ... <|im_end|>An assistant-start marker can indicate a fresh generation turn.
Llama 3 InstructHeader tokens followed by <|eot_id|>Use the tokenizer's shipped headers and end-of-turn marker.[4][5]
Mistral-7B-Instruct-v0.1<s>[INST] ... [/INST] ... </s>Spacing and turn layout are part of the tokenizer contract.[3][6]

Those strings are examples of checkpoint-specific formats, not a menu of interchangeable wrappers. Feeding [INST] formatting to a checkpoint trained with another role-token layout may still produce text, but you have changed its input distribution.

Render training, fresh generation, and prefill separately

The smallest useful template exercise is to render the same support task in three modes:

  1. A completed training transcript contains the known assistant answer.
  2. A new inference request ends at an assistant-start cue.
  3. A already contains the beginning of an assistant answer and asks the model to continue it.

The last two modes aren't the same. Starting a new assistant turn and continuing an existing assistant message should never happen at once.

render-chat-contract.py
1START = "<|im_start|>" 2END = "<|im_end|>" 3 4def render(messages, *, add_generation_prompt=False, continue_final_message=False): 5 if add_generation_prompt and continue_final_message: 6 raise ValueError("choose a new assistant turn or a prefill, not both") 7 8 pieces = [] 9 for index, message in enumerate(messages): 10 role = message["role"] 11 content = message["content"] 12 is_prefill = ( 13 continue_final_message 14 and index == len(messages) - 1 15 and role == "assistant" 16 ) 17 pieces.append(f"{START}{role}\n{content}") 18 if not is_prefill: 19 pieces.append(f"{END}\n") 20 21 if add_generation_prompt: 22 pieces.append(f"{START}assistant\n") 23 return "".join(pieces) 24 25context = [ 26 {"role": "system", "content": "Use only the supplied policy evidence."}, 27 {"role": "user", "content": "My electronics arrived damaged. What should I do?"}, 28] 29answer = { 30 "role": "assistant", 31 "content": "Report the damage within 48 hours and attach photos.", 32} 33 34training_text = render(context + [answer]) 35generation_text = render(context, add_generation_prompt=True) 36prefill_text = render( 37 context + [{"role": "assistant", "content": '{"deadline_hours": '}], 38 continue_final_message=True, 39) 40 41print("training_has_answer=", "48 hours" in training_text) 42print("generation_ends_at_assistant=", generation_text.endswith(f"{START}assistant\n")) 43print("prefill_is_open=", prefill_text.endswith('{"deadline_hours": ')) 44 45try: 46 render(context, add_generation_prompt=True, continue_final_message=True) 47except ValueError as error: 48 print("invalid_mode_caught=", str(error))
Output
1training_has_answer= True 2generation_ends_at_assistant= True 3prefill_is_open= True 4invalid_mode_caught= choose a new assistant turn or a prefill, not both

Hugging Face tokenizers implement this idea with apply_chat_template. For generation, add_generation_prompt=True appends a start-of-assistant sequence only when that particular template defines one. For a response prefill, continue_final_message=True keeps the final assistant content open, and the documentation treats combining both flags as an error.[3]

use-the-checkpoint-tokenizer.py
1from transformers import AutoTokenizer 2 3tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") 4messages = [ 5 {"role": "user", "content": "My electronics arrived damaged. What should I do?"}, 6] 7 8input_ids = tokenizer.apply_chat_template( 9 messages, 10 tokenize=True, 11 add_generation_prompt=True, 12 return_tensors="pt", 13)

This snippet intentionally uses the tokenizer attached to the checkpoint rather than recreating the format with string concatenation. If you render text with tokenize=False and tokenize it in a later step, pass add_special_tokens=False; otherwise BOS or EOS tokens may be inserted twice.[3]

Before rendering thousands of rows, reject malformed dialogue structure. An assistant reply without a preceding user request is a bad supervised example even if its prose is fluent.

validate-conversation-roles.py
1rows = { 2 "valid": ["system", "user", "assistant"], 3 "missing_answer": ["system", "user"], 4 "double_user": ["system", "user", "user", "assistant"], 5} 6 7def role_errors(roles): 8 turns = roles[1:] if roles and roles[0] == "system" else roles 9 expected = "user" 10 for role in turns: 11 if role != expected: 12 return [f"expected {expected}, got {role}"] 13 expected = "assistant" if expected == "user" else "user" 14 if expected == "assistant": 15 return ["conversation ends before assistant answer"] 16 return [] 17 18for row_id, roles in rows.items(): 19 errors = role_errors(roles) 20 print(f"{row_id}: {'accept' if not errors else 'reject'} {errors}") 21 22assert role_errors(rows["valid"]) == [] 23assert role_errors(rows["missing_answer"]) == ["conversation ends before assistant answer"]
Output
1valid: accept [] 2missing_answer: reject ['conversation ends before assistant answer'] 3double_user: reject ['expected assistant, got user']

Fine-tuning data must teach supported answers

Good formatting can't rescue bad supervision. If a training row claims that damaged electronics have a 30-day return window when the policy says 48 hours with photos, SFT reinforces an unsupported answer.

Your first data design decision is not "How many rows can I generate?" It is "What behavior is each accepted row allowed to teach?"

Data sourceUseful forRisk to test before training
Reviewed support examplesPolicy-critical responses and refusal behaviorSparse coverage of unusual tickets
Synthetic variations from approved seedsParaphrases, edge cases, tone variationUnsupported policy claims or near-duplicates
Multi-turn transcriptsFollow-ups such as "Where do I upload photos?"Old context, personal data, or unhelpful agent habits

Self-Instruct demonstrated a repeatable way to expand instruction data: start from human-written tasks, generate new instructions and responses, then filter invalid or similar rows before fine-tuning.[7] A product team can use the same pattern with policy questions, but it must validate answers against source clauses rather than trusting a generator's confidence.

Grounded synthetic instruction data filtering workflow Grounded synthetic instruction data filtering workflow
Synthetic expansion is a candidate-row factory. Filtering and review decide which behavior becomes training signal.

The next lab treats each assistant answer as a candidate SFT row. A row is accepted only if it contains required policy facts and omits a known unsupported claim.

filter-grounded-sft-rows.py
1policies = { 2 "damaged_electronics": { 3 "required": ("48 hours", "photos"), 4 "forbidden": ("30 days",), 5 }, 6 "late_delivery": { 7 "required": ("carrier scan", "support ticket"), 8 "forbidden": ("automatic refund",), 9 }, 10} 11 12candidate_rows = [ 13 { 14 "id": "row-001", 15 "policy": "damaged_electronics", 16 "answer": "Report the damage within 48 hours and attach photos.", 17 }, 18 { 19 "id": "row-002", 20 "policy": "damaged_electronics", 21 "answer": "Return any damaged electronics within 30 days.", 22 }, 23 { 24 "id": "row-003", 25 "policy": "late_delivery", 26 "answer": "Share the carrier scan in a support ticket so we can investigate.", 27 }, 28] 29 30def evaluate(row): 31 text = row["answer"].lower() 32 rule = policies[row["policy"]] 33 has_required_facts = all(fact in text for fact in rule["required"]) 34 contains_forbidden_claim = any(claim in text for claim in rule["forbidden"]) 35 return has_required_facts and not contains_forbidden_claim 36 37accepted = [] 38for row in candidate_rows: 39 decision = "accept" if evaluate(row) else "reject" 40 print(f"{row['id']}: {decision}") 41 if decision == "accept": 42 accepted.append(row["id"]) 43 44print("accepted_rows=", accepted) 45assert accepted == ["row-001", "row-003"]
Output
1row-001: accept 2row-002: reject 3row-003: accept 4accepted_rows= ['row-001', 'row-003']

This filter is deliberately small. A real data pipeline also checks duplicated instructions, personal information, unsafe replies, role ordering, length limits, and human-review requirements for high-risk cases. Version the evidence snapshot, generator prompt, filters, and accepted dataset together. Otherwise you won't know which change caused a behavior regression.

An accepted dataset also needs coverage. Rows for damage and delivery delay don't demonstrate how the assistant should answer a sealed-return question.

report-sft-coverage-gaps.py
1required_intents = { 2 "damaged_electronics", 3 "late_delivery", 4 "sealed_return", 5} 6accepted_rows = [ 7 {"id": "row-001", "intent": "damaged_electronics"}, 8 {"id": "row-003", "intent": "late_delivery"}, 9 {"id": "row-004", "intent": "damaged_electronics"}, 10] 11 12covered = {row["intent"] for row in accepted_rows} 13missing = sorted(required_intents - covered) 14counts = { 15 intent: sum(row["intent"] == intent for row in accepted_rows) 16 for intent in sorted(required_intents) 17} 18 19print("accepted_counts=", counts) 20print("missing_critical_intents=", missing) 21print("ready_for_training=", not missing) 22 23assert missing == ["sealed_return"]
Output
1accepted_counts= {'damaged_electronics': 2, 'late_delivery': 1, 'sealed_return': 0} 2missing_critical_intents= ['sealed_return'] 3ready_for_training= False

Which tokens should produce gradient?

Every completed chat transcript contains tokens from the system prompt, customer question, and assistant answer. You must decide which of those tokens are supervised targets.

Two choices matter:

Objective choiceTarget tokensWhen it can be reasonable
Full-sequence causal lossAll non-padding transcript tokensYou intentionally train the model on the whole conversation distribution.
Assistant-only lossAssistant response spans, usually including their end-of-turn markersYou want the gradient budget focused on response behavior rather than reproducing prompt text.

Assistant-only loss is common, but it isn't the definition of SFT. TRL exposes it as assistant_only_loss=True for conversational data only when the chat template can identify assistant spans through generation markers. Current TRL releases automatically patch templates for some bundled model families; inspect the resolved template for the checkpoint you train.[8] If the span mask is wrong, you may silently train on customer text or mask out the answer you meant to learn.

In a causal language model, the target for a token is evaluated after the preceding tokens. The tiny preprocessing lab below marks assistant words and the assistant turn terminator as targets; role markers, system text, and user text receive the usual ignore label -100.

build-assistant-only-labels.py
1conversation = [ 2 ("system", "Use supplied policy evidence"), 3 ("user", "Damaged electronics arrived"), 4 ("assistant", "Report within 48 hours with photos"), 5] 6 7tokens = [] 8labels = [] 9 10for role, text in conversation: 11 tokens.append(f"<{role}>") 12 labels.append("-100") 13 for word in text.split(): 14 tokens.append(word) 15 labels.append(word if role == "assistant" else "-100") 16 tokens.append("<eot>") 17 labels.append("<eot>" if role == "assistant" else "-100") 18 19trained_targets = [label for label in labels if label != "-100"] 20masked_tokens = sum(label == "-100" for label in labels) 21 22print("tokens=", tokens) 23print("trained_targets=", trained_targets) 24print("masked_tokens=", masked_tokens) 25 26assert trained_targets == ["Report", "within", "48", "hours", "with", "photos", "<eot>"] 27assert labels[tokens.index("<user>")] == "-100"
Output
1tokens= ['<system>', 'Use', 'supplied', 'policy', 'evidence', '<eot>', '<user>', 'Damaged', 'electronics', 'arrived', '<eot>', '<assistant>', 'Report', 'within', '48', 'hours', 'with', 'photos', '<eot>'] 2trained_targets= ['Report', 'within', '48', 'hours', 'with', 'photos', '<eot>'] 3masked_tokens= 12

Use this kind of inspected tiny batch before launching training. A training-loss curve can't reveal that your boundary finder shifted one token too far and trained the wrong span.

The following calculation makes the objective choice visible. Full-sequence loss scores nearly the whole serialized row; assistant-only loss scores only the desired answer span and its terminator.

compare-supervised-target-budgets.py
1stream = [ 2 ("system", "<system>"), 3 ("system", "Use"), 4 ("system", "policy"), 5 ("system", "<eot>"), 6 ("user", "<user>"), 7 ("user", "Damaged"), 8 ("user", "electronics"), 9 ("user", "<eot>"), 10 ("assistant-marker", "<assistant>"), 11 ("assistant", "Report"), 12 ("assistant", "within"), 13 ("assistant", "48"), 14 ("assistant", "hours"), 15 ("assistant", "<eot>"), 16] 17 18full_sequence_targets = [token for _, token in stream[1:]] 19assistant_only_targets = [token for role, token in stream if role == "assistant"] 20 21print("full_sequence_target_count=", len(full_sequence_targets)) 22print("assistant_only_target_count=", len(assistant_only_targets)) 23print("assistant_only_targets=", assistant_only_targets) 24 25assert assistant_only_targets == ["Report", "within", "48", "hours", "<eot>"] 26assert len(assistant_only_targets) < len(full_sequence_targets)
Output
1full_sequence_target_count= 13 2assistant_only_target_count= 5 3assistant_only_targets= ['Report', 'within', '48', 'hours', '<eot>']

Packing saves padding, but test isolation explicitly

SFT rows have different lengths. If every short chat is padded to a long context window, the accelerator spends much of its work processing padding. Packing fills a window with several short sequences instead. TRL supports packing as a training configuration for this reason.[8]

Packing introduces a decision that teams often miss: may a token in conversation B attend to earlier tokens in conversation A? Some concatenated language-model recipes separate samples with an end token while retaining ordinary causal attention. If you need each support conversation to be an independent supervised example, use a trainer or attention kernel that supports isolation and verify its boundary behavior.

First measure why packing is tempting. Four short training rows padded separately to a 16-token window waste most of their capacity; filling shared windows reduces that waste.

measure-packing-utilization.py
1window = 16 2row_lengths = [9, 6, 7, 5] 3 4separate_capacity = window * len(row_lengths) 5separate_utilization = sum(row_lengths) / separate_capacity 6 7packed_windows = [] 8for length in row_lengths: 9 for index, used in enumerate(packed_windows): 10 if used + length <= window: 11 packed_windows[index] += length 12 break 13 else: 14 packed_windows.append(length) 15 16packed_capacity = window * len(packed_windows) 17packed_utilization = sum(row_lengths) / packed_capacity 18 19print("separate_utilization=", f"{separate_utilization:.1%}") 20print("packed_windows=", packed_windows) 21print("packed_utilization=", f"{packed_utilization:.1%}") 22 23assert packed_windows == [15, 12] 24assert packed_utilization > separate_utilization
Output
1separate_utilization= 42.2% 2packed_windows= [15, 12] 3packed_utilization= 84.4%

Suppose two three-token rows are packed into one sequence:

Packed position012345
Conversation IDAAABBB
Allowed history for strict isolationA onlyA onlyA onlyB onlyB onlyB only

For strict isolation, the causal attention matrix contains two blocks:

M=[100000110000111000000100000110000111]M = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 \end{bmatrix}M=​111000​011000​001000​000111​000011​000001​​

Each row is a query position; each column is a possible earlier key position. The zeros in row 4 under columns 0 through 2 prevent the second conversation from reading the first one.

verify-packed-isolation.py
1conversation_ids = ["A", "A", "A", "B", "B", "B"] 2 3mask = [ 4 [ 5 int(key_position <= query_position and key_id == query_id) 6 for key_position, key_id in enumerate(conversation_ids) 7 ] 8 for query_position, query_id in enumerate(conversation_ids) 9] 10 11for row in mask: 12 print(" ".join(map(str, row))) 13 14cross_conversation_edges = [ 15 (query, key) 16 for query, row in enumerate(mask) 17 for key, allowed in enumerate(row) 18 if allowed and conversation_ids[query] != conversation_ids[key] 19] 20 21print("cross_conversation_edges=", cross_conversation_edges) 22assert cross_conversation_edges == [] 23assert mask[4][1] == 0
Output
11 0 0 0 0 0 21 1 0 0 0 0 31 1 1 0 0 0 40 0 0 1 0 0 50 0 0 1 1 0 60 0 0 1 1 1 7cross_conversation_edges= []

Don't assume an option named packing=True automatically constructs this matrix. Inspect your trainer's documented semantics and run a small boundary test with the implementation you will train.

Production failures are usually contract failures

Once a checkpoint has been fine-tuned, evaluate it with the private cases you built in the previous chapter. Log the rendered prompt, tokenizer revision, template revision, truncation policy, and generation settings alongside every result. Otherwise an answer regression could be a template deployment bug rather than a model change.

SymptomLikely contract failureFirst diagnostic check
Model generates another customer question instead of answeringMissing assistant-start cue for a template that needs oneInspect final rendered tokens before .generate().
Model emits unfamiliar role markers or ramblesCheckpoint served with another template familyCompare tokenizer/template revision with training manifest.
Responses changed after refactoring preprocessingBOS, EOS, or delimiters were inserted twiceCount special-token IDs after rendering and tokenization.
Good short answers, incorrect long threadsTruncation removed the policy evidence or system instructionLog retained messages and token count.
Low loss, poor response qualityWrong target-span mask or noisy accepted rowsRender one batch with visible labels and inspect rejected data.

Here is a minimal deployment manifest check. It can't measure model quality, but it prevents shipping an endpoint whose serialization contract is known to differ from the fine-tuning run.

audit-serving-contract.py
1training_manifest = { 2 "checkpoint": "shopflow-policy-sft-v3", 3 "tokenizer_revision": "tokens-7", 4 "template_revision": "chatml-policy-v2", 5 "add_special_tokens_after_template": False, 6} 7 8deployments = [ 9 { 10 "name": "candidate-safe", 11 "checkpoint": "shopflow-policy-sft-v3", 12 "tokenizer_revision": "tokens-7", 13 "template_revision": "chatml-policy-v2", 14 "add_special_tokens_after_template": False, 15 }, 16 { 17 "name": "candidate-drifted", 18 "checkpoint": "shopflow-policy-sft-v3", 19 "tokenizer_revision": "tokens-7", 20 "template_revision": "mistral-wrapper-v1", 21 "add_special_tokens_after_template": True, 22 }, 23] 24 25def mismatches(candidate): 26 return [ 27 field 28 for field, expected in training_manifest.items() 29 if candidate[field] != expected 30 ] 31 32for deployment in deployments: 33 drift = mismatches(deployment) 34 status = "block" if drift else "evaluate" 35 print(f"{deployment['name']}: {status} drift={drift}") 36 37assert mismatches(deployments[0]) == [] 38assert mismatches(deployments[1]) == [ 39 "template_revision", 40 "add_special_tokens_after_template", 41]
Output
1candidate-safe: evaluate drift=[] 2candidate-drifted: block drift=['template_revision', 'add_special_tokens_after_template']

A matching manifest earns the right to run quality evaluations; it doesn't prove the model is ready. Re-run grounded policy cases, critical failure slices, latency checks, and human review gates after every fine-tune or serving change.

The final executable check joins this chapter back to evaluation. Template parity is mandatory, but a correctly formatted candidate still ships only if grounded cases and operational limits pass.

gate-finetuned-candidates.py
1candidates = [ 2 { 3 "name": "sft-v3", 4 "template_matches": True, 5 "grounded_rate": 0.99, 6 "critical_errors": 0, 7 "p95_latency_ms": 620, 8 }, 9 { 10 "name": "sft-v4-fast", 11 "template_matches": True, 12 "grounded_rate": 0.94, 13 "critical_errors": 1, 14 "p95_latency_ms": 410, 15 }, 16] 17 18def blockers(candidate): 19 failed = [] 20 if not candidate["template_matches"]: 21 failed.append("template_drift") 22 if candidate["grounded_rate"] < 0.98: 23 failed.append("grounded_quality") 24 if candidate["critical_errors"] > 0: 25 failed.append("critical_policy_error") 26 if candidate["p95_latency_ms"] > 700: 27 failed.append("latency") 28 return failed 29 30for candidate in candidates: 31 failed = blockers(candidate) 32 decision = "release" if not failed else "block" 33 print(f"{candidate['name']}: {decision} blockers={failed}") 34 35assert blockers(candidates[0]) == [] 36assert blockers(candidates[1]) == ["grounded_quality", "critical_policy_error"]
Output
1sft-v3: release blockers=[] 2sft-v4-fast: block blockers=['grounded_quality', 'critical_policy_error']

From chat foundations to applied engineering

Instruction tuning connects data to behavior:

  1. A conversation row demonstrates the response you want.
  2. A chat template turns that row into the exact token context used for training and serving.
  3. A chosen label mask decides where supervised gradient is spent.
  4. Data filtering, packing checks, and private evaluations keep the learned behavior honest.

You don't need to fine-tune a large model to practice this chapter. Render transcripts, inspect labels, validate synthetic rows, and treat the serving manifest as part of your experiment record. Those habits scale from a tiny exercise to a serious training run.

Mastery check

Key concepts

  • Base-model continuation versus instruction-tuned response behavior
  • SFT examples grounded in approved evidence
  • Chat-template serialization and checkpoint-specific markers
  • Fresh generation versus assistant prefill
  • Full-sequence versus assistant-only loss
  • Synthetic-data acceptance checks
  • Packed-sequence isolation as a verified design choice
  • Training and serving template parity

Evaluation rubric

  • Foundational: Explains why a base model can complete dialogue text without reliably acting as a support assistant.
  • Intermediate: Renders completed, generation, and prefill versions of one conversation without mixing their boundary rules.
  • Intermediate: Builds an assistant-only target mask and explains why that's a choice rather than a universal SFT requirement.
  • Advanced: Designs a data and serving audit that rejects unsupported policy rows and template drift before evaluation.

Common pitfalls

  • Training on plausible synthetic answers without checking them against policy evidence.
  • Hand-formatting messages with delimiters that don't match the checkpoint's tokenizer template.
  • Treating assistant_only_loss or packed isolation as automatic behavior without inspecting the trainer.
  • Using a new-assistant generation cue when the request already contains an assistant prefill.
  • Declaring a fine-tuned checkpoint better before rerunning grounded private evaluations.

Follow-up questions

Practice extension

Add a third policy row for a sealed return window and a second rejected answer that invents a refund. Extend filter-grounded-sft-rows.py to print which required fact is missing or which forbidden claim was found. Then add its accepted answer to the rendering and label-building labs. You should be able to show exactly which assistant tokens become supervised targets and exactly which evidence allowed the row into training.

Next Step
Continue to Dimensionality Reduction for Embeddings

You can now connect supervision data, serialized prompts, and evaluated assistant behavior. The applied phase begins by examining how high-dimensional representations are compressed and visualized, which prepares you to inspect retrieval and ranking systems rather than treating embeddings as a black box.

PreviousLLM Benchmarks & Limitations
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

LIMA: Less Is More for Alignment.

Zhou, C., et al. · 2023 · NeurIPS 2023

Transformers Documentation: Chat templates.

Hugging Face · 2026

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint

Transformers Documentation: Writing a chat template.

Hugging Face · 2026

Tokenization

Mistral AI · 2026

Self-Instruct: Aligning Language Models with Self-Generated Instructions.

Wang, Y., et al. · 2023 · ACL 2023

TRL Documentation: SFT Trainer.

Hugging Face · 2026