Teach a base language model to answer as an assistant: curate grounded SFT rows, serialize chat turns exactly, choose loss targets, pack safely, and detect serving-time template drift.
In the previous lesson, you built evaluations for a policy-answering assistant. One private case asks about a damaged electronics delivery, with this evidence:
Policy evidence: Damaged electronics must be reported within 48 hours with photos.
A candidate model may retrieve the right clause and still respond badly. A base language model predicts plausible next ; it isn't reliably trained to interpret a customer turn, answer as support, cite the policy, and stop. This chapter shows how supervised fine-tuning (SFT) teaches that response behavior, and why the same chat template must survive from data preparation to serving.
A pretrained causal language model has one basic mechanism: given preceding tokens, assign probabilities to the next token. It can sometimes answer a question because conversations appeared in pretraining data, but answering isn't yet a dependable product contract.
For ShopFlow support, one SFT training row can look like this:
| Role | Content | What this turn contributes |
|---|---|---|
system | Answer from the supplied policy evidence. | Sets the behavior boundary. |
user | My electronics arrived damaged. What should I do? | Gives the customer's request. |
assistant | Report the damage within 48 hours and attach photos. | Supplies the continuation to reward. |
During SFT, the increases the probability of target response tokens conditioned on the turns that came before them. In InstructGPT, supervised demonstrations were the first post-training stage before preference feedback was used to refine response quality.[1]
SFT isn't a guarantee that the model only changes style or formatting. Fine-tuning can alter factual behavior and task competence too. LIMA provides a useful, narrower result: with a capable base model, a carefully curated set of 1,000 demonstrations produced strong response-format and instruction-following behavior in its experiments.[2] Treat that as evidence for data quality, not a promise that every task needs little data.
Applications store a conversation as structured records. The language model receives tokens. A chat template is the serialization rule between those two representations: it adds role markers, delimiters, whitespace, and, when appropriate, the cue that a new assistant response should begin.[3]
Model families don't share one universal chat serialization:
| Checkpoint family example | Shape of its turn markers | Engineering consequence |
|---|---|---|
| ChatML-style | <|im_start|>user ... <|im_end|> | An assistant-start marker can indicate a fresh generation turn. |
| Llama 3 Instruct | Header tokens followed by <|eot_id|> | Use the tokenizer's shipped headers and end-of-turn marker.[4][5] |
| Mistral-7B-Instruct-v0.1 | <s>[INST] ... [/INST] ... </s> | Spacing and turn layout are part of the tokenizer contract.[3][6] |
Those strings are examples of checkpoint-specific formats, not a menu of interchangeable wrappers. Feeding [INST] formatting to a checkpoint trained with another role-token layout may still produce text, but you have changed its input distribution.
The smallest useful template exercise is to render the same support task in three modes:
The last two modes aren't the same. Starting a new assistant turn and continuing an existing assistant message should never happen at once.
1START = "<|im_start|>"
2END = "<|im_end|>"
3
4def render(messages, *, add_generation_prompt=False, continue_final_message=False):
5 if add_generation_prompt and continue_final_message:
6 raise ValueError("choose a new assistant turn or a prefill, not both")
7
8 pieces = []
9 for index, message in enumerate(messages):
10 role = message["role"]
11 content = message["content"]
12 is_prefill = (
13 continue_final_message
14 and index == len(messages) - 1
15 and role == "assistant"
16 )
17 pieces.append(f"{START}{role}\n{content}")
18 if not is_prefill:
19 pieces.append(f"{END}\n")
20
21 if add_generation_prompt:
22 pieces.append(f"{START}assistant\n")
23 return "".join(pieces)
24
25context = [
26 {"role": "system", "content": "Use only the supplied policy evidence."},
27 {"role": "user", "content": "My electronics arrived damaged. What should I do?"},
28]
29answer = {
30 "role": "assistant",
31 "content": "Report the damage within 48 hours and attach photos.",
32}
33
34training_text = render(context + [answer])
35generation_text = render(context, add_generation_prompt=True)
36prefill_text = render(
37 context + [{"role": "assistant", "content": '{"deadline_hours": '}],
38 continue_final_message=True,
39)
40
41print("training_has_answer=", "48 hours" in training_text)
42print("generation_ends_at_assistant=", generation_text.endswith(f"{START}assistant\n"))
43print("prefill_is_open=", prefill_text.endswith('{"deadline_hours": '))
44
45try:
46 render(context, add_generation_prompt=True, continue_final_message=True)
47except ValueError as error:
48 print("invalid_mode_caught=", str(error))1training_has_answer= True
2generation_ends_at_assistant= True
3prefill_is_open= True
4invalid_mode_caught= choose a new assistant turn or a prefill, not bothHugging Face tokenizers implement this idea with apply_chat_template. For generation, add_generation_prompt=True appends a start-of-assistant sequence only when that particular template defines one. For a response prefill, continue_final_message=True keeps the final assistant content open, and the documentation treats combining both flags as an error.[3]
1from transformers import AutoTokenizer
2
3tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
4messages = [
5 {"role": "user", "content": "My electronics arrived damaged. What should I do?"},
6]
7
8input_ids = tokenizer.apply_chat_template(
9 messages,
10 tokenize=True,
11 add_generation_prompt=True,
12 return_tensors="pt",
13)This snippet intentionally uses the tokenizer attached to the checkpoint rather than recreating the format with string concatenation. If you render text with tokenize=False and tokenize it in a later step, pass add_special_tokens=False; otherwise BOS or EOS tokens may be inserted twice.[3]
Before rendering thousands of rows, reject malformed dialogue structure. An assistant reply without a preceding user request is a bad supervised example even if its prose is fluent.
1rows = {
2 "valid": ["system", "user", "assistant"],
3 "missing_answer": ["system", "user"],
4 "double_user": ["system", "user", "user", "assistant"],
5}
6
7def role_errors(roles):
8 turns = roles[1:] if roles and roles[0] == "system" else roles
9 expected = "user"
10 for role in turns:
11 if role != expected:
12 return [f"expected {expected}, got {role}"]
13 expected = "assistant" if expected == "user" else "user"
14 if expected == "assistant":
15 return ["conversation ends before assistant answer"]
16 return []
17
18for row_id, roles in rows.items():
19 errors = role_errors(roles)
20 print(f"{row_id}: {'accept' if not errors else 'reject'} {errors}")
21
22assert role_errors(rows["valid"]) == []
23assert role_errors(rows["missing_answer"]) == ["conversation ends before assistant answer"]1valid: accept []
2missing_answer: reject ['conversation ends before assistant answer']
3double_user: reject ['expected assistant, got user']Good formatting can't rescue bad supervision. If a training row claims that damaged electronics have a 30-day return window when the policy says 48 hours with photos, SFT reinforces an unsupported answer.
Your first data design decision is not "How many rows can I generate?" It is "What behavior is each accepted row allowed to teach?"
| Data source | Useful for | Risk to test before training |
|---|---|---|
| Reviewed support examples | Policy-critical responses and refusal behavior | Sparse coverage of unusual tickets |
| Synthetic variations from approved seeds | Paraphrases, edge cases, tone variation | Unsupported policy claims or near-duplicates |
| Multi-turn transcripts | Follow-ups such as "Where do I upload photos?" | Old context, personal data, or unhelpful agent habits |
Self-Instruct demonstrated a repeatable way to expand instruction data: start from human-written tasks, generate new instructions and responses, then filter invalid or similar rows before fine-tuning.[7] A product team can use the same pattern with policy questions, but it must validate answers against source clauses rather than trusting a generator's confidence.
The next lab treats each assistant answer as a candidate SFT row. A row is accepted only if it contains required policy facts and omits a known unsupported claim.
1policies = {
2 "damaged_electronics": {
3 "required": ("48 hours", "photos"),
4 "forbidden": ("30 days",),
5 },
6 "late_delivery": {
7 "required": ("carrier scan", "support ticket"),
8 "forbidden": ("automatic refund",),
9 },
10}
11
12candidate_rows = [
13 {
14 "id": "row-001",
15 "policy": "damaged_electronics",
16 "answer": "Report the damage within 48 hours and attach photos.",
17 },
18 {
19 "id": "row-002",
20 "policy": "damaged_electronics",
21 "answer": "Return any damaged electronics within 30 days.",
22 },
23 {
24 "id": "row-003",
25 "policy": "late_delivery",
26 "answer": "Share the carrier scan in a support ticket so we can investigate.",
27 },
28]
29
30def evaluate(row):
31 text = row["answer"].lower()
32 rule = policies[row["policy"]]
33 has_required_facts = all(fact in text for fact in rule["required"])
34 contains_forbidden_claim = any(claim in text for claim in rule["forbidden"])
35 return has_required_facts and not contains_forbidden_claim
36
37accepted = []
38for row in candidate_rows:
39 decision = "accept" if evaluate(row) else "reject"
40 print(f"{row['id']}: {decision}")
41 if decision == "accept":
42 accepted.append(row["id"])
43
44print("accepted_rows=", accepted)
45assert accepted == ["row-001", "row-003"]1row-001: accept
2row-002: reject
3row-003: accept
4accepted_rows= ['row-001', 'row-003']This filter is deliberately small. A real data pipeline also checks duplicated instructions, personal information, unsafe replies, role ordering, length limits, and human-review requirements for high-risk cases. Version the evidence snapshot, generator prompt, filters, and accepted dataset together. Otherwise you won't know which change caused a behavior regression.
An accepted dataset also needs coverage. Rows for damage and delivery delay don't demonstrate how the assistant should answer a sealed-return question.
1required_intents = {
2 "damaged_electronics",
3 "late_delivery",
4 "sealed_return",
5}
6accepted_rows = [
7 {"id": "row-001", "intent": "damaged_electronics"},
8 {"id": "row-003", "intent": "late_delivery"},
9 {"id": "row-004", "intent": "damaged_electronics"},
10]
11
12covered = {row["intent"] for row in accepted_rows}
13missing = sorted(required_intents - covered)
14counts = {
15 intent: sum(row["intent"] == intent for row in accepted_rows)
16 for intent in sorted(required_intents)
17}
18
19print("accepted_counts=", counts)
20print("missing_critical_intents=", missing)
21print("ready_for_training=", not missing)
22
23assert missing == ["sealed_return"]1accepted_counts= {'damaged_electronics': 2, 'late_delivery': 1, 'sealed_return': 0}
2missing_critical_intents= ['sealed_return']
3ready_for_training= FalseEvery completed chat transcript contains tokens from the system prompt, customer question, and assistant answer. You must decide which of those tokens are supervised targets.
Two choices matter:
| Objective choice | Target tokens | When it can be reasonable |
|---|---|---|
| Full-sequence causal loss | All non-padding transcript tokens | You intentionally train the model on the whole conversation distribution. |
| Assistant-only loss | Assistant response spans, usually including their end-of-turn markers | You want the gradient budget focused on response behavior rather than reproducing prompt text. |
Assistant-only loss is common, but it isn't the definition of SFT. TRL exposes it as assistant_only_loss=True for conversational data only when the chat template can identify assistant spans through generation markers. Current TRL releases automatically patch templates for some bundled model families; inspect the resolved template for the checkpoint you train.[8] If the span mask is wrong, you may silently train on customer text or mask out the answer you meant to learn.
In a causal language model, the target for a token is evaluated after the preceding tokens. The tiny preprocessing lab below marks assistant words and the assistant turn terminator as targets; role markers, system text, and user text receive the usual ignore label -100.
1conversation = [
2 ("system", "Use supplied policy evidence"),
3 ("user", "Damaged electronics arrived"),
4 ("assistant", "Report within 48 hours with photos"),
5]
6
7tokens = []
8labels = []
9
10for role, text in conversation:
11 tokens.append(f"<{role}>")
12 labels.append("-100")
13 for word in text.split():
14 tokens.append(word)
15 labels.append(word if role == "assistant" else "-100")
16 tokens.append("<eot>")
17 labels.append("<eot>" if role == "assistant" else "-100")
18
19trained_targets = [label for label in labels if label != "-100"]
20masked_tokens = sum(label == "-100" for label in labels)
21
22print("tokens=", tokens)
23print("trained_targets=", trained_targets)
24print("masked_tokens=", masked_tokens)
25
26assert trained_targets == ["Report", "within", "48", "hours", "with", "photos", "<eot>"]
27assert labels[tokens.index("<user>")] == "-100"1tokens= ['<system>', 'Use', 'supplied', 'policy', 'evidence', '<eot>', '<user>', 'Damaged', 'electronics', 'arrived', '<eot>', '<assistant>', 'Report', 'within', '48', 'hours', 'with', 'photos', '<eot>']
2trained_targets= ['Report', 'within', '48', 'hours', 'with', 'photos', '<eot>']
3masked_tokens= 12Use this kind of inspected tiny batch before launching training. A training-loss curve can't reveal that your boundary finder shifted one token too far and trained the wrong span.
The following calculation makes the objective choice visible. Full-sequence loss scores nearly the whole serialized row; assistant-only loss scores only the desired answer span and its terminator.
1stream = [
2 ("system", "<system>"),
3 ("system", "Use"),
4 ("system", "policy"),
5 ("system", "<eot>"),
6 ("user", "<user>"),
7 ("user", "Damaged"),
8 ("user", "electronics"),
9 ("user", "<eot>"),
10 ("assistant-marker", "<assistant>"),
11 ("assistant", "Report"),
12 ("assistant", "within"),
13 ("assistant", "48"),
14 ("assistant", "hours"),
15 ("assistant", "<eot>"),
16]
17
18full_sequence_targets = [token for _, token in stream[1:]]
19assistant_only_targets = [token for role, token in stream if role == "assistant"]
20
21print("full_sequence_target_count=", len(full_sequence_targets))
22print("assistant_only_target_count=", len(assistant_only_targets))
23print("assistant_only_targets=", assistant_only_targets)
24
25assert assistant_only_targets == ["Report", "within", "48", "hours", "<eot>"]
26assert len(assistant_only_targets) < len(full_sequence_targets)1full_sequence_target_count= 13
2assistant_only_target_count= 5
3assistant_only_targets= ['Report', 'within', '48', 'hours', '<eot>']SFT rows have different lengths. If every short chat is padded to a long context window, the accelerator spends much of its work processing padding. Packing fills a window with several short sequences instead. TRL supports packing as a training configuration for this reason.[8]
Packing introduces a decision that teams often miss: may a token in conversation B attend to earlier tokens in conversation A? Some concatenated language-model recipes separate samples with an end token while retaining ordinary causal attention. If you need each support conversation to be an independent supervised example, use a trainer or attention kernel that supports isolation and verify its boundary behavior.
First measure why packing is tempting. Four short training rows padded separately to a 16-token window waste most of their capacity; filling shared windows reduces that waste.
1window = 16
2row_lengths = [9, 6, 7, 5]
3
4separate_capacity = window * len(row_lengths)
5separate_utilization = sum(row_lengths) / separate_capacity
6
7packed_windows = []
8for length in row_lengths:
9 for index, used in enumerate(packed_windows):
10 if used + length <= window:
11 packed_windows[index] += length
12 break
13 else:
14 packed_windows.append(length)
15
16packed_capacity = window * len(packed_windows)
17packed_utilization = sum(row_lengths) / packed_capacity
18
19print("separate_utilization=", f"{separate_utilization:.1%}")
20print("packed_windows=", packed_windows)
21print("packed_utilization=", f"{packed_utilization:.1%}")
22
23assert packed_windows == [15, 12]
24assert packed_utilization > separate_utilization1separate_utilization= 42.2%
2packed_windows= [15, 12]
3packed_utilization= 84.4%Suppose two three-token rows are packed into one sequence:
| Packed position | 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|---|
| Conversation ID | A | A | A | B | B | B |
| Allowed history for strict isolation | A only | A only | A only | B only | B only | B only |
For strict isolation, the causal attention matrix contains two blocks:
Each row is a query position; each column is a possible earlier key position. The zeros in row 4 under columns 0 through 2 prevent the second conversation from reading the first one.
1conversation_ids = ["A", "A", "A", "B", "B", "B"]
2
3mask = [
4 [
5 int(key_position <= query_position and key_id == query_id)
6 for key_position, key_id in enumerate(conversation_ids)
7 ]
8 for query_position, query_id in enumerate(conversation_ids)
9]
10
11for row in mask:
12 print(" ".join(map(str, row)))
13
14cross_conversation_edges = [
15 (query, key)
16 for query, row in enumerate(mask)
17 for key, allowed in enumerate(row)
18 if allowed and conversation_ids[query] != conversation_ids[key]
19]
20
21print("cross_conversation_edges=", cross_conversation_edges)
22assert cross_conversation_edges == []
23assert mask[4][1] == 011 0 0 0 0 0
21 1 0 0 0 0
31 1 1 0 0 0
40 0 0 1 0 0
50 0 0 1 1 0
60 0 0 1 1 1
7cross_conversation_edges= []Don't assume an option named packing=True automatically constructs this matrix. Inspect your trainer's documented semantics and run a small boundary test with the implementation you will train.
Once a checkpoint has been fine-tuned, evaluate it with the private cases you built in the previous chapter. Log the rendered prompt, tokenizer revision, template revision, truncation policy, and generation settings alongside every result. Otherwise an answer regression could be a template deployment bug rather than a model change.
| Symptom | Likely contract failure | First diagnostic check |
|---|---|---|
| Model generates another customer question instead of answering | Missing assistant-start cue for a template that needs one | Inspect final rendered tokens before .generate(). |
| Model emits unfamiliar role markers or rambles | Checkpoint served with another template family | Compare tokenizer/template revision with training manifest. |
| Responses changed after refactoring preprocessing | BOS, EOS, or delimiters were inserted twice | Count special-token IDs after rendering and tokenization. |
| Good short answers, incorrect long threads | Truncation removed the policy evidence or system instruction | Log retained messages and token count. |
| Low loss, poor response quality | Wrong target-span mask or noisy accepted rows | Render one batch with visible labels and inspect rejected data. |
Here is a minimal deployment manifest check. It can't measure model quality, but it prevents shipping an endpoint whose serialization contract is known to differ from the fine-tuning run.
1training_manifest = {
2 "checkpoint": "shopflow-policy-sft-v3",
3 "tokenizer_revision": "tokens-7",
4 "template_revision": "chatml-policy-v2",
5 "add_special_tokens_after_template": False,
6}
7
8deployments = [
9 {
10 "name": "candidate-safe",
11 "checkpoint": "shopflow-policy-sft-v3",
12 "tokenizer_revision": "tokens-7",
13 "template_revision": "chatml-policy-v2",
14 "add_special_tokens_after_template": False,
15 },
16 {
17 "name": "candidate-drifted",
18 "checkpoint": "shopflow-policy-sft-v3",
19 "tokenizer_revision": "tokens-7",
20 "template_revision": "mistral-wrapper-v1",
21 "add_special_tokens_after_template": True,
22 },
23]
24
25def mismatches(candidate):
26 return [
27 field
28 for field, expected in training_manifest.items()
29 if candidate[field] != expected
30 ]
31
32for deployment in deployments:
33 drift = mismatches(deployment)
34 status = "block" if drift else "evaluate"
35 print(f"{deployment['name']}: {status} drift={drift}")
36
37assert mismatches(deployments[0]) == []
38assert mismatches(deployments[1]) == [
39 "template_revision",
40 "add_special_tokens_after_template",
41]1candidate-safe: evaluate drift=[]
2candidate-drifted: block drift=['template_revision', 'add_special_tokens_after_template']A matching manifest earns the right to run quality evaluations; it doesn't prove the model is ready. Re-run grounded policy cases, critical failure slices, latency checks, and human review gates after every fine-tune or serving change.
The final executable check joins this chapter back to evaluation. Template parity is mandatory, but a correctly formatted candidate still ships only if grounded cases and operational limits pass.
1candidates = [
2 {
3 "name": "sft-v3",
4 "template_matches": True,
5 "grounded_rate": 0.99,
6 "critical_errors": 0,
7 "p95_latency_ms": 620,
8 },
9 {
10 "name": "sft-v4-fast",
11 "template_matches": True,
12 "grounded_rate": 0.94,
13 "critical_errors": 1,
14 "p95_latency_ms": 410,
15 },
16]
17
18def blockers(candidate):
19 failed = []
20 if not candidate["template_matches"]:
21 failed.append("template_drift")
22 if candidate["grounded_rate"] < 0.98:
23 failed.append("grounded_quality")
24 if candidate["critical_errors"] > 0:
25 failed.append("critical_policy_error")
26 if candidate["p95_latency_ms"] > 700:
27 failed.append("latency")
28 return failed
29
30for candidate in candidates:
31 failed = blockers(candidate)
32 decision = "release" if not failed else "block"
33 print(f"{candidate['name']}: {decision} blockers={failed}")
34
35assert blockers(candidates[0]) == []
36assert blockers(candidates[1]) == ["grounded_quality", "critical_policy_error"]1sft-v3: release blockers=[]
2sft-v4-fast: block blockers=['grounded_quality', 'critical_policy_error']Instruction tuning connects data to behavior:
You don't need to fine-tune a large model to practice this chapter. Render transcripts, inspect labels, validate synthetic rows, and treat the serving manifest as part of your experiment record. Those habits scale from a tiny exercise to a serious training run.
assistant_only_loss or packed isolation as automatic behavior without inspecting the trainer.Add a third policy row for a sealed return window and a second rejected answer that invents a refund. Extend filter-grounded-sft-rows.py to print which required fact is missing or which forbidden claim was found. Then add its accepted answer to the rendering and label-building labs. You should be able to show exactly which assistant tokens become supervised targets and exactly which evidence allowed the row into training.
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
LIMA: Less Is More for Alignment.
Zhou, C., et al. · 2023 · NeurIPS 2023
Transformers Documentation: Chat templates.
Hugging Face · 2026
The Llama 3 Herd of Models.
Dubey, A., et al. · 2024 · arXiv preprint
Transformers Documentation: Writing a chat template.
Hugging Face · 2026
Tokenization
Mistral AI · 2026
Self-Instruct: Aligning Language Models with Self-Generated Instructions.
Wang, Y., et al. · 2023 · ACL 2023
TRL Documentation: SFT Trainer.
Hugging Face · 2026