LearnCore LLM FoundationsInstruction Tuning & Chat Templates

⚡MediumFine-Tuning & Training

Instruction Tuning & Chat Templates

Teach a base language model to answer as an assistant: curate grounded SFT rows, serialize chat turns exactly, choose loss targets, pack safely, and detect serving-time template drift.

17 min read

Learning path

Step 55 of 158 in the full curriculum

LLM Benchmarks & Limitations Dimensionality Reduction for Embeddings

The policy-answering assistant now has evaluations. One private case asks about a stale service-account key, with this evidence:

Policy evidence: Service-account keys older than 14 days require a rotation ticket before continued use.

A candidate model may retrieve the right clause and still respond badly. A base language model predicts plausible next tokens; it isn't reliably trained to interpret a user turn, answer as a policy assistant, cite the policy, and stop. Supervised fine-tuning (SFT) teaches that response behavior, and the same chat template must survive from data preparation to serving.

Instruction tuning turns one chat context into a training contract. A base model sees system and user turns and could continue in several directions. Supervised fine-tuning highlights the grounded assistant span as the target. Serving then recreates the same role boundaries before generating the answer. — SFT picks one grounded assistant continuation from many plausible next-token paths. Serving has to recreate the same role boundaries.

One example defines a behavior contract

A pretrained causal language model has one basic mechanism: given preceding tokens, assign probabilities to the next token. It can sometimes answer a question because conversations appeared in pretraining data, but answering isn't yet a dependable product contract.

For an access-policy assistant, one SFT training row can look like this:

Role	Content	What this turn contributes
`system`	Answer from the supplied policy evidence.	Sets the behavior boundary.
`user`	My service-account key is 18 days old. Can I keep using it?	Gives the user's request.
`assistant`	Open a rotation ticket because keys older than 14 days require review.	Supplies the continuation to reward.

During SFT, the optimizer increases the probability of target response tokens conditioned on the turns that came before them. In InstructGPT, supervised demonstrations were the first post-training stage before preference feedback was used to refine response quality.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}

Diagram showing Grounded SFT row 14 days + ticket, Chat template role-token stream, Optimize chosen target tokens, and Serve + evaluate same token contract. — Grounded SFT row 14 days + ticket, Chat template role-token stream, Optimize chosen target tokens, and Serve + evaluate same token contract.

SFT isn't a guarantee that the model only changes style or formatting. Fine-tuning can alter factual behavior and task competence too. LIMA provides a useful, narrower result: with a capable base model, a carefully curated set of 1,000 demonstrations produced strong response-format and instruction-following behavior in its experiments.^{[2]Reference 2LIMA: Less Is More for Alignment.https://arxiv.org/abs/2305.11206} Treat that as evidence for data quality, not a promise that every task needs little data.

Chat messages eventually become one token stream

Applications store a conversation as structured records. The language model receives tokens. A chat template is the serialization rule between those two representations: it adds role markers, delimiters, whitespace, and, when appropriate, the cue that a new assistant response should begin.^{[3]Reference 3Transformers Documentation: Chat templates.https://huggingface.co/docs/transformers/main/en/chat_templating}

One serialized conversation prefix branches into a completed training answer, a fresh generation ending at an assistant-start cue, and an open assistant prefill; combining both open generation modes is invalid. — The final boundary changes by mode: training closes the known answer, generation opens a new assistant turn, and prefill leaves the existing assistant message open. The last two modes can't be combined.

Model families don't share one universal chat serialization:

Checkpoint family example	Shape of its turn markers	Engineering consequence
ChatML-style	`<\|im_start\|>user ... <\|im_end\|>`	An assistant-start marker can indicate a fresh generation turn.
Llama 3 Instruct	Header tokens followed by `<\|eot_id\|>`	Use the tokenizer's shipped headers and end-of-turn marker.^{[4]Reference 4The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}^{[5]Reference 5Transformers Documentation: Writing a chat template.https://huggingface.co/docs/transformers/main/en/chat_templating_writing}
Mistral-7B-Instruct-v0.1	`<s>[INST] ... [/INST] ... </s>`	Spacing and turn layout are part of the tokenizer contract.^{[3]Reference 3Transformers Documentation: Chat templates.https://huggingface.co/docs/transformers/main/en/chat_templating}^{[6]Reference 6Tokenizationhttps://docs.mistral.ai/resources/cookbooks/concept-deep-dive-tokenization-templates}

Those strings are examples of checkpoint-specific formats, not a menu of interchangeable wrappers. Feeding [INST] formatting to a checkpoint trained with another role-token layout may still produce text, but you have changed its input distribution.

Render training, fresh generation, and prefill separately

The smallest useful template exercise is to render the same support task in three modes:

A completed training transcript contains the known assistant answer.
A new inference request ends at an assistant-start cue.
A prefill already contains the beginning of an assistant answer and asks the model to continue it.

The last two modes aren't the same. Starting a new assistant turn and continuing an existing assistant message should never happen at once.

render-chat-contract.py

START = "<|im_start|>"
END = "<|im_end|>"

def render(messages, *, add_generation_prompt=False, continue_final_message=False):
    if add_generation_prompt and continue_final_message:
        raise ValueError("choose a new assistant turn or a prefill, not both")

    pieces = []
    for index, message in enumerate(messages):
        role = message["role"]
        content = message["content"]
        is_prefill = (
            continue_final_message
            and index == len(messages) - 1
            and role == "assistant"
        )
        pieces.append(f"{START}{role}\n{content}")
        if not is_prefill:
            pieces.append(f"{END}\n")

    if add_generation_prompt:
        pieces.append(f"{START}assistant\n")
    return "".join(pieces)

context = [
    {"role": "system", "content": "Use only the supplied policy evidence."},
    {"role": "user", "content": "My service-account key is 18 days old. Can I keep using it?"},
]
answer = {
    "role": "assistant",
    "content": "Open a rotation ticket because keys older than 14 days require review.",
}

training_text = render(context + [answer])
generation_text = render(context, add_generation_prompt=True)
prefill_text = render(
    context + [{"role": "assistant", "content": '{"rotation_window_days": '}],
    continue_final_message=True,
)

print("training_has_answer=", "14 days" in training_text)
print("generation_ends_at_assistant=", generation_text.endswith(f"{START}assistant\n"))
print("prefill_is_open=", prefill_text.endswith('{"rotation_window_days": '))

try:
    render(context, add_generation_prompt=True, continue_final_message=True)
except ValueError as error:
    print("invalid_mode_caught=", str(error))

Output

training_has_answer= True
generation_ends_at_assistant= True
prefill_is_open= True
invalid_mode_caught= choose a new assistant turn or a prefill, not both

Hugging Face tokenizers implement this idea with apply_chat_template. For generation, add_generation_prompt=True appends a start-of-assistant sequence only when that particular template defines one. For a response prefill, continue_final_message=True keeps the final assistant content open, and the documentation treats combining both flags as an error.^{[3]Reference 3Transformers Documentation: Chat templates.https://huggingface.co/docs/transformers/main/en/chat_templating}

use-the-checkpoint-tokenizer.py

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
    {"role": "user", "content": "My service-account key is 18 days old. Can I keep using it?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

This snippet intentionally uses the tokenizer attached to the checkpoint rather than recreating the format with string concatenation. For Mistral-7B-Instruct-v0.1, the shipped template already places generation after [/INST]; add_generation_prompt=True doesn't append a separate assistant-start token. Other checkpoint templates do append a cue. If you render text with tokenize=False and tokenize it in a later step, pass add_special_tokens=False; otherwise BOS or EOS tokens may be inserted twice.^{[3]Reference 3Transformers Documentation: Chat templates.https://huggingface.co/docs/transformers/main/en/chat_templating}

Before rendering thousands of rows, reject malformed dialogue structure. An assistant reply without a preceding user request is a bad supervised example even if its text is fluent.

validate-conversation-roles.py

rows = {
    "valid": ["system", "user", "assistant"],
    "missing_answer": ["system", "user"],
    "double_user": ["system", "user", "user", "assistant"],
}

def role_errors(roles):
    turns = roles[1:] if roles and roles[0] == "system" else roles
    expected = "user"
    for role in turns:
        if role != expected:
            return [f"expected {expected}, got {role}"]
        expected = "assistant" if expected == "user" else "user"
    if expected == "assistant":
        return ["conversation ends before assistant answer"]
    return []

for row_id, roles in rows.items():
    errors = role_errors(roles)
    print(f"{row_id}: {'accept' if not errors else 'reject'} {errors}")

assert role_errors(rows["valid"]) == []
assert role_errors(rows["missing_answer"]) == ["conversation ends before assistant answer"]

Output

valid: accept []
missing_answer: reject ['conversation ends before assistant answer']
double_user: reject ['expected assistant, got user']

Fine-tuning data must teach supported answers

Good formatting can't rescue bad supervision. If a training row claims that stale keys can wait 30 days when the policy says 14 days with a rotation ticket, SFT reinforces an unsupported answer.

Your first data design decision isn't "How many rows can I generate?" It's "What behavior is each accepted row allowed to teach?"

Data source	Useful for	Risk to test before training
Reviewed policy examples	Policy-critical responses and refusal behavior	Sparse coverage of unusual tickets
Synthetic variations from approved seeds	Paraphrases, edge cases, tone variation	Unsupported policy claims or near-duplicates
Multi-turn transcripts	Follow-ups such as "Which rotation ticket should I use?"	Old context, personal data, or unhelpful agent habits

Self-Instruct demonstrated a repeatable way to expand instruction data: start from human-written tasks, generate new instructions and responses, then filter invalid or similar rows before fine-tuning.^{[7]Reference 7Self-Instruct: Aligning Language Models with Self-Generated Instructions.https://aclanthology.org/2023.acl-long.754/} A product team can use the same pattern with policy questions, but it must validate answers against source clauses rather than trusting a generator's confidence.

A three-row synthetic SFT grounding matrix accepts row-001 and row-003 after required-fact and forbidden-claim checks, rejects row-002, then shows source citation missing from intent coverage. — Passing row-level grounding is necessary but not sufficient: the two accepted examples cover stale-key and admin-access cases, while source citation still lacks a demonstration.

The next lab treats each assistant answer as a candidate SFT row. A row is accepted only if it contains required policy facts and omits a known unsupported claim.

filter-grounded-sft-rows.py

policies = {
    "stale_key": {
        "required": ("14 days", "rotation ticket"),
        "forbidden": ("30 days",),
    },
    "admin_access": {
        "required": ("reviewer approval", "policy p-7"),
        "forbidden": ("auto-approve",),
    },
}

candidate_rows = [
    {
        "id": "row-001",
        "policy": "stale_key",
        "answer": "Open a rotation ticket because keys older than 14 days require review.",
    },
    {
        "id": "row-002",
        "policy": "stale_key",
        "answer": "Keep using the key and rotate it within 30 days.",
    },
    {
        "id": "row-003",
        "policy": "admin_access",
        "answer": "Escalate temporary admin access for reviewer approval under policy P-7.",
    },
]

def evaluate(row):
    text = row["answer"].lower()
    rule = policies[row["policy"]]
    has_required_facts = all(fact in text for fact in rule["required"])
    contains_forbidden_claim = any(claim in text for claim in rule["forbidden"])
    return has_required_facts and not contains_forbidden_claim

accepted = []
for row in candidate_rows:
    decision = "accept" if evaluate(row) else "reject"
    print(f"{row['id']}: {decision}")
    if decision == "accept":
        accepted.append(row["id"])

print("accepted_rows=", accepted)
assert accepted == ["row-001", "row-003"]

Output

row-001: accept
row-002: reject
row-003: accept
accepted_rows= ['row-001', 'row-003']

This filter is deliberately small. A real data pipeline also checks duplicated instructions, personal information, unsafe replies, role ordering, length limits, and human-review requirements for high-risk cases. Version the evidence snapshot, generator prompt, filters, and accepted dataset together. Otherwise you won't know which change caused a behavior regression.

An accepted dataset also needs coverage. Rows for stale-key and admin-access cases don't demonstrate how the assistant should answer a source-citation question.

report-sft-coverage-gaps.py

required_intents = {
    "stale_key",
    "admin_access",
    "source_citation",
}
accepted_rows = [
    {"id": "row-001", "intent": "stale_key"},
    {"id": "row-003", "intent": "admin_access"},
    {"id": "row-004", "intent": "stale_key"},
]

covered = {row["intent"] for row in accepted_rows}
missing = sorted(required_intents - covered)
counts = {
    intent: sum(row["intent"] == intent for row in accepted_rows)
    for intent in sorted(required_intents)
}

print("accepted_counts=", counts)
print("missing_critical_intents=", missing)
print("ready_for_training=", not missing)

assert missing == ["source_citation"]

Output

accepted_counts= {'admin_access': 1, 'source_citation': 0, 'stale_key': 2}
missing_critical_intents= ['source_citation']
ready_for_training= False

Which tokens should produce gradient?

Every completed chat transcript contains tokens from the system prompt, user question, and assistant answer. You must decide which of those tokens are supervised targets.

Two choices matter:

Objective choice	Target tokens	When it can be reasonable
Full-sequence causal loss	All non-padding transcript tokens	You intentionally train the model on the whole conversation distribution.
Assistant-only loss	Assistant response spans, usually including their end-of-turn markers	You want the gradient budget focused on response behavior rather than reproducing prompt text.

Assistant-only loss is common, but it isn't the definition of SFT. TRL exposes it as assistant_only_loss=True for conversational data only when the chat template can identify assistant spans through {% generation %} and {% endgeneration %} markers. Current TRL releases automatically patch templates for some bundled model families; inspect the resolved template for the checkpoint you train.^{[8]Reference 8TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer} If the span mask is wrong, you may silently train on user text or mask out the answer you meant to learn.

In a causal language model, the target for a token is evaluated after the preceding tokens. The tiny preprocessing lab below marks assistant words and the assistant turn terminator as targets; role markers, system text, and user text receive the usual ignore label -100.

build-assistant-only-labels.py

conversation = [
    ("system", "Use supplied policy evidence"),
    ("user", "Stale key needs access"),
    ("assistant", "Open rotation ticket"),
]

tokens = []
labels = []

for role, text in conversation:
    tokens.append(f"<{role}>")
    labels.append("-100")
    for word in text.split():
        tokens.append(word)
        labels.append(word if role == "assistant" else "-100")
    tokens.append("<eot>")
    labels.append("<eot>" if role == "assistant" else "-100")

trained_targets = [label for label in labels if label != "-100"]
masked_tokens = sum(label == "-100" for label in labels)

print("tokens=", tokens)
print("trained_targets=", trained_targets)
print("masked_tokens=", masked_tokens)

assert trained_targets == ["Open", "rotation", "ticket", "<eot>"]
assert labels[tokens.index("<user>")] == "-100"

Output

tokens= ['<system>', 'Use', 'supplied', 'policy', 'evidence', '<eot>', '<user>', 'Stale', 'key', 'needs', 'access', '<eot>', '<assistant>', 'Open', 'rotation', 'ticket', '<eot>']
trained_targets= ['Open', 'rotation', 'ticket', '<eot>']
masked_tokens= 13

Use this kind of inspected tiny batch before launching training. A training-loss curve can't reveal that your boundary finder shifted one token too far and trained the wrong span.

This calculation makes the objective choice visible. Full-sequence loss scores nearly the whole serialized row; assistant-only loss scores only the desired answer span and its terminator.

compare-supervised-target-budgets.py

stream = [
    ("system", "<system>"),
    ("system", "Use"),
    ("system", "policy"),
    ("system", "<eot>"),
    ("user", "<user>"),
    ("user", "Stale"),
    ("user", "key"),
    ("user", "<eot>"),
    ("assistant-marker", "<assistant>"),
    ("assistant", "Open"),
    ("assistant", "ticket"),
    ("assistant", "<eot>"),
]

full_sequence_targets = [token for _, token in stream[1:]]
assistant_only_targets = [token for role, token in stream if role == "assistant"]

print("full_sequence_target_count=", len(full_sequence_targets))
print("assistant_only_target_count=", len(assistant_only_targets))
print("assistant_only_targets=", assistant_only_targets)

assert assistant_only_targets == ["Open", "ticket", "<eot>"]
assert len(assistant_only_targets) < len(full_sequence_targets)

Output

full_sequence_target_count= 11
assistant_only_target_count= 3
assistant_only_targets= ['Open', 'ticket', '<eot>']

Packing saves padding, but test isolation explicitly

SFT rows have different lengths. If every short chat is padded to a long context window, the accelerator spends much of its work processing padding. Packing fills a window with several short sequences instead. TRL supports packing as a training configuration for this reason.^{[8]Reference 8TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer}

Packing introduces a decision that teams often miss: may a token in conversation B attend to earlier tokens in conversation A? Some concatenated language-model recipes separate samples with an end token while retaining ordinary causal attention. If you need each support conversation to be an independent supervised example, use a trainer or attention kernel that supports isolation and verify its boundary behavior.

First measure why packing is tempting. Four short training rows padded separately to a 16-token window waste most of their capacity; filling shared windows reduces that waste.

measure-packing-utilization.py

window = 16
row_lengths = [9, 6, 7, 5]

separate_capacity = window * len(row_lengths)
separate_utilization = sum(row_lengths) / separate_capacity

packed_windows = []
for length in row_lengths:
    for index, used in enumerate(packed_windows):
        if used + length <= window:
            packed_windows[index] += length
            break
    else:
        packed_windows.append(length)

packed_capacity = window * len(packed_windows)
packed_utilization = sum(row_lengths) / packed_capacity

print("separate_utilization=", f"{separate_utilization:.1%}")
print("packed_windows=", packed_windows)
print("packed_utilization=", f"{packed_utilization:.1%}")

assert packed_windows == [15, 12]
assert packed_utilization > separate_utilization

Output

separate_utilization= 42.2%
packed_windows= [15, 12]
packed_utilization= 84.4%

Suppose two three-token rows are packed into one sequence:

Packed position	0	1	2	3	4	5
Conversation ID	A	A	A	B	B	B
Allowed history for strict isolation	A only	A only	A only	B only	B only	B only

For strict isolation, the causal self-attention matrix contains two blocks:

M = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 \end{bmatrix}

Each row is a query position; each column is a possible earlier key position. The zeros in row 4 under columns 0 through 2 prevent the second conversation from reading the first one.

verify-packed-isolation.py

conversation_ids = ["A", "A", "A", "B", "B", "B"]

mask = [
    [
        int(key_position <= query_position and key_id == query_id)
        for key_position, key_id in enumerate(conversation_ids)
    ]
    for query_position, query_id in enumerate(conversation_ids)
]

for row in mask:
    print(" ".join(map(str, row)))

cross_conversation_edges = [
    (query, key)
    for query, row in enumerate(mask)
    for key, allowed in enumerate(row)
    if allowed and conversation_ids[query] != conversation_ids[key]
]

print("cross_conversation_edges=", cross_conversation_edges)
assert cross_conversation_edges == []
assert mask[4][1] == 0

Output

0 0 0 0 0
1 0 0 0 0
1 1 0 0 0
0 0 1 0 0
0 0 1 1 0
0 0 1 1 1
cross_conversation_edges= []

Don't assume an option named packing=True automatically constructs this matrix. Inspect your trainer's documented semantics and run a small boundary test with the implementation you'll train.

Production failures are usually contract failures

Once a checkpoint has been fine-tuned, evaluate it with the private cases you built in the previous chapter. Log the rendered prompt, tokenizer revision, template revision, truncation policy, and generation settings alongside every result. Otherwise an answer regression could be a template deployment bug rather than a model change.

Symptom	Likely contract failure	First diagnostic check
Model generates another user question instead of answering	Missing assistant-start cue for a template that needs one	Inspect final rendered tokens before `.generate()`.
Model emits unfamiliar role markers or rambles	Checkpoint served with another template family	Compare tokenizer/template revision with training manifest.
Responses changed after refactoring preprocessing	BOS, EOS, or delimiters were inserted twice	Count special-token IDs after rendering and tokenization.
Good short answers, incorrect long threads	Truncation removed the policy evidence or system instruction	Log retained messages and token count.
Low loss, poor response quality	Wrong target-span mask or noisy accepted rows	Render one batch with visible labels and inspect rejected data.

This minimal deployment manifest check can't measure model quality, but it prevents shipping an endpoint whose serialization contract is known to differ from the fine-tuning run.

audit-serving-contract.py

training_manifest = {
    "checkpoint": "access-policy-sft-v3",
    "tokenizer_revision": "tokens-7",
    "template_revision": "chatml-policy-v2",
    "add_special_tokens_after_template": False,
}

deployments = [
    {
        "name": "candidate-safe",
        "checkpoint": "access-policy-sft-v3",
        "tokenizer_revision": "tokens-7",
        "template_revision": "chatml-policy-v2",
        "add_special_tokens_after_template": False,
    },
    {
        "name": "candidate-drifted",
        "checkpoint": "access-policy-sft-v3",
        "tokenizer_revision": "tokens-7",
        "template_revision": "mistral-wrapper-v1",
        "add_special_tokens_after_template": True,
    },
]

def mismatches(candidate):
    return [
        field
        for field, expected in training_manifest.items()
        if candidate[field] != expected
    ]

for deployment in deployments:
    drift = mismatches(deployment)
    status = "block" if drift else "evaluate"
    print(f"{deployment['name']}: {status} drift={drift}")

assert mismatches(deployments[0]) == []
assert mismatches(deployments[1]) == [
    "template_revision",
    "add_special_tokens_after_template",
]

Output

candidate-safe: evaluate drift=[]
candidate-drifted: block drift=['template_revision', 'add_special_tokens_after_template']

A matching manifest earns the right to run quality evaluations; it doesn't prove the model is ready. Re-run grounded policy cases, critical failure slices, p95 latency checks, and human review checks after every fine-tune or serving change.

The final executable check joins template work back to evaluation. Template parity is mandatory, but a correctly formatted candidate still ships only if grounded cases and operational limits pass.

gate-finetuned-candidates.py

candidates = [
    {
        "name": "sft-v3",
        "template_matches": True,
        "grounded_rate": 0.99,
        "critical_errors": 0,
        "p95_latency_ms": 620,
    },
    {
        "name": "sft-v4-fast",
        "template_matches": True,
        "grounded_rate": 0.94,
        "critical_errors": 1,
        "p95_latency_ms": 410,
    },
]

def blockers(candidate):
    failed = []
    if not candidate["template_matches"]:
        failed.append("template_drift")
    if candidate["grounded_rate"] < 0.98:
        failed.append("grounded_quality")
    if candidate["critical_errors"] > 0:
        failed.append("critical_policy_error")
    if candidate["p95_latency_ms"] > 700:
        failed.append("latency")
    return failed

for candidate in candidates:
    failed = blockers(candidate)
    decision = "release" if not failed else "block"
    print(f"{candidate['name']}: {decision} blockers={failed}")

assert blockers(candidates[0]) == []
assert blockers(candidates[1]) == ["grounded_quality", "critical_policy_error"]

Output

sft-v3: release blockers=[]
sft-v4-fast: block blockers=['grounded_quality', 'critical_policy_error']

From chat foundations to applied engineering

Instruction tuning connects data to behavior:

A conversation row demonstrates the response you want.
A chat template turns that row into the exact token context used for training and serving.
A chosen label mask decides where supervised gradient is spent.
Data filtering, packing checks, and private evaluations keep the learned behavior honest.

You don't need to fine-tune a large model to practice. Render transcripts, inspect labels, validate synthetic rows, and treat the serving manifest as part of your experiment record. Those habits scale from a tiny exercise to a serious training run.

Mastery check

Key concepts

Base-model continuation versus instruction-tuned response behavior
SFT examples grounded in approved evidence
Chat-template serialization and checkpoint-specific markers
Fresh generation versus assistant prefill
Full-sequence versus assistant-only loss
Synthetic-data acceptance checks
Packed-sequence isolation as a verified design choice
Training and serving template parity

Evaluation rubric

Foundational: Explains why a base model can complete dialogue text without reliably acting as a support assistant.
Intermediate: Renders completed, generation, and prefill versions of one conversation without mixing their boundary rules.
Intermediate: Builds an assistant-only target mask and explains why that's a choice rather than a universal SFT requirement.
Advanced: Designs a data and serving audit that rejects unsupported policy rows and template drift before evaluation.

Common pitfalls

Training on plausible synthetic answers without checking them against policy evidence.
Hand-formatting messages with delimiters that don't match the checkpoint's tokenizer template.
Treating assistant_only_loss or packed isolation as automatic behavior without inspecting the trainer.
Using a new-assistant generation cue when the request already contains an assistant prefill.
Declaring a fine-tuned checkpoint better before rerunning grounded private evaluations.

Follow-up questions

Practice extension

Add a third policy row for a source-citation requirement and a second rejected answer that invents automatic approval. Extend filter-grounded-sft-rows.py to print which required fact is missing or which forbidden claim was found. Then add its accepted answer to the rendering and label-building labs. Show exactly which assistant tokens become supervised targets and exactly which evidence allowed the row into training.

Next Step

Continue to Dimensionality Reduction for Embeddings

You can now connect supervision data, serialized prompts, and evaluated assistant behavior. The applied phase begins by examining how high-dimensional representations are compressed and visualized, which prepares you to inspect retrieval and ranking systems instead of accepting <span data-glossary="embedding">embedding</span> outputs without measurement.

PreviousLLM Benchmarks & Limitations

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

LIMA: Less Is More for Alignment.

Zhou, C., et al. · 2023 · NeurIPS 2023

Transformers Documentation: Chat templates.

Hugging Face · 2026

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint

Transformers Documentation: Writing a chat template.

Hugging Face · 2026

Tokenization

Mistral AI · 2026

Self-Instruct: Aligning Language Models with Self-Generated Instructions.

Wang, Y., et al. · 2023 · ACL 2023

TRL Documentation: SFT Trainer.

Hugging Face · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnCore LLM FoundationsInstruction Tuning & Chat Templates

⚡MediumFine-Tuning & Training

Instruction Tuning & Chat Templates

Teach a base language model to answer as an assistant: curate grounded SFT rows, serialize chat turns exactly, choose loss targets, pack safely, and detect serving-time template drift.

17 min read

Learning path

Step 55 of 158 in the full curriculum

LLM Benchmarks & Limitations Dimensionality Reduction for Embeddings

The policy-answering assistant now has evaluations. One private case asks about a stale service-account key, with this evidence:

Policy evidence: Service-account keys older than 14 days require a rotation ticket before continued use.

One example defines a behavior contract

For an access-policy assistant, one SFT training row can look like this:

Role	Content	What this turn contributes
`system`	Answer from the supplied policy evidence.	Sets the behavior boundary.
`user`	My service-account key is 18 days old. Can I keep using it?	Gives the user's request.
`assistant`	Open a rotation ticket because keys older than 14 days require review.	Supplies the continuation to reward.

Chat messages eventually become one token stream

Model families don't share one universal chat serialization:

Checkpoint family example	Shape of its turn markers	Engineering consequence
ChatML-style	`<\|im_start\|>user ... <\|im_end\|>`	An assistant-start marker can indicate a fresh generation turn.
Llama 3 Instruct	Header tokens followed by `<\|eot_id\|>`	Use the tokenizer's shipped headers and end-of-turn marker.^{[4]Reference 4The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}^{[5]Reference 5Transformers Documentation: Writing a chat template.https://huggingface.co/docs/transformers/main/en/chat_templating_writing}
Mistral-7B-Instruct-v0.1	`<s>[INST] ... [/INST] ... </s>`	Spacing and turn layout are part of the tokenizer contract.^{[3]Reference 3Transformers Documentation: Chat templates.https://huggingface.co/docs/transformers/main/en/chat_templating}^{[6]Reference 6Tokenizationhttps://docs.mistral.ai/resources/cookbooks/concept-deep-dive-tokenization-templates}

Render training, fresh generation, and prefill separately

The smallest useful template exercise is to render the same support task in three modes:

A completed training transcript contains the known assistant answer.
A new inference request ends at an assistant-start cue.
A prefill already contains the beginning of an assistant answer and asks the model to continue it.

The last two modes aren't the same. Starting a new assistant turn and continuing an existing assistant message should never happen at once.

render-chat-contract.py

START = "<|im_start|>"
END = "<|im_end|>"

def render(messages, *, add_generation_prompt=False, continue_final_message=False):
    if add_generation_prompt and continue_final_message:
        raise ValueError("choose a new assistant turn or a prefill, not both")

    pieces = []
    for index, message in enumerate(messages):
        role = message["role"]
        content = message["content"]
        is_prefill = (
            continue_final_message
            and index == len(messages) - 1
            and role == "assistant"
        )
        pieces.append(f"{START}{role}\n{content}")
        if not is_prefill:
            pieces.append(f"{END}\n")

    if add_generation_prompt:
        pieces.append(f"{START}assistant\n")
    return "".join(pieces)

context = [
    {"role": "system", "content": "Use only the supplied policy evidence."},
    {"role": "user", "content": "My service-account key is 18 days old. Can I keep using it?"},
]
answer = {
    "role": "assistant",
    "content": "Open a rotation ticket because keys older than 14 days require review.",
}

training_text = render(context + [answer])
generation_text = render(context, add_generation_prompt=True)
prefill_text = render(
    context + [{"role": "assistant", "content": '{"rotation_window_days": '}],
    continue_final_message=True,
)

print("training_has_answer=", "14 days" in training_text)
print("generation_ends_at_assistant=", generation_text.endswith(f"{START}assistant\n"))
print("prefill_is_open=", prefill_text.endswith('{"rotation_window_days": '))

try:
    render(context, add_generation_prompt=True, continue_final_message=True)
except ValueError as error:
    print("invalid_mode_caught=", str(error))

Output

training_has_answer= True
generation_ends_at_assistant= True
prefill_is_open= True
invalid_mode_caught= choose a new assistant turn or a prefill, not both

use-the-checkpoint-tokenizer.py

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
    {"role": "user", "content": "My service-account key is 18 days old. Can I keep using it?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

Before rendering thousands of rows, reject malformed dialogue structure. An assistant reply without a preceding user request is a bad supervised example even if its text is fluent.

validate-conversation-roles.py

rows = {
    "valid": ["system", "user", "assistant"],
    "missing_answer": ["system", "user"],
    "double_user": ["system", "user", "user", "assistant"],
}

def role_errors(roles):
    turns = roles[1:] if roles and roles[0] == "system" else roles
    expected = "user"
    for role in turns:
        if role != expected:
            return [f"expected {expected}, got {role}"]
        expected = "assistant" if expected == "user" else "user"
    if expected == "assistant":
        return ["conversation ends before assistant answer"]
    return []

for row_id, roles in rows.items():
    errors = role_errors(roles)
    print(f"{row_id}: {'accept' if not errors else 'reject'} {errors}")

assert role_errors(rows["valid"]) == []
assert role_errors(rows["missing_answer"]) == ["conversation ends before assistant answer"]

Output

valid: accept []
missing_answer: reject ['conversation ends before assistant answer']
double_user: reject ['expected assistant, got user']

Fine-tuning data must teach supported answers

Good formatting can't rescue bad supervision. If a training row claims that stale keys can wait 30 days when the policy says 14 days with a rotation ticket, SFT reinforces an unsupported answer.

Your first data design decision isn't "How many rows can I generate?" It's "What behavior is each accepted row allowed to teach?"

Data source	Useful for	Risk to test before training
Reviewed policy examples	Policy-critical responses and refusal behavior	Sparse coverage of unusual tickets
Synthetic variations from approved seeds	Paraphrases, edge cases, tone variation	Unsupported policy claims or near-duplicates
Multi-turn transcripts	Follow-ups such as "Which rotation ticket should I use?"	Old context, personal data, or unhelpful agent habits

The next lab treats each assistant answer as a candidate SFT row. A row is accepted only if it contains required policy facts and omits a known unsupported claim.

filter-grounded-sft-rows.py

policies = {
    "stale_key": {
        "required": ("14 days", "rotation ticket"),
        "forbidden": ("30 days",),
    },
    "admin_access": {
        "required": ("reviewer approval", "policy p-7"),
        "forbidden": ("auto-approve",),
    },
}

candidate_rows = [
    {
        "id": "row-001",
        "policy": "stale_key",
        "answer": "Open a rotation ticket because keys older than 14 days require review.",
    },
    {
        "id": "row-002",
        "policy": "stale_key",
        "answer": "Keep using the key and rotate it within 30 days.",
    },
    {
        "id": "row-003",
        "policy": "admin_access",
        "answer": "Escalate temporary admin access for reviewer approval under policy P-7.",
    },
]

def evaluate(row):
    text = row["answer"].lower()
    rule = policies[row["policy"]]
    has_required_facts = all(fact in text for fact in rule["required"])
    contains_forbidden_claim = any(claim in text for claim in rule["forbidden"])
    return has_required_facts and not contains_forbidden_claim

accepted = []
for row in candidate_rows:
    decision = "accept" if evaluate(row) else "reject"
    print(f"{row['id']}: {decision}")
    if decision == "accept":
        accepted.append(row["id"])

print("accepted_rows=", accepted)
assert accepted == ["row-001", "row-003"]

Output

row-001: accept
row-002: reject
row-003: accept
accepted_rows= ['row-001', 'row-003']

An accepted dataset also needs coverage. Rows for stale-key and admin-access cases don't demonstrate how the assistant should answer a source-citation question.

report-sft-coverage-gaps.py

required_intents = {
    "stale_key",
    "admin_access",
    "source_citation",
}
accepted_rows = [
    {"id": "row-001", "intent": "stale_key"},
    {"id": "row-003", "intent": "admin_access"},
    {"id": "row-004", "intent": "stale_key"},
]

covered = {row["intent"] for row in accepted_rows}
missing = sorted(required_intents - covered)
counts = {
    intent: sum(row["intent"] == intent for row in accepted_rows)
    for intent in sorted(required_intents)
}

print("accepted_counts=", counts)
print("missing_critical_intents=", missing)
print("ready_for_training=", not missing)

assert missing == ["source_citation"]

Output

accepted_counts= {'admin_access': 1, 'source_citation': 0, 'stale_key': 2}
missing_critical_intents= ['source_citation']
ready_for_training= False

Which tokens should produce gradient?

Every completed chat transcript contains tokens from the system prompt, user question, and assistant answer. You must decide which of those tokens are supervised targets.

Two choices matter:

Objective choice	Target tokens	When it can be reasonable
Full-sequence causal loss	All non-padding transcript tokens	You intentionally train the model on the whole conversation distribution.
Assistant-only loss	Assistant response spans, usually including their end-of-turn markers	You want the gradient budget focused on response behavior rather than reproducing prompt text.

build-assistant-only-labels.py

conversation = [
    ("system", "Use supplied policy evidence"),
    ("user", "Stale key needs access"),
    ("assistant", "Open rotation ticket"),
]

tokens = []
labels = []

for role, text in conversation:
    tokens.append(f"<{role}>")
    labels.append("-100")
    for word in text.split():
        tokens.append(word)
        labels.append(word if role == "assistant" else "-100")
    tokens.append("<eot>")
    labels.append("<eot>" if role == "assistant" else "-100")

trained_targets = [label for label in labels if label != "-100"]
masked_tokens = sum(label == "-100" for label in labels)

print("tokens=", tokens)
print("trained_targets=", trained_targets)
print("masked_tokens=", masked_tokens)

assert trained_targets == ["Open", "rotation", "ticket", "<eot>"]
assert labels[tokens.index("<user>")] == "-100"

Output

tokens= ['<system>', 'Use', 'supplied', 'policy', 'evidence', '<eot>', '<user>', 'Stale', 'key', 'needs', 'access', '<eot>', '<assistant>', 'Open', 'rotation', 'ticket', '<eot>']
trained_targets= ['Open', 'rotation', 'ticket', '<eot>']
masked_tokens= 13

Use this kind of inspected tiny batch before launching training. A training-loss curve can't reveal that your boundary finder shifted one token too far and trained the wrong span.

This calculation makes the objective choice visible. Full-sequence loss scores nearly the whole serialized row; assistant-only loss scores only the desired answer span and its terminator.

compare-supervised-target-budgets.py

stream = [
    ("system", "<system>"),
    ("system", "Use"),
    ("system", "policy"),
    ("system", "<eot>"),
    ("user", "<user>"),
    ("user", "Stale"),
    ("user", "key"),
    ("user", "<eot>"),
    ("assistant-marker", "<assistant>"),
    ("assistant", "Open"),
    ("assistant", "ticket"),
    ("assistant", "<eot>"),
]

full_sequence_targets = [token for _, token in stream[1:]]
assistant_only_targets = [token for role, token in stream if role == "assistant"]

print("full_sequence_target_count=", len(full_sequence_targets))
print("assistant_only_target_count=", len(assistant_only_targets))
print("assistant_only_targets=", assistant_only_targets)

assert assistant_only_targets == ["Open", "ticket", "<eot>"]
assert len(assistant_only_targets) < len(full_sequence_targets)

Output

full_sequence_target_count= 11
assistant_only_target_count= 3
assistant_only_targets= ['Open', 'ticket', '<eot>']

Packing saves padding, but test isolation explicitly

First measure why packing is tempting. Four short training rows padded separately to a 16-token window waste most of their capacity; filling shared windows reduces that waste.

measure-packing-utilization.py

window = 16
row_lengths = [9, 6, 7, 5]

separate_capacity = window * len(row_lengths)
separate_utilization = sum(row_lengths) / separate_capacity

packed_windows = []
for length in row_lengths:
    for index, used in enumerate(packed_windows):
        if used + length <= window:
            packed_windows[index] += length
            break
    else:
        packed_windows.append(length)

packed_capacity = window * len(packed_windows)
packed_utilization = sum(row_lengths) / packed_capacity

print("separate_utilization=", f"{separate_utilization:.1%}")
print("packed_windows=", packed_windows)
print("packed_utilization=", f"{packed_utilization:.1%}")

assert packed_windows == [15, 12]
assert packed_utilization > separate_utilization

Output

separate_utilization= 42.2%
packed_windows= [15, 12]
packed_utilization= 84.4%

Suppose two three-token rows are packed into one sequence:

Packed position	0	1	2	3	4	5
Conversation ID	A	A	A	B	B	B
Allowed history for strict isolation	A only	A only	A only	B only	B only	B only

For strict isolation, the causal self-attention matrix contains two blocks:

M = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 \end{bmatrix}

Each row is a query position; each column is a possible earlier key position. The zeros in row 4 under columns 0 through 2 prevent the second conversation from reading the first one.

verify-packed-isolation.py

conversation_ids = ["A", "A", "A", "B", "B", "B"]

mask = [
    [
        int(key_position <= query_position and key_id == query_id)
        for key_position, key_id in enumerate(conversation_ids)
    ]
    for query_position, query_id in enumerate(conversation_ids)
]

for row in mask:
    print(" ".join(map(str, row)))

cross_conversation_edges = [
    (query, key)
    for query, row in enumerate(mask)
    for key, allowed in enumerate(row)
    if allowed and conversation_ids[query] != conversation_ids[key]
]

print("cross_conversation_edges=", cross_conversation_edges)
assert cross_conversation_edges == []
assert mask[4][1] == 0

Output

0 0 0 0 0
1 0 0 0 0
1 1 0 0 0
0 0 1 0 0
0 0 1 1 0
0 0 1 1 1
cross_conversation_edges= []

Don't assume an option named packing=True automatically constructs this matrix. Inspect your trainer's documented semantics and run a small boundary test with the implementation you'll train.

Production failures are usually contract failures

Symptom	Likely contract failure	First diagnostic check
Model generates another user question instead of answering	Missing assistant-start cue for a template that needs one	Inspect final rendered tokens before `.generate()`.
Model emits unfamiliar role markers or rambles	Checkpoint served with another template family	Compare tokenizer/template revision with training manifest.
Responses changed after refactoring preprocessing	BOS, EOS, or delimiters were inserted twice	Count special-token IDs after rendering and tokenization.
Good short answers, incorrect long threads	Truncation removed the policy evidence or system instruction	Log retained messages and token count.
Low loss, poor response quality	Wrong target-span mask or noisy accepted rows	Render one batch with visible labels and inspect rejected data.

This minimal deployment manifest check can't measure model quality, but it prevents shipping an endpoint whose serialization contract is known to differ from the fine-tuning run.

audit-serving-contract.py

training_manifest = {
    "checkpoint": "access-policy-sft-v3",
    "tokenizer_revision": "tokens-7",
    "template_revision": "chatml-policy-v2",
    "add_special_tokens_after_template": False,
}

deployments = [
    {
        "name": "candidate-safe",
        "checkpoint": "access-policy-sft-v3",
        "tokenizer_revision": "tokens-7",
        "template_revision": "chatml-policy-v2",
        "add_special_tokens_after_template": False,
    },
    {
        "name": "candidate-drifted",
        "checkpoint": "access-policy-sft-v3",
        "tokenizer_revision": "tokens-7",
        "template_revision": "mistral-wrapper-v1",
        "add_special_tokens_after_template": True,
    },
]

def mismatches(candidate):
    return [
        field
        for field, expected in training_manifest.items()
        if candidate[field] != expected
    ]

for deployment in deployments:
    drift = mismatches(deployment)
    status = "block" if drift else "evaluate"
    print(f"{deployment['name']}: {status} drift={drift}")

assert mismatches(deployments[0]) == []
assert mismatches(deployments[1]) == [
    "template_revision",
    "add_special_tokens_after_template",
]

Output

candidate-safe: evaluate drift=[]
candidate-drifted: block drift=['template_revision', 'add_special_tokens_after_template']

The final executable check joins template work back to evaluation. Template parity is mandatory, but a correctly formatted candidate still ships only if grounded cases and operational limits pass.

gate-finetuned-candidates.py

candidates = [
    {
        "name": "sft-v3",
        "template_matches": True,
        "grounded_rate": 0.99,
        "critical_errors": 0,
        "p95_latency_ms": 620,
    },
    {
        "name": "sft-v4-fast",
        "template_matches": True,
        "grounded_rate": 0.94,
        "critical_errors": 1,
        "p95_latency_ms": 410,
    },
]

def blockers(candidate):
    failed = []
    if not candidate["template_matches"]:
        failed.append("template_drift")
    if candidate["grounded_rate"] < 0.98:
        failed.append("grounded_quality")
    if candidate["critical_errors"] > 0:
        failed.append("critical_policy_error")
    if candidate["p95_latency_ms"] > 700:
        failed.append("latency")
    return failed

for candidate in candidates:
    failed = blockers(candidate)
    decision = "release" if not failed else "block"
    print(f"{candidate['name']}: {decision} blockers={failed}")

assert blockers(candidates[0]) == []
assert blockers(candidates[1]) == ["grounded_quality", "critical_policy_error"]

Output

sft-v3: release blockers=[]
sft-v4-fast: block blockers=['grounded_quality', 'critical_policy_error']

From chat foundations to applied engineering

Instruction tuning connects data to behavior:

A conversation row demonstrates the response you want.
A chat template turns that row into the exact token context used for training and serving.
A chosen label mask decides where supervised gradient is spent.
Data filtering, packing checks, and private evaluations keep the learned behavior honest.

Mastery check

Key concepts

Base-model continuation versus instruction-tuned response behavior
SFT examples grounded in approved evidence
Chat-template serialization and checkpoint-specific markers
Fresh generation versus assistant prefill
Full-sequence versus assistant-only loss
Synthetic-data acceptance checks
Packed-sequence isolation as a verified design choice
Training and serving template parity

Evaluation rubric

Foundational: Explains why a base model can complete dialogue text without reliably acting as a support assistant.
Intermediate: Renders completed, generation, and prefill versions of one conversation without mixing their boundary rules.
Intermediate: Builds an assistant-only target mask and explains why that's a choice rather than a universal SFT requirement.
Advanced: Designs a data and serving audit that rejects unsupported policy rows and template drift before evaluation.

Common pitfalls

Training on plausible synthetic answers without checking them against policy evidence.
Hand-formatting messages with delimiters that don't match the checkpoint's tokenizer template.
Treating assistant_only_loss or packed isolation as automatic behavior without inspecting the trainer.
Using a new-assistant generation cue when the request already contains an assistant prefill.
Declaring a fine-tuned checkpoint better before rerunning grounded private evaluations.

Follow-up questions

Practice extension

Next Step

Continue to Dimensionality Reduction for Embeddings

PreviousLLM Benchmarks & Limitations

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

LIMA: Less Is More for Alignment.

Zhou, C., et al. · 2023 · NeurIPS 2023

Transformers Documentation: Chat templates.

Hugging Face · 2026

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint

Transformers Documentation: Writing a chat template.

Hugging Face · 2026

Tokenization

Mistral AI · 2026

Self-Instruct: Aligning Language Models with Self-Generated Instructions.

Wang, Y., et al. · 2023 · ACL 2023

TRL Documentation: SFT Trainer.

Hugging Face · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Instruction Tuning & Chat Templates

One example defines a behavior contract

Chat messages eventually become one token stream

Render training, fresh generation, and prefill separately

Fine-tuning data must teach supported answers

Which tokens should produce gradient?

Packing saves padding, but test isolation explicitly

Production failures are usually contract failures

From chat foundations to applied engineering

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Follow-up questions

Practice extension

Mastery Check

Discussion

Instruction Tuning & Chat Templates

One example defines a behavior contract

Chat messages eventually become one token stream

Render training, fresh generation, and prefill separately

Fine-tuning data must teach supported answers

Which tokens should produce gradient?

Packing saves padding, but test isolation explicitly

Production failures are usually contract failures

From chat foundations to applied engineering

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Follow-up questions

Practice extension

Mastery Check

Discussion

Instruction Tuning & Chat Templates

One example defines a behavior contract

Chat messages eventually become one token stream

Render training, fresh generation, and prefill separately

Fine-tuning data must teach supported answers

Which tokens should produce gradient?

Packing saves padding, but test isolation explicitly

Production failures are usually contract failures

From chat foundations to applied engineering

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Follow-up questions

Why can a base model produce conversation-looking text without being a reliable assistant?

When do you use add_generation_prompt=True, and when do you use continue_final_message=True?

Why isn't assistant-only loss the definition of supervised fine-tuning?

What must you verify before using packed support conversations as independent examples?

Practice extension

Mastery Check

Discussion

Instruction Tuning & Chat Templates

One example defines a behavior contract

Chat messages eventually become one token stream

Render training, fresh generation, and prefill separately

Fine-tuning data must teach supported answers

Which tokens should produce gradient?

Packing saves padding, but test isolation explicitly

Production failures are usually contract failures

From chat foundations to applied engineering

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Follow-up questions

Why can a base model produce conversation-looking text without being a reliable assistant?

When do you use add_generation_prompt=True, and when do you use continue_final_message=True?

Why isn't assistant-only loss the definition of supervised fine-tuning?

What must you verify before using packed support conversations as independent examples?

Practice extension

Mastery Check

Discussion