Build and test grounded prompts with clear roles, few-shot examples, structured outputs, evidence checks, and failure-focused evaluation.
You already know that a modern large language model (LLM) predicts the next token from its input context. That doesn't give it a live copy of your merchant's return policy. If support agent Luna needs an answer about damaged order A10234, your application must supply the applicable policy or use tool calling to retrieve it from a trusted source.
Prompt engineering is the practice of packaging instructions, context, examples, and output constraints so a model response can be checked. It isn't a hunt for secret phrases. The useful work is ordinary engineering: state the task, supply trusted facts, isolate untrusted input, define the response shape, and evaluate failures.
Key idea: Prompts don't make an answer trustworthy on their own. A grounded task needs trusted context, a response contract, and a test that catches unsupported outputs.
A common beginner mistake is treating a plain LLM call like a database lookup. Without attached retrieval or tools, a generation request doesn't fetch a live policy during inference. Transformer layers process the supplied tokens and the decoder generates one continuation token after another.
Think of it like a very well-read autocomplete. The prompt is the text on the screen so far, and the model's job is to continue it in a way that looks coherent. At the model-input level, prompting is probability shaping: you arrange the input so useful continuations become more likely.
This has two practical consequences:
One way to think about prompting is writing a work order for a fulfillment system. The model has broad capability, but it needs a task, constraints, and relevant context. Stable instructions say it must answer only from supplied return-policy lines. The current request carries order facts and the customer's message. If the work order is vague, the model may fill gaps. If it has evidence and explicit checks, a bad output is easier to catch.
Chat-style APIs structure a conversation into roles. Exact role names vary by provider and API, but the idea is the same: separate application instructions, user requests, and prior assistant outputs.
A provider's chat-style request might represent a grounded support turn like this:
1[
2 {"role": "system", "content": "Answer return questions only from supplied policy lines. Return JSON."},
3 {"role": "user", "content": "Policy P-7: damaged items may be returned within 30 days. Order A10234 arrived damaged on day 12."},
4 {"role": "assistant", "content": "{\"decision\":\"eligible\",\"source_line_ids\":[\"P-7\"]}"},
5 {"role": "user", "content": "Add a one-sentence agent summary for the same decision."}
6]developer, system, or expose it separately. It holds stable behavior and format constraints.Roles aren't security boundaries. They're structure.
Without roles, the model receives a blob of text and has to infer which parts are instructions, which parts are user data, and which parts are prior answers.
Roles make that separation explicit:
| Role | Plain-English job | Beginner mistake |
|---|---|---|
Application instructions (developer, system, or provider-specific layer) | Stable instructions for behavior and format. | Putting user-specific data here and making it hard to update. |
| User | Current request or task. | Mixing instructions with untrusted pasted content. |
| Assistant | Prior model output in the conversation. | Forgetting to include prior turns when calling a stateless API. |
If an API is stateless, it won't remember old turns unless your application sends them again. Browser chat apps usually manage that history for you. API clients often make you manage it yourself.
Build the message list in code so you can test where dynamic data goes. The assertion prevents order details from leaking into stable application instructions.
1import json
2
3system_rules = "Answer return questions only from supplied policy lines. Return JSON."
4policy_line = "P-7: Damaged items may be returned within 30 days."
5order_context = "Order A10234 arrived damaged on day 12."
6messages = [
7 {"role": "system", "content": system_rules},
8 {"role": "user", "content": f"{policy_line}\n{order_context}"},
9]
10
11assert "A10234" not in messages[0]["content"]
12assert "A10234" in messages[1]["content"]
13print(json.dumps(messages, indent=2))1[
2 {
3 "role": "system",
4 "content": "Answer return questions only from supplied policy lines. Return JSON."
5 },
6 {
7 "role": "user",
8 "content": "P-7: Damaged items may be returned within 30 days.\nOrder A10234 arrived damaged on day 12."
9 }
10]Here is how a prompt evolves from a vague request to a testable one. Imagine you're building a support tool that extracts return eligibility from merchant policy text.
1Summarize this.
2Return policy: damaged items may be returned within 30 days.1Summarize the damaged-item return rule for a customer support agent.
2Focus on eligibility and time window.1Role: You are a support triage assistant.
2
3Task: Decide whether the order is eligible for a return.
4
5Rules:
6- Use only facts from the policy and order context.
7- If the policy doesn't answer something, say "Not specified."
8
9Policy:
10<policy>
11P-7: Damaged items may be returned within 30 days.
12</policy>
13
14Order context:
15<order>
16Order A10234 arrived damaged on day 12.
17</order>
18
19Format: Return valid JSON with keys "decision", "refund_window_days", and "source_line_ids".The engineered version does four useful things:
The delimiter doesn't make prompt injection impossible. It gives the model a clearer layout and gives your application a place to apply extra checks.
Now render the same contract from variables rather than pasting facts into an untestable string:
1from textwrap import dedent
2
3policy = "P-7: Damaged items may be returned within 30 days."
4order = "Order A10234 arrived damaged on day 12."
5question = "Is this order eligible for a return?"
6
7prompt = dedent(f"""\
8 Task: Decide return eligibility from supplied facts.
9 Rules: Use only <policy> and <order>. If unsupported, answer "not_specified".
10 <policy>
11 {policy}
12 </policy>
13 <order>
14 {order}
15 </order>
16 Question: {question}
17 Output JSON fields: decision, refund_window_days, source_line_ids
18""")
19
20assert "<policy>" in prompt and "</policy>" in prompt
21assert "P-7" in prompt and "A10234" in prompt
22print(prompt)1Task: Decide return eligibility from supplied facts.
2Rules: Use only <policy> and <order>. If unsupported, answer "not_specified".
3<policy>
4P-7: Damaged items may be returned within 30 days.
5</policy>
6<order>
7Order A10234 arrived damaged on day 12.
8</order>
9Question: Is this order eligible for a return?
10Output JSON fields: decision, refund_window_days, source_line_ids
Reliable prompts are built in layers. You put stable rules first, add trusted context and examples, isolate untrusted user data, then check the output shape.
At each layer, you choose a prompting pattern based on what the model needs to know:
| Pattern | Add to context | Best beginner use | Failure to watch |
|---|---|---|---|
| Zero-shot | Task instruction only | Common tasks with obvious format | Model guesses hidden requirements. |
| Few-shot | Two or more examples | Exact output shape or style | Bad examples teach the wrong pattern. |
| Structured output | Schema and delimiters | JSON, extraction, tool inputs | User data leaks into instructions. |
| Checkable explanation | Short evidence or calculation fields | Multi-step decisions a reviewer must inspect | Long free-form reasoning adds cost without a useful check. |
A zero-shot prompt asks the model to do something without examples. It relies on the model's training, instruction tuning, and the instruction text in the prompt.
User: Classify this ticket as
refund,shipping, orcatalog: "Where is my parcel?"
A few-shot prompt provides examples in the prompt to teach the model the exact format or logic you want. This uses the model's in-context learning ability, made famous by GPT-3.[1]
User: Ticket: "The mug arrived cracked." Label: refund Ticket: "The carrier scan hasn't moved." Label: shipping Ticket: "Is the mug available in blue?" Label: catalog Ticket: "The box arrived, but the lamp is broken." Label:
Few-shot prompting is useful when you need a specific JSON shape, classification boundary, or unusual style.
You will also hear one-shot, which sits between zero-shot and few-shot.
| Pattern | Examples in prompt | Best use |
|---|---|---|
| Zero-shot | 0 | Simple tasks where the instruction is enough. |
| One-shot | 1 | Show one output format example. |
| Few-shot | 2 or more | Teach a pattern, rubric, or edge-case boundary. |
1Classify ticket as refund, shipping, or catalog.
2
3Ticket: "The box arrived, but the lamp is broken."
4Label:1Classify ticket as refund, shipping, or catalog.
2
3Ticket: "My tracking scan has not changed since Tuesday."
4Label: shipping
5
6Ticket: "The box arrived, but the lamp is broken."
7Label:1Classify ticket as refund, shipping, or catalog.
2
3Ticket: "My tracking scan has not changed since Tuesday."
4Label: shipping
5
6Ticket: "The lamp arrived with a cracked base."
7Label: refund
8
9Ticket: "Do you stock this lamp in brass?"
10Label: catalog
11
12Ticket: "The box arrived, but the lamp is broken."
13Label:The few-shot version teaches a boundary: a delivered package can still be a refund issue when the item is damaged, rather than a shipping issue.
On classification and multiple-choice tasks tested by Min et al., randomly replacing demonstration labels barely reduced performance across the evaluated models; label space, input distribution, and format were important signals.[2] Don't misread that result as permission to use wrong policy examples. A product prompt still needs correct examples and held-out tests because a wrong boundary can create a wrong business action.
| Bad example pattern | Failure |
|---|---|
| Examples use invalid JSON. | Model may copy invalid JSON. |
| Labels contradict each other. | Model learns unclear boundaries. |
| Examples are too easy. | Edge cases still fail. |
| Examples include hidden policy decisions. | Model may copy decisions without explanation. |
If a prompt feels unreliable, inspect the examples before blaming the model. Many prompt bugs are data bugs in miniature.
Render each shot strategy from the same label rule, then keep the target ticket out of its demonstrations:
1examples = [
2 ("My tracking scan hasn't changed since Tuesday.", "shipping"),
3 ("The lamp arrived with a cracked base.", "refund"),
4 ("Do you stock this lamp in brass?", "catalog"),
5]
6target = "The box arrived, but the lamp is broken."
7
8def make_prompt(shots: int) -> str:
9 lines = ["Classify ticket as refund, shipping, or catalog."]
10 for text, label in examples[:shots]:
11 lines.extend([f'Ticket: "{text}"', f"Label: {label}", ""])
12 lines.extend([f'Ticket: "{target}"', "Label:"])
13 return "\n".join(lines)
14
15for name, shots in [("zero-shot", 0), ("one-shot", 1), ("few-shot", 3)]:
16 prompt = make_prompt(shots)
17 print(f"{name}: examples={shots}, chars={len(prompt)}")
18 print(prompt.splitlines()[-2:])1zero-shot: examples=0, chars=106
2['Ticket: "The box arrived, but the lamp is broken."', 'Label:']
3one-shot: examples=1, chars=180
4['Ticket: "The box arrived, but the lamp is broken."', 'Label:']
5few-shot: examples=3, chars=302
6['Ticket: "The box arrived, but the lamp is broken."', 'Label:']Examples used in the prompt aren't evaluation evidence. Reserve separate gold cases so you can see whether a revised prompt generalizes beyond the demonstrations:
1demonstrations = {
2 "ticket_train_01": "shipping",
3 "ticket_train_02": "refund",
4 "ticket_train_03": "catalog",
5}
6held_out_expected = {
7 "ticket_eval_damaged_delivery": "refund",
8 "ticket_eval_missing_parcel": "shipping",
9}
10
11overlap = set(demonstrations) & set(held_out_expected)
12assert not overlap, f"leaked eval cases into prompt: {sorted(overlap)}"
13print("demonstrations:", len(demonstrations))
14print("held_out_cases:", len(held_out_expected))
15print("leakage_check: PASS")1demonstrations: 3
2held_out_cases: 2
3leakage_check: PASSOne important prompt engineering result is chain-of-thought prompting: Wei et al. supplied worked intermediate steps as few-shot exemplars and reported improvements on arithmetic, commonsense, and symbolic reasoning benchmarks for sufficiently large models.[3] We'll return to advanced reasoning patterns later.
If you ask an LLM a multi-step inventory question and demand only the final number, it can fail silently. A short, checkable calculation gives a reviewer intermediate values to verify.
Q: A warehouse has 5 pallets, ships 2 to a store, receives 3 more, and splits every pallet into half-pallet units. How many half-pallet units are there? A:
For models and tasks where it helps, ask for useful intermediate work: a short calculation, cited policy lines, or a few-shot example with checkable steps. The goal isn't to expose unrestricted deliberation. It's to request evidence a reviewer or test can verify.
Q: A warehouse has 5 pallets, ships 2 to a store, receives 3 more, and splits every pallet into half-pallet units. How many half-pallet units are there? A: Short solution: 5 pallets - 2 shipped = 3. Then 3 + 3 received = 6 pallets. Each pallet becomes 2 half-pallet units, so 6 x 2 = 12. The answer is 12.
By generating intermediate steps, the model adds those steps to its context window. It can use its visible output as a scratchpad, making the final answer easier to predict.
For applicable OpenAI reasoning models, current API documentation recommends giving the task, constraints, and desired output format, then treating reasoning.effort as a tuning knob rather than the primary way to recover quality.[4] Don't assume visible chain-of-thought prompts help every provider or model. Compare prompt variants on held-out cases and measure answer quality, tokens, and latency.
Use visible reasoning when:
Avoid long visible reasoning when:
For production apps, ask for structured evidence instead of unrestricted reasoning.
Think step by step and show all your reasoning.
Return:
- answer
- 2-sentence explanation
- source line IDs used
- missing facts, if any
That gives users something useful to inspect without turning every response into a long scratchpad.
For a return decision, a compact evidence contract is easier to check than an open-ended monologue:
1policy_lines = {"P-7": "Damaged items may be returned within 30 days."}
2order = {"order_id": "A10234", "condition": "damaged", "days_since_delivery": 12}
3
4eligible = order["condition"] == "damaged" and order["days_since_delivery"] <= 30
5answer = {
6 "decision": "eligible" if eligible else "not_eligible",
7 "calculation": f"{order['days_since_delivery']} <= 30",
8 "source_line_ids": ["P-7"],
9}
10
11assert all(line_id in policy_lines for line_id in answer["source_line_ids"])
12print(answer)1{'decision': 'eligible', 'calculation': '12 <= 30', 'source_line_ids': ['P-7']}Prompts become much easier to test when the output has a shape.
For example, a support triage tool might need this object:
1{
2 "decision": "eligible",
3 "refund_window_days": 30,
4 "needs_human": false,
5 "source_line_ids": ["P-7"],
6 "summary": "Damaged order A10234 arrived within the return window."
7}You can ask for JSON in plain text, but production systems should use a provider's schema-constrained output facility when it supports the needed schema. In OpenAI's API, Structured Outputs with a supported strict JSON schema is distinct from JSON mode: Structured Outputs enforces supported schema adherence, while JSON mode targets valid JSON and still has documented edge cases that callers must handle.[5] Regardless of provider, your application must still handle refusals or incomplete output and validate business facts and permissions.
If you are using a plain prompt, still define the shape clearly:
1Return valid JSON with these fields:
2- decision: one of ["eligible", "not_eligible", "not_specified"]
3- refund_window_days: integer or null
4- needs_human: boolean
5- source_line_ids: list of policy line IDs
6- summary: one sentence
7
8Return only the JSON object.Then validate the result in code. A prompt isn't a parser.
| Output issue | What to do |
|---|---|
| Missing field | Reject and retry with validation error. |
| Extra prose | Strip only if safe, otherwise retry. |
| Invalid enum | Reject and ask the model to choose an allowed value. |
| Unsupported claim | Re-run with source text and require citations. |
This is the bridge from prompt engineering to system design: prompts are only one layer. Schemas, validators, tests, and fallback paths make the behavior reliable.
You can start with a validator before you ever call a model. One candidate is correct; the other contains a plausible but unsupported window:
1import json
2
3known_policy_lines = {"P-7": 30}
4candidates = [
5 '{"decision":"eligible","refund_window_days":30,"needs_human":false,"source_line_ids":["P-7"]}',
6 '{"decision":"eligible","refund_window_days":90,"needs_human":false,"source_line_ids":["P-7"]}',
7]
8
9def validate(candidate: str) -> list[str]:
10 data = json.loads(candidate)
11 errors = []
12 allowed_decisions = {"eligible", "not_eligible", "not_specified"}
13 if data.get("decision") not in allowed_decisions:
14 errors.append("invalid decision")
15 source_ids = data.get("source_line_ids", [])
16 if source_ids != ["P-7"] or data.get("refund_window_days") != known_policy_lines["P-7"]:
17 errors.append("window isn't supported by P-7")
18 return errors
19
20for index, candidate in enumerate(candidates, start=1):
21 errors = validate(candidate)
22 print(f"candidate_{index}:", "PASS" if not errors else f"FAIL {errors}")1candidate_1: PASS
2candidate_2: FAIL ["window isn't supported by P-7"]Even good prompts fail when common mistakes slip in. Here are four anti-patterns to watch for.
Without visible boundaries, instruction drift is easier. When instructions, user data, and examples are all mashed together in one paragraph, the model has to guess where one ends and the other begins. Instead, use headers, XML tags, or triple backticks to create visible boundaries.
Asking a model for facts it doesn't have, without providing a knowledge base, invites fabrication. If your prompt asks for a merchant's return rule but the rule isn't supplied, the model may emit a plausible window. Retrieve trusted evidence (or use a tool), then require source line IDs.
Positive constraints are usually more actionable than negative-only instructions. Instead of "Don't make up details," say "Only use facts from the provided policy. If something is missing, say 'Not specified.'"
Including pages of unrelated policy when only one rule is relevant adds input work and distractors. Before sending a long document, ask: what is the smallest evidence slice needed for this question? Retrieve relevant chunks and place decisive evidence close to the final request; long-context studies have found lower performance when relevant information is placed in the middle of long inputs.[6]
This toy keyword retriever isn't a production search engine, but it proves the prompt-building idea: carry only lines that can support the answer and keep their IDs.
1policy_lines = {
2 "P-1": "Unused items may be returned within 45 days.",
3 "P-7": "Damaged items may be returned within 30 days.",
4 "P-9": "Final-sale items aren't eligible for voluntary returns.",
5 "P-12": "International exchanges need a support review.",
6}
7ticket = "Order A10234 arrived damaged. Can I return it?"
8query_terms = {"damaged", "return"}
9
10selected = {
11 line_id: text
12 for line_id, text in policy_lines.items()
13 if query_terms & set(text.lower().replace(".", "").split())
14}
15assert "P-7" in selected
16print("all_policy_lines:", len(policy_lines))
17print("selected_lines:", selected)1all_policy_lines: 4
2selected_lines: {'P-7': 'Damaged items may be returned within 30 days.'}Putting customer text in <ticket> tags helps identify it as data. It can't stop the model from proposing an unauthorized refund or tool action. Authorization belongs in code outside the prompt.
1allowed_actions = {"draft_reply", "escalate_to_agent"}
2model_proposals = [
3 {"action": "draft_reply", "order_id": "A10234"},
4 {"action": "issue_refund", "order_id": "A10234"},
5]
6
7for proposal in model_proposals:
8 action = proposal["action"]
9 authorized = action in allowed_actions
10 print(f"{action:17s} -> {'ALLOW' if authorized else 'BLOCK'}")1draft_reply -> ALLOW
2issue_refund -> BLOCKWhen a prompt fails, don't rewrite the whole thing first. Debug it like a program.
| Symptom | Likely cause | Fix |
|---|---|---|
| Output format changes every run | Format examples are missing or inconsistent. | Add schema or few-shot examples. |
| Model ignores key instruction | Instruction is buried in long context. | Repeat critical rule near final request. |
| Model follows malicious pasted text | Untrusted data isn't isolated. | Delimit user data and add tool permissions outside the prompt. |
| Answer is plausible but false | Source text is missing or stale. | Retrieve trusted policy lines and require line IDs. |
| Too verbose | Prompt asks for reasoning but app needs answer. | Ask for concise answer plus short explanation. |
Here is a prompt you can test by hand:
1Task:
2Decide return eligibility and extract the refund window from the supplied policy and order context.
3
4Policy:
5<policy>
6P-7: Damaged items may be returned within 30 days.
7P-9: Final-sale items aren't eligible for voluntary returns.
8</policy>
9
10Order context:
11<order>
12Order A10234 arrived damaged on day 12.
13</order>
14
15Return JSON:
16{"decision": string, "refund_window_days": number, "source_line_ids": [string]}1{
2 "decision": "eligible",
3 "refund_window_days": 30,
4 "source_line_ids": ["P-7"]
5}You should not trust that shape by inspection alone. Parse it and check the fields you care about:
1import json
2
3candidate = """
4{"decision": "eligible", "refund_window_days": 30, "source_line_ids": ["P-7"]}
5"""
6
7data = json.loads(candidate)
8errors = []
9
10required = {"decision", "refund_window_days", "source_line_ids"}
11missing = sorted(required - data.keys())
12if missing:
13 errors.append(f"missing keys: {missing}")
14
15if data.get("decision") != "eligible":
16 errors.append("decision should be eligible")
17
18if data.get("refund_window_days") != 30:
19 errors.append("refund_window_days should be 30")
20
21if data.get("source_line_ids") != ["P-7"]:
22 errors.append("source_line_ids should identify P-7")
23
24print("PASS" if not errors else "FAIL")
25for error in errors:
26 print(error)1PASSIf the model returns 90 days, it contradicts supplied evidence. If it omits source_line_ids, it failed the traceability contract.
That's how prompt engineering becomes engineering: define expected behavior, run examples, and check failures.
Once a failing case exists, turn it into a regression test for candidate prompt versions. Here, outputs are fixtures standing in for results collected from two prompt versions; the test logic is real.
1expected = {
2 "damaged_day_12": ("eligible", 30, ["P-7"]),
3 "missing_policy": ("not_specified", None, []),
4}
5outputs_by_version = {
6 "vague_v1": {
7 "damaged_day_12": ("eligible", 90, []),
8 "missing_policy": ("eligible", 30, []),
9 },
10 "grounded_v2": {
11 "damaged_day_12": ("eligible", 30, ["P-7"]),
12 "missing_policy": ("not_specified", None, []),
13 },
14}
15
16for version, outputs in outputs_by_version.items():
17 passed = sum(outputs[case] == answer for case, answer in expected.items())
18 print(f"{version}: {passed}/{len(expected)} held-out checks passed")1vague_v1: 0/2 held-out checks passed
2grounded_v2: 2/2 held-out checks passedBefore writing a prompt, answer three questions:
If you can't answer those, the prompt isn't ready for production.
Here is a quick decision table for common tasks:
| Task | Best starting pattern | Why |
|---|---|---|
| Summarize one carrier scan | Zero-shot | The task is familiar and format is simple. |
| Extract return eligibility into JSON | Structured output | Parser reliability matters. |
| Classify tricky support tickets | Few-shot | Examples teach category boundaries. |
| Reconcile inventory movements | Checkable calculation or reasoning model | Intermediate quantities can be verified. |
| Answer from a policy document | Retrieved context plus source quote | Local facts matter more than model memory. |
By now you should be able to:
| Symptom | Cause | Fix |
|---|---|---|
| Model invents a refund rule or policy detail | Needed fact was never placed in prompt, retrieved context, or tool result | Add trusted policy lines, require source line IDs, and fail when the source is missing |
| Model ignores a key instruction | Critical rule is buried inside long context or mixed with noisy data | Move the rule near the final request, isolate data with delimiters, and rerun the exact failing case |
| User text hijacks the task | Delimiters were treated like security instead of layout | Keep user data isolated, but also enforce tool permissions, schema validation, and logging outside the prompt |
| JSON breaks parser in production | Prompt asked for JSON, but app trusted raw text without validation | Parse response, reject missing fields or invalid enums, and retry with exact validation error |
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Language Models are Few-Shot Learners.
Brown, T., et al. · 2020 · NeurIPS 2020
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., & Zettlemoyer, L. · 2022 · EMNLP 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Wei, J., et al. · 2022 · NeurIPS
Reasoning models
OpenAI · 2026
Structured outputs
OpenAI · 2024
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., et al. · 2023 · TACL 2023