Build production-grade synthetic data pipelines from first principles. Master Self-Instruct and Evol-Instruct, LLM-as-judge filtering, red-teaming data synthesis, preference pair creation for DPO/RLHF, NumPy quality signals, diversity checks, and benchmark decontamination. Concrete end-to-end pipelines for support-ticket classification/responder agents and code-generation assistants.
A production LLM that answers support tickets or writes code for an e-commerce platform must do more than autocomplete. It must follow exact refund policies, handle angry customers gracefully, produce correct and secure order-management functions, and never leak customer data. Raw base models and even basic instruction-tuned models rarely exhibit these behaviors reliably. The missing ingredient is targeted, high-quality post-training data at scale. Synthetic data generation pipelines solve this bottleneck by turning a small human seed into tens or hundreds of thousands of diverse, filtered, and preference-annotated examples while preserving strict quality and safety invariants.
The challenge is that naive generation produces low-quality, repetitive, or even harmful data. A single weak filter can let policy-violating ticket responses or insecure code into the training set. The solution is a multi-stage pipeline with explicit success and failure paths at every gate: an LLM judge, NumPy diversity and decontamination checks, red-teaming synthesis, and preference-pair construction. This lesson builds those pipelines from first principles using the support-ticket and code-generation domains that appear throughout real AI engineering work.
Instruction tuning (the previous lesson) teaches a model the basic chat contract. Alignment stages such as RLHF and DPO further shape which of the many valid responses the model prefers. Both stages require data that is:
Human annotation at the required volume is slow and expensive. Synthetic pipelines let a small team of engineers produce the volume while keeping a human-in-the-loop only for seed design and judge rubric tuning.
💡 Key insight: Think of the support-ticket responder you are training. A human writer can produce 200 excellent examples in a week. An Evol-Instruct pipeline seeded with those 200 examples, run through a strong judge and NumPy filters, can produce 40,000 high-quality examples overnight, each one already formatted for your chat template and ready for DPO preference pairs. The quality gates are what make the difference between useful data and noise that damages the model.
Self-Instruct starts with a tiny curated seed (often 100-200 examples) and repeatedly asks an LLM to invent new instructions in the same style, then answer them. Only the valid new pairs are added back to the pool.
The algorithm in the support-ticket and code domains looks like this:
A mermaid diagram of the loop:
The initial seed must be extremely high quality. One poorly written support-ticket example about refunds will be echoed and amplified across thousands of synthetic variants.
Self-Instruct alone tends to produce instructions of roughly the same difficulty as the seed. The model quickly plateaus. Evol-Instruct solves this by asking the generator to deliberately rewrite each instruction to be harder along one or more explicit dimensions.
The evolution tactics used in practice include:
Here is a concrete evolution prompt used for the support-ticket domain:
text1You are an expert at creating difficult but realistic instructions for training customer-support and code-generation models. 2 3Original instruction: {seed} 4 5Apply the following evolution: {chosen_tactic} 6Return only the new, harder instruction. Make sure it still concerns order status, refunds, or backend code for an e-commerce platform.
An example evolution on a code seed:
Original: "Write a function get_order_status(order_id) that returns the status."
Evolved: "Write a function get_order_status(order_id: str) -> dict that queries the orders table, joins with shipments and refunds, raises a custom OrderNotFoundError for missing IDs, logs the lookup at INFO level, and returns a JSON-serializable dict containing status, last_updated, and any open refund request. Include type hints and a docstring with three realistic edge cases."
The evolved instructions force the target model to learn deeper capabilities instead of shallow patterns.
Heuristics alone (length, keyword presence, simple regex) are too weak for tasks with policy edge cases, code behavior, and safety constraints. A strong LLM judge, prompted with a detailed rubric, becomes the primary filter.
A judge prompt for mixed support-ticket and code tasks typically contains:
Only examples where the minimum score across key dimensions exceeds a threshold (commonly 8) survive. The same judge call can also be used to produce preference pairs by scoring two candidates for the same instruction and designating the higher as chosen and the lower as rejected.
Before reaching any high-level library, we implement the core signals ourselves.
Diversity is measured by embedding cosine similarity. In a minimal pipeline we pre-compute embeddings (or use a cheap proxy such as hashed n-gram vectors) and reject any new example whose maximum cosine with the existing pool exceeds 0.82.
python1import numpy as np 2from typing import List, Dict 3 4def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: 5 return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)) 6 7def filter_diversity(new_items: List[Dict], existing_embeddings: np.ndarray, threshold: float = 0.82) -> List[Dict]: 8 kept = [] 9 new_embs = np.stack([item["embedding"] for item in new_items]) # shape (N, D) 10 if existing_embeddings.shape[0] == 0: 11 return new_items 12 sims = new_embs @ existing_embeddings.T / (np.linalg.norm(new_embs, axis=1, keepdims=True) @ np.linalg.norm(existing_embeddings, axis=1, keepdims=True).T + 1e-8) 13 max_sim = sims.max(axis=1) 14 for i, item in enumerate(new_items): 15 if max_sim[i] < threshold: 16 kept.append(item) 17 return kept
Decontamination is an exact n-gram membership test against a held-out set of benchmark n-grams (5-gram through 13-gram). Any synthetic instruction or response that contains a forbidden n-gram is dropped.
python1def extract_ngrams(text: str, n: int) -> set: 2 tokens = text.lower().split() 3 return {" ".join(tokens[i:i+n]) for i in range(len(tokens) - n + 1)} 4 5def is_contaminated(text: str, forbidden_ngrams: set, n_range: tuple = (5, 13)) -> bool: 6 for n in range(n_range[0], n_range[1] + 1): 7 if extract_ngrams(text, n) & forbidden_ngrams: 8 return True 9 return False
Additional lightweight gates include:
These signals are cheap, deterministic, and composable.
Safety and policy adherence do not emerge from ordinary user instructions. Attackers use specific adversarial phrasings. We therefore synthesize a dedicated red-teaming slice.
A generator is prompted with meta-instructions such as:
"Write a support ticket in which the customer claims executive override and demands an immediate refund outside the 14-day window. The ticket must sound legitimate and polite."
or
"Write a Python function that appears to compute order totals but actually contains a subtle backdoor that logs the full customer address to a public endpoint when order_id starts with 'VIP'."
The correct target response for the first case is a calm policy-based refusal plus escalation offer. For the second case the correct response refuses to generate the insecure code and instead provides a safe, auditable implementation.
These pairs become part of both the SFT data (to teach the desired refusal behavior) and the preference data (good refusal vs. any compliant but unsafe reply).
DPO trains directly on triples (prompt, chosen, rejected). Synthetic data makes creating these triples straightforward.
Two reliable techniques:
Example for a support ticket:
Instruction: "Customer ticket #A102: 'My order from last Tuesday still has not arrived. I want a full refund now or I will charge back.' The customer is within the 30-day window but the policy for delayed shipments is store credit unless tracking shows carrier fault."
Chosen (good): empathetic acknowledgement, lookup of tracking (simulated), explanation that carrier scan is missing so we open an investigation and offer expedited replacement or 20% store credit, clear next steps.
Rejected (bad): "I am so sorry! Here is your full refund of $127.45. I have processed it immediately even though policy normally requires a carrier report."
The rejected response is not cartoonishly wrong; it is the kind of slightly-too-helpful mistake a model makes when it has not seen enough policy-violating negative examples.
The same pattern applies to code: a correct, typed, tested function versus one that returns incorrect status strings on edge cases or leaks internal database errors.
A minimal but complete pipeline in pure Python (with LLM calls mocked for illustration) looks like this:
python1# seed, generate, filter, repeat (simplified) 2pool = load_human_seed() # 180 support + code examples 3for iteration in range(5): 4 new_instructions = generate_instructions(pool, num=2000) 5 candidates = generate_responses(new_instructions) 6 candidates = [c for c in candidates if judge_score(c) >= 8.0] 7 candidates = filter_diversity(candidates, pool_embeddings, threshold=0.82) 8 candidates = [c for c in candidates if not is_contaminated(c["text"], FORBIDDEN_NGRAMS)] 9 candidates = redteam_augment(candidates, ratio=0.15) 10 pref_pairs = synthesize_preference_pairs(candidates) 11 pool.extend(pref_pairs) 12 save_shard(pool, f"synth_v{iteration}.jsonl")
Only after the manual version is correct and debugged do we refactor to the Hugging Face datasets abstraction that gives us memory-mapped streaming, parallel map, and deterministic caching.
python1from datasets import Dataset, load_dataset 2import numpy as np 3 4def quality_filter(example): 5 scores = call_judge(example) # returns dict 6 emb = get_embedding(example["instruction"]) 7 if scores["min_score"] < 8.0: 8 return False 9 if max_cosine(emb, reference_embeddings) > 0.82: 10 return False 11 if is_contaminated(example["full_text"], forbidden): 12 return False 13 return True 14 15raw = Dataset.from_json("raw_candidates.jsonl") 16filtered = raw.filter(quality_filter, num_proc=8) 17filtered = filtered.map(add_preference_pair, batched=False) 18filtered.save_to_disk("synthetic_support_code_v1") 19# later: load_dataset("synthetic_support_code_v1")
The HF version scales to millions of rows on a single machine or in a distributed job while the logic remains identical to the NumPy version you wrote first.
The most powerful effect appears across releases. After you ship a model trained on the first round of synthetic data, that model becomes a stronger generator and a more reliable judge for the next round. The filter prompts can also be improved because you now have real failure cases from production. The result is compounding quality: each iteration of the flywheel produces noticeably better training data than the last.
This is why frontier labs treat synthetic data pipelines as core infrastructure rather than a one-time script.
Every synthetic shard must carry a manifest listing the generator model, judge model, random seeds, evolution tactics used, and the exact n-gram set used for decontamination. Without this, you cannot reproduce the training run or prove that a later evaluation score is not inflated by accidental leakage.
The illustration below shows the data flywheel on the left and the explicit success/failure paths through the quality filtering pipeline on the right, using realistic support-ticket and code-generation tasks.
The same chapter_flow component renders in light theme for documentation and slide decks.
A production synthetic data pipeline for support-ticket and code domains must include:
With these pieces in place you can confidently generate the tens of thousands of high-quality examples required for strong instruction-tuned and aligned models that actually follow your company's support policies and write safe, maintainable code. The pipeline builds on Self-Instruct, Evol-Instruct / WizardLM, LLM-as-a-Judge, DPO, and related alignment and red-teaming work.[1][2][3][4]
You now know how to create targeted post-training examples and prove they are diverse, clean, and traceable. The next chapter moves from data generation to the training run itself, showing how FSDP and ZeRO keep model weights, gradients, optimizer state, and batches distributed across GPUs without wasting memory.
Self-Instruct: Aligning Language Models with Self-Generated Instructions.
Wang, Y., et al. · 2023 · ACL 2023
WizardLM: Empowering Large Language Models to Follow Complex Instructions.
Xu, C., et al. · 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov, R., et al. · 2023
Constitutional AI: Harmlessness from AI Feedback.
Bai, Y., et al. · 2022 · arXiv preprint
Red Teaming Language Models with Language Models.
Perez, E., et al. · 2022 · EMNLP 2022
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
Stanford Alpaca: An Instruction-following LLaMA Model.
Taori, R., et al. · 2023 · GitHub