LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnComputing FoundationsPython for AI Engineering
⚙️EasyMLOps & Deployment

Python for AI Engineering

Learn Python as the first AI engineering loop: read JSONL rows, validate fields, compute exact-match accuracy, and harden the scorer with pytest, prompt snapshots, leakage checks, seeded runs, and a CI gate.

16 min read
Learning path
Step 3 of 158 in the full curriculum
Docker for Reproducible AINumPy and Tensor Shapes

Most AI engineering starts before the model call.

You have files on disk. You have examples with labels. You have model predictions that may be wrong, malformed, or missing. Before you can trust a training run, an evaluation report, or a demo, you need Python code that can read those examples, check the shape of each row, compute one clear metric, and fail when the input is broken.

Python's job here is to turn loose files into named data, named data into measurements, and measurements into something another person can reproduce. This chapter builds that first loop from the ground up: read a JSONL evaluation file, parse each row into a typed object, score exact-match accuracy, and protect the behavior with tests.[1]

Python evaluation script diagram showing JSONL rows becoming an Example dataclass, exact-match row results, a 0.667 metric, and tests that stop malformed input before scoring. Python evaluation script diagram showing JSONL rows becoming an Example dataclass, exact-match row results, a 0.667 metric, and tests that stop malformed input before scoring.
Follow the order 102 row. It starts as one JSONL line, becomes an Example, scores False, and contributes to the final 0.667 metric while malformed rows stop before scoring.

Git and Docker protected the file and runtime. Now we open the scorer and build the code those earlier chapters were carrying. This chapter also hardens that behavior into a reproducible contract with tests, snapshots, leakage checks, and CI.

The first useful loop

For this Python chapter, don't think about model APIs yet. Think about the smallest reliable AI workflow:

BoundaryQuestionOutput
fileCan I read the examples?non-empty text lines
rowDoes each line have the fields I expect?one parsed object
metricDid each prediction match the expected answer?True or False
reportHow did the whole file do?one accuracy number
testCan I prove the script rejects bad input?passing checks

That sequence is the foundation for later work. RAG evaluations, fine-tuning datasets, judge results, and agent traces all need the same habit: make the data visible before turning it into a number.

Here is the whole flow:

Diagram showing eval/support_tickets.jsonl (the 3-row contract from Git/Docker), load_examples(path) read lines, parse_example(line) required fields, and Example prompt, expected, prediction. Diagram showing eval/support_tickets.jsonl (the 3-row contract from Git/Docker), load_examples(path) read lines, parse_example(line) required fields, and Example prompt, expected, prediction.
eval/support_tickets.jsonl (the 3-row contract from Git/Docker), load_examples(path) read lines, parse_example(line) required fields, and Example prompt, expected, prediction.

The diagram stays small on purpose. A beginner should be able to trace every arrow before the curriculum moves into arrays, tensors, training loops, and LLM systems.

Start with the three-row contract you already own

The Git and Docker chapters already gave you a protected three-row evaluation file at eval/support_tickets.jsonl. We reuse it here so the Python scorer operates on the exact same contract the previous lessons locked down.

promptexpectedpredictioncorrect?
Order 101 status?shippedshippedyes
Order 102 status?delayeddeliveredno
Order 103 status?refundedrefundedyes

Two predictions match. One doesn't.

The exact-match score is:

23=0.667\frac{2}{3} = 0.66732​=0.667

That hand calculation still matters. The Python code must produce this number on this file because the earlier Git and Docker chapters locked the same contract. This is the same fixture the pre-commit gate, the Docker container, and the later CI workflow will protect.

The file lives in the eval directory the repo already tracks

eval/support_tickets.jsonl
1{"prompt": "Order 101 status?", "expected": "shipped", "prediction": "shipped"} 2{"prompt": "Order 102 status?", "expected": "delayed", "prediction": "delivered"} 3{"prompt": "Order 103 status?", "expected": "refunded", "prediction": "refunded"}

JSONL (JSON Lines) is the format used throughout the curriculum because it streams cleanly, supports appending new eval rows, and stays human-inspectable with cat or head.

Inspect one line (the order-102 failure)

text
1{"prompt": "Order 102 status?", "expected": "delayed", "prediction": "delivered"}

It carries the three fields the scorer contract requires:

FieldWhat it means
promptthe input question the model saw
expectedthe ground-truth label
predictionwhat the system actually output

The row is a valid failure case. The scorer must treat it as incorrect. A row missing prediction or carrying a non-string value must fail before any metric is computed.

Name the pieces before writing code

The script uses a few Python ideas. Keep them attached to the evaluation task.

Python ideaMeaning in this chapter
functionone named operation, such as exact_match
moduleone .py file that can be run or imported
dataclassa small class for storing named fields
type hinta reader-facing contract such as list[Example]
Patha file location passed to the loader
exceptiona loud failure when the input breaks the contract
testa small check that protects one behavior

A type hint helps humans and editors read your intent. It doesn't validate JSON from disk by itself. A static type checker such as mypy or pyright verifies hints before the code runs, and a linter and formatter such as ruff keeps the style consistent. None of them inspect the actual file at runtime, so the runtime checks below still matter because real files can contain missing fields, wrong types, blank lines, or invalid JSON.

Runtime validation contract for Python evaluation rows showing JSON text passing object, required-field, and string-type checks before becoming an Example dataclass. Runtime validation contract for Python evaluation rows showing JSON text passing object, required-field, and string-type checks before becoming an Example dataclass.
Type hints describe the shape you want. Runtime checks prove each row from disk really has that shape before the metric can use it.

Build the scorer

The eval/support_tickets.jsonl file is already present from the Git chapter and already travels through the Docker setup. Create the Python module that reads it. Put this code in python_eval_demo.py.

If you are testing this block in a blank scratch directory, create the same three-row fixture first:

create-demo-fixture.py
1from pathlib import Path 2 3fixture = Path("eval/support_tickets.jsonl") 4fixture.parent.mkdir(exist_ok=True) 5fixture.write_text( 6 "\n".join( 7 [ 8 '{"prompt": "Order 101 status?", "expected": "shipped", "prediction": "shipped"}', 9 '{"prompt": "Order 102 status?", "expected": "delayed", "prediction": "delivered"}', 10 '{"prompt": "Order 103 status?", "expected": "refunded", "prediction": "refunded"}', 11 ] 12 ) 13 + "\n", 14 encoding="utf-8", 15) 16print("fixture_rows", len(fixture.read_text(encoding="utf-8").splitlines()))
Output
1fixture_rows 3

Path comes from the standard library module pathlib. It turns a file path into an object you can read, check, and pass around. Using Path instead of a plain string makes file operations explicit and behaves the same way on Linux, macOS, and Windows.

Read the code in five blocks:

  1. Example names one row.
  2. require_str checks one field.
  3. parse_example turns text into a checked row.
  4. load_examples reads the file and rejects empty input.
  5. score_examples turns row checks into one metric.
build-the-scorer.py
1from dataclasses import dataclass 2import json 3from pathlib import Path 4 5@dataclass 6class Example: 7 prompt: str 8 expected: str 9 prediction: str 10 11def require_str(row: dict[str, object], key: str) -> str: 12 if key not in row: 13 raise KeyError(key) 14 15 value = row[key] 16 if not isinstance(value, str): 17 raise TypeError(f"{key} must be str, got {type(value).__name__}") 18 19 return value 20 21def parse_example(line: str) -> Example: 22 row = json.loads(line) 23 if not isinstance(row, dict): 24 raise TypeError("each line must decode to a JSON object") 25 26 return Example( 27 prompt=require_str(row, "prompt"), 28 expected=require_str(row, "expected"), 29 prediction=require_str(row, "prediction"), 30 ) 31 32def load_examples(path: Path) -> list[Example]: 33 lines = [line for line in path.read_text(encoding="utf-8").splitlines() if line.strip()] 34 if not lines: 35 raise ValueError("input file is empty") 36 37 return [parse_example(line) for line in lines] 38 39def exact_match(example: Example) -> bool: 40 return example.prediction.strip() == example.expected.strip() 41 42def score_examples(examples: list[Example]) -> float: 43 if not examples: 44 raise ValueError("examples must not be empty") 45 46 return sum(exact_match(example) for example in examples) / len(examples) 47 48def main() -> None: 49 # Use the same protected fixture the Git and Docker chapters already placed in the repo 50 input_path = Path("eval/support_tickets.jsonl") 51 examples = load_examples(input_path) 52 score = score_examples(examples) 53 54 print("rows", len(examples)) 55 print("first_prompt", examples[0].prompt) 56 print("exact_match", round(score, 3)) 57 58if __name__ == "__main__": 59 main()
Output
1rows 3 2first_prompt Order 101 status? 3exact_match 0.667

Run it

terminal
1uv run python_eval_demo.py

The printed score matches the hand calculation. Use that as your first sanity check.

Trace the order 102 row through the script

The best way to read the program is to follow one row all the way through.

StepPython valueWhat changed
file line'{"prompt": "Order 102 status?", ...}'text came from disk
json.loads(line){"prompt": "Order 102 status?", "expected": "delayed", "prediction": "delivered"}text became a dictionary
require_str(row, "prediction")"delivered"one field was checked
Example(...)Example(prompt="Order 102 status?", expected="delayed", prediction="delivered")loose fields became a named row
exact_match(example)Falseone prediction was scored
score_examples(examples)0.667all row scores were averaged

The word "averaged" is literal here. In Python, True behaves like 1 and False behaves like 0 when summed:

trace-the-order-102-row-through-the-script.py
1row_scores = [True, False, True] 2print(sum(row_scores)) 3print(sum(row_scores) / len(row_scores))
Output
12 20.6666666666666666

The article rounds that final value to three decimal places, so it prints 0.667.

Read the code by boundaries

Good AI scripts make each boundary explicit. If the final number looks wrong, you can walk backward through the pipeline and inspect the smallest suspicious step.

CodeBoundary it protects
Path("eval/support_tickets.jsonl")the protected fixture location (the same file Git and Docker already track)
path.read_text(...).splitlines()file contents become lines
json.loads(line)one text line becomes Python data
isinstance(row, dict)each line must be an object, not a list or string
require_str(row, "prediction")required fields must exist and be strings
Example(...)loose dictionaries become named rows
exact_match(example)one row becomes one decision
score_examples(examples)row decisions become one metric

That structure is more important than exact-match accuracy itself. Exact match is a simple first metric. Later chapters will use richer metrics, tensors, loss functions, eval sets, and model outputs. The boundary habit stays the same.

When debugging, print one value at each boundary:

read-the-code-by-boundaries.py
1input_path = Path("eval/support_tickets.jsonl") 2lines = [line for line in input_path.read_text(encoding="utf-8").splitlines() if line.strip()] 3row = json.loads(lines[1]) 4example = parse_example(lines[1]) 5 6print(input_path) 7print(len(lines)) 8print(row) 9print(example) 10print(exact_match(example))
Output
1eval/support_tickets.jsonl 23 3{'prompt': 'Order 102 status?', 'expected': 'delayed', 'prediction': 'delivered'} 4Example(prompt='Order 102 status?', expected='delayed', prediction='delivered') 5False

Don't debug the aggregate score first. Debug one row first.

Make bad input fail loudly

Now break the input on purpose. Run this in a small scratch file or REPL after saving python_eval_demo.py in the same folder:

python_eval_demo.py
1try: 2 parse_example 3except NameError: 4 from python_eval_demo import parse_example 5 6bad_row = '{"prompt": "Order 101 status?", "expected": "shipped"}' 7bad_type_row = '{"prompt": "Order 101 status?", "expected": "shipped", "prediction": 7}' 8 9try: 10 parse_example(bad_row) 11except KeyError as error: 12 print("missing", error.args[0]) 13 14try: 15 parse_example(bad_type_row) 16except TypeError as error: 17 print("bad type", error)
Output
1missing prediction 2bad type prediction must be str, got int

Both failures are useful. The first row is missing a required field. The second row has the right field name but the wrong value type. In both cases, the script stops before it prints a misleading metric.

Treat this as a production habit. Silent bad data can make an evaluation dashboard look healthy while the input is nonsense.

Turn the script into a command

The first version hardcodes the protected eval/support_tickets.jsonl path so the example stays easy to run on the curriculum fixture. The next version should accept the file path from the command line so the same module can score any later eval set.

Add import sys near the top of the file, then replace main():

turn-the-script-into-a-command.py
1import sys 2 3def main() -> None: 4 if len(sys.argv) != 2: 5 raise SystemExit("usage: uv run python_eval_demo.py <examples.jsonl>") 6 7 input_path = Path(sys.argv[1]) 8 examples = load_examples(input_path) 9 print("exact_match", round(score_examples(examples), 3)) 10 11 # The same command works on the protected three-row fixture: 12 # uv run python_eval_demo.py eval/support_tickets.jsonl 13 # → exact_match 0.667
simulate-command-argument.py
1sys.argv = ["python_eval_demo.py", "eval/support_tickets.jsonl"] 2main()
Output
1exact_match 0.667

sys.argv is a list that holds the words you type after the script name. sys.argv[1] is the first argument, which we expect to be the JSONL file path.

Run it on the curriculum fixture

terminal-2
1uv run python_eval_demo.py eval/support_tickets.jsonl

That small change makes the script reusable across any JSONL file that follows the same three-field schema. A teammate (or your future self on a fresh clone) can point it at the protected fixture or any later eval set and get a trustworthy number. The earlier fixture tests still pass because the behavior on this exact file is locked.

The input contract is now easy to state:

User providesScript promises
one JSONL file pathread that file
rows with prompt, expected, predictioncompute exact-match accuracy
empty filefail before dividing
malformed rowfail before scoring

Add tests before the script grows

Once python_eval_demo.py is a module, tests can import its functions directly. Save the test file as tests/test_python_eval_demo.py so the project uses a standard pytest layout.

run-prompt-snapshot-check.py
1from pathlib import Path 2 3from python_eval_demo import Example, exact_match, load_examples, parse_example 4 5def test_exact_match_strips_whitespace(): 6 example = Example("Q", "shipped", " shipped ") 7 assert exact_match(example) 8 9def test_parse_example_requires_prediction(): 10 bad_row = '{"prompt": "Q", "expected": "A"}' 11 try: 12 parse_example(bad_row) 13 except KeyError as error: 14 assert error.args[0] == "prediction" 15 else: 16 raise AssertionError("missing prediction should fail") 17 18def test_load_examples_rejects_empty_file(tmp_path: Path): 19 path = tmp_path / "empty.jsonl" 20 path.write_text("", encoding="utf-8") 21 22 try: 23 load_examples(path) 24 except ValueError as error: 25 assert str(error) == "input file is empty" 26 else: 27 raise AssertionError("empty file should fail")

tmp_path is a pytest fixture. pytest automatically creates a temporary directory and passes it as the tmp_path argument, so the test doesn't leave files behind after it finishes.

Run the tests

Run pytest through an isolated tool command instead of installing packages into your global Python environment.

terminal-3
1uv run --with pytest -m pytest tests/test_python_eval_demo.py -q --tb=line

Expected output

text
1... 23 passed in 0.01s

These tests protect the three behaviors most likely to break first:

TestWhat it protects
whitespace testmetric normalization
missing field testrequired input fields
empty file testno divide-by-zero metric

The point isn't to write many tests. The point is to test the boundaries that would make a metric lie.

Notice that the tests don't call main(). They test the smaller functions. That makes failures easier to understand. If parse_example fails, you know the parsing boundary broke. If exact_match fails, you know the scoring rule broke.

The same checks are easy to run directly while learning:

run-core-unit-checks.py
1tmp_path = Path("tmp-empty.jsonl") 2tmp_path.write_text("", encoding="utf-8") 3 4assert exact_match(Example("Q", "shipped", " shipped ")) 5 6try: 7 parse_example('{"prompt": "Q", "expected": "A"}') 8except KeyError as error: 9 assert error.args[0] == "prediction" 10else: 11 raise AssertionError("missing prediction should fail") 12 13try: 14 load_examples(tmp_path) 15except ValueError as error: 16 assert str(error) == "input file is empty" 17else: 18 raise AssertionError("empty file should fail") 19 20tmp_path.unlink() 21print("unit checks passed")
Output
1unit checks passed

Make the tests strong enough for ML work

Three unit tests catch parser mistakes. Real ML evaluation code usually needs four more protections: prompt snapshots, simple property checks, train/eval leakage detection, and seeded CI runs.[2][3]

Snapshot the prompt template

Prompt drift is one of the fastest ways to make a metric move without anybody noticing why. Lock the exact prompt string with a small snapshot-style test:

run-ml-guard-checks.py
1PROMPT_TEMPLATE = ( 2 "You are a support ticket classifier.\n" 3 "Ticket: {prompt}\n" 4 "What is the status? Answer with exactly one word: shipped, delayed, or refunded." 5) 6 7def build_prompt(example: Example) -> str: 8 return PROMPT_TEMPLATE.format(prompt=example.prompt) 9 10def test_prompt_snapshot_is_stable(): 11 example = Example("Order 101 status?", "shipped", "shipped") 12 expected = ( 13 "You are a support ticket classifier.\n" 14 "Ticket: Order 101 status?\n" 15 "What is the status? Answer with exactly one word: shipped, delayed, or refunded." 16 ) 17 assert build_prompt(example) == expected 18 19test_prompt_snapshot_is_stable() 20print("prompt snapshot stable")
Output
1prompt snapshot stable

If one word changes, the test fails before the scorer or model call hides the reason.

Add a property check and a leakage detector

ML tests should protect invariants, not only one fixed example. Accuracy should never leave [0, 1], and the eval set should never overlap with the training prompts:

tests/test_python_eval_demo.py
1import random 2 3def assert_no_leakage(train_prompts: set[str], eval_prompts: set[str]) -> None: 4 overlap = train_prompts & eval_prompts 5 assert not overlap, f"LEAKAGE DETECTED: {sorted(overlap)}" 6 7def test_score_stays_in_unit_interval(): 8 for seed in range(20): 9 random.seed(seed) 10 n = random.randint(1, 8) 11 examples = [ 12 Example( 13 prompt=f"p{i}", 14 expected=random.choice(["shipped", "delayed", "refunded"]), 15 prediction=random.choice(["shipped", "delayed", "refunded"]), 16 ) 17 for i in range(n) 18 ] 19 score = sum(exact_match(example) for example in examples) / len(examples) 20 assert 0.0 <= score <= 1.0 21 22def test_leakage_detector_catches_overlap(): 23 eval_examples = load_examples(Path("eval/support_tickets.jsonl")) 24 eval_prompts = {example.prompt.strip().lower() for example in eval_examples} 25 train_prompts = {"order 101 status?", "warehouse delay escalation"} 26 27 try: 28 assert_no_leakage(train_prompts, eval_prompts) 29 except AssertionError as error: 30 assert "LEAKAGE DETECTED" in str(error) 31 else: 32 raise AssertionError("overlapping train/eval prompts should fail") 33 34test_score_stays_in_unit_interval() 35test_leakage_detector_catches_overlap() 36print("ml guards passed")
Output
1ml guards passed

This is how you turn "that score looks suspiciously high" into one reproducible red test instead of a debugging argument.[4]

Seed test runs and let CI enforce them

Notebook state and random seeds are another common source of fake progress. Put the seeding contract in one place, then run the same test command in CI:

tests/conftest.py
1import os 2import random 3 4import numpy as np 5import pytest 6 7def seed_all(seed: int = 42) -> None: 8 os.environ["PYTHONHASHSEED"] = str(seed) 9 random.seed(seed) 10 np.random.seed(seed) 11 12@pytest.fixture(autouse=True) 13def _seed_everything() -> None: 14 seed_all(42)
verify-seed-contract.py
1seed_all(42) 2print("seed", os.environ["PYTHONHASHSEED"]) 3print("numpy_sample", np.random.randint(0, 10, size=3).tolist())
Output
1seed 42 2numpy_sample [6, 3, 7]
.github/workflows/eval-gate.yml
1name: Eval Gate 2 3on: [push, pull_request] 4 5jobs: 6 eval: 7 runs-on: ubuntu-latest 8 steps: 9 - uses: actions/checkout@v6 10 - uses: astral-sh/setup-uv@v8 11 with: 12 python-version: "3.14" 13 - name: Run eval tests 14 env: 15 PYTHONHASHSEED: "0" 16 run: | 17 uv run --with pytest --with numpy -m pytest tests/test_python_eval_demo.py -q --tb=no

Run the fuller suite with:

terminal-4
1uv run --with pytest --with numpy -m pytest tests/test_python_eval_demo.py -q --tb=line

Now the contract is stronger:

GuardWhat it catches
snapshot testprompt drift
property checkimpossible metric values
leakage detectortrain/eval contamination
seeded fixture + CInotebook randomness and teammate drift
ML evaluation safety net showing parser unit tests plus prompt snapshot, metric invariant, leakage detector, and seeded CI gate before publishing a score. ML evaluation safety net showing parser unit tests plus prompt snapshot, metric invariant, leakage detector, and seeded CI gate before publishing a score.
Unit tests catch parser mistakes. ML guard tests catch prompt drift, impossible metrics, train/eval leakage, and hidden randomness before a score reaches a report.

Common bugs and how to find them

Beginner Python evaluation scripts usually fail in predictable ways.

SymptomLikely causeFix
score is 0.0 even when answers look rightwhitespace or casing mismatchprint one Example, then choose the normalization rule
KeyError: 'prediction'an input row is missing a required fieldinspect the broken row and keep the required-field check
TypeError: prediction must be str, got inta row has the right field but wrong value typevalidate at parse time, not after scoring
ValueError: input file is emptywrong path or empty fixtureprint the path and check the file contents
JSONDecodeErrorone line isn't valid JSONinspect that line and fix quotes, commas, or braces
teammate gets a different scoredifferent input file or commandprint row count, commit the fixture, and document the command

Notebook-only code becomes fragile for the same reason. A notebook can hide state in old cells, local paths, and copied snippets. A small module with tests is easier to rerun and review.

Practice on the same file

Try these in order. Predict the result before running code.

  1. Add a fourth row where prediction equals expected.
  2. Predict the new exact-match score.
  3. Add extra spaces around one correct prediction.
  4. Confirm exact_match still treats it as correct.
  5. Replace one prediction with the number 7.
  6. Confirm parse_example rejects that row.
  7. Create an empty file and confirm load_examples rejects it.
  8. Change main() so the file path comes from sys.argv[1].

Use these checks after you try:

If your answer only says "the code works," make it more specific. Name the input, the checked fields, the metric, and the failure behavior.

How this becomes production Python

This first script stays deliberately small, but it points at the same Python skills used in production AI systems.

Later skillSame habit from this chapter
Pydantic schemasturn fuzzy model text into checked fields before downstream code sees it
async API callskeep the loop structure clear before running many model requests at once
generatorsstream large datasets one row at a time instead of loading everything into memory
retry logicmake transient API failures visible without hiding malformed input
prompt cacheskey repeated inputs deterministically so evaluation and cost reports are reproducible

The important part is the boundary, not the library. Whether the next system uses a dataclass, a Pydantic model, an async client, or a cache, the script should still make inputs explicit, validate shape early, and fail before a bad result looks trustworthy.

Before moving on, you should be able to explain the finished script in four sentences:

  1. The input is a JSONL file with one evaluation example per line.
  2. The parser checks that each row has string fields named prompt, expected, and prediction.
  3. The metric scores one row with exact match, then averages the row scores.
  4. The tests prove whitespace handling, missing fields, and empty files behave as expected.

What this pattern unlocks

This chapter isn't about memorizing every Python feature. It's about one engineering pattern:

Read a data file, check one row, turn it into a named object, compute one metric, and protect the behavior with a test.

You will reuse that pattern constantly. NumPy will turn rows into arrays. PyTorch will turn arrays into tensors and gradients. Retrieval systems will turn documents into ranked candidates. Agent systems will turn tool calls into traces. Each later step is larger, but the first question stays familiar: what enters, what shape should it have, what operation happens, and what fails before the result can be trusted?[5][6]

Mastery check

Key concepts

  • functions and modules
  • dataclasses and type hints
  • pathlib file IO
  • JSONL datasets
  • runtime validation
  • pytest tests
  • prompt snapshot tests
  • property-style checks for metric invariants
  • train/eval leakage detection
  • seeded reproducibility and CI gates
  • command-line scripts
  • error handling

Evaluation rubric

  • Foundational: Explains Python evaluation scripts as repeatable data pipelines from file path to checked row to metric
  • Intermediate: Builds a dataclass parser, file loader, exact-match scorer, command-line entrypoint, and pytest suite with visible output
  • Advanced: Catches missing fields, wrong types, empty files, prompt drift, train/eval leakage, and hidden notebook state before scoring

Follow-up questions

Common pitfalls

  • Notebook-only code works once, then breaks when another person changes paths, rerun order, or input shape.
  • Type hints help readers, but they don't validate runtime JSON on their own. Parse and check required fields before scoring.
  • A metric is only useful if malformed rows and empty files fail before the score is printed.
  • Prompt wording can drift silently unless one test locks the exact string the scorer or judge sees.
  • A high score can be fake if the training prompts overlap the eval prompts or if random seeds change between runs.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A JSONL eval file has three (expected, prediction) pairs: ("shipped", " shipped "), ("delayed", "delivered"), and ("refunded", "refunded"). The scorer strips both strings and averages all row results. What score is printed after rounding to three decimals?
2.parse_example receives two lines. The first is {"prompt": "Q", "expected": "A"}. The second also has "prediction": 7. What should happen?
3.A JSONL file contains only blank and whitespace-only lines. load_examples removes those lines before checking the input. What should happen?
4.A Python JSONL exact-match scorer should be reusable for any file whose rows contain string fields prompt, expected, and prediction, while preserving the existing loader, validation, and scoring behavior. Which main() behavior should it use?
5.A snapshot test expects the exact prompt string Answer with exactly one word: shipped, delayed, or refunded. A teammate changes the template to say Respond with one word: shipped, delayed, or refunded. What should the test do?
6.The eval prompts include order 101 status?, and train_prompts is {"order 101 status?", "warehouse delay escalation"} after normalization. What should the leakage detector do?
7.A refactor accidentally computes accuracy as matches / (len(examples) - 1), so some non-empty inputs produce scores above 1.0. Which test most directly exposes the broken metric contract?
8.Two engineers run a test that randomly generates evaluation examples and receive different cases. Which setup addresses that reproducibility problem while keeping the contract automatically enforced?

8 questions remaining.

Next Step
Continue to NumPy and Tensor Shapes

You now have a small Python script that reads the exact same `eval/support_tickets.jsonl` three-row contract the Git and Docker chapters protected, and computes a reproducible exact-match metric. The next chapter teaches you how to load those same rows (and later much larger batches) into NumPy arrays, attach explicit meaning to every axis, and prepare the tensor shapes that every PyTorch training loop, attention mechanism, and retrieval system in the curriculum will expect.

PreviousDocker for Reproducible AI
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Python Tutorial.

Python Software Foundation. · 2026 · Python Documentation

pytest Documentation

pytest contributors · 2026 · Official pytest documentation

Hypothesis Documentation

Hypothesis contributors · 2026

The Machine Learning Reproducibility Checklist

Pineau, J., et al. · 2021

Array programming with NumPy.

Harris, C. R., et al. · 2020 · Nature

PyTorch: An Imperative Style, High-Performance Deep Learning Library.

Paszke, A., et al. · 2019 · NeurIPS 2019