Learn Python as the first AI engineering loop: read JSONL rows, validate fields, compute exact-match accuracy, and harden the scorer with pytest, prompt snapshots, leakage checks, seeded runs, and a CI gate.
Most AI engineering starts before the model call.
You have files on disk. You have examples with labels. You have model predictions that may be wrong, malformed, or missing. Before you can trust a training run, an evaluation report, or a demo, you need Python code that can read those examples, check the shape of each row, compute one clear metric, and fail when the input is broken.
Python's job here is to turn loose files into named data, named data into measurements, and measurements into something another person can reproduce. This chapter builds that first loop from the ground up: read a JSONL evaluation file, parse each row into a typed object, score exact-match accuracy, and protect the behavior with tests.[1]
Example, scores False, and contributes to the final 0.667 metric while malformed rows stop before scoring.Git and Docker protected the file and runtime. Now we open the scorer and build the code those earlier chapters were carrying. This chapter also hardens that behavior into a reproducible contract with tests, snapshots, leakage checks, and CI.
For this Python chapter, don't think about model APIs yet. Think about the smallest reliable AI workflow:
| Boundary | Question | Output |
|---|---|---|
| file | Can I read the examples? | non-empty text lines |
| row | Does each line have the fields I expect? | one parsed object |
| metric | Did each prediction match the expected answer? | True or False |
| report | How did the whole file do? | one accuracy number |
| test | Can I prove the script rejects bad input? | passing checks |
That sequence is the foundation for later work. RAG evaluations, fine-tuning datasets, judge results, and agent traces all need the same habit: make the data visible before turning it into a number.
Here is the whole flow:
The diagram stays small on purpose. A beginner should be able to trace every arrow before the curriculum moves into arrays, tensors, training loops, and LLM systems.
The Git and Docker chapters already gave you a protected three-row evaluation file at eval/support_tickets.jsonl. We reuse it here so the Python scorer operates on the exact same contract the previous lessons locked down.
| prompt | expected | prediction | correct? |
|---|---|---|---|
Order 101 status? | shipped | shipped | yes |
Order 102 status? | delayed | delivered | no |
Order 103 status? | refunded | refunded | yes |
Two predictions match. One doesn't.
The exact-match score is:
That hand calculation still matters. The Python code must produce this number on this file because the earlier Git and Docker chapters locked the same contract. This is the same fixture the pre-commit gate, the Docker container, and the later CI workflow will protect.
1{"prompt": "Order 101 status?", "expected": "shipped", "prediction": "shipped"}
2{"prompt": "Order 102 status?", "expected": "delayed", "prediction": "delivered"}
3{"prompt": "Order 103 status?", "expected": "refunded", "prediction": "refunded"}JSONL (JSON Lines) is the format used throughout the curriculum because it streams cleanly, supports appending new eval rows, and stays human-inspectable with cat or head.
1{"prompt": "Order 102 status?", "expected": "delayed", "prediction": "delivered"}It carries the three fields the scorer contract requires:
| Field | What it means |
|---|---|
prompt | the input question the model saw |
expected | the ground-truth label |
prediction | what the system actually output |
The row is a valid failure case. The scorer must treat it as incorrect. A row missing prediction or carrying a non-string value must fail before any metric is computed.
The script uses a few Python ideas. Keep them attached to the evaluation task.
| Python idea | Meaning in this chapter |
|---|---|
| function | one named operation, such as exact_match |
| module | one .py file that can be run or imported |
| dataclass | a small class for storing named fields |
| type hint | a reader-facing contract such as list[Example] |
Path | a file location passed to the loader |
| exception | a loud failure when the input breaks the contract |
| test | a small check that protects one behavior |
A type hint helps humans and editors read your intent. It doesn't validate JSON from disk by itself. A static type checker such as mypy or pyright verifies hints before the code runs, and a linter and formatter such as ruff keeps the style consistent. None of them inspect the actual file at runtime, so the runtime checks below still matter because real files can contain missing fields, wrong types, blank lines, or invalid JSON.
The eval/support_tickets.jsonl file is already present from the Git chapter and already travels through the Docker setup. Create the Python module that reads it. Put this code in python_eval_demo.py.
If you are testing this block in a blank scratch directory, create the same three-row fixture first:
1from pathlib import Path
2
3fixture = Path("eval/support_tickets.jsonl")
4fixture.parent.mkdir(exist_ok=True)
5fixture.write_text(
6 "\n".join(
7 [
8 '{"prompt": "Order 101 status?", "expected": "shipped", "prediction": "shipped"}',
9 '{"prompt": "Order 102 status?", "expected": "delayed", "prediction": "delivered"}',
10 '{"prompt": "Order 103 status?", "expected": "refunded", "prediction": "refunded"}',
11 ]
12 )
13 + "\n",
14 encoding="utf-8",
15)
16print("fixture_rows", len(fixture.read_text(encoding="utf-8").splitlines()))1fixture_rows 3Path comes from the standard library module pathlib. It turns a file path into an object you can read, check, and pass around. Using Path instead of a plain string makes file operations explicit and behaves the same way on Linux, macOS, and Windows.
Read the code in five blocks:
Example names one row.require_str checks one field.parse_example turns text into a checked row.load_examples reads the file and rejects empty input.score_examples turns row checks into one metric.1from dataclasses import dataclass
2import json
3from pathlib import Path
4
5@dataclass
6class Example:
7 prompt: str
8 expected: str
9 prediction: str
10
11def require_str(row: dict[str, object], key: str) -> str:
12 if key not in row:
13 raise KeyError(key)
14
15 value = row[key]
16 if not isinstance(value, str):
17 raise TypeError(f"{key} must be str, got {type(value).__name__}")
18
19 return value
20
21def parse_example(line: str) -> Example:
22 row = json.loads(line)
23 if not isinstance(row, dict):
24 raise TypeError("each line must decode to a JSON object")
25
26 return Example(
27 prompt=require_str(row, "prompt"),
28 expected=require_str(row, "expected"),
29 prediction=require_str(row, "prediction"),
30 )
31
32def load_examples(path: Path) -> list[Example]:
33 lines = [line for line in path.read_text(encoding="utf-8").splitlines() if line.strip()]
34 if not lines:
35 raise ValueError("input file is empty")
36
37 return [parse_example(line) for line in lines]
38
39def exact_match(example: Example) -> bool:
40 return example.prediction.strip() == example.expected.strip()
41
42def score_examples(examples: list[Example]) -> float:
43 if not examples:
44 raise ValueError("examples must not be empty")
45
46 return sum(exact_match(example) for example in examples) / len(examples)
47
48def main() -> None:
49 # Use the same protected fixture the Git and Docker chapters already placed in the repo
50 input_path = Path("eval/support_tickets.jsonl")
51 examples = load_examples(input_path)
52 score = score_examples(examples)
53
54 print("rows", len(examples))
55 print("first_prompt", examples[0].prompt)
56 print("exact_match", round(score, 3))
57
58if __name__ == "__main__":
59 main()1rows 3
2first_prompt Order 101 status?
3exact_match 0.6671uv run python_eval_demo.pyThe printed score matches the hand calculation. Use that as your first sanity check.
The best way to read the program is to follow one row all the way through.
| Step | Python value | What changed |
|---|---|---|
| file line | '{"prompt": "Order 102 status?", ...}' | text came from disk |
json.loads(line) | {"prompt": "Order 102 status?", "expected": "delayed", "prediction": "delivered"} | text became a dictionary |
require_str(row, "prediction") | "delivered" | one field was checked |
Example(...) | Example(prompt="Order 102 status?", expected="delayed", prediction="delivered") | loose fields became a named row |
exact_match(example) | False | one prediction was scored |
score_examples(examples) | 0.667 | all row scores were averaged |
The word "averaged" is literal here. In Python, True behaves like 1 and False behaves like 0 when summed:
1row_scores = [True, False, True]
2print(sum(row_scores))
3print(sum(row_scores) / len(row_scores))12
20.6666666666666666The article rounds that final value to three decimal places, so it prints 0.667.
Good AI scripts make each boundary explicit. If the final number looks wrong, you can walk backward through the pipeline and inspect the smallest suspicious step.
| Code | Boundary it protects |
|---|---|
Path("eval/support_tickets.jsonl") | the protected fixture location (the same file Git and Docker already track) |
path.read_text(...).splitlines() | file contents become lines |
json.loads(line) | one text line becomes Python data |
isinstance(row, dict) | each line must be an object, not a list or string |
require_str(row, "prediction") | required fields must exist and be strings |
Example(...) | loose dictionaries become named rows |
exact_match(example) | one row becomes one decision |
score_examples(examples) | row decisions become one metric |
That structure is more important than exact-match accuracy itself. Exact match is a simple first metric. Later chapters will use richer metrics, tensors, loss functions, eval sets, and model outputs. The boundary habit stays the same.
When debugging, print one value at each boundary:
1input_path = Path("eval/support_tickets.jsonl")
2lines = [line for line in input_path.read_text(encoding="utf-8").splitlines() if line.strip()]
3row = json.loads(lines[1])
4example = parse_example(lines[1])
5
6print(input_path)
7print(len(lines))
8print(row)
9print(example)
10print(exact_match(example))1eval/support_tickets.jsonl
23
3{'prompt': 'Order 102 status?', 'expected': 'delayed', 'prediction': 'delivered'}
4Example(prompt='Order 102 status?', expected='delayed', prediction='delivered')
5FalseDon't debug the aggregate score first. Debug one row first.
Now break the input on purpose. Run this in a small scratch file or REPL after saving python_eval_demo.py in the same folder:
1try:
2 parse_example
3except NameError:
4 from python_eval_demo import parse_example
5
6bad_row = '{"prompt": "Order 101 status?", "expected": "shipped"}'
7bad_type_row = '{"prompt": "Order 101 status?", "expected": "shipped", "prediction": 7}'
8
9try:
10 parse_example(bad_row)
11except KeyError as error:
12 print("missing", error.args[0])
13
14try:
15 parse_example(bad_type_row)
16except TypeError as error:
17 print("bad type", error)1missing prediction
2bad type prediction must be str, got intBoth failures are useful. The first row is missing a required field. The second row has the right field name but the wrong value type. In both cases, the script stops before it prints a misleading metric.
Treat this as a production habit. Silent bad data can make an evaluation dashboard look healthy while the input is nonsense.
The first version hardcodes the protected eval/support_tickets.jsonl path so the example stays easy to run on the curriculum fixture. The next version should accept the file path from the command line so the same module can score any later eval set.
Add import sys near the top of the file, then replace main():
1import sys
2
3def main() -> None:
4 if len(sys.argv) != 2:
5 raise SystemExit("usage: uv run python_eval_demo.py <examples.jsonl>")
6
7 input_path = Path(sys.argv[1])
8 examples = load_examples(input_path)
9 print("exact_match", round(score_examples(examples), 3))
10
11 # The same command works on the protected three-row fixture:
12 # uv run python_eval_demo.py eval/support_tickets.jsonl
13 # → exact_match 0.6671sys.argv = ["python_eval_demo.py", "eval/support_tickets.jsonl"]
2main()1exact_match 0.667sys.argv is a list that holds the words you type after the script name. sys.argv[1] is the first argument, which we expect to be the JSONL file path.
1uv run python_eval_demo.py eval/support_tickets.jsonlThat small change makes the script reusable across any JSONL file that follows the same three-field schema. A teammate (or your future self on a fresh clone) can point it at the protected fixture or any later eval set and get a trustworthy number. The earlier fixture tests still pass because the behavior on this exact file is locked.
The input contract is now easy to state:
| User provides | Script promises |
|---|---|
| one JSONL file path | read that file |
rows with prompt, expected, prediction | compute exact-match accuracy |
| empty file | fail before dividing |
| malformed row | fail before scoring |
Once python_eval_demo.py is a module, tests can import its functions directly. Save the test file as tests/test_python_eval_demo.py so the project uses a standard pytest layout.
1from pathlib import Path
2
3from python_eval_demo import Example, exact_match, load_examples, parse_example
4
5def test_exact_match_strips_whitespace():
6 example = Example("Q", "shipped", " shipped ")
7 assert exact_match(example)
8
9def test_parse_example_requires_prediction():
10 bad_row = '{"prompt": "Q", "expected": "A"}'
11 try:
12 parse_example(bad_row)
13 except KeyError as error:
14 assert error.args[0] == "prediction"
15 else:
16 raise AssertionError("missing prediction should fail")
17
18def test_load_examples_rejects_empty_file(tmp_path: Path):
19 path = tmp_path / "empty.jsonl"
20 path.write_text("", encoding="utf-8")
21
22 try:
23 load_examples(path)
24 except ValueError as error:
25 assert str(error) == "input file is empty"
26 else:
27 raise AssertionError("empty file should fail")tmp_path is a pytest fixture. pytest automatically creates a temporary directory and passes it as the tmp_path argument, so the test doesn't leave files behind after it finishes.
Run pytest through an isolated tool command instead of installing packages into your global Python environment.
1uv run --with pytest -m pytest tests/test_python_eval_demo.py -q --tb=line1...
23 passed in 0.01sThese tests protect the three behaviors most likely to break first:
| Test | What it protects |
|---|---|
| whitespace test | metric normalization |
| missing field test | required input fields |
| empty file test | no divide-by-zero metric |
The point isn't to write many tests. The point is to test the boundaries that would make a metric lie.
Notice that the tests don't call main(). They test the smaller functions. That makes failures easier to understand. If parse_example fails, you know the parsing boundary broke. If exact_match fails, you know the scoring rule broke.
The same checks are easy to run directly while learning:
1tmp_path = Path("tmp-empty.jsonl")
2tmp_path.write_text("", encoding="utf-8")
3
4assert exact_match(Example("Q", "shipped", " shipped "))
5
6try:
7 parse_example('{"prompt": "Q", "expected": "A"}')
8except KeyError as error:
9 assert error.args[0] == "prediction"
10else:
11 raise AssertionError("missing prediction should fail")
12
13try:
14 load_examples(tmp_path)
15except ValueError as error:
16 assert str(error) == "input file is empty"
17else:
18 raise AssertionError("empty file should fail")
19
20tmp_path.unlink()
21print("unit checks passed")1unit checks passedThree unit tests catch parser mistakes. Real ML evaluation code usually needs four more protections: prompt snapshots, simple property checks, train/eval leakage detection, and seeded CI runs.[2][3]
Prompt drift is one of the fastest ways to make a metric move without anybody noticing why. Lock the exact prompt string with a small snapshot-style test:
1PROMPT_TEMPLATE = (
2 "You are a support ticket classifier.\n"
3 "Ticket: {prompt}\n"
4 "What is the status? Answer with exactly one word: shipped, delayed, or refunded."
5)
6
7def build_prompt(example: Example) -> str:
8 return PROMPT_TEMPLATE.format(prompt=example.prompt)
9
10def test_prompt_snapshot_is_stable():
11 example = Example("Order 101 status?", "shipped", "shipped")
12 expected = (
13 "You are a support ticket classifier.\n"
14 "Ticket: Order 101 status?\n"
15 "What is the status? Answer with exactly one word: shipped, delayed, or refunded."
16 )
17 assert build_prompt(example) == expected
18
19test_prompt_snapshot_is_stable()
20print("prompt snapshot stable")1prompt snapshot stableIf one word changes, the test fails before the scorer or model call hides the reason.
ML tests should protect invariants, not only one fixed example. Accuracy should never leave [0, 1], and the eval set should never overlap with the training prompts:
1import random
2
3def assert_no_leakage(train_prompts: set[str], eval_prompts: set[str]) -> None:
4 overlap = train_prompts & eval_prompts
5 assert not overlap, f"LEAKAGE DETECTED: {sorted(overlap)}"
6
7def test_score_stays_in_unit_interval():
8 for seed in range(20):
9 random.seed(seed)
10 n = random.randint(1, 8)
11 examples = [
12 Example(
13 prompt=f"p{i}",
14 expected=random.choice(["shipped", "delayed", "refunded"]),
15 prediction=random.choice(["shipped", "delayed", "refunded"]),
16 )
17 for i in range(n)
18 ]
19 score = sum(exact_match(example) for example in examples) / len(examples)
20 assert 0.0 <= score <= 1.0
21
22def test_leakage_detector_catches_overlap():
23 eval_examples = load_examples(Path("eval/support_tickets.jsonl"))
24 eval_prompts = {example.prompt.strip().lower() for example in eval_examples}
25 train_prompts = {"order 101 status?", "warehouse delay escalation"}
26
27 try:
28 assert_no_leakage(train_prompts, eval_prompts)
29 except AssertionError as error:
30 assert "LEAKAGE DETECTED" in str(error)
31 else:
32 raise AssertionError("overlapping train/eval prompts should fail")
33
34test_score_stays_in_unit_interval()
35test_leakage_detector_catches_overlap()
36print("ml guards passed")1ml guards passedThis is how you turn "that score looks suspiciously high" into one reproducible red test instead of a debugging argument.[4]
Notebook state and random seeds are another common source of fake progress. Put the seeding contract in one place, then run the same test command in CI:
1import os
2import random
3
4import numpy as np
5import pytest
6
7def seed_all(seed: int = 42) -> None:
8 os.environ["PYTHONHASHSEED"] = str(seed)
9 random.seed(seed)
10 np.random.seed(seed)
11
12@pytest.fixture(autouse=True)
13def _seed_everything() -> None:
14 seed_all(42)1seed_all(42)
2print("seed", os.environ["PYTHONHASHSEED"])
3print("numpy_sample", np.random.randint(0, 10, size=3).tolist())1seed 42
2numpy_sample [6, 3, 7]1name: Eval Gate
2
3on: [push, pull_request]
4
5jobs:
6 eval:
7 runs-on: ubuntu-latest
8 steps:
9 - uses: actions/checkout@v6
10 - uses: astral-sh/setup-uv@v8
11 with:
12 python-version: "3.14"
13 - name: Run eval tests
14 env:
15 PYTHONHASHSEED: "0"
16 run: |
17 uv run --with pytest --with numpy -m pytest tests/test_python_eval_demo.py -q --tb=noRun the fuller suite with:
1uv run --with pytest --with numpy -m pytest tests/test_python_eval_demo.py -q --tb=lineNow the contract is stronger:
| Guard | What it catches |
|---|---|
| snapshot test | prompt drift |
| property check | impossible metric values |
| leakage detector | train/eval contamination |
| seeded fixture + CI | notebook randomness and teammate drift |
Beginner Python evaluation scripts usually fail in predictable ways.
| Symptom | Likely cause | Fix |
|---|---|---|
score is 0.0 even when answers look right | whitespace or casing mismatch | print one Example, then choose the normalization rule |
KeyError: 'prediction' | an input row is missing a required field | inspect the broken row and keep the required-field check |
TypeError: prediction must be str, got int | a row has the right field but wrong value type | validate at parse time, not after scoring |
ValueError: input file is empty | wrong path or empty fixture | print the path and check the file contents |
JSONDecodeError | one line isn't valid JSON | inspect that line and fix quotes, commas, or braces |
| teammate gets a different score | different input file or command | print row count, commit the fixture, and document the command |
Notebook-only code becomes fragile for the same reason. A notebook can hide state in old cells, local paths, and copied snippets. A small module with tests is easier to rerun and review.
Try these in order. Predict the result before running code.
prediction equals expected.exact_match still treats it as correct.7.parse_example rejects that row.load_examples rejects it.main() so the file path comes from sys.argv[1].Use these checks after you try:
If your answer only says "the code works," make it more specific. Name the input, the checked fields, the metric, and the failure behavior.
This first script stays deliberately small, but it points at the same Python skills used in production AI systems.
| Later skill | Same habit from this chapter |
|---|---|
| Pydantic schemas | turn fuzzy model text into checked fields before downstream code sees it |
| async API calls | keep the loop structure clear before running many model requests at once |
| generators | stream large datasets one row at a time instead of loading everything into memory |
| retry logic | make transient API failures visible without hiding malformed input |
| prompt caches | key repeated inputs deterministically so evaluation and cost reports are reproducible |
The important part is the boundary, not the library. Whether the next system uses a dataclass, a Pydantic model, an async client, or a cache, the script should still make inputs explicit, validate shape early, and fail before a bad result looks trustworthy.
Before moving on, you should be able to explain the finished script in four sentences:
prompt, expected, and prediction.This chapter isn't about memorizing every Python feature. It's about one engineering pattern:
Read a data file, check one row, turn it into a named object, compute one metric, and protect the behavior with a test.
You will reuse that pattern constantly. NumPy will turn rows into arrays. PyTorch will turn arrays into tensors and gradients. Retrieval systems will turn documents into ranked candidates. Agent systems will turn tool calls into traces. Each later step is larger, but the first question stays familiar: what enters, what shape should it have, what operation happens, and what fails before the result can be trusted?[5][6]
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
The Python Tutorial.
Python Software Foundation. · 2026 · Python Documentation
pytest Documentation
pytest contributors · 2026 · Official pytest documentation
Hypothesis Documentation
Hypothesis contributors · 2026
The Machine Learning Reproducibility Checklist
Pineau, J., et al. · 2021
Array programming with NumPy.
Harris, C. R., et al. · 2020 · Nature
PyTorch: An Imperative Style, High-Performance Deep Learning Library.
Paszke, A., et al. · 2019 · NeurIPS 2019