LeetLLM
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 104 articles completed

🛠️Computing Foundations0/3
Python for AI EngineeringNumPy and Tensor ShapesData Structures for AI
📊Math & Statistics0/4
Probability for MLStatistics and UncertaintyDistributions and SamplingHypothesis Tests and pass@k
📚Preparation & Prerequisites0/9
Vectors, Matrices & TensorsNeural Networks from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsThe LLM Lifecycle
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFunction Calling & Tool UseChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧮ML Algorithms & Evaluation0/8
Linear Regression from ScratchValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment DesignPyTorch Training LoopsDataset Pipelines
🧰Applied LLM Engineering0/17
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingMCP & Tool Protocol StandardsPrompt Injection DefenseAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringPre-training Data at ScaleMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsDesign an Automated Support Agent
🎓Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/6
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/10
Scaling Laws & Compute-Optimal TrainingDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/12
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingAgent Memory & PersistenceHuman-in-the-Loop AgentsAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/14
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Quantization: GPTQ, AWQ & GGUFSpeculative DecodingLong Context Window ManagementMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnComputing FoundationsPython for AI Engineering
⚙️EasyMLOps & Deployment

Python for AI Engineering

A beginner-first Python chapter that turns one JSONL eval row into a typed object, exact-match metric, failing test, and reproducible command.

35 min readOpenAI, Anthropic, Google +16 key concepts

Python is the first workshop tool in this curriculum.

Before you train models, build agents, or deploy services, you need one simple habit: turn messy AI work into small Python functions that another engineer can run.

Imagine a folder with model answers in a JSONL file. Each line has a prompt, an expected answer, and a prediction. Your job isn't to build a chatbot yet. Your job is to load the rows, score them, print a metric, and fail loudly when the data is wrong.

What You Learn

This chapter teaches Python as a repeatable eval tool. By the end, a beginner should be able to explain how one JSONL row becomes a typed object, how one prediction becomes a score, and how one failing assertion prevents a broken script from shipping.[1][2][3]

Python AI engineering pipeline showing JSONL examples parsed into dataclasses, checked by tests, scored by exact match, and stopped by a CI gate Python AI engineering pipeline showing JSONL examples parsed into dataclasses, checked by tests, scored by exact match, and stopped by a CI gate
Visual anchor: follow one JSONL row into a typed example, then watch the test and metric gate decide whether the AI script is shippable.

Step Map

StepQuestionWhat you should be able to do
1What data enters?Name the fields in one eval row.
2What shape should it have?Store the row in a small typed object.
3What operation happens?Score one prediction against one expected answer.
4What can fail?Catch missing fields and wrong types early.
5What ships?Run one command and get one clear metric.

Think of the script like a lab scale. It doesn't make the experiment interesting. It makes the measurement trustworthy.


Tiny Story

Start with three examples from a fictional geography eval:

promptexpectedprediction
Capital of France?ParisParis
Capital of Spain?MadridBarcelona
Capital of Italy?RomeRome

Two predictions are correct. One is wrong.

Before using any AI framework, you can already answer the important question:

What fraction of rows matched the expected answer?

The answer is:

23≈0.667\frac{2}{3} \approx 0.66732​≈0.667

That is a metric. Small, plain, and useful.

Vocabulary

  • function: named operation with inputs and an output.
  • dataclass: small Python class for storing named fields.
  • type hint: label that says what kind of value a function expects or returns.
  • JSONL: file format with one JSON object per line.
  • metric: number that summarizes behavior, such as exact-match accuracy.
  • test: small check that protects an assumption.
  • exit code: process result where 0 means success and nonzero means failure.

Each word should point to something in the eval story. If it can't point to a row, field, function, score, or failure, it is floating.

Worked Example

One row is a dictionary:

text
1{"prompt": "Capital of France?", "expected": "Paris", "prediction": "Paris"}

A beginner might keep it as a dictionary forever. That works for one file, then gets fragile.

A typed object gives the row a name and stable fields:

FieldMeaning
promptinput question
expectedanswer we wanted
predictionanswer the model gave

The object isn't fancy. It is a little box with labels.

Code Lab

Put this in python_eval_demo.py:

python
1from dataclasses import dataclass 2import json 3 4@dataclass 5class Example: 6 prompt: str 7 expected: str 8 prediction: str 9 10def parse_example(line: str) -> Example: 11 row = json.loads(line) 12 return Example( 13 prompt=row["prompt"], 14 expected=row["expected"], 15 prediction=row["prediction"], 16 ) 17 18def exact_match(example: Example) -> bool: 19 return example.prediction.strip() == example.expected.strip() 20 21raw_rows = [ 22 '{"prompt": "Capital of France?", "expected": "Paris", "prediction": "Paris"}', 23 '{"prompt": "Capital of Spain?", "expected": "Madrid", "prediction": "Barcelona"}', 24 '{"prompt": "Capital of Italy?", "expected": "Rome", "prediction": "Rome"}', 25] 26 27examples = [parse_example(row) for row in raw_rows] 28score = sum(exact_match(example) for example in examples) / len(examples) 29 30print("rows", len(examples)) 31print("exact_match", round(score, 3))

Expected output:

text
1rows 3 2exact_match 0.667

Read the code as a pipeline:

Line of workMeaning
parse JSONturn text into data
build Examplegive fields stable names
compare stringsscore one row
average booleansscore the whole file

Python is doing ordinary bookkeeping. That bookkeeping is what makes later model work inspectable.

Second Worked Example: Fail Early

Now break one row:

python
1bad_row = '{"prompt": "Capital of France?", "expected": "Paris"}' 2 3try: 4 parse_example(bad_row) 5except KeyError as error: 6 print("missing", error.args[0])

Expected output:

text
1missing prediction

Read The Failure

That failure is good.

It tells you the input is wrong before a metric report lies to you. In real AI systems, silent bad data is worse than a loud crash.

Mini Test

Add tests beside the demo:

python
1def test_exact_match_strips_whitespace(): 2 example = Example("Q", "Paris", " Paris ") 3 assert exact_match(example) 4 5def test_parse_example_requires_prediction(): 6 bad_row = '{"prompt": "Q", "expected": "A"}' 7 try: 8 parse_example(bad_row) 9 except KeyError as error: 10 assert error.args[0] == "prediction" 11 else: 12 raise AssertionError("missing prediction should fail")

These tests protect two beginner lessons:

  • string cleanup belongs in the metric
  • missing required fields should fail before scoring

Common Trap

The common trap is notebook-only code.

Notebook-only code usually has hidden state:

Hidden thingWhy it hurts
variable from an old cellrerun order changes the answer
local file pathteammate can't reproduce the script
copied metric codebugs drift between notebooks
no testsbroken input still prints a score

A small script is less glamorous, but it is easier to trust.

From Script To Command

The next step is a command that accepts paths:

python
1def main(input_path: str, output_path: str) -> None: 2 print("input", input_path) 3 print("output", output_path)

This tiny shape matters:

InputOutput
path to examplesmetric file
path to predictionserror log
config valueexit code

Later chapters will use the same pattern for NumPy arrays, PyTorch tensors, RAG evals, and deployment checks.

Practice

  1. Add a fourth row where prediction equals expected.
  2. Predict the new exact-match score before running code.
  3. Add a row with extra whitespace around the prediction.
  4. Confirm exact_match still treats it as correct.
  5. Add a row missing expected and make sure it fails loudly.
  6. Write one sentence that starts: "This script is trustworthy because..."

Production Check

Before you trust a Python eval script, write down:

  • input file format
  • required fields
  • metric definition
  • behavior on empty files
  • behavior on missing fields
  • command used to run it
  • test command
  • expected output example

For example:

Input is JSONL with prompt, expected, and prediction. Metric is exact match after stripping whitespace. Missing fields raise KeyError. Tests cover whitespace and missing fields.

That note is small. It is also the difference between "I ran some code" and "another engineer can reproduce this result."

Next, continue to NumPy and Tensor Shapes. Python gave you named functions and testable rows. NumPy turns rows into arrays, and arrays need shape discipline.

Evaluation Rubric
  • 1
    Explains Python eval scripts as repeatable data pipelines from row to object to metric
  • 2
    Builds a dataclass parser, exact-match scorer, and tests with visible output
  • 3
    Catches missing fields, hidden notebook state, and unreproducible inputs before scoring
Common Pitfalls
  • The common failure is notebook-only code. It works once, then breaks when another person changes paths, environment variables, or input shape.
  • Hidden notebook state makes a result hard to reproduce. Prefer a small script with inputs, outputs, and tests.
  • A metric is only useful if malformed rows fail before the score is printed.
Follow-up Questions to Expect

Key Concepts Tested
functions and modulestype hintsfile IOJSONL datasetstestable CLI scriptserror handling
References

The Python Tutorial.

Python Software Foundation. · 2026 · Python Documentation

Array programming with NumPy.

Harris, C. R., et al. · 2020 · Nature

PyTorch: An Imperative Style, High-Performance Deep Learning Library.

Paszke, A., et al. · 2019 · NeurIPS 2019

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.