LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationLinear Regression from Scratch
📊MediumEvaluation & Benchmarks

Linear Regression from Scratch

Fit return-assistant latency by hand, implement least squares and gradient descent in NumPy, then test failure cases and held-out behavior.

21 min read
Learning path
Step 28 of 155 in the full curriculum
The LLM LifecycleLogistic Regression and Metrics

The previous chapter traced an LLM feature from training data to a release gate. Now make one part measurable: how much latency does added policy evidence cost when the returns assistant answers order A10234? Linear regression is the smallest useful training problem because every moving part remains visible: feature, target, intercept, slope, residual, loss, gradient, validation, and failure.

You will fit one line by hand, reproduce it with NumPy, break it on purpose, and compare it with scikit-learn only after the mechanics are clear.[1][2][3][4]

Latency observations for order A10234 around the fitted line y equals 100 plus 20x, with the 100 millisecond intercept, 20 millisecond slope, signed residual bars, and squared-error tiles producing MSE 2.8. Latency observations for order A10234 around the fitted line y equals 100 plus 20x, with the 100 millisecond intercept, 20 millisecond slope, signed residual bars, and squared-error tiles producing MSE 2.8.
The fitted line separates the trend from row-level error. The intercept captures fixed overhead, the slope captures added token cost, and squared residuals average to MSE 2.8.

You need two prerequisites here: comfort reading a small NumPy array, and familiarity with y = mx + b as a straight-line rule. We keep one tiny request replay throughout so arithmetic, code, and debugging habits stay connected.

Predicting latency from retrieved evidence

The returns assistant must answer whether damaged order A10234 is eligible under policy line P-7. You replay the same question while varying retrieved policy evidence. The input feature is evidence length in hundreds of tokens. The target is end-to-end decision latency in milliseconds.

Evidence length x (hundreds of tokens)Observed latency y (ms)
1118
2141
3162
4181
5198

This small dataset works for beginners because each number means something you can picture:

  • The feature is retrieved evidence length.
  • The target is measured decision latency.
  • The intercept is the latency when x = 0, which you can read as fixed server overhead.
  • The slope is how many more milliseconds you expect for each extra 100 tokens.

A reasonable starting guess is:

y^=100+20x\hat{y} = 100 + 20xy^​=100+20x

Read that sentence before reading it as math: "start around 100 ms, then add about 20 ms for each extra 100 tokens of policy evidence."

That one line already lets you answer concrete questions. If a request has x = 6, the model predicts:

100+20×6=220100 + 20 \times 6 = 220100+20×6=220

That gives you the core mental model of linear regression. There isn't a hidden trick. We're finding the straight line that makes measured mistakes as small as possible.

inspect-the-request-replay.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5 6print("requests", len(x)) 7print("evidence_range", (int(x.min()), int(x.max())), "hundreds_of_tokens") 8print("latency_range", (int(y.min()), int(y.max())), "ms")
Output
1requests 5 2evidence_range (1, 5) hundreds_of_tokens 3latency_range (118, 198) ms

Residuals show what the line missed

A prediction becomes useful when you compare it with reality. The difference between the actual value and the predicted value is the residual:

residual=y−y^\text{residual} = y - \hat{y}residual=y−y^​

Using the candidate line 100 + 20x, you get:

xActual yPredicted y-hatResidual y - y-hatResidual squared
1118120-24
214114011
316216024
418118011
5198200-24

The signs matter:

  • Negative residual: the model predicted too high.
  • Positive residual: the model predicted too low.
  • Residual near zero: the line was close on that row.

To turn all five misses into one score, we square them and average them:

MSE=4+1+4+1+45=2.8\text{MSE} = \frac{4 + 1 + 4 + 1 + 4}{5} = 2.8MSE=54+1+4+1+4​=2.8

This score is mean squared error. Squaring does two jobs at once: it stops positive and negative errors from canceling out, and it punishes large misses more than small ones.

For this latency model, a positive residual means the request ran slower than predicted; a negative residual means it ran faster than predicted. If release gates care about slow surprises, inspect positive residuals directly in addition to reporting MSE.

A useful beginner habit is to compare against a naive baseline. The mean-latency baseline predicts 160 for each row, which gives MSE 802.8. The fitted line is useful because it crushes the "predict the average" baseline. That ratio has a name. The standard goodness-of-fit score for regression is the coefficient of determination, R-squared, which reports the fraction of the target's variance the model explains relative to the mean baseline:[1]

R2=1−∑(y−y^)2∑(y−yˉ)2=1−144014=1−2.8802.8=0.9965R^2 = 1 - \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} = 1 - \frac{14}{4014} = 1 - \frac{2.8}{802.8} = 0.9965R2=1−∑(y−yˉ​)2∑(y−y^​)2​=1−401414​=1−802.82.8​=0.9965

R^2 = 0 means you tied the mean baseline; R^2 = 1 means perfect fit; a negative value means you did worse than predicting the average. The formula uses sums of squared errors. Dividing numerator and denominator by the same five rows gives the identical 1 - MSE_line / MSE_baseline calculation above. Here 0.9965 says the line explains about 99.65 percent of variation in these replay measurements, not every future production request.

score-candidate-line-against-baseline.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5pred = 100 + 20 * x 6residual = y - pred 7mse = np.mean(residual ** 2) 8baseline_mse = np.mean((y - y.mean()) ** 2) 9r2 = 1 - mse / baseline_mse 10 11print("predictions", pred.astype(int)) 12print("residuals", residual.astype(int)) 13print("mse", round(float(mse), 1)) 14print("baseline_mse", round(float(baseline_mse), 1)) 15print("r2", round(float(r2), 4))
Output
1predictions [120 140 160 180 200] 2residuals [-2 1 2 1 -2] 3mse 2.8 4baseline_mse 802.8 5r2 0.9965

Why the math becomes matrix multiplication

For one feature, y^=100+20x\hat{y} = 100 + 20xy^​=100+20x is readable. For many features, you want the same idea written in a form code can scale.

Start by turning the data into a design matrix: a tidy table where each row is one training example and each column is one input the model can use.

X=[1112131415]X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix}X=​11111​12345​​

Each row is one training example. The leading column is all ones so the model can learn an intercept. The second column is the feature itself.

Now bundle the weights into one vector:

w=[bm]w = \begin{bmatrix} b \\ m \end{bmatrix}w=[bm​]

The prediction rule becomes:

y^=Xw\hat{y} = Xwy^​=Xw

In this chapter, b is the intercept and m is the slope. For our latency data, the least-squares weights will turn out to be:

w=[10020]w = \begin{bmatrix} 100 \\ 20 \end{bmatrix}w=[10020​]

It's still the same old line. We changed the notation so the computer can do the bookkeeping.

Here is one row of that bookkeeping by itself. For the third request, x = 3, so the design-matrix row is [1, 3]:

[13][10020]=1×100+3×20=160\begin{bmatrix} 1 & 3 \end{bmatrix} \begin{bmatrix} 100 \\ 20 \end{bmatrix} = 1 \times 100 + 3 \times 20 = 160[1​3​][10020​]=1×100+3×20=160

The leading 1 selects the intercept. The 3 scales the slope. Matrix multiplication does that same dot product for each row at once.

design-matrix-keeps-fixed-overhead.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5 6with_intercept = np.c_[np.ones_like(x), x] 7without_intercept = x[:, None] 8 9w_full, *_ = np.linalg.lstsq(with_intercept, y, rcond=None) 10w_forced, *_ = np.linalg.lstsq(without_intercept, y, rcond=None) 11mse_full = np.mean((with_intercept @ w_full - y) ** 2) 12mse_forced = np.mean((without_intercept @ w_forced - y) ** 2) 13 14print("with_intercept", w_full.round(2), "mse", round(float(mse_full), 2)) 15print("forced_through_zero", w_forced.round(2), "mse", round(float(mse_forced), 2))
Output
1with_intercept [100. 20.] mse 2.8 2forced_through_zero [47.27] mse 1820.98

NumPy implementation with the closed-form solve

Start with the normal equation, the closed-form condition that the least-squares optimum satisfies.[1] It makes the derivation visible when the feature matrix is small and full rank.

Written as math, the normal equation is:

XTXw=XTyX^T X w = X^T yXTXw=XTy

Solving that system gives the same weights as the textbook formula w=(XTX)−1XTyw = (X^T X)^{-1}X^T yw=(XTX)−1XTy. In this small demonstration, np.linalg.solve avoids explicitly forming an inverse. Forming X^T X is useful for seeing the derivation, but it can magnify numerical conditioning problems. Later you will use np.linalg.lstsq(X, y) directly, which also reports matrix rank and handles redundant columns without asking you to form X^T X.

numpy-implementation-with-the-closed-form.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5 6X = np.c_[np.ones_like(x), x] 7 8w_closed = np.linalg.solve(X.T @ X, X.T @ y) 9pred_closed = X @ w_closed 10residuals = y - pred_closed 11 12baseline = np.full_like(y, y.mean()) 13line_mse = np.mean((pred_closed - y) ** 2) 14baseline_mse = np.mean((baseline - y) ** 2) 15r2 = 1 - line_mse / baseline_mse 16 17print("closed_form_weights", w_closed.round(2)) 18print("residuals", residuals.round(2)) 19print("line_mse", line_mse.round(2)) 20print("baseline_mse", baseline_mse.round(2)) 21print("r2", round(r2, 4))
Output
1closed_form_weights [100. 20.] 2residuals [-2. 1. 2. 1. -2.] 3line_mse 2.8 4baseline_mse 802.8 5r2 0.9965

Read the code in this order:

  1. np.c_[np.ones_like(x), x] creates the intercept column and the feature column.
  2. X.T @ X and X.T @ y are the matrix pieces of least squares.
  3. np.linalg.solve(...) solves the system without explicitly computing a matrix inverse.
  4. X @ w_closed turns weights back into predictions.
  5. The baseline check proves the model learned a useful pattern instead of memorizing the mean target value.

For a reusable implementation, fit against X directly:

least-squares-with-rank-check.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5X = np.c_[np.ones_like(x), x] 6 7w, residual_sum, rank, singular_values = np.linalg.lstsq(X, y, rcond=None) 8pred = X @ w 9 10print("weights", w.round(2)) 11print("rank", rank, "columns", X.shape[1]) 12print("mse", round(float(np.mean((pred - y) ** 2)), 2)) 13print("singular_values", singular_values.round(2))
Output
1weights [100. 20.] 2rank 2 columns 2 3mse 2.8 4singular_values [7.69 0.92]

The two columns are independent, so rank equals the number of columns. Now make a bad feature table by adding double_tokens = 2 * tokens. That third column contains no new information.

redundant-features-break-normal-equation.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5X_bad = np.c_[np.ones_like(x), x, 2 * x] 6 7try: 8 np.linalg.solve(X_bad.T @ X_bad, X_bad.T @ y) 9except np.linalg.LinAlgError: 10 print("normal_equation", "singular") 11 12w, _, rank, _ = np.linalg.lstsq(X_bad, y, rcond=None) 13print("lstsq_rank", rank, "columns", X_bad.shape[1]) 14print("lstsq_weights", w.round(2)) 15print("mse", round(float(np.mean((X_bad @ w - y) ** 2)), 2))
Output
1normal_equation singular 2lstsq_rank 2 columns 3 3lstsq_weights [100. 4. 8.] 4mse 2.8

The prediction remains the same because 4*x + 8*(2*x) = 20*x; the individual feature weights are no longer uniquely identifiable. Rank is the warning signal for an exact duplicate. Near-duplicate columns can be harder to catch: the solve may return weights, but small data changes can move those weights sharply. Inspect singular values or a condition number when coefficients look unstable.

There is also a scaling limit worth knowing. With N rows and D features, forming X^T X costs roughly O(ND^2), then solving the D x D system costs roughly O(D^3). With a handful of features that is instant. With very wide data, iterative optimization often becomes the practical path.[1]

Gradient descent finds the same line by walking downhill

The closed-form solve is elegant, but it doesn't scale to each setting you'll care about later. Neural networks, embeddings, and large feature sets are usually trained by iterative optimization instead. Linear regression is the friendliest place to learn that loop.

Gradient descent is like walking down a mountain in thick fog. You don't see the bottom, but you can feel the slope under your feet. You take small steps in the steepest downward direction until the ground levels out.

Start with terrible weights:

w=[00]w = \begin{bmatrix} 0 \\ 0 \end{bmatrix}w=[00​]

That predicts 0 for each row. Each error is negative because each prediction is too low. The gradient points in the direction where loss grows fastest, so subtracting it moves the weights toward lower loss.

On the initial step, the average gradient works out to [-320, -1040]. Here is where that number comes from when w = [0, 0]:

PieceValue
predictions[0, 0, 0, 0, 0]
errors Xw - y[-118, -141, -162, -181, -198]
intercept sum-118 - 141 - 162 - 181 - 198 = -800
slope-weighted sum1(-118) + 2(-141) + 3(-162) + 4(-181) + 5(-198) = -2600
multiply by 2 / n = 2 / 5[-320, -1040]

The slope gradient is larger because later rows have larger x values, so their errors pull harder on the slope than on the intercept.

The number 0.05 is the learning rate: it controls how large each step is. Too small and the model crawls; too large and it can overshoot the valley and bounce around. With lr = 0.05, the update becomes:

w=[0,0]−0.05×[−320,−1040]=[16,52]w = [0, 0] - 0.05 \times [-320, -1040] = [16, 52]w=[0,0]−0.05×[−320,−1040]=[16,52]

That's not the final answer. The slope overshoots. But the update is still useful because the model has learned the right direction: both the intercept and slope need to increase.

One common mistake is using wildly different feature scales, such as distance in miles alongside package weight in grams. The loss surface becomes stretched, so one coordinate can drive much larger updates than another. Gradient descent may zig-zag or diverge before it reaches the minimum. Normalize features before training when their ranges differ.

For mean squared error, the gradient is:

∇wMSE=2nXT(Xw−y)\nabla_w \text{MSE} = \frac{2}{n}X^T(Xw - y)∇w​MSE=n2​XT(Xw−y)

Where does the 2 come from? It's the derivative of the squared term. If you write out MSE as the average of (Xw−y)2(Xw - y)^2(Xw−y)2 and differentiate with respect to www, the chain rule brings down a 2 and leaves XTX^TXT in front because each weight touches each prediction through the design matrix.[2][1]

Plain English:

  • Xw - y is the vector of prediction errors.
  • X.T turns row-wise errors back into "how much should each weight change?"
  • 2 / n averages the result across rows.
verify-the-first-gradient-step.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5X = np.c_[np.ones_like(x), x] 6w = np.zeros(2) 7 8error = X @ w - y 9grad = (2 / len(X)) * X.T @ error 10w_next = w - 0.05 * grad 11 12print("gradient", grad) 13print("next_weights", w_next) 14print("old_mse", round(float(np.mean(error ** 2)), 1)) 15print("new_mse", round(float(np.mean((X @ w_next - y) ** 2)), 1))
Output
1gradient [ -320. -1040.] 2next_weights [16. 52.] 3old_mse 26402.8 4new_mse 2194.8

Here is the full training loop:

Gradient descent on the latency model loss surface: a learning rate of 0.05 zig-zags from zero weights to the least-squares minimum at intercept 100 and slope 20, while a learning rate of 0.2 sends log loss upward. Gradient descent on the latency model loss surface: a learning rate of 0.05 zig-zags from zero weights to the least-squares minimum at intercept 100 and slope 20, while a learning rate of 0.2 sends log loss upward.
The same gradient can converge or explode depending on step size. With `lr = 0.05`, updates follow the narrow loss valley to `[100, 20]`; with `lr = 0.2`, each overshoot makes the next error larger.
gradient-descent-line-fit.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5X = np.c_[np.ones_like(x), x] 6 7w = np.zeros(2) 8lr = 0.05 9history = {} 10 11for step in range(1000): 12 pred = X @ w 13 error = pred - y 14 grad = (2 / len(X)) * (X.T @ error) 15 w -= lr * grad 16 17 if step in {0, 99, 999}: 18 loss = np.mean((X @ w - y) ** 2) 19 print(f"after_step={step + 1} loss={loss:.2f} w={w.round(2)}") 20 history[step + 1] = (loss, w.copy())
Output
1after_step=1 loss=2194.80 w=[16. 52.] 2after_step=100 loss=49.09 w=[84.05 24.42] 3after_step=1000 loss=2.80 w=[100. 20.]

The initial update looks dramatic because the model started absurdly wrong. Each progress line reports loss and weights after the same update, so you can compare snapshots without mixing states. By the end, gradient descent reaches the same answer as the closed-form solve.

Remember this distinction:

  • Least squares solve: jumps straight to the answer when the matrix is small enough and stable enough.
  • Gradient descent: improves the answer step by step and scales to settings where a direct solve isn't practical.

The one-feature data also exposes a failed optimization run cleanly. Increase lr from 0.05 to 0.2; the steps overshoot and loss grows:

detect-an-unstable-learning-rate.py
1import numpy as np 2 3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) 4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 5X = np.c_[np.ones_like(x), x] 6w = np.zeros(2) 7losses = [] 8 9for _ in range(5): 10 error = X @ w - y 11 losses.append(float(np.mean(error ** 2))) 12 grad = (2 / len(X)) * X.T @ error 13 w -= 0.2 * grad 14 15print("losses", [round(loss, 1) for loss in losses]) 16print("loss_increased", losses[-1] > losses[0])
Output
1losses [26402.8, 349474.8, 4852473.8, 67584328.5, 941482657.6] 2loss_increased True

When to use each approach

ApproachGood fitMain thing to watch
Direct least squaresSmall tabular problems, few features, reproducible solution such as evidence-length latency estimatesRedundant features make individual weights ambiguous; inspect rank
Gradient descentLarge feature sets, streaming updates, same optimization pattern used by neural networksLearning rate and feature scale can make training unstable

This comparison matters because later AI systems still reuse the same question: "Should I solve this directly, or optimize it iteratively?" Linear regression is the smallest place to learn that tradeoff with numbers you can still inspect.

Evaluation starts on unseen requests

The five rows above were chosen to make mechanics visible. Their training R^2 is descriptive, not release evidence. A real latency model needs a train, validation, and test split with held-out replays from evidence lengths, policy documents, and load conditions the fit never saw.

For a first clean check, fit the first four evidence lengths and reserve the fifth request as a tiny test set:

evaluate-a-held-out-request.py
1import numpy as np 2 3x_train = np.array([1.0, 2.0, 3.0, 4.0]) 4y_train = np.array([118.0, 141.0, 162.0, 181.0]) 5x_test = np.array([5.0]) 6y_test = np.array([198.0]) 7 8X_train = np.c_[np.ones_like(x_train), x_train] 9X_test = np.c_[np.ones_like(x_test), x_test] 10w, *_ = np.linalg.lstsq(X_train, y_train, rcond=None) 11prediction = X_test @ w 12baseline = np.full_like(y_test, y_train.mean()) 13 14print("train_weights", w.round(2)) 15print("test_prediction_ms", prediction.round(2)) 16print("test_error_ms", (y_test - prediction).round(2)) 17print("line_squared_error", round(float(np.mean((prediction - y_test) ** 2)), 2)) 18print("baseline_squared_error", round(float(np.mean((baseline - y_test) ** 2)), 2))
Output
1train_weights [98. 21.] 2test_prediction_ms [203.] 3test_error_ms [-5.] 4line_squared_error 25.0 5baseline_squared_error 2256.25

One held-out row isn't an uncertainty estimate, but it teaches the correct boundary: fit on training rows, judge on data held back from fitting. A later evaluation chapter will make that split robust with cross-validation and leakage controls.

Match a trusted implementation

From-scratch code teaches mechanics. Once it works, compare it with a maintained library on the same features and target. LinearRegression fits ordinary least squares, while mean_squared_error and r2_score apply the same metrics used above.[4]

compare-with-scikit-learn.py
1import numpy as np 2from sklearn.linear_model import LinearRegression 3from sklearn.metrics import mean_squared_error, r2_score 4 5x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]).reshape(-1, 1) 6y = np.array([118.0, 141.0, 162.0, 181.0, 198.0]) 7 8model = LinearRegression().fit(x, y) 9pred = model.predict(x) 10 11print("intercept", round(float(model.intercept_), 2)) 12print("slope", np.round(model.coef_, 2)) 13print("mse", round(float(mean_squared_error(y, pred)), 2)) 14print("r2", round(float(r2_score(y, pred)), 4))
Output
1intercept 100.0 2slope [20.] 3mse 2.8 4r2 0.9965

Common failure patterns

Low error doesn't mean the model is healthy. Beginners usually get tricked by one of these patterns:

SymptomLikely causeFix
Residuals bend in a pattern, such as low-high-low instead of random noiseThe real relationship isn't well described by one straight lineAdd a transformed feature, try a different model, or inspect whether a key variable is missing
Loss explodes or bounces up and down during gradient descentLearning rate is too high, or features live on wildly different scalesLower the learning rate and normalize features before training
Normal-equation solve fails, or weights change wildly after adding a similar featureFeature columns are redundant or nearly redundantInspect rank, remove redundant features, and fit with np.linalg.lstsq
Training error looks great, but future predictions are badYou evaluated on the same rows you trained on, or leaked future information into featuresSplit data before fitting and keep a truly held-out set
One extreme row drags the line toward itselfAn outlier dominates squared lossPlot the points, inspect the row, and decide whether the data is wrong or the model needs an outlier-resistant approach

Our one-feature example hides one practical issue: feature scaling. With one feature, the optimizer has a simple job. Once one column is in milliseconds and another is in millions of tokens, gradient descent can zig-zag badly until you rescale the inputs. The same problem appears in logistics when you mix delivery distance in miles with package count or weight in grams without normalization.

Two interview traps are worth separating:

  • Linear regression doesn't require the raw feature x or target y to be normally distributed. Normal residuals matter for certain statistical inference claims, not for fitting the least-squares line itself.
  • A straight line is most trustworthy inside the range you trained on. Predicting x = 6 after training on 1 through 5 is a small extension. Predicting x = 50 would be extrapolation, and the model may be confidently wrong.

Try it yourself

Start with the NumPy code above, then use the library comparison as a check rather than a substitute for understanding the fit.

Try thisGood answer should notice
Predict latency for x = 6 before running anything220 ms, because the fitted line adds 20 ms for each extra 100 evidence tokens on top of fixed overhead
Change the last target from 198 to 240 and refitThe slope rises because the last point pulls the line upward, and the residual pattern changes most at the high-token end
Set lr = 0.2 in the gradient loopLoss shoots upward or becomes unstable, which is a learning-rate problem rather than proof that linear regression is broken
Delete the intercept column and fit with x aloneThe line is forced through zero, which usually worsens the fit because the system still has base latency when prompt length is zero
Add a duplicate feature 2 * x and call np.linalg.solveThe normal equation becomes singular; rank-aware least squares still predicts the same line but coefficients aren't unique
Remove the baseline comparisonYou lose the simplest proof that the model learned structure instead of chasing the average target

If a practice answer feels hand-wavy, make it measurable. Print the new weights, print the new loss, or sketch the new residual pattern. Beginners improve faster when each guess turns into an observable result.

What you have built and where it leads

You can now do six useful things without a library hiding the mechanics:

  1. Turn a small dataset into a design matrix.
  2. Fit a line and explain what each weight means.
  3. Measure error with residuals and mean squared error.
  4. Compare a direct least-squares solve with gradient descent.
  5. Detect a redundant-feature failure through rank.
  6. Hold out a request and verify the fit against scikit-learn.

That's enough to fit a model. It isn't enough to trust one.

The next step is to move from numeric latency to a binary return decision: "Should this request be routed to human review?" Logistic regression gives you probabilities, log-loss, threshold tuning, ROC, and calibration, tools later classifiers reuse.

Mastery check

Key concepts

  • intercept and slope
  • design matrix
  • residuals
  • mean squared error
  • closed-form least squares
  • gradient descent
  • learning rate
  • baseline comparison
  • R-squared
  • rank deficiency
  • held-out evaluation

Evaluation rubric

  • Foundational: Explains intercept, slope, residuals, and mean squared error with a concrete worked example before introducing matrix notation
  • Intermediate: Implements linear regression in NumPy with direct least squares and a gradient descent loop, then compares against a mean baseline and a held-out request
  • Advanced: Diagnoses rank deficiency, non-linearity, unstable learning rates, outliers, and training-row evaluation as distinct failure modes with concrete fixes

Follow-up questions

Common pitfalls

  • Symptom: fitted line looks consistently shifted even though slope seems plausible. Cause: the design matrix forgot the all-ones intercept column, so the model is forced through zero. Fix: rebuild X with the intercept column and refit before changing anything else.
  • Symptom: adding a second token-count feature makes the direct solve fail or produces hard-to-explain weights. Cause: the new column repeats information already in x. Fix: inspect rank, remove the redundant column, and use np.linalg.lstsq while diagnosing the data design.
  • Symptom: training MSE looks low, but predictions still fail on new rows. Cause: training loss was treated as enough evidence, even though residual pattern or held-out performance was never checked. Fix: compare against a baseline, inspect residuals, and evaluate on rows the model did not train on.
  • Symptom: gradient descent loss bounces upward or turns unstable. Cause: learning rate is too high, or feature scales are badly mismatched. Fix: lower lr, normalize features, and watch whether the loss curve starts descending smoothly.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.For the latency replay, a design row for x = 3 is [1, 3] and the weights are [100, 20]. What prediction does X @ w produce for that row, and what role does the leading 1 play?
2.For a candidate latency line, two requests have residuals +10 ms and -10 ms. What happens if you average the raw residuals rather than squared residuals?
3.A latency model evaluated on the same rows as its mean baseline has MSE 900, while the mean-latency baseline has MSE 600. What is R-squared, and what does it mean?
4.You build X_bad = [1, x, 2 * x] for the replay and call np.linalg.solve(X_bad.T @ X_bad, X_bad.T @ y). It raises a singular-matrix error, while np.linalg.lstsq reports rank 2 for 3 columns and MSE 2.8. What diagnosis follows?
5.In the first gradient descent step, the model starts with w = [0, 0], the average MSE gradient is [-320, -1040], and lr = 0.05. Which update is correct?
6.A gradient descent latency model uses lr = 0.2 and feature columns whose ranges differ by orders of magnitude. Its MSE grows at every step. Which change addresses both stated optimization risks?
7.You have two tasks: fit a tiny full-rank latency table for inspectable weights, then train a much wider or streaming feature set where a direct solve may be impractical. Which pairing matches the practical trade-off?
8.A team fits the line on five replay rows and reports training MSE 2.8 and R-squared 0.9965 as proof that future latency is safe. Which next step directly tests generalization?
9.A single high-evidence replay changes from 198 ms to 240 ms. After refitting by least squares, the slope rises sharply and the line shifts toward that row. What diagnosis and response fit?
10.After fitting a straight line, residuals are consistently negative at low evidence lengths, positive in the middle, and negative again at high lengths. Which diagnosis and next step match this pattern?

10 questions remaining.

Next Step
Continue to Logistic Regression and Metrics

There you will turn numeric features into a binary decision, derive sigmoid and log loss, tune thresholds, inspect calibration, and compare a NumPy trainer against scikit-learn.

PreviousThe LLM Lifecycle
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

Pattern Recognition and Machine Learning.

Bishop, C. M. · 2006

Array programming with NumPy.

Harris, C. R., et al. · 2020 · Nature

Scikit-learn: Machine Learning in Python.

Pedregosa, F., et al. · 2011 · JMLR