Fit return-assistant latency by hand, implement least squares and gradient descent in NumPy, then test failure cases and held-out behavior.
The previous chapter traced an LLM feature from training data to a release gate. Now make one part measurable: how much latency does added policy evidence cost when the returns assistant answers order A10234? Linear regression is the smallest useful training problem because every moving part remains visible: feature, target, intercept, slope, residual, loss, gradient, validation, and failure.
You will fit one line by hand, reproduce it with NumPy, break it on purpose, and compare it with scikit-learn only after the mechanics are clear.[1][2][3][4]
You need two prerequisites here: comfort reading a small NumPy array, and familiarity with y = mx + b as a straight-line rule. We keep one tiny request replay throughout so arithmetic, code, and debugging habits stay connected.
The returns assistant must answer whether damaged order A10234 is eligible under policy line P-7. You replay the same question while varying retrieved policy evidence. The input feature is evidence length in hundreds of tokens. The target is end-to-end decision latency in milliseconds.
Evidence length x (hundreds of tokens) | Observed latency y (ms) |
|---|---|
| 1 | 118 |
| 2 | 141 |
| 3 | 162 |
| 4 | 181 |
| 5 | 198 |
This small dataset works for beginners because each number means something you can picture:
x = 0, which you can read as fixed server overhead.A reasonable starting guess is:
Read that sentence before reading it as math: "start around 100 ms, then add about 20 ms for each extra 100 tokens of policy evidence."
That one line already lets you answer concrete questions. If a request has x = 6, the model predicts:
That gives you the core mental model of linear regression. There isn't a hidden trick. We're finding the straight line that makes measured mistakes as small as possible.
1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5
6print("requests", len(x))
7print("evidence_range", (int(x.min()), int(x.max())), "hundreds_of_tokens")
8print("latency_range", (int(y.min()), int(y.max())), "ms")1requests 5
2evidence_range (1, 5) hundreds_of_tokens
3latency_range (118, 198) msA prediction becomes useful when you compare it with reality. The difference between the actual value and the predicted value is the residual:
Using the candidate line 100 + 20x, you get:
x | Actual y | Predicted y-hat | Residual y - y-hat | Residual squared |
|---|---|---|---|---|
| 1 | 118 | 120 | -2 | 4 |
| 2 | 141 | 140 | 1 | 1 |
| 3 | 162 | 160 | 2 | 4 |
| 4 | 181 | 180 | 1 | 1 |
| 5 | 198 | 200 | -2 | 4 |
The signs matter:
To turn all five misses into one score, we square them and average them:
This score is mean squared error. Squaring does two jobs at once: it stops positive and negative errors from canceling out, and it punishes large misses more than small ones.
For this latency model, a positive residual means the request ran slower than predicted; a negative residual means it ran faster than predicted. If release gates care about slow surprises, inspect positive residuals directly in addition to reporting MSE.
A useful beginner habit is to compare against a naive baseline. The mean-latency baseline predicts 160 for each row, which gives MSE 802.8. The fitted line is useful because it crushes the "predict the average" baseline. That ratio has a name. The standard goodness-of-fit score for regression is the coefficient of determination, R-squared, which reports the fraction of the target's variance the model explains relative to the mean baseline:[1]
R^2 = 0 means you tied the mean baseline; R^2 = 1 means perfect fit; a negative value means you did worse than predicting the average. The formula uses sums of squared errors. Dividing numerator and denominator by the same five rows gives the identical 1 - MSE_line / MSE_baseline calculation above. Here 0.9965 says the line explains about 99.65 percent of variation in these replay measurements, not every future production request.
1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5pred = 100 + 20 * x
6residual = y - pred
7mse = np.mean(residual ** 2)
8baseline_mse = np.mean((y - y.mean()) ** 2)
9r2 = 1 - mse / baseline_mse
10
11print("predictions", pred.astype(int))
12print("residuals", residual.astype(int))
13print("mse", round(float(mse), 1))
14print("baseline_mse", round(float(baseline_mse), 1))
15print("r2", round(float(r2), 4))1predictions [120 140 160 180 200]
2residuals [-2 1 2 1 -2]
3mse 2.8
4baseline_mse 802.8
5r2 0.9965For one feature, is readable. For many features, you want the same idea written in a form code can scale.
Start by turning the data into a design matrix: a tidy table where each row is one training example and each column is one input the model can use.
Each row is one training example. The leading column is all ones so the model can learn an intercept. The second column is the feature itself.
Now bundle the weights into one vector:
The prediction rule becomes:
In this chapter, b is the intercept and m is the slope. For our latency data, the least-squares weights will turn out to be:
It's still the same old line. We changed the notation so the computer can do the bookkeeping.
Here is one row of that bookkeeping by itself. For the third request, x = 3, so the design-matrix row is [1, 3]:
The leading 1 selects the intercept. The 3 scales the slope. Matrix multiplication does that same dot product for each row at once.
1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5
6with_intercept = np.c_[np.ones_like(x), x]
7without_intercept = x[:, None]
8
9w_full, *_ = np.linalg.lstsq(with_intercept, y, rcond=None)
10w_forced, *_ = np.linalg.lstsq(without_intercept, y, rcond=None)
11mse_full = np.mean((with_intercept @ w_full - y) ** 2)
12mse_forced = np.mean((without_intercept @ w_forced - y) ** 2)
13
14print("with_intercept", w_full.round(2), "mse", round(float(mse_full), 2))
15print("forced_through_zero", w_forced.round(2), "mse", round(float(mse_forced), 2))1with_intercept [100. 20.] mse 2.8
2forced_through_zero [47.27] mse 1820.98Start with the normal equation, the closed-form condition that the least-squares optimum satisfies.[1] It makes the derivation visible when the feature matrix is small and full rank.
Written as math, the normal equation is:
Solving that system gives the same weights as the textbook formula . In this small demonstration, np.linalg.solve avoids explicitly forming an inverse. Forming X^T X is useful for seeing the derivation, but it can magnify numerical conditioning problems. Later you will use np.linalg.lstsq(X, y) directly, which also reports matrix rank and handles redundant columns without asking you to form X^T X.
1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5
6X = np.c_[np.ones_like(x), x]
7
8w_closed = np.linalg.solve(X.T @ X, X.T @ y)
9pred_closed = X @ w_closed
10residuals = y - pred_closed
11
12baseline = np.full_like(y, y.mean())
13line_mse = np.mean((pred_closed - y) ** 2)
14baseline_mse = np.mean((baseline - y) ** 2)
15r2 = 1 - line_mse / baseline_mse
16
17print("closed_form_weights", w_closed.round(2))
18print("residuals", residuals.round(2))
19print("line_mse", line_mse.round(2))
20print("baseline_mse", baseline_mse.round(2))
21print("r2", round(r2, 4))1closed_form_weights [100. 20.]
2residuals [-2. 1. 2. 1. -2.]
3line_mse 2.8
4baseline_mse 802.8
5r2 0.9965Read the code in this order:
np.c_[np.ones_like(x), x] creates the intercept column and the feature column.X.T @ X and X.T @ y are the matrix pieces of least squares.np.linalg.solve(...) solves the system without explicitly computing a matrix inverse.X @ w_closed turns weights back into predictions.For a reusable implementation, fit against X directly:
1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5X = np.c_[np.ones_like(x), x]
6
7w, residual_sum, rank, singular_values = np.linalg.lstsq(X, y, rcond=None)
8pred = X @ w
9
10print("weights", w.round(2))
11print("rank", rank, "columns", X.shape[1])
12print("mse", round(float(np.mean((pred - y) ** 2)), 2))
13print("singular_values", singular_values.round(2))1weights [100. 20.]
2rank 2 columns 2
3mse 2.8
4singular_values [7.69 0.92]The two columns are independent, so rank equals the number of columns. Now make a bad feature table by adding double_tokens = 2 * tokens. That third column contains no new information.
1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5X_bad = np.c_[np.ones_like(x), x, 2 * x]
6
7try:
8 np.linalg.solve(X_bad.T @ X_bad, X_bad.T @ y)
9except np.linalg.LinAlgError:
10 print("normal_equation", "singular")
11
12w, _, rank, _ = np.linalg.lstsq(X_bad, y, rcond=None)
13print("lstsq_rank", rank, "columns", X_bad.shape[1])
14print("lstsq_weights", w.round(2))
15print("mse", round(float(np.mean((X_bad @ w - y) ** 2)), 2))1normal_equation singular
2lstsq_rank 2 columns 3
3lstsq_weights [100. 4. 8.]
4mse 2.8The prediction remains the same because 4*x + 8*(2*x) = 20*x; the individual feature weights are no longer uniquely identifiable. Rank is the warning signal for an exact duplicate. Near-duplicate columns can be harder to catch: the solve may return weights, but small data changes can move those weights sharply. Inspect singular values or a condition number when coefficients look unstable.
There is also a scaling limit worth knowing. With N rows and D features, forming X^T X costs roughly O(ND^2), then solving the D x D system costs roughly O(D^3). With a handful of features that is instant. With very wide data, iterative optimization often becomes the practical path.[1]
The closed-form solve is elegant, but it doesn't scale to each setting you'll care about later. Neural networks, embeddings, and large feature sets are usually trained by iterative optimization instead. Linear regression is the friendliest place to learn that loop.
Gradient descent is like walking down a mountain in thick fog. You don't see the bottom, but you can feel the slope under your feet. You take small steps in the steepest downward direction until the ground levels out.
Start with terrible weights:
That predicts 0 for each row. Each error is negative because each prediction is too low. The gradient points in the direction where loss grows fastest, so subtracting it moves the weights toward lower loss.
On the initial step, the average gradient works out to [-320, -1040]. Here is where that number comes from when w = [0, 0]:
| Piece | Value |
|---|---|
| predictions | [0, 0, 0, 0, 0] |
errors Xw - y | [-118, -141, -162, -181, -198] |
| intercept sum | -118 - 141 - 162 - 181 - 198 = -800 |
| slope-weighted sum | 1(-118) + 2(-141) + 3(-162) + 4(-181) + 5(-198) = -2600 |
multiply by 2 / n = 2 / 5 | [-320, -1040] |
The slope gradient is larger because later rows have larger x values, so their errors pull harder on the slope than on the intercept.
The number 0.05 is the learning rate: it controls how large each step is. Too small and the model crawls; too large and it can overshoot the valley and bounce around. With lr = 0.05, the update becomes:
That's not the final answer. The slope overshoots. But the update is still useful because the model has learned the right direction: both the intercept and slope need to increase.
One common mistake is using wildly different feature scales, such as distance in miles alongside package weight in grams. The loss surface becomes stretched, so one coordinate can drive much larger updates than another. Gradient descent may zig-zag or diverge before it reaches the minimum. Normalize features before training when their ranges differ.
For mean squared error, the gradient is:
Where does the 2 come from? It's the derivative of the squared term. If you write out MSE as the average of and differentiate with respect to , the chain rule brings down a 2 and leaves in front because each weight touches each prediction through the design matrix.[2][1]
Plain English:
Xw - y is the vector of prediction errors.X.T turns row-wise errors back into "how much should each weight change?"2 / n averages the result across rows.1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5X = np.c_[np.ones_like(x), x]
6w = np.zeros(2)
7
8error = X @ w - y
9grad = (2 / len(X)) * X.T @ error
10w_next = w - 0.05 * grad
11
12print("gradient", grad)
13print("next_weights", w_next)
14print("old_mse", round(float(np.mean(error ** 2)), 1))
15print("new_mse", round(float(np.mean((X @ w_next - y) ** 2)), 1))1gradient [ -320. -1040.]
2next_weights [16. 52.]
3old_mse 26402.8
4new_mse 2194.8Here is the full training loop:
1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5X = np.c_[np.ones_like(x), x]
6
7w = np.zeros(2)
8lr = 0.05
9history = {}
10
11for step in range(1000):
12 pred = X @ w
13 error = pred - y
14 grad = (2 / len(X)) * (X.T @ error)
15 w -= lr * grad
16
17 if step in {0, 99, 999}:
18 loss = np.mean((X @ w - y) ** 2)
19 print(f"after_step={step + 1} loss={loss:.2f} w={w.round(2)}")
20 history[step + 1] = (loss, w.copy())1after_step=1 loss=2194.80 w=[16. 52.]
2after_step=100 loss=49.09 w=[84.05 24.42]
3after_step=1000 loss=2.80 w=[100. 20.]The initial update looks dramatic because the model started absurdly wrong. Each progress line reports loss and weights after the same update, so you can compare snapshots without mixing states. By the end, gradient descent reaches the same answer as the closed-form solve.
Remember this distinction:
The one-feature data also exposes a failed optimization run cleanly. Increase lr from 0.05 to 0.2; the steps overshoot and loss grows:
1import numpy as np
2
3x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
4y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
5X = np.c_[np.ones_like(x), x]
6w = np.zeros(2)
7losses = []
8
9for _ in range(5):
10 error = X @ w - y
11 losses.append(float(np.mean(error ** 2)))
12 grad = (2 / len(X)) * X.T @ error
13 w -= 0.2 * grad
14
15print("losses", [round(loss, 1) for loss in losses])
16print("loss_increased", losses[-1] > losses[0])1losses [26402.8, 349474.8, 4852473.8, 67584328.5, 941482657.6]
2loss_increased True| Approach | Good fit | Main thing to watch |
|---|---|---|
| Direct least squares | Small tabular problems, few features, reproducible solution such as evidence-length latency estimates | Redundant features make individual weights ambiguous; inspect rank |
| Gradient descent | Large feature sets, streaming updates, same optimization pattern used by neural networks | Learning rate and feature scale can make training unstable |
This comparison matters because later AI systems still reuse the same question: "Should I solve this directly, or optimize it iteratively?" Linear regression is the smallest place to learn that tradeoff with numbers you can still inspect.
The five rows above were chosen to make mechanics visible. Their training R^2 is descriptive, not release evidence. A real latency model needs a train, validation, and test split with held-out replays from evidence lengths, policy documents, and load conditions the fit never saw.
For a first clean check, fit the first four evidence lengths and reserve the fifth request as a tiny test set:
1import numpy as np
2
3x_train = np.array([1.0, 2.0, 3.0, 4.0])
4y_train = np.array([118.0, 141.0, 162.0, 181.0])
5x_test = np.array([5.0])
6y_test = np.array([198.0])
7
8X_train = np.c_[np.ones_like(x_train), x_train]
9X_test = np.c_[np.ones_like(x_test), x_test]
10w, *_ = np.linalg.lstsq(X_train, y_train, rcond=None)
11prediction = X_test @ w
12baseline = np.full_like(y_test, y_train.mean())
13
14print("train_weights", w.round(2))
15print("test_prediction_ms", prediction.round(2))
16print("test_error_ms", (y_test - prediction).round(2))
17print("line_squared_error", round(float(np.mean((prediction - y_test) ** 2)), 2))
18print("baseline_squared_error", round(float(np.mean((baseline - y_test) ** 2)), 2))1train_weights [98. 21.]
2test_prediction_ms [203.]
3test_error_ms [-5.]
4line_squared_error 25.0
5baseline_squared_error 2256.25One held-out row isn't an uncertainty estimate, but it teaches the correct boundary: fit on training rows, judge on data held back from fitting. A later evaluation chapter will make that split robust with cross-validation and leakage controls.
From-scratch code teaches mechanics. Once it works, compare it with a maintained library on the same features and target. LinearRegression fits ordinary least squares, while mean_squared_error and r2_score apply the same metrics used above.[4]
1import numpy as np
2from sklearn.linear_model import LinearRegression
3from sklearn.metrics import mean_squared_error, r2_score
4
5x = np.array([1.0, 2.0, 3.0, 4.0, 5.0]).reshape(-1, 1)
6y = np.array([118.0, 141.0, 162.0, 181.0, 198.0])
7
8model = LinearRegression().fit(x, y)
9pred = model.predict(x)
10
11print("intercept", round(float(model.intercept_), 2))
12print("slope", np.round(model.coef_, 2))
13print("mse", round(float(mean_squared_error(y, pred)), 2))
14print("r2", round(float(r2_score(y, pred)), 4))1intercept 100.0
2slope [20.]
3mse 2.8
4r2 0.9965Low error doesn't mean the model is healthy. Beginners usually get tricked by one of these patterns:
| Symptom | Likely cause | Fix |
|---|---|---|
| Residuals bend in a pattern, such as low-high-low instead of random noise | The real relationship isn't well described by one straight line | Add a transformed feature, try a different model, or inspect whether a key variable is missing |
| Loss explodes or bounces up and down during gradient descent | Learning rate is too high, or features live on wildly different scales | Lower the learning rate and normalize features before training |
| Normal-equation solve fails, or weights change wildly after adding a similar feature | Feature columns are redundant or nearly redundant | Inspect rank, remove redundant features, and fit with np.linalg.lstsq |
| Training error looks great, but future predictions are bad | You evaluated on the same rows you trained on, or leaked future information into features | Split data before fitting and keep a truly held-out set |
| One extreme row drags the line toward itself | An outlier dominates squared loss | Plot the points, inspect the row, and decide whether the data is wrong or the model needs an outlier-resistant approach |
Our one-feature example hides one practical issue: feature scaling. With one feature, the optimizer has a simple job. Once one column is in milliseconds and another is in millions of tokens, gradient descent can zig-zag badly until you rescale the inputs. The same problem appears in logistics when you mix delivery distance in miles with package count or weight in grams without normalization.
Two interview traps are worth separating:
x or target y to be normally distributed. Normal residuals matter for certain statistical inference claims, not for fitting the least-squares line itself.x = 6 after training on 1 through 5 is a small extension. Predicting x = 50 would be extrapolation, and the model may be confidently wrong.Start with the NumPy code above, then use the library comparison as a check rather than a substitute for understanding the fit.
| Try this | Good answer should notice |
|---|---|
Predict latency for x = 6 before running anything | 220 ms, because the fitted line adds 20 ms for each extra 100 evidence tokens on top of fixed overhead |
Change the last target from 198 to 240 and refit | The slope rises because the last point pulls the line upward, and the residual pattern changes most at the high-token end |
Set lr = 0.2 in the gradient loop | Loss shoots upward or becomes unstable, which is a learning-rate problem rather than proof that linear regression is broken |
Delete the intercept column and fit with x alone | The line is forced through zero, which usually worsens the fit because the system still has base latency when prompt length is zero |
Add a duplicate feature 2 * x and call np.linalg.solve | The normal equation becomes singular; rank-aware least squares still predicts the same line but coefficients aren't unique |
| Remove the baseline comparison | You lose the simplest proof that the model learned structure instead of chasing the average target |
If a practice answer feels hand-wavy, make it measurable. Print the new weights, print the new loss, or sketch the new residual pattern. Beginners improve faster when each guess turns into an observable result.
You can now do six useful things without a library hiding the mechanics:
That's enough to fit a model. It isn't enough to trust one.
The next step is to move from numeric latency to a binary return decision: "Should this request be routed to human review?" Logistic regression gives you probabilities, log-loss, threshold tuning, ROC, and calibration, tools later classifiers reuse.
X with the intercept column and refit before changing anything else.x. Fix: inspect rank, remove the redundant column, and use np.linalg.lstsq while diagnosing the data design.lr, normalize features, and watch whether the loss curve starts descending smoothly.Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.