Route access-change requests with logistic regression from scratch: derive sigmoid and log loss, fit NumPy weights, select a cost-aware threshold on validation data, audit ranking and calibration, then compare with scikit-learn.
Logistic regression turns features into a binary-routing score, then challenges you to justify treating that score as a probability.[1] In the previous lesson, access-change request REQ-10234 needed a numeric estimate: would retrieved evidence fit the response-latency budget? Now it needs a decision: may the workflow auto-process the permission change under policy line P-7, or must a human review it first?
You'll fit a model by hand on eight historical access-change requests, derive its gradient, implement each line in NumPy, and check it against scikit-learn. A separate eight-request validation slice then carries the decisions that matter: threshold cost, ranking, and calibration. Training output explains the mechanism. Held-out labels are where you begin to trust an operating policy.
You need the linear-regression chapter (design matrix, gradient descent loop, NumPy broadcasting) and comfort with basic probability. Each derivative gets worked with real numbers on paper before the code appears.
In the previous chapter you predicted a number: "how many milliseconds will this evidence cost?" Now the target is binary: 1 means an access request needs human review before any permission change; 0 means policy evidence supports automatic handling. An incorrect automatic approval can cost $120 in permission rollback and investigation. An unnecessary review costs $18 of queue time.
Linear regression on a 0/1 label would happily predict 1.3 or −0.2. Those numbers are meaningless as probabilities. We need something that squeezes any real number into (0, 1).
This is also how you evaluate LLM-era classifiers. A prompt-injection guardrail, a toxicity filter, or an agent action router emits a score; a threshold turns that score into an intervention. Precision, recall, cost, and calibration belong in its model card and on-call review.
The sigmoid (or logistic) function does that:
It has four properties you'll feel in your fingers:
Compute five concrete values by hand. Do this on a napkin before reaching for code:
| z | e^{-z} | 1 + e^{-z} | σ(z) |
|---|---|---|---|
| −4.0 | 54.6 | 55.6 | 0.018 |
| −1.0 | 2.718 | 3.718 | 0.269 |
| 0.0 | 1.0 | 2.0 | 0.500 |
| 1.0 | 0.368 | 1.368 | 0.731 |
| 4.0 | 0.018 | 1.018 | 0.982 |
1import numpy as np
2
3def sigmoid(z: np.ndarray) -> np.ndarray:
4 return 1.0 / (1.0 + np.exp(-z))
5
6scores = np.array([-4.0, -1.0, 0.0, 1.0, 4.0])
7print("scores:", scores)
8print("probabilities:", sigmoid(scores).round(3))1scores: [-4. -1. 0. 1. 4.]
2probabilities: [0.018 0.269 0.5 0.731 0.982]
The curve saturates quickly. A score of +4 produces p = 0.982; whether that number represents an honest frequency still needs validation.
The model itself is still linear inside:
Then p = sigma(z). The positive class is needs human review. This one line is the whole logistic regression model.
The score z is also the logit, or log-odds:[1]
At z = 0, the odds are 1:1 and p = 0.5. Adding one to z multiplies the odds by e; it doesn't guarantee that the displayed probability is calibrated on production traffic.
We continue the REQ-10234 access-control workflow. Each historical request has two scaled features produced from retrieved evidence:
x1 = ambiguity score: conflicting audit logs, missing requester context, or unclear scope descriptionx2 = auto-policy support score: strength of evidence that policy line P-7 permits automatic handlingThe label y = 1 when the completed case required human review before a permission change, and y = 0 when automatic handling was acceptable.
This tiny dataset stays with the entire chapter:
| Request | x1 ambiguity | x2 auto-policy support | y needs review |
|---|---|---|---|
| R1 | 1.0 | 1 | 0 |
| R2 | 1.5 | 2 | 0 |
| R3 | 2.0 | 3 | 1 |
| R4 | 3.5 | 6 | 1 |
| R5 | 4.0 | 7 | 1 |
| R6 | 4.5 | 5 | 1 |
| R7 | 2.8 | 4 | 0 |
| R8 | 3.2 | 8 | 0 |
We kept this training fixture small so you can verify each number. R3 needed review despite only moderate ambiguity. R8 had substantial ambiguity, but strong policy support made automatic processing acceptable. The resulting coefficient signs should now tell a coherent story: ambiguity raises review risk; supporting policy evidence lowers it.
Add the intercept column of ones and build the design matrix X (8 × 3) and the label vector y.
If the model outputs p = 0.93 for a request with y = 0, it made a confident mistake that deserves a large penalty. If it outputs p = 0.07 for that request, the loss is small. Negative log-likelihood encodes that asymmetry.
For one example the loss is the negative log-likelihood (also called binary cross-entropy).[2][3]
When y=1 the second term vanishes and we pay −log p (large when p is near 0). When y=0 we pay −log(1−p).
Why not squared error? Reusing MSE from linear regression makes the objective non-convex after composition with a sigmoid. For unregularized linear logistic regression, log loss is convex in the weights and yields the clean gradient below.[1] Convexity makes optimization behavior easier to reason about, but it never replaces evaluation on unseen data.
Compute by hand for R4 (x1=3.5, x2=6, y=1) with a starting guess w = [0, 0, 0]:
z = 0 + 0·3.5 + 0·6 = 0p = σ(0) = 0.5L = − [1 · log(0.5) + 0] = − log(0.5) ≈ 0.693Now suppose the model is slightly better and predicts p = 0.82 for the same row:
L = − log(0.82) ≈ 0.198For code, compute the same loss from logits rather than taking log(sigmoid(z)). The expression log(1 + exp(z)) - y*z is the same loss, but evaluating exp(z) directly can overflow. NumPy's logaddexp(0, z) - y*z evaluates that softplus term stably even when z is very large in either direction.
1import numpy as np
2
3def log_loss_from_logits(y: np.ndarray, z: np.ndarray) -> float:
4 return float(np.mean(np.logaddexp(0.0, z) - y * z))
5
6print("correct positive:", round(log_loss_from_logits(np.array([1.0]), np.array([2.0])), 3))
7print("confident wrong:", round(log_loss_from_logits(np.array([1.0]), np.array([-4.0])), 3))
8extreme = log_loss_from_logits(np.array([0.0, 1.0]), np.array([1000.0, -1000.0]))
9print("extreme logits finite:", np.isfinite(extreme))1correct positive: 0.127
2confident wrong: 4.018
3extreme logits finite: TrueDerive this once, with real numbers, on R4. The same pattern will reappear in neural-network classifiers later.
We have L(p, y) and p = σ(z), z = w·x (including w₀).
The algebra collapses more neatly than it looks. Look at the two label cases:
| label | loss | derivative with respect to score z | meaning |
|---|---|---|---|
| y = 1 | −log(p) | p − 1 | prediction is too low unless p is already near 1 |
| y = 0 | −log(1−p) | p | prediction is too high unless p is already near 0 |
Both rows are the same formula: ∂L/∂z = p − y. Then the weight gradient is:
This holds for each weight, including the intercept (x₀ = 1).[2][3]
Plug in numbers for the starting point on R4 (p=0.5, y=1, x=[1, 3.5, 6]):
The gradient points uphill (direction of increasing loss). We subtract it (scaled by the learning rate) to descend.
That sign is the entire learning signal. If y=1 and p=0.5, p-y is negative, so subtracting the gradient increases weights attached to positive features and raises the future score for similar requests. If y=0 and p=0.8, p-y is positive, so subtracting the gradient lowers those weights and reduces the score.
Do the same arithmetic for each row, average the gradients, and you have one step of gradient descent for logistic regression. The code is close to the linear-regression GD loop you already wrote; the prediction and gradient formula change.
1import numpy as np
2
3def sigmoid(z: float) -> float:
4 return float(1.0 / (1.0 + np.exp(-z)))
5
6def loss(w: np.ndarray, x: np.ndarray, y: float) -> float:
7 z = x @ w
8 return float(np.logaddexp(0.0, z) - y * z)
9
10x = np.array([1.0, 3.5, 6.0])
11y = 1.0
12w = np.zeros(3)
13analytic = (sigmoid(x @ w) - y) * x
14eps = 1e-6
15numeric = np.array([
16 (loss(w + eps * np.eye(3)[j], x, y) - loss(w - eps * np.eye(3)[j], x, y)) / (2 * eps)
17 for j in range(3)
18])
19print("analytic:", analytic.round(3))
20print("numeric :", numeric.round(3))
21print("match:", np.allclose(analytic, numeric, atol=1e-6))1analytic: [-0.5 -1.75 -3. ]
2numeric : [-0.5 -1.75 -3. ]
3match: True
This complete, self-contained implementation can be copied into a file and run.
1import numpy as np
2
3def sigmoid(z: np.ndarray) -> np.ndarray:
4 return 1.0 / (1.0 + np.exp(-np.clip(z, -50, 50)))
5
6def log_loss_from_logits(y_true: np.ndarray, z: np.ndarray) -> float:
7 return float(np.mean(np.logaddexp(0.0, z) - y_true * z))
8
9def fit_logistic(X: np.ndarray, y: np.ndarray, lr: float = 0.2, epochs: int = 3000, verbose: bool = True) -> np.ndarray:
10 n, d = X.shape
11 w = np.zeros(d)
12 for epoch in range(epochs):
13 z = X @ w
14 p = sigmoid(z)
15 grad = (1 / n) * (X.T @ (p - y))
16 w -= lr * grad
17 if verbose and epoch in (0, 200, 500, 1000, 2999):
18 loss = log_loss_from_logits(y, X @ w)
19 print(f"after_epoch={epoch + 1:4d} loss={loss:.4f} w={np.round(w, 3)}")
20 return w
21
22def predict_proba(X: np.ndarray, w: np.ndarray) -> np.ndarray:
23 return sigmoid(X @ w)
24
25def predict_with_threshold(p: np.ndarray, threshold: float) -> np.ndarray:
26 return (p >= threshold).astype(int)
27
28def confusion_matrix_at_threshold(y_true: np.ndarray, p: np.ndarray, threshold: float) -> dict[str, int]:
29 pred = predict_with_threshold(p, threshold)
30 tp = int(((pred == 1) & (y_true == 1)).sum())
31 fp = int(((pred == 1) & (y_true == 0)).sum())
32 fn = int(((pred == 0) & (y_true == 1)).sum())
33 tn = int(((pred == 0) & (y_true == 0)).sum())
34 return {"tp": tp, "fp": fp, "fn": fn, "tn": tn}
35
36def precision_recall_f1(cm: dict[str, int]) -> tuple[float, float, float]:
37 tp, fp, fn = cm["tp"], cm["fp"], cm["fn"]
38 precision = tp / (tp + fp) if tp + fp else 0.0
39 recall = tp / (tp + fn) if tp + fn else 0.0
40 f1 = 2 * precision * recall / (precision + recall) if precision + recall else 0.0
41 return precision, recall, f1
42
43def roc_points(y_true: np.ndarray, p: np.ndarray, thresholds: list[float]) -> list[tuple[float, float, float]]:
44 points = []
45 for threshold in thresholds:
46 cm = confusion_matrix_at_threshold(y_true, p, threshold)
47 fpr = cm["fp"] / (cm["fp"] + cm["tn"]) if cm["fp"] + cm["tn"] else 0.0
48 tpr = cm["tp"] / (cm["tp"] + cm["fn"]) if cm["tp"] + cm["fn"] else 0.0
49 points.append((threshold, fpr, tpr))
50 return points
51
52def expected_calibration_error(y_true: np.ndarray, p: np.ndarray, n_bins: int = 4) -> float:
53 edges = np.linspace(0.0, 1.0, n_bins + 1)
54 ece = 0.0
55 for lo, hi in zip(edges[:-1], edges[1:]):
56 mask = (p >= lo) & ((p < hi) if hi < 1.0 else (p <= hi))
57 if not mask.any():
58 continue
59 avg_pred = p[mask].mean()
60 frac_pos = y_true[mask].mean()
61 ece += mask.mean() * abs(avg_pred - frac_pos)
62 return float(ece)
63
64# Historical access-change requests: [ambiguity, auto-policy support]
65X_raw = np.array([
66 [1.0, 1.0], [1.5, 2.0], [2.0, 3.0], [3.5, 6.0],
67 [4.0, 7.0], [4.5, 5.0], [2.8, 4.0], [3.2, 8.0],
68])
69y = np.array([0, 0, 1, 1, 1, 1, 0, 0])
70
71X = np.c_[np.ones((len(X_raw), 1)), X_raw] # add intercept column
72
73w = fit_logistic(X, y, lr=0.2, epochs=3000)
74p = predict_proba(X, w)
75cm = confusion_matrix_at_threshold(y, p, threshold=0.5)
76precision, recall, f1 = precision_recall_f1(cm)
77ece = expected_calibration_error(y, p, n_bins=4)
78
79print("\nFinal weights (w0, w1, w2):", w.round(3))
80print("Probabilities:", p.round(3))
81print("Confusion @ 0.5:", cm)
82print("Precision / recall / F1:", tuple(round(v, 3) for v in (precision, recall, f1)))
83print("Training ECE (4 bins):", round(ece, 3))
84print("weights_match_reference", np.allclose(w, [-4.894, 3.399, -0.959], atol=0.01))
85print("confusion_matches_reference", cm == {"tp": 3, "fp": 1, "fn": 1, "tn": 3})
86print("f1_matches_reference", np.isclose(f1, 0.75))
87print("ece_in_expected_band", 0.26 < ece < 0.28)1after_epoch= 1 loss=0.6830 w=[0. 0.069 0.075]
2after_epoch= 201 loss=0.4648 w=[-2.205 1.615 -0.462]
3after_epoch= 501 loss=0.4206 w=[-3.524 2.452 -0.686]
4after_epoch=1001 loss=0.4092 w=[-4.356 3.023 -0.85 ]
5after_epoch=3000 loss=0.4072 w=[-4.894 3.399 -0.959]
6
7Final weights (w0, w1, w2): [-4.894 3.399 -0.959]
8Probabilities: [0.079 0.153 0.274 0.777 0.88 0.996 0.687 0.156]
9Confusion @ 0.5: {'tp': 3, 'fp': 1, 'fn': 1, 'tn': 3}
10Precision / recall / F1: (0.75, 0.75, 0.75)
11Training ECE (4 bins): 0.268
12weights_match_reference True
13confusion_matches_reference True
14f1_matches_reference True
15ece_in_expected_band TrueThe negative intercept starts low when both signals are absent. Ambiguity's positive coefficient raises review risk, while the negative policy-support coefficient lowers it, matching the feature definition. Each progress line reports loss and weights after the same update, so snapshots stay comparable. These eight rows teach model mechanics; they never justify a shipping threshold by themselves.
The code builds a working logistic regression model whose scoring, gradients, and threshold behavior are visible.
The fitted model emits a score in (0, 1). A business action needs a threshold t: route to review when p >= t. The training confusion matrix at t = 0.5 is a useful code check (TP=3, FP=1, FN=1, TN=3), but selecting t on those same rows would reward overfitting.
Use a validation slice with cases the optimizer never saw. Here a false negative means automatic processing when review was required, costed at $120. A false positive means unnecessary human review, costed at $18.
1import numpy as np
2
3def sigmoid(z: np.ndarray) -> np.ndarray:
4 return 1.0 / (1.0 + np.exp(-z))
5
6def metrics(y: np.ndarray, p: np.ndarray, threshold: float) -> tuple[float, float, float, int]:
7 pred = (p >= threshold).astype(int)
8 tp = int(((pred == 1) & (y == 1)).sum())
9 fp = int(((pred == 1) & (y == 0)).sum())
10 fn = int(((pred == 0) & (y == 1)).sum())
11 precision = tp / (tp + fp) if tp + fp else 0.0
12 recall = tp / (tp + fn) if tp + fn else 0.0
13 f1 = 2 * precision * recall / (precision + recall) if precision + recall else 0.0
14 cost = 120 * fn + 18 * fp
15 return precision, recall, f1, cost
16
17w = np.array([-4.894, 3.399, -0.959])
18X_val_raw = np.array([
19 [1.2, 2.0], [2.1, 4.0], [2.4, 3.0], [3.0, 5.0],
20 [3.6, 4.0], [3.2, 7.0], [4.2, 5.0], [2.8, 6.0],
21])
22y_val = np.array([0, 0, 1, 0, 1, 0, 1, 1])
23X_val = np.c_[np.ones(len(X_val_raw)), X_val_raw]
24p_val = sigmoid(X_val @ w)
25
26for t in (0.20, 0.50, 0.80):
27 precision, recall, f1, cost = metrics(y_val, p_val, t)
28 print(f"t={t:.2f} precision={precision:.3f} recall={recall:.3f} f1={f1:.3f} cost=${cost}")1t=0.20 precision=0.667 recall=1.000 f1=0.800 cost=$36
2t=0.50 precision=0.750 recall=0.750 f1=0.750 cost=$138
3t=0.80 precision=1.000 recall=0.500 f1=0.667 cost=$240On this validation slice, t = 0.20 costs least because avoiding missed reviews outweighs two extra queue items. In production, estimate these costs with stakeholders and re-evaluate thresholds as request mix shifts.
For any threshold, fill the 2x2 confusion matrix:
Definitions:
TP / (TP + FP): among requests routed to review, what fraction required it?TP / (TP + FN): among requests requiring review, what fraction reached it?2 * precision * recall / (precision + recall): one balance of those rates.Accuracy alone can be nearly useless when positive cases are rare:
1import numpy as np
2
3y_true = np.array([1] * 4 + [0] * 96)
4predict_auto_process_everything = np.zeros(100, dtype=int)
5accuracy = (predict_auto_process_everything == y_true).mean()
6recall = ((predict_auto_process_everything == 1) & (y_true == 1)).sum() / (y_true == 1).sum()
7print("accuracy:", round(float(accuracy), 3))
8print("review recall:", round(float(recall), 3))1accuracy: 0.96
2review recall: 0.0To draw an ROC (Receiver Operating Characteristic) curve, vary the threshold from high to low and plot false-positive rate against recall. AUC answers a ranking question: how often does a random positive score rank above a random negative score? Ties get half credit. AUC never chooses a business threshold for you.
1import numpy as np
2
3p_val = np.array([0.061, 0.169, 0.595, 0.624, 0.971, 0.325, 0.990, 0.244])
4y_val = np.array([0, 0, 1, 0, 1, 0, 1, 1])
5positive = p_val[y_val == 1]
6negative = p_val[y_val == 0]
7pair_scores = [(pos > neg) + 0.5 * (pos == neg) for pos in positive for neg in negative]
8auc = float(np.mean(pair_scores))
9print("positive-negative pairs:", len(pair_scores))
10print("validation AUC:", round(auc, 3))1positive-negative pairs: 16
2validation AUC: 0.812
For rare positives, ROC can look generous because many true negatives keep false-positive rate small. Plot a precision-recall curve as well: it exposes how much review load accompanies higher recall.[1]
A model can rank access-change requests well while producing over-confident or under-confident numbers. If requests scored near 0.80 require review only half the time, exposing 80% to an operator misstates risk.
A reliability diagram compares average score with observed positive frequency per bin. Expected Calibration Error (ECE) summarizes binned gaps:
ECE depends on sample size, prevalence, and bin choices. It has no universal pass threshold. State the binning scheme, measure on held-out labels, and use a separate calibration split when fitting Platt scaling, isotonic regression, or temperature scaling. Temperature scaling is a common post-hoc method for neural classifiers.[4]
1import numpy as np
2
3p_val = np.array([0.061, 0.169, 0.595, 0.624, 0.971, 0.325, 0.990, 0.244])
4y_val = np.array([0, 0, 1, 0, 1, 0, 1, 1])
5edges = np.linspace(0.0, 1.0, 5)
6ece = 0.0
7for lo, hi in zip(edges[:-1], edges[1:]):
8 mask = (p_val >= lo) & ((p_val < hi) if hi < 1.0 else (p_val <= hi))
9 if mask.any():
10 avg_p = p_val[mask].mean()
11 frac_pos = y_val[mask].mean()
12 ece += mask.mean() * abs(avg_p - frac_pos)
13 print(f"[{lo:.2f}, {hi:.2f}) n={mask.sum()} avg_p={avg_p:.3f} frac_pos={frac_pos:.3f}")
14print("validation ECE:", round(float(ece), 3))1[0.00, 0.25) n=3 avg_p=0.158 frac_pos=0.333
2[0.25, 0.50) n=1 avg_p=0.325 frac_pos=0.000
3[0.50, 0.75) n=2 avg_p=0.609 frac_pos=0.500
4[0.75, 1.00) n=2 avg_p=0.980 frac_pos=1.000
5validation ECE: 0.139Eight validation requests are enough for arithmetic, not a release gate. A single disputed label moves both metrics noticeably:
1import numpy as np
2
3def f1_at_half(y: np.ndarray, p: np.ndarray) -> float:
4 pred = p >= 0.5
5 tp = ((pred == 1) & (y == 1)).sum()
6 fp = ((pred == 1) & (y == 0)).sum()
7 fn = ((pred == 0) & (y == 1)).sum()
8 return float(2 * tp / (2 * tp + fp + fn))
9
10def ece(y: np.ndarray, p: np.ndarray) -> float:
11 edges = np.linspace(0.0, 1.0, 5)
12 total = 0.0
13 for lo, hi in zip(edges[:-1], edges[1:]):
14 mask = (p >= lo) & ((p < hi) if hi < 1.0 else (p <= hi))
15 if mask.any():
16 total += mask.mean() * abs(p[mask].mean() - y[mask].mean())
17 return float(total)
18
19p = np.array([0.061, 0.169, 0.595, 0.624, 0.971, 0.325, 0.990, 0.244])
20base = np.array([0, 0, 1, 0, 1, 0, 1, 1])
21one_flip = base.copy()
22one_flip[3] = 1
23for name, labels in (("base", base), ("one flip", one_flip)):
24 print(f"{name:8s} f1={f1_at_half(labels, p):.3f} ece={ece(labels, p):.3f}")1base f1=0.750 ece=0.139
2one flip f1=0.889 ece=0.209First audit action selection:
Then audit probability meaning:
A from-scratch implementation earns trust by matching a maintained implementation under equivalent settings.[5] Scikit-learn applies regularization by default; set C=np.inf here to compare against our unregularized gradient-descent fit.
1import numpy as np
2from sklearn.linear_model import LogisticRegression
3from sklearn.metrics import f1_score, roc_auc_score
4
5X_raw = np.array([
6 [1.0, 1.0], [1.5, 2.0], [2.0, 3.0], [3.5, 6.0],
7 [4.0, 7.0], [4.5, 5.0], [2.8, 4.0], [3.2, 8.0],
8])
9y = np.array([0, 0, 1, 1, 1, 1, 0, 0])
10X = np.c_[np.ones((len(X_raw), 1)), X_raw]
11w_ours = np.array([-4.894, 3.399, -0.959])
12X_val_raw = np.array([
13 [1.2, 2.0], [2.1, 4.0], [2.4, 3.0], [3.0, 5.0],
14 [3.6, 4.0], [3.2, 7.0], [4.2, 5.0], [2.8, 6.0],
15])
16y_val = np.array([0, 0, 1, 0, 1, 0, 1, 1])
17X_val = np.c_[np.ones(len(X_val_raw)), X_val_raw]
18
19clf = LogisticRegression(C=np.inf, solver="lbfgs", max_iter=5000, fit_intercept=False)
20clf.fit(X, y)
21w_sk = clf.coef_[0]
22print("sklearn weights:", np.round(w_sk, 3))
23print("our weights :", w_ours)
24print("weights close :", np.allclose(w_sk, w_ours, atol=0.05))
25p_val = clf.predict_proba(X_val)[:, 1]
26print("validation F1 @ 0.20:", round(f1_score(y_val, (p_val >= 0.20).astype(int)), 3))
27print("validation AUC:", round(roc_auc_score(y_val, p_val), 3))1sklearn weights: [-4.914 3.414 -0.964]
2our weights : [-4.894 3.399 -0.959]
3weights close : True
4validation F1 @ 0.20: 0.8
5validation AUC: 0.812The weight match validates the mechanics. The validation metrics validate only this small held-out exercise; larger and temporally later data is still required before deployment.
Tests should protect stable numerics, the known fit, held-out threshold behavior, and parity settings:
1import numpy as np
2import pytest
3from logistic_scratch import (
4 sigmoid, log_loss_from_logits, fit_logistic, predict_proba,
5 confusion_matrix_at_threshold, precision_recall_f1
6)
7
8X_train_raw = np.array([[1.,1.],[1.5,2.],[2.,3.],[3.5,6.],[4.,7.],[4.5,5.],[2.8,4.],[3.2,8.]])
9y_train = np.array([0,0,1,1,1,1,0,0])
10X_train = np.c_[np.ones((8,1)), X_train_raw]
11X_val_raw = np.array([[1.2,2.],[2.1,4.],[2.4,3.],[3.,5.],[3.6,4.],[3.2,7.],[4.2,5.],[2.8,6.]])
12y_val = np.array([0,0,1,0,1,0,1,1])
13X_val = np.c_[np.ones((8,1)), X_val_raw]
14
15def test_sigmoid():
16 assert sigmoid(0) == pytest.approx(0.5)
17 assert sigmoid(10) > 0.999
18 assert sigmoid(-10) < 0.001
19
20def test_log_loss_is_stable_for_extreme_logits():
21 loss = log_loss_from_logits(np.array([0., 1.]), np.array([1000., -1000.]))
22 assert np.isfinite(loss)
23
24def test_gradient_direction():
25 # on a single row with bad prediction, gradient should point toward correction
26 w0 = np.zeros(3)
27 p = sigmoid(X_train[3] @ w0)
28 grad = X_train[3:4].T @ (p - y_train[3:4])
29 assert grad[0] < 0
30
31def test_fit_converges():
32 w = fit_logistic(X_train, y_train, lr=0.2, epochs=3000, verbose=False)
33 loss = log_loss_from_logits(y_train, X_train @ w)
34 assert loss < 0.42
35 assert np.allclose(w, [-4.894, 3.399, -0.959], atol=0.01)
36
37def test_validation_threshold_cost_contract():
38 w = fit_logistic(X_train, y_train, lr=0.2, epochs=3000, verbose=False)
39 p = predict_proba(X_val, w)
40 cm = confusion_matrix_at_threshold(y_val, p, 0.2)
41 prec, rec, f1 = precision_recall_f1(cm)
42 assert cm == {"tp": 4, "fp": 2, "fn": 0, "tn": 2}
43 assert (prec, rec, f1) == pytest.approx((2/3, 1.0, 0.8))
44
45def test_matches_sklearn():
46 from sklearn.linear_model import LogisticRegression
47 clf = LogisticRegression(C=np.inf, solver='lbfgs', max_iter=5000, fit_intercept=False)
48 clf.fit(X_train, y_train)
49 w_ours = fit_logistic(X_train, y_train, lr=0.2, epochs=3000, verbose=False)
50 assert np.allclose(w_ours, clf.coef_[0], atol=0.05)Run with pytest tests/test_logistic.py -q --tb=line after placing implementation in logistic_scratch.py. Unlike an exact ECE assertion on eight labels, these tests guard code contracts and the selected validation scenario.
| Symptom | Likely cause | Fix |
|---|---|---|
| All predicted probabilities hover around 0.4-0.6 | Features are weak, data is tiny, or GD is under-trained | Add informative evidence features, increase epochs, tune learning rate, or evaluate a stronger model |
| F1 at threshold 0.5 is mediocre but recall at 0.2 is excellent | Class imbalance + wrong operating point | Read the ROC or cost-sensitive curve; pick threshold by expected business cost, not by accuracy |
| Good AUC but large reliability gaps on substantial held-out data | Model ranks well but scores misstate frequency | Withhold raw percentages; calibrate on a reserved split and re-audit |
| Your NumPy weights differ from sklearn by >0.3 | Forgot the intercept column, used different regularization, or feature scaling mismatch | Compare design matrices, disable sklearn regularization, and verify the same X |
| Loss decreases then suddenly increases | Learning rate too large or probability-form loss overflowed | Lower lr, compute loss with logaddexp, and print loss during fitting |
These exercises turn the chapter from reading into mastery. Use training rows for fitting and validation rows for policy selection.
[0.2, 0.35, 0.5, 0.65, 0.8] on validation and print precision, recall, F1, and cost. Which threshold minimizes the stated cost?If any answer feels hand-wavy, print the arrays, loss numbers, and concrete cells of the confusion matrix. Concrete numbers build classification skill.
The classifier can now:
These are the same habits behind production classifiers: fit the model, inspect the threshold tradeoff, measure calibration, and write tests that keep the implementation honest.
The validation slice improved the evaluation boundary, but eight requests remain far too few for a release claim. Next, decision trees and ensembles attack the same access-review routing task with nonlinear rules; the later cross-validation lesson will deepen the evidence required before shipping a model.
(p - y) * x gradient by hand on an access-change request with concrete scores and loss valuesLogisticRegression0.5 on imbalanced data, so recall collapsed.
Fix: sweep thresholds on a held-out set and pick by expected business cost or recall target, not habit.0.87 even though AUC looks useful.
Cause: ranking and calibration answer different questions.
Fix: draw reliability diagrams on enough held-out labels, reserve calibration data, and withhold percentages until verified.Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
The Elements of Statistical Learning.
Hastie, T., Tibshirani, R., Friedman, J. · 2009
Pattern Recognition and Machine Learning.
Bishop, C. M. · 2006
Machine Learning: A Probabilistic Perspective.
Murphy, K. P. · 2012
On Calibration of Modern Neural Networks
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017
Scikit-learn: Machine Learning in Python.
Pedregosa, F., et al. · 2011 · JMLR