LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationLogistic Regression and Metrics
📊MediumEvaluation & Benchmarks

Logistic Regression and Metrics

Route access-change requests with logistic regression from scratch: derive sigmoid and log loss, fit NumPy weights, select a cost-aware threshold on validation data, audit ranking and calibration, then compare with scikit-learn.

25 min read
Learning path
Step 32 of 158 in the full curriculum
Linear Regression from ScratchDecision Trees, Forests, and Boosting

Logistic regression turns features into a binary-routing score, then challenges you to justify treating that score as a probability.[1]Reference 1The Elements of Statistical Learning.https://hastie.su.domains/ElemStatLearn/ In the previous lesson, access-change request REQ-10234 needed a numeric estimate: would retrieved evidence fit the response-latency budget? Now it needs a decision: may the workflow auto-process the permission change under policy line P-7, or must a human review it first?

You'll fit a model by hand on eight historical access-change requests, derive its gradient, implement each line in NumPy, and check it against scikit-learn. A separate eight-request validation slice then carries the decisions that matter: threshold cost, ranking, and calibration. Training output explains the mechanism. Held-out labels are where you begin to trust an operating policy.

Logistic-regression probability field over ambiguity and policy-support features, with eight labeled training requests, decision contours at thresholds 0.20, 0.50, and 0.80, and held-out score, cost, AUC, and calibration audits. Logistic-regression probability field over ambiguity and policy-support features, with eight labeled training requests, decision contours at thresholds 0.20, 0.50, and 0.80, and held-out score, cost, AUC, and calibration audits.
Training requests shape a probability field and its threshold contours. Held-out requests then test the routing cost, ranking quality, and probability meaning without leaking validation labels into the fit.

You need the linear-regression chapter (design matrix, gradient descent loop, NumPy broadcasting) and comfort with basic probability. Each derivative gets worked with real numbers on paper before the code appears.

From latency estimates to access-review routing

In the previous chapter you predicted a number: "how many milliseconds will this evidence cost?" Now the target is binary: 1 means an access request needs human review before any permission change; 0 means policy evidence supports automatic handling. An incorrect automatic approval can cost $120 in permission rollback and investigation. An unnecessary review costs $18 of queue time.

Linear regression on a 0/1 label would happily predict 1.3 or −0.2. Those numbers are meaningless as probabilities. We need something that squeezes any real number into (0, 1).

This is also how you evaluate LLM-era classifiers. A prompt-injection guardrail, a toxicity filter, or an agent action router emits a score; a threshold turns that score into an intervention. Precision, recall, cost, and calibration belong in its model card and on-call review.

The sigmoid turns any score into a probability

The sigmoid (or logistic) function does that:

σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1​

It has four properties you'll feel in your fingers:

  • As z → +∞, σ(z) → 1
  • As z → −∞, σ(z) → 0
  • At z = 0, σ(0) = 0.5
  • Its derivative is σ'(z) = σ(z) · (1 − σ(z)), which is positive for finite z and peaks at 0.25 when z=0. This tidy form is why the gradient of log-loss later becomes so simple.

Compute five concrete values by hand. Do this on a napkin before reaching for code:

ze^{-z}1 + e^{-z}σ(z)
−4.054.655.60.018
−1.02.7183.7180.269
0.01.02.00.500
1.00.3681.3680.731
4.00.0181.0180.982
sigmoid-probabilities.py
1import numpy as np 2 3def sigmoid(z: np.ndarray) -> np.ndarray: 4 return 1.0 / (1.0 + np.exp(-z)) 5 6scores = np.array([-4.0, -1.0, 0.0, 1.0, 4.0]) 7print("scores:", scores) 8print("probabilities:", sigmoid(scores).round(3))
Output
1scores: [-4. -1. 0. 1. 4.] 2probabilities: [0.018 0.269 0.5 0.731 0.982]
Smooth sigmoid and derivative curves from scores minus six to six, showing maximum slope 0.25 at zero and comparing an equal score move from minus one to one with a move from four to six, where probability changes by 0.462 versus 0.016. Smooth sigmoid and derivative curves from scores minus six to six, showing maximum slope 0.25 at zero and comparing an equal score move from minus one to one with a move from four to six, where probability changes by 0.462 versus 0.016.
The derivative peaks at `0.25` when `z = 0`, so a two-point score move near the center changes probability by `0.462`. The same score move in the positive tail changes it by only `0.016`.

The curve saturates quickly. A score of +4 produces p = 0.982; whether that number represents an honest frequency still needs validation.

The model itself is still linear inside:

z=w0+w1x1+w2x2z = w_0 + w_1 x_1 + w_2 x_2z=w0​+w1​x1​+w2​x2​

Then p = sigma(z). The positive class is needs human review. This one line is the whole logistic regression model.

The score z is also the logit, or log-odds:[1]Reference 1The Elements of Statistical Learning.https://hastie.su.domains/ElemStatLearn/

z=log⁡(p1−p)z = \log \left(\frac{p}{1 - p}\right)z=log(1−pp​)

At z = 0, the odds are 1:1 and p = 0.5. Adding one to z multiplies the odds by e; it doesn't guarantee that the displayed probability is calibrated on production traffic.

Our running example: eight access-change requests

We continue the REQ-10234 access-control workflow. Each historical request has two scaled features produced from retrieved evidence:

  • x1 = ambiguity score: conflicting audit logs, missing requester context, or unclear scope description
  • x2 = auto-policy support score: strength of evidence that policy line P-7 permits automatic handling

The label y = 1 when the completed case required human review before a permission change, and y = 0 when automatic handling was acceptable.

This tiny dataset stays with the entire chapter:

Requestx1 ambiguityx2 auto-policy supporty needs review
R11.010
R21.520
R32.031
R43.561
R54.071
R64.551
R72.840
R83.280

We kept this training fixture small so you can verify each number. R3 needed review despite only moderate ambiguity. R8 had substantial ambiguity, but strong policy support made automatic processing acceptable. The resulting coefficient signs should now tell a coherent story: ambiguity raises review risk; supporting policy evidence lowers it.

Add the intercept column of ones and build the design matrix X (8 × 3) and the label vector y.

Log-loss (binary cross-entropy) is the right loss for probabilities

If the model outputs p = 0.93 for a request with y = 0, it made a confident mistake that deserves a large penalty. If it outputs p = 0.07 for that request, the loss is small. Negative log-likelihood encodes that asymmetry.

For one example the loss is the negative log-likelihood (also called binary cross-entropy).[2]Reference 2Pattern Recognition and Machine Learning.https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/[3]Reference 3Machine Learning: A Probabilistic Perspective.https://probml.github.io/pml-book/book0.html

L=−[ylog⁡p+(1−y)log⁡(1−p)]L = − \bigl[ y \log p + (1 − y) \log (1 − p) \bigr]L=−[ylogp+(1−y)log(1−p)]

When y=1 the second term vanishes and we pay −log p (large when p is near 0). When y=0 we pay −log(1−p).

Why not squared error? Reusing MSE from linear regression makes the objective non-convex after composition with a sigmoid. For unregularized linear logistic regression, log loss is convex in the weights and yields the clean gradient below.[1]Reference 1The Elements of Statistical Learning.https://hastie.su.domains/ElemStatLearn/ Convexity makes optimization behavior easier to reason about, but it never replaces evaluation on unseen data.

Compute by hand for R4 (x1=3.5, x2=6, y=1) with a starting guess w = [0, 0, 0]:

  • z = 0 + 0·3.5 + 0·6 = 0
  • p = σ(0) = 0.5
  • L = − [1 · log(0.5) + 0] = − log(0.5) ≈ 0.693

Now suppose the model is slightly better and predicts p = 0.82 for the same row:

  • L = − log(0.82) ≈ 0.198
  • The loss dropped dramatically because the model became more confident in the correct direction, exactly the behavior we want.

For code, compute the same loss from logits rather than taking log(sigmoid(z)). The expression log(1 + exp(z)) - y*z is the same loss, but evaluating exp(z) directly can overflow. NumPy's logaddexp(0, z) - y*z evaluates that softplus term stably even when z is very large in either direction.

stable-log-loss-from-logits.py
1import numpy as np 2 3def log_loss_from_logits(y: np.ndarray, z: np.ndarray) -> float: 4 return float(np.mean(np.logaddexp(0.0, z) - y * z)) 5 6print("correct positive:", round(log_loss_from_logits(np.array([1.0]), np.array([2.0])), 3)) 7print("confident wrong:", round(log_loss_from_logits(np.array([1.0]), np.array([-4.0])), 3)) 8extreme = log_loss_from_logits(np.array([0.0, 1.0]), np.array([1000.0, -1000.0])) 9print("extreme logits finite:", np.isfinite(extreme))
Output
1correct positive: 0.127 2confident wrong: 4.018 3extreme logits finite: True

The log-loss gradient collapses to (p − y) · x

Derive this once, with real numbers, on R4. The same pattern will reappear in neural-network classifiers later.

We have L(p, y) and p = σ(z), z = w·x (including w₀).

  • By chain rule: ∂L/∂wⱼ = (∂L/∂p) · (∂p/∂z) · (∂z/∂wⱼ)
  • First, ∂L/∂p = − (y/p − (1−y)/(1−p) )
  • Second, ∂p/∂z = p · (1 − p) (the sigmoid derivative)
  • Third, ∂z/∂wⱼ = xⱼ because increasing weight wⱼ changes the score in proportion to feature xⱼ.

The algebra collapses more neatly than it looks. Look at the two label cases:

labellossderivative with respect to score zmeaning
y = 1−log(p)p − 1prediction is too low unless p is already near 1
y = 0−log(1−p)pprediction is too high unless p is already near 0

Both rows are the same formula: ∂L/∂z = p − y. Then the weight gradient is:

∂L∂wj=(p−y)⋅xj\frac{\partial L}{\partial w_j} = (p - y) \cdot x_j∂wj​∂L​=(p−y)⋅xj​

This holds for each weight, including the intercept (x₀ = 1).[2]Reference 2Pattern Recognition and Machine Learning.https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/[3]Reference 3Machine Learning: A Probabilistic Perspective.https://probml.github.io/pml-book/book0.html

Plug in numbers for the starting point on R4 (p=0.5, y=1, x=[1, 3.5, 6]):

  • grad_w0 = (0.5 − 1) · 1 = −0.5
  • grad_w1 = (0.5 − 1) · 3.5 = −1.75
  • grad_w2 = (0.5 − 1) · 6 = −3.0

The gradient points uphill (direction of increasing loss). We subtract it (scaled by the learning rate) to descend.

That sign is the entire learning signal. If y=1 and p=0.5, p-y is negative, so subtracting the gradient increases weights attached to positive features and raises the future score for similar requests. If y=0 and p=0.8, p-y is positive, so subtracting the gradient lowers those weights and reduces the score.

Do the same arithmetic for each row, average the gradients, and you have one step of gradient descent for logistic regression. The code is close to the linear-regression GD loop you already wrote; the prediction and gradient formula change.

check-gradient-with-finite-differences.py
1import numpy as np 2 3def sigmoid(z: float) -> float: 4 return float(1.0 / (1.0 + np.exp(-z))) 5 6def loss(w: np.ndarray, x: np.ndarray, y: float) -> float: 7 z = x @ w 8 return float(np.logaddexp(0.0, z) - y * z) 9 10x = np.array([1.0, 3.5, 6.0]) 11y = 1.0 12w = np.zeros(3) 13analytic = (sigmoid(x @ w) - y) * x 14eps = 1e-6 15numeric = np.array([ 16 (loss(w + eps * np.eye(3)[j], x, y) - loss(w - eps * np.eye(3)[j], x, y)) / (2 * eps) 17 for j in range(3) 18]) 19print("analytic:", analytic.round(3)) 20print("numeric :", numeric.round(3)) 21print("match:", np.allclose(analytic, numeric, atol=1e-6))
Output
1analytic: [-0.5 -1.75 -3. ] 2numeric : [-0.5 -1.75 -3. ] 3match: True
Logistic-regression gradient heatmap for eight return requests at zero weights, showing each row contribution to intercept, ambiguity, and policy gradients, their average gradient, and the first update to weights 0, 0.069, and 0.075. Logistic-regression gradient heatmap for eight return requests at zero weights, showing each row contribution to intercept, ambiguity, and policy gradients, their average gradient, and the first update to weights 0, 0.069, and 0.075.
At zero weights, every request starts at `p = 0.5`. Multiplying each signed error by its feature vector creates the row heatmap; column averages give `g = [0, -0.344, -0.375]`, so the first `lr = 0.2` update becomes `w = [0, 0.069, 0.075]`.

Implementing logistic regression from scratch in NumPy

This complete, self-contained implementation can be copied into a file and run.

logistic-regression-numpy.py
1import numpy as np 2 3def sigmoid(z: np.ndarray) -> np.ndarray: 4 return 1.0 / (1.0 + np.exp(-np.clip(z, -50, 50))) 5 6def log_loss_from_logits(y_true: np.ndarray, z: np.ndarray) -> float: 7 return float(np.mean(np.logaddexp(0.0, z) - y_true * z)) 8 9def fit_logistic(X: np.ndarray, y: np.ndarray, lr: float = 0.2, epochs: int = 3000, verbose: bool = True) -> np.ndarray: 10 n, d = X.shape 11 w = np.zeros(d) 12 for epoch in range(epochs): 13 z = X @ w 14 p = sigmoid(z) 15 grad = (1 / n) * (X.T @ (p - y)) 16 w -= lr * grad 17 if verbose and epoch in (0, 200, 500, 1000, 2999): 18 loss = log_loss_from_logits(y, X @ w) 19 print(f"after_epoch={epoch + 1:4d} loss={loss:.4f} w={np.round(w, 3)}") 20 return w 21 22def predict_proba(X: np.ndarray, w: np.ndarray) -> np.ndarray: 23 return sigmoid(X @ w) 24 25def predict_with_threshold(p: np.ndarray, threshold: float) -> np.ndarray: 26 return (p >= threshold).astype(int) 27 28def confusion_matrix_at_threshold(y_true: np.ndarray, p: np.ndarray, threshold: float) -> dict[str, int]: 29 pred = predict_with_threshold(p, threshold) 30 tp = int(((pred == 1) & (y_true == 1)).sum()) 31 fp = int(((pred == 1) & (y_true == 0)).sum()) 32 fn = int(((pred == 0) & (y_true == 1)).sum()) 33 tn = int(((pred == 0) & (y_true == 0)).sum()) 34 return {"tp": tp, "fp": fp, "fn": fn, "tn": tn} 35 36def precision_recall_f1(cm: dict[str, int]) -> tuple[float, float, float]: 37 tp, fp, fn = cm["tp"], cm["fp"], cm["fn"] 38 precision = tp / (tp + fp) if tp + fp else 0.0 39 recall = tp / (tp + fn) if tp + fn else 0.0 40 f1 = 2 * precision * recall / (precision + recall) if precision + recall else 0.0 41 return precision, recall, f1 42 43def roc_points(y_true: np.ndarray, p: np.ndarray, thresholds: list[float]) -> list[tuple[float, float, float]]: 44 points = [] 45 for threshold in thresholds: 46 cm = confusion_matrix_at_threshold(y_true, p, threshold) 47 fpr = cm["fp"] / (cm["fp"] + cm["tn"]) if cm["fp"] + cm["tn"] else 0.0 48 tpr = cm["tp"] / (cm["tp"] + cm["fn"]) if cm["tp"] + cm["fn"] else 0.0 49 points.append((threshold, fpr, tpr)) 50 return points 51 52def expected_calibration_error(y_true: np.ndarray, p: np.ndarray, n_bins: int = 4) -> float: 53 edges = np.linspace(0.0, 1.0, n_bins + 1) 54 ece = 0.0 55 for lo, hi in zip(edges[:-1], edges[1:]): 56 mask = (p >= lo) & ((p < hi) if hi < 1.0 else (p <= hi)) 57 if not mask.any(): 58 continue 59 avg_pred = p[mask].mean() 60 frac_pos = y_true[mask].mean() 61 ece += mask.mean() * abs(avg_pred - frac_pos) 62 return float(ece) 63 64# Historical access-change requests: [ambiguity, auto-policy support] 65X_raw = np.array([ 66 [1.0, 1.0], [1.5, 2.0], [2.0, 3.0], [3.5, 6.0], 67 [4.0, 7.0], [4.5, 5.0], [2.8, 4.0], [3.2, 8.0], 68]) 69y = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 70 71X = np.c_[np.ones((len(X_raw), 1)), X_raw] # add intercept column 72 73w = fit_logistic(X, y, lr=0.2, epochs=3000) 74p = predict_proba(X, w) 75cm = confusion_matrix_at_threshold(y, p, threshold=0.5) 76precision, recall, f1 = precision_recall_f1(cm) 77ece = expected_calibration_error(y, p, n_bins=4) 78 79print("\nFinal weights (w0, w1, w2):", w.round(3)) 80print("Probabilities:", p.round(3)) 81print("Confusion @ 0.5:", cm) 82print("Precision / recall / F1:", tuple(round(v, 3) for v in (precision, recall, f1))) 83print("Training ECE (4 bins):", round(ece, 3)) 84print("weights_match_reference", np.allclose(w, [-4.894, 3.399, -0.959], atol=0.01)) 85print("confusion_matches_reference", cm == {"tp": 3, "fp": 1, "fn": 1, "tn": 3}) 86print("f1_matches_reference", np.isclose(f1, 0.75)) 87print("ece_in_expected_band", 0.26 < ece < 0.28)
Output
1after_epoch= 1 loss=0.6830 w=[0. 0.069 0.075] 2after_epoch= 201 loss=0.4648 w=[-2.205 1.615 -0.462] 3after_epoch= 501 loss=0.4206 w=[-3.524 2.452 -0.686] 4after_epoch=1001 loss=0.4092 w=[-4.356 3.023 -0.85 ] 5after_epoch=3000 loss=0.4072 w=[-4.894 3.399 -0.959] 6 7Final weights (w0, w1, w2): [-4.894 3.399 -0.959] 8Probabilities: [0.079 0.153 0.274 0.777 0.88 0.996 0.687 0.156] 9Confusion @ 0.5: {'tp': 3, 'fp': 1, 'fn': 1, 'tn': 3} 10Precision / recall / F1: (0.75, 0.75, 0.75) 11Training ECE (4 bins): 0.268 12weights_match_reference True 13confusion_matches_reference True 14f1_matches_reference True 15ece_in_expected_band True

The negative intercept starts low when both signals are absent. Ambiguity's positive coefficient raises review risk, while the negative policy-support coefficient lowers it, matching the feature definition. Each progress line reports loss and weights after the same update, so snapshots stay comparable. These eight rows teach model mechanics; they never justify a shipping threshold by themselves.

The code builds a working logistic regression model whose scoring, gradients, and threshold behavior are visible.

Select a threshold on held-out requests

The fitted model emits a score in (0, 1). A business action needs a threshold t: route to review when p >= t. The training confusion matrix at t = 0.5 is a useful code check (TP=3, FP=1, FN=1, TN=3), but selecting t on those same rows would reward overfitting.

Use a validation slice with cases the optimizer never saw. Here a false negative means automatic processing when review was required, costed at $120. A false positive means unnecessary human review, costed at $18.

choose-threshold-on-validation.py
1import numpy as np 2 3def sigmoid(z: np.ndarray) -> np.ndarray: 4 return 1.0 / (1.0 + np.exp(-z)) 5 6def metrics(y: np.ndarray, p: np.ndarray, threshold: float) -> tuple[float, float, float, int]: 7 pred = (p >= threshold).astype(int) 8 tp = int(((pred == 1) & (y == 1)).sum()) 9 fp = int(((pred == 1) & (y == 0)).sum()) 10 fn = int(((pred == 0) & (y == 1)).sum()) 11 precision = tp / (tp + fp) if tp + fp else 0.0 12 recall = tp / (tp + fn) if tp + fn else 0.0 13 f1 = 2 * precision * recall / (precision + recall) if precision + recall else 0.0 14 cost = 120 * fn + 18 * fp 15 return precision, recall, f1, cost 16 17w = np.array([-4.894, 3.399, -0.959]) 18X_val_raw = np.array([ 19 [1.2, 2.0], [2.1, 4.0], [2.4, 3.0], [3.0, 5.0], 20 [3.6, 4.0], [3.2, 7.0], [4.2, 5.0], [2.8, 6.0], 21]) 22y_val = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 23X_val = np.c_[np.ones(len(X_val_raw)), X_val_raw] 24p_val = sigmoid(X_val @ w) 25 26for t in (0.20, 0.50, 0.80): 27 precision, recall, f1, cost = metrics(y_val, p_val, t) 28 print(f"t={t:.2f} precision={precision:.3f} recall={recall:.3f} f1={f1:.3f} cost=${cost}")
Output
1t=0.20 precision=0.667 recall=1.000 f1=0.800 cost=$36 2t=0.50 precision=0.750 recall=0.750 f1=0.750 cost=$138 3t=0.80 precision=1.000 recall=0.500 f1=0.667 cost=$240

On this validation slice, t = 0.20 costs least because avoiding missed reviews outweighs two extra queue items. In production, estimate these costs with stakeholders and re-evaluate thresholds as request mix shifts.

Precision, recall, F1, and class imbalance

For any threshold, fill the 2x2 confusion matrix:

Eight validation scores split at threshold 0.20 into TN 2, FP 2, FN 0, and TP 4, yielding precision 0.667, recall 1.0, and 36 dollars of weighted error cost. Eight validation scores split at threshold 0.20 into TN 2, FP 2, FN 0, and TP 4, yielding precision 0.667, recall 1.0, and 36 dollars of weighted error cost.
At `t = 0.20`, all four required reviews are caught. Two routine requests enter the queue, producing `$36` of validation cost while avoiding the `$120` false-negative penalty.

Definitions:

  • Precision = TP / (TP + FP): among requests routed to review, what fraction required it?
  • Recall = TP / (TP + FN): among requests requiring review, what fraction reached it?
  • F1 = 2 * precision * recall / (precision + recall): one balance of those rates.

Accuracy alone can be nearly useless when positive cases are rare:

accuracy-fails-on-rare-review-cases.py
1import numpy as np 2 3y_true = np.array([1] * 4 + [0] * 96) 4predict_auto_process_everything = np.zeros(100, dtype=int) 5accuracy = (predict_auto_process_everything == y_true).mean() 6recall = ((predict_auto_process_everything == 1) & (y_true == 1)).sum() / (y_true == 1).sum() 7print("accuracy:", round(float(accuracy), 3)) 8print("review recall:", round(float(recall), 3))
Output
1accuracy: 0.96 2review recall: 0.0

ROC and precision-recall curves measure ranking

To draw an ROC (Receiver Operating Characteristic) curve, vary the threshold from high to low and plot false-positive rate against recall. AUC answers a ranking question: how often does a random positive score rank above a random negative score? Ties get half credit. AUC never chooses a business threshold for you.

auc-from-positive-negative-pairs.py
1import numpy as np 2 3p_val = np.array([0.061, 0.169, 0.595, 0.624, 0.971, 0.325, 0.990, 0.244]) 4y_val = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 5positive = p_val[y_val == 1] 6negative = p_val[y_val == 0] 7pair_scores = [(pos > neg) + 0.5 * (pos == neg) for pos in positive for neg in negative] 8auc = float(np.mean(pair_scores)) 9print("positive-negative pairs:", len(pair_scores)) 10print("validation AUC:", round(auc, 3))
Output
1positive-negative pairs: 16 2validation AUC: 0.812
Validation ROC staircase with threshold checkpoints 0.80, 0.50, and 0.20 alongside a four-by-four positive-versus-negative score matrix with 13 ranking wins out of 16 pairs, producing AUC 0.8125. Validation ROC staircase with threshold checkpoints 0.80, 0.50, and 0.20 alongside a four-by-four positive-versus-negative score matrix with 13 ranking wins out of 16 pairs, producing AUC 0.8125.
The staircase sweeps every validation score. The matrix gives the equivalent ranking view: positive requests outrank negatives in `13` of `16` pairs, so AUC is `13 / 16 = 0.8125`.

For rare positives, ROC can look generous because many true negatives keep false-positive rate small. Plot a precision-recall curve as well: it exposes how much review load accompanies higher recall.[1]Reference 1The Elements of Statistical Learning.https://hastie.su.domains/ElemStatLearn/

Validation precision-recall curve beside review queues at thresholds 0.80, 0.50, and 0.20, showing two true positives and zero false positives, three and one, then four and two. Validation precision-recall curve beside review queues at thresholds 0.80, 0.50, and 0.20, showing two true positives and zero false positives, three and one, then four and two.
Lowering the threshold grows the review queue from `2 TP / 0 FP` to `4 TP / 2 FP`. The final required catch comes with one additional false alarm, leaving precision at `0.667`.

Calibration: do scores represent frequencies?

A model can rank access-change requests well while producing over-confident or under-confident numbers. If requests scored near 0.80 require review only half the time, exposing 80% to an operator misstates risk.

A reliability diagram compares average score with observed positive frequency per bin. Expected Calibration Error (ECE) summarizes binned gaps:

ECE=∑b∣Bb∣n∣avg⁡i∈Bb(pi)−avg⁡i∈Bb(yi)∣\mathrm{ECE} = \sum_b \frac{|B_b|}{n}\left|\operatorname{avg}_{i\in B_b}(p_i) - \operatorname{avg}_{i\in B_b}(y_i)\right|ECE=b∑​n∣Bb​∣​​avgi∈Bb​​(pi​)−avgi∈Bb​​(yi​)​

ECE depends on sample size, prevalence, and bin choices. It has no universal pass threshold. State the binning scheme, measure on held-out labels, and use a separate calibration split when fitting Platt scaling, isotonic regression, or temperature scaling. Temperature scaling is a common post-hoc method for neural classifiers.[4]Reference 4On Calibration of Modern Neural Networkshttps://arxiv.org/abs/1706.04599

calibration-on-validation-labels.py
1import numpy as np 2 3p_val = np.array([0.061, 0.169, 0.595, 0.624, 0.971, 0.325, 0.990, 0.244]) 4y_val = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 5edges = np.linspace(0.0, 1.0, 5) 6ece = 0.0 7for lo, hi in zip(edges[:-1], edges[1:]): 8 mask = (p_val >= lo) & ((p_val < hi) if hi < 1.0 else (p_val <= hi)) 9 if mask.any(): 10 avg_p = p_val[mask].mean() 11 frac_pos = y_val[mask].mean() 12 ece += mask.mean() * abs(avg_p - frac_pos) 13 print(f"[{lo:.2f}, {hi:.2f}) n={mask.sum()} avg_p={avg_p:.3f} frac_pos={frac_pos:.3f}") 14print("validation ECE:", round(float(ece), 3))
Output
1[0.00, 0.25) n=3 avg_p=0.158 frac_pos=0.333 2[0.25, 0.50) n=1 avg_p=0.325 frac_pos=0.000 3[0.50, 0.75) n=2 avg_p=0.609 frac_pos=0.500 4[0.75, 1.00) n=2 avg_p=0.980 frac_pos=1.000 5validation ECE: 0.139

Eight validation requests are enough for arithmetic, not a release gate. A single disputed label moves both metrics noticeably:

one-label-changes-small-sample-conclusions.py
1import numpy as np 2 3def f1_at_half(y: np.ndarray, p: np.ndarray) -> float: 4 pred = p >= 0.5 5 tp = ((pred == 1) & (y == 1)).sum() 6 fp = ((pred == 1) & (y == 0)).sum() 7 fn = ((pred == 0) & (y == 1)).sum() 8 return float(2 * tp / (2 * tp + fp + fn)) 9 10def ece(y: np.ndarray, p: np.ndarray) -> float: 11 edges = np.linspace(0.0, 1.0, 5) 12 total = 0.0 13 for lo, hi in zip(edges[:-1], edges[1:]): 14 mask = (p >= lo) & ((p < hi) if hi < 1.0 else (p <= hi)) 15 if mask.any(): 16 total += mask.mean() * abs(p[mask].mean() - y[mask].mean()) 17 return float(total) 18 19p = np.array([0.061, 0.169, 0.595, 0.624, 0.971, 0.325, 0.990, 0.244]) 20base = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 21one_flip = base.copy() 22one_flip[3] = 1 23for name, labels in (("base", base), ("one flip", one_flip)): 24 print(f"{name:8s} f1={f1_at_half(labels, p):.3f} ece={ece(labels, p):.3f}")
Output
1base f1=0.750 ece=0.139 2one flip f1=0.889 ece=0.209

First audit action selection:

Validation threshold comparison with stacked error-cost bars: threshold 0.20 costs 36 dollars from two false positives, 0.50 costs 138 dollars from one false positive and one false negative, and 0.80 costs 240 dollars from two false negatives. Validation threshold comparison with stacked error-cost bars: threshold 0.20 costs 36 dollars from two false positives, 0.50 costs 138 dollars from one false positive and one false negative, and 0.80 costs 240 dollars from two false negatives.
Threshold `0.20` wins this validation audit because two `$18` reviews cost less than one `$120` missed review. It's a candidate policy, not a permanent constant.

Then audit probability meaning:

Four validation calibration bins with paired average-score and observed-positive-rate bars, plus weighted ECE contributions 0.066, 0.041, 0.027, and 0.005 summing to 0.139. Four validation calibration bins with paired average-score and observed-positive-rate bars, plus weighted ECE contributions 0.066, 0.041, 0.027, and 0.005 summing to 0.139.
Each bin contributes its score-frequency gap weighted by its share of the eight validation requests: `0.066 + 0.041 + 0.027 + 0.005 ≈ 0.139`. The arithmetic is exact for this slice; the evidence is still too small for a release claim.

Verify the NumPy fit with scikit-learn

A from-scratch implementation earns trust by matching a maintained implementation under equivalent settings.[5]Reference 5Scikit-learn: Machine Learning in Python.https://jmlr.org/papers/v12/pedregosa11a.html Scikit-learn applies regularization by default; set C=np.inf here to compare against our unregularized gradient-descent fit.

compare-with-scikit-learn.py
1import numpy as np 2from sklearn.linear_model import LogisticRegression 3from sklearn.metrics import f1_score, roc_auc_score 4 5X_raw = np.array([ 6 [1.0, 1.0], [1.5, 2.0], [2.0, 3.0], [3.5, 6.0], 7 [4.0, 7.0], [4.5, 5.0], [2.8, 4.0], [3.2, 8.0], 8]) 9y = np.array([0, 0, 1, 1, 1, 1, 0, 0]) 10X = np.c_[np.ones((len(X_raw), 1)), X_raw] 11w_ours = np.array([-4.894, 3.399, -0.959]) 12X_val_raw = np.array([ 13 [1.2, 2.0], [2.1, 4.0], [2.4, 3.0], [3.0, 5.0], 14 [3.6, 4.0], [3.2, 7.0], [4.2, 5.0], [2.8, 6.0], 15]) 16y_val = np.array([0, 0, 1, 0, 1, 0, 1, 1]) 17X_val = np.c_[np.ones(len(X_val_raw)), X_val_raw] 18 19clf = LogisticRegression(C=np.inf, solver="lbfgs", max_iter=5000, fit_intercept=False) 20clf.fit(X, y) 21w_sk = clf.coef_[0] 22print("sklearn weights:", np.round(w_sk, 3)) 23print("our weights :", w_ours) 24print("weights close :", np.allclose(w_sk, w_ours, atol=0.05)) 25p_val = clf.predict_proba(X_val)[:, 1] 26print("validation F1 @ 0.20:", round(f1_score(y_val, (p_val >= 0.20).astype(int)), 3)) 27print("validation AUC:", round(roc_auc_score(y_val, p_val), 3))
Output
1sklearn weights: [-4.914 3.414 -0.964] 2our weights : [-4.894 3.399 -0.959] 3weights close : True 4validation F1 @ 0.20: 0.8 5validation AUC: 0.812

The weight match validates the mechanics. The validation metrics validate only this small held-out exercise; larger and temporally later data is still required before deployment.

Tests that protect the implementation

Tests should protect stable numerics, the known fit, held-out threshold behavior, and parity settings:

tests/test_logistic.py
1import numpy as np 2import pytest 3from logistic_scratch import ( 4 sigmoid, log_loss_from_logits, fit_logistic, predict_proba, 5 confusion_matrix_at_threshold, precision_recall_f1 6) 7 8X_train_raw = np.array([[1.,1.],[1.5,2.],[2.,3.],[3.5,6.],[4.,7.],[4.5,5.],[2.8,4.],[3.2,8.]]) 9y_train = np.array([0,0,1,1,1,1,0,0]) 10X_train = np.c_[np.ones((8,1)), X_train_raw] 11X_val_raw = np.array([[1.2,2.],[2.1,4.],[2.4,3.],[3.,5.],[3.6,4.],[3.2,7.],[4.2,5.],[2.8,6.]]) 12y_val = np.array([0,0,1,0,1,0,1,1]) 13X_val = np.c_[np.ones((8,1)), X_val_raw] 14 15def test_sigmoid(): 16 assert sigmoid(0) == pytest.approx(0.5) 17 assert sigmoid(10) > 0.999 18 assert sigmoid(-10) < 0.001 19 20def test_log_loss_is_stable_for_extreme_logits(): 21 loss = log_loss_from_logits(np.array([0., 1.]), np.array([1000., -1000.])) 22 assert np.isfinite(loss) 23 24def test_gradient_direction(): 25 # on a single row with bad prediction, gradient should point toward correction 26 w0 = np.zeros(3) 27 p = sigmoid(X_train[3] @ w0) 28 grad = X_train[3:4].T @ (p - y_train[3:4]) 29 assert grad[0] < 0 30 31def test_fit_converges(): 32 w = fit_logistic(X_train, y_train, lr=0.2, epochs=3000, verbose=False) 33 loss = log_loss_from_logits(y_train, X_train @ w) 34 assert loss < 0.42 35 assert np.allclose(w, [-4.894, 3.399, -0.959], atol=0.01) 36 37def test_validation_threshold_cost_contract(): 38 w = fit_logistic(X_train, y_train, lr=0.2, epochs=3000, verbose=False) 39 p = predict_proba(X_val, w) 40 cm = confusion_matrix_at_threshold(y_val, p, 0.2) 41 prec, rec, f1 = precision_recall_f1(cm) 42 assert cm == {"tp": 4, "fp": 2, "fn": 0, "tn": 2} 43 assert (prec, rec, f1) == pytest.approx((2/3, 1.0, 0.8)) 44 45def test_matches_sklearn(): 46 from sklearn.linear_model import LogisticRegression 47 clf = LogisticRegression(C=np.inf, solver='lbfgs', max_iter=5000, fit_intercept=False) 48 clf.fit(X_train, y_train) 49 w_ours = fit_logistic(X_train, y_train, lr=0.2, epochs=3000, verbose=False) 50 assert np.allclose(w_ours, clf.coef_[0], atol=0.05)

Run with pytest tests/test_logistic.py -q --tb=line after placing implementation in logistic_scratch.py. Unlike an exact ECE assertion on eight labels, these tests guard code contracts and the selected validation scenario.

Symptom, cause, fix table (the debugging checklist)

SymptomLikely causeFix
All predicted probabilities hover around 0.4-0.6Features are weak, data is tiny, or GD is under-trainedAdd informative evidence features, increase epochs, tune learning rate, or evaluate a stronger model
F1 at threshold 0.5 is mediocre but recall at 0.2 is excellentClass imbalance + wrong operating pointRead the ROC or cost-sensitive curve; pick threshold by expected business cost, not by accuracy
Good AUC but large reliability gaps on substantial held-out dataModel ranks well but scores misstate frequencyWithhold raw percentages; calibrate on a reserved split and re-audit
Your NumPy weights differ from sklearn by >0.3Forgot the intercept column, used different regularization, or feature scaling mismatchCompare design matrices, disable sklearn regularization, and verify the same X
Loss decreases then suddenly increasesLearning rate too large or probability-form loss overflowedLower lr, compute loss with logaddexp, and print loss during fitting

Try it yourself

These exercises turn the chapter from reading into mastery. Use training rows for fitting and validation rows for policy selection.

  1. Start with w = [0,0,0] and compute the gradient vector on the full batch by hand for the initial step (lr = 0.5). Verify it matches the initial printed line of your NumPy run.
  2. Change the learning rate to 2.0 and watch the loss explode or oscillate. Explain why in one sentence.
  3. Sweep thresholds [0.2, 0.35, 0.5, 0.65, 0.8] on validation and print precision, recall, F1, and cost. Which threshold minimizes the stated cost?
  4. Bin validation probabilities into four equal-width bins, compute ECE by hand, and compare with code output. Repeat with eight bins and explain why the number changes.
  5. Flip one validation label and measure F1 and ECE again. Write the argument for gathering a larger, later-in-time validation set before publishing probability claims.

If any answer feels hand-wavy, print the arrays, loss numbers, and concrete cells of the confusion matrix. Concrete numbers build classification skill.

What you have built and where it leads

The classifier can now:

  • Turn a binary classification problem into a design matrix and fit logistic weights with gradient descent using NumPy alone.
  • Derive and implement the (p − y)·x gradient yourself.
  • Select a decision threshold on held-out requests with an explicit cost model.
  • Compute classification metrics and distinguish threshold decisions from ranking quality.
  • Draw a reliability diagram and compute ECE without inventing a universal pass line.
  • Write deterministic tests for numeric stability, parity, and validation behavior.

These are the same habits behind production classifiers: fit the model, inspect the threshold tradeoff, measure calibration, and write tests that keep the implementation honest.

The validation slice improved the evaluation boundary, but eight requests remain far too few for a release claim. Next, decision trees and ensembles attack the same access-review routing task with nonlinear rules; the later cross-validation lesson will deepen the evidence required before shipping a model.

Mastery check

Key concepts

  • sigmoid activation
  • log loss / binary cross-entropy
  • gradient of logistic loss
  • decision threshold tuning
  • confusion matrix (TP/FP/TN/FN)
  • precision, recall, F1-score
  • ROC curve and AUC
  • reliability diagram / calibration plot
  • Expected Calibration Error (ECE)
  • NumPy from-scratch implementation
  • scikit-learn comparison
  • imbalanced classification pitfalls

Evaluation rubric

  • Foundational: Computes sigmoid, log loss, and the (p - y) * x gradient by hand on an access-change request with concrete scores and loss values
  • Intermediate: Implements NumPy logistic regression with stable logit loss; selects a cost-aware validation threshold; computes F1, AUC, and binned ECE; verifies unregularized parity with LogisticRegression
  • Advanced: Separates fit, threshold selection, ranking audit, and calibration audit; explains label sensitivity and the evidence needed before exposing probability claims

Follow-up questions

Common pitfalls

  • Symptom: dashboard accuracy looks fine, but cases needing review are auto-processed. Cause: threshold stayed at 0.5 on imbalanced data, so recall collapsed. Fix: sweep thresholds on a held-out set and pick by expected business cost or recall target, not habit.
  • Symptom: operators distrust percentages like 0.87 even though AUC looks useful. Cause: ranking and calibration answer different questions. Fix: draw reliability diagrams on enough held-out labels, reserve calibration data, and withhold percentages until verified.
  • Symptom: your NumPy weights disagree sharply with scikit-learn. Cause: intercept handling, regularization, scaling, or solver settings don't match. Fix: compare design matrices first, disable sklearn regularization when matching plain GD, and verify feature preprocessing is identical.
  • Symptom: loss falls for a few steps, then spikes or oscillates. Cause: learning rate is too large or logits are hitting unstable numeric regions. Fix: lower the learning rate, clip logits inside sigmoid, and print loss every few hundred steps until the curve is stable.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A logistic routing model uses z = w0 + w1x1 + w2x2 and p = 1/(1 + e^-z). With w = [-4.9, 3.4, -0.96], x1 = 3.0, x2 = 5.0, and positive class = needs human review, what action follows at threshold 0.5?
2.For a positive access-review case, y = 1, compare logits z = 2 and z = -4 using stable logistic loss logaddexp(0, z) - y*z. What should the losses show?
3.For one training row with features x = [1, 3.5, 6], label y = 1, and current prediction p = 0.5, what is the row gradient of log loss with respect to [w0, w1, w2]?
4.On validation, threshold 0.50 gives FP = 1 and FN = 1. Threshold 0.20 gives FP = 2 and FN = 0. A false automatic approval costs 120andanunnecessaryreviewcosts120 and an unnecessary review costs 120andanunnecessaryreviewcosts18. Which threshold has lower validation cost?
5.At a validation threshold, TP = 4, FP = 2, and FN = 0. What are precision, recall, and F1?
6.In a 100-request validation set, 4 requests truly need review and 96 do not. A model predicts auto-process for every request. What do accuracy and review recall reveal?
7.Across 16 positive-negative validation score pairs, the positive request has the higher score in 13 pairs and there are no ties. What does the AUC equal, and what does that value mean?
8.ECE is sum over bins of (bin size / n) * abs(avg score - observed positive rate). For n = 8, four bins have (n_b, avg score, positive rate): (3, 0.158, 0.333), (1, 0.325, 0.000), (2, 0.609, 0.500), and (2, 0.980, 1.000). What is ECE?
9.Validation on eight held-out access-change requests gives AUC = 0.812 and four-bin ECE = 0.139. A raw score for one request is 0.87, and a product manager wants to display it as an 87% chance of manual review. What should you do first?
10.A NumPy logistic-regression fit and a scikit-learn fit have similar F1, but one coefficient differs by about 0.4. What should you inspect before blaming gradient descent?

10 questions remaining.

Next Step
Continue to Decision Trees, Random Forests & Gradient Boosting from Scratch

There you'll model the same access-review decision with nonlinear rules and inspect why high-scoring rankings still need calibration audits.

PreviousLinear Regression from Scratch
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

Pattern Recognition and Machine Learning.

Bishop, C. M. · 2006

Machine Learning: A Probabilistic Perspective.

Murphy, K. P. · 2012

On Calibration of Modern Neural Networks

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017

Scikit-learn: Machine Learning in Python.

Pedregosa, F., et al. · 2011 · JMLR