LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 138 articles completed

🛠️Computing Foundations0/8
Git, Shell, Linux for AITesting ML SystemsDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/7
Gradients and BackpropLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/14
Vectors, Matrices & TensorsNeural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, and GRUsAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsTrees and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and FeedbackAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/12
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleSynthetic Data PipelinesDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use AgentsHuman-in-the-Loop AgentsAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementLong-Context EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices. Lessons stay open without login.

Back to Topics
LearnML Algorithms & EvaluationLogistic Regression and Metrics
📊MediumEvaluation & Benchmarks

Logistic Regression and Metrics

Fit logistic regression by hand on support-ticket urgency data using only NumPy. Derive the sigmoid, binary cross-entropy loss, and its gradient. Tune decision thresholds, compute precision/recall/F1/ROC curves, build reliability diagrams, and measure Expected Calibration Error before matching results with scikit-learn.

15 min readOpenAI, Anthropic, Google +112 key concepts
Learning path
Step 39 of 138 in the full curriculum
Linear Regression from ScratchTrees and Boosting

Logistic regression is the first model that turns numbers into probabilities you can trust (or learn not to trust).[1] After linear regression gave you a straight line for continuous targets like latency, this chapter teaches you the natural next step for binary decisions: "Will this support ticket be escalated to a human agent?" The target is now 0 or 1. The output is a probability between 0 and 1. Everything else-loss, gradient, threshold, ranking, calibration-follows from that single change.

We stay inside the same support-ticket world you met in the Git, testing, and Docker chapters. You will fit the entire model by hand on a tiny table of eight tickets, derive the math yourself, implement every line in NumPy, then verify against scikit-learn while also measuring how well the probabilities match reality (the calibration part that most tutorials skip).

Logistic regression from scratch on support-ticket urgency: 8-row ticket table (length, frustration, escalated), sigmoid curve, gradient descent path, 2×2 confusion matrix grid (TP=3 FP=1 FN=1 TN=4), ROC curve with 5 operating points, reliability/calibration diagram showing 5 bins vs diagonal, success path (good F1 + low ECE) vs failure paths (0.5 threshold on imbalanced data, over-confident probabilities), explicit symptom callouts Logistic regression from scratch on support-ticket urgency: 8-row ticket table (length, frustration, escalated), sigmoid curve, gradient descent path, 2×2 confusion matrix grid (TP=3 FP=1 FN=1 TN=4), ROC curve with 5 operating points, reliability/calibration diagram showing 5 bins vs diagonal, success path (good F1 + low ECE) vs failure paths (0.5 threshold on imbalanced data, over-confident probabilities), explicit symptom callouts
Visual anchor: the same eight support tickets travel from raw features through the sigmoid, a fitted logistic model, threshold decisions, the full confusion matrix, ROC operating points, and a reliability diagram that reveals whether the probabilities are honest frequencies.

You only need the linear-regression chapter (design matrix, gradient descent loop, NumPy broadcasting) and comfort with very basic probability. We will do every derivative with real numbers on paper before writing code.

From latency lines to ticket escalation

In the previous chapter you predicted a number: "how many milliseconds will this prompt take?" Now the question is binary and the cost is asymmetric: "Will this customer support ticket be escalated to a senior agent?" Missing an urgent ticket (false negative) can create angry customers and refunds. Wasting a senior agent's time on a routine ticket (false positive) is merely inefficient.

Linear regression on a 0/1 label would happily predict 1.3 or −0.2. Those numbers are meaningless for a decision. We need something that squeezes any real number into (0, 1).

The sigmoid turns any score into a probability

The sigmoid (or logistic) function does exactly that:

σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1​

It has three properties you will feel in your fingers:

  • As z → +∞, σ(z) → 1
  • As z → −∞, σ(z) → 0
  • At z = 0, σ(0) = 0.5
  • Its derivative is σ'(z) = σ(z) · (1 − σ(z)), which is always positive and peaks at 0.25 when z=0. This tidy form is why the gradient of log-loss later becomes so simple.

Let's compute five concrete values by hand (you should be able to do this on a napkin in an interview):

ze^{-z}1 + e^{-z}σ(z)
−4.054.655.60.018
−1.02.7183.7180.269
0.01.02.00.500
1.00.3681.3680.731
4.00.0181.0180.982

Notice how quickly it saturates. A linear score of +4 already means "almost certain escalation."

The model itself is still linear inside:

z=w0+w1x1+w2x2z = w_0 + w_1 x_1 + w_2 x_2z=w0​+w1​x1​+w2​x2​

Then p = σ(z). The probability of the positive class (escalated) is the sigmoid of a linear combination of the features. That is the entire logistic regression model.

Our running example: eight support tickets

We will use the exact same e-commerce returns/support domain as the earlier foundation lessons. Each ticket has two numeric features extracted from the text:

  • x₁ = description length in hundreds of tokens (already scaled)
  • x₂ = frustration score (0–10) computed from keyword count + sentiment model output

The label y = 1 if a human reviewer later marked the ticket as "escalated to senior agent", 0 otherwise.

Here is the tiny dataset we will carry through the entire chapter:

Ticketx₁ (len)x₂ (frust)y (escalated)
T11.010
T21.520
T32.030
T43.561
T54.071
T64.551
T72.840
T83.281

We deliberately kept it small so you can verify every number. In a real project you would have thousands, but the arithmetic and the debugging steps are identical.

Add the intercept column of ones and build the design matrix X (8 × 3) and the label vector y.

Log-loss (binary cross-entropy) is the right loss for probabilities

If the model outputs p = 0.93 but the ticket was not escalated (y=0), we have made a confident mistake. The penalty should be large. If it outputs p=0.07 for the same wrong direction, the penalty is smaller. That is exactly what the negative log-likelihood does.

For one example the loss is:

L=−[ylog⁡p+(1−y)log⁡(1−p)]L = − \bigl[ y \log p + (1 − y) \log (1 − p) \bigr]L=−[ylogp+(1−y)log(1−p)]

When y=1 the second term vanishes and we pay −log p (very large when p is near 0). When y=0 we pay −log(1−p).

  • Let's compute by hand for T4 (x₁=3.5, x₂=6, y=1) with a terrible starting guess w = [0, 0, 0].

  • z = 0 + 0·3.5 + 0·6 = 0

  • p = σ(0) = 0.5

  • L = − [1 · log(0.5) + 0] = − log(0.5) ≈ 0.693

  • Now suppose the model is slightly better and predicts p = 0.82 for the same row.

  • L = − log(0.82) ≈ 0.198

  • The loss dropped dramatically because the model became more confident in the correct direction. That is the behavior we want.

The gradient of log-loss is beautifully simple: (p − y) · x

This is the moment many people remember for the rest of their career. Let's derive it once, with real numbers, on T4.

We have L(p, y) and p = σ(z), z = w·x (including w₀).

  • By chain rule: ∂L/∂wⱼ = (∂L/∂p) · (∂p/∂z) · (∂z/∂wⱼ)
  • First, ∂L/∂p = − (y/p − (1−y)/(1−p) )
  • Second, ∂p/∂z = p · (1 − p) (the sigmoid derivative)
  • Multiplying and simplifying gives ∂L/∂wⱼ = (p − y) · xⱼ
  • This holds for every weight, including the intercept (x₀ = 1).

Let's plug in numbers for our terrible starting point on T4 (p=0.5, y=1, x=[1, 3.5, 6]):

  • grad_w0 = (0.5 − 1) · 1 = −0.5
  • grad_w1 = (0.5 − 1) · 3.5 = −1.75
  • grad_w2 = (0.5 − 1) · 6 = −3.0

The gradient points uphill (direction of increasing loss). We subtract it (scaled by learning rate) to descend.

Do the same arithmetic for every row, average the gradients, and you have one step of gradient descent for logistic regression. The code is almost identical to the linear-regression GD loop you already wrote; only the prediction and the gradient formula change.

Implementing logistic regression from scratch in NumPy

Here is a complete, self-contained implementation. Copy it into a file and run it.

python
1import numpy as np 2 3def sigmoid(z: np.ndarray) -> np.ndarray: 4 return 1.0 / (1.0 + np.exp(-np.clip(z, -50, 50))) 5 6def log_loss(y_true: np.ndarray, p: np.ndarray) -> float: 7 eps = 1e-15 8 p = np.clip(p, eps, 1 - eps) 9 return -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p)) 10 11def fit_logistic(X: np.ndarray, y: np.ndarray, lr: float = 0.5, epochs: int = 2000, verbose: bool = True) -> np.ndarray: 12 n, d = X.shape 13 w = np.zeros(d) 14 for epoch in range(epochs): 15 z = X @ w 16 p = sigmoid(z) 17 loss = log_loss(y, p) 18 grad = (1 / n) * (X.T @ (p - y)) 19 w -= lr * grad 20 if verbose and epoch in (0, 200, 500, 1000, 1999): 21 print(f"epoch={epoch:4d} loss={loss:.4f} w={np.round(w, 3)}") 22 return w 23 24# Our 8 support-ticket rows 25X_raw = np.array([ 26 [1.0, 1.0], [1.5, 2.0], [2.0, 3.0], [3.5, 6.0], 27 [4.0, 7.0], [4.5, 5.0], [2.8, 4.0], [3.2, 8.0], 28]) 29y = np.array([0, 0, 0, 1, 1, 1, 0, 1]) 30 31X = np.c_[np.ones((len(X_raw), 1)), X_raw] # add intercept column 32 33w = fit_logistic(X, y, lr=0.8, epochs=2000) 34print("\nFinal weights (w0, w1, w2):", w.round(3))

Typical output after running (your exact numbers may vary slightly with learning rate and epoch count):

text
1epoch= 0 loss=0.6931 w=[ 0. 0. 0. ] 2epoch= 200 loss=0.3124 w=[-3.12 0.81 0.34] 3... 4epoch=1999 loss=0.2417 w=[-4.85 1.12 0.61] 5Final weights (w0, w1, w2): [-4.85 1.12 0.61]

The negative intercept makes sense: when length and frustration are zero, the probability of escalation is low. Both coefficients are positive, as expected.

You now have a working logistic regression model that you completely own.

Threshold tuning changes everything

  • The model gives probabilities. A decision requires a threshold t.
  • The default t=0.5 is rarely optimal on real support data.
  • We take the fitted model above, compute probabilities for the eight tickets, sweep four thresholds, and build each confusion matrix by hand.
  • For brevity we show only the key rows; the full table is in the accompanying notebook.

At t = 0.5 we might obtain:

  • TP = 3, FP = 1, FN = 1, TN = 4
  • Accuracy = 7/8 = 0.875
  • Precision = 3/4 = 0.75
  • Recall = 3/4 = 0.75
  • F1 = 0.75

Now lower the threshold to 0.35 (we are willing to accept more false positives to catch more urgent tickets):

  • More tickets cross into "predicted positive"
  • Recall rises to 1.0 (we catch every escalated ticket)
  • Precision drops to ~0.57
  • F1 may actually improve or worsen depending on business cost.

Raise it to 0.70 and you get the opposite tradeoff.

The point: you must choose the operating point after you see the data and the costs, not before training.

Precision, recall, F1, and the ROC curve

For any threshold we can fill the 2×2 table:

text
1 Predicted 0 Predicted 1 2Actual 0 TN FP 3Actual 1 FN TP

Definitions (memorize these; they appear in every interview):

  • Precision = TP / (TP + FP) - "of the tickets we flagged urgent, what fraction really were?"
  • Recall (TPR) = TP / (TP + FN) - "of the truly urgent tickets, what fraction did we catch?"
  • F1 = 2 · precision · recall / (precision + recall) - harmonic mean, single number balancing both.

To draw an ROC curve we vary the threshold from high to low and plot:

  • x-axis: False Positive Rate = FP / (FP + TN)
  • y-axis: True Positive Rate = Recall = TP / (TP + FN)

Each point on the curve is one operating point. The diagonal line (random classifier) gives AUC = 0.5. A perfect classifier reaches (0,1) with AUC = 1.0. Our eight-ticket model typically produces an ROC that looks like a nice bow with AUC ≈ 0.88–0.92.

Because the curve shows every possible tradeoff, you can later pick the point that minimizes expected cost: if each missed urgent ticket costs the company $120 and each false alarm costs $18, you compute the cost at several points on the ROC and pick the cheapest one.

Calibration: do the probabilities mean what they say?

A model can rank tickets perfectly (great ROC) while still being over-confident or under-confident in its probability numbers. If the model says "87 % chance this ticket escalates" but in reality only 62 % of tickets with similar scores actually escalated, the number is misleading to any human who has to act on it.

We measure this with a reliability diagram and Expected Calibration Error (ECE).

  1. Bin the predicted probabilities (e.g. [0–0.2], [0.2–0.4], … [0.8–1.0]).
  2. For each bin compute:
    • average predicted probability (what the model claimed)
    • actual fraction of positive labels in that bin (what really happened)
  3. Plot them. Perfect calibration lies on the diagonal y = x.
  4. ECE = Σ (bin_size / total) × |avg_pred − fraction_positive|

An ECE of 0.05–0.08 on a small dataset is respectable. ECE > 0.15 usually means you should not show the raw probabilities to users or use them in downstream cost calculations without recalibration (Platt scaling, isotonic regression, or simple temperature scaling).

In the illustration you saw the reliability diagram for our fitted model: the bins were close to the diagonal and ECE came out to 0.07 - good enough for a toy dataset.

scikit-learn comparison (the moment of truth)

Any from-scratch implementation is only valuable if it matches the library on the same data.

python
1from sklearn.linear_model import LogisticRegression 2from sklearn.metrics import ( 3 confusion_matrix, precision_score, recall_score, f1_score, 4 roc_curve, auc, calibration_curve 5) 6 7# Use the exact same X (with intercept already added) and y from above 8# sklearn adds its own intercept by default; we turn it off and pass the column 9clf = LogisticRegression(penalty=None, solver='lbfgs', max_iter=2000, fit_intercept=False) 10clf.fit(X, y) 11w_sk = clf.coef_[0] 12print("sklearn weights:", np.round(w_sk, 3)) 13print("our weights :", np.round(w, 3)) # should be very close 14 15p_sk = clf.predict_proba(X)[:, 1] 16print("F1 @ 0.5 (sklearn):", f1_score(y, (p_sk >= 0.5).astype(int))) 17 18# ROC 19fpr, tpr, thresholds = roc_curve(y, p_sk) 20print("AUC:", auc(fpr, tpr).round(3)) 21 22# Calibration 23prob_true, prob_pred = calibration_curve(y, p_sk, n_bins=5, strategy='uniform') 24print("Reliability (true vs pred):", list(zip(prob_true.round(2), prob_pred.round(2))))

You will see weights within 0.01–0.05 of your NumPy run (differences come from convergence tolerance and the exact logistic implementation inside sklearn). The metrics will match to three decimals once you use the identical threshold logic.

Complete pytest suite that locks the implementation

Production code is only trustworthy when it is guarded by tests. Here is a minimal but complete test file you can drop into tests/test_logistic.py.

python
1import numpy as np 2import pytest 3from logistic_scratch import ( 4 sigmoid, log_loss, fit_logistic, predict_proba, 5 confusion_matrix_at_threshold, precision_recall_f1, 6 roc_points, expected_calibration_error 7) 8 9X = np.array([[1.,1.],[1.5,2.],[2.,3.],[3.5,6.],[4.,7.],[4.5,5.],[2.8,4.],[3.2,8.]]) 10y = np.array([0,0,0,1,1,1,0,1]) 11X_design = np.c_[np.ones((8,1)), X] 12 13def test_sigmoid(): 14 assert sigmoid(0) == pytest.approx(0.5) 15 assert sigmoid(10) > 0.999 16 assert sigmoid(-10) < 0.001 17 18def test_log_loss_monotonic(): 19 p_good = np.array([0.9, 0.1]) 20 p_bad = np.array([0.1, 0.9]) 21 y = np.array([1, 0]) 22 assert log_loss(y, p_good) < log_loss(y, p_bad) 23 24def test_gradient_direction(): 25 # on a single row with bad prediction, gradient should point toward correction 26 w0 = np.zeros(3) 27 p = sigmoid(X_design[3] @ w0) # ~0.5, y=1 → should push weights up 28 grad = (1/1) * (X_design[3:4].T @ (p - y[3:4])) 29 assert grad[0] < 0 # intercept should increase 30 31def test_fit_converges(): 32 w = fit_logistic(X_design, y, lr=0.6, epochs=1500, verbose=False) 33 p = sigmoid(X_design @ w) 34 loss = log_loss(y, p) 35 assert loss < 0.30 36 assert np.all((p > 0.1) & (p < 0.9)) # not over-confident on toy data 37 38def test_metrics_at_threshold(): 39 w = fit_logistic(X_design, y, lr=0.6, epochs=1500, verbose=False) 40 p = sigmoid(X_design @ w) 41 cm = confusion_matrix_at_threshold(y, p, 0.5) 42 prec, rec, f1 = precision_recall_f1(cm) 43 assert f1 > 0.65 44 45def test_roc_and_ece(): 46 w = fit_logistic(X_design, y, lr=0.6, epochs=1500, verbose=False) 47 p = sigmoid(X_design @ w) 48 points = roc_points(y, p, thresholds=[0.3, 0.5, 0.7]) 49 assert len(points) == 3 50 ece = expected_calibration_error(y, p, n_bins=4) 51 assert 0.0 <= ece <= 0.25 # realistic on n=8 52 53def test_matches_sklearn(): 54 # integration-style: our implementation should be close to sklearn on identical data 55 from sklearn.linear_model import LogisticRegression 56 clf = LogisticRegression(penalty=None, solver='lbfgs', max_iter=2000, fit_intercept=False) 57 clf.fit(X_design, y) 58 w_ours = fit_logistic(X_design, y, lr=0.6, epochs=2000, verbose=False) 59 assert np.allclose(w_ours, clf.coef_[0], atol=0.15)

Run with pytest tests/test_logistic.py -q --tb=line. All tests should pass on every machine, exactly as the 0.667 scorer was locked in the earlier testing chapter.

Symptom → Cause → Fix table (the debugging checklist)

SymptomLikely causeFix
All predicted probabilities hover around 0.4–0.6Features are weak, data is tiny, or GD has not convergedAdd better features (text embeddings later), increase epochs, try a larger learning rate, or switch to a stronger model
F1 at threshold 0.5 is mediocre but recall at 0.3 is excellentClass imbalance + wrong operating pointRead the ROC or cost-sensitive curve; pick threshold by expected business cost, not by accuracy
Great AUC (0.91) but ECE = 0.22Model ranks well but probabilities are over-confidentDo not show raw p to users; add Platt scaling or temperature scaling before deployment
Your NumPy weights differ from sklearn by >0.3Forgot the intercept column, used different regularization, or feature scaling mismatchCompare design matrices, set penalty=None / C=1e9, verify the exact same X
Loss decreases then suddenly increasesLearning rate too large; sigmoid saturation causing numerical issuesLower lr by 2–5×, clip the logit z inside sigmoid, or use a more reliable solver

Try it yourself

These exercises turn the chapter from reading into mastery. Do them with the exact eight-row table above.

  1. Start with w = [0,0,0] and compute the gradient vector on the full batch by hand for the first step (lr = 0.5). Verify it matches the first printed line of your NumPy run.
  2. Change the learning rate to 2.0 and watch the loss explode or oscillate. Explain why in one sentence.
  3. Implement predict_with_threshold(p, t) and compute_confusion_matrix(y, pred) as pure functions. Sweep thresholds [0.2, 0.35, 0.5, 0.65, 0.8] and print a table of precision / recall / F1 for each. Which threshold gives the highest F1 on this toy set?
  4. Manually bin the eight predicted probabilities into four equal-width bins, compute the ECE by hand, and compare with your expected_calibration_error function.
  5. Replace one label (flip T8 from 1 to 0) and retrain. How does the ROC curve and ECE change? What does this tell you about sensitivity to labeling errors in real support queues?

If any answer feels hand-wavy, print the actual arrays, the exact loss numbers, and the concrete cells of the confusion matrix. That is how you become dangerous with classification.

What you've built and where it leads

You can now:

  • Turn any binary classification problem into a design matrix and fit logistic weights with gradient descent using only NumPy.
  • Derive and implement the (p − y)·x gradient yourself.
  • Tune a decision threshold with a real cost model in mind.
  • Compute and interpret every standard classification metric and draw an ROC curve.
  • Draw a reliability diagram and compute ECE so you know whether your probabilities are honest.
  • Write deterministic pytest guards that will still pass when the dataset grows to 50 000 tickets.

These are the exact skills used in every production classifier at every AI company.

The beautiful 0.80 F1 and 0.07 ECE you just measured were computed on the same eight tickets the model saw during training. How do you know any of it will hold up on tomorrow's new support tickets? The very next lesson (decision trees) will show a completely different family of models on the identical task, while the cross-validation chapter that follows soon after teaches the validation discipline required to trust any of these numbers in production. You now own logistic regression end-to-end. The rest of the curriculum will reuse these ideas constantly.

Evaluation Rubric
  • 1
    Computes sigmoid, log-loss, and the (p − y)·x gradient by hand on a 3-row support-ticket example with concrete probabilities and loss numbers
  • 2
    Implements a complete NumPy logistic regression trainer with gradient descent, a threshold-tunable predict function, and all core classification metrics plus ECE; compares weights and scores to LogisticRegression + calibration_curve
  • 3
    Diagnoses over-confident probabilities, threshold mismatch on imbalanced data, and poor calibration as distinct failure modes; shows how to read a ROC curve, pick operating point by business cost, and interpret an ECE number
Common Pitfalls
  • Treating the default 0.5 threshold as sacred on imbalanced ticket data; a model that always predicts 0.3 on the positive class will still have high accuracy but zero recall.
  • Reporting only accuracy or only F1 at 0.5 when the business cost of false negatives is 5–10× higher than false positives.
  • Plotting an ROC curve but never looking at the actual probabilities or drawing the reliability diagram; good ranking does not imply calibrated probabilities.
  • Forgetting the intercept column when building the design matrix, which forces the decision boundary through the origin in feature space.
  • Comparing your from-scratch weights directly to sklearn without standardizing features or checking the regularization (C=1e10 for no reg) and solver.
Follow-up Questions to Expect

Key Concepts Tested
sigmoid activationlog loss / binary cross-entropygradient of logistic lossdecision threshold tuningconfusion matrix (TP/FP/TN/FN)precision, recall, F1-scoreROC curve and AUCreliability diagram / calibration plotExpected Calibration Error (ECE)NumPy from-scratch implementationscikit-learn comparisonimbalanced classification pitfalls
Next Step
Next: Continue to Decision Trees, Random Forests & Gradient Boosting from Scratch

There you will see how tree-based models naturally handle the same support-ticket urgency task, produce feature importances, and often require different calibration strategies than logistic regression.

PreviousLinear Regression from ScratchNextTrees and Boosting
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

Pattern Recognition and Machine Learning.

Bishop, C. M. · 2006

Machine Learning: A Probabilistic Perspective.

Murphy, K. P. · 2012

Scikit-learn: Machine Learning in Python.

Pedregosa, F., et al. · 2011 · JMLR

Your account is free and you can post anonymously if you choose.