LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesNeural Networks from Scratch
📝EasyNLP Fundamentals

Neural Networks from Scratch

Trace a delivery-risk network from one neuron to a batched NumPy forward pass, then diagnose activation, shape, scale, and numerical-stability failures.

11 min read
Learning path
Step 15 of 155 in the full curriculum
Hypothesis Tests, Intervals, and pass@kCNNs from Scratch

The previous lesson asked whether an observed model improvement was convincing. This lesson moves one step earlier in the pipeline: where does a model score come from?

Imagine a delivery operations team. Orders arrive with numerical signals such as route disruption and warehouse backlog. Before anyone can measure accuracy or compare models, a network must transform those signals into a delay-risk score that can trigger manual review.

Earlier math lessons gave you the notation:

ObjectIn this lesson
vector xfeatures for one shipment
matrix Wweights connecting all inputs to all units in a layer
batch matrix Xseveral shipments sent through the same network at once

We will build only the forward pass: input to score. Training will eventually choose the weights from labeled examples. Evaluation will tell you whether trained scores deserve trust.

Core idea: A dense neural network alternates affine maps, W @ x + b, with non-linear activations. The matrices move information; the activations keep many layers from collapsing into one linear rule.

One neuron produces one score

A neuron is the smallest useful scoring unit. For feature vector xxx, weights www, and bias bbb, it computes:

z=w⊤x+bz = w^\top x + bz=w⊤x+b

The intermediate value zzz is a logit or pre-activation. Apply a function such as the sigmoid to map it into the open interval (0, 1):

σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1​

The resulting number is a convenient score. It becomes a probability claim only after the model has been trained for that purpose and checked on held-out data.

Consider two inputs on a zero-to-ten operational scale:

FeatureValueChosen weightContribution
Route disruption score60.74.2
Warehouse backlog score80.32.4

With bias -4.0, the logit is 4.2 + 2.4 - 4.0 = 2.6; sigmoid maps it to 0.931. These are fixed demonstration parameters, not learned evidence that the order truly has a 93.1% delay probability.

one-neuron-score.py
1import numpy as np 2 3x = np.array([6.0, 8.0]) # disruption, backlog 4w = np.array([0.7, 0.3]) 5b = -4.0 6 7z = float(x @ w + b) 8score = 1.0 / (1.0 + np.exp(-z)) 9 10print(f"logit: {z:.3f}") 11print(f"sigmoid score: {score:.3f}")
Output
1logit: 2.600 2sigmoid score: 0.931
Weighted-edge diagram for one delay-risk neuron: disruption 6 and backlog 8 contribute 4.2 and 2.4, bias subtracts 4.0, and the resulting logit 2.6 lands at 0.931 on a sigmoid curve. Weighted-edge diagram for one delay-risk neuron: disruption 6 and backlog 8 contribute 4.2 and 2.4, bias subtracts 4.0, and the resulting logit 2.6 lands at 0.931 on a sigmoid curve.
Edge width and labels expose each weighted contribution. The marked sigmoid point is a score from chosen parameters, not calibration evidence.

A bias shifts the intervention threshold without changing feature values. Keep the same weights and compare a quieter order with the high-risk order:

bias-shifts-threshold.py
1import numpy as np 2 3w = np.array([0.7, 0.3]) 4b = -4.0 5orders = { 6 "quiet route": np.array([2.0, 3.0]), 7 "disrupted route": np.array([6.0, 8.0]), 8} 9 10for name, x in orders.items(): 11 z = float(x @ w + b) 12 score = 1.0 / (1.0 + np.exp(-z)) 13 print(f"{name:15s} logit={z:5.2f} score={score:.3f}")
Output
1quiet route logit=-1.70 score=0.154 2disrupted route logit= 2.60 score=0.931

The weighted-sum pattern reaches back to early perceptron models.[1] Modern networks replace a single thresholding unit with many differentiable layers, but the local calculation remains recognizable.

Non-linearity is what makes depth useful

Suppose two layers contain only weighted sums and biases:

h=W1x+b1h = W_1x + b_1h=W1​x+b1​ y=W2h+b2y = W_2h + b_2y=W2​h+b2​

Substitute the first equation into the second:

y=(W2W1)x+(W2b1+b2)y = (W_2W_1)x + (W_2b_1 + b_2)y=(W2​W1​)x+(W2​b1​+b2​)

Two affine layers have collapsed into one affine layer. The same is true for twenty layers. A non-linear activation placed between layers prevents that collapse, letting a network represent curved or piecewise decision boundaries.[2]

Three functions matter here:

ActivationFormulaUseful intuition
Sigmoidσ(z)=1/(1+e−z)\sigma(z)=1/(1+e^{-z})σ(z)=1/(1+e−z)Maps a final binary-classification logit to a score in (0, 1)
ReLUmax⁡(0,z)\max(0,z)max(0,z)Passes positive evidence and clips negative pre-activations to zero
GELUzΦ(z)z\Phi(z)zΦ(z)Smoothly gates values using a Gaussian cumulative distribution

ReLU became a practical hidden-layer choice because its positive side doesn't saturate like sigmoid; its negative side can still produce inactive units.[3] GELU is a smooth alternative that weights inputs by their values rather than gating only by sign.[4]

Linear, ReLU, sigmoid, and GELU curves showing how non-linear activations bend stacked layers away from one straight-line transform. Linear, ReLU, sigmoid, and GELU curves showing how non-linear activations bend stacked layers away from one straight-line transform.
The bends, not the names, are the point: ReLU clips negatives, sigmoid and GELU curve smoothly, and the stack stops acting like one affine transform.

Compute the bends rather than memorizing their names:

activation-values.py
1from math import erf, sqrt 2import numpy as np 3 4z = np.array([-2.0, -0.5, 0.0, 1.0, 2.0]) 5relu = np.maximum(0.0, z) 6sigmoid = 1.0 / (1.0 + np.exp(-z)) 7gelu = np.array([value * 0.5 * (1.0 + erf(value / sqrt(2.0))) for value in z]) 8 9for value, r, s, g in zip(z, relu, sigmoid, gelu): 10 print(f"z={value:4.1f} relu={r:5.3f} sigmoid={s:5.3f} gelu={g:6.3f}")
Output
1z=-2.0 relu=0.000 sigmoid=0.119 gelu=-0.046 2z=-0.5 relu=0.000 sigmoid=0.378 gelu=-0.154 3z= 0.0 relu=0.000 sigmoid=0.500 gelu= 0.000 4z= 1.0 relu=1.000 sigmoid=0.731 gelu= 0.841 5z= 2.0 relu=2.000 sigmoid=0.881 gelu= 1.954

Here is the collapse as executable evidence. Without ReLU, two layers match one merged affine map exactly. With ReLU inserted, they no longer match.

activation-prevents-collapse.py
1import numpy as np 2 3x = np.array([2.0, 1.0]) 4w1 = np.array([[1.0, -2.0], [0.5, 1.0]]) 5b1 = np.array([-0.5, 0.25]) 6w2 = np.array([[0.4, -0.8]]) 7b2 = np.array([0.1]) 8 9two_affine = w2 @ (w1 @ x + b1) + b2 10merged_affine = (w2 @ w1) @ x + (w2 @ b1 + b2) 11with_relu = w2 @ np.maximum(0.0, w1 @ x + b1) + b2 12 13print("affine equals merged:", np.allclose(two_affine, merged_affine)) 14print(f"two affine output: {float(two_affine[0]):.3f}") 15print(f"with ReLU output: {float(with_relu[0]):.3f}")
Output
1affine equals merged: True 2two affine output: -1.900 3with ReLU output: -1.700

The first hidden pre-activation is negative in this trace. ReLU clips that value to zero, so the activated network no longer matches the merged affine map.

A network is many neurons arranged as layers

One neuron emits one score. A dense layer emits a vector of scores by giving each output unit its own row of weights:

z=Wx+bz = Wx + bz=Wx+b

For three input features and four hidden units, W has shape (4, 3), x has shape (3,), and z has shape (4,). Matrix multiplication computes all four neurons at once.

Inputs feed four hidden units; ReLU drops one before the output score. Inputs feed four hidden units; ReLU drops one before the output score.
The negative hidden unit is gated off before output.

Trace a tiny forward pass

Now use a two-feature, two-hidden-unit network. Its parameters are deliberately small enough to calculate by hand:

W1=[0.3−0.10.10.6],b1=[0.00.0]W_1 = \begin{bmatrix} 0.3 & -0.1 \\ 0.1 & 0.6 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0.0 \\ 0.0 \end{bmatrix}W1​=[0.30.1​−0.10.6​],b1​=[0.00.0​]

For x = [2.0, 5.0], the hidden pre-activation is:

z1=W1x+b1=[0.1,3.2]z_1 = W_1x + b_1 = [0.1, 3.2]z1​=W1​x+b1​=[0.1,3.2]

Both entries are positive, so ReLU leaves them unchanged: h = [0.1, 3.2]. The output parameters are w_2 = [0.4, 0.1] and b_2 = -0.5, giving:

z2=0.4(0.1)+0.1(3.2)−0.5=−0.14z_2 = 0.4(0.1) + 0.1(3.2) - 0.5 = -0.14z2​=0.4(0.1)+0.1(3.2)−0.5=−0.14 score=σ(−0.14)≈0.465\text{score} = \sigma(-0.14) \approx 0.465score=σ(−0.14)≈0.465
Node-level trace of a 2-2-1 neural network: inputs 2 and 5 travel across four labeled weights, produce ReLU hidden values 0.10 and 3.20, then contribute 0.04 and 0.32 before bias gives logit -0.14 and sigmoid score 0.465. Node-level trace of a 2-2-1 neural network: inputs 2 and 5 travel across four labeled weights, produce ReLU hidden values 0.10 and 3.20, then contribute 0.04 and 0.32 before bias gives logit -0.14 and sigmoid score 0.465.
Each edge keeps its own weight and contribution visible. Once this single-example graph is clear, batching performs the same operations for several rows at once.
trace-one-forward-pass.py
1import numpy as np 2 3x = np.array([2.0, 5.0]) 4w1 = np.array([[0.3, -0.1], [0.1, 0.6]]) 5b1 = np.array([0.0, 0.0]) 6w2 = np.array([0.4, 0.1]) 7b2 = -0.5 8 9z1 = w1 @ x + b1 10h = np.maximum(0.0, z1) 11z2 = float(w2 @ h + b2) 12score = 1.0 / (1.0 + np.exp(-z2)) 13 14print("hidden pre-activation:", z1) 15print("hidden after ReLU: ", h) 16print(f"output logit: {z2:.3f}") 17print(f"sigmoid score: {score:.3f}")
Output
1hidden pre-activation: [0.1 3.2] 2hidden after ReLU: [0.1 3.2] 3output logit: -0.140 4sigmoid score: 0.465

Batch several orders through the same network

Production code rarely scores one order at a time. Put one order in each row of X. With row-major batches, compute X @ W1.T because each row of W1 describes one hidden unit.

batch-forward-pass.py
1import numpy as np 2 3x = np.array([ 4 [2.0, 5.0], 5 [6.0, 8.0], 6 [1.0, 1.0], 7 [8.0, 2.0], 8]) 9w1 = np.array([[0.3, -0.1], [0.1, 0.6]]) 10b1 = np.array([0.0, 0.0]) 11w2 = np.array([0.4, 0.1]) 12b2 = -0.5 13 14h = np.maximum(0.0, x @ w1.T + b1) 15logits = h @ w2 + b2 16scores = 1.0 / (1.0 + np.exp(-logits)) 17 18print("input shape: ", x.shape) 19print("hidden shape:", h.shape) 20print("score shape: ", scores.shape) 21print("scores:", np.round(scores, 3))
Output
1input shape: (4, 2) 2hidden shape: (4, 2) 3score shape: (4,) 4scores: [0.465 0.608 0.413 0.641]

Shape checks aren't busywork. An accidental transpose can either fail loudly or silently attach the wrong meaning to a dimension.

Parameters are stored decisions

Each weight and bias is one parameter. A dense layer with n_in inputs and n_out outputs has:

nout×nin+noutn_{\text{out}} \times n_{\text{in}} + n_{\text{out}}nout​×nin​+nout​

parameters: one weight per connection plus one bias per output unit.

For our 2 -> 2 -> 1 network:

LayerWeightsBiasesParameters
Input to hidden2 x 2 = 426
Hidden to output2 x 1 = 213
Total9
count-dense-parameters.py
1def dense_parameters(n_in: int, n_out: int) -> int: 2 if n_in <= 0 or n_out <= 0: 3 raise ValueError("dense dimensions must be positive") 4 return n_in * n_out + n_out 5 6tiny = dense_parameters(2, 2) + dense_parameters(2, 1) 7image_classifier = ( 8 dense_parameters(784, 256) 9 + dense_parameters(256, 128) 10 + dense_parameters(128, 10) 11) 12 13print("2 -> 2 -> 1 parameters:", tiny) 14print("784 -> 256 -> 128 -> 10 parameters:", f"{image_classifier:,}") 15try: 16 dense_parameters(0, 2) 17except ValueError as error: 18 print(error)
Output
12 -> 2 -> 1 parameters: 9 2784 -> 256 -> 128 -> 10 parameters: 235,146 3dense dimensions must be positive

During training, labels and an optimization procedure adjust these parameters. At inference time, the network applies the stored parameters to new input values. New prompt tokens, retrieved context, or sensor readings are inputs; they aren't new weights.

Failure case: scales can overpower meaning

Suppose you replace route disruption with order value in dollars and feed raw values into a neuron:

z=0.3⋅backlog+0.3⋅order value−2z = 0.3 \cdot \text{backlog} + 0.3 \cdot \text{order value} - 2z=0.3⋅backlog+0.3⋅order value−2

Equal coefficients don't mean equal influence when one feature ranges from 0 to 10 and another ranges from 10 to 1000. A large measurement scale can dominate an initial weighted sum and make optimization harder.

feature-scale-failure.py
1import numpy as np 2 3raw_order = np.array([8.0, 900.0]) # backlog score, dollars 4raw_contributions = 0.3 * raw_order 5 6mean = np.array([5.0, 500.0]) 7scale = np.array([2.0, 200.0]) 8standardized_order = (raw_order - mean) / scale 9scaled_contributions = 0.3 * standardized_order 10 11print("raw contributions: ", raw_contributions) 12print("standardized features: ", standardized_order) 13print("scaled contributions: ", np.round(scaled_contributions, 2))
Output
1raw contributions: [ 2.4 270. ] 2standardized features: [1.5 2. ] 3scaled contributions: [0.45 0.6 ]

This doesn't prove standardization is always required. It gives you a diagnostic question: do input magnitudes reflect meaning, or merely units such as dollars, milliseconds, and counts?

Failure case: naive sigmoid can overflow

Forward-pass code also needs numerical care. The formula 1 / (1 + exp(-z)) is harmless for z = 2.6, but for a very negative logit, exp(-z) can exceed floating-point range.

Use an algebraically equivalent branch:

σ(z)={1/(1+e−z)z≥0ez/(1+ez)z<0\sigma(z) = \begin{cases} 1/(1+e^{-z}) & z \ge 0 \\ e^z/(1+e^z) & z < 0 \end{cases}σ(z)={1/(1+e−z)ez/(1+ez)​z≥0z<0​
stable-sigmoid.py
1from math import exp 2 3def stable_sigmoid(z: float) -> float: 4 if z >= 0: 5 return 1.0 / (1.0 + exp(-z)) 6 ez = exp(z) 7 return ez / (1.0 + ez) 8 9for logit in [-1000.0, -2.0, 0.0, 2.0, 1000.0]: 10 print(f"logit={logit:7.1f} score={stable_sigmoid(logit):.6f}")
Output
1logit=-1000.0 score=0.000000 2logit= -2.0 score=0.119203 3logit= 0.0 score=0.500000 4logit= 2.0 score=0.880797 5logit= 1000.0 score=1.000000

Build an inspectable delay-risk report

This final forward pass combines five engineering habits:

  1. Standardize heterogeneous inputs with saved statistics.
  2. Assert shapes before multiplying.
  3. Apply a hidden non-linearity.
  4. Use a numerically stable output activation.
  5. Present scores as scores until validation supports probability language.

The parameters below are still illustrative. The point is an auditable computation path, not a trained delivery model.

delay-risk-report.py
1import numpy as np 2 3def stable_sigmoid(values: np.ndarray) -> np.ndarray: 4 scores = np.empty_like(values, dtype=float) 5 nonnegative = values >= 0 6 scores[nonnegative] = 1.0 / (1.0 + np.exp(-values[nonnegative])) 7 exp_values = np.exp(values[~nonnegative]) 8 scores[~nonnegative] = exp_values / (1.0 + exp_values) 9 return scores 10 11feature_names = ["route disruption", "backlog", "weather severity"] 12order_ids = ["A-104", "B-208", "C-311"] 13raw = np.array([ 14 [2.0, 3.0, 1.0], 15 [8.0, 7.0, 9.0], 16 [5.0, 9.0, 4.0], 17]) 18mean = np.array([5.0, 5.0, 5.0]) 19scale = np.array([2.0, 2.0, 2.0]) 20x = (raw - mean) / scale 21 22w1 = np.array([ 23 [0.8, 0.3, 0.4], 24 [-0.2, 0.9, 0.5], 25 [0.4, -0.1, 0.7], 26]) 27b1 = np.array([0.0, -0.2, 0.1]) 28w2 = np.array([0.9, 0.7, 0.6]) 29b2 = -0.3 30 31assert x.shape[1] == len(feature_names) == w1.shape[1] 32assert w1.shape[0] == b1.shape[0] == w2.shape[0] 33 34hidden = np.maximum(0.0, x @ w1.T + b1) 35logits = hidden @ w2 + b2 36scores = stable_sigmoid(logits) 37 38print("Delay-risk scoring report") 39for order_id, score in zip(order_ids, scores): 40 queue = "review" if score >= 0.60 else "monitor" 41 print(f"{order_id}: score={score:.3f} -> {queue}")
Output
1Delay-risk scoring report 2A-104: score=0.426 -> monitor 3B-208: score=0.981 -> review 4C-311: score=0.732 -> review

What approximation theorems leave open

A classic result shows that a feed-forward network with one sufficiently wide hidden layer and suitable non-linear activation can approximate broad classes of target functions arbitrarily well.[5][6]

That is a statement about representation: useful parameters can exist. It doesn't say:

  • gradient-based training will find them;
  • a finite dataset identifies the right behavior;
  • a model is calibrated or safe to deploy;
  • one architecture is efficient for images, sequences, or language.

Those remaining questions motivate the rest of the curriculum: convolution adds structure for spatial data, training turns parameters into learned behavior, and evaluation tests whether behavior generalizes.

Practice

  1. In one-neuron-score.py, change the bias from -4.0 to -6.0. Explain why both orders receive lower scores even though their inputs don't change.
  2. Extend batch-forward-pass.py with a third feature and update W1. Write each expected shape before executing the code.
  3. Remove np.maximum from the hidden layer in delay-risk-report.py. Derive the single merged affine map and verify its logits match the two-layer version without ReLU.
  4. Add a second output for lost package risk. Decide whether independent sigmoid scores or a softmax category fits the operational question, and state the assumption.

Mastery check

Key concepts

  • weighted sum, bias, logit, and sigmoid score
  • affine maps and dense-layer shapes
  • ReLU and GELU non-linearities
  • single-example and batched forward passes
  • parameter counting
  • feature scaling and stable sigmoid implementation
  • representation versus training and calibration

Evaluation rubric

  • Foundational: Computes one neuron's logit and sigmoid score, then explains why the score isn't automatically a probability.
  • Intermediate: Traces a dense ReLU forward pass for one example and a batch while stating each tensor shape.
  • Advanced: Demonstrates linear-layer collapse, counts parameters, and diagnoses scaling or numerical-stability failures before deployment.

Common pitfalls

  • Symptom: A hand-selected sigmoid output is reported as a delay probability. Cause: output range was confused with calibration. Fix: call it a score until held-out calibration supports probability language.
  • Symptom: Adding more affine layers doesn't improve representational power. Cause: no hidden non-linearity was applied. Fix: insert and trace an activation such as ReLU or GELU.
  • Symptom: Dollar-valued features swamp smaller operational scores. Cause: coefficients were compared without checking feature units. Fix: inspect distributions and scale features when the model contract requires it.
  • Symptom: Scoring extreme logits crashes or produces warnings. Cause: naive sigmoid exponentiation overflowed. Fix: use a numerically stable branch or library primitive.

Follow-up questions

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.For one order, x = [4.0, 5.0], w = [0.5, 0.4], and b = -2.8. A neuron computes z = x @ w + b and then applies a sigmoid. Which statement is correct?
2.An engineer removes the hidden activation from a dense block: h = W1 @ x + b1 and y = W2 @ h + b2. W1 has shape (3, 2), b1 has shape (3,), W2 has shape (1, 3), and b2 is a scalar. Which single affine expression computes the same y for every x?
3.A network uses x = [2.0, 5.0], W1 = [[0.3, -0.1], [0.1, 0.6]], b1 = [0, 0], w2 = [0.4, 0.1], and b2 = -0.5. Which forward trace is correct?
4.A row-major batch X has shape (5, 3): five orders and three features per order. First-layer weights W1 have shape (4, 3), with one row per hidden unit. Which expression computes the hidden pre-activations before adding b1 in the usual batch layout?
5.A dense network has architecture 3 -> 4 -> 2, with a bias for every non-input unit. How many trainable parameters does it contain?
6.A raw neuron uses z = 0.3 * backlog + 0.3 * order_value - 2, where backlog is on a 0 to 10 scale and order value is measured in dollars. For backlog = 8 and order_value = 900, what diagnostic does the forward pass reveal?
7.For an extreme negative logit z = -1000, a naive sigmoid 1 / (1 + exp(-z)) can overflow. Which branch computes an algebraically equivalent sigmoid without evaluating exp(1000)?
8.A feed-forward network has one sufficiently wide hidden layer and a suitable nonlinear activation. An engineer cites a universal approximation result. Which conclusion is justified by that result alone?

8 questions remaining.

Next Step
Continue to Convolutional Neural Networks from Scratch

Dense forward-pass intuition now becomes a spatial model for images, where shared kernels preserve local structure.

PreviousHypothesis Tests, Intervals, and pass@k
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.

Rosenblatt, F. · 1958 · Psychological Review, 65(6)

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. · 2016

Rectified Linear Units Improve Restricted Boltzmann Machines.

Nair, V., Hinton, G. E. · 2010 · ICML 2010

Gaussian Error Linear Units (GELUs).

Hendrycks, D., Gimpel, K. · 2016

Multilayer Feedforward Networks are Universal Approximators.

Hornik, K., Stinchcombe, M., White, H. · 1989 · Neural Networks, 2(5)

Approximation by Superpositions of a Sigmoidal Function.

Cybenko, G. · 1989 · Mathematics of Control, Signals and Systems, 2(4)