Trace a delivery-risk network from one neuron to a batched NumPy forward pass, then diagnose activation, shape, scale, and numerical-stability failures.
The previous lesson asked whether an observed model improvement was convincing. This lesson moves one step earlier in the pipeline: where does a model score come from?
Imagine a delivery operations team. Orders arrive with numerical signals such as route disruption and warehouse backlog. Before anyone can measure accuracy or compare models, a network must transform those signals into a delay-risk score that can trigger manual review.
Earlier math lessons gave you the notation:
| Object | In this lesson |
|---|---|
vector x | features for one shipment |
matrix W | weights connecting all inputs to all units in a layer |
batch matrix X | several shipments sent through the same network at once |
We will build only the forward pass: input to score. Training will eventually choose the weights from labeled examples. Evaluation will tell you whether trained scores deserve trust.
Core idea: A dense neural network alternates affine maps,
W @ x + b, with non-linear activations. The matrices move information; the activations keep many layers from collapsing into one linear rule.
A neuron is the smallest useful scoring unit. For feature vector , weights , and bias , it computes:
The intermediate value is a logit or pre-activation. Apply a function such as the sigmoid to map it into the open interval (0, 1):
The resulting number is a convenient score. It becomes a probability claim only after the model has been trained for that purpose and checked on held-out data.
Consider two inputs on a zero-to-ten operational scale:
| Feature | Value | Chosen weight | Contribution |
|---|---|---|---|
| Route disruption score | 6 | 0.7 | 4.2 |
| Warehouse backlog score | 8 | 0.3 | 2.4 |
With bias -4.0, the logit is 4.2 + 2.4 - 4.0 = 2.6; sigmoid maps it to 0.931. These are fixed demonstration parameters, not learned evidence that the order truly has a 93.1% delay probability.
1import numpy as np
2
3x = np.array([6.0, 8.0]) # disruption, backlog
4w = np.array([0.7, 0.3])
5b = -4.0
6
7z = float(x @ w + b)
8score = 1.0 / (1.0 + np.exp(-z))
9
10print(f"logit: {z:.3f}")
11print(f"sigmoid score: {score:.3f}")1logit: 2.600
2sigmoid score: 0.931
A bias shifts the intervention threshold without changing feature values. Keep the same weights and compare a quieter order with the high-risk order:
1import numpy as np
2
3w = np.array([0.7, 0.3])
4b = -4.0
5orders = {
6 "quiet route": np.array([2.0, 3.0]),
7 "disrupted route": np.array([6.0, 8.0]),
8}
9
10for name, x in orders.items():
11 z = float(x @ w + b)
12 score = 1.0 / (1.0 + np.exp(-z))
13 print(f"{name:15s} logit={z:5.2f} score={score:.3f}")1quiet route logit=-1.70 score=0.154
2disrupted route logit= 2.60 score=0.931The weighted-sum pattern reaches back to early perceptron models.[1] Modern networks replace a single thresholding unit with many differentiable layers, but the local calculation remains recognizable.
Suppose two layers contain only weighted sums and biases:
Substitute the first equation into the second:
Two affine layers have collapsed into one affine layer. The same is true for twenty layers. A non-linear activation placed between layers prevents that collapse, letting a network represent curved or piecewise decision boundaries.[2]
Three functions matter here:
| Activation | Formula | Useful intuition |
|---|---|---|
| Sigmoid | Maps a final binary-classification logit to a score in (0, 1) | |
| ReLU | Passes positive evidence and clips negative pre-activations to zero | |
| GELU | Smoothly gates values using a Gaussian cumulative distribution |
ReLU became a practical hidden-layer choice because its positive side doesn't saturate like sigmoid; its negative side can still produce inactive units.[3] GELU is a smooth alternative that weights inputs by their values rather than gating only by sign.[4]
Compute the bends rather than memorizing their names:
1from math import erf, sqrt
2import numpy as np
3
4z = np.array([-2.0, -0.5, 0.0, 1.0, 2.0])
5relu = np.maximum(0.0, z)
6sigmoid = 1.0 / (1.0 + np.exp(-z))
7gelu = np.array([value * 0.5 * (1.0 + erf(value / sqrt(2.0))) for value in z])
8
9for value, r, s, g in zip(z, relu, sigmoid, gelu):
10 print(f"z={value:4.1f} relu={r:5.3f} sigmoid={s:5.3f} gelu={g:6.3f}")1z=-2.0 relu=0.000 sigmoid=0.119 gelu=-0.046
2z=-0.5 relu=0.000 sigmoid=0.378 gelu=-0.154
3z= 0.0 relu=0.000 sigmoid=0.500 gelu= 0.000
4z= 1.0 relu=1.000 sigmoid=0.731 gelu= 0.841
5z= 2.0 relu=2.000 sigmoid=0.881 gelu= 1.954Here is the collapse as executable evidence. Without ReLU, two layers match one merged affine map exactly. With ReLU inserted, they no longer match.
1import numpy as np
2
3x = np.array([2.0, 1.0])
4w1 = np.array([[1.0, -2.0], [0.5, 1.0]])
5b1 = np.array([-0.5, 0.25])
6w2 = np.array([[0.4, -0.8]])
7b2 = np.array([0.1])
8
9two_affine = w2 @ (w1 @ x + b1) + b2
10merged_affine = (w2 @ w1) @ x + (w2 @ b1 + b2)
11with_relu = w2 @ np.maximum(0.0, w1 @ x + b1) + b2
12
13print("affine equals merged:", np.allclose(two_affine, merged_affine))
14print(f"two affine output: {float(two_affine[0]):.3f}")
15print(f"with ReLU output: {float(with_relu[0]):.3f}")1affine equals merged: True
2two affine output: -1.900
3with ReLU output: -1.700The first hidden pre-activation is negative in this trace. ReLU clips that value to zero, so the activated network no longer matches the merged affine map.
One neuron emits one score. A dense layer emits a vector of scores by giving each output unit its own row of weights:
For three input features and four hidden units, W has shape (4, 3), x has shape (3,), and z has shape (4,). Matrix multiplication computes all four neurons at once.
Now use a two-feature, two-hidden-unit network. Its parameters are deliberately small enough to calculate by hand:
For x = [2.0, 5.0], the hidden pre-activation is:
Both entries are positive, so ReLU leaves them unchanged: h = [0.1, 3.2]. The output parameters are w_2 = [0.4, 0.1] and b_2 = -0.5, giving:
1import numpy as np
2
3x = np.array([2.0, 5.0])
4w1 = np.array([[0.3, -0.1], [0.1, 0.6]])
5b1 = np.array([0.0, 0.0])
6w2 = np.array([0.4, 0.1])
7b2 = -0.5
8
9z1 = w1 @ x + b1
10h = np.maximum(0.0, z1)
11z2 = float(w2 @ h + b2)
12score = 1.0 / (1.0 + np.exp(-z2))
13
14print("hidden pre-activation:", z1)
15print("hidden after ReLU: ", h)
16print(f"output logit: {z2:.3f}")
17print(f"sigmoid score: {score:.3f}")1hidden pre-activation: [0.1 3.2]
2hidden after ReLU: [0.1 3.2]
3output logit: -0.140
4sigmoid score: 0.465Production code rarely scores one order at a time. Put one order in each row of X. With row-major batches, compute X @ W1.T because each row of W1 describes one hidden unit.
1import numpy as np
2
3x = np.array([
4 [2.0, 5.0],
5 [6.0, 8.0],
6 [1.0, 1.0],
7 [8.0, 2.0],
8])
9w1 = np.array([[0.3, -0.1], [0.1, 0.6]])
10b1 = np.array([0.0, 0.0])
11w2 = np.array([0.4, 0.1])
12b2 = -0.5
13
14h = np.maximum(0.0, x @ w1.T + b1)
15logits = h @ w2 + b2
16scores = 1.0 / (1.0 + np.exp(-logits))
17
18print("input shape: ", x.shape)
19print("hidden shape:", h.shape)
20print("score shape: ", scores.shape)
21print("scores:", np.round(scores, 3))1input shape: (4, 2)
2hidden shape: (4, 2)
3score shape: (4,)
4scores: [0.465 0.608 0.413 0.641]Shape checks aren't busywork. An accidental transpose can either fail loudly or silently attach the wrong meaning to a dimension.
Each weight and bias is one parameter. A dense layer with n_in inputs and n_out outputs has:
parameters: one weight per connection plus one bias per output unit.
For our 2 -> 2 -> 1 network:
| Layer | Weights | Biases | Parameters |
|---|---|---|---|
| Input to hidden | 2 x 2 = 4 | 2 | 6 |
| Hidden to output | 2 x 1 = 2 | 1 | 3 |
| Total | 9 |
1def dense_parameters(n_in: int, n_out: int) -> int:
2 if n_in <= 0 or n_out <= 0:
3 raise ValueError("dense dimensions must be positive")
4 return n_in * n_out + n_out
5
6tiny = dense_parameters(2, 2) + dense_parameters(2, 1)
7image_classifier = (
8 dense_parameters(784, 256)
9 + dense_parameters(256, 128)
10 + dense_parameters(128, 10)
11)
12
13print("2 -> 2 -> 1 parameters:", tiny)
14print("784 -> 256 -> 128 -> 10 parameters:", f"{image_classifier:,}")
15try:
16 dense_parameters(0, 2)
17except ValueError as error:
18 print(error)12 -> 2 -> 1 parameters: 9
2784 -> 256 -> 128 -> 10 parameters: 235,146
3dense dimensions must be positiveDuring training, labels and an optimization procedure adjust these parameters. At inference time, the network applies the stored parameters to new input values. New prompt tokens, retrieved context, or sensor readings are inputs; they aren't new weights.
Suppose you replace route disruption with order value in dollars and feed raw values into a neuron:
Equal coefficients don't mean equal influence when one feature ranges from 0 to 10 and another ranges from 10 to 1000. A large measurement scale can dominate an initial weighted sum and make optimization harder.
1import numpy as np
2
3raw_order = np.array([8.0, 900.0]) # backlog score, dollars
4raw_contributions = 0.3 * raw_order
5
6mean = np.array([5.0, 500.0])
7scale = np.array([2.0, 200.0])
8standardized_order = (raw_order - mean) / scale
9scaled_contributions = 0.3 * standardized_order
10
11print("raw contributions: ", raw_contributions)
12print("standardized features: ", standardized_order)
13print("scaled contributions: ", np.round(scaled_contributions, 2))1raw contributions: [ 2.4 270. ]
2standardized features: [1.5 2. ]
3scaled contributions: [0.45 0.6 ]This doesn't prove standardization is always required. It gives you a diagnostic question: do input magnitudes reflect meaning, or merely units such as dollars, milliseconds, and counts?
Forward-pass code also needs numerical care. The formula 1 / (1 + exp(-z)) is harmless for z = 2.6, but for a very negative logit, exp(-z) can exceed floating-point range.
Use an algebraically equivalent branch:
1from math import exp
2
3def stable_sigmoid(z: float) -> float:
4 if z >= 0:
5 return 1.0 / (1.0 + exp(-z))
6 ez = exp(z)
7 return ez / (1.0 + ez)
8
9for logit in [-1000.0, -2.0, 0.0, 2.0, 1000.0]:
10 print(f"logit={logit:7.1f} score={stable_sigmoid(logit):.6f}")1logit=-1000.0 score=0.000000
2logit= -2.0 score=0.119203
3logit= 0.0 score=0.500000
4logit= 2.0 score=0.880797
5logit= 1000.0 score=1.000000This final forward pass combines five engineering habits:
The parameters below are still illustrative. The point is an auditable computation path, not a trained delivery model.
1import numpy as np
2
3def stable_sigmoid(values: np.ndarray) -> np.ndarray:
4 scores = np.empty_like(values, dtype=float)
5 nonnegative = values >= 0
6 scores[nonnegative] = 1.0 / (1.0 + np.exp(-values[nonnegative]))
7 exp_values = np.exp(values[~nonnegative])
8 scores[~nonnegative] = exp_values / (1.0 + exp_values)
9 return scores
10
11feature_names = ["route disruption", "backlog", "weather severity"]
12order_ids = ["A-104", "B-208", "C-311"]
13raw = np.array([
14 [2.0, 3.0, 1.0],
15 [8.0, 7.0, 9.0],
16 [5.0, 9.0, 4.0],
17])
18mean = np.array([5.0, 5.0, 5.0])
19scale = np.array([2.0, 2.0, 2.0])
20x = (raw - mean) / scale
21
22w1 = np.array([
23 [0.8, 0.3, 0.4],
24 [-0.2, 0.9, 0.5],
25 [0.4, -0.1, 0.7],
26])
27b1 = np.array([0.0, -0.2, 0.1])
28w2 = np.array([0.9, 0.7, 0.6])
29b2 = -0.3
30
31assert x.shape[1] == len(feature_names) == w1.shape[1]
32assert w1.shape[0] == b1.shape[0] == w2.shape[0]
33
34hidden = np.maximum(0.0, x @ w1.T + b1)
35logits = hidden @ w2 + b2
36scores = stable_sigmoid(logits)
37
38print("Delay-risk scoring report")
39for order_id, score in zip(order_ids, scores):
40 queue = "review" if score >= 0.60 else "monitor"
41 print(f"{order_id}: score={score:.3f} -> {queue}")1Delay-risk scoring report
2A-104: score=0.426 -> monitor
3B-208: score=0.981 -> review
4C-311: score=0.732 -> reviewA classic result shows that a feed-forward network with one sufficiently wide hidden layer and suitable non-linear activation can approximate broad classes of target functions arbitrarily well.[5][6]
That is a statement about representation: useful parameters can exist. It doesn't say:
Those remaining questions motivate the rest of the curriculum: convolution adds structure for spatial data, training turns parameters into learned behavior, and evaluation tests whether behavior generalizes.
one-neuron-score.py, change the bias from -4.0 to -6.0. Explain why both orders receive lower scores even though their inputs don't change.batch-forward-pass.py with a third feature and update W1. Write each expected shape before executing the code.np.maximum from the hidden layer in delay-risk-report.py. Derive the single merged affine map and verify its logits match the two-layer version without ReLU.lost package risk. Decide whether independent sigmoid scores or a softmax category fits the operational question, and state the assumption.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.
Rosenblatt, F. · 1958 · Psychological Review, 65(6)
Deep Learning.
Goodfellow, I., Bengio, Y., Courville, A. · 2016
Rectified Linear Units Improve Restricted Boltzmann Machines.
Nair, V., Hinton, G. E. · 2010 · ICML 2010
Gaussian Error Linear Units (GELUs).
Hendrycks, D., Gimpel, K. · 2016
Multilayer Feedforward Networks are Universal Approximators.
Hornik, K., Stinchcombe, M., White, H. · 1989 · Neural Networks, 2(5)
Approximation by Superpositions of a Sigmoidal Function.
Cybenko, G. · 1989 · Mathematics of Control, Signals and Systems, 2(4)