Trace a CNN over a cracked equipment-panel photo patch: shared kernels, feature-map shapes, pooling, padding failures, and a NumPy-to-PyTorch forward pass.
An engineer uploads a photo of a cracked equipment panel with a narrow fault line across its surface. In the previous lesson, a small neural network accepted already-extracted numbers and produced a score. For a photo, those numbers are still arranged in space: neighboring bright and dark pixels may form the crack, while the same pixels in a different arrangement may mean nothing.
A convolutional neural network (CNN) handles that structure by applying small filters across nearby pixels. The same filter is reused at every position, so it can respond to a useful local pattern wherever it appears in the image. This use of local receptive fields and shared weights is the central CNN idea.[1][2]
Suppose the upload is resized to 224 x 224 RGB pixels. Feeding every pixel into one dense hidden layer of 256 units creates 38,535,424 parameters before the model has even recognized an edge: 38,535,168 weights plus 256 biases. A convolution layer with 16 filters of size 3 x 3 across three color channels uses 448 parameters: 432 weights plus 16 biases.
This comparison is about parameter count, not identical capabilities: a dense layer connects every location independently, while a convolution deliberately assumes that the same local detector should be useful across the image.
1height, width, channels, hidden = 224, 224, 3, 256
2dense_weights = height * width * channels * hidden
3dense_biases = hidden
4dense_parameters = dense_weights + dense_biases
5
6filters, kernel = 16, 3
7conv_weights = filters * channels * kernel * kernel
8conv_biases = filters
9conv_parameters = conv_weights + conv_biases
10
11print(f"dense parameters: {dense_parameters:,} ({dense_weights:,} weights + {dense_biases} biases)")
12print(f"conv parameters: {conv_parameters:,} ({conv_weights:,} weights + {conv_biases} biases)")
13print(f"parameter-count ratio: {dense_parameters / conv_parameters:,.0f}x")1dense parameters: 38,535,424 (38,535,168 weights + 256 biases)
2conv parameters: 448 (432 weights + 16 biases)
3parameter-count ratio: 86,017x
Start with one grayscale crop from the equipment-panel photo. Bright values in the center might be reflective surface around a crack:
| Input patch | Column 1 | Column 2 | Column 3 | Column 4 |
|---|---|---|---|---|
| Row 1 | 0.05 | 0.08 | 0.06 | 0.07 |
| Row 2 | 0.12 | 0.92 | 0.88 | 0.09 |
| Row 3 | 0.08 | 0.90 | 0.93 | 0.11 |
| Row 4 | 0.06 | 0.07 | 0.09 | 0.05 |
For now, choose a 3 x 3 filter that gives positive weight to a bright horizontal middle line and negative weight above and below it:
| Kernel | Column 1 | Column 2 | Column 3 |
|---|---|---|---|
| Row 1 | -0.6 | -0.6 | -0.6 |
| Row 2 | 1.1 | 1.1 | 1.1 |
| Row 3 | -0.6 | -0.6 | -0.6 |
The first output value comes from overlaying the kernel on the upper-left 3 x 3 region, multiplying aligned values, and summing:
1import numpy as np
2
3patch = np.array([
4 [0.05, 0.08, 0.06],
5 [0.12, 0.92, 0.88],
6 [0.08, 0.90, 0.93],
7])
8kernel = np.array([
9 [-0.6, -0.6, -0.6],
10 [1.1, 1.1, 1.1],
11 [-0.6, -0.6, -0.6],
12])
13
14products = patch * kernel
15print(f"row contributions: {products.sum(axis=1).round(3).tolist()}")
16print(f"first response: {products.sum():.3f}")1row contributions: [-0.114, 2.112, -1.146]
2first response: 0.852Slide exactly the same kernel one column or row at a time. The output is called a feature map: it records where this chosen pattern responds strongly.
1import numpy as np
2
3image = np.array([
4 [0.05, 0.08, 0.06, 0.07],
5 [0.12, 0.92, 0.88, 0.09],
6 [0.08, 0.90, 0.93, 0.11],
7 [0.06, 0.07, 0.09, 0.05],
8])
9kernel = np.array([
10 [-0.6, -0.6, -0.6],
11 [1.1, 1.1, 1.1],
12 [-0.6, -0.6, -0.6],
13])
14
15def convolve_valid(image: np.ndarray, kernel: np.ndarray) -> np.ndarray:
16 out_height = image.shape[0] - kernel.shape[0] + 1
17 out_width = image.shape[1] - kernel.shape[1] + 1
18 output = np.empty((out_height, out_width))
19 for row in range(out_height):
20 for col in range(out_width):
21 window = image[row:row + kernel.shape[0], col:col + kernel.shape[1]]
22 output[row, col] = np.sum(window * kernel)
23 return output
24
25feature_map = convolve_valid(image, kernel)
26strongest = tuple(int(position) for position in np.unravel_index(np.argmax(feature_map), feature_map.shape))
27print(np.round(feature_map, 3))
28print("strongest response:", strongest)1[[0.852 0.789]
2 [0.817 0.874]]
3strongest response: (1, 1)
Deep-learning libraries usually call this sliding multiply-and-sum a convolution even though the kernel isn't flipped. In signal-processing terminology, this exact operation is cross-correlation.[2]
This detector is fixed for inspection. During training, kernel weights are adjusted from prediction error rather than supplied by a person. Also note the careful claim: shifting an input pattern tends to shift its convolution response, which is translation equivariance; it isn't automatic invariance to arbitrary movement or deformation.[2]
For one spatial dimension of a standard convolution with dilation 1, output size is:
where N is input size, K is kernel size, P is padding on each side, and S is stride. The worked example uses N = 4, K = 3, P = 0, and S = 1, giving output width and height 2.
1def conv_size(size: int, kernel: int, stride: int = 1, padding: int = 0) -> int:
2 if size <= 0 or kernel <= 0 or stride <= 0 or padding < 0:
3 raise ValueError("size, kernel, and stride must be positive; padding cannot be negative")
4 if kernel > size + 2 * padding:
5 raise ValueError("kernel cannot exceed padded input size")
6 return (size + 2 * padding - kernel) // stride + 1
7
8configurations = [
9 ("worked patch", 4, 3, 1, 0),
10 ("same-size patch", 4, 3, 1, 1),
11 ("strided photo", 224, 3, 2, 1),
12]
13
14for name, size, kernel, stride, padding in configurations:
15 output = conv_size(size, kernel, stride, padding)
16 print(f"{name}: {size} -> {output}")1worked patch: 4 -> 2
2same-size patch: 4 -> 4
3strided photo: 224 -> 112A color image tensor has channels as well as height and width. Filters looking at RGB input must span all three channels, so one 3 x 3 filter has shape 3 x 3 x 3. A layer with 16 such filters creates 16 output feature maps.
1import numpy as np
2
3image = np.zeros((3, 224, 224))
4filters = np.zeros((16, 3, 3, 3))
5biases = np.zeros(16)
6
7assert filters.shape[1] == image.shape[0]
8print("input shape:", image.shape)
9print("one filter shape:", filters[0].shape)
10print("output channels:", filters.shape[0])
11print("parameters including biases:", filters.size + biases.size)1input shape: (3, 224, 224)
2one filter shape: (3, 3, 3)
3output channels: 16
4parameters including biases: 448One response in the first feature map sees only a 3 x 3 patch. Later layers combine nearby responses, so one later value depends on a larger region of the original image. That dependency region is the receptive field.
Keep two numbers while tracing a stack:
field: how many original pixels influence one current value.jump: how far apart neighboring current values are in original-pixel coordinates.For a layer with kernel size K and stride S:
1layers = [
2 ("conv 3x3", 3, 1),
3 ("conv 3x3", 3, 1),
4 ("pool 2x2", 2, 2),
5 ("conv 3x3", 3, 1),
6]
7
8field, jump = 1, 1
9for name, kernel, stride in layers:
10 field = field + (kernel - 1) * jump
11 jump = jump * stride
12 print(f"{name:9s} receptive field={field:2d}, jump={jump}")1conv 3x3 receptive field= 3, jump=1
2conv 3x3 receptive field= 5, jump=1
3pool 2x2 receptive field= 6, jump=2
4conv 3x3 receptive field=10, jump=2A larger receptive field only says what input can affect a value. It doesn't prove that the model understands cracks, panels, or any other object. That depends on learned weights and training data.
A common CNN step is 2 x 2 max pooling: in each small block, keep only the largest activation. It reduces spatial resolution and gives limited tolerance to small local shifts in a strong response.
Consider a larger response map:
| Feature map | Column 1 | Column 2 | Column 3 | Column 4 |
|---|---|---|---|---|
| Row 1 | 0.10 | 0.42 | 0.18 | 0.30 |
| Row 2 | 0.25 | 0.91 | 0.44 | 0.12 |
| Row 3 | 0.08 | 0.21 | 0.77 | 0.55 |
| Row 4 | 0.16 | 0.14 | 0.32 | 0.49 |
1import numpy as np
2
3feature_map = np.array([
4 [0.10, 0.42, 0.18, 0.30],
5 [0.25, 0.91, 0.44, 0.12],
6 [0.08, 0.21, 0.77, 0.55],
7 [0.16, 0.14, 0.32, 0.49],
8])
9
10pooled = np.empty((2, 2))
11winners = []
12for row in range(2):
13 for col in range(2):
14 block = feature_map[row * 2:row * 2 + 2, col * 2:col * 2 + 2]
15 local_row, local_col = np.unravel_index(np.argmax(block), block.shape)
16 pooled[row, col] = block[local_row, local_col]
17 winners.append((int(row * 2 + local_row), int(col * 2 + local_col)))
18
19print(np.round(pooled, 2))
20print("winner coordinates:", winners)1[[0.91 0.44]
2 [0.21 0.77]]
3winner coordinates: [(1, 1), (1, 2), (2, 1), (2, 2)]
Why save winner coordinates? In the next lesson, when an error signal travels backward through max pooling, it must return to the input value that won the forward comparison. For now, the forward-pass rule is enough: pool values and retain their locations.
Valid convolution, with no padding, never centers a 3 x 3 filter on an outermost pixel. Evidence near an image border participates in fewer windows than identical evidence near the center. That matters in equipment-panel photos: a crack reaching the edge of a crop can be weakened by preprocessing choices.
1import numpy as np
2
3def total_window_response(image: np.ndarray, padding: int) -> float:
4 padded = np.pad(image, padding)
5 kernel = np.ones((3, 3))
6 total = 0.0
7 for row in range(padded.shape[0] - 2):
8 for col in range(padded.shape[1] - 2):
9 total += np.sum(padded[row:row + 3, col:col + 3] * kernel)
10 return total
11
12edge_signal = np.zeros((5, 5))
13edge_signal[0, 0] = 1.0
14center_signal = np.zeros((5, 5))
15center_signal[2, 2] = 1.0
16
17print(f"valid edge total: {total_window_response(edge_signal, padding=0):.0f}")
18print(f"valid center total: {total_window_response(center_signal, padding=0):.0f}")
19print(f"padded edge total: {total_window_response(edge_signal, padding=1):.0f}")1valid edge total: 1
2valid center total: 9
3padded edge total: 4Padding doesn't make border context identical to interior context: outside-image pixels still have to be filled somehow. It still keeps more filter placements available near the border, which is why padding policy is part of model debugging rather than a cosmetic setting.
Now assemble the pieces into a tiny forward pass. This model has one chosen convolution filter, a ReLU nonlinearity, max pooling, and a two-score linear classifier. The scores are logits, not probabilities; converting logits into probabilities and training weights comes later.
1import numpy as np
2
3image = np.array([
4 [0.05, 0.08, 0.06, 0.07],
5 [0.12, 0.92, 0.88, 0.09],
6 [0.08, 0.90, 0.93, 0.11],
7 [0.06, 0.07, 0.09, 0.05],
8])
9kernel = np.array([
10 [-0.6, -0.6, -0.6],
11 [1.1, 1.1, 1.1],
12 [-0.6, -0.6, -0.6],
13])
14
15def conv_valid(image: np.ndarray, kernel: np.ndarray) -> np.ndarray:
16 output = np.empty((2, 2))
17 for row in range(2):
18 for col in range(2):
19 output[row, col] = np.sum(image[row:row + 3, col:col + 3] * kernel)
20 return output
21
22feature_map = conv_valid(image, kernel)
23activated = np.maximum(feature_map, 0.0)
24pooled = np.array([activated.max()])
25classifier_weights = np.array([[1.4], [-0.9]])
26classifier_bias = np.array([-0.2, 0.1])
27logits = classifier_weights @ pooled + classifier_bias
28
29assert feature_map.shape == (2, 2)
30assert pooled.shape == (1,)
31print("feature map:")
32print(np.round(feature_map, 3))
33print("pooled activation:", np.round(pooled, 3))
34print("logits:", np.round(logits, 3))1feature map:
2[[0.852 0.789]
3 [0.817 0.874]]
4pooled activation: [0.874]
5logits: [ 1.024 -0.687]PyTorch performs the same operations with standard layers. Here the filter and classifier weights are copied from the NumPy example only so the two implementations can be checked against one another.
1import numpy as np
2import torch
3from torch import nn
4
5image = torch.tensor([[
6 [0.05, 0.08, 0.06, 0.07],
7 [0.12, 0.92, 0.88, 0.09],
8 [0.08, 0.90, 0.93, 0.11],
9 [0.06, 0.07, 0.09, 0.05],
10]], dtype=torch.float32).unsqueeze(0)
11
12model = nn.Sequential(
13 nn.Conv2d(1, 1, kernel_size=3, bias=False),
14 nn.ReLU(),
15 nn.MaxPool2d(kernel_size=2),
16 nn.Flatten(),
17 nn.Linear(1, 2),
18)
19
20with torch.no_grad():
21 model[0].weight.copy_(torch.tensor([[[[-0.6, -0.6, -0.6],
22 [1.1, 1.1, 1.1],
23 [-0.6, -0.6, -0.6]]]]))
24 model[4].weight.copy_(torch.tensor([[1.4], [-0.9]]))
25 model[4].bias.copy_(torch.tensor([-0.2, 0.1]))
26
27logits = model(image).detach().numpy()[0]
28numpy_logits = np.array([1.0236, -0.6866])
29print("pytorch logits:", np.round(logits, 3))
30print("matches NumPy:", np.allclose(logits, numpy_logits, atol=1e-4))1pytorch logits: [ 1.024 -0.687]
2matches NumPy: TrueCNNs became practical image models in part because local connections and shared weights made them efficient enough to train for tasks such as handwritten-document recognition in LeNet-style systems.[1] Newer vision architectures may use different starting assumptions: a Vision Transformer, for example, treats fixed-size image patches as a sequence of embedded tokens rather than beginning with sliding convolution filters.[3]
That contrast is orientation, not a model-selection rule. The transferable skill is being able to trace what spatial evidence reaches a score, which assumptions the model makes, and which failure mode a shape or padding choice can introduce.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.