LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesCNNs from Scratch
🏛️EasyModel Architecture

CNNs from Scratch

Trace a CNN over a cracked equipment-panel photo patch: shared kernels, feature-map shapes, pooling, padding failures, and a NumPy-to-PyTorch forward pass.

10 min read
Learning path
Step 19 of 158 in the full curriculum
Neural Networks from ScratchTraining & Backpropagation

An engineer uploads a photo of a cracked equipment panel with a narrow fault line across its surface. In the previous lesson, a small neural network accepted already-extracted numbers and produced a score. For a photo, those numbers are still arranged in space: neighboring bright and dark pixels may form the crack, while the same pixels in a different arrangement may mean nothing.

A convolutional neural network (CNN) handles that structure by applying small filters across nearby pixels. The same filter is reused at every position, so it can respond to a useful local pattern wherever it appears in the image. This use of local receptive fields and shared weights is the central CNN idea.[1]Reference 1Gradient-Based Learning Applied to Document Recognitionhttps://ieeexplore.ieee.org/document/726791[2]Reference 2Deep Learning.https://www.deeplearningbook.org/

Why a photo needs local weights

Suppose the upload is resized to 224 x 224 RGB pixels. Feeding every pixel into one dense hidden layer of 256 units creates 38,535,424 parameters before the model has even recognized an edge: 38,535,168 weights plus 256 biases. A convolution layer with 16 filters of size 3 x 3 across three color channels uses 448 parameters: 432 weights plus 16 biases.

This comparison is about parameter count, not identical capabilities: a dense layer connects every location independently, while a convolution deliberately assumes that the same local detector should be useful across the image.

dense-versus-conv-parameters.py
1height, width, channels, hidden = 224, 224, 3, 256 2dense_weights = height * width * channels * hidden 3dense_biases = hidden 4dense_parameters = dense_weights + dense_biases 5 6filters, kernel = 16, 3 7conv_weights = filters * channels * kernel * kernel 8conv_biases = filters 9conv_parameters = conv_weights + conv_biases 10 11print(f"dense parameters: {dense_parameters:,} ({dense_weights:,} weights + {dense_biases} biases)") 12print(f"conv parameters: {conv_parameters:,} ({conv_weights:,} weights + {conv_biases} biases)") 13print(f"parameter-count ratio: {dense_parameters / conv_parameters:,.0f}x")
Output
1dense parameters: 38,535,424 (38,535,168 weights + 256 biases) 2conv parameters: 448 (432 weights + 16 biases) 3parameter-count ratio: 86,017x
CNN output-shape and parameter-savings visual comparing a dense layer on a 224 by 224 RGB support photo with a small 3 by 3 convolution and showing the 4 by 4 to 2 by 2 output-size formula. CNN output-shape and parameter-savings visual comparing a dense layer on a 224 by 224 RGB support photo with a small 3 by 3 convolution and showing the 4 by 4 to 2 by 2 output-size formula.
Weight sharing gives every image position access to the same detector without storing a separate set of weights at every position.

Slide one chosen kernel by hand

Start with one grayscale crop from the equipment-panel photo. Bright values in the center might be reflective surface around a crack:

Input patchColumn 1Column 2Column 3Column 4
Row 10.050.080.060.07
Row 20.120.920.880.09
Row 30.080.900.930.11
Row 40.060.070.090.05

For now, choose a 3 x 3 filter that gives positive weight to a bright horizontal middle line and negative weight above and below it:

KernelColumn 1Column 2Column 3
Row 1-0.6-0.6-0.6
Row 21.11.11.1
Row 3-0.6-0.6-0.6

The first output value comes from overlaying the kernel on the upper-left 3 x 3 region, multiplying aligned values, and summing:

y0,0=(−0.6)(0.05+0.08+0.06)+(1.1)(0.12+0.92+0.88)+(−0.6)(0.08+0.90+0.93)=0.852\begin{aligned} y_{0,0} ={}& (-0.6)(0.05 + 0.08 + 0.06) \\ &+ (1.1)(0.12 + 0.92 + 0.88) \\ &+ (-0.6)(0.08 + 0.90 + 0.93) \\ ={}& 0.852 \end{aligned}y0,0​==​(−0.6)(0.05+0.08+0.06)+(1.1)(0.12+0.92+0.88)+(−0.6)(0.08+0.90+0.93)0.852​
one-convolution-window.py
1import numpy as np 2 3patch = np.array([ 4 [0.05, 0.08, 0.06], 5 [0.12, 0.92, 0.88], 6 [0.08, 0.90, 0.93], 7]) 8kernel = np.array([ 9 [-0.6, -0.6, -0.6], 10 [1.1, 1.1, 1.1], 11 [-0.6, -0.6, -0.6], 12]) 13 14products = patch * kernel 15print(f"row contributions: {products.sum(axis=1).round(3).tolist()}") 16print(f"first response: {products.sum():.3f}")
Output
1row contributions: [-0.114, 2.112, -1.146] 2first response: 0.852

Slide exactly the same kernel one column or row at a time. The output is called a feature map: it records where this chosen pattern responds strongly.

slide-the-kernel.py
1import numpy as np 2 3image = np.array([ 4 [0.05, 0.08, 0.06, 0.07], 5 [0.12, 0.92, 0.88, 0.09], 6 [0.08, 0.90, 0.93, 0.11], 7 [0.06, 0.07, 0.09, 0.05], 8]) 9kernel = np.array([ 10 [-0.6, -0.6, -0.6], 11 [1.1, 1.1, 1.1], 12 [-0.6, -0.6, -0.6], 13]) 14 15def convolve_valid(image: np.ndarray, kernel: np.ndarray) -> np.ndarray: 16 out_height = image.shape[0] - kernel.shape[0] + 1 17 out_width = image.shape[1] - kernel.shape[1] + 1 18 output = np.empty((out_height, out_width)) 19 for row in range(out_height): 20 for col in range(out_width): 21 window = image[row:row + kernel.shape[0], col:col + kernel.shape[1]] 22 output[row, col] = np.sum(window * kernel) 23 return output 24 25feature_map = convolve_valid(image, kernel) 26strongest = tuple(int(position) for position in np.unravel_index(np.argmax(feature_map), feature_map.shape)) 27print(np.round(feature_map, 3)) 28print("strongest response:", strongest)
Output
1[[0.852 0.789] 2 [0.817 0.874]] 3strongest response: (1, 1)
Chosen convolutional kernel sliding over a 4x4 photo patch with a horizontal line pattern and producing a 2x2 feature map. Chosen convolutional kernel sliding over a 4x4 photo patch with a horizontal line pattern and producing a 2x2 feature map.
This fixed detector produces high responses where the bright local line matches its pattern. Training will determine useful filters rather than receiving this one by hand.

Deep-learning libraries usually call this sliding multiply-and-sum a convolution even though the kernel isn't flipped. In signal-processing terminology, this exact operation is cross-correlation.[2]Reference 2Deep Learning.https://www.deeplearningbook.org/

This detector is fixed for inspection. During training, kernel weights are adjusted from prediction error rather than supplied by a person. Also note the careful claim: shifting an input pattern tends to shift its convolution response, which is translation equivariance; it isn't automatic invariance to arbitrary movement or deformation.[2]Reference 2Deep Learning.https://www.deeplearningbook.org/

Shapes and channels are part of the contract

For one spatial dimension of a standard convolution with dilation 1, output size is:

⌊N+2P−KS⌋+1\left\lfloor \frac{N + 2P - K}{S} \right\rfloor + 1⌊SN+2P−K​⌋+1

where N is input size, K is kernel size, P is padding on each side, and S is stride. The worked example uses N = 4, K = 3, P = 0, and S = 1, giving output width and height 2.

convolution-output-size.py
1def conv_size(size: int, kernel: int, stride: int = 1, padding: int = 0) -> int: 2 if size <= 0 or kernel <= 0 or stride <= 0 or padding < 0: 3 raise ValueError("size, kernel, and stride must be positive; padding cannot be negative") 4 if kernel > size + 2 * padding: 5 raise ValueError("kernel cannot exceed padded input size") 6 return (size + 2 * padding - kernel) // stride + 1 7 8configurations = [ 9 ("worked patch", 4, 3, 1, 0), 10 ("same-size patch", 4, 3, 1, 1), 11 ("strided photo", 224, 3, 2, 1), 12] 13 14for name, size, kernel, stride, padding in configurations: 15 output = conv_size(size, kernel, stride, padding) 16 print(f"{name}: {size} -> {output}")
Output
1worked patch: 4 -> 2 2same-size patch: 4 -> 4 3strided photo: 224 -> 112

A color image tensor has channels as well as height and width. Filters looking at RGB input must span all three channels, so one 3 x 3 filter has shape 3 x 3 x 3. A layer with 16 such filters creates 16 output feature maps.

rgb-filter-parameters.py
1import numpy as np 2 3image = np.zeros((3, 224, 224)) 4filters = np.zeros((16, 3, 3, 3)) 5biases = np.zeros(16) 6 7assert filters.shape[1] == image.shape[0] 8print("input shape:", image.shape) 9print("one filter shape:", filters[0].shape) 10print("output channels:", filters.shape[0]) 11print("parameters including biases:", filters.size + biases.size)
Output
1input shape: (3, 224, 224) 2one filter shape: (3, 3, 3) 3output channels: 16 4parameters including biases: 448

Receptive fields grow one layer at a time

One response in the first feature map sees only a 3 x 3 patch. Later layers combine nearby responses, so one later value depends on a larger region of the original image. That dependency region is the receptive field.

Keep two numbers while tracing a stack:

  • field: how many original pixels influence one current value.
  • jump: how far apart neighboring current values are in original-pixel coordinates.

For a layer with kernel size K and stride S:

new field=field+(K−1)jump,new jump=jump⋅S\text{new field} = \text{field} + (K - 1)\text{jump}, \qquad \text{new jump} = \text{jump} \cdot Snew field=field+(K−1)jump,new jump=jump⋅S
receptive-field-trace.py
1layers = [ 2 ("conv 3x3", 3, 1), 3 ("conv 3x3", 3, 1), 4 ("pool 2x2", 2, 2), 5 ("conv 3x3", 3, 1), 6] 7 8field, jump = 1, 1 9for name, kernel, stride in layers: 10 field = field + (kernel - 1) * jump 11 jump = jump * stride 12 print(f"{name:9s} receptive field={field:2d}, jump={jump}")
Output
1conv 3x3 receptive field= 3, jump=1 2conv 3x3 receptive field= 5, jump=1 3pool 2x2 receptive field= 6, jump=2 4conv 3x3 receptive field=10, jump=2

A larger receptive field only says what input can affect a value. It doesn't prove that the model understands cracks, panels, or any other object. That depends on learned weights and training data.

Pooling reduces resolution and records winners

A common CNN step is 2 x 2 max pooling: in each small block, keep only the largest activation. It reduces spatial resolution and gives limited tolerance to small local shifts in a strong response.

Consider a larger response map:

Feature mapColumn 1Column 2Column 3Column 4
Row 10.100.420.180.30
Row 20.250.910.440.12
Row 30.080.210.770.55
Row 40.160.140.320.49
max-pooling-winners.py
1import numpy as np 2 3feature_map = np.array([ 4 [0.10, 0.42, 0.18, 0.30], 5 [0.25, 0.91, 0.44, 0.12], 6 [0.08, 0.21, 0.77, 0.55], 7 [0.16, 0.14, 0.32, 0.49], 8]) 9 10pooled = np.empty((2, 2)) 11winners = [] 12for row in range(2): 13 for col in range(2): 14 block = feature_map[row * 2:row * 2 + 2, col * 2:col * 2 + 2] 15 local_row, local_col = np.unravel_index(np.argmax(block), block.shape) 16 pooled[row, col] = block[local_row, local_col] 17 winners.append((int(row * 2 + local_row), int(col * 2 + local_col))) 18 19print(np.round(pooled, 2)) 20print("winner coordinates:", winners)
Output
1[[0.91 0.44] 2 [0.21 0.77]] 3winner coordinates: [(1, 1), (1, 2), (2, 1), (2, 2)]
Max pooling forward and backward route showing four 2 by 2 windows reduced to a 2 by 2 output and the backward gradient routed only to each winning maximum location. Max pooling forward and backward route showing four 2 by 2 windows reduced to a 2 by 2 output and the backward gradient routed only to each winning maximum location.
Max pooling keeps the strongest value in each local region. Winner locations become relevant when the next lesson sends an error signal backward.

Why save winner coordinates? In the next lesson, when an error signal travels backward through max pooling, it must return to the input value that won the forward comparison. For now, the forward-pass rule is enough: pool values and retain their locations.

Failure case: border evidence gets fewer chances to fire

Valid convolution, with no padding, never centers a 3 x 3 filter on an outermost pixel. Evidence near an image border participates in fewer windows than identical evidence near the center. That matters in equipment-panel photos: a crack reaching the edge of a crop can be weakened by preprocessing choices.

padding-border-evidence.py
1import numpy as np 2 3def total_window_response(image: np.ndarray, padding: int) -> float: 4 padded = np.pad(image, padding) 5 kernel = np.ones((3, 3)) 6 total = 0.0 7 for row in range(padded.shape[0] - 2): 8 for col in range(padded.shape[1] - 2): 9 total += np.sum(padded[row:row + 3, col:col + 3] * kernel) 10 return total 11 12edge_signal = np.zeros((5, 5)) 13edge_signal[0, 0] = 1.0 14center_signal = np.zeros((5, 5)) 15center_signal[2, 2] = 1.0 16 17print(f"valid edge total: {total_window_response(edge_signal, padding=0):.0f}") 18print(f"valid center total: {total_window_response(center_signal, padding=0):.0f}") 19print(f"padded edge total: {total_window_response(edge_signal, padding=1):.0f}")
Output
1valid edge total: 1 2valid center total: 9 3padded edge total: 4

Padding doesn't make border context identical to interior context: outside-image pixels still have to be filled somehow. It still keeps more filter placements available near the border, which is why padding policy is part of model debugging rather than a cosmetic setting.

Build it: an inspectable CNN forward pass

Now assemble the pieces into a tiny forward pass. This model has one chosen convolution filter, a ReLU nonlinearity, max pooling, and a two-score linear classifier. The scores are logits, not probabilities; converting logits into probabilities and training weights comes later.

cnn-forward-numpy.py
1import numpy as np 2 3image = np.array([ 4 [0.05, 0.08, 0.06, 0.07], 5 [0.12, 0.92, 0.88, 0.09], 6 [0.08, 0.90, 0.93, 0.11], 7 [0.06, 0.07, 0.09, 0.05], 8]) 9kernel = np.array([ 10 [-0.6, -0.6, -0.6], 11 [1.1, 1.1, 1.1], 12 [-0.6, -0.6, -0.6], 13]) 14 15def conv_valid(image: np.ndarray, kernel: np.ndarray) -> np.ndarray: 16 output = np.empty((2, 2)) 17 for row in range(2): 18 for col in range(2): 19 output[row, col] = np.sum(image[row:row + 3, col:col + 3] * kernel) 20 return output 21 22feature_map = conv_valid(image, kernel) 23activated = np.maximum(feature_map, 0.0) 24pooled = np.array([activated.max()]) 25classifier_weights = np.array([[1.4], [-0.9]]) 26classifier_bias = np.array([-0.2, 0.1]) 27logits = classifier_weights @ pooled + classifier_bias 28 29assert feature_map.shape == (2, 2) 30assert pooled.shape == (1,) 31print("feature map:") 32print(np.round(feature_map, 3)) 33print("pooled activation:", np.round(pooled, 3)) 34print("logits:", np.round(logits, 3))
Output
1feature map: 2[[0.852 0.789] 3 [0.817 0.874]] 4pooled activation: [0.874] 5logits: [ 1.024 -0.687]

PyTorch performs the same operations with standard layers. Here the filter and classifier weights are copied from the NumPy example only so the two implementations can be checked against one another.

same-forward-pass-pytorch.py
1import numpy as np 2import torch 3from torch import nn 4 5image = torch.tensor([[ 6 [0.05, 0.08, 0.06, 0.07], 7 [0.12, 0.92, 0.88, 0.09], 8 [0.08, 0.90, 0.93, 0.11], 9 [0.06, 0.07, 0.09, 0.05], 10]], dtype=torch.float32).unsqueeze(0) 11 12model = nn.Sequential( 13 nn.Conv2d(1, 1, kernel_size=3, bias=False), 14 nn.ReLU(), 15 nn.MaxPool2d(kernel_size=2), 16 nn.Flatten(), 17 nn.Linear(1, 2), 18) 19 20with torch.no_grad(): 21 model[0].weight.copy_(torch.tensor([[[[-0.6, -0.6, -0.6], 22 [1.1, 1.1, 1.1], 23 [-0.6, -0.6, -0.6]]]])) 24 model[4].weight.copy_(torch.tensor([[1.4], [-0.9]])) 25 model[4].bias.copy_(torch.tensor([-0.2, 0.1])) 26 27logits = model(image).detach().numpy()[0] 28numpy_logits = np.array([1.0236, -0.6866]) 29print("pytorch logits:", np.round(logits, 3)) 30print("matches NumPy:", np.allclose(logits, numpy_logits, atol=1e-4))
Output
1pytorch logits: [ 1.024 -0.687] 2matches NumPy: True

A short bridge to later vision models

CNNs became practical image models in part because local connections and shared weights made them efficient enough to train for tasks such as handwritten-document recognition in LeNet-style systems.[1]Reference 1Gradient-Based Learning Applied to Document Recognitionhttps://ieeexplore.ieee.org/document/726791 Newer vision architectures may use different starting assumptions: a Vision Transformer, for example, treats fixed-size image patches as a sequence of embedded tokens rather than beginning with sliding convolution filters.[3]Reference 3An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.https://arxiv.org/abs/2010.11929

That contrast is orientation, not a model-selection rule. The transferable skill is being able to trace what spatial evidence reaches a score, which assumptions the model makes, and which failure mode a shape or padding choice can introduce.

Mastery check

Key concepts

  • A convolution applies one local filter at many positions, producing a feature map.
  • Weight sharing reduces parameter count and supports translation-equivariant responses.
  • Output shapes depend on kernel size, stride, and padding; channel counts must also agree.
  • Receptive fields grow as layers combine neighboring responses.
  • Max pooling retains strong local activations and must remember winning positions for later training.
  • Padding policy can change how strongly edge-of-image evidence is represented.

Evaluation rubric

  • Foundational: Calculate one convolution response and state why the same kernel is reused across positions.
  • Intermediate: Derive output shapes and parameter counts for grayscale or RGB convolution layers, then trace pooling output.
  • Advanced: Diagnose a border-evidence or receptive-field failure and implement an equivalent NumPy and PyTorch forward pass.

Common pitfalls

  • Symptom: A hand-entered detector is described as learned. Cause: Forward-pass inspection is confused with training. Fix: Call it a chosen kernel until gradients have updated it from data.
  • Symptom: Feature-map dimensions differ from expected dimensions. Cause: Padding or stride was omitted from the shape calculation. Fix: Compute each spatial output with the convolution formula before wiring the next layer.
  • Symptom: An RGB image fails to enter a convolution layer. Cause: Each filter doesn't span all input channels. Fix: Match filter input-channel dimension to the tensor channel dimension.
  • Symptom: A defect near a crop boundary produces a weaker response than the same defect at the center. Cause: Valid convolution gives border pixels fewer filter placements. Fix: Audit crop and padding policy, then compare edge and center responses explicitly.

Follow-up questions

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An upper-left 3 x 3 convolution window originally has response 0.852. Its aligned upper-left input pixel is 0.05 and the kernel weight on that pixel is -0.6. If only that weight changes to 0.0, what is the new response and which input pixel causes the change?
2.An RGB 64 by 64 image enters a convolution layer with eight 5 by 5 filters, stride 2, and padding 2. What output shape and parameter count, including one bias per filter, does the layer have?
3.Using field' = field + (K - 1) * jump and jump' = jump * S, start with field 1 and jump 1. What do you get for two 3 by 3 stride-one convolutions followed by one 3 by 3 stride-two convolution?
4.With a 3 x 3 valid convolution, why can a defect near a crop boundary produce a weaker aggregate response than the same defect near the center, and what does padding change?
5.Apply 2 x 2 max pooling with stride 2 to this feature map, and report zero-based winner coordinates in the original map: [[0.10, 0.42, 0.18, 0.30], [0.25, 0.91, 0.44, 0.12], [0.08, 0.21, 0.77, 0.55], [0.16, 0.14, 0.32, 0.49]]. Which result is retained?
6.A 224 by 224 RGB image enters a layer with sixteen 3 x 3 convolution filters and one bias per filter. Why does the layer have 448 parameters rather than a parameter count that grows with the number of output positions?
7.A hand-entered convolution kernel responds strongly to a bright line. The line shifts one pixel while remaining within positions where the kernel can be applied. Which claim is justified?
8.A convolution produces [[0.852, 0.789], [0.817, 0.874]]. Apply ReLU, one 2 x 2 max pool, then a linear layer with weights [[1.4], [-0.9]] and bias [-0.2, 0.1]. What is the result?

8 questions remaining.

Next Step
Continue to Training & Backpropagation

You can now trace where a CNN score comes from while treating kernel weights as fixed. Next you'll calculate how prediction error changes those weights so useful detectors are learned from data.

PreviousNeural Networks from Scratch
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Gradient-Based Learning Applied to Document Recognition

Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner · 1998

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. · 2016

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Dosovitskiy, A., et al. · 2020 · ICLR 2021