LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnComputing FoundationsNumPy and Tensor Shapes
📝EasyNLP Fundamentals

NumPy and Tensor Shapes

Learn NumPy shape reasoning from first principles: name axes, predict indexing and broadcasting, reduce safely, distinguish reshape from transpose, and add shape guards.

17 min read
Learning path
Step 1 of 155 in the full curriculum
CUDA for ML Training

Two customers send support tickets: "refund request order" and "shipped from hub." A routing model can't work directly with those words, so it receives small lists of numbers that represent each word. If a data-loading step swaps the customer and word positions, later math can fail or, worse, run while attaching predictions to the wrong text.

This is why array shape is your first AI engineering habit. An array is an ordered collection of numbers; a multi-axis array is often called a tensor in machine learning. NumPy calls its array object an ndarray: it holds numerical data in a buffer plus metadata such as shape and strides that tell NumPy how an index reaches that data. Views can reuse the same buffer with different metadata, so fast operations aren't permission to ignore axis meaning.[1]

That habit matters early in AI work. Many beginner bugs are not mysterious math bugs. They are shape bugs:

  • A batch axis disappears and the loss function crashes.
  • A feature vector broadcasts across the wrong dimension and each token receives the wrong offset.
  • A transpose should have happened, but a reshape happened instead and token meaning silently changes.

This chapter uses those two support tickets as one tiny batch. By the end, you should be able to name the axes in an array, predict how indexing or broadcasting changes its shape, reduce the right axis on purpose, choose reshape or transpose for the job, and check an attention-score tensor without guessing.

Two support tickets represented as a B by T by D tensor: two ticket rows, three word positions per ticket, and four numerical features for each word, with the request vector highlighted as an example slice. Two support tickets represented as a B by T by D tensor: two ticket rows, three word positions per ticket, and four numerical features for each word, with the request vector highlighted as an example slice.
The first axis selects a ticket, the second selects a word position, and the third holds four numerical features for that word. Reading `(B, T, D)` as a sentence prevents most shape mistakes.

Your first batch: two support tickets

Use one tiny batch for the whole chapter.

Imagine two short support-ticket phrases:

Batch itemTokens (support ticket words)
0["refund", "request", "order"]
1["shipped", "from", "hub"]

A token is a piece of text a model processes; here we use one word as one token. A feature vector is the list of numbers representing one token. Later chapters call these per-position vectors hidden states. Suppose each token receives four features. The resulting tensor has shape:

text
1(B, T, D) = (2, 3, 4)

Read that slowly:

AxisMeaningSize
Bbatch, how many tickets2
Ttoken positions per ticket3
Dnumbers per token4

That tuple is a contract, not decoration. If you can say the sentence, you can usually debug the code.

The anatomy of a tensor

Don't treat a tensor as a mysterious cube. Our tensor (2, 3, 4) is two ticket trays; each tray contains three token slots; each token slot contains four numerical features. To find one value, you need three addresses. Shape tells you how many choices exist at each address.

TermShapeWhat it holds
scalar()one number, no axes
vector(4,)one row of 4 numbers
matrix(3, 4)3 rows, each with 4 numbers
tensor(2, 3, 4)any number of axes, generalizing the same idea

Keep the words concrete. If D means feature count, each vector on that axis should contain D numbers.

Indexing: which axes stay and which disappear

Start with real numbers so the array stops feeling abstract:

text
1x = 2[ 3 [[ 0, 1, 2, 3], 4 [ 4, 5, 6, 7], 5 [ 8, 9, 10, 11]], 6 7 [[12, 13, 14, 15], 8 [16, 17, 18, 19], 9 [20, 21, 22, 23]] 10]

The shape is still (2, 3, 4), but now you can point at actual token vectors.

ExpressionPlain-English meaningResult shape
x[0]ticket 0, all tokens, all features(3, 4)
x[0, 1]token 1 in ticket 0(4,)
x[:, 1]token 1 from each ticket(2, 4)
x[:, :, 0]feature 0 from each token in each ticket(2, 3)

The pattern is simple:

  1. Write the original shape.
  2. A single number like 0 or 1 chooses one item and usually removes that axis.
  3. A colon : keeps the whole axis.
  4. Write the remaining axes in order.
  5. Check with .shape.

Code lab: inspect one batch

Run this as a tiny script. Notice that it prints values as well as shapes.

code-lab-inspect-one-batch.py
1import numpy as np 2 3B, T, D = 2, 3, 4 4x = np.arange(B * T * D).reshape(B, T, D) 5 6print("x shape", x.shape) 7print("x[0] shape", x[0].shape) 8print("x[0,1]", x[0, 1]) 9print("x[:,1]") 10print(x[:, 1])
Output
1x shape (2, 3, 4) 2x[0] shape (3, 4) 3x[0,1] [4 5 6 7] 4x[:,1] 5[[ 4 5 6 7] 6 [16 17 18 19]]

The useful beginner habit isn't "look at .shape after the fact." It's "predict the shape, then let .shape confirm or correct you."

Broadcasting: reuse one feature bias

Now add one feature bias to each token vector:

text
1bias = [100, 200, 300, 400]

This bias has shape (4,), so it matches the last axis of x.

One token by hand

text
1x[0, 1] = [ 4, 5, 6, 7] 2bias = [100, 200, 300, 400] 3result = [104, 205, 306, 407]

NumPy applies that addition to each token in each ticket without a Python loop. You can imagine the bias vector repeating across the batch and token axes, but the repetition is conceptual: broadcasting doesn't need to copy that smaller operand in memory.[2]

What broadcasting keeps

  • the batch axis stays
  • the token axis stays
  • the feature bias repeats across both of those axes

The matching rule is easier than it sounds:

  1. Line shapes up from the right.
  2. Two axes are compatible if they are equal or one of them is 1.
  3. If neither rule holds, NumPy raises an error.

For x.shape == (2, 3, 4):

Smaller shapeWhat NumPy doesMeaning
(4,)broadcasts across batch and token axesone bias per feature
(3, 1)broadcasts across batch and feature axesone scalar per token position
(3,)raises an errorno feature axis matches that 3

That last row is important. A shape can be wrong in two ways:

  • it can fail loudly
  • it can succeed but mean the wrong thing

Here is the same rule shown as a shape-alignment diagram:

Broadcasting alignment examples showing which smaller shapes line up with a B T D tensor and which one fails at the trailing feature axis. Broadcasting alignment examples showing which smaller shapes line up with a B T D tensor and which one fails at the trailing feature axis.
Broadcasting starts at the trailing edge. Equal sizes or size-1 axes can align. A bare `(3,)` doesn't mean token axis unless you make that axis placement explicit.

Code lab: broadcast one feature bias

code-lab-broadcast-one-feature-bias.py
1import numpy as np 2 3B, T, D = 2, 3, 4 4x = np.arange(B * T * D).reshape(B, T, D) 5bias = np.array([100, 200, 300, 400]) 6shifted = x + bias 7 8print("shifted shape", shifted.shape) 9print("shifted[0,1]", shifted[0, 1])
Output
1shifted shape (2, 3, 4) 2shifted[0,1] [104 205 306 407]

If you accidentally try np.array([10, 20, 30]), NumPy raises:

text
1ValueError: operands could not be broadcast together with shapes (2,3,4) (3,)

That message is useful. It tells you the feature axis was not where you thought it was.

You can test that failure deliberately instead of waiting for it to surprise you:

broadcast-shape-error.py
1import numpy as np 2 3x = np.arange(2 * 3 * 4).reshape(2, 3, 4) 4token_bias = np.array([10, 20, 30]) 5 6try: 7 x + token_bias 8except ValueError as exc: 9 print(str(exc)) 10else: 11 raise RuntimeError("Expected NumPy to reject shape (3,) against (2, 3, 4)")
Output
1operands could not be broadcast together with shapes (2,3,4) (3,)

The (N,) vs (N, 1) trap

A 1D array (5,) is not the same as a column vector (5, 1). This is a common cause of silent broadcasting errors in neural networks.

the-n-vs-n-1-trap.py
1import numpy as np 2 3v = np.array([1, 2, 3]) 4print("v shape", v.shape) 5print("v[:, None] shape", v[:, None].shape) 6print("v[None, :] shape", v[None, :].shape)
Output
1v shape (3,) 2v[:, None] shape (3, 1) 3v[None, :] shape (1, 3)

A plain (3,) can broadcast into either a row or a column, depending on what it's paired with. That flexibility is convenient but dangerous. When you need an explicit row or column vector, make the axis visible with v[None, :] or v[:, None].

Reductions: collapsing the right axis

A reduction answers the question, "which axis am I summarizing away?"

Suppose you want one vector per ticket, not one vector per token. Then average across the token axis:

text
1x.mean(axis=1)

For easier arithmetic, keep the same (B, T, D) contract but switch to feature values whose average you can check by hand:

Mean ticket 0 by hand

text
1token 0 = [ 0, 3, 6, 9] 2token 1 = [ 3, 6, 9, 12] 3token 2 = [12, 9, 6, 3] 4mean = [ 5, 6, 7, 8]

So x.mean(axis=1) should keep batch and feature axes, giving shape (B, D) = (2, 4).

What that reduction keeps

  • the batch axis stays
  • the token axis disappears
  • the feature axis stays

Now compare that with:

text
1x.mean(axis=-1)

That reduces the feature axis instead. It keeps batch and token axes, so the shape becomes (B, T) = (2, 3).

Code lab: compare two reductions

code-lab-compare-two-reductions.py
1import numpy as np 2 3B, T, D = 2, 3, 4 4x = np.array([ 5 [[0, 3, 6, 9], [3, 6, 9, 12], [12, 9, 6, 3]], 6 [[1, 4, 7, 10], [4, 7, 10, 13], [13, 10, 7, 4]], 7]) 8pooled_tokens = x.mean(axis=1) 9feature_means = x.mean(axis=-1) 10 11print("pooled_tokens shape", pooled_tokens.shape) 12print("pooled_tokens[0]", pooled_tokens[0]) 13print("feature_means shape", feature_means.shape) 14print("feature_means[0]", feature_means[0])
Output
1pooled_tokens shape (2, 4) 2pooled_tokens[0] [5. 6. 7. 8.] 3feature_means shape (2, 3) 4feature_means[0] [4.5 7.5 7.5]

The easiest way to avoid reduction mistakes is to say the axis name out loud before you write the code:

"I want one vector per ticket, so I am reducing the token axis."

Why keepdims=True is a safety switch

By default, a reduction removes the axis entirely. That can break later broadcasting because the shape has changed.

why-keepdims-true-is-a-safety-switch.py
1import numpy as np 2 3B, T, D = 2, 3, 4 4x = np.arange(B * T * D).reshape(B, T, D) 5 6# Without keepdims: shape becomes (2, 4), which no longer broadcasts with (2, 3, 4) 7mean_flat = x.mean(axis=1) 8print("mean_flat shape", mean_flat.shape) 9 10# With keepdims: shape stays (2, 1, 4), which still broadcasts with (2, 3, 4) 11mean_kept = x.mean(axis=1, keepdims=True) 12print("mean_kept shape", mean_kept.shape)
Output
1mean_flat shape (2, 4) 2mean_kept shape (2, 1, 4)

keepdims=True keeps the collapsed axis as a placeholder of size 1. That preserves the original axis positions so operations against the original tensor can still broadcast correctly. Use it when a reduction result must be added, subtracted, or divided back into its source-shaped tensor.

This is easier to trust when you compare both shape paths side by side:

Reduction comparison for a ticket tensor: mean_flat has actual shape (2, 4) and fails when aligned against (2, 3, 4), while mean_kept has shape (2, 1, 4) and broadcasts back over token positions. Reduction comparison for a ticket tensor: mean_flat has actual shape (2, 4) and fails when aligned against (2, 3, 4), while mean_kept has shape (2, 1, 4) and broadcasts back over token positions.
`keepdims=True` keeps the placeholder where the reduction happened. That size-1 slot is what lets a ticket-level mean broadcast cleanly back over every token instead of colliding with the batch axis.

Reshape vs. transpose: regrouping or reordering

Beginners often mix these up because both can produce a new shape.

Use a smaller array so you can see the numbers move:

reshape-vs-transpose.py
1import numpy as np 2 3small = np.arange(12).reshape(2, 3, 2) 4reshaped = small.reshape(2, 2, 3) 5transposed = np.transpose(small, (0, 2, 1)) 6 7print("small[0]") 8print(small[0]) 9print("reshaped[0]") 10print(reshaped[0]) 11print("transposed[0]") 12print(transposed[0])
Output
1small[0] 2[[0 1] 3 [2 3] 4 [4 5]] 5reshaped[0] 6[[0 1 2] 7 [3 4 5]] 8transposed[0] 9[[0 2 4] 10 [1 3 5]]

Both results have shape (2, 2, 3), but they answer different questions:

OperationWhat it doesWhen to use it
reshapekeeps the same data stream and groups it into a new shapewhen the flat order is correct and you want new grouping
transposepermutes the axeswhen the same values are attached to the wrong axis order

If your real goal is to turn (B, T, D) into (B, D, T), that is an axis-order problem, so transpose is the right tool.

One more beginner trap lives here:

vector-transpose-stays-1d.py
1import numpy as np 2 3v = np.array([1, 2, 3]) 4print(v.shape) 5print(v.T.shape)
Output
1(3,) 2(3,)

A 1-D vector has one axis, so transposing it does nothing. If you need a row or column vector, make that axis explicit with v[None, :] or v[:, None].

Views vs. copies: who shares memory

Basic slicing produces a view: an array that reaches the same data buffer through different metadata. transpose also returns a view whenever possible. reshape returns a view when the new layout can be expressed with strides, but it can make a copy for a non-contiguous input. Fancy indexing (an integer array or a boolean mask) returns a copy. Those differences matter when you mutate arrays or reason about memory use.[1]

views-vs-copies.py
1import numpy as np 2 3x = np.arange(6).reshape(2, 3) 4 5token = x[:, 1] # basic slice -> view shares memory with x 6token[0] = 99 7print("after editing a slice, x[0, 1] =", x[0, 1]) 8 9picked = x[[0, 1]] # fancy indexing -> independent copy 10picked[0, 0] = -1 11print("after editing a fancy-index copy, x[0, 0] =", x[0, 0])
Output
1after editing a slice, x[0, 1] = 99 2after editing a fancy-index copy, x[0, 0] = 0

If you need to mutate a slice without touching the source, take an explicit copy first with x[:, 1].copy(). When unsure, np.shares_memory(a, b) answers the question directly.

From hidden states to attention scores

Attention looks harder than indexing or reduction, but the shape logic is the same. A later Transformer chapter will explain the full computation. For now, track one intermediate result: each token's query vector is compared with every token's key vector by a dot product before those scores become weights.[3]

Start with hidden states of shape (B, T, D). Build queries and keys with matrices of shape (D, D). That keeps the batch axis and token axis in place:

text
1q.shape == (B, T, D) 2k.shape == (B, T, D)

To compare each token with each token, use matrix multiplication (many dot products calculated together). The last two axes of k must swap first:

text
1unscaled_scores = q @ swapaxes(k, -1, -2) 2unscaled_scores.shape == (B, T, T)

Read the two T axes

  • left T: which token is asking the question
  • second T: which token is being compared against

Code lab: check attention scores

code-lab-check-attention-scores.py
1import numpy as np 2 3B, T, D = 2, 3, 4 4rng = np.random.default_rng(7) 5x = rng.normal(size=(B, T, D)) 6wq = rng.normal(size=(D, D)) 7wk = rng.normal(size=(D, D)) 8 9q = x @ wq 10k = x @ wk 11unscaled_scores = q @ np.swapaxes(k, -1, -2) 12 13print("q", q.shape) 14print("k", k.shape) 15print("unscaled_scores", unscaled_scores.shape)
Output
1q (2, 3, 4) 2k (2, 3, 4) 3unscaled_scores (2, 3, 3)

This is why shape practice belongs before Transformer internals. The score matrix is still built from small, explainable axis moves; later you will scale it, hide positions a token shouldn't read, turn scores into weights, and use those weights to mix values.

Splitting heads: reshape then transpose

Transformers run several attention heads in parallel by carving feature axis D into H smaller feature groups of width D // H.[3] This is exactly the reshape-versus-transpose trap from earlier. To go from (B, T, D) to (B, H, T, D // H), first reshape to expose the head axis, then transpose to move it next to batch axis. Reshaping straight to target shape produces a legal array whose numbers are jumbled.

split-attention-heads.py
1import numpy as np 2 3B, T, D, H = 2, 3, 8, 2 4head_dim = D // H 5x = np.arange(B * T * D).reshape(B, T, D) 6 7# Right: expose the head axis, then move it next to the batch axis. 8heads = x.reshape(B, T, H, head_dim).transpose(0, 2, 1, 3) 9 10# Wrong: same target shape, but the values land in the wrong heads. 11jumbled = x.reshape(B, H, T, head_dim) 12 13print("heads shape", heads.shape) 14print("same values?", np.array_equal(heads, jumbled))
Output
1heads shape (2, 2, 3, 4) 2same values? False

Both arrays have shape (2, 2, 3, 4), so a shape check alone would pass. Only the reshape-then-transpose path keeps each token's features grouped inside the right head.

Transfer: normalizing catalog photos

The same rule applies outside ticket text. Suppose ShopFlow prepares 32 catalog photos for a product-image model. Each red-green-blue (RGB) photo is 224 pixels high by 224 pixels wide. The shape is:

text
1images = (32, 224, 224, 3)

The last axis is the color channel. Pixel values are scaled between 0 and 1, and you want to subtract a measured mean-color vector (3,) from each pixel in every photo.

normalize-catalog-photos.py
1import numpy as np 2 3images = np.ones((32, 224, 224, 3)) # placeholder normalized catalog photos 4mean_color = np.array([0.52, 0.48, 0.44]) # measured from training photos 5 6normalized = images - mean_color 7print("normalized shape", normalized.shape)
Output
1normalized shape (32, 224, 224, 3)

NumPy lines the shapes up from the right. The trailing 3 matches, so the mean color broadcasts across all 224 rows, 224 columns, and 32 photos. No Python loop is needed.

Common shape bugs: symptoms and fixes

Shape bugs become easier to fix when you name the symptom, the cause, and the correction.

SymptomLikely causeFix
ValueError: operands could not be broadcast together with shapes (2,3,4) (3,)you tried to match a token-sized vector with the feature axisdecide whether the smaller array should be (D,), (T, 1), or something else
output has shape (2, 3) after mean, but you expected (2, 4)you reduced the feature axis instead of the token axissay the axis name before calling mean, then verify the kept axes
code works with x[0] but breaks on real batchesyou dropped the batch axis during indexingkeep tiny batch examples in scripts and assertions
v.T still has shape (4,)a 1-D vector has a single axismake the row or column explicit with v[None, :] or v[:, None]
reshape gave a legal array, but token meaning changedaxis order was wrong, but you regrouped values instead of permuting axesuse transpose when the axis names must move
q @ k raises a matrix-multiply mismatchthe last axis of q doesn't line up with the second-to-last axis of kswap the last two axes of k before the multiply
subtracting a mean gives (32, 224, 224, 3) but colors look wrongthe mean vector shape might be (224, 224, 3) instead of (3,)check that the trailing dimension matches the channel axis
editing a slice silently changes the original arraybasic slices are views that share memory, not copiescall .copy() when you need an independent array
split heads have the right shape but attention output is garbageyou reshaped straight to (B, H, T, D // H) instead of reshaping then transposingreshape to (B, T, H, D // H) first, then transpose(0, 2, 1, 3)

The pattern behind all of these fixes is the same: write the intended axis meaning before asking whether the operation preserves that meaning.

Key insight: The error message operands could not be broadcast together isn't a cryptic NumPy complaint. It's a precise statement that two axes you assumed matched don't match. Treat it as a shape-contract violation, not a random failure, and the next check becomes obvious.

Build a shape guard script

Build one tiny script that proves you understand the whole chapter.

  1. Create x = np.arange(24).reshape(2, 3, 4).
  2. Add a feature bias of shape (4,) and assert the result is still (2, 3, 4).
  3. Average over the token axis and assert the result is (2, 4).
  4. Create unscaled attention scores and assert the result is (2, 3, 3).
  5. Intentionally replace the bias with shape (3,) and read the failure.

If you want a stronger version, wrap each step in a function:

  • add_feature_bias(x, bias)
  • pool_tokens(x)
  • unscaled_attention_scores(x, wq, wk)

Each function should assert both the input shape it expects and the output shape it promises.

Here is a compact reference version:

build-a-shape-guard-script.py
1import numpy as np 2 3def add_feature_bias(x: np.ndarray, bias: np.ndarray) -> np.ndarray: 4 assert x.ndim == 3, f"expected (B, T, D), got {x.shape}" 5 assert bias.shape == (x.shape[-1],), f"expected bias shape ({x.shape[-1]},), got {bias.shape}" 6 out = x + bias 7 assert out.shape == x.shape 8 return out 9 10def pool_tokens(x: np.ndarray) -> np.ndarray: 11 assert x.ndim == 3, f"expected (B, T, D), got {x.shape}" 12 out = x.mean(axis=1) 13 assert out.shape == (x.shape[0], x.shape[2]) 14 return out 15 16def unscaled_attention_scores(x: np.ndarray, wq: np.ndarray, wk: np.ndarray) -> np.ndarray: 17 batch, tokens, width = x.shape 18 assert wq.shape == (width, width) 19 assert wk.shape == (width, width) 20 q = x @ wq 21 k = x @ wk 22 unscaled_scores = q @ np.swapaxes(k, -1, -2) 23 assert unscaled_scores.shape == (batch, tokens, tokens) 24 return unscaled_scores 25 26x = np.arange(24).reshape(2, 3, 4) 27bias = np.array([100, 200, 300, 400]) 28wq = np.eye(4) 29wk = np.eye(4) 30 31shifted = add_feature_bias(x, bias) 32pooled = pool_tokens(shifted) 33unscaled_scores = unscaled_attention_scores(x, wq, wk) 34 35print("shifted", shifted.shape) 36print("pooled", pooled.shape) 37print("unscaled_scores", unscaled_scores.shape) 38 39try: 40 add_feature_bias(x, np.array([10, 20, 30])) 41except AssertionError as exc: 42 print("caught:", exc) 43else: 44 raise AssertionError("Expected the shape guard to reject a token-sized bias")
Output
1shifted (2, 3, 4) 2pooled (2, 4) 3unscaled_scores (2, 3, 3) 4caught: expected bias shape (4,), got (3,)

Check your reasoning: predict before you run

Cover the code blocks above and answer these without running anything. If your explanation names numbers without axis meanings, slow down and add labels: reliable shape reasoning attaches each number to its role.

Open each checkpoint only after you commit to an answer.

If you got any of these wrong, re-read the one-running-example section and trace the indices by hand before moving on.

Keep the axis contract visible

The durable lesson is simple: a shape is a sentence about data.

(B, T, D) is not three numbers. It says how many examples you have, how many positions each example contains, and how many numbers describe each position. Once you keep that sentence attached to the array, matrix multiplies, reductions, and attention stop feeling opaque.

Mastery check

Key concepts

  • ndarrays
  • axis naming
  • indexing
  • broadcasting
  • axis reductions
  • keepdims
  • reshape vs transpose
  • matrix multiplication
  • shape assertions
  • column vs row vectors

Evaluation rubric

  • Foundational: Reads tensor axes in plain English before slicing, broadcasting, or reducing arrays
  • Intermediate: Predicts and verifies shapes for indexing, broadcasting, reductions, projection, and attention-score examples
  • Intermediate: Uses keepdims=True to preserve axis positions during reductions and avoid silent broadcasting mismatches
  • Advanced: Adds shape assertions that catch wrong broadcasting, wrong reduction axes, reshape-versus-transpose confusion, and dropped batch dimensions

Follow-up questions

Common pitfalls

"Broadcast succeeded, so shape must be right"

Symptom: Code runs and output shape looks plausible, but every token or package gets the wrong offset. Cause: A legal broadcast matched the wrong axis. NumPy checked size compatibility, not semantic intent. Fix: Name the target axis first, then verify the smaller array is shaped for that axis on purpose.

"Reduction returned numbers, so averaging step was correct"

Symptom: mean() runs without complaint, but downstream code gets scalars where vectors were expected. Cause: The wrong axis was reduced, so the result kept different meaning than intended. Fix: Say the axis name out loud before reducing, then check kept axes against the plain-English goal.

"Reshape and transpose are interchangeable"

Symptom: Final shape matches expectation, but token order, catalog-photo layout, or attention heads become mixed up. Cause: reshape regrouped the flat stream when the real task was axis permutation. Fix: Use transpose when axis meaning moves. Use reshape only when flat order is already correct.

"A 1D vector is close enough to row or column"

Symptom: (N,) sometimes behaves like a row and sometimes like a column, creating silent broadcast surprises. Cause: One axis is missing entirely, so NumPy infers placement from context. Fix: Make the axis explicit with v[None, :] or v[:, None] whenever orientation matters.

"Slice is harmless to mutate"

Symptom: Editing a slice unexpectedly changes the original batch. Cause: Basic slices are views, and reshape may return a view that shares memory with the source. Fix: Call .copy() when you need an independent array before mutating it.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An array x has shape (2, 3, 4), interpreted as (B, T, D): 2 tickets, 3 token positions, and 4 features per token. What does x[:, 1] return?
2.For x.shape == (2, 3, 4) interpreted as (B, T, D), you want to add one scalar to each token position and reuse it across every feature and ticket. Which bias shape encodes that intent, and why does a bare (3,) fail?
3.A support-ticket tensor uses (B, T, D) = (32, 10, 64), where T is token positions and D is features. A teammate writes x.mean(axis=-1) while saying they want one feature vector per ticket. What went wrong?
4.For x.shape == (2, 3, 4) interpreted as (B, T, D), you want to subtract each ticket's mean token vector from every token: x - x.mean(axis=1). NumPy raises a broadcast error. Which change preserves the intended axes?
5.You want (B, T, D) to become (B, H, T, head_dim) for attention heads. Why is reshape alone not enough even if final shape looks right?
6.You have q.shape == k.shape == (2, 3, 4) from a (B, T, D) batch. You need unscaled attention scores where each query token is compared with every key token in the same ticket. Which code and assertion match that contract?
7.For images.shape == (32, 224, 224, 3), you have one RGB mean value per channel to subtract from every pixel. Which mean shape matches that intent, and what would (224, 224, 3) imply?
8.Given x = np.arange(6).reshape(2, 3), what happens after token = x[:, 1]; token[0] = 99; picked = x[[0, 1]]; picked[0, 0] = -1?
9.Which guard belongs inside add_feature_bias(x, bias) when the function promises to add one feature bias to a (B, T, D) tensor and return the same shape?

9 questions remaining.

Next Step
Continue to CUDA & GPU Computing for ML

NumPy taught you how to keep explicit axis meaning attached to each tensor so reshapes, reductions, and matrix multiplies stay readable. CUDA is next because those same tensors now have to live on the right device, move across the CPU/GPU boundary at the right time, and execute asynchronously without hiding bugs.

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Copies and views.

NumPy Contributors · 2026 · Official NumPy documentation

Broadcasting.

NumPy Contributors · 2026 · Official NumPy documentation

Attention Is All You Need.

Vaswani, A., et al. · 2017