LeetLLM
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 104 articles completed

๐Ÿ› ๏ธComputing Foundations0/3
Python for AI EngineeringNumPy and Tensor ShapesData Structures for AI
๐Ÿ“ŠMath & Statistics0/4
Probability for MLStatistics and UncertaintyDistributions and SamplingHypothesis Tests and pass@k
๐Ÿ“šPreparation & Prerequisites0/9
Vectors, Matrices & TensorsNeural Networks from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsThe LLM Lifecycle
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFunction Calling & Tool UseChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฎML Algorithms & Evaluation0/8
Linear Regression from ScratchValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment DesignPyTorch Training LoopsDataset Pipelines
๐ŸงฐApplied LLM Engineering0/17
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingMCP & Tool Protocol StandardsPrompt Injection DefenseAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringPre-training Data at ScaleMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/6
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/10
Scaling Laws & Compute-Optimal TrainingDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/12
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingAgent Memory & PersistenceHuman-in-the-Loop AgentsAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/14
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Quantization: GPTQ, AWQ & GGUFSpeculative DecodingLong Context Window ManagementMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnComputing FoundationsNumPy and Tensor Shapes
๐Ÿ“EasyNLP Fundamentals

NumPy and Tensor Shapes

A textbook-style NumPy chapter that teaches B/T/D axes, indexing, matrix multiplication, attention score shapes, and shape assertions.

35 min readOpenAI, Anthropic, Google +16 key concepts

NumPy teaches you to see data as shapes.

Python gave you rows and functions. NumPy adds a new rule: every value has axes, and those axes must mean something.

In LLM work, many bugs look like mysterious math bugs. Often they aren't mysterious. Batch, token, and hidden dimensions got swapped, squeezed, or broadcast in the wrong place.

What You Learn

This chapter teaches the shape habit. By the end, a beginner should be able to look at an array, name each axis, predict the output shape of a small operation, and write assertions that catch wrong shapes before model code runs.[1][2][3]

NumPy tensor shape diagram mapping B T D hidden states through query and key projections into a B T T attention score matrix NumPy tensor shape diagram mapping B T D hidden states through query and key projections into a B T T attention score matrix
Visual anchor: keep your eye on the axes. (B,T,D) becomes query and key tensors, then K flips so attention scores end as (B,T,T).

Step Map

StepQuestionWhat you should be able to do
1What are the axes?Name B, T, and D in plain English.
2What does one slice mean?Point to one batch item, token, or feature vector.
3What operation happens?Predict shape after indexing or matrix multiply.
4What can fail silently?Spot wrong broadcasting and swapped axes.
5What ships?Add shape assertions around model boundaries.

Analogy: a tensor is like a set of labeled shelves. If you remove the labels, the boxes may still stack, but nobody knows what the stack means.


Tiny Story

Imagine two short prompts:

Batch itemTokens
0["hello", "ai", "world"]
1["debug", "the", "shape"]

Each token becomes four numbers.

That means the hidden states have shape:

text
1(B, T, D) = (2, 3, 4)

Read it slowly:

AxisMeaningSize
Bbatch, how many prompts2
Ttokens per prompt3
Dnumbers per token4

The shape isn't decoration. It is the contract.

Vocabulary

  • array: grid of numbers.
  • axis: one direction in an array.
  • shape: tuple of axis sizes, such as (2, 3, 4).
  • scalar: one number with shape ().
  • vector: one axis, such as (4,).
  • matrix: two axes, such as (3, 4).
  • tensor: array with any number of axes.
  • broadcasting: NumPy rule that stretches compatible axes.

Keep the words concrete. If D means hidden size, every vector on that axis should have D numbers.

Worked Example: Indexing

Start with shape (2, 3, 4).

ExpressionMeaningResult shape
x[0]first prompt(3, 4)
x[0, 1]second token in first prompt(4,)
x[:, 1]second token from every prompt(2, 4)
x[:, :, 0]first feature from every token(2, 3)

Indexing is reading from the shelf.

When you remove one coordinate, you usually remove one axis.

Code Lab

Put this in numpy_shapes_demo.py:

python
1import numpy as np 2 3B, T, D = 2, 3, 4 4x = np.arange(B * T * D).reshape(B, T, D) 5 6print("x", x.shape) 7print("first_prompt", x[0].shape) 8print("one_token", x[0, 1].shape) 9print("all_second_tokens", x[:, 1].shape) 10 11assert x.shape == (2, 3, 4) 12assert x[0, 1].shape == (4,)

Expected output:

text
1x (2, 3, 4) 2first_prompt (3, 4) 3one_token (4,) 4all_second_tokens (2, 4)

The values inside x aren't important yet. The shape changes are the lesson.

Second Worked Example: Matrix Multiply

Now transform each token vector from four numbers to five numbers.

Use a weight matrix:

ObjectShapeMeaning
x(2, 3, 4)two prompts, three tokens, four features
W(4, 5)map four features to five features
y = x @ W(2, 3, 5)same prompts and tokens, new feature size

The last axis of x must match the first axis of W.

Shape rule:

text
1(2, 3, 4) @ (4, 5) -> (2, 3, 5)

The 4 disappears because it is the matched dimension. The 5 appears because it is the new feature size.

Run The Projection

python
1import numpy as np 2 3B, T, D, H = 2, 3, 4, 5 4x = np.ones((B, T, D)) 5W = np.ones((D, H)) 6y = x @ W 7 8print("y", y.shape) 9assert y.shape == (B, T, H)

Expected output:

text
1y (2, 3, 5)

That one rule appears everywhere in neural networks.

Third Worked Example: Attention Scores

Attention compares tokens to tokens.

For one batch item with three tokens, each token needs a score against three tokens. That gives a T x T table.

For a full batch, the score shape should be:

text
1(B, T, T)

Shape Checkpoint

Build the smallest version:

python
1import numpy as np 2 3B, T, D = 2, 3, 4 4rng = np.random.default_rng(7) 5x = rng.normal(size=(B, T, D)) 6wq = rng.normal(size=(D, D)) 7wk = rng.normal(size=(D, D)) 8 9q = x @ wq 10k = x @ wk 11scores = q @ np.swapaxes(k, -1, -2) 12 13print("q", q.shape) 14print("k", k.shape) 15print("scores", scores.shape) 16assert scores.shape == (B, T, T)

Expected output shape:

text
1scores (2, 3, 3)

Read The Output

This is the first shape checkpoint for Transformer attention. You don't need full attention yet. You only need to know why token-to-token scores have two token axes.

Common Trap

The common trap is trusting NumPy because it returned an array.

NumPy can return a legal array that means the wrong thing.

Example:

MistakeWhy it is dangerous
using (T, B, D) when code expects (B, T, D)batch and token axes swap silently
broadcasting a (T,) mask over the wrong axisevery prompt gets the wrong mask
calling reshape without naming axesvalues move into a shape that looks valid
dropping a batch axis with x[0]code works for one prompt and breaks later

The fix isn't more confidence. The fix is shape assertions with names.

Mini Test

Add tests like these:

python
1import numpy as np 2 3def test_token_projection_shape(): 4 B, T, D, H = 2, 3, 4, 5 5 x = np.ones((B, T, D)) 6 W = np.ones((D, H)) 7 y = x @ W 8 assert y.shape == (B, T, H) 9 10def test_attention_scores_are_token_by_token(): 11 B, T, D = 2, 3, 4 12 q = np.ones((B, T, D)) 13 k = np.ones((B, T, D)) 14 scores = q @ np.swapaxes(k, -1, -2) 15 assert scores.shape == (B, T, T)

These tests don't prove a model is good. They prove the arrays still mean what their names claim.

Practice

  1. Change T from 3 to 6.
  2. Predict every printed shape before running code.
  3. Change H from 5 to 2.
  4. Explain why y changes only on the last axis.
  5. Remove np.swapaxes and read the error.
  6. Write one sentence that starts: "The scores shape is (B, T, T) because..."

Production Check

At every model boundary, write down:

  • input shape
  • axis names
  • expected output shape
  • batch behavior
  • sequence behavior
  • feature or hidden size
  • shape assertion
  • example with tiny numbers

For example:

Input hidden states are (B, T, D). Projection matrix is (D, H). Output is (B, T, H). Test uses B=2, T=3, D=4, H=5.

That note prevents the most common beginner mistake: treating arrays as anonymous grids.

Next, continue to Data Structures for AI. NumPy taught you to keep shape meaning attached to arrays. Data structures teach you to keep lookup meaning attached to storage.

Evaluation Rubric
  • 1
    Names tensor axes in plain English before doing array operations
  • 2
    Predicts and verifies shapes for indexing, projection, and attention-score examples
  • 3
    Adds shape assertions that catch swapped axes, wrong broadcasting, and dropped batch dimensions
Common Pitfalls
  • Most tensor bugs are silent until a later layer sees the wrong axis. Broadcasting can hide a bug by producing a legal but wrong shape.
  • A legal broadcast can still be wrong. Check meaning, not only whether NumPy returned an array.
  • Dropping a batch axis can make a one-example demo pass while production batches break.
Follow-up Questions to Expect

Key Concepts Tested
ndarraysbroadcastingmatrix multiplicationbatch dimensionsrandom seedsshape assertions
References

Array programming with NumPy.

Harris, C. R., et al. ยท 2020 ยท Nature

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. ยท 2016

PyTorch: An Imperative Style, High-Performance Deep Learning Library.

Paszke, A., et al. ยท 2019 ยท NeurIPS 2019

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.