LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnMath & StatisticsVectors, Matrices & Tensors
📝EasyNLP Fundamentals

Vectors, Matrices & Tensors

Turn one gradient vector into batches of model inputs while learning dot products, matrix transforms, tensor axes, and shape debugging.

12 min read
Learning path
Step 8 of 155 in the full curriculum
Gradients and BackpropLinear Algebra for ML

In the previous chapter, ShopFlow's delivery model produced a gradient [4, 2]: one correction for its distance weight and one for its warehouse-backlog weight. That list already has a mathematical name. It's a vector.

Now you will learn how one list becomes a row in a dataset, how learned weights change its features, and how batches of token sequences need extra axes. These shapes are the notation behind training code, similarity search, and later Transformer chapters.

Progression from one ShopFlow delivery feature vector to an order matrix and then a batched support-ticket tensor. Progression from one ShopFlow delivery feature vector to an order matrix and then a batched support-ticket tensor.
A single delivery example needs one feature vector. Training adds rows. Text models add a sequence axis, so a batch of ticket tokens needs three named axes.

Start with the vector you already computed

A scalar is one number, such as a loss of 1.0. A vector is an ordered list of numbers. Its order matters because each position has a meaning.

Reuse the delivery-time example from the gradient calculation:

VectorMeaning of position 0Meaning of position 1Values
featureshundreds of kilometreswarehouse backlog hours[2, 1]
weightshours per distance unithours per backlog unit[2, 2]
gradientcorrection for distance weightcorrection for backlog weight[4, 2]

Every vector above has shape [2], but they play different roles. Shape tells you how values are arranged. Names tell you what those values mean.

a-gradient-is-a-vector.py
1import numpy as np 2 3features = np.array([2.0, 1.0]) # distance, backlog 4weights = np.array([2.0, 2.0]) 5gradient = np.array([4.0, 2.0]) 6 7print("features", features.tolist(), "shape", features.shape) 8print("weights", weights.tolist(), "shape", weights.shape) 9print("gradient", gradient.tolist(), "shape", gradient.shape) 10print("updated_weights", (weights - 0.1 * gradient).tolist())
Output
1features [2.0, 1.0] shape (2,) 2weights [2.0, 2.0] shape (2,) 3gradient [4.0, 2.0] shape (2,) 4updated_weights [1.6, 1.8]

A vector isn't automatically an embedding. Here it holds two features or two parameters. Much later, a token embedding will be a longer vector whose positions are learned rather than named by hand.

A dot product makes one weighted prediction

The delivery model uses a dot product to combine two features into one prediction:

y^=x⋅w=x1w1+x2w2\hat{y} = x \cdot w = x_1w_1 + x_2w_2y^​=x⋅w=x1​w1​+x2​w2​

Here x is the feature vector, w is the weight vector, and \hat{y} is the predicted delivery time. To isolate the operation, use a calibrated rule with x = [2, 1] and w = [2, 1]:

[2,1]⋅[2,1]=(2×2)+(1×1)=5[2, 1] \cdot [2, 1] = (2 \times 2) + (1 \times 1) = 5[2,1]⋅[2,1]=(2×2)+(1×1)=5

The dot product multiplies matching positions, then adds the contributions. Two vectors of length two produce one scalar prediction.

A delivery-time prediction formed by multiplying distance and backlog features by matching weights and summing the contributions. A delivery-time prediction formed by multiplying distance and backlog features by matching weights and summing the contributions.
The dot product is inspectable bookkeeping: distance contributes four hours, backlog contributes one hour, and their sum predicts a five-hour delivery.
one-order-dot-product.py
1import numpy as np 2 3features = np.array([2.0, 1.0]) 4weights = np.array([2.0, 1.0]) 5 6contributions = features * weights 7prediction = features @ weights 8 9print("contributions", contributions.tolist()) 10print("prediction_hours", float(prediction)) 11print("output_shape", prediction.shape)
Output
1contributions [4.0, 1.0] 2prediction_hours 5.0 3output_shape ()

The output shape is (): NumPy's representation for a scalar. Multiplying element by element produced a vector of contributions; applying @ summed them into one number.

Several orders make a matrix

Training from one delivery would be fragile. Put three observed deliveries into a matrix: a rectangular array whose rows are orders and whose columns are features.

X=[102131]w=[21]X = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 3 & 1 \end{bmatrix} \quad w = \begin{bmatrix} 2 \\ 1 \end{bmatrix}X=​123​011​​w=[21​]

Read conceptual shape [3, 2] as three orders by two features. NumPy prints the same shape as tuple (3, 2). Multiplying every row by the same weight vector returns one prediction per order:

[3,2]  @  [2]→[3][3, 2] \; @ \; [2] \rightarrow [3][3,2]@[2]→[3]
many-orders-one-weight-vector.py
1import numpy as np 2 3X = np.array([ 4 [1.0, 0.0], # nearby order, no backlog 5 [2.0, 1.0], # medium distance, backlog 6 [3.0, 1.0], # farther order, backlog 7]) 8weights = np.array([2.0, 1.0]) 9 10predictions = X @ weights 11 12print("X_shape", X.shape) 13print("weights_shape", weights.shape) 14print("predictions_shape", predictions.shape) 15print("predicted_hours", predictions.tolist())
Output
1X_shape (3, 2) 2weights_shape (2,) 3predictions_shape (3,) 4predicted_hours [2.0, 5.0, 7.0]

This is the same computation you did for one order, repeated efficiently for every row. In this matrix-vector case, the feature dimensions must agree and the row axis survives in the result.[1]

A matrix can transform features into new features

The weight vector above produces one output. A model often wants several outputs from the same input. Suppose ShopFlow needs a predicted number of hours and an internal handling-priority score.

Use one column of a weight matrix for each output:

W=[2.00.21.01.5]W = \begin{bmatrix} 2.0 & 0.2 \\ 1.0 & 1.5 \end{bmatrix}W=[2.01.0​0.21.5​]

The first column computes delivery hours; the second computes priority. For x = [2, 1]:

xW=[2,1][2.00.21.01.5]=[5.0,1.9]xW = [2, 1] \begin{bmatrix} 2.0 & 0.2 \\ 1.0 & 1.5 \end{bmatrix} = [5.0, 1.9]xW=[2,1][2.01.0​0.21.5​]=[5.0,1.9]

The shape contract is:

[2]  @  [2,2]→[2][2] \; @ \; [2, 2] \rightarrow [2][2]@[2,2]→[2]

For the whole three-row matrix:

[3,2]  @  [2,2]→[3,2][3, 2] \; @ \; [2, 2] \rightarrow [3, 2][3,2]@[2,2]→[3,2]
Shape contract for three delivery rows with two input features multiplied by a two-by-two weight matrix to produce two outputs per row. Shape contract for three delivery rows with two input features multiplied by a two-by-two weight matrix to produce two outputs per row.
The two feature columns meet the two input rows of the weight matrix. Three order rows and two output columns remain in the result.
matrix-transforms-order-features.py
1import numpy as np 2 3X = np.array([[1.0, 0.0], [2.0, 1.0], [3.0, 1.0]]) 4W = np.array([ 5 [2.0, 0.2], # distance contribution to [hours, priority] 6 [1.0, 1.5], # backlog contribution to [hours, priority] 7]) 8 9outputs = X @ W 10 11print("contract", X.shape, "@", W.shape, "->", outputs.shape) 12print("second_order", outputs[1].tolist()) 13print("all_hour_predictions", outputs[:, 0].tolist())
Output
1contract (3, 2) @ (2, 2) -> (3, 2) 2second_order [5.0, 1.9] 3all_hour_predictions [2.0, 5.0, 7.0]

This is why matrices matter in neural networks: one learned transform can mix input features into many output features for every row at once.

Meeting dimensions are a contract, not a suggestion

Suppose a teammate replaces W with a matrix expecting three input features:

[3,2]  @  [3,2][3, 2] \; @ \; [3, 2][3,2]@[3,2]

The first object ends with feature count 2; the second expects feature count 3. No computation can match the missing feature, so NumPy raises an error.

catch-the-matrix-shape-error.py
1import numpy as np 2 3X = np.zeros((3, 2)) 4wrong_W = np.zeros((3, 2)) 5 6try: 7 X @ wrong_W 8except ValueError: 9 print("invalid", X.shape, "@", wrong_W.shape) 10 print("meeting_dimensions", X.shape[-1], "and", wrong_W.shape[0], "do_not_match") 11 12correct_W = np.zeros((2, 2)) 13print("valid_output_shape", (X @ correct_W).shape)
Output
1invalid (3, 2) @ (3, 2) 2meeting_dimensions 2 and 3 do_not_match 3valid_output_shape (3, 2)

When you see a multiplication error, don't randomly transpose arrays until it disappears. Name the axes first: order rows, input features, and output features. Then fix the object whose meaning is wrong.

Framework tensors add axes for sequences and batches

In NumPy and PyTorch code, a tensor is an array with any number of axes. This chapter uses that framework meaning: a vector is a one-axis tensor, and a matrix is a two-axis tensor. Text models commonly need three named axes:

AxisSymbolShopFlow meaning
batchBseveral support tickets processed together
sequenceTtoken positions within each ticket
featureDnumbers describing one token at one layer

Use tiny handcrafted token features so every number stays readable. This isn't a trained embedding table; it's a shape exercise:

Token featurePosition 0Position 1
late-like signal1.00.0
package-like signal0.50.5
refund-like signal0.01.0

Two tickets, each containing three token-feature vectors, have shape [2, 3, 2].

stack-ticket-sequences-into-a-tensor.py
1import numpy as np 2 3ticket_tensor = np.array([ 4 [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]], # "late package refund" 5 [[0.0, 1.0], [0.5, 0.5], [1.0, 0.0]], # "refund package late" 6]) 7 8print("tensor_shape", ticket_tensor.shape) 9print("first_ticket_shape", ticket_tensor[0].shape) 10print("first_token_shape", ticket_tensor[0, 0].shape) 11print("first_token", ticket_tensor[0, 0].tolist())
Output
1tensor_shape (2, 3, 2) 2first_ticket_shape (3, 2) 3first_token_shape (2,) 4first_token [1.0, 0.0]

The last axis still contains features. Extra axes organize which ticket and which token each vector belongs to.

Diagram showing One order x: [2], One prediction x @ w: scalar, Many orders X: [3, 2], and Two outputs each X @ W: [3, 2]. Diagram showing One order x: [2], One prediction x @ w: scalar, Many orders X: [3, 2], and Two outputs each X @ W: [3, 2].
One order x: [2], One prediction x @ w: scalar, Many orders X: [3, 2], and Two outputs each X @ W: [3, 2].

A token projection keeps leading tensor axes

Once a ticket tensor has token features, a learned projection can change only its final feature dimension. Let P map two input features into three output features:

[B,T,2]  @  [2,3]→[B,T,3][B, T, 2] \; @ \; [2, 3] \rightarrow [B, T, 3][B,T,2]@[2,3]→[B,T,3]

Batch and sequence remain intact. Only the token-feature size changes.

project-every-token-vector.py
1import numpy as np 2 3H = np.array([ 4 [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]], 5 [[0.0, 1.0], [0.5, 0.5], [1.0, 0.0]], 6]) 7P = np.array([ 8 [1.0, 0.0, 0.5], 9 [0.0, 1.0, 0.5], 10]) 11 12projected = H @ P 13 14print("contract", H.shape, "@", P.shape, "->", projected.shape) 15print("first_ticket_first_token", projected[0, 0].tolist()) 16print("middle_token_shared", np.allclose(projected[0, 1], projected[1, 1]))
Output
1contract (2, 3, 2) @ (2, 3) -> (2, 3, 3) 2first_ticket_first_token [1.0, 0.0, 0.5] 3middle_token_shared True

For this [B, T, D] @ [D, P] case, NumPy applies the same two-dimensional projection at every leading [B, T] position, preserving the batch and sequence axes. More generally, higher-rank @ treats operands as stacks of matrices on their last two axes and broadcasts compatible leading axes.[1]

Broadcasting: valid arithmetic can still be the wrong meaning

Broadcasting lets a smaller array participate in element-wise operations against a larger one. Adding an output bias [hours_offset, priority_offset] to every order row is useful:

[3,2]+[2]→[3,2][3, 2] + [2] \rightarrow [3, 2][3,2]+[2]→[3,2]

NumPy compares dimensions from the right: equal dimensions match, a dimension of size 1 can expand, and missing leading dimensions act like size 1. This is powerful, but it can hide an axis mistake.[2]

broadcast-the-right-axis.py
1import numpy as np 2 3outputs = np.array([ 4 [2.0, 0.2], 5 [5.0, 1.9], 6 [7.0, 2.1], 7]) # columns: [hours, priority] 8 9feature_bias = np.array([0.1, -0.1]) # intended: adjust each output feature 10wrong_row_bias = np.array([[0.1], [-0.1], [0.0]]) # wrong: adjusts each order row 11 12corrected = outputs + feature_bias 13wrong_but_valid = outputs + wrong_row_bias 14 15print("correct_shape", corrected.shape) 16print("wrong_shape", wrong_but_valid.shape) 17print("correct_second_order", corrected[1].tolist()) 18print("wrong_second_order", wrong_but_valid[1].tolist()) 19assert feature_bias.shape == (outputs.shape[-1],)
Output
1correct_shape (3, 2) 2wrong_shape (3, 2) 3correct_second_order [5.1, 1.7999999999999998] 4wrong_second_order [4.9, 1.7999999999999998]

Both additions return shape [3, 2], yet only one uses the intended axis. A failing operation is easy to debug. A legal broadcast on the wrong axis is more dangerous, so assertions should state meaning, not only final size.

A score grid is repeated dot products

The delivery predictor used a dot product to combine features with weights. The same operation can compare vectors to one another.

For three token-feature vectors in one ticket, multiply the token matrix by its transpose:

[T,D]  @  [D,T]→[T,T][T, D] \; @ \; [D, T] \rightarrow [T, T][T,D]@[D,T]→[T,T]

Every entry in the output is one dot product between two token positions.

These are raw alignment scores, not automatically cosine similarities. A dot product changes when vector magnitude changes; cosine similarity first divides by vector lengths.

token-similarity-grid.py
1import numpy as np 2 3tokens = np.array([ 4 [1.0, 0.0], # late-like 5 [0.5, 0.5], # package-like 6 [0.0, 1.0], # refund-like 7]) 8 9scores = tokens @ tokens.T 10 11print("contract", tokens.shape, "@", tokens.T.shape, "->", scores.shape) 12print(scores) 13print("late_to_refund", float(scores[0, 2]))
Output
1contract (3, 2) @ (2, 3) -> (3, 3) 2[[1. 0.5 0. ] 3 [0.5 0.5 0.5] 4 [0. 0.5 1. ]] 5late_to_refund 0.0

Later, attention will create separate query and key vectors, scale the Q @ K.T score grid, and convert each row into weights that sum to one.[3] You don't need attention mechanics yet. For now, recognize its shape as many dot products arranged in a matrix.

Build it: trace one tiny ticket projection

The lab below packages the shape reasoning into a guardrail. It accepts a batch of token-feature sequences and a projection matrix. Before multiplying, it checks that the hidden states have three named axes, the projection is a matrix, and the feature size agrees with the matrix input size.

build-it-ticket-projection.py
1import numpy as np 2 3def project_tickets(hidden: np.ndarray, projection: np.ndarray) -> np.ndarray: 4 if hidden.ndim != 3: 5 raise ValueError("expected [batch, sequence, feature]") 6 if projection.ndim != 2: 7 raise ValueError("expected projection [input_feature, output_feature]") 8 if hidden.shape[-1] != projection.shape[0]: 9 raise ValueError("feature axis does not match projection input") 10 return hidden @ projection 11 12hidden = np.array([ 13 [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]], 14 [[0.0, 1.0], [0.5, 0.5], [1.0, 0.0]], 15]) 16projection = np.array([[2.0, 0.0], [0.0, 3.0]]) 17 18projected = project_tickets(hidden, projection) 19ticket_summaries = projected.mean(axis=1) 20 21print("input", hidden.shape, "projection", projection.shape) 22print("projected", projected.shape) 23print("ticket_summaries", ticket_summaries.tolist()) 24 25try: 26 project_tickets(hidden, np.zeros((3, 2))) 27except ValueError as error: 28 print("feature_guardrail", str(error)) 29 30try: 31 project_tickets(hidden, np.zeros(2)) 32except ValueError as error: 33 print("rank_guardrail", str(error))
Output
1input (2, 3, 2) projection (2, 2) 2projected (2, 3, 2) 3ticket_summaries [[1.0, 1.5], [1.0, 1.5]] 4feature_guardrail feature axis does not match projection input 5rank_guardrail expected projection [input_feature, output_feature]

These handcrafted sequences have three meaningful token positions each. Real padded text batches need masked pooling: plain .mean(axis=1) would let padding positions change the summaries.

Real token vectors won't be hand written, and the model's learned projections will be much wider. The invariant stays small enough to say aloud: preserve batch and sequence; match and transform feature size.

PyTorch keeps the same shape reasoning

NumPy is excellent for seeing the algebra. PyTorch carries the same tensor operations into trainable models and automatic differentiation.[4] Confirm that the same batch projection gives the same values in both libraries:

pytorch-confirms-tensor-shapes.py
1import numpy as np 2import torch 3 4hidden_np = np.array([ 5 [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]], 6 [[0.0, 1.0], [0.5, 0.5], [1.0, 0.0]], 7], dtype=np.float32) 8projection_np = np.array([[2.0, 0.0], [0.0, 3.0]], dtype=np.float32) 9 10numpy_output = hidden_np @ projection_np 11torch_output = torch.tensor(hidden_np) @ torch.tensor(projection_np) 12 13print("torch_shape", tuple(torch_output.shape)) 14print("same_values", np.allclose(numpy_output, torch_output.numpy())) 15print("first_token", torch_output[0, 0].tolist())
Output
1torch_shape (2, 3, 2) 2same_values True 3first_token [2.0, 0.0]

This is enough PyTorch for this chapter. Device placement, mixed precision, training modules, and full attention blocks each need their own careful lesson.

Practice with shapes before code

Use the conventions from this chapter: rows represent orders or token positions; the final dimension represents features.

  1. A delivery feature vector has shape [3], and a weight vector has shape [3]. What does x @ w return?
  2. A training table X has shape [20, 3], and W has shape [3, 4]. What does X @ W mean and what shape does it produce?
  3. Ticket hidden states have shape [8, 12, 64], and a projection has shape [64, 16]. Which axes remain unchanged?
  4. Scores have shape [20, 4]. Why can adding a value of shape [20, 1] be a silent bug when you meant a four-feature bias?
  5. Token vectors have shape [12, 64]. What transpose is needed to create a pairwise score grid?

Solution sketches

  1. It returns a scalar because matching feature positions are multiplied and summed.
  2. It applies one four-output transformation to each of 20 rows, producing [20, 4].
  3. Batch size 8 and sequence length 12 stay unchanged; feature size changes from 64 to 16, giving [8, 12, 16].
  4. [20, 1] legally broadcasts one different value across all four output columns in each row. The output shape looks right even though the adjustment is attached to orders instead of features.
  5. Use the transpose [64, 12]; then [12, 64] @ [64, 12] produces [12, 12].

What you are ready for next

You can now read a model operation as a contract:

ObjectExample shapeQuestion to ask
Vector[2]What does each feature position mean?
Matrix of rows[3, 2]What does one row describe?
Weight matrix[2, 3]Which input feature axis must match?
Tensor[2, 3, 2]What do batch, sequence, and feature axes mean?
Similarity grid[3, 3]Which vector pairs produced these dot products?

You should be able to predict a multiplication output before running it, catch an invalid meeting dimension, and explain why a legal broadcast might still corrupt meaning.

Mastery check

Key concepts

  • A vector is an ordered list whose positions carry agreed meaning.
  • A dot product turns two matching vectors into one weighted prediction or similarity score.
  • A matrix can hold many rows or map input features into new output features.
  • Matrix multiplication succeeds only when the meeting feature dimensions agree.
  • A tensor adds named axes such as batch and sequence while preserving feature-vector reasoning.
  • Broadcasting needs semantic axis checks because incorrect operations can still run.
  • PyTorch uses the same shape reasoning that you practiced in NumPy.

Evaluation rubric

  • Foundational: Calculates the two-feature delivery prediction by hand and identifies vector, scalar, and matrix shapes.
  • Intermediate: Traces X @ W and [B, T, D] @ [D, P] contracts before executing code.
  • Advanced: Diagnoses both a hard dimension mismatch and a silent broadcasting bug, then implements guarded ticket projection in NumPy and confirms it with PyTorch.

Follow-up questions

Common pitfalls

Swapping feature order

Symptom: A prediction is numerically plausible but wrong for known examples. Cause: Feature and weight vectors have the same length but disagree about position meanings. Fix: Name feature order once and assert it at data boundaries.

Transposing until code runs

Symptom: A shape error disappears, but outputs no longer mean what the model intended. Cause: A transpose was used as a guess rather than from the row/column contract. Fix: Label each axis, then transpose only when the operation requires pairwise rows, such as a similarity grid.

Trusting a legal broadcast

Symptom: Results are shifted across entire orders or token positions. Cause: A singleton axis expanded in a valid but unintended direction. Fix: Assert the bias or mask shape before arithmetic and test a row with hand-computed expected values.

Losing the feature axis

Symptom: A token projection fails after batching or reshaping. Cause: The last axis no longer represents the feature size expected by the projection. Fix: Print or assert [batch, sequence, feature] immediately before multiplication.

Averaging padding positions

Symptom: Ticket summaries change when shorter tickets receive more padding. Cause: A plain sequence-axis mean treats padding positions like meaningful tokens. Fix: Use a padding mask and divide by the number of real tokens.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A delivery vector is documented as [distance_units, backlog_hours], and weights are [hours_per_distance, hours_per_backlog]. A data pipeline accidentally sends [backlog_hours, distance_units] while keeping the same two weights. What is the main problem?
2.ShopFlow stores one order as x = [3, 2], where positions mean [distance_units, backlog_hours], and uses weights w = [1.5, 4.0] in the same order. Which computation returns the model's single delivery-time prediction?
3.An order table X has shape [20, 3]: 20 order rows and 3 input features. A weight matrix W has shape [3, 4], with one column for each output feature. What does X @ W produce?
4.In project_tickets, hidden should be [batch, sequence, feature] and projection should be [input_feature, output_feature]. For hidden.shape == [4, 10, 32] and projection.shape == [16, 64], which guard should fail before multiplication?
5.A batch of ticket hidden states H has shape [2, 3, 2] for [batch, sequence, feature]. A projection P has shape [2, 3] for [input_feature, output_feature]. What is the result of H @ P?
6.An output array has shape [3, 2], with rows as orders and columns as [hours, priority]. You meant to add one bias per output column. Which choice matches that intent and avoids a silent axis bug?
7.A single ticket has token vectors stored in tokens with shape [4, 6]: four token positions, each with six features. Which operation forms a raw pairwise dot-product score grid between token positions, and what shape does it have?
8.A projected ticket tensor has shape [2, 3, 2], and mean(axis=1) averages the three token positions to get one summary per ticket. In a real padded text batch, why is a plain mean risky?

8 questions remaining.

Next Step
Continue to Linear Algebra for ML

You can now trace vectors, matrix transforms, tensor axes, and dot-product grids without guessing shapes. Next you will examine the directions inside a matrix through SVD, rank, and conditioning, which makes compression and numerical stability understandable rather than mysterious.

PreviousGradients and Backprop
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Array programming with NumPy.

Harris, C. R., et al. · 2020 · Nature

Broadcasting.

NumPy Contributors · 2026 · Official NumPy documentation

Attention Is All You Need.

Vaswani, A., et al. · 2017

PyTorch: An Imperative Style, High-Performance Deep Learning Library.

Paszke, A., et al. · 2019 · NeurIPS 2019