LeetLLM
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 104 articles completed

๐Ÿ› ๏ธComputing Foundations0/3
Python for AI EngineeringNumPy and Tensor ShapesData Structures for AI
๐Ÿ“ŠMath & Statistics0/4
Probability for MLStatistics and UncertaintyDistributions and SamplingHypothesis Tests and pass@k
๐Ÿ“šPreparation & Prerequisites0/9
Vectors, Matrices & TensorsNeural Networks from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsThe LLM Lifecycle
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFunction Calling & Tool UseChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฎML Algorithms & Evaluation0/8
Linear Regression from ScratchValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment DesignPyTorch Training LoopsDataset Pipelines
๐ŸงฐApplied LLM Engineering0/17
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingMCP & Tool Protocol StandardsPrompt Injection DefenseAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringPre-training Data at ScaleMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/6
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/10
Scaling Laws & Compute-Optimal TrainingDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/12
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingAgent Memory & PersistenceHuman-in-the-Loop AgentsAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/14
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Quantization: GPTQ, AWQ & GGUFSpeculative DecodingLong Context Window ManagementMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnPreparation & PrerequisitesVectors, Matrices & Tensors
๐Ÿ“EasyNLP Fundamentals

Vectors, Matrices & Tensors

Learn the linear algebra vocabulary behind LLMs: vectors, matrices, tensors, dot products, matrix multiplication, and shape reasoning.

25 min readGoogle, Meta, OpenAI +47 key concepts

If you've ever seen an LLM error like mat1 and mat2 shapes cannot be multiplied, you've touched the math layer underneath modern AI. Transformers, embeddings, attention, and neural networks all run on the same small set of linear algebra ideas.

The good news: you don't need a math degree to follow LLM internals. You need a practical mental model for vectors, matrices, tensors, and shapes. Once those click, later articles on neural networks and attention become much easier.

This article gives you the minimum math toolkit for the rest of LeetLLM. We'll focus on what each object means, how dimensions flow through a model, and how to debug the most common shape mistakes.

๐Ÿ’ก Key insight: Deep learning is mostly learned linear algebra plus nonlinearities. Vectors store features, matrices transform those features, and tensors batch the whole computation so GPUs can process many tokens at once.[1]

A visual map connecting vectors, matrices, and tensors to LLM embeddings, weight matrices, hidden states, and attention scores. A visual map connecting vectors, matrices, and tensors to LLM embeddings, weight matrices, hidden states, and attention scores.
Visual anchor: connect each math object to an LLM object: embeddings are vectors, weights are matrices, hidden states are tensors, and attention scores come from dot products.

Step map

StepObjectBeginner check
1VectorOne token or item becomes one list of features.
2Dot productTwo vectors produce one alignment score.
3MatrixA learned grid transforms many features at once.
4TensorBatch and sequence axes keep many examples organized.

The main habit is shape tracing. Read each operation from left to right and check which dimensions are allowed to multiply.

Diagram Diagram

Vectors: lists of meaningful numbers

A vector is a list of numbers. That's it.

x=[0.2,ย โˆ’1.4,ย 3.1,ย 0.7]x = [0.2,\ -1.4,\ 3.1,\ 0.7]x=[0.2,ย โˆ’1.4,ย 3.1,ย 0.7]

In geometry, a vector is an arrow pointing somewhere in space. In machine learning, a vector is a compact representation of features. The numbers might describe a token, an image patch, a user profile, or an internal model state.

For LLMs, the most important vector is the embedding. After tokenization, each token ID gets mapped to a learned vector:

TokenToken IDEmbedding shape
"cat"3797[768]
"sat"3332[768]
"mat"2603[768]

The shape [768] means the token is represented by 768 numbers. Bigger models often use much larger hidden dimensions, but the idea is the same: each token becomes a point in a learned space.

Why vectors? Because numbers can be compared, transformed, and optimized. Strings can't be multiplied by a neural network. Vectors can.


Dot products: alignment between vectors

The dot product takes two vectors of the same length and returns one number:

aโ‹…b=a1b1+a2b2+โ€ฆ+anbna \cdot b = a_1b_1 + a_2b_2 + \ldots + a_nb_naโ‹…b=a1โ€‹b1โ€‹+a2โ€‹b2โ€‹+โ€ฆ+anโ€‹bnโ€‹

Example:

[1,ย 2,ย 3]โ‹…[4,ย 0,ย 2]=(1ร—4)+(2ร—0)+(3ร—2)=10[1,\ 2,\ 3] \cdot [4,\ 0,\ 2] = (1 \times 4) + (2 \times 0) + (3 \times 2) = 10[1,ย 2,ย 3]โ‹…[4,ย 0,ย 2]=(1ร—4)+(2ร—0)+(3ร—2)=10

Intuitively, the dot product measures alignment. If two vectors point in similar directions, their dot product is high. If they point in opposite directions, it's low or negative. If they're unrelated, it's near zero.

This is why dot products appear everywhere in LLMs:

  • Retrieval systems compare query vectors with document vectors.
  • Attention compares query vectors with key vectors.
  • Output layers compare hidden states with vocabulary vectors.

Cosine similarity is a normalized version of the dot product:

cosโก(a,b)=aโ‹…bโˆฅaโˆฅโˆฅbโˆฅ\cos(a,b) = \frac{a \cdot b}{\|a\| \|b\|}cos(a,b)=โˆฅaโˆฅโˆฅbโˆฅaโ‹…bโ€‹

The denominator divides out vector length, so cosine focuses on direction. Raw dot product mixes direction and magnitude.


Matrices: transformations for vectors

A matrix is a rectangular grid of numbers:

W=[0.2โˆ’0.10.71.30.4โˆ’0.5]W = \begin{bmatrix} 0.2 & -0.1 & 0.7 \\ 1.3 & 0.4 & -0.5 \end{bmatrix}W=[0.21.3โ€‹โˆ’0.10.4โ€‹0.7โˆ’0.5โ€‹]

You can think of a matrix as a machine that transforms one vector into another vector. If the input vector has 3 numbers and the matrix has shape [2, 3], the output vector has 2 numbers:

Wx=[0.2โˆ’0.10.71.30.4โˆ’0.5][x1x2x3]=[y1y2]W x = \begin{bmatrix} 0.2 & -0.1 & 0.7 \\ 1.3 & 0.4 & -0.5 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}Wx=[0.21.3โ€‹โˆ’0.10.4โ€‹0.7โˆ’0.5โ€‹]โ€‹x1โ€‹x2โ€‹x3โ€‹โ€‹โ€‹=[y1โ€‹y2โ€‹โ€‹]

Shape rule:

[2,ย 3]ร—[3]โ†’[2][2,\ 3] \times [3] \rightarrow [2][2,ย 3]ร—[3]โ†’[2]

The inner dimensions must match. If they don't, multiplication is undefined.

Neural network layers are built from this operation:

y=Wx+by = W x + by=Wx+b
  • xxx is the input vector.
  • WWW is the learned weight matrix.
  • bbb is the learned bias vector.
  • yyy is the transformed output vector.

That one equation is the backbone of feed-forward networks, output projections, and many internal Transformer operations.


Matrix multiplication: many vectors at once

In practice, models don't process one vector at a time. They process a whole sequence of token vectors.

Suppose a sentence has 4 tokens, and each token has a 768-dimensional embedding. The sequence is a matrix:

XโˆˆR4ร—768X \in \mathbb{R}^{4 \times 768}XโˆˆR4ร—768

That means:

  • 4 rows, one per token position.
  • 768 columns, one per embedding feature.

If a layer projects hidden size 768 to hidden size 3072, its weight matrix has shape:

WโˆˆR768ร—3072W \in \mathbb{R}^{768 \times 3072}WโˆˆR768ร—3072

Then:

XWโˆˆR4ร—3072XW \in \mathbb{R}^{4 \times 3072}XWโˆˆR4ร—3072

Shape rule:

[4,ย 768]ร—[768,ย 3072]โ†’[4,ย 3072][4,\ 768] \times [768,\ 3072] \rightarrow [4,\ 3072][4,ย 768]ร—[768,ย 3072]โ†’[4,ย 3072]

Every token gets transformed by the same weight matrix. That's why neural networks can run efficiently on GPUs: one matrix multiplication applies the same learned operation across many tokens in parallel.


Tensors: matrices with more axes

A tensor is a generalization of vectors and matrices:

ObjectRankExample shapeMeaning
Scalar0[]One number
Vector1[768]One token embedding
Matrix2[128, 768]One sequence of 128 token embeddings
Tensor3+[8, 128, 768]Batch of 8 sequences

LLMs usually carry hidden states as a 3D tensor:

HโˆˆRBร—Tร—DH \in \mathbb{R}^{B \times T \times D}HโˆˆRBร—Tร—D

Where:

  • BBB = batch size (how many examples at once)
  • TTT = sequence length (how many token positions)
  • DDD = hidden dimension (how many numbers per token)

Example:

text
1hidden_states.shape == [8, 128, 768]

This means: 8 examples, 128 tokens each, 768 numbers per token.

Once you can read shapes, model internals become less mysterious. A tensor shape tells you what the model is storing and which dimensions can interact.


Attention preview: shapes tell the story

Attention is covered deeply later, but the shape story is useful now.[2]

Each token vector gets projected into three new vectors:

  • Query (Q): what this token is looking for
  • Key (K): what this token offers for matching
  • Value (V): what information this token will contribute

For one sequence:

QโˆˆRTร—D,KโˆˆRTร—D,VโˆˆRTร—DQ \in \mathbb{R}^{T \times D}, \quad K \in \mathbb{R}^{T \times D}, \quad V \in \mathbb{R}^{T \times D}QโˆˆRTร—D,KโˆˆRTร—D,VโˆˆRTร—D

Attention scores come from:

QKTQK^TQKT

Shape check:

[T,ย D]ร—[D,ย T]โ†’[T,ย T][T,\ D] \times [D,\ T] \rightarrow [T,\ T][T,ย D]ร—[D,ย T]โ†’[T,ย T]

The result is a square matrix: every token compared with every other token. If the sequence has 128 tokens, attention produces a [128, 128] score matrix. This is the core reason attention gets expensive for long context windows.


Shape debugging: the skill that saves hours

When a model crashes with a shape error, don't guess. Write down the dimensions.

Common mistake 1: wrong matrix order

Matrix multiplication isn't commutative:

ABโ‰ BAAB \ne BAAB๎€ =BA

If:

A:[4,ย 768],B:[768,ย 3072]A: [4,\ 768], \quad B: [768,\ 3072]A:[4,ย 768],B:[768,ย 3072]

Then:

AB:[4,ย 3072]AB: [4,\ 3072]AB:[4,ย 3072]

But:

BABABA

is invalid, because [768, 3072] x [4, 768] has inner dimensions 3072 and 4, which don't match.

Common mistake 2: forgetting the transpose

Attention needs QKTQK^TQKT, not QKQKQK.

If both QQQ and KKK are [T, D], then:

QK:[T,ย D]ร—[T,ย D]QK: [T,\ D] \times [T,\ D]QK:[T,ย D]ร—[T,ย D]

is invalid. Transpose KKK first:

QKT:[T,ย D]ร—[D,ย T]โ†’[T,ย T]QK^T: [T,\ D] \times [D,\ T] \rightarrow [T,\ T]QKT:[T,ย D]ร—[D,ย T]โ†’[T,ย T]

Common mistake 3: mixing hidden size and vocabulary size

Inside the Transformer, most tensors use hidden dimension DDD:

text
1hidden_states.shape == [batch, sequence, hidden_dim]

Only the final output projection maps hidden states to vocabulary logits:

text
1logits.shape == [batch, sequence, vocab_size]

Confusing these two dimensions leads to broken mental models. Hidden dimension is the model's internal workspace. Vocabulary size is the set of tokens it can predict.


Key takeaways

  • A vector is a list of numbers that represents features, such as one token embedding.
  • A dot product measures alignment between two vectors and powers similarity, attention, and logits.
  • A matrix is a learned transformation that maps vectors from one space to another.
  • Matrix multiplication only works when inner dimensions match.
  • A tensor is a vector or matrix with extra axes, such as [batch, sequence, hidden].
  • Shape reasoning is the fastest way to debug neural networks and understand LLM internals.

Next, continue to Neural Networks from Scratch, where these shapes become trainable layers.

Evaluation Rubric
  • 1
    Defines vectors, matrices, and tensors using concrete LLM examples
  • 2
    Explains dot product intuition and why it measures alignment or similarity
  • 3
    Checks matrix multiplication shapes and explains why incompatible shapes fail
  • 4
    Maps embeddings, hidden states, and attention scores to tensor shapes
  • 5
    Explains how weight matrices transform vectors inside neural network layers
  • 6
    Uses shape notation to debug attention computations like QK^T
Common Pitfalls
  • Treating tensor shapes as bookkeeping instead of core model behavior
  • Forgetting that matrix multiplication order matters: AB and BA are usually different and may not both be valid
  • Confusing dot product similarity with cosine similarity without checking vector length
  • Ignoring batch and sequence dimensions when reasoning about LLM hidden states
  • Assuming a tensor's last dimension is always vocabulary size, when it's usually the hidden dimension
Follow-up Questions to Expect

Key Concepts Tested
Vectors as lists of features and directions in spaceDot product as a similarity scoreMatrices as transformations that move vectors between spacesMatrix multiplication and shape compatibilityTensors as batched stacks of vectors and matricesCommon LLM tensor shapes: batch, sequence, hidden dimensionShape debugging for attention and neural network layers
References

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. ยท 2016

Numerical Linear Algebra.

Trefethen, L. N., & Bau, D. ยท 1997 ยท SIAM

Attention Is All You Need.

Vaswani, A., et al. ยท 2017

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.