LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 138 articles completed

🛠️Computing Foundations0/8
Git, Shell, Linux for AITesting ML SystemsDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/7
Gradients and BackpropLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/14
Vectors, Matrices & TensorsNeural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, and GRUsAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsTrees and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and FeedbackAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/12
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleSynthetic Data PipelinesDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use AgentsHuman-in-the-Loop AgentsAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementLong-Context EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices. Lessons stay open without login.

Back to Topics
LearnMath & StatisticsLinear Algebra for ML
⚡EasyFine-Tuning & Training

Linear Algebra for ML

Master SVD by hand on small matrices, geometric intuition for PCA as rotation to variance-maximizing axes, rank, condition numbers, and their direct use in embedding compression, latent semantic retrieval, and numerical stability for ML systems. Includes NumPy from-scratch eigendecomposition SVD and sklearn comparisons.

10 min readOpenAI, Anthropic, Google +16 key concepts
Learning path
Step 10 of 138 in the full curriculum
Gradients and BackpropAdam, Momentum, Schedulers

Linear algebra is the operating system underneath every embedding table, every attention head, and every retrieval index. The introductory vectors-matrices-tensors lesson taught you the nouns (vector, matrix, dot product, shape). The clustering-and-PCA lesson showed you a black-box PCA call that magically gave you lower-dimensional ticket embeddings. This chapter teaches you the verbs: how to actually factor a matrix by hand, why the factors have geometric meaning, and exactly how that factorization is used to compress embeddings, denoise retrieval, and keep numerical systems stable when you deploy real LLM products.

Where this fits: it builds on vectors, matrices, and PCA, then unlocks later chapters on embeddings, retrieval, LoRA, and interpretability. We will keep the examples close to e-commerce and logistics systems: merchant catalog matrices, support-ticket embeddings, and warehouse search indexes.

Linear algebra deeper illustration: 3×2 matrix A with term-document counts, step-by-step SVD via AᵀA eigendecomposition showing exact V eigenvectors and Σ singular values, U reconstruction, 2D point cloud with principal component axes (PC1 long red arrow, PC2 short), explained-variance bars dropping sharply after rank-2, condition number callout κ=σ_max/σ_min, and low-rank truncated approximation visual for embedding compression Linear algebra deeper illustration: 3×2 matrix A with term-document counts, step-by-step SVD via AᵀA eigendecomposition showing exact V eigenvectors and Σ singular values, U reconstruction, 2D point cloud with principal component axes (PC1 long red arrow, PC2 short), explained-variance bars dropping sharply after rank-2, condition number callout κ=σ_max/σ_min, and low-rank truncated approximation visual for embedding compression
Visual anchor: Everything in one picture: a real data matrix, the Gram matrix you actually diagonalize, the V and Σ factors you obtain by hand, the resulting U, a geometric PCA view of the same idea on points, variance bars, rank, condition number, and the compressed low-rank form used in production embedding stores.

You only need to be comfortable multiplying small matrices and solving a 2×2 quadratic equation. We will compute every singular vector with pencil-and-paper arithmetic on two concrete matrices before we write a single line of NumPy. By the end you will be able to look at a 768-dimensional embedding matrix and know, without running code, whether it is well-conditioned, how much you can compress it with almost no quality loss, and why a tiny change in one token can sometimes flip the top-1 result in a vector database.

The SVD is the Swiss-army knife of applied linear algebra

Any real matrix ( A ) (size ( m \times n )) admits a factorization.[1]

A=UΣVTA = U \Sigma V^TA=UΣVT

where

  • ( U ) is ( m \times m ) (or economy ( m \times r )) with orthonormal columns,
  • ( \Sigma ) is diagonal ( r \times r ) with non-negative entries ( \sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r \ge 0 ) (the singular values),
  • ( V ) is ( n \times n ) with orthonormal columns (the right singular vectors).

The number ( r ) of strictly positive singular values is exactly the rank of ( A ). The factorization is unique up to signs when singular values are distinct.

Why this particular factorization is magical for machine learning:

  • The columns of ( V ) give the natural coordinate system for the columns of ( A ).
  • The first few columns of ( V ) are the directions of greatest variance (exactly the principal components when ( A ) is a centered data matrix).
  • Truncating after the first ( k ) singular values gives the optimal rank-( k ) approximation of ( A ) in both Frobenius and spectral norm (Eckart–Young theorem).
  • The ratio ( \kappa(A) = \sigma_1 / \sigma_r ) (the 2-norm condition number) tells you how much tiny input perturbations will be magnified in any linear operation that uses ( A ).

All of these statements become obvious once you have computed one SVD by hand.

Hand-worked SVD on a 2×2 matrix (the cleanest numbers possible)

Let

A=(2112)A = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}A=(21​12​)

(This could be the co-occurrence counts of two terms across two documents, or the covariance of two features.)

First form the Gram matrix ( G = A^T A ). Because ( A ) is symmetric we have ( G = A^2 ):

G=(5445)G = \begin{pmatrix} 5 & 4 \\ 4 & 5 \end{pmatrix}G=(54​45​)

We now solve the eigenvalue problem ( G v = \lambda v ), or equivalently ( \det(G - \lambda I) = 0 ):

(5−λ)2−16=0⇒λ2−10λ+9=0⇒(λ−9)(λ−1)=0(5 - \lambda)^2 - 16 = 0 \quad \Rightarrow \quad \lambda^2 - 10\lambda + 9 = 0 \quad \Rightarrow \quad (\lambda - 9)(\lambda - 1) = 0(5−λ)2−16=0⇒λ2−10λ+9=0⇒(λ−9)(λ−1)=0

So ( \lambda_1 = 9 ), ( \lambda_2 = 1 ).

The singular values are the square roots (always taken positive and sorted descending):

σ1=3,σ2=1\sigma_1 = 3, \quad \sigma_2 = 1σ1​=3,σ2​=1

Now the eigenvectors (right singular vectors).

For ( \lambda_1 = 9 ):

(5−9)v1+4v2=0⇒−4v1+4v2=0⇒v1=v2(5-9)v_1 + 4v_2 = 0 \quad \Rightarrow \quad -4v_1 + 4v_2 = 0 \quad \Rightarrow \quad v_1 = v_2(5−9)v1​+4v2​=0⇒−4v1​+4v2​=0⇒v1​=v2​

A unit vector satisfying this is ( v_1 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 \ 1 \end{pmatrix} \approx \begin{pmatrix} 0.7071 \ 0.7071 \end{pmatrix} ).

For ( \lambda_2 = 1 ):

(5−1)v1+4v2=0⇒4v1+4v2=0⇒v1=−v2(5-1)v_1 + 4v_2 = 0 \quad \Rightarrow \quad 4v_1 + 4v_2 = 0 \quad \Rightarrow \quad v_1 = -v_2(5−1)v1​+4v2​=0⇒4v1​+4v2​=0⇒v1​=−v2​

Unit vector: ( v_2 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 \ -1 \end{pmatrix} \approx \begin{pmatrix} 0.7071 \ -0.7071 \end{pmatrix} ).

Therefore the matrix ( V ) whose columns are these vectors (in descending-( \sigma ) order) is

V=12(111−1)V = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}V=2​1​(11​1−1​)

Because ( A ) is square and symmetric, ( U = V ) (you can also compute ( U ) explicitly as ( U = A V \Sigma^{-1} )). The diagonal matrix of singular values is simply

Σ=(3001)\Sigma = \begin{pmatrix} 3 & 0 \\ 0 & 1 \end{pmatrix}Σ=(30​01​)

Verify the factorization (with rounded numbers for readability):

UΣVT=(0.7070.7070.707−0.707)(3001)(0.7070.7070.707−0.707)≈(2112)U \Sigma V^T = \begin{pmatrix} 0.707 & 0.707 \\ 0.707 & -0.707 \end{pmatrix} \begin{pmatrix} 3 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 0.707 & 0.707 \\ 0.707 & -0.707 \end{pmatrix} \approx \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}UΣVT=(0.7070.707​0.707−0.707​)(30​01​)(0.7070.707​0.707−0.707​)≈(21​12​)

The reconstruction error is on the order of machine epsilon. You have just performed a complete, exact SVD by hand.

A 3×2 non-square example (the realistic case)

Real data matrices are rarely square. Let

A=(100111)A = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix}A=​101​011​​

(three documents, two terms; the third document contains both terms).

The Gram matrix is now 2×2:

ATA=(2111)A^T A = \begin{pmatrix} 2 & 1 \\ 1 & 1 \end{pmatrix}ATA=(21​11​)

Characteristic equation: ( \lambda^2 - 3\lambda + 1 = 0 ), solutions ( \lambda = \frac{3 \pm \sqrt{5}}{2} \approx 2.618, 0.382 ).

Singular values:

σ1≈1.618,σ2≈0.618\sigma_1 \approx 1.618, \quad \sigma_2 \approx 0.618σ1​≈1.618,σ2​≈0.618

(The exact values are ( \sqrt{\frac{3+\sqrt{5}}{2}} ) and ( \sqrt{\frac{3-\sqrt{5}}{2}} ), the golden ratio and its conjugate.)

Solving for the eigenvectors (same procedure) yields (to four decimals)

V≈(0.85070.52570.5257−0.8507)V \approx \begin{pmatrix} 0.8507 & 0.5257 \\ 0.5257 & -0.8507 \end{pmatrix}V≈(0.85070.5257​0.5257−0.8507​)

The left singular vectors ( U ) (now 3×2) are obtained column-wise without ever forming the huge 3×3 matrix ( AA^T ):

ui=1σiAviu_i = \frac{1}{\sigma_i} A v_iui​=σi​1​Avi​

After normalization you obtain

U≈(0.5257−0.85070.52570.85070.67100.0000)U \approx \begin{pmatrix} 0.5257 & -0.8507 \\ 0.5257 & 0.8507 \\ 0.6710 & 0.0000 \end{pmatrix}U≈​0.52570.52570.6710​−0.85070.85070.0000​​

(You can check the columns of ( U ) are orthonormal and that ( U \Sigma V^T ) reproduces ( A ) within rounding error.)

The key computational fact: you always diagonalize the smaller Gram matrix (( n \times n ) when ( m \gg n )). This is why SVD scales to real embedding matrices.

Geometric meaning – PCA is SVD in disguise

Suppose you have a cloud of points in 2-D (or 768-D) that you have already mean-centered to form the data matrix ( X ) (samples × features).

  • The right singular vectors ( V ) are exactly the principal component directions.
  • The first column of ( V ) is the single direction in feature space that maximizes projected variance ( |X v_1|^2 = \sigma_1^2 ).
  • The second column is the best direction orthogonal to the first, and so on.
  • In the illustration, the long axis is PC1 and the short axis is PC2.
  • When you call pca.fit_transform(embeddings), the library is doing a truncated SVD on the centered matrix and returning the scores ( X V_k ).

Rank, truncation, and compression

Rank(( A )) = number of strictly positive singular values. In the 3×2 example the matrix has full rank 2. If one singular value had been exactly zero, the columns would have been linearly dependent and we could have described the same data with a single column of ( U ), one number in ( \Sigma ), and one column of ( V ).

Truncating after the first ( k ) components gives the matrix ( A_k = U_{:,k} \Sigma_k V_{k,:}^T ) that is closest to ( A ) among all rank-( k ) matrices. The squared Frobenius error is exactly the sum of the squares of the discarded singular values.

This is why embedding compression works so well. A 1 000 000 × 768 matrix of embeddings can be stored, with almost no loss in cosine-similarity quality, as three much smaller arrays whose total size is roughly 1 000 000 × 64 + 64 + 64 × 768 - more than 10× smaller.

Condition number and numerical stability

The 2-norm condition number of ( A ) is

κ2(A)=σ1σr\kappa_2(A) = \frac{\sigma_1}{\sigma_r}κ2​(A)=σr​σ1​​

In our first 2×2 example ( \kappa = 3 / 1 = 3 ) (very well-conditioned). In real embedding matrices you will often see values from a few hundred up to ( 10^5 )–( 10^6 ).

Interpretation: if you perturb the input by a relative amount ( \epsilon ), the output of any linear operation involving ( A ) (matrix-vector product, least-squares solve, even a dot-product similarity after projection) can change by up to ( \kappa \epsilon ). When ( \kappa = 10,000 ), a 0.01 % noise in a single coordinate can produce a 100 % change in the result.

In a vector database this manifests as “query drift”: two almost-identical queries can retrieve completely different top documents because the tiny difference was magnified along a tiny-singular-value direction that is mostly noise. Truncating the SVD removes precisely those unstable directions and therefore improves both compression and reliableness.

From-scratch SVD in NumPy versus the libraries

Here is the exact procedure you just performed, written as runnable code.

python
1import numpy as np 2 3A = np.array([[2.0, 1.0], 4 [1.0, 2.0]]) 5 6# From-scratch via eigendecomposition of the Gram matrix 7AtA = A.T @ A 8eigvals, V = np.linalg.eigh(AtA) 9idx = np.argsort(eigvals)[::-1] 10eigvals = eigvals[idx] 11V = V[:, idx] 12sigmas = np.sqrt(eigvals) 13U = (A @ V) / sigmas # works because we are square here 14 15print("Singular values:", sigmas) 16print("V (right singular vectors):\n", V) 17print("U (left singular vectors):\n", U) 18 19# Verify 20Sigma = np.diag(sigmas) 21recon = U @ Sigma @ V.T 22print("Max reconstruction error:", np.max(np.abs(recon - A)))

Running the identical data through the library calls produces (within floating-point) the same factors:

python
1U_lib, S_lib, Vt_lib = np.linalg.svd(A, full_matrices=False) 2print("Library singular values:", S_lib)

For a non-square data matrix or when you only want the top ( k ) components, sklearn is usually more convenient and faster:

python
1from sklearn.decomposition import TruncatedSVD, PCA 2 3# TruncatedSVD works directly on the (possibly sparse) data matrix 4svd = TruncatedSVD(n_components=2, random_state=0) 5U_trunc = svd.fit_transform(A) # == U_k @ Sigma_k 6Vt_trunc = svd.components_ # rows are top V^T 7 8# For classical PCA you must center first (or use the PCA class) 9X = np.array([[0.2, 0.1], [0.1, 0.3], [-0.4, -0.2], [0.1, -0.2]]) # toy centered points 10pca = PCA(n_components=2).fit(X) 11print("PCA components_ (V^T):\n", pca.components_) 12print("Explained variance ratio:", pca.explained_variance_ratio_)

The numbers will match your hand calculation on the same data. The only difference is that the library versions are numerically stable, support sparse matrices, and use randomized algorithms for very large data.

Concrete applications inside LLM and retrieval systems

Use caseMatrix you factorProduction reason
Merchant catalog searchSKU-by-term or SKU-by-embedding matrixRecover similar products even when titles use different wording
Support-ticket retrievalTicket-by-feature matrixDenoise noisy tickets before vector search
Warehouse routing analyticsRoute-by-delay-feature matrixSeparate carrier, region, and seasonality directions
LoRA adaptationWeight-update matrixStore a useful update with far fewer parameters
  • Latent Semantic Indexing / LSA. Build a term-by-document matrix (tf-idf or raw counts), run SVD, project both queries and documents into the top-100 or top-300 latent dimensions. Documents about “late package” and “delivery delay” become close even if they never shared a term, because the first few right singular vectors discovered the underlying logistics direction.

  • Embedding compression for vector databases. Many production systems (Pinecone, Weaviate, pgvector with indexes, etc.) optionally run a PCA or OPQ step (which is a rotation derived from SVD) before product quantization. You pay a one-time SVD cost on a sample of your embeddings and then store 8–32× less data while often improving recall because you discarded noise.

  • Activation steering and interpretability. Collect residual-stream activations across many prompts, run SVD, and the top directions frequently correspond to human-interpretable concepts such as truthfulness, refusal, or code mode.

  • Low-rank adaptation (LoRA). When you train a LoRA adapter you are explicitly learning a low-rank update ( \Delta W = BA ). The mathematically optimal rank-( r ) factors for a given full update matrix are given by its truncated SVD. Training finds a nearby good factorization; the SVD view tells you the absolute best compression you could ever achieve for that rank.

  • Numerical hygiene in production. Any time you solve a linear system, compute an inverse, or even just do many dot products against a fixed matrix, a high condition number is a red flag. Monitoring the singular values (or at least the condition number of a small random projection) of your embedding or weight matrices is a cheap way to catch impending instability before it appears in your RAG quality metrics.

Practice problems

Do these with pencil and paper first, then verify with the code patterns above.

  1. For the matrix ( A = \begin{pmatrix} 3 & 1 \ 1 & 3 \end{pmatrix} ), compute ( A^T A ), its eigenvalues, the singular values, and the condition number ( \kappa ). What changed compared with our first example?

  2. Implement a tiny function manual_svd_2x2(A) that accepts any 2×2 matrix, performs the eigendecomposition route, returns (U, S, Vt), and prints the reconstruction error. Test it on both matrices we used in the article.

  3. Generate (or hard-code) six 2-D points that form a clearly elongated cloud. Center them, run your manual SVD (or np.linalg.svd), and print the first column of ( V ). Verify that it points roughly along the long axis of the cloud.

  4. Construct a deliberately ill-conditioned 5×3 matrix (make the third column almost equal to the sum of the first two plus a tiny ( \epsilon )). Add 0.001 noise to a single entry. Compute the change in the top singular vector before and after keeping only the largest two singular values. Quantify how much more stable the truncated version is.

Solution sketches

  1. ( A^T A = \begin{pmatrix} 10 & 6 \ 6 & 10 \end{pmatrix} ). Eigenvalues 16 and 4. Singular values 4 and 2. ( \kappa = 2 ). The matrix is better conditioned than the first example because the off-diagonal is relatively smaller.

  2. The function is exactly the 10-line block shown in the “from-scratch” section. On both examples the max error is < 1e-14.

  3. After centering, the first right singular vector will be within a few degrees of the direction of greatest elongation (you can eyeball it or compute the sample covariance and its dominant eigenvector - they must agree).

  4. The full-rank top singular vector moves by a large angle when you perturb the near-dependent column. After truncation to rank 2 the direction is almost unchanged - the unstable direction has been removed.

Evaluation Rubric
  • 1
    Carries out a complete SVD by hand for a 2×2 or 3×2 matrix with integer entries, computes singular values and vectors, and reconstructs the original matrix to verify A = UΣVᵀ
  • 2
    Implements a from-scratch SVD routine using NumPy eigh on AᵀA, applies it to a toy embedding matrix, and contrasts reconstruction error and singular values against np.linalg.svd and sklearn TruncatedSVD
  • 3
    Diagnoses a high-condition-number embedding matrix in a retrieval setting, explains how noise gets amplified in dot-product scores, and shows how keeping only the top-k singular components both compresses and denoises the representation
Common Pitfalls
  • Forgetting to center (subtract column means) before treating a data matrix as input to PCA. The first principal component then points toward the data centroid instead of the direction of greatest spread.
  • Running full np.linalg.svd on a 1-million-row embedding matrix when you only need the top 128 components; always use TruncatedSVD or randomized_svd for production scale.
  • Reporting raw singular values as importance. Always convert to explained_variance_ratio = σ² / Σσ² so that you can say “the first 32 components explain 87 % of variance.”
  • Treating U and V symmetrically. V lives in the original feature space (the columns you can interpret), U lives in the sample space (one coordinate per row of the original matrix).
  • Applying SVD or PCA to unscaled features. A column with values in the thousands will dominate every singular vector even if it carries no semantic signal.
Follow-up Questions to Expect

Key Concepts Tested
singular value decomposition (SVD) hand computationeigendecomposition of AᵀA and relation to Vprincipal component analysis as SVD on centered datamatrix rank and truncated low-rank approximationcondition number and its effect on numerical stabilityapplications of SVD/PCA to embeddings, retrieval, and compression
Next Step
Next: Continue to Optimization Algorithms: Adam, Schedulers, and Why They Matter

You now own the factorizations that turn raw high-dimensional vectors into clean, stable, low-dimensional directions that embeddings and retrievers actually use. The optimization chapter shows you how those same matrix properties (condition numbers, curvature along the principal axes you just discovered) explain why plain gradient descent zig-zags or diverges, and why Adam plus learning-rate schedules succeed in practice. The two chapters together give you the complete mathematical story behind every training run.

PreviousGradients and BackpropNextAdam, Momentum, Schedulers
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Numerical Linear Algebra.

Trefethen, L. N., & Bau, D. · 1997 · SIAM

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. · 2016

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

Pattern Recognition and Machine Learning.

Bishop, C. M. · 2006

Your account is free and you can post anonymously if you choose.