LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnMath & StatisticsLinear Algebra for ML
⚡EasyFine-Tuning & Training

Linear Algebra for ML

Find hidden directions in a support-incident matrix with SVD, then use rank, PCA, truncation, and condition numbers without losing sight of what the numbers mean.

13 min read
Learning path
Step 12 of 158 in the full curriculum
Vectors, Matrices & TensorsAdam, Momentum, Schedulers

Access-ticket features can already be arranged into tensors and projected through a matrix. A harder question comes next: if two feature columns mostly repeat the same pattern, do you need to keep both?

That question appears everywhere in machine learning. An embedding table can contain redundant directions. A learned update can be almost low-rank. A data matrix can have a weak direction that makes a numerical solve unreliable. Singular value decomposition (SVD) gives you a precise way to find those directions, measure their strengths, and decide what to keep.

Incident matrix decomposed into two orthogonal feature directions, with one stronger singular direction and one weaker contrast direction. Incident matrix decomposed into two orthogonal feature directions, with one stronger singular direction and one weaker contrast direction.
SVD turns the incident matrix into feature directions, then measures their strength. The common latency-plus-rollback direction is stronger than the contrast direction; compression comes later.

A matrix can hide repeated directions

Use a tiny matrix with the same named-axis discipline as the previous lesson. Each row is one ticket. The two columns record whether the ticket contains a latency signal or a rollback signal:

A=[100111]A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}A=​101​011​​

The third ticket contains both signals. That makes the combined direction [latency + rollback] stronger than the contrast direction [latency - rollback].

To compare directions fairly, give each one length 1:

vcommon=12[11],vcontrast=12[1−1]v_{\text{common}} = \frac{1}{\sqrt{2}} \begin{bmatrix}1 \\ 1\end{bmatrix}, \qquad v_{\text{contrast}} = \frac{1}{\sqrt{2}} \begin{bmatrix}1 \\ -1\end{bmatrix}vcommon​=2​1​[11​],vcontrast​=2​1​[1−1​]

Multiply A by either direction. The result tells you how strongly each ticket activates that pattern; its length tells you the pattern's total strength in the matrix.

measure-two-ticket-directions.py
1import numpy as np 2 3A = np.array([ 4 [1.0, 0.0], # latency 5 [0.0, 1.0], # rollback 6 [1.0, 1.0], # latency and rollback 7]) 8common = np.array([1.0, 1.0]) / np.sqrt(2) 9contrast = np.array([1.0, -1.0]) / np.sqrt(2) 10 11common_scores = A @ common 12contrast_scores = A @ contrast 13 14print("A_shape", A.shape) 15print("common_scores", common_scores.round(4).tolist()) 16print("contrast_scores", contrast_scores.round(4).tolist()) 17print("strengths", round(np.linalg.norm(common_scores), 4), round(np.linalg.norm(contrast_scores), 4))
Output
1A_shape (3, 2) 2common_scores [0.7071, 0.7071, 1.4142] 3contrast_scores [0.7071, -0.7071, 0.0] 4strengths 1.7321 1.0

The common direction has strength 3\sqrt{3}3​; the contrast direction has strength 111. The third row reinforces the common direction and cancels in the contrast direction. Nothing is compressed yet. You're only measuring which feature combinations the rows support most strongly.

The Gram matrix reveals those directions by hand

For a data matrix A, multiply it by its transpose on the feature side:

G=ATA=[2112]G = A^T A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}G=ATA=[21​12​]

G has shape [2, 2] because it compares feature directions with feature directions. A direction v that obeys Gv=λvGv = \lambda vGv=λv is an eigenvector of G: G scales it without turning it. Its eigenvalue λ\lambdaλ is squared strength, because

∥Av∥2=vTATAv=λ\lVert A v \rVert^2 = v^T A^T A v = \lambda∥Av∥2=vTATAv=λ

when v has length 1.

For this matrix:

G[11]=3[11],G[1−1]=1[1−1]G \begin{bmatrix}1 \\ 1\end{bmatrix} = 3 \begin{bmatrix}1 \\ 1\end{bmatrix}, \qquad G \begin{bmatrix}1 \\ -1\end{bmatrix} = 1 \begin{bmatrix}1 \\ -1\end{bmatrix}G[11​]=3[11​],G[1−1​]=1[1−1​]
Equal-scale vector plots showing the Gram matrix stretching the common diagonal direction by eigenvalue three while leaving the contrast diagonal direction at eigenvalue one. Equal-scale vector plots showing the Gram matrix stretching the common diagonal direction by eigenvalue three while leaving the contrast diagonal direction at eigenvalue one.
Both eigenvectors keep their direction under G. The common direction stretches 3× while the contrast direction stays unchanged; because G = AᵀA measures squared strength, A's singular values are √3 and 1.
check-the-gram-directions.py
1import numpy as np 2 3A = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]]) 4G = A.T @ A 5common = np.array([1.0, 1.0]) / np.sqrt(2) 6contrast = np.array([1.0, -1.0]) / np.sqrt(2) 7 8print("G", G.tolist()) 9print("G_common", (G @ common).round(4).tolist()) 10print("three_common", (3 * common).round(4).tolist()) 11print("G_contrast_equals_contrast", np.allclose(G @ contrast, contrast))
Output
1G [[2.0, 1.0], [1.0, 2.0]] 2G_common [2.1213, 2.1213] 3three_common [2.1213, 2.1213] 4G_contrast_equals_contrast True

SVD names the directions and strengths

Every real matrix has a singular value decomposition:[1]Reference 1Numerical Linear Algebra.https://people.maths.ox.ac.uk/trefethen/books.html

A=UΣVTA = U \Sigma V^TA=UΣVT

For the incident matrix A with shape [3, 2], use the compact (or reduced) form. It keeps only the two singular directions that can fit in a matrix with two feature columns:

FactorShapeMeaning here
VVV[2, 2]directions in feature space, such as common signal versus contrast
Σ\SigmaΣ[2, 2]non-negative strengths, ordered largest first
UUU[3, 2]how each ticket participates in each direction

Our right singular vectors are the normalized common and contrast directions. Our singular values are:

σ1=3,σ2=1\sigma_1 = \sqrt{3}, \qquad \sigma_2 = 1σ1​=3​,σ2​=1

The corresponding left singular vectors come from ui=Avi/σiu_i = A v_i / \sigma_iui​=Avi​/σi​. Put the factors back together and they reconstruct every original entry.

Use the direct SVD routine for computation. It returns U, the one-dimensional singular-value array S, and Vt, which is VTV^TVT.

factor-and-reconstruct-ticket-matrix.py
1import numpy as np 2 3A = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]]) 4U, S, Vt = np.linalg.svd(A, full_matrices=False) 5reconstructed = U @ np.diag(S) @ Vt 6 7print("shapes", U.shape, S.shape, Vt.shape) 8print("singular_values", S.round(4).tolist()) 9print("reconstruction_error", float(np.max(np.abs(A - reconstructed)))) 10print("reconstructs_A", np.allclose(A, reconstructed))
Output
1shapes (3, 2) (2,) (2, 2) 2singular_values [1.7321, 1.0] 3reconstruction_error 4.440892098500626e-16 4reconstructs_A True

The sign of a singular vector may differ between correct implementations. If both matching directions in U and V flip sign, their product stays the same. Compare singular values and reconstruction, not raw signs.

Rank counts independent patterns

The rank of a matrix is the number of non-zero singular values. It counts how many independent directions the matrix really needs.

Suppose every rollback feature exactly repeats the latency feature:

B=[110011]B = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 1 & 1 \end{bmatrix}B=​101​101​​

The second column adds no new information. B has two columns, but only one independent direction, so its rank is 1.

rank-detects-a-duplicate-feature.py
1import numpy as np 2 3B = np.array([ 4 [1.0, 1.0], 5 [0.0, 0.0], 6 [1.0, 1.0], 7]) 8singular_values = np.linalg.svd(B, compute_uv=False) 9 10print("shape", B.shape) 11print("singular_values", singular_values.round(6).tolist()) 12print("rank", int(np.linalg.matrix_rank(B)))
Output
1shape (3, 2) 2singular_values [2.0, 0.0] 3rank 1

A feature table with rank below its column count deserves inspection. Sometimes it contains a deliberate encoding. Sometimes two features duplicate each other and waste compute. Rank exposes the question; domain meaning answers it.

Exact rank is a mathematical definition. Floating-point data rarely produces a perfect zero, so np.linalg.matrix_rank treats sufficiently small singular values as zero using a tolerance. Measurement noise or downstream requirements may justify a different cutoff.

Truncation trades detail for a smaller matrix

Return to the full-rank incident matrix A. Keeping both singular directions reconstructs it exactly. Keeping only the strongest direction creates a rank-one approximation:

A1=U:,1σ1V:,1TA_1 = U_{:,1} \sigma_1 V_{:,1}^TA1​=U:,1​σ1​V:,1T​

Among all rank-one matrices, this approximation has the smallest Frobenius reconstruction error: the square root of the sum of squared entry errors.[2]Reference 2The approximation of one matrix by another of lower rankhttps://doi.org/10.1007/BF02288367 Equivalently, it minimizes that squared sum. The guarantee is about reconstructing A. It doesn't promise that a downstream classifier or retriever keeps its quality.

truncate-to-one-ticket-direction.py
1import numpy as np 2 3A = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]]) 4U, S, Vt = np.linalg.svd(A, full_matrices=False) 5rank_one = U[:, :1] @ np.diag(S[:1]) @ Vt[:1, :] 6 7error = np.linalg.norm(A - rank_one, ord="fro") 8 9print("rank_one") 10print(rank_one.round(3)) 11print("frobenius_error", round(float(error), 4)) 12print("discarded_singular_value", round(float(S[1]), 4))
Output
1rank_one 2[[0.5 0.5] 3 [0.5 0.5] 4 [1. 1. ]] 5frobenius_error 1.0 6discarded_singular_value 1.0

Here the discarded direction has strength 1, so the rank-one approximation loses meaningful contrast between a latency-only ticket and a rollback-only ticket. A small matrix makes that cost visible before you try the same idea on embeddings.

Singular value cutoff diagram showing squared strength retained by two components and a reminder to validate rankings after compression. Singular value cutoff diagram showing squared strength retained by two components and a reminder to validate rankings after compression.
Singular values order candidate directions by measured strength. Squared singular values measure retained matrix energy; a retrieval check decides whether dropping the tail is acceptable for the task.

PCA is SVD after centering each feature

So far the matrix contained signal counts and we kept its origin at zero. Principal component analysis (PCA) answers a different question: which directions describe variation around the average row?

That means centering first. Given a matrix X of tickets by numeric features:

Xcentered=X−column meansX_{\text{centered}} = X - \text{column means}Xcentered​=X−column means

Then SVD on the centered matrix gives principal-component directions in V. If you skip centering, a large average offset can dominate the first direction even when it doesn't describe variation between tickets.

center-before-finding-principal-directions.py
1import numpy as np 2 3X = np.array([ 4 [1.0, 1.0], # low delay evidence, low escalation evidence 5 [2.0, 2.1], 6 [3.0, 2.8], 7 [4.0, 4.2], # high on both signals 8]) 9centered = X - X.mean(axis=0) 10_, S, Vt = np.linalg.svd(centered, full_matrices=False) 11pc1 = Vt[0] 12if pc1[0] < 0: 13 pc1 = -pc1 14 15retained = S[0] ** 2 / np.sum(S ** 2) 16 17print("column_means", X.mean(axis=0).round(3).tolist()) 18print("centered_means", centered.mean(axis=0).round(8).tolist()) 19print("pc1", pc1.round(4).tolist()) 20print("pc1_variance_fraction", round(float(retained), 4))
Output
1column_means [2.5, 2.525] 2centered_means [0.0, -0.0] 3pc1 [0.6937, 0.7203] 4pc1_variance_fraction 0.9961

This leading direction points roughly along [delay evidence + escalation evidence], because those two signals rise together in the sample. PCA uses variance, not meaning. A direction that explains a lot of variation still needs human interpretation and task evaluation.

Condition number warns about solving through weak directions

SVD also reveals a failure mode. If a matrix stretches one direction strongly and almost erases another, an operation that tries to reverse that transformation must amplify the weak direction.

The 2-norm condition number is:

κ2(A)=σmax⁡σmin⁡\kappa_2(A) = \frac{\sigma_{\max}}{\sigma_{\min}}κ2​(A)=σmin​σmax​​

when the smallest singular value is non-zero. A condition number near 1 means all directions have comparable scales. A large condition number warns that solving, inverting, whitening (rescaling projected components toward unit variance), or other operations that divide by small singular values can magnify perturbations.[1]Reference 1Numerical Linear Algebra.https://people.maths.ox.ac.uk/trefethen/books.html

tiny-noise-becomes-large-after-a-solve.py
1import numpy as np 2 3A = np.diag([1.0, 0.0001]) 4x_true = np.array([1.0, 1.0]) 5y = A @ x_true 6y_with_noise = y + np.array([0.0, 0.0001]) 7 8x_clean = np.linalg.solve(A, y) 9x_noisy = np.linalg.solve(A, y_with_noise) 10 11print("condition_number", float(np.linalg.cond(A))) 12print("clean_solution", x_clean.tolist()) 13print("noisy_solution", x_noisy.tolist()) 14print("small_output_noise_doubled_second_coordinate", x_noisy[1] == 2.0)
Output
1condition_number 10000.0 2clean_solution [1.0, 1.0] 3noisy_solution [1.0, 2.0] 4small_output_noise_doubled_second_coordinate True

Notice the scope of the claim. Merely storing vectors in a matrix with a weak direction doesn't automatically make cosine or dot-product retrieval unstable. The amplification appears when a pipeline performs an inverse-like operation, such as a solve or whitening transform, along that weak direction.

Use the Gram matrix for insight, not as a numerical shortcut

A.T @ A made our hand calculation easy, but it isn't the safe general implementation of SVD. Its condition number is squared:

κ2(ATA)=κ2(A)2\kappa_2(A^T A) = \kappa_2(A)^2κ2​(ATA)=κ2​(A)2

for a full-column-rank matrix. A matrix with condition number 10,000 turns into a Gram matrix with condition number 100,000,000. That can discard useful precision before the decomposition even begins. Use direct SVD implementations in numerical code; keep A.T @ A as a teaching device for the relationship between eigenvectors and singular vectors.

forming-the-gram-matrix-squares-conditioning.py
1import numpy as np 2 3A = np.diag([1.0, 0.0001]) 4G = A.T @ A 5 6print("cond_A", float(np.linalg.cond(A))) 7print("cond_AtA", float(np.linalg.cond(G))) 8print("ratio", float(np.linalg.cond(G) / np.linalg.cond(A)))
Output
1cond_A 10000.0 2cond_AtA 100000000.0 3ratio 10000.0

Choosing a cutoff uses squared singular values

For centered data, the squared singular values are proportional to how much variation each principal direction explains. For an uncentered count or embedding matrix, they still measure Frobenius energy captured by each direction. In either case, the cumulative ratio gives a transparent reconstruction-based cutoff, where r is the number of singular values:

retained ratio at k=σ12+⋯+σk2σ12+⋯+σr2\text{retained ratio at } k = \frac{\sigma_1^2 + \cdots + \sigma_k^2} {\sigma_1^2 + \cdots + \sigma_r^2}retained ratio at k=σ12​+⋯+σr2​σ12​+⋯+σk2​​
choose-smallest-cutoff-that-retains-energy.py
1import numpy as np 2 3singular_values = np.array([12.4, 8.7, 3.1, 0.9, 0.3]) 4fractions = np.cumsum(singular_values ** 2) / np.sum(singular_values ** 2) 5target = 0.95 6k = int(np.searchsorted(fractions, target) + 1) 7 8print("retained_by_component", fractions.round(4).tolist()) 9print("smallest_k_for_95_percent", k) 10print("retained_at_k", round(float(fractions[k - 1]), 4))
Output
1retained_by_component [0.6408, 0.9562, 0.9962, 0.9996, 1.0] 2smallest_k_for_95_percent 2 3retained_at_k 0.9562

This rule answers a reconstruction question. For a retrieval system, follow it with ranking metrics on held-out queries rather than assuming that matrix energy equals relevance.

Build it: compress a tiny retrieval matrix and check rankings

Represent four incident tickets with three term features: latency, rollback, and trace. The query is mostly about a latency trace with a smaller rollback signal. Project both tickets and query onto the top two right-singular directions, then compare rankings before and after compression with cosine similarity. Cosine similarity divides each dot product by both vector lengths, so the score reflects direction rather than raw magnitude.

compress-ticket-retrieval-and-validate-ranking.py
1import numpy as np 2 3tickets = np.array([ 4 [3.0, 0.0, 1.0], # latency trace 5 [0.0, 3.0, 1.0], # rollback trace 6 [2.0, 1.0, 0.5], # latency with rollback mention 7 [1.0, 2.0, 1.5], # rollback with latency mention 8]) 9query = np.array([2.5, 0.5, 1.0]) 10 11def cosine_scores(matrix: np.ndarray, vector: np.ndarray) -> np.ndarray: 12 row_norms = np.linalg.norm(matrix, axis=1) 13 vector_norm = np.linalg.norm(vector) 14 if np.any(row_norms == 0) or vector_norm == 0: 15 raise ValueError("cosine similarity requires non-zero vectors") 16 return (matrix @ vector) / (row_norms * vector_norm) 17 18_, singular_values, Vt = np.linalg.svd(tickets, full_matrices=False) 19basis = Vt[:2].T 20retained_energy = np.sum(singular_values[:2] ** 2) / np.sum(singular_values ** 2) 21 22original_scores = cosine_scores(tickets, query) 23compressed_scores = cosine_scores(tickets @ basis, query @ basis) 24original_order = np.argsort(original_scores)[::-1] 25compressed_order = np.argsort(compressed_scores)[::-1] 26 27print("singular_values", singular_values.round(4).tolist()) 28print("retained_energy", round(float(retained_energy), 4)) 29print("original_order", original_order.tolist()) 30print("compressed_order", compressed_order.tolist()) 31print("top_ticket_preserved", bool(original_order[0] == compressed_order[0])) 32print("max_score_change", round(float(np.max(np.abs(original_scores - compressed_scores))), 4)) 33 34try: 35 cosine_scores(tickets, np.zeros(3)) 36except ValueError as error: 37 print("zero_vector_guard", str(error))
Output
1singular_values [4.7011, 3.1677, 0.6044] 2retained_energy 0.9888 3original_order [0, 2, 3, 1] 4compressed_order [0, 2, 3, 1] 5top_ticket_preserved True 6max_score_change 0.0221 7zero_vector_guard cosine similarity requires non-zero vectors

For this tiny fixture, compression drops a real tail direction and changes scores while preserving the ranking. The explicit norm guard matters too: cosine similarity is undefined for a zero-length query or ticket vector. A production pipeline should reject, quarantine, or replace that vector instead of returning NaN scores.

This is a useful sanity check, not a credible evaluation. On a real retrieval set, measure Recall@k, the fraction of relevant tickets recovered in the top k returned candidates, and inspect failures before accepting reduced dimensions.

Where these ideas return in LLM engineering

Later taskMatrix idea used hereWhat must still be validated
Latent semantic analysis for text retrievalTruncated SVD maps term-document structure into fewer directions.[3]Reference 3Indexing by Latent Semantic Analysishttps://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9Ranking quality on relevant queries
Embedding dimensionality reductionCentered PCA or a task-specific projection drops feature directionsRecall, latency, and storage tradeoffs
Low-rank adaptation (LoRA)A LoRA update is represented as two thin matrices whose product has limited rank.[4]Reference 4LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685Fine-tuned model quality
Training diagnosticsWeak or differently scaled directions help explain why optimization can be difficultLoss curves and held-out performance

LoRA doesn't compute the SVD of a full trained update during fine-tuning. It learns thin factors directly. SVD gives you the language to understand what "low rank" means; later training lessons explain how those factors are optimized.

pytorch-reconstructs-a-low-rank-view.py
1import torch 2 3update = torch.tensor([ 4 [2.0, 1.0], 5 [1.0, 2.0], 6]) 7U, S, Vh = torch.linalg.svd(update, full_matrices=False) 8rank_one = U[:, :1] @ torch.diag(S[:1]) @ Vh[:1, :] 9 10print("singular_values", [round(float(value), 4) for value in S]) 11print("rank_one_shape", tuple(rank_one.shape)) 12print("rank_one_error", round(float(torch.linalg.norm(update - rank_one)), 4))
Output
1singular_values [3.0, 1.0] 2rank_one_shape (2, 2) 3rank_one_error 1.0

Practice with measured answers

  1. For C=[3113]C = \begin{bmatrix}3 & 1 \\ 1 & 3\end{bmatrix}C=[31​13​], compute CTCC^T CCTC, its two eigenvalues, and its singular values. Then compute the condition number.
  2. Modify the retrieval build exercise to keep only one singular direction. Did the first result stay the same? How large did the maximum score change become?
  3. Add a large constant offset [100, 100] to every row of the PCA example. Compare SVD before centering and after centering. Which one describes differences between rows?

Answers to check:

  1. CTC=[106610]C^T C = \begin{bmatrix}10 & 6 \\ 6 & 10\end{bmatrix}CTC=[106​610​], eigenvalues are 16 and 4, singular values are 4 and 2, and κ2(C)=2\kappa_2(C) = 2κ2​(C)=2.
  2. A one-direction representation throws away the latency-versus-rollback contrast. With this fixture, every compressed cosine score becomes 1.0, so the ranking stops being meaningful. The exact code above returns order [3, 2, 1, 0], doesn't preserve the first result, and reports a maximum score change of 0.7113.
  3. Uncentered SVD is pulled toward the large offset. Centered SVD removes that shared mean and recovers the variation direction.

Mastery check

Key concepts

  • SVD as feature directions in VVV, strengths in Σ\SigmaΣ, and row participation in UUU
  • rank as the number of independent non-zero singular directions
  • truncated SVD as best fixed-rank reconstruction, not automatic downstream success
  • PCA as SVD of a centered data matrix
  • condition number as a warning for inverse-like operations along weak directions
  • direct SVD as the computational default instead of manually decomposing ATAA^T AATA

Evaluation rubric

  • Foundational: Starting from the three-incident matrix, computes ATAA^T AATA, identifies common and contrast directions, and explains why the singular values are 3\sqrt{3}3​ and 1.
  • Intermediate: Uses NumPy SVD to choose a low-rank basis, compresses both tickets and query, and reports a retrieval metric before accepting the reduced representation.
  • Advanced: Builds an ill-conditioned matrix, demonstrates amplification during a solve or whitening-like operation, and explains why forming ATAA^T AATA worsens the numerical problem.

Follow-up questions

Common pitfalls

  • Symptom: PCA's first direction represents a shared baseline rather than differences among rows. Cause: The data wasn't centered. Fix: Subtract feature means before applying the PCA interpretation.
  • Symptom: A hand-built SVD becomes unreliable on near-dependent columns. Cause: It decomposes A.T @ A, squaring the condition number. Fix: Use a direct SVD routine for computation.
  • Symptom: Compression has low reconstruction error but misses important tickets. Cause: The cutoff optimized matrix energy instead of relevance. Fix: Measure retrieval metrics after projection.
  • Symptom: Similarity scores become NaN after projection. Cause: A query or ticket vector has zero length, so cosine similarity is undefined. Fix: Guard norms before division and define a reject, quarantine, or replacement policy.
  • Symptom: Engineers read columns of U as embedding feature axes. Cause: Row and feature spaces were confused. Fix: Use V for feature directions and U for row participation.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.For A = [[1, 0], [0, 1], [1, 1]], G = A.T @ A = [[2, 1], [1, 2]]. The common and contrast directions are eigenvectors of G with eigenvalues 3 and 1. What are the singular values of A, and what is its rank?
2.A centered ticket-feature matrix Xc has shape [5000, 384], and np.linalg.svd(Xc, full_matrices=False) returns U, S, Vt with Xc = U @ diag(S) @ Vt. To build a reusable 64-dimensional projection for centered ticket feature rows, which description and computation are correct?
3.For B = [[1, 1], [0, 0], [1, 1]], the two feature columns exactly repeat each other and its singular values are [2.0, 0.0]. What conclusion follows?
4.For A = [[1, 0], [0, 1], [1, 1]], the singular values are sqrt(3) and 1. A rank-one approximation keeps only the largest singular triplet. Which statement is correct?
5.A ticket-feature table has a large shared baseline in both numeric features, and you want PCA directions that describe variation between tickets rather than the baseline itself. Which preprocessing and SVD interpretation are correct?
6.For ordered singular values [12.4, 8.7, 3.1, 0.9, 0.3], choose the smallest k that retains at least 95 percent of matrix energy using the squared-singular-value ratio. Which choice is correct?
7.You reduce the four-ticket retrieval matrix to the top two right-singular directions. The projection retains 0.9888 of matrix energy and preserves the ranking on the tiny fixture. What is the defensible production conclusion?
8.A pipeline whitens projected components by dividing by their singular values. For singular values [20, 4, 0.0002], use kappa_2 = sigma_max / sigma_min and the effect of inverse-like division by small singular values to choose the correct warning.
9.A full-column-rank feature matrix has kappa_2(A) = 10000. An engineer plans to compute singular vectors by forming G = A.T @ A and eigen-decomposing G. Using kappa_2(A.T @ A) = kappa_2(A)^2, which numerical issue and implementation choice are correct?

9 questions remaining.

Next Step
Continue to Adam, Momentum, Schedulers

You can now see how a matrix stretches some directions more than others and why weak directions can make inverse-like calculations fragile. Next you'll apply that geometric intuition to loss surfaces, where uneven directions make <span data-glossary="gradient">gradient</span> descent zig-zag and motivate momentum, adaptive steps, and learning-rate schedules.

PreviousVectors, Matrices & Tensors
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Numerical Linear Algebra.

Trefethen, L. N., & Bau, D. · 1997 · SIAM

The approximation of one matrix by another of lower rank

Eckart, C. & Young, G. · 1936

Indexing by Latent Semantic Analysis

Deerwester, S., et al. · 1990 · JASIS

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR