LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

๐Ÿ› ๏ธComputing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
๐Ÿ“ŠMath & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
๐Ÿ“šPreparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
๐ŸงฎML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
๐Ÿ“ฆProduction ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฐApplied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
๐ŸŽคAI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationClustering and PCA
๐Ÿ“MediumEmbeddings & Vector Search

Clustering and PCA

Inspect unlabeled support-message embeddings with k-means and PCA, then stress-test whether apparent neighborhoods survive scale, metric, and compression choices.

17 min read
Learning path
Step 33 of 155 in the full curriculum
Validation and LeakageCore Retrieval Algorithms

In the last chapter, you learned to protect evaluation from leakage when labels exist. Now a support team has embedded thousands of untagged return and delivery messages, and nobody has assigned reliable topics yet. Before you build search or train a classifier, you need to ask a quieter question: does this representation contain useful neighborhoods at all?

This chapter gives you a disciplined answer. You'll group a tiny unlabeled corpus with k-means, inspect it from fewer dimensions with principal component analysis (PCA), and run failure tests that stop a pretty plot from becoming a false claim. Those habits matter later when a retrieval-augmented generation system selects evidence for an LLM.

Unlabeled support-message embedding map with three numbered neighborhoods and a PC1 axis, linked to sampled text that suggests refund, login, and shipping labels only after review, plus a gate requiring held-out query tests before retrieval. Unlabeled support-message embedding map with three numbered neighborhoods and a PC1 axis, linked to sampled text that suggests refund, login, and shipping labels only after review, plus a gate requiring held-out query tests before retrieval.
K-means and PCA can expose nearby pairs and dominant directions, but those are geometric hypotheses. Read centroid and boundary messages, keep labels separate from cluster IDs, and test held-out queries before using the representation for retrieval.

Six Messages Without Topic Labels

Suppose a text embedding model has already converted six customer messages into two-number vectors. Real embeddings might contain hundreds or thousands of coordinates. Two dimensions let us compute every important step by hand first.

Message IDCustomer textxxxyyy
refund_A"I was charged twice."1.01.0
refund_B"Please undo this duplicate payment."1.21.8
login_A"My reset link never arrives."4.04.2
login_B"Two-factor code keeps failing."5.03.8
shipping_A"Where is my package?"8.01.0
shipping_B"Tracking says delayed again."9.01.8

The human-readable names are available here so you can check the geometry. Pretend the algorithm sees only coordinates. It doesn't receive refund, login, or shipping as labels.

The first diagnostic is simply nearest neighbors. For two points a=(ax,ay)a=(a_x,a_y)a=(axโ€‹,ayโ€‹) and b=(bx,by)b=(b_x,b_y)b=(bxโ€‹,byโ€‹), Euclidean distance is:

d(a,b)=(axโˆ’bx)2+(ayโˆ’by)2d(a,b)=\sqrt{(a_x-b_x)^2 + (a_y-b_y)^2}d(a,b)=(axโ€‹โˆ’bxโ€‹)2+(ayโ€‹โˆ’byโ€‹)2โ€‹

That formula asks how far apart two messages lie in the current representation. Run it before clustering so you know whether local pairs look sensible.

nearest-message-neighbors.py
1import numpy as np 2 3names = np.array(["refund_A", "refund_B", "login_A", "login_B", "shipping_A", "shipping_B"]) 4vectors = np.array([ 5 [1.0, 1.0], 6 [1.2, 1.8], 7 [4.0, 4.2], 8 [5.0, 3.8], 9 [8.0, 1.0], 10 [9.0, 1.8], 11]) 12 13distances = np.linalg.norm(vectors[:, None, :] - vectors[None, :, :], axis=2) 14np.fill_diagonal(distances, np.inf) 15 16for row, name in enumerate(names): 17 neighbor = distances[row].argmin() 18 print(f"{name:10} -> {names[neighbor]:10} distance={distances[row, neighbor]:.2f}")
Output
1refund_A -> refund_B distance=0.82 2refund_B -> refund_A distance=0.82 3login_A -> login_B distance=1.08 4login_B -> login_A distance=1.08 5shipping_A -> shipping_B distance=1.28 6shipping_B -> shipping_A distance=1.28

The intended local structure appears if each message's closest neighbor belongs to the same readable theme. That isn't a proof that all future messages will organize cleanly. It's a first representation check.

K-Means Turns Nearby Points Into Hypotheses

K-means is an unsupervised algorithm: it receives vectors and a requested number of groups, kkk, but no correct topic names. It repeats two operations:

  1. Assignment: send every vector to its nearest centroid.
  2. Update: replace each centroid with the mean of its assigned vectors.

The algorithm minimizes inertia, the within-cluster sum of squared distances:

J=โˆ‘i=1nโˆฅxiโˆ’ฮผziโˆฅ22J = \sum_{i=1}^{n} \left\lVert x_i - \mu_{z_i} \right\rVert_2^2J=i=1โˆ‘nโ€‹โˆฅxiโ€‹โˆ’ฮผziโ€‹โ€‹โˆฅ22โ€‹

Here, xix_ixiโ€‹ is message vector iii, ziz_iziโ€‹ is its assigned cluster, and ฮผzi\mu_{z_i}ฮผziโ€‹โ€‹ is that cluster's centroid. Lower inertia means points lie closer to their selected centers. It doesn't mean the resulting groups deserve human topic names.[1]

Start with k=3k=3k=3 and choose one seed from each visible area:

  • c0=(1.0,1.0)c_0=(1.0,1.0)c0โ€‹=(1.0,1.0) from refund_A
  • c1=(4.0,4.2)c_1=(4.0,4.2)c1โ€‹=(4.0,4.2) from login_A
  • c2=(8.0,1.0)c_2=(8.0,1.0)c2โ€‹=(8.0,1.0) from shipping_A

After assignment, each pair stays together. Averaging coordinates moves the three centroids to:

Candidate neighborhoodAssigned pointsUpdated centroid
left grouprefund_A, refund_B(1.1,1.4)(1.1, 1.4)(1.1,1.4)
upper grouplogin_A, login_B(4.5,4.0)(4.5, 4.0)(4.5,4.0)
right groupshipping_A, shipping_B(8.5,1.4)(8.5, 1.4)(8.5,1.4)

Notice the wording: the algorithm found left, upper, and right groups. A reviewer reads sampled messages and proposes the topic labels afterward.

one-kmeans-update.py
1import numpy as np 2 3vectors = np.array([ 4 [1.0, 1.0], [1.2, 1.8], 5 [4.0, 4.2], [5.0, 3.8], 6 [8.0, 1.0], [9.0, 1.8], 7]) 8centroids = vectors[[0, 2, 4]].copy() 9 10squared_distance = ((vectors[:, None, :] - centroids[None, :, :]) ** 2).sum(axis=2) 11cluster = squared_distance.argmin(axis=1) 12updated = np.vstack([vectors[cluster == index].mean(axis=0) for index in range(3)]) 13inertia = ((vectors - updated[cluster]) ** 2).sum() 14 15print("assignment:", cluster.tolist()) 16print("updated centroids:", np.round(updated, 2).tolist()) 17print(f"inertia after update: {inertia:.2f}")
Output
1assignment: [0, 0, 1, 1, 2, 2] 2updated centroids: [[1.1, 1.4], [4.5, 4.0], [8.5, 1.4]] 3inertia after update: 1.74

Two coordinate plots of six message embeddings showing nearest-seed assignment and centroid arrows moving to pair means, with inertia falling from 3.48 to 1.74 and cluster IDs separated from reviewed topic labels. Two coordinate plots of six message embeddings showing nearest-seed assignment and centroid arrows moving to pair means, with inertia falling from 3.48 to 1.74 and cluster IDs separated from reviewed topic labels.
One assignment/update pass moves each seed to its pair's mean and cuts this toy objective from `3.48` to `1.74`. The numeric IDs remain arbitrary; only source-message review can justify tentative topic names.

Build K-Means Instead of Hiding the Loop

A production implementation uses a tested library, but writing the small loop once makes later debugging far easier. The function below accepts a matrix of vectors and initial centers. It stops when the centers no longer move.

kmeans-from-scratch.py
1import numpy as np 2 3names = np.array(["refund_A", "refund_B", "login_A", "login_B", "shipping_A", "shipping_B"]) 4vectors = np.array([ 5 [1.0, 1.0], [1.2, 1.8], 6 [4.0, 4.2], [5.0, 3.8], 7 [8.0, 1.0], [9.0, 1.8], 8]) 9 10def kmeans(x: np.ndarray, initial: np.ndarray, max_steps: int = 20): 11 centers = initial.astype(float).copy() 12 for step in range(1, max_steps + 1): 13 squared = ((x[:, None, :] - centers[None, :, :]) ** 2).sum(axis=2) 14 labels = squared.argmin(axis=1) 15 next_centers = np.vstack([x[labels == index].mean(axis=0) for index in range(len(centers))]) 16 if np.allclose(next_centers, centers): 17 inertia = ((x - next_centers[labels]) ** 2).sum() 18 return labels, next_centers, inertia, step 19 centers = next_centers 20 raise RuntimeError("k-means did not converge") 21 22labels, centers, inertia, steps = kmeans(vectors, vectors[[0, 2, 4]]) 23for index in range(3): 24 members = names[labels == index].tolist() 25 print(f"cluster {index}: {members} center={centers[index].round(2).tolist()}") 26print(f"converged in {steps} steps with inertia={inertia:.2f}")
Output
1cluster 0: ['refund_A', 'refund_B'] center=[1.1, 1.4] 2cluster 1: ['login_A', 'login_B'] center=[4.5, 4.0] 3cluster 2: ['shipping_A', 'shipping_B'] center=[8.5, 1.4] 4converged in 2 steps with inertia=1.74

One defensive detail is missing on purpose: if a center gets no assigned vectors, its mean becomes undefined. Library implementations handle empty clusters and initialization more carefully. Your learning target here is the alternating assignment/update mechanism, not a full clustering package.

Choose k by Usefulness, Not by Wishful Naming

You chose k=3k=3k=3 because the map visibly contained three pairs. On a new corpus, kkk isn't supplied by nature. A small audit combines several signals:

  • Inertia: lower is better, but always falls as kkk rises, eventually reaching zero when every point gets its own center.
  • Silhouette score: for each point, compares cohesion within its assigned group against separation from neighboring groups. Higher is better, but it still judges geometry rather than business meaning.
  • Sample review: read messages nearest each centroid and messages close to cluster boundaries.
  • Stability: rerun with different seeds or resampled messages and check whether the interpretation survives.

The scikit-learn version below evaluates possible values for kkk.[2] Setting n_init=20 makes the choice explicit: try several starting configurations and retain the lowest-inertia result for each kkk.

compare-candidate-cluster-counts.py
1import numpy as np 2from sklearn.cluster import KMeans 3from sklearn.metrics import silhouette_score 4 5vectors = np.array([ 6 [1.0, 1.0], [1.2, 1.8], 7 [4.0, 4.2], [5.0, 3.8], 8 [8.0, 1.0], [9.0, 1.8], 9]) 10 11for k in (2, 3, 4): 12 model = KMeans(n_clusters=k, random_state=7, n_init=20).fit(vectors) 13 silhouette = silhouette_score(vectors, model.labels_) 14 print(f"k={k}: inertia={model.inertia_:.2f}, silhouette={silhouette:.2f}")
Output
1k=2: inertia=20.06, silhouette=0.56 2k=3: inertia=1.74, silhouette=0.76 3k=4: inertia=0.92, silhouette=0.51

On this unusually neat six-point example, k=3k=3k=3 should separate the three visible pairs. On a real support corpus, no single metric gives permission to publish topic names. Read the examples.

Equal Scores Can Still Tell Different Stories

Even a stable-looking objective can hide ambiguity. Four messages positioned at the corners of a square have two equally valid two-cluster solutions: split left from right, or split top from bottom.

same-inertia-different-story.py
1import numpy as np 2from sklearn.cluster import KMeans 3 4square = np.array([[0.0, 0.0], [0.0, 2.0], [2.0, 0.0], [2.0, 2.0]]) 5vertical_seed = np.array([[0.0, 1.0], [2.0, 1.0]]) 6horizontal_seed = np.array([[1.0, 0.0], [1.0, 2.0]]) 7 8for name, seed in [("left/right", vertical_seed), ("bottom/top", horizontal_seed)]: 9 model = KMeans(n_clusters=2, init=seed, n_init=1).fit(square) 10 groups = [np.where(model.labels_ == i)[0].tolist() for i in range(2)] 11 print(f"{name:10} groups={groups} inertia={model.inertia_:.1f}")
Output
1left/right groups=[[0, 1], [2, 3]] inertia=4.0 2bottom/top groups=[[0, 2], [1, 3]] inertia=4.0

Both answers optimize the same squared-distance objective equally well. If each axis meant a different product concern, only human review or a downstream task could say which grouping is useful.

PCA Gives You a Smaller View

K-means assigns groups. Principal component analysis (PCA) does something different: it replaces the original coordinate axes with new perpendicular axes that preserve as much variance as possible in the first few dimensions.

First center the six vectors by subtracting their mean, (4.7,2.27)(4.7, 2.27)(4.7,2.27). A centered coordinate describes how a message differs from the average message. Centering isn't scaling: subtracting a mean doesn't stop a high-unit column from dominating variance. Then factor the centered matrix using singular value decomposition (SVD):

Xcentered=UฮฃVโŠคX_{\text{centered}} = U \Sigma V^\topXcenteredโ€‹=UฮฃVโŠค

The rows of VโŠคV^\topVโŠค are principal directions. The first direction, PC1, captures the greatest remaining linear spread. Its share of total variance is the squared first singular value divided by the sum of squared singular values. Keeping the top components gives the best rank-limited reconstruction in squared-error terms, but it doesn't guarantee that the retained directions preserve a rare semantic distinction.[1][3]

For our map, PC1 runs almost horizontally:

QuantityValueInterpretation
PC1 direction(1.000,โˆ’0.016)(1.000, -0.016)(1.000,โˆ’0.016)Most spread is left to right.
PC1 explained variance85.2%85.2\%85.2%One number retains most linear variation.
PC2 explained variance14.8%14.8\%14.8%Vertical placement still carries information.
pca-from-svd.py
1import numpy as np 2 3names = np.array(["refund_A", "refund_B", "login_A", "login_B", "shipping_A", "shipping_B"]) 4vectors = np.array([ 5 [1.0, 1.0], [1.2, 1.8], 6 [4.0, 4.2], [5.0, 3.8], 7 [8.0, 1.0], [9.0, 1.8], 8]) 9 10centered = vectors - vectors.mean(axis=0) 11_, singular_values, vt = np.linalg.svd(centered, full_matrices=False) 12pc1 = vt[0] 13if pc1[0] < 0: 14 pc1 = -pc1 15projection = centered @ pc1 16ratio = singular_values**2 / (singular_values**2).sum() 17 18print("mean:", np.round(vectors.mean(axis=0), 2).tolist()) 19print("pc1:", np.round(pc1, 3).tolist()) 20print("explained variance:", np.round(ratio, 3).tolist()) 21for name, value in zip(names, projection): 22 print(f"{name:10} pc1={value:5.2f}")
Output
1mean: [4.7, 2.27] 2pc1: [1.0, -0.016] 3explained variance: [0.852, 0.148] 4refund_A pc1=-3.68 5refund_B pc1=-3.49 6login_A pc1=-0.73 7login_B pc1= 0.27 8shipping_A pc1= 3.32 9shipping_B pc1= 4.31

PCA has created a new feature, pc1. It didn't pick an existing embedding coordinate. In high-dimensional embeddings, each principal component is a weighted mixture of many original coordinates.

PCA drops points onto PC1 and leaves residual error off the score line. PCA drops points onto PC1 and leaves residual error off the score line.
PC1 keeps the main spread; residuals show lost detail.

Measure What Compression Discards

Reconstruction makes the tradeoff concrete. Project onto one component, map back into the original space, and measure squared error. Two components can perfectly reconstruct this two-dimensional input; one component can't.

pca-reconstruction-error.py
1import numpy as np 2 3vectors = np.array([ 4 [1.0, 1.0], [1.2, 1.8], 5 [4.0, 4.2], [5.0, 3.8], 6 [8.0, 1.0], [9.0, 1.8], 7]) 8mean = vectors.mean(axis=0) 9centered = vectors - mean 10_, singular_values, vt = np.linalg.svd(centered, full_matrices=False) 11 12for components in (1, 2): 13 basis = vt[:components] 14 reconstructed = (centered @ basis.T) @ basis + mean 15 error = ((vectors - reconstructed) ** 2).sum() 16 kept = (singular_values[:components] ** 2).sum() / (singular_values ** 2).sum() 17 print(f"components={components}: variance_kept={kept:.3f}, squared_error={error:.2f}")
Output
1components=1: variance_kept=0.852, squared_error=9.72 2components=2: variance_kept=1.000, squared_error=0.00

Explained variance measures geometry, not business value. A rare fraud or policy-escalation direction may carry little total variance and still matter greatly.

Failure Test 1: Mixed Scales Manufacture Clusters

Embeddings often appear next to metadata such as message length, order value, or age of the shipment. If you concatenate features with unlike numeric scales, both k-means and PCA can chase the largest units rather than the concept you care about.

The script below creates two intended topics (refund and shipping) while message length cuts across both. Without scaling, length dominates. Standardizing each column to comparable scale recovers the topic split in this constructed check.

scale-can-overrule-topic.py
1import numpy as np 2from sklearn.cluster import KMeans 3from sklearn.metrics import adjusted_rand_score 4from sklearn.pipeline import make_pipeline 5from sklearn.preprocessing import StandardScaler 6 7# Column 0 carries topic; column 1 is message length in characters. 8features = np.array([ 9 [-3.2, 100.0], [-3.0, 500.0], [-2.8, 900.0], 10 [ 2.8, 110.0], [ 3.0, 510.0], [ 3.2, 910.0], 11]) 12topic = np.array([0, 0, 0, 1, 1, 1]) 13 14raw = KMeans(n_clusters=2, random_state=0, n_init=20).fit_predict(features) 15scaled = make_pipeline( 16 StandardScaler(), 17 KMeans(n_clusters=2, random_state=0, n_init=20), 18).fit_predict(features) 19 20print(f"raw agreement with topic: {adjusted_rand_score(topic, raw):.2f}") 21print(f"scaled agreement with topic: {adjusted_rand_score(topic, scaled):.2f}")
Output
1raw agreement with topic: -0.22 2scaled agreement with topic: 1.00

This isn't a rule to standardize every embedding blindly. If a model's embedding vectors already come with a specified normalization and similarity contract, follow that contract. During evaluation, fit any learned scaler on the development corpus and reuse it for held-out examples. The rule is simpler: don't mix measurements and then pretend their scales are irrelevant.

Failure Test 2: The Similarity Rule Changes Neighbors

Text retrieval frequently compares vector direction with cosine similarity, while ordinary k-means minimizes Euclidean distance to centroids. Dot product additionally rewards magnitude. A model or index must use the similarity rule its representations were built for.

This tiny example gives a query two candidate policy snippets. One candidate points almost exactly in the query direction; the other has a much larger norm.

metric-changes-the-neighbor.py
1import numpy as np 2 3query = np.array([1.0, 0.0]) 4names = ["duplicate_charge_policy", "very_long_shipping_page"] 5documents = np.array([ 6 [1.0, 0.1], 7 [8.0, 4.0], 8]) 9 10dot = documents @ query 11cosine = dot / (np.linalg.norm(documents, axis=1) * np.linalg.norm(query)) 12 13print("dot-product winner:", names[int(dot.argmax())]) 14print("cosine winner: ", names[int(cosine.argmax())]) 15print("cosine scores: ", np.round(cosine, 3).tolist())
Output
1dot-product winner: very_long_shipping_page 2cosine winner: duplicate_charge_policy 3cosine scores: [0.995, 0.894]

On unit-normalized vectors, cosine ranking and Euclidean nearest-neighbor ranking are directly related: maximizing cosine similarity minimizes squared Euclidean distance. That relation is useful, but only after normalization is explicit.

Failure Test 3: PCA Can Hide the Dimension Your Query Needs

A retrieval system doesn't need a globally beautiful map. It needs the right document for a specific question. Suppose horizontal variation across refund and shipping policies dominates a corpus, while the distinction between password-reset and billing pages sits mostly in a quieter vertical direction.

The following diagnostic projects documents to only one principal component, then runs a nearest-document lookup for a login question. The compressed view preserves most overall spread but collapses a distinction that mattered for this query.

compression-can-change-retrieval.py
1import numpy as np 2from sklearn.decomposition import PCA 3 4names = np.array(["refund_policy", "billing_help", "login_2fa", "shipping_policy", "warehouse_note"]) 5documents = np.array([ 6 [-20.0, 0.0], 7 [ 0.0, 0.0], 8 [ 0.0, 3.0], 9 [20.0, 0.0], 10 [ 0.0, -3.0], 11]) 12query = np.array([[0.0, 3.2]]) 13 14full_distance = np.linalg.norm(documents - query, axis=1) 15full_winner = names[full_distance.argmin()] 16 17pca = PCA(n_components=1).fit(documents) 18compressed_documents = pca.transform(documents) 19compressed_query = pca.transform(query) 20compressed_distance = np.linalg.norm(compressed_documents - compressed_query, axis=1) 21compressed_ties = names[np.isclose(compressed_distance, compressed_distance.min())].tolist() 22 23print(f"PC1 variance kept: {pca.explained_variance_ratio_[0]:.3f}") 24print("full-space nearest: ", full_winner) 25print("one-PC nearest ties: ", compressed_ties)
Output
1PC1 variance kept: 0.978 2full-space nearest: login_2fa 3one-PC nearest ties: ['billing_help', 'login_2fa', 'warehouse_note']

After projection, billing_help, login_2fa, and warehouse_note land at the same one-dimensional coordinate. The failure isn't that billing became uniquely closest. PCA removed the quieter vertical distinction that separated the useful login page from two tied alternatives.

PCA is excellent for quick inspection and sometimes useful for compression. Before deploying compressed retrieval vectors, measure retrieval quality on labeled queries. In the next chapter, you'll build exactly those ranking checks.

Run a Representation Audit Before Shipping Search

An unsupervised plot is most useful when it produces testable questions. For a support or policy corpus, write an audit record before claiming that the embedding space is organized:

Audit questionEvidence to collect
Do local neighbors discuss the same issue?Read nearest-message pairs sampled across the corpus.
Do proposed clusters have coherent content?Review messages nearest each centroid and near boundaries.
Is a large axis merely formatting or length?Compare PCA coordinates against metadata such as length and source.
Does an embedding metric match later search?Run cosine or dot-product checks according to model contract.
Can compression hurt a rare but important query?Evaluate retrieval before and after PCA on held-out query-document pairs.

Use a handful of reviewed examples as an audit set, not as hidden training labels. It gives you a way to reject a broken representation before a downstream LLM begins quoting the wrong policy.

write-a-cluster-audit-summary.py
1import numpy as np 2from sklearn.cluster import KMeans 3 4names = np.array(["refund_A", "refund_B", "login_A", "login_B", "shipping_A", "shipping_B"]) 5reviewed_topic = np.array(["refund", "refund", "login", "login", "shipping", "shipping"]) 6vectors = np.array([ 7 [1.0, 1.0], [1.2, 1.8], 8 [4.0, 4.2], [5.0, 3.8], 9 [8.0, 1.0], [9.0, 1.8], 10]) 11 12labels = KMeans(n_clusters=3, random_state=4, n_init=20).fit_predict(vectors) 13for cluster in sorted(set(labels)): 14 topics, counts = np.unique(reviewed_topic[labels == cluster], return_counts=True) 15 dominant = topics[counts.argmax()] 16 purity = counts.max() / counts.sum() 17 members = names[labels == cluster].tolist() 18 print(f"cluster {cluster}: proposed={dominant:8} purity={purity:.2f} members={members}")
Output
1cluster 0: proposed=shipping purity=1.00 members=['shipping_A', 'shipping_B'] 2cluster 1: proposed=refund purity=1.00 members=['refund_A', 'refund_B'] 3cluster 2: proposed=login purity=1.00 members=['login_A', 'login_B']

On this tiny teaching set, perfect purity is expected because the coordinates were designed to be readable. A production audit should be harder: sample boundary cases, unusual languages, policy updates, and short messages with ambiguous intent.

Practice Tasks

  1. Change the initial centers in kmeans-from-scratch.py so one cluster becomes empty. Add a guard that either reseeds the empty center or raises a clear error.
  2. Add one metadata column with a much larger numeric scale to the six-message matrix. Compare cluster assignments before and after deliberate scaling.
  3. Keep two PCA components in compression-can-change-retrieval.py, then confirm the login query recovers login_2fa. Explain why explained variance alone couldn't guarantee that result.
  4. Run write-a-cluster-audit-summary.py with one mislabeled or boundary message. Identify which samples you'd inspect before assigning topic names.
  5. Write a representation-audit record for a small policy corpus: metric contract, normalization rule, cluster stability check, sampled neighbors, and held-out retrieval metric.

Practice guidance

  1. mean(axis=0) on an empty slice produces invalid centers. A learning implementation should detect not np.any(labels == index) before averaging; a production library may reseed more carefully.
  2. Without scaling, the large-unit metadata can dominate squared distance. After scaling, inspect whether recovered groups reflect useful semantics rather than assuming standardization is always correct.
  3. With both components, this two-dimensional fixture is reconstructed exactly and login retrieval returns login_2fa. In a real index, compare held-out retrieval metrics before and after compression.
  4. A single conflicting message should reduce purity or create a boundary case. Inspect centroid-near examples and edge cases; don't attach a permanent topic name from one clean run.
  5. Record embedding model/version, similarity metric, normalization, sample strategy, stability runs, and before/after retrieval quality if compression is proposed.

Key Concepts

  • unlabeled embeddings and representation inspection
  • k-means assignments, centroids, and inertia
  • choosing kkk with geometry plus human audit
  • PCA, SVD, projection, and reconstruction error
  • explained variance versus semantic usefulness
  • feature-scale and similarity-metric failures
  • compression risk before retrieval

Evaluation rubric

  • Foundational: Computes one centroid update and explains why a cluster integer isn't a topic label.
  • Intermediate: Runs k-means and PCA diagnostics, then distinguishes explained variance from semantic evidence.
  • Advanced: Designs an embedding audit that catches scale, metric, or compression failures before an LLM retrieval pipeline uses the vectors.

Follow-up questions

Common pitfalls

SymptomCauseFix
Dashboard permanently interprets cluster 2 as refunds.Arbitrary cluster IDs were treated as product labels.Version each run and attach reviewed labels after sampling members.
Clusters track message length rather than issue type.Mixed-scale metadata dominated distance or variance.Standardize deliberately or keep metadata out of embedding geometry.
Teams trust a clean PCA chart without reading messages.A compressed visualization was confused with semantic validation.Audit centroid and boundary examples in original text.
Search results change after switching dot product to cosine.Similarity contract wasn't made explicit.Use documented embedding metric and normalize only when appropriate.
A compressed index misses an important policy page.PCA dropped a small direction needed by that query.Test held-out retrieval before deploying compression.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.In one k-means update, two vectors assigned to the same cluster are refund_A = (1.0, 1.0) and refund_B = (1.2, 1.8). What centroid should the update step use for that cluster?
2.Two k-means runs on the same support corpus find the same three message neighborhoods, but the numeric IDs are permuted: one run calls refunds cluster 0, while another calls refunds cluster 2. How should a dashboard represent the result?
3.A six-message audit reports these k-means results: k=2 has inertia 20.06 and silhouette 0.56, k=3 has inertia 1.74 and silhouette 0.76, and k=4 has inertia 0.92 and silhouette 0.51. What conclusion follows for this toy corpus?
4.Six two-dimensional vectors are centered and PCA reports explained variance ratios of 0.852 and 0.148. A one-component reconstruction has squared error 9.72, while two components have error 0.00. Which interpretation is correct?
5.A retrieval diagnostic keeps one PCA component with 0.978 explained variance. In full space, a login query is nearest to login_2fa, but after projection billing_help, login_2fa, and warehouse_note tie at the same one-dimensional coordinate. What should the team conclude?
6.A feature matrix combines a topic coordinate near -3 for refunds and +3 for shipping with a message-length column ranging from 100 to 900 characters. Raw k-means groups mostly by length, while a deliberate StandardScaler step recovers the topic split in this constructed check. What caused the raw failure?
7.A query vector is (1.0, 0.0). Two document vectors are duplicate_charge_policy = (1.0, 0.1) and very_long_shipping_page = (8.0, 4.0). Dot product ranks very_long_shipping_page higher, while cosine ranks duplicate_charge_policy higher. What explains the disagreement?
8.A team has support-message clusters and a PCA plot that look coherent, and it wants to deploy vector search over the same representation. Which audit record is most appropriate?

8 questions remaining.

Next Step
Continue to Core Retrieval Algorithms

You can now test whether vectors form meaningful neighborhoods; next you will use that geometry to rank the actual policy documents an LLM receives as evidence.

PreviousValidation and Leakage
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. ยท 2009

Scikit-learn: Machine Learning in Python.

Pedregosa, F., et al. ยท 2011 ยท JMLR

The approximation of one matrix by another of lower rank

Eckart, C. & Young, G. ยท 1936