LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringDimensionality Reduction for Embeddings
📐MediumEmbeddings & Vector Search

Dimensionality Reduction for Embeddings

Shrink and inspect embedding indexes without guessing: measure recall while testing PCA, projections, native shortening, and quantization.

20 min read
Learning path
Step 53 of 155 in the full curriculum
Instruction Tuning & Chat TemplatesCoT, ToT & Self-Consistency Prompting

An instruction-tuned support assistant still needs evidence before it answers. Suppose its retrieval index stores policy chunks for damaged items, delayed parcels, return windows, and refund approvals. Each chunk is represented by a long embedding, and a query such as "the screen arrived cracked" should land near the damaged-electronics policy.

That retrieval system creates two different questions:

  1. Can we store and search shorter vectors while returning the same useful policy chunks?
  2. Can we draw a two-dimensional map that helps an engineer inspect confusing clusters?

Both questions use dimensionality reduction. They don't have the same success criterion. A production vector is approved by retrieval quality and cost. A plot is approved by what it lets you investigate, with important conclusions checked back in the original embedding space.

Dimensionality reduction overview showing high-dimensional embeddings compressed into smaller vectors while preserving useful neighbor structure. Dimensionality reduction overview showing high-dimensional embeddings compressed into smaller vectors while preserving useful neighbor structure.
A short embedding is valuable only when the policy neighbors needed by real queries still survive the reduction.

Start with the budget, not an algorithm

An embedding is an array of numbers. With float32 storage, every dimension costs four bytes. Storing ten million 1,536-dimensional policy or product vectors therefore costs:

10,000,000×1,536×4=61,440,000,000 bytes10{,}000{,}000 \times 1{,}536 \times 4 = 61{,}440{,}000{,}000 \text{ bytes}10,000,000×1,536×4=61,440,000,000 bytes

That is 61.44 GB using decimal gigabytes, before the index graph, metadata, replicas, or backups. Cutting the vector to 256 dimensions makes the raw array six times smaller. It does not prove that search remains good.

Run the arithmetic first. It turns "embeddings feel large" into a concrete engineering constraint.

estimate-vector-storage.py
1def decimal_gb(byte_count: int) -> float: 2 return byte_count / 1_000_000_000 3 4documents = 10_000_000 5bytes_per_float32 = 4 6full_dim = 1_536 7candidate_dim = 256 8 9full_bytes = documents * full_dim * bytes_per_float32 10candidate_bytes = documents * candidate_dim * bytes_per_float32 11 12print(f"Full float32 array: {decimal_gb(full_bytes):.2f} GB") 13print(f"256-d float32 array: {decimal_gb(candidate_bytes):.2f} GB") 14print(f"Raw-vector reduction: {full_bytes / candidate_bytes:.1f}x")
Output
1Full float32 array: 61.44 GB 2256-d float32 array: 10.24 GB 3Raw-vector reduction: 6.0x

The first applied habit is simple: separate a cost metric from a quality metric.

QuestionUseful metric
Will vectors fit in memory?Bytes per vector, total RAM, index overhead
Will useful evidence still be retrieved?Recall@K, nDCG@K, answer success on grounded evaluations
Will an engineer understand failures?A labeled plot plus checks in the original space

Build a retrieval baseline

Before compressing anything, define what "the same useful neighbors" means. The small fixture below represents six support documents. Its coordinates are hand-written so you can see the experiment without needing an API key: the first two coordinates concern damaged items, the next two concern delivery, and the last two concern returns.

Cosine similarity compares directions rather than raw lengths. We L2-normalize each vector, score one damaged-screen query, and save the top results as the baseline that reduction must try to preserve.

baseline-policy-retrieval.py
1import numpy as np 2 3documents = [ 4 "damaged electronics: photo evidence and replacement", 5 "damaged furniture: upload photos for claim", 6 "late parcel: carrier scan and escalation", 7 "missing parcel: investigate delivery trace", 8 "sealed return: return within 30 days", 9 "opened return: restocking review required", 10] 11 12vectors = np.array([ 13 [1.00, 0.92, 0.04, 0.00, 0.02, 0.00], 14 [0.92, 0.80, 0.04, 0.01, 0.02, 0.00], 15 [0.02, 0.00, 1.00, 0.88, 0.04, 0.00], 16 [0.02, 0.02, 0.91, 0.98, 0.00, 0.00], 17 [0.00, 0.00, 0.04, 0.02, 1.00, 0.89], 18 [0.02, 0.00, 0.00, 0.02, 0.91, 1.00], 19], dtype=float) 20query = np.array([0.98, 1.00, 0.03, 0.00, 0.01, 0.00]) 21 22def normalize(rows: np.ndarray) -> np.ndarray: 23 return rows / np.linalg.norm(rows, axis=1, keepdims=True) 24 25scores = normalize(vectors) @ normalize(query.reshape(1, -1))[0] 26ranking = np.argsort(-scores) 27 28for rank, index in enumerate(ranking[:3], start=1): 29 print(f"{rank}. {documents[index]} score={scores[index]:.3f}")
Output
11. damaged electronics: photo evidence and replacement score=0.999 22. damaged furniture: upload photos for claim score=0.997 33. missing parcel: investigate delivery trace score=0.036

In a real evaluation set, each query has judged relevant documents. For a query with two relevant policy chunks, recall@2 = 1.0 means both appear in the top two returned results. That is the comparison target for every compressed representation in this lesson.

PCA: learn a shorter linear coordinate system

Principal Component Analysis (PCA) fits a linear transformation to a corpus. Center the embedding matrix, find directions with the most variation, and keep the first k directions. If XXX is the centered document matrix and VkV_kVk​ holds the chosen directions, the reduced vectors are:

Z=XVkZ = X V_kZ=XVk​

The rows are still documents; there are simply fewer columns. In implementations, Singular Value Decomposition (SVD) is a practical way to find principal directions without explicitly constructing a covariance matrix.[1]

PCA algorithm flow: high-dimensional embeddings are centered, covariance or SVD is computed, top components are selected, and vectors are projected into k dimensions. PCA algorithm flow: high-dimensional embeddings are centered, covariance or SVD is computed, top components are selected, and vectors are projected into k dimensions.
PCA produces one learned coordinate system. Stored documents and incoming queries must be transformed by that same fitted reducer.

There is one failure mode to understand before using the library: fitting PCA separately on document vectors and live queries creates different axes. Their dot products no longer compare like with like. Fit on representative document data, persist the fitted transform, and call transform for both indexed documents and future queries.

This example adds small variations around the three policy themes, fits a two-dimensional PCA model, and retrieves with a query transformed by the same model.

fit-one-pca-transform.py
1import numpy as np 2from sklearn.decomposition import PCA 3 4rng = np.random.default_rng(4) 5centers = np.array([ 6 [1.0, 0.9, 0.0, 0.0, 0.0, 0.0], # damaged item 7 [0.0, 0.0, 1.0, 0.9, 0.0, 0.0], # delivery issue 8 [0.0, 0.0, 0.0, 0.0, 1.0, 0.9], # return 9]) 10labels = np.repeat(["damage", "delivery", "return"], 20) 11documents = np.vstack([ 12 center + rng.normal(0, 0.05, size=(20, 6)) 13 for center in centers 14]) 15query = np.array([[0.96, 0.91, 0.02, 0.01, 0.00, 0.00]]) 16 17def normalize(rows: np.ndarray) -> np.ndarray: 18 return rows / np.linalg.norm(rows, axis=1, keepdims=True) 19 20pca = PCA(n_components=2, random_state=0) 21reduced_documents = normalize(pca.fit_transform(documents)) 22reduced_query = normalize(pca.transform(query)) 23best_index = int(np.argmax(reduced_documents @ reduced_query[0])) 24 25print(f"Shape: {documents.shape} -> {reduced_documents.shape}") 26print(f"Explained variance kept: {pca.explained_variance_ratio_.sum():.3f}") 27print(f"Closest policy theme: {labels[best_index]}")
Output
1Shape: (60, 6) -> (60, 2) 2Explained variance kept: 0.992 3Closest policy theme: damage

PCA optimizes variance, not relevance. A direction that rarely changes in the corpus can still distinguish a valid replacement from an invalid refund. Explained variance is useful diagnostic information; retrieval recall is the release metric.

Sweep dimensions against neighbor recall

On a small data set, we can provide the relevance judgments ourselves. The following lab creates twenty support intents in a noisy 24-dimensional space. Each query has three acceptable policy chunks. It then tries PCA dimensions 2, 4, 8, and 12.

pca-recall-sweep.py
1import numpy as np 2from sklearn.decomposition import PCA 3 4rng = np.random.default_rng(21) 5dimensions = 24 6intent_count = 20 7acceptable_chunks = 3 8intent_vectors = rng.normal(size=(intent_count, dimensions)) 9intent_vectors -= intent_vectors.mean(axis=0) 10intent_vectors /= np.linalg.norm(intent_vectors, axis=1, keepdims=True) 11 12documents = [] 13queries = [] 14relevant = [] 15for intent_vector in intent_vectors: 16 first_chunk = len(documents) 17 documents.extend( 18 intent_vector + rng.normal(0, 0.11, size=(acceptable_chunks, dimensions)) 19 ) 20 queries.append(intent_vector + rng.normal(0, 0.03, size=dimensions)) 21 relevant.append(set(range(first_chunk, first_chunk + acceptable_chunks))) 22documents = np.asarray(documents) 23queries = np.asarray(queries) 24 25def normalize(rows: np.ndarray) -> np.ndarray: 26 return rows / np.linalg.norm(rows, axis=1, keepdims=True) 27 28def top_k(query_rows: np.ndarray, doc_rows: np.ndarray, k: int) -> np.ndarray: 29 return np.argsort(-(normalize(query_rows) @ normalize(doc_rows).T), axis=1)[:, :k] 30 31for target_dim in [2, 4, 8, 12]: 32 reducer = PCA(n_components=target_dim, random_state=0) 33 short_docs = reducer.fit_transform(documents) 34 short_queries = reducer.transform(queries) 35 returned = top_k(short_queries, short_docs, k=3) 36 recall = np.mean([ 37 len(expected & set(found)) / acceptable_chunks 38 for expected, found in zip(relevant, returned) 39 ]) 40 print(f"{target_dim:>2} dims: recall@3={recall:.3f}, " 41 f"variance={reducer.explained_variance_ratio_.sum():.3f}")
Output
12 dims: recall@3=0.333, variance=0.266 2 4 dims: recall@3=0.817, variance=0.454 3 8 dims: recall@3=1.000, variance=0.715 412 dims: recall@3=1.000, variance=0.869

The numbers belong to this generated fixture, not to all embedding models. Notice that variance and relevant-chunk recall answer different questions. Copy the experiment pattern, then replace its documents, queries, and judgments with data from your own product.

A two-dimensional map is an investigation tool

A two-dimensional projection answers questions such as:

  • Are failed damaged-item queries scattered into the returns cluster?
  • Are documents from a new seller-policy version isolated from older versions?
  • Which points should an engineer inspect in the original data?

It doesn't become a search index just because the plot looks separated. Compressing a rich semantic vector to two coordinates forces distortion.

Compare a linear map and a neighborhood map

PCA offers a deterministic linear view. t-SNE and UMAP focus more strongly on neighborhood relationships, using nonlinear objectives. The visual difference matters because a layout can make clusters appear convincingly separated even when a production retrieval test would disagree.

Comparison of PCA, t-SNE, and UMAP projections for embedding visualization and structure preservation. Comparison of PCA, t-SNE, and UMAP projections for embedding visualization and structure preservation.
Each 2D method distorts geometry differently. Use maps to choose records to inspect, then verify retrieval behavior in the source representation.

The next lab demonstrates the evaluation idea without asking you to trust a picture. It measures how many of each point's original nearest neighbors remain nearby after a two-dimensional PCA view and after a t-SNE view.

measure-neighbor-retention-in-a-map.py
1import os 2 3# Keep this small teaching run from oversubscribing local CPU threads. 4for variable in ("OMP_NUM_THREADS", "OPENBLAS_NUM_THREADS", "VECLIB_MAXIMUM_THREADS"): 5 os.environ[variable] = "1" 6 7import numpy as np 8from sklearn.datasets import make_blobs 9from sklearn.decomposition import PCA 10from sklearn.manifold import TSNE 11 12points, labels = make_blobs( 13 n_samples=90, 14 n_features=10, 15 centers=3, 16 cluster_std=1.0, 17 random_state=8, 18) 19 20def neighbor_ids(rows: np.ndarray, k: int = 5) -> np.ndarray: 21 distances = np.linalg.norm(rows[:, None, :] - rows[None, :, :], axis=2) 22 return np.argsort(distances, axis=1)[:, 1:k + 1] 23 24def overlap(reference: np.ndarray, candidate: np.ndarray) -> float: 25 return float(np.mean([ 26 len(set(left) & set(right)) / left.size 27 for left, right in zip(reference, candidate) 28 ])) 29 30original_neighbors = neighbor_ids(points) 31pca_map = PCA(n_components=2).fit_transform(points) 32tsne_map = TSNE( 33 n_components=2, 34 perplexity=12, 35 init="random", 36 learning_rate="auto", 37 max_iter=750, 38 random_state=8, 39).fit_transform(points) 40 41# t-SNE is stochastic and its exact value shifts across library versions, 42# so we report two decimals rather than a brittle high-precision figure. 43print(f"PCA local-neighbor overlap: {overlap(original_neighbors, neighbor_ids(pca_map)):.2f}") 44print(f"t-SNE local-neighbor overlap: {overlap(original_neighbors, neighbor_ids(tsne_map)):.2f}") 45print(f"Labels available for inspection: {[int(label) for label in sorted(set(labels))]}")
Output
1PCA local-neighbor overlap: 0.35 2t-SNE local-neighbor overlap: 0.61 3Labels available for inspection: [0, 1, 2]

This local-overlap measurement still isn't a retrieval release gate. It tells you whether a map is useful for investigating local cases. The retrieval gate remains based on queries and relevant documents.

t-SNE: excellent local pictures, unsafe search coordinates

t-SNE converts high-dimensional neighbor relationships into probabilities pijp_{ij}pij​, creates corresponding low-dimensional probabilities qijq_{ij}qij​ using a heavy-tailed Student-t distribution, and minimizes:

DKL(P∥Q)=∑i≠jpijlog⁡pijqijD_{KL}(P \parallel Q) = \sum_{i \ne j} p_{ij} \log \frac{p_{ij}}{q_{ij}}DKL​(P∥Q)=∑i=j​pij​logqij​pij​​

The heavy tail lets moderately distant points spread in the map instead of crowding into the center. This is the central construction in van der Maaten and Hinton's t-SNE paper.[2]

t-SNE algorithm flow showing Gaussian neighbor probabilities in high dimensions, symmetrized probabilities, Student-t similarities in 2D, KL optimization, and a final visualization map. t-SNE algorithm flow showing Gaussian neighbor probabilities in high dimensions, symmetrized probabilities, Student-t similarities in 2D, KL optimization, and a final visualization map.
t-SNE spends its objective on local neighborhood probabilities. Large gaps between resulting islands aren't reliable semantic distances.

perplexity controls the rough neighborhood scale considered by the map. Trying several values is useful for investigation. It isn't acceptable to pick whichever rendering tells the cleanest story.

UMAP: optimize a weighted neighbor graph

UMAP first constructs a weighted nearest-neighbor graph, represented as a fuzzy simplicial set, then optimizes a lower-dimensional graph using a cross-entropy objective.[3] Parameters such as n_neighbors and min_dist change the view: one layout can expose small pockets of policy confusion while another reveals broader groupings.

UMAP algorithm flow showing high-dimensional data converted to a weighted nearest-neighbor graph, fuzzy simplicial set, low-dimensional initialization, cross-entropy layout optimization, and final manifold projection. UMAP algorithm flow showing high-dimensional data converted to a weighted nearest-neighbor graph, fuzzy simplicial set, low-dimensional initialization, cross-entropy layout optimization, and final manifold projection.
UMAP is a useful map builder because it starts from local connections. Its map still needs checks against the original vectors and labeled cases.

Common UMAP implementations can place later samples into a fitted map with an approximate transform. That is convenient for a diagnostic dashboard, but it isn't the same as PCA's fixed linear transform. If incoming support traffic changes, re-check the map before using it to explain failures.

GoalAppropriate first measurementSuitable methods to test
Inspect clusters or outliersNeighbor overlap plus human reviewPCA, t-SNE, UMAP
Shorten vectors for retrievalRecall@K or nDCG@K plus RAM and latencyNative shortening, PCA, random projection
Reduce memory furtherRecall after compressed search and rerankingScalar, PQ, binary quantization

Random projection: a no-fit serving baseline

PCA learns from the document corpus. A random projection samples a fixed matrix once and applies it to every document and query:

z=xR,R∈Rd×kz = xR, \qquad R \in \mathbb{R}^{d \times k}z=xR,R∈Rd×k

The Johnson-Lindenstrauss lemma establishes that a finite set of points can be projected into a lower-dimensional space while approximately preserving pairwise distances, given enough target dimensions.[4] The bound is a worst-case guarantee, not a claim that a specific 128-dimensional support index will meet your recall requirement.

Random projection flow showing a 1536-dimensional vector multiplied by a 1536 by 128 random matrix to produce a 128-dimensional projected vector. Random projection flow showing a 1536-dimensional vector multiplied by a 1536 by 128 random matrix to produce a 128-dimensional projected vector.
A random matrix has no training step. Save it, apply it consistently to documents and queries, then measure whether relevant neighbors survive.

This experiment compares original cosine neighbors with a Gaussian random projection. A fixed seed makes the transform reproducible.

random-projection-recall.py
1import numpy as np 2from sklearn.random_projection import GaussianRandomProjection 3 4rng = np.random.default_rng(33) 5documents = rng.normal(size=(180, 48)) 6queries = documents[:12] + rng.normal(0, 0.03, size=(12, 48)) 7 8def normalized(rows: np.ndarray) -> np.ndarray: 9 return rows / np.linalg.norm(rows, axis=1, keepdims=True) 10 11def neighbors(q: np.ndarray, d: np.ndarray, k: int = 5) -> np.ndarray: 12 return np.argsort(-(normalized(q) @ normalized(d).T), axis=1)[:, :k] 13 14gold = neighbors(queries, documents) 15projector = GaussianRandomProjection(n_components=16, random_state=7) 16short_docs = projector.fit_transform(documents) 17short_queries = projector.transform(queries) 18predicted = neighbors(short_queries, short_docs) 19recall = np.mean([ 20 len(set(left) & set(right)) / left.size 21 for left, right in zip(gold, predicted) 22]) 23 24original_bytes = documents.shape[1] * 4 25short_bytes = short_docs.shape[1] * 4 26print(f"Bytes/vector: {original_bytes} -> {short_bytes}") 27print(f"Neighbor recall@5: {recall:.3f}")
Output
1Bytes/vector: 192 -> 64 2Neighbor recall@5: 0.350

Random projection is worth benchmarking when fitting a data-dependent reducer is inconvenient or expensive. It isn't automatically better or worse than PCA. Let the same labeled retrieval set judge both.

Compression after reduction: quantization

Dimension reduction stores fewer coordinates. Quantization stores each coordinate, or each subvector, with fewer bits. You can use them independently or together.

For example, a 256-dimensional float32 vector occupies 256 × 4 = 1,024 bytes. Storing one signed byte per dimension uses 256 bytes. Storing one bit per dimension uses 32 bytes. Lower memory comes with progressively more approximation.

Scalar and binary codes

Scalar quantization maps each value onto an integer grid. Binary quantization records only a sign or threshold result. Vector databases often pair aggressive compressed candidate search with rescoring of a larger candidate set in a more accurate representation; the available choices and configuration depend on the database implementation.[5]

The next lab makes the compression visible. It quantizes four normalized vectors to int8 and converts their signs to bits. The bit count is exact; the quality decision still needs retrieval measurements.

scalar-and-binary-codes.py
1import numpy as np 2 3vectors = np.array([ 4 [0.90, 0.80, -0.10, -0.20, 0.04, 0.02, -0.05, 0.08], 5 [0.85, 0.74, -0.04, -0.16, 0.01, 0.04, -0.02, 0.05], 6 [-0.02, 0.04, 0.92, 0.81, -0.06, 0.01, 0.02, -0.04], 7 [-0.05, 0.03, 0.88, 0.79, -0.03, 0.00, 0.06, -0.02], 8], dtype=np.float32) 9 10scale = 127 / np.max(np.abs(vectors)) 11int8_vectors = np.round(vectors * scale).astype(np.int8) 12binary_vectors = vectors > 0 13 14query_bits = binary_vectors[0] 15hamming = np.count_nonzero(binary_vectors != query_bits, axis=1) 16 17print(f"float32 bytes/vector: {vectors.shape[1] * 4}") 18print(f"int8 bytes/vector: {int8_vectors.shape[1]}") 19print(f"binary bits/vector: {binary_vectors.shape[1]}") 20print(f"Binary Hamming distances from row 0: {hamming.tolist()}")
Output
1float32 bytes/vector: 32 2int8 bytes/vector: 8 3binary bits/vector: 8 4Binary Hamming distances from row 0: [0, 0, 6, 7]

Binary codes are tiny, but "same sign" is a much weaker statement than "same semantic evidence." For a support application, retrieve more candidates with a compressed representation and rerank before final evidence selection if benchmarks justify that design.

Product quantization: code sub-vectors

Product Quantization (PQ) splits a vector into sub-vectors, learns a codebook for each subspace, and stores only the selected centroid IDs.[6] In the common case of 256 possible centroids, each ID fits in one byte. A 128-dimensional float32 vector split into eight subspaces can therefore be represented by eight code bytes, plus shared codebooks.

Product quantization flow: a 128-dimensional float vector is split into eight 16-dimensional subvectors, each subvector is assigned to a nearest centroid, and the stored result is an 8-byte code. Product quantization flow: a 128-dimensional float vector is split into eight 16-dimensional subvectors, each subvector is assigned to a nearest centroid, and the stored result is an 8-byte code.
PQ stores centroid IDs rather than every float. At query time, asymmetric distance computation uses query-to-centroid lookup tables to score stored codes.

The original PQ paper describes asymmetric distance computation (ADC): keep a query unquantized, precompute its distances to codebook centroids, and score each stored code through table lookups.[6] This miniature implementation uses only two subspaces and four centroids so that every shape remains visible.

mini-product-quantization.py
1import numpy as np 2from sklearn.cluster import KMeans 3 4rng = np.random.default_rng(12) 5documents = np.vstack([ 6 rng.normal([1, 1, 0, 0], 0.08, size=(20, 4)), 7 rng.normal([0, 0, 1, 1], 0.08, size=(20, 4)), 8]) 9query = np.array([0.94, 1.02, 0.01, -0.02]) 10 11subspace_width = 2 12codebooks = [] 13codes = [] 14for start in range(0, documents.shape[1], subspace_width): 15 block = documents[:, start:start + subspace_width] 16 model = KMeans(n_clusters=4, random_state=0, n_init=10).fit(block) 17 codebooks.append(model.cluster_centers_) 18 codes.append(model.labels_) 19codes = np.column_stack(codes) 20 21adc_scores = np.zeros(documents.shape[0]) 22for subspace, start in enumerate(range(0, documents.shape[1], subspace_width)): 23 query_block = query[start:start + subspace_width] 24 table = np.linalg.norm(codebooks[subspace] - query_block, axis=1) ** 2 25 adc_scores += table[codes[:, subspace]] 26 27print(f"Document shape: {documents.shape}") 28print(f"Stored code shape: {codes.shape}") 29print(f"Raw bytes/vector: {documents.shape[1] * 8} (float64 in this lab)") 30print(f"Code bytes/vector with <=256 centroids: {codes.shape[1]}") 31print(f"Nearest PQ-coded document group: {'damage' if np.argmin(adc_scores) < 20 else 'delivery'}")
Output
1Document shape: (40, 4) 2Stored code shape: (40, 2) 3Raw bytes/vector: 32 (float64 in this lab) 4Code bytes/vector with <=256 centroids: 2 5Nearest PQ-coded document group: damage

PQ isn't an interchangeable synonym for PCA. PCA shortens a coordinate system; PQ encodes sub-vectors with learned IDs. An index may use a shortened embedding and then PQ-code it when memory is tight.

Native shortening: train prefixes to be useful

PCA and random projection operate after an encoder produces its vector. Matryoshka Representation Learning (MRL) changes training itself: the model is optimized so prefixes of its representation remain useful at several selected lengths.[7]

For an embedding eee and prefix lengths in a set D\mathcal{D}D, a simplified training view is:

LMRL=∑d∈DwdL(e:d)\mathcal{L}_{MRL} = \sum_{d \in \mathcal{D}} w_d \mathcal{L}(e_{:d})LMRL​=∑d∈D​wd​L(e:d​)

Here, e:de_{:d}e:d​ is the first d coordinates. Unlike PCA, serving a shorter MRL-trained representation doesn't require a separately fitted rotation. It still requires matching document and query lengths and testing retrieval quality.

Matryoshka embedding truncation showing one ordered embedding whose early prefixes can be served at shorter lengths. Matryoshka embedding truncation showing one ordered embedding whose early prefixes can be served at shorter lengths.
MRL makes selected prefixes meaningful during training. A deployment still chooses one shared prefix length for documents and queries by measurement.

This lab illustrates the mechanics with deliberately ordered vectors. It isn't pretending to train an embedding model. The first coordinates carry policy theme information; later coordinates add detail and noise. Truncating a standard, unordered representation wouldn't have this promise.

evaluate-ordered-prefixes.py
1import numpy as np 2 3rng = np.random.default_rng(15) 4centers = np.array([ 5 [1.0, 0.9, 0.3, 0.2, 0.0, 0.0, 0.0, 0.0], 6 [0.0, 0.0, 1.0, 0.9, 0.3, 0.2, 0.0, 0.0], 7 [0.0, 0.0, 0.0, 0.0, 1.0, 0.9, 0.3, 0.2], 8]) 9documents = np.vstack([ 10 center + rng.normal(0, 0.04, size=(18, 8)) 11 for center in centers 12]) 13queries = np.vstack([ 14 center + rng.normal(0, 0.02, size=(3, 8)) 15 for center in centers 16]) 17 18def top3(rows: np.ndarray, corpus: np.ndarray) -> np.ndarray: 19 rows = rows / np.linalg.norm(rows, axis=1, keepdims=True) 20 corpus = corpus / np.linalg.norm(corpus, axis=1, keepdims=True) 21 return np.argsort(-(rows @ corpus.T), axis=1)[:, :3] 22 23gold = top3(queries, documents) 24for prefix in [2, 4, 8]: 25 shortened = top3(queries[:, :prefix], documents[:, :prefix]) 26 recall = np.mean([ 27 len(set(expected) & set(found)) / 3 28 for expected, found in zip(gold, shortened) 29 ]) 30 print(f"Prefix {prefix}: recall@3={recall:.3f}")
Output
1Prefix 2: recall@3=0.296 2Prefix 4: recall@3=0.519 3Prefix 8: recall@3=1.000

One hosted API example is OpenAI's text-embedding-3 family. Its documentation lists default lengths of 1,536 for text-embedding-3-small and 3,072 for text-embedding-3-large, and documents a dimensions request parameter for shorter outputs.[8] A request shape looks like this:

embedding-request.json
1{ 2 "model": "text-embedding-3-large", 3 "input": "screen arrived cracked and will not turn on", 4 "dimensions": 256 5}

The provider feature makes generation of shorter vectors convenient. It doesn't decide whether 256 dimensions meet your damaged-item retrieval requirement.

Choose a deployment experiment

At this point you have three kinds of tools:

Tool familyWhat changesBest first question
PCA or random projectionNumber of served coordinates after encodingHow much recall survives at a lower dimension?
MRL-style native shorteningNumber of coordinates emitted from a trained modelDoes a supported prefix beat post-hoc alternatives at the same cost?
Scalar, binary, or product quantizationPrecision or code used to store/search vectorsCan compressed candidate search plus reranking fit the RAM budget?
t-SNE or UMAPCoordinates used for inspection mapsWhat failures should an engineer examine next?
Decision framework for picking dimensionality reduction by goal: visualize, serve smaller vectors, or compress search codes. Decision framework for picking dimensionality reduction by goal: visualize, serve smaller vectors, or compress search codes.
Start from the bottleneck and compare candidates on its metric. Queries and documents must always follow the same serving representation.

A release gate should turn this table into a decision that someone else can audit. The following lab accepts only configurations that meet both a recall floor and a raw-vector memory budget, then chooses the smallest accepted representation.

select-a-serving-candidate.py
1measurements = [ 2 {"method": "float32-full", "bytes_per_vector": 6144, "recall_at_10": 1.000}, 3 {"method": "native-256", "bytes_per_vector": 1024, "recall_at_10": 0.984}, 4 {"method": "pca-256", "bytes_per_vector": 1024, "recall_at_10": 0.971}, 5 {"method": "pq-16-code", "bytes_per_vector": 16, "recall_at_10": 0.939}, 6] 7 8required_recall = 0.975 9maximum_bytes = 1200 10accepted = [ 11 row for row in measurements 12 if row["recall_at_10"] >= required_recall 13 and row["bytes_per_vector"] <= maximum_bytes 14] 15choice = min(accepted, key=lambda row: row["bytes_per_vector"]) 16 17for row in measurements: 18 status = "pass" if row in accepted else "reject" 19 print(f"{row['method']:<14} recall={row['recall_at_10']:.3f} " 20 f"bytes={row['bytes_per_vector']:<4} {status}") 21print(f"Selected representation: {choice['method']}")
Output
1float32-full recall=1.000 bytes=6144 reject 2native-256 recall=0.984 bytes=1024 pass 3pca-256 recall=0.971 bytes=1024 reject 4pq-16-code recall=0.939 bytes=16 reject 5Selected representation: native-256

The values above are example measurements for the gate, not benchmark results for a real model. A serious report would include:

  1. A frozen set of realistic support queries and judged policy chunks.
  2. Recall or nDCG at the candidate count actually sent to generation or reranking.
  3. Index RAM, build time, and query-latency percentiles for each candidate.
  4. A failure list: which queries lost relevant evidence, and what risk that creates.
  5. Re-evaluation when embedding models, policy documents, or traffic distributions change.

What you've built

You began with a long embedding and a budget question. You now know how to:

  • calculate the raw cost of storing a vector representation;
  • establish cosine-retrieval neighbors before changing representations;
  • fit PCA once, apply it consistently, and evaluate shortened retrieval;
  • use t-SNE and UMAP as diagnostic maps rather than search coordinates;
  • benchmark random projection as a no-fit alternative;
  • distinguish fewer dimensions from lower-precision or PQ-coded storage; and
  • evaluate model-supported prefixes such as MRL-style shortening through the same retrieval gate.

The recurring idea is more important than any preferred method: compression is an experiment with a quality constraint. Storage savings are real immediately. Semantic safety must be measured.

Mastery check

Key concepts

  • PCA fits variance-aligned linear coordinates and reuses one transform for documents and queries.
  • t-SNE and UMAP make investigation maps; neither plot distance should approve retrieval.
  • Random projection supplies a fixed, data-independent serving baseline.
  • MRL trains selected prefixes to remain useful, so shortening can happen at model output.
  • Quantization reduces storage precision or stores centroid codes; PQ is distinct from dimensionality reduction.
  • Recall@K or an equivalent downstream metric decides whether reduced vectors remain useful.

Evaluation rubric

  • Foundational: Calculates float32 vector storage and explains why it isn't a quality metric.
  • Intermediate: Fits a PCA transform once, normalizes reduced vectors for cosine comparison, and computes neighbor recall.
  • Intermediate: Explains why a 2D t-SNE or UMAP map is useful for debugging but unsafe as a retrieval representation.
  • Advanced: Designs a comparison among native shortening, linear projection, and quantization under RAM and recall constraints.

Common pitfalls

  • Refitting a reducer for query batches: documents and queries end up in different coordinate systems. Fit and version one serving transform.
  • Choosing dimensions by explained variance only: retained variance isn't relevance. Use labeled retrieval or grounded-answer evaluations.
  • Reading plot gaps as semantic distance: nonlinear map layout distorts separation. Check specific claims in original vectors.
  • Truncating an arbitrary embedding: prefix shortening is justified only when the model documents or was trained for it.
  • Compressing without reranking analysis: aggressive codes can lose required evidence. Evaluate oversampling and rescoring when using them.

Practice extension

Create a small policy corpus with four query categories: damaged item, delivery delay, missing parcel, and return eligibility. Assign each query two acceptable policy chunks. Replace the generated embeddings in pca-recall-sweep.py with embeddings from a model you can run or an API you can audit. Compare full vectors, one shorter representation, and one quantized candidate. Your artifact is a table of recall, bytes per vector, and the exact failed queries, followed by a release recommendation.

Next Step
Continue to CoT, ToT & Self-Consistency Prompting

You can now retrieve and audit compact evidence representations; next you'll study how spending additional inference-time computation can improve multi-step decisions made from that evidence.

PreviousInstruction Tuning & Chat Templates
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Numerical Linear Algebra.

Trefethen, L. N., & Bau, D. · 1997 · SIAM

Visualizing Data using t-SNE.

van der Maaten, L. & Hinton, G. · 2008

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.

McInnes, L., Healy, J., & Melville, J. · 2018

Extensions of Lipschitz mappings into a Hilbert space.

Johnson, W. B. & Lindenstrauss, J. · 1984 · Contemporary Mathematics, 26

Qdrant Quantization Guide

Qdrant. · 2024

Product Quantization for Nearest Neighbor Search.

Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011

Matryoshka Representation Learning.

Kusupati, A., et al. · 2022 · NeurIPS 2022

Vector embeddings

OpenAI · 2024