LearnApplied LLM EngineeringDimensionality Reduction for Embeddings

📐MediumEmbeddings & Vector Search

Dimensionality Reduction for Embeddings

Shrink and inspect embedding indexes without guessing: measure recall while testing PCA, projections, native shortening, and quantization.

21 min read

Learning path

Step 56 of 158 in the full curriculum

Instruction Tuning & Chat Templates CoT, ToT & Self-Consistency Prompting

An instruction-tuned incident assistant still needs evidence before it answers. Suppose its retrieval index stores runbook chunks for canary failures, latency spikes, rollback approvals, and access escalations. Each chunk is represented by a long embedding, and a query such as "the canary started returning 500s" should land near the canary-failure runbook.

That retrieval system creates two different questions:

Can we store and search shorter vectors while returning the same useful runbook chunks?
Can we draw a two-dimensional map that helps an engineer inspect confusing clusters?

Both questions use dimensionality reduction. They don't have the same success criterion. A production vector is approved by retrieval quality and cost. A plot is approved by what it lets you investigate, with important conclusions checked back in the original embedding space.

Embedding reduction jobs for serving recall and 2D diagnostics. — Reduction has two different jobs: serving candidates must preserve relevant neighbors under a cost budget, while 2D maps help investigate records without becoming the production index.

Start with the budget, not an algorithm

An embedding is an array of numbers. With float32 storage, every dimension costs four bytes. Storing ten million 1,536-dimensional runbook or document vectors therefore costs:

$10{,}000{,}000 \times 1{,}536 \times 4 = 61{,}440{,}000{,}000 \text{ bytes}$

That's 61.44 GB using decimal gigabytes, before the index graph, metadata, replicas, or backups. Cutting the vector to 256 dimensions makes the raw array six times smaller. It does not prove that search remains good.

Run the arithmetic first. It turns "embeddings feel large" into a concrete engineering constraint.

estimate-vector-storage.py

def decimal_gb(byte_count: int) -> float:
    return byte_count / 1_000_000_000

documents = 10_000_000
bytes_per_float32 = 4
full_dim = 1_536
candidate_dim = 256

full_bytes = documents * full_dim * bytes_per_float32
candidate_bytes = documents * candidate_dim * bytes_per_float32

print(f"Full float32 array: {decimal_gb(full_bytes):.2f} GB")
print(f"256-d float32 array: {decimal_gb(candidate_bytes):.2f} GB")
print(f"Raw-vector reduction: {full_bytes / candidate_bytes:.1f}x")

Output

Full float32 array: 61.44 GB
256-d float32 array: 10.24 GB
Raw-vector reduction: 6.0x

The first applied habit is simple: separate a cost metric from a quality metric.

Question	Useful metric
Will vectors fit in memory?	Bytes per vector, total RAM, index overhead
Will useful evidence still be retrieved?	Recall@K, nDCG@K, answer success on grounded evaluations
Will an engineer understand failures?	A labeled plot plus checks in the original space

Build a retrieval baseline

Before compressing anything, define what "the same useful neighbors" means. The small fixture below represents six incident documents. Its coordinates are hand-written so you can see the experiment without needing an API key: the first two coordinates concern canary failures, the next two concern latency, and the last two concern rollback.

Cosine similarity compares directions rather than raw lengths. We L2-normalize each vector, score one canary-failure query, and save the top results as the baseline that reduction must try to preserve.

baseline-runbook-retrieval.py

import numpy as np

documents = [
    "canary 500s: inspect deploy diff and rollback",
    "canary errors: check trace spans before rollback",
    "latency spike: inspect dependency saturation",
    "timeout storm: investigate queue depth",
    "rollback approved: page deploy owner",
    "rollback blocked: wait for database migration",
]

vectors = np.array([
    [1.00, 0.92, 0.04, 0.00, 0.02, 0.00],
    [0.92, 0.80, 0.04, 0.01, 0.02, 0.00],
    [0.02, 0.00, 1.00, 0.88, 0.04, 0.00],
    [0.02, 0.02, 0.91, 0.98, 0.00, 0.00],
    [0.00, 0.00, 0.04, 0.02, 1.00, 0.89],
    [0.02, 0.00, 0.00, 0.02, 0.91, 1.00],
], dtype=float)
query = np.array([0.98, 1.00, 0.03, 0.00, 0.01, 0.00])

def normalize(rows: np.ndarray) -> np.ndarray:
    return rows / np.linalg.norm(rows, axis=1, keepdims=True)

scores = normalize(vectors) @ normalize(query.reshape(1, -1))[0]
ranking = np.argsort(-scores)

for rank, index in enumerate(ranking[:3], start=1):
    print(f"{rank}. {documents[index]}  score={scores[index]:.3f}")

Output

canary 500s: inspect deploy diff and rollback  score=0.999
canary errors: check trace spans before rollback  score=0.997
timeout storm: investigate queue depth  score=0.036

In a real evaluation set, each query has judged relevant documents. For a query with two relevant runbook chunks, recall@2 = 1.0 means both appear in the top two returned results. That's the comparison target for every compressed representation here.

PCA: learn a shorter linear coordinate system

Principal Component Analysis (PCA) fits a linear transformation to a corpus. Center the embedding matrix, find directions with the most variation, and keep the first k directions. If $X$ is the centered document matrix and $V_k$ holds the chosen directions, the reduced vectors are:

$Z = X V_k$

The rows are still documents, now with fewer columns. In implementations, Singular Value Decomposition (SVD) is a practical way to find principal directions without explicitly constructing a covariance matrix.^{[1]Reference 1Numerical Linear Algebra.https://people.maths.ox.ac.uk/trefethen/books.html}

Centered embedding points are projected onto the highest-variance principal axis, while document and query matrices both pass through the same fitted mean and V k basis to produce comparable reduced coordinates. — PCA learns variance-aligned axes from representative documents. Persist the mean and basis, then project stored documents and incoming queries through that same coordinate system.

There is one failure mode to understand before using the library: fitting PCA separately on document vectors and live queries creates different axes. Their dot products no longer compare like with like. Fit on representative document data, persist the fitted transform, and call transform for both indexed documents and future queries.

This example adds small variations around the three runbook themes, fits a two-dimensional PCA model, and retrieves with a query transformed by the same model.

fit-one-pca-transform.py

import numpy as np
from sklearn.decomposition import PCA

rng = np.random.default_rng(4)
centers = np.array([
    [1.0, 0.9, 0.0, 0.0, 0.0, 0.0],  # canary failure
    [0.0, 0.0, 1.0, 0.9, 0.0, 0.0],  # latency issue
    [0.0, 0.0, 0.0, 0.0, 1.0, 0.9],  # rollback
])
labels = np.repeat(["canary", "latency", "rollback"], 20)
documents = np.vstack([
    center + rng.normal(0, 0.05, size=(20, 6))
    for center in centers
])
query = np.array([[0.96, 0.91, 0.02, 0.01, 0.00, 0.00]])

def normalize(rows: np.ndarray) -> np.ndarray:
    return rows / np.linalg.norm(rows, axis=1, keepdims=True)

pca = PCA(n_components=2, random_state=0)
reduced_documents = normalize(pca.fit_transform(documents))
reduced_query = normalize(pca.transform(query))
best_index = int(np.argmax(reduced_documents @ reduced_query[0]))

print(f"Shape: {documents.shape} -> {reduced_documents.shape}")
print(f"Explained variance kept: {pca.explained_variance_ratio_.sum():.3f}")
print(f"Closest runbook theme: {labels[best_index]}")

Output

Shape: (60, 6) -> (60, 2)
Explained variance kept: 0.992
Closest runbook theme: canary

PCA optimizes variance, not relevance. A direction that rarely changes in the corpus can still distinguish a rollback-approved incident from a rollback-blocked incident. Explained variance is useful diagnostic information; retrieval recall is the release metric.

Sweep dimensions against neighbor recall

On a small data set, you can provide the relevance judgments yourself. This lab creates twenty incident intents in a noisy 24-dimensional space. Each query has three acceptable runbook chunks. It then tries PCA dimensions 2, 4, 8, and 12.

pca-recall-sweep.py

import numpy as np
from sklearn.decomposition import PCA

rng = np.random.default_rng(21)
dimensions = 24
intent_count = 20
acceptable_chunks = 3
intent_vectors = rng.normal(size=(intent_count, dimensions))
intent_vectors -= intent_vectors.mean(axis=0)
intent_vectors /= np.linalg.norm(intent_vectors, axis=1, keepdims=True)

documents = []
queries = []
relevant = []
for intent_vector in intent_vectors:
    first_chunk = len(documents)
    documents.extend(
        intent_vector + rng.normal(0, 0.11, size=(acceptable_chunks, dimensions))
    )
    queries.append(intent_vector + rng.normal(0, 0.03, size=dimensions))
    relevant.append(set(range(first_chunk, first_chunk + acceptable_chunks)))
documents = np.asarray(documents)
queries = np.asarray(queries)

def normalize(rows: np.ndarray) -> np.ndarray:
    return rows / np.linalg.norm(rows, axis=1, keepdims=True)

def top_k(query_rows: np.ndarray, doc_rows: np.ndarray, k: int) -> np.ndarray:
    return np.argsort(-(normalize(query_rows) @ normalize(doc_rows).T), axis=1)[:, :k]

for target_dim in [2, 4, 8, 12]:
    reducer = PCA(n_components=target_dim, random_state=0)
    short_docs = reducer.fit_transform(documents)
    short_queries = reducer.transform(queries)
    returned = top_k(short_queries, short_docs, k=3)
    recall = np.mean([
        len(expected & set(found)) / acceptable_chunks
        for expected, found in zip(relevant, returned)
    ])
    print(f"{target_dim:>2} dims: recall@3={recall:.3f}, "
          f"variance={reducer.explained_variance_ratio_.sum():.3f}")

Output

dims: recall@3=0.333, variance=0.266
dims: recall@3=0.817, variance=0.454
dims: recall@3=1.000, variance=0.715
dims: recall@3=1.000, variance=0.869

The numbers belong to this generated fixture, not to all embedding models. Notice that variance and relevant-chunk recall answer different questions. Copy the experiment pattern, then replace its documents, queries, and judgments with data from your own product.

A two-dimensional map is an investigation tool

A two-dimensional projection answers questions such as:

Are failed canary queries scattered into the rollback cluster?
Are documents from a new runbook version isolated from older versions?
Which points should an engineer inspect in the original data?

It doesn't become a search index just because the plot looks separated. Compressing a high-dimensional semantic vector to two coordinates forces distortion.

Compare a linear map and a neighborhood map

PCA offers a deterministic linear view. t-SNE and UMAP focus more strongly on neighborhood relationships, using nonlinear objectives. The visual difference matters because a layout can make clusters appear convincingly separated even when a production retrieval test would disagree.

The same three runbook groups form an overlapping diagonal under PCA, separated local islands under t-SNE, and a UMAP neighbor-graph view, with one anchor point keeping its nearest neighbors in each map. — Each method tells a different 2D story. Treat local neighborhoods as investigation leads, not proof that cluster gaps or production retrieval are faithful.

The next lab demonstrates the evaluation idea without asking you to trust a picture. It measures how many of each point's original nearest neighbors remain nearby after a two-dimensional PCA view and after a t-SNE view.

measure-neighbor-retention-in-a-map.py

import os

# Keep this small teaching run from oversubscribing local CPU threads.
for variable in ("OMP_NUM_THREADS", "OPENBLAS_NUM_THREADS", "VECLIB_MAXIMUM_THREADS"):
    os.environ[variable] = "1"

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

points, labels = make_blobs(
    n_samples=90,
    n_features=10,
    centers=3,
    cluster_std=1.0,
    random_state=8,
)

def neighbor_ids(rows: np.ndarray, k: int = 5) -> np.ndarray:
    distances = np.linalg.norm(rows[:, None, :] - rows[None, :, :], axis=2)
    return np.argsort(distances, axis=1)[:, 1:k + 1]

def overlap(reference: np.ndarray, candidate: np.ndarray) -> float:
    return float(np.mean([
        len(set(left) & set(right)) / left.size
        for left, right in zip(reference, candidate)
    ]))

original_neighbors = neighbor_ids(points)
pca_map = PCA(n_components=2).fit_transform(points)
tsne_map = TSNE(
    n_components=2,
    perplexity=12,
    init="random",
    learning_rate="auto",
    max_iter=750,
    random_state=8,
).fit_transform(points)

# t-SNE is stochastic and its exact value shifts across library versions,
# so we report two decimals rather than a brittle high-precision figure.
print(f"PCA local-neighbor overlap:  {overlap(original_neighbors, neighbor_ids(pca_map)):.2f}")
print(f"t-SNE local-neighbor overlap: {overlap(original_neighbors, neighbor_ids(tsne_map)):.2f}")
print(f"Labels available for inspection: {[int(label) for label in sorted(set(labels))]}")

Output

PCA local-neighbor overlap:  0.35
t-SNE local-neighbor overlap: 0.61
Labels available for inspection: [0, 1, 2]

This local-overlap measurement still isn't a retrieval release gate. It checks whether a map is useful for investigating local cases. The retrieval gate remains based on queries and relevant documents.

t-SNE: excellent local pictures, unsafe search coordinates

t-SNE converts high-dimensional neighbor relationships into probabilities $p_{ij}$ , creates corresponding low-dimensional probabilities $q_{ij}$ using a heavy-tailed Student-t distribution, and minimizes:

$D_{KL}(P \parallel Q) = \sum_{i \ne j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

The heavy tail lets moderately distant points spread in the map instead of crowding into the center. This is the central construction in van der Maaten and Hinton's t-SNE paper.^{[2]Reference 2Visualizing Data using t-SNE.https://jmlr.org/papers/v9/vandermaaten08a.html}

t-SNE compares a block-structured high-dimensional neighbor-probability matrix P with a crowded low-dimensional Student-t matrix Q, then moves 2D points until local probability blocks align while inter-cluster gaps remain uninterpretable. — t-SNE optimizes local probability agreement. The final islands help identify nearby records, but the empty space between islands isn't a semantic distance scale.

perplexity controls the rough neighborhood scale considered by the map. Trying several values is useful for investigation. It isn't acceptable to pick whichever rendering tells the cleanest story.

UMAP: optimize a weighted neighbor graph

UMAP first constructs a weighted nearest-neighbor graph, represented as a fuzzy simplicial set, then optimizes a lower-dimensional graph using a cross-entropy objective.^{[3]Reference 3UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.https://arxiv.org/abs/1802.03426} Parameters such as n_neighbors and min_dist change the view: one layout can expose small pockets of runbook confusion while another reveals broader groupings.

UMAP turns high-dimensional nearest-neighbor relations into a fuzzy weighted graph, combines directed memberships, and optimizes a 2D layout that attracts connected points and repels sampled non-neighbors; n neighbors and min dist change graph scope and cluster tightness. — UMAP's output follows its weighted graph and chosen parameters. Use the map to investigate labeled cases, then verify important neighbors in source vectors.

Common UMAP implementations can place later samples into a fitted map with an approximate transform. That's convenient for a diagnostic dashboard, but it isn't the same as PCA's fixed linear transform. If incoming incident traffic changes, re-check the map before using it to explain failures.

Goal	Appropriate first measurement	Suitable methods to test
Inspect clusters or outliers	Neighbor overlap plus human review	PCA, t-SNE, UMAP
Shorten vectors for retrieval	Recall@K or nDCG@K plus RAM and latency	Native shortening, PCA, random projection
Reduce memory further	Recall after compressed search and reranking	Scalar, PQ, binary quantization

Random projection: a no-fit serving baseline

PCA learns from the document corpus. A random projection samples a fixed matrix once and applies it to every document and query:

$z = xR, \qquad R \in \mathbb{R}^{d \times k}$

The Johnson-Lindenstrauss lemma establishes that a finite set of points can be projected into a lower-dimensional space while approximately preserving pairwise distances, given enough target dimensions.^{[4]Reference 4Extensions of Lipschitz mappings into a Hilbert space.https://web.stanford.edu/class/cs114/readings/JL-Johnson.pdf} The bound is a worst-case guarantee, not a claim that a specific 128-dimensional incident index will meet your recall requirement.

Random projection maps 1536-D embeddings to 128-D with one saved matrix and a recall gate. — Random projection is one persisted matrix multiplication, not a fitted corpus model. Its approximate geometry still needs a judged retrieval gate.

This experiment compares original cosine neighbors with a Gaussian random projection. A fixed seed makes the transform reproducible.

random-projection-recall.py

import numpy as np
from sklearn.random_projection import GaussianRandomProjection

rng = np.random.default_rng(33)
documents = rng.normal(size=(180, 48))
queries = documents[:12] + rng.normal(0, 0.03, size=(12, 48))

def normalized(rows: np.ndarray) -> np.ndarray:
    return rows / np.linalg.norm(rows, axis=1, keepdims=True)

def neighbors(q: np.ndarray, d: np.ndarray, k: int = 5) -> np.ndarray:
    return np.argsort(-(normalized(q) @ normalized(d).T), axis=1)[:, :k]

gold = neighbors(queries, documents)
projector = GaussianRandomProjection(n_components=16, random_state=7)
short_docs = projector.fit_transform(documents)
short_queries = projector.transform(queries)
predicted = neighbors(short_queries, short_docs)
recall = np.mean([
    len(set(left) & set(right)) / left.size
    for left, right in zip(gold, predicted)
])

original_bytes = documents.shape[1] * 4
short_bytes = short_docs.shape[1] * 4
print(f"Bytes/vector: {original_bytes} -> {short_bytes}")
print(f"Neighbor recall@5: {recall:.3f}")

Output

Bytes/vector: 192 -> 64
Neighbor recall@5: 0.350

Random projection is worth benchmarking when fitting a data-dependent reducer is inconvenient or expensive. It isn't automatically better or worse than PCA. Let the same labeled retrieval set judge both.

Compression after reduction: quantization

Dimension reduction stores fewer coordinates. Quantization stores each coordinate, or each subvector, with fewer bits. You can use them independently or together.

For example, a 256-dimensional float32 vector occupies 256 × 4 = 1,024 bytes. Storing one signed byte per dimension uses 256 bytes. Storing one bit per dimension uses 32 bytes. Lower memory comes with progressively more approximation.

Scalar and binary codes

Scalar quantization maps each value onto an integer grid. Binary quantization records only a sign or threshold result. Vector databases often pair aggressive compressed candidate search with rescoring of a larger candidate set in a more accurate representation; the available choices and configuration depend on the database implementation.^{[5]Reference 5Qdrant Quantization Guidehttps://qdrant.tech/documentation/guides/quantization/}

The next lab makes the compression visible. It quantizes four normalized vectors to int8 and converts their signs to bits. The bit count is exact; the quality decision still needs retrieval measurements.

scalar-and-binary-codes.py

import numpy as np

vectors = np.array([
    [0.90, 0.80, -0.10, -0.20, 0.04, 0.02, -0.05, 0.08],
    [0.85, 0.74, -0.04, -0.16, 0.01, 0.04, -0.02, 0.05],
    [-0.02, 0.04, 0.92, 0.81, -0.06, 0.01, 0.02, -0.04],
    [-0.05, 0.03, 0.88, 0.79, -0.03, 0.00, 0.06, -0.02],
], dtype=np.float32)

scale = 127 / np.max(np.abs(vectors))
int8_vectors = np.round(vectors * scale).astype(np.int8)
binary_vectors = vectors > 0

query_bits = binary_vectors[0]
hamming = np.count_nonzero(binary_vectors != query_bits, axis=1)

print(f"float32 bytes/vector: {vectors.shape[1] * 4}")
print(f"int8 bytes/vector:    {int8_vectors.shape[1]}")
print(f"binary bits/vector:   {binary_vectors.shape[1]}")
print(f"Binary Hamming distances from row 0: {hamming.tolist()}")

Output

float32 bytes/vector: 32
int8 bytes/vector:    8
binary bits/vector:   8
Binary Hamming distances from row 0: [0, 0, 6, 7]

Binary codes are tiny, but "same sign" is a much weaker statement than "same semantic evidence." For an incident application, retrieve more candidates with a compressed representation and rerank before final evidence selection if benchmarks justify that design.

Product quantization: code sub-vectors

Product Quantization (PQ) splits a vector into sub-vectors, learns a codebook for each subspace, and stores only the selected centroid IDs.^{[6]Reference 6Product Quantization for Nearest Neighbor Search.https://dblp.org/rec/journals/pami/JegouDS11} In the common case of 256 possible centroids, each ID fits in one byte. A 128-dimensional float32 vector split into eight subspaces can therefore be represented by eight code bytes, plus shared codebooks.

Product quantization stores centroid IDs and scores them with query lookup tables. — PQ stores eight centroid choices instead of 128 floats. ADC keeps the query uncompressed, builds query-to-centroid lookup tables, and lets each stored byte select one distance to add.

The original PQ paper describes asymmetric distance computation (ADC): keep a query unquantized, precompute its distances to codebook centroids, and score each stored code through table lookups.^{[6]Reference 6Product Quantization for Nearest Neighbor Search.https://dblp.org/rec/journals/pami/JegouDS11} This miniature implementation uses only two subspaces and four centroids so that every shape remains visible.

mini-product-quantization.py

import numpy as np
from sklearn.cluster import KMeans

rng = np.random.default_rng(12)
documents = np.vstack([
    rng.normal([1, 1, 0, 0], 0.08, size=(20, 4)),
    rng.normal([0, 0, 1, 1], 0.08, size=(20, 4)),
])
query = np.array([0.94, 1.02, 0.01, -0.02])

subspace_width = 2
codebooks = []
codes = []
for start in range(0, documents.shape[1], subspace_width):
    block = documents[:, start:start + subspace_width]
    model = KMeans(n_clusters=4, random_state=0, n_init=10).fit(block)
    codebooks.append(model.cluster_centers_)
    codes.append(model.labels_)
codes = np.column_stack(codes)

adc_scores = np.zeros(documents.shape[0])
for subspace, start in enumerate(range(0, documents.shape[1], subspace_width)):
    query_block = query[start:start + subspace_width]
    table = np.linalg.norm(codebooks[subspace] - query_block, axis=1) ** 2
    adc_scores += table[codes[:, subspace]]

print(f"Document shape: {documents.shape}")
print(f"Stored code shape: {codes.shape}")
print(f"Raw bytes/vector: {documents.shape[1] * 8} (float64 in this lab)")
print(f"Code bytes/vector with <=256 centroids: {codes.shape[1]}")
print(f"Nearest PQ-coded document group: {'canary' if np.argmin(adc_scores) < 20 else 'latency'}")

Output

Document shape: (40, 4)
Stored code shape: (40, 2)
Raw bytes/vector: 32 (float64 in this lab)
Code bytes/vector with <=256 centroids: 2
Nearest PQ-coded document group: canary

PQ isn't an interchangeable synonym for PCA. PCA shortens a coordinate system; PQ encodes sub-vectors with learned IDs. An index may use a shortened embedding and then PQ-code it when memory is tight.

Native shortening: train prefixes to be useful

PCA and random projection operate after an encoder produces its vector. Matryoshka Representation Learning (MRL) changes training itself: the model is optimized so prefixes of its representation remain useful at several selected lengths.^{[7]Reference 7Matryoshka Representation Learning.https://arxiv.org/abs/2205.13147}

For an embedding $e$ and prefix lengths in a set $\mathcal{D}$ , a simplified training view is:

$\mathcal{L}_{MRL} = \sum_{d \in \mathcal{D}} w_d \mathcal{L}(e_{:d})$

Here, $e_{:d}$ is the first d coordinates. Unlike PCA, serving a shorter MRL-trained representation doesn't require a separately fitted rotation. It still requires matching document and query lengths and testing retrieval quality.

Matryoshka representation learning trains nested embedding prefixes and deploys one shared cut. — MRL trains selected prefixes to carry task signal. Production then benchmarks supported cuts and applies one chosen length identically to document and query embeddings.

This lab illustrates the mechanics with deliberately ordered vectors. It isn't pretending to train an embedding model. The first coordinates carry runbook theme information; later coordinates add detail and noise. Truncating a standard, unordered representation wouldn't have this promise.

evaluate-ordered-prefixes.py

import numpy as np

rng = np.random.default_rng(15)
centers = np.array([
    [1.0, 0.9, 0.3, 0.2, 0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 1.0, 0.9, 0.3, 0.2, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 1.0, 0.9, 0.3, 0.2],
])
documents = np.vstack([
    center + rng.normal(0, 0.04, size=(18, 8))
    for center in centers
])
queries = np.vstack([
    center + rng.normal(0, 0.02, size=(3, 8))
    for center in centers
])

def top3(rows: np.ndarray, corpus: np.ndarray) -> np.ndarray:
    rows = rows / np.linalg.norm(rows, axis=1, keepdims=True)
    corpus = corpus / np.linalg.norm(corpus, axis=1, keepdims=True)
    return np.argsort(-(rows @ corpus.T), axis=1)[:, :3]

gold = top3(queries, documents)
for prefix in [2, 4, 8]:
    shortened = top3(queries[:, :prefix], documents[:, :prefix])
    recall = np.mean([
        len(set(expected) & set(found)) / 3
        for expected, found in zip(gold, shortened)
    ])
    print(f"Prefix {prefix}: recall@3={recall:.3f}")

Output

Prefix 2: recall@3=0.296
Prefix 4: recall@3=0.519
Prefix 8: recall@3=1.000

One hosted API example is OpenAI's text-embedding-3 family. Its documentation lists default lengths of 1,536 for text-embedding-3-small and 3,072 for text-embedding-3-large, and documents a dimensions request parameter for shorter outputs.^{[8]Reference 8Vector embeddingshttps://developers.openai.com/api/docs/guides/embeddings} A request shape looks like this:

embedding-request.json

{
  "model": "text-embedding-3-large",
  "input": "screen arrived cracked and will not turn on",
  "dimensions": 256
}

The provider feature makes generation of shorter vectors convenient. It doesn't decide whether 256 dimensions meet your canary-failure retrieval requirement.

Choose a deployment experiment

You have three tool families:

Tool family	What changes	Best first question
PCA or random projection	Number of served coordinates after encoding	How much recall survives at a lower dimension?
MRL-style native shortening	Number of coordinates emitted from a trained model	Does a supported prefix beat post-hoc alternatives at the same cost?
Scalar, binary, or product quantization	Precision or code used to store/search vectors	Can compressed candidate search plus reranking fit the RAM budget?
t-SNE or UMAP	Coordinates used for inspection maps	What failures should an engineer examine next?

A job-first dimensionality reduction map fans into three lanes: use PCA, t-SNE, and UMAP for investigation maps; use native shortening, PCA, or random projection for smaller serving vectors; and use scalar quantization, product quantization, or binary codes for compressed candidate search. Every lane passes through its own evidence gate, and documents plus queries still need a compatible representation path. — Pick the method from the job. Maps debug, shortened vectors serve, and compressed codes still need a compatible query and document path.

A release gate should turn this table into a decision that someone else can audit. This lab accepts only configurations that meet both a recall floor and a raw-vector memory budget, then chooses the smallest accepted representation.

select-a-serving-candidate.py

measurements = [
    {"method": "float32-full", "bytes_per_vector": 6144, "recall_at_10": 1.000},
    {"method": "native-256", "bytes_per_vector": 1024, "recall_at_10": 0.984},
    {"method": "pca-256", "bytes_per_vector": 1024, "recall_at_10": 0.971},
    {"method": "pq-16-code", "bytes_per_vector": 16, "recall_at_10": 0.939},
]

required_recall = 0.975
maximum_bytes = 1200
accepted = [
    row for row in measurements
    if row["recall_at_10"] >= required_recall
    and row["bytes_per_vector"] <= maximum_bytes
]
choice = min(accepted, key=lambda row: row["bytes_per_vector"])

for row in measurements:
    status = "pass" if row in accepted else "reject"
    print(f"{row['method']:<14} recall={row['recall_at_10']:.3f} "
          f"bytes={row['bytes_per_vector']:<4} {status}")
print(f"Selected representation: {choice['method']}")

Output

float32-full   recall=1.000 bytes=6144 reject
native-256     recall=0.984 bytes=1024 pass
pca-256        recall=0.971 bytes=1024 reject
pq-16-code     recall=0.939 bytes=16   reject
Selected representation: native-256

The values above are example measurements for the gate, not benchmark results for a real model. A serious report would include:

A frozen set of realistic incident queries and judged runbook chunks.
Recall or nDCG at the candidate count sent to generation or reranking.
Index RAM, build time, and query-latency percentiles for each candidate.
A failure list: which queries lost relevant evidence, and what risk that creates.
Re-evaluation when embedding models, runbook documents, or traffic distributions change.

What you've built

The long-embedding budget question leaves four concrete skills:

calculate the raw cost of storing a vector representation;
establish cosine-retrieval neighbors before changing representations;
fit PCA once, apply it consistently, and evaluate shortened retrieval;
use t-SNE and UMAP as diagnostic maps rather than search coordinates;
benchmark random projection as a no-fit alternative;
distinguish fewer dimensions from lower-precision or PQ-coded storage; and
evaluate model-supported prefixes such as MRL-style shortening through the same retrieval gate.

The recurring idea is more important than any preferred method: compression is an experiment with a quality constraint. Storage savings are real immediately. Semantic safety must be measured.

Mastery check

Key concepts

PCA fits variance-aligned linear coordinates and reuses one transform for documents and queries.
t-SNE and UMAP make investigation maps; neither plot distance should approve retrieval.
Random projection supplies a fixed, data-independent serving baseline.
MRL trains selected prefixes to remain useful, so shortening can happen at model output.
Quantization reduces storage precision or stores centroid codes; PQ is distinct from dimensionality reduction.
Recall@K or an equivalent downstream metric decides whether reduced vectors remain useful.

Evaluation rubric

Foundational: Calculates float32 vector storage and explains why it isn't a quality metric.
Intermediate: Fits a PCA transform once, normalizes reduced vectors for cosine comparison, and computes neighbor recall.
Intermediate: Explains why a 2D t-SNE or UMAP map is useful for debugging but unsafe as a retrieval representation.
Advanced: Designs a comparison among native shortening, linear projection, and quantization under RAM and recall constraints.

Common pitfalls

Refitting a reducer for query batches: documents and queries end up in different coordinate systems. Fit and version one serving transform.
Choosing dimensions by explained variance only: retained variance isn't relevance. Use labeled retrieval or grounded-answer evaluations.
Reading plot gaps as semantic distance: nonlinear map layout distorts separation. Check specific claims in original vectors.
Truncating an arbitrary embedding: prefix shortening is justified only when the model documents or was trained for it.
Compressing without reranking analysis: aggressive codes can lose required evidence. Evaluate oversampling and rescoring when using them.

Practice extension

Create a small runbook corpus with four query categories: canary failure, latency spike, rollback eligibility, and access escalation. Assign each query two acceptable runbook chunks. Replace the generated embeddings in pca-recall-sweep.py with embeddings from a model you can run or an API you can audit. Compare full vectors, one shorter representation, and one quantized candidate. Your artifact is a table of recall, bytes per vector, and the exact failed queries, followed by a release recommendation.

Next Step

Continue to CoT, ToT & Self-Consistency Prompting

You can now retrieve and audit compact evidence representations; next you'll study how spending additional inference-time computation can improve multi-step decisions made from that evidence.

PreviousInstruction Tuning & Chat Templates

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Numerical Linear Algebra.

Trefethen, L. N., & Bau, D. · 1997 · SIAM

Visualizing Data using t-SNE.

van der Maaten, L. & Hinton, G. · 2008

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.

McInnes, L., Healy, J., & Melville, J. · 2018

Extensions of Lipschitz mappings into a Hilbert space.

Johnson, W. B. & Lindenstrauss, J. · 1984 · Contemporary Mathematics, 26

Qdrant Quantization Guide

Qdrant. · 2024

Product Quantization for Nearest Neighbor Search.

Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011

Matryoshka Representation Learning.

Kusupati, A., et al. · 2022 · NeurIPS 2022

Vector embeddings

OpenAI · 2024

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Dimensionality Reduction for Embeddings

Start with the budget, not an algorithm

Why can't a six-times smaller vector index be approved from storage savings alone?

Build a retrieval baseline

PCA: learn a shorter linear coordinate system

Sweep dimensions against neighbor recall

A PCA model keeps 95 percent of explained variance but loses a relevant rollback-runbook chunk from top-5 retrieval. Should it ship?

A two-dimensional map is an investigation tool

Compare a linear map and a neighborhood map

t-SNE: excellent local pictures, unsafe search coordinates

UMAP: optimize a weighted neighbor graph

Random projection: a no-fit serving baseline

Compression after reduction: quantization

Scalar and binary codes

Product quantization: code sub-vectors

Why can a system use both PCA or native shortening and product quantization?

Native shortening: train prefixes to be useful

Choose a deployment experiment

An engineer proposes using a visually clean UMAP plot as a 2D production index. What experiment should you request instead?

What you've built

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Practice extension

What is the single rule that applies to PCA, random projection, native shortening, and compressed search?

Mastery Check

Discussion