Shrink and inspect embedding indexes without guessing: measure recall while testing PCA, projections, native shortening, and quantization.
An instruction-tuned support assistant still needs evidence before it answers. Suppose its retrieval index stores policy chunks for damaged items, delayed parcels, return windows, and refund approvals. Each chunk is represented by a long embedding, and a query such as "the screen arrived cracked" should land near the damaged-electronics policy.
That retrieval system creates two different questions:
Both questions use dimensionality reduction. They don't have the same success criterion. A production vector is approved by retrieval quality and cost. A plot is approved by what it lets you investigate, with important conclusions checked back in the original embedding space.
An embedding is an array of numbers. With float32 storage, every dimension costs four bytes. Storing ten million 1,536-dimensional policy or product vectors therefore costs:
That is 61.44 GB using decimal gigabytes, before the index graph, metadata, replicas, or backups. Cutting the vector to 256 dimensions makes the raw array six times smaller. It does not prove that search remains good.
Run the arithmetic first. It turns "embeddings feel large" into a concrete engineering constraint.
1def decimal_gb(byte_count: int) -> float:
2 return byte_count / 1_000_000_000
3
4documents = 10_000_000
5bytes_per_float32 = 4
6full_dim = 1_536
7candidate_dim = 256
8
9full_bytes = documents * full_dim * bytes_per_float32
10candidate_bytes = documents * candidate_dim * bytes_per_float32
11
12print(f"Full float32 array: {decimal_gb(full_bytes):.2f} GB")
13print(f"256-d float32 array: {decimal_gb(candidate_bytes):.2f} GB")
14print(f"Raw-vector reduction: {full_bytes / candidate_bytes:.1f}x")1Full float32 array: 61.44 GB
2256-d float32 array: 10.24 GB
3Raw-vector reduction: 6.0xThe first applied habit is simple: separate a cost metric from a quality metric.
| Question | Useful metric |
|---|---|
| Will vectors fit in memory? | Bytes per vector, total RAM, index overhead |
| Will useful evidence still be retrieved? | Recall@K, nDCG@K, answer success on grounded evaluations |
| Will an engineer understand failures? | A labeled plot plus checks in the original space |
Before compressing anything, define what "the same useful neighbors" means. The small fixture below represents six support documents. Its coordinates are hand-written so you can see the experiment without needing an API key: the first two coordinates concern damaged items, the next two concern delivery, and the last two concern returns.
Cosine similarity compares directions rather than raw lengths. We L2-normalize each vector, score one damaged-screen query, and save the top results as the baseline that reduction must try to preserve.
1import numpy as np
2
3documents = [
4 "damaged electronics: photo evidence and replacement",
5 "damaged furniture: upload photos for claim",
6 "late parcel: carrier scan and escalation",
7 "missing parcel: investigate delivery trace",
8 "sealed return: return within 30 days",
9 "opened return: restocking review required",
10]
11
12vectors = np.array([
13 [1.00, 0.92, 0.04, 0.00, 0.02, 0.00],
14 [0.92, 0.80, 0.04, 0.01, 0.02, 0.00],
15 [0.02, 0.00, 1.00, 0.88, 0.04, 0.00],
16 [0.02, 0.02, 0.91, 0.98, 0.00, 0.00],
17 [0.00, 0.00, 0.04, 0.02, 1.00, 0.89],
18 [0.02, 0.00, 0.00, 0.02, 0.91, 1.00],
19], dtype=float)
20query = np.array([0.98, 1.00, 0.03, 0.00, 0.01, 0.00])
21
22def normalize(rows: np.ndarray) -> np.ndarray:
23 return rows / np.linalg.norm(rows, axis=1, keepdims=True)
24
25scores = normalize(vectors) @ normalize(query.reshape(1, -1))[0]
26ranking = np.argsort(-scores)
27
28for rank, index in enumerate(ranking[:3], start=1):
29 print(f"{rank}. {documents[index]} score={scores[index]:.3f}")11. damaged electronics: photo evidence and replacement score=0.999
22. damaged furniture: upload photos for claim score=0.997
33. missing parcel: investigate delivery trace score=0.036In a real evaluation set, each query has judged relevant documents. For a query with two relevant policy chunks, recall@2 = 1.0 means both appear in the top two returned results. That is the comparison target for every compressed representation in this lesson.
Principal Component Analysis (PCA) fits a linear transformation to a corpus. Center the embedding matrix, find directions with the most variation, and keep the first k directions. If is the centered document matrix and holds the chosen directions, the reduced vectors are:
The rows are still documents; there are simply fewer columns. In implementations, Singular Value Decomposition (SVD) is a practical way to find principal directions without explicitly constructing a covariance matrix.[1]
There is one failure mode to understand before using the library: fitting PCA separately on document vectors and live queries creates different axes. Their dot products no longer compare like with like. Fit on representative document data, persist the fitted transform, and call transform for both indexed documents and future queries.
This example adds small variations around the three policy themes, fits a two-dimensional PCA model, and retrieves with a query transformed by the same model.
1import numpy as np
2from sklearn.decomposition import PCA
3
4rng = np.random.default_rng(4)
5centers = np.array([
6 [1.0, 0.9, 0.0, 0.0, 0.0, 0.0], # damaged item
7 [0.0, 0.0, 1.0, 0.9, 0.0, 0.0], # delivery issue
8 [0.0, 0.0, 0.0, 0.0, 1.0, 0.9], # return
9])
10labels = np.repeat(["damage", "delivery", "return"], 20)
11documents = np.vstack([
12 center + rng.normal(0, 0.05, size=(20, 6))
13 for center in centers
14])
15query = np.array([[0.96, 0.91, 0.02, 0.01, 0.00, 0.00]])
16
17def normalize(rows: np.ndarray) -> np.ndarray:
18 return rows / np.linalg.norm(rows, axis=1, keepdims=True)
19
20pca = PCA(n_components=2, random_state=0)
21reduced_documents = normalize(pca.fit_transform(documents))
22reduced_query = normalize(pca.transform(query))
23best_index = int(np.argmax(reduced_documents @ reduced_query[0]))
24
25print(f"Shape: {documents.shape} -> {reduced_documents.shape}")
26print(f"Explained variance kept: {pca.explained_variance_ratio_.sum():.3f}")
27print(f"Closest policy theme: {labels[best_index]}")1Shape: (60, 6) -> (60, 2)
2Explained variance kept: 0.992
3Closest policy theme: damagePCA optimizes variance, not relevance. A direction that rarely changes in the corpus can still distinguish a valid replacement from an invalid refund. Explained variance is useful diagnostic information; retrieval recall is the release metric.
On a small data set, we can provide the relevance judgments ourselves. The following lab creates twenty support intents in a noisy 24-dimensional space. Each query has three acceptable policy chunks. It then tries PCA dimensions 2, 4, 8, and 12.
1import numpy as np
2from sklearn.decomposition import PCA
3
4rng = np.random.default_rng(21)
5dimensions = 24
6intent_count = 20
7acceptable_chunks = 3
8intent_vectors = rng.normal(size=(intent_count, dimensions))
9intent_vectors -= intent_vectors.mean(axis=0)
10intent_vectors /= np.linalg.norm(intent_vectors, axis=1, keepdims=True)
11
12documents = []
13queries = []
14relevant = []
15for intent_vector in intent_vectors:
16 first_chunk = len(documents)
17 documents.extend(
18 intent_vector + rng.normal(0, 0.11, size=(acceptable_chunks, dimensions))
19 )
20 queries.append(intent_vector + rng.normal(0, 0.03, size=dimensions))
21 relevant.append(set(range(first_chunk, first_chunk + acceptable_chunks)))
22documents = np.asarray(documents)
23queries = np.asarray(queries)
24
25def normalize(rows: np.ndarray) -> np.ndarray:
26 return rows / np.linalg.norm(rows, axis=1, keepdims=True)
27
28def top_k(query_rows: np.ndarray, doc_rows: np.ndarray, k: int) -> np.ndarray:
29 return np.argsort(-(normalize(query_rows) @ normalize(doc_rows).T), axis=1)[:, :k]
30
31for target_dim in [2, 4, 8, 12]:
32 reducer = PCA(n_components=target_dim, random_state=0)
33 short_docs = reducer.fit_transform(documents)
34 short_queries = reducer.transform(queries)
35 returned = top_k(short_queries, short_docs, k=3)
36 recall = np.mean([
37 len(expected & set(found)) / acceptable_chunks
38 for expected, found in zip(relevant, returned)
39 ])
40 print(f"{target_dim:>2} dims: recall@3={recall:.3f}, "
41 f"variance={reducer.explained_variance_ratio_.sum():.3f}")12 dims: recall@3=0.333, variance=0.266
2 4 dims: recall@3=0.817, variance=0.454
3 8 dims: recall@3=1.000, variance=0.715
412 dims: recall@3=1.000, variance=0.869The numbers belong to this generated fixture, not to all embedding models. Notice that variance and relevant-chunk recall answer different questions. Copy the experiment pattern, then replace its documents, queries, and judgments with data from your own product.
A two-dimensional projection answers questions such as:
It doesn't become a search index just because the plot looks separated. Compressing a rich semantic vector to two coordinates forces distortion.
PCA offers a deterministic linear view. t-SNE and UMAP focus more strongly on neighborhood relationships, using nonlinear objectives. The visual difference matters because a layout can make clusters appear convincingly separated even when a production retrieval test would disagree.
The next lab demonstrates the evaluation idea without asking you to trust a picture. It measures how many of each point's original nearest neighbors remain nearby after a two-dimensional PCA view and after a t-SNE view.
1import os
2
3# Keep this small teaching run from oversubscribing local CPU threads.
4for variable in ("OMP_NUM_THREADS", "OPENBLAS_NUM_THREADS", "VECLIB_MAXIMUM_THREADS"):
5 os.environ[variable] = "1"
6
7import numpy as np
8from sklearn.datasets import make_blobs
9from sklearn.decomposition import PCA
10from sklearn.manifold import TSNE
11
12points, labels = make_blobs(
13 n_samples=90,
14 n_features=10,
15 centers=3,
16 cluster_std=1.0,
17 random_state=8,
18)
19
20def neighbor_ids(rows: np.ndarray, k: int = 5) -> np.ndarray:
21 distances = np.linalg.norm(rows[:, None, :] - rows[None, :, :], axis=2)
22 return np.argsort(distances, axis=1)[:, 1:k + 1]
23
24def overlap(reference: np.ndarray, candidate: np.ndarray) -> float:
25 return float(np.mean([
26 len(set(left) & set(right)) / left.size
27 for left, right in zip(reference, candidate)
28 ]))
29
30original_neighbors = neighbor_ids(points)
31pca_map = PCA(n_components=2).fit_transform(points)
32tsne_map = TSNE(
33 n_components=2,
34 perplexity=12,
35 init="random",
36 learning_rate="auto",
37 max_iter=750,
38 random_state=8,
39).fit_transform(points)
40
41# t-SNE is stochastic and its exact value shifts across library versions,
42# so we report two decimals rather than a brittle high-precision figure.
43print(f"PCA local-neighbor overlap: {overlap(original_neighbors, neighbor_ids(pca_map)):.2f}")
44print(f"t-SNE local-neighbor overlap: {overlap(original_neighbors, neighbor_ids(tsne_map)):.2f}")
45print(f"Labels available for inspection: {[int(label) for label in sorted(set(labels))]}")1PCA local-neighbor overlap: 0.35
2t-SNE local-neighbor overlap: 0.61
3Labels available for inspection: [0, 1, 2]This local-overlap measurement still isn't a retrieval release gate. It tells you whether a map is useful for investigating local cases. The retrieval gate remains based on queries and relevant documents.
t-SNE converts high-dimensional neighbor relationships into probabilities , creates corresponding low-dimensional probabilities using a heavy-tailed Student-t distribution, and minimizes:
The heavy tail lets moderately distant points spread in the map instead of crowding into the center. This is the central construction in van der Maaten and Hinton's t-SNE paper.[2]
perplexity controls the rough neighborhood scale considered by the map. Trying several values is useful for investigation. It isn't acceptable to pick whichever rendering tells the cleanest story.
UMAP first constructs a weighted nearest-neighbor graph, represented as a fuzzy simplicial set, then optimizes a lower-dimensional graph using a cross-entropy objective.[3] Parameters such as n_neighbors and min_dist change the view: one layout can expose small pockets of policy confusion while another reveals broader groupings.
Common UMAP implementations can place later samples into a fitted map with an approximate transform. That is convenient for a diagnostic dashboard, but it isn't the same as PCA's fixed linear transform. If incoming support traffic changes, re-check the map before using it to explain failures.
| Goal | Appropriate first measurement | Suitable methods to test |
|---|---|---|
| Inspect clusters or outliers | Neighbor overlap plus human review | PCA, t-SNE, UMAP |
| Shorten vectors for retrieval | Recall@K or nDCG@K plus RAM and latency | Native shortening, PCA, random projection |
| Reduce memory further | Recall after compressed search and reranking | Scalar, PQ, binary quantization |
PCA learns from the document corpus. A random projection samples a fixed matrix once and applies it to every document and query:
The Johnson-Lindenstrauss lemma establishes that a finite set of points can be projected into a lower-dimensional space while approximately preserving pairwise distances, given enough target dimensions.[4] The bound is a worst-case guarantee, not a claim that a specific 128-dimensional support index will meet your recall requirement.
This experiment compares original cosine neighbors with a Gaussian random projection. A fixed seed makes the transform reproducible.
1import numpy as np
2from sklearn.random_projection import GaussianRandomProjection
3
4rng = np.random.default_rng(33)
5documents = rng.normal(size=(180, 48))
6queries = documents[:12] + rng.normal(0, 0.03, size=(12, 48))
7
8def normalized(rows: np.ndarray) -> np.ndarray:
9 return rows / np.linalg.norm(rows, axis=1, keepdims=True)
10
11def neighbors(q: np.ndarray, d: np.ndarray, k: int = 5) -> np.ndarray:
12 return np.argsort(-(normalized(q) @ normalized(d).T), axis=1)[:, :k]
13
14gold = neighbors(queries, documents)
15projector = GaussianRandomProjection(n_components=16, random_state=7)
16short_docs = projector.fit_transform(documents)
17short_queries = projector.transform(queries)
18predicted = neighbors(short_queries, short_docs)
19recall = np.mean([
20 len(set(left) & set(right)) / left.size
21 for left, right in zip(gold, predicted)
22])
23
24original_bytes = documents.shape[1] * 4
25short_bytes = short_docs.shape[1] * 4
26print(f"Bytes/vector: {original_bytes} -> {short_bytes}")
27print(f"Neighbor recall@5: {recall:.3f}")1Bytes/vector: 192 -> 64
2Neighbor recall@5: 0.350Random projection is worth benchmarking when fitting a data-dependent reducer is inconvenient or expensive. It isn't automatically better or worse than PCA. Let the same labeled retrieval set judge both.
Dimension reduction stores fewer coordinates. Quantization stores each coordinate, or each subvector, with fewer bits. You can use them independently or together.
For example, a 256-dimensional float32 vector occupies 256 × 4 = 1,024 bytes. Storing one signed byte per dimension uses 256 bytes. Storing one bit per dimension uses 32 bytes. Lower memory comes with progressively more approximation.
Scalar quantization maps each value onto an integer grid. Binary quantization records only a sign or threshold result. Vector databases often pair aggressive compressed candidate search with rescoring of a larger candidate set in a more accurate representation; the available choices and configuration depend on the database implementation.[5]
The next lab makes the compression visible. It quantizes four normalized vectors to int8 and converts their signs to bits. The bit count is exact; the quality decision still needs retrieval measurements.
1import numpy as np
2
3vectors = np.array([
4 [0.90, 0.80, -0.10, -0.20, 0.04, 0.02, -0.05, 0.08],
5 [0.85, 0.74, -0.04, -0.16, 0.01, 0.04, -0.02, 0.05],
6 [-0.02, 0.04, 0.92, 0.81, -0.06, 0.01, 0.02, -0.04],
7 [-0.05, 0.03, 0.88, 0.79, -0.03, 0.00, 0.06, -0.02],
8], dtype=np.float32)
9
10scale = 127 / np.max(np.abs(vectors))
11int8_vectors = np.round(vectors * scale).astype(np.int8)
12binary_vectors = vectors > 0
13
14query_bits = binary_vectors[0]
15hamming = np.count_nonzero(binary_vectors != query_bits, axis=1)
16
17print(f"float32 bytes/vector: {vectors.shape[1] * 4}")
18print(f"int8 bytes/vector: {int8_vectors.shape[1]}")
19print(f"binary bits/vector: {binary_vectors.shape[1]}")
20print(f"Binary Hamming distances from row 0: {hamming.tolist()}")1float32 bytes/vector: 32
2int8 bytes/vector: 8
3binary bits/vector: 8
4Binary Hamming distances from row 0: [0, 0, 6, 7]Binary codes are tiny, but "same sign" is a much weaker statement than "same semantic evidence." For a support application, retrieve more candidates with a compressed representation and rerank before final evidence selection if benchmarks justify that design.
Product Quantization (PQ) splits a vector into sub-vectors, learns a codebook for each subspace, and stores only the selected centroid IDs.[6] In the common case of 256 possible centroids, each ID fits in one byte. A 128-dimensional float32 vector split into eight subspaces can therefore be represented by eight code bytes, plus shared codebooks.
The original PQ paper describes asymmetric distance computation (ADC): keep a query unquantized, precompute its distances to codebook centroids, and score each stored code through table lookups.[6] This miniature implementation uses only two subspaces and four centroids so that every shape remains visible.
1import numpy as np
2from sklearn.cluster import KMeans
3
4rng = np.random.default_rng(12)
5documents = np.vstack([
6 rng.normal([1, 1, 0, 0], 0.08, size=(20, 4)),
7 rng.normal([0, 0, 1, 1], 0.08, size=(20, 4)),
8])
9query = np.array([0.94, 1.02, 0.01, -0.02])
10
11subspace_width = 2
12codebooks = []
13codes = []
14for start in range(0, documents.shape[1], subspace_width):
15 block = documents[:, start:start + subspace_width]
16 model = KMeans(n_clusters=4, random_state=0, n_init=10).fit(block)
17 codebooks.append(model.cluster_centers_)
18 codes.append(model.labels_)
19codes = np.column_stack(codes)
20
21adc_scores = np.zeros(documents.shape[0])
22for subspace, start in enumerate(range(0, documents.shape[1], subspace_width)):
23 query_block = query[start:start + subspace_width]
24 table = np.linalg.norm(codebooks[subspace] - query_block, axis=1) ** 2
25 adc_scores += table[codes[:, subspace]]
26
27print(f"Document shape: {documents.shape}")
28print(f"Stored code shape: {codes.shape}")
29print(f"Raw bytes/vector: {documents.shape[1] * 8} (float64 in this lab)")
30print(f"Code bytes/vector with <=256 centroids: {codes.shape[1]}")
31print(f"Nearest PQ-coded document group: {'damage' if np.argmin(adc_scores) < 20 else 'delivery'}")1Document shape: (40, 4)
2Stored code shape: (40, 2)
3Raw bytes/vector: 32 (float64 in this lab)
4Code bytes/vector with <=256 centroids: 2
5Nearest PQ-coded document group: damagePQ isn't an interchangeable synonym for PCA. PCA shortens a coordinate system; PQ encodes sub-vectors with learned IDs. An index may use a shortened embedding and then PQ-code it when memory is tight.
PCA and random projection operate after an encoder produces its vector. Matryoshka Representation Learning (MRL) changes training itself: the model is optimized so prefixes of its representation remain useful at several selected lengths.[7]
For an embedding and prefix lengths in a set , a simplified training view is:
Here, is the first d coordinates. Unlike PCA, serving a shorter MRL-trained representation doesn't require a separately fitted rotation. It still requires matching document and query lengths and testing retrieval quality.
This lab illustrates the mechanics with deliberately ordered vectors. It isn't pretending to train an embedding model. The first coordinates carry policy theme information; later coordinates add detail and noise. Truncating a standard, unordered representation wouldn't have this promise.
1import numpy as np
2
3rng = np.random.default_rng(15)
4centers = np.array([
5 [1.0, 0.9, 0.3, 0.2, 0.0, 0.0, 0.0, 0.0],
6 [0.0, 0.0, 1.0, 0.9, 0.3, 0.2, 0.0, 0.0],
7 [0.0, 0.0, 0.0, 0.0, 1.0, 0.9, 0.3, 0.2],
8])
9documents = np.vstack([
10 center + rng.normal(0, 0.04, size=(18, 8))
11 for center in centers
12])
13queries = np.vstack([
14 center + rng.normal(0, 0.02, size=(3, 8))
15 for center in centers
16])
17
18def top3(rows: np.ndarray, corpus: np.ndarray) -> np.ndarray:
19 rows = rows / np.linalg.norm(rows, axis=1, keepdims=True)
20 corpus = corpus / np.linalg.norm(corpus, axis=1, keepdims=True)
21 return np.argsort(-(rows @ corpus.T), axis=1)[:, :3]
22
23gold = top3(queries, documents)
24for prefix in [2, 4, 8]:
25 shortened = top3(queries[:, :prefix], documents[:, :prefix])
26 recall = np.mean([
27 len(set(expected) & set(found)) / 3
28 for expected, found in zip(gold, shortened)
29 ])
30 print(f"Prefix {prefix}: recall@3={recall:.3f}")1Prefix 2: recall@3=0.296
2Prefix 4: recall@3=0.519
3Prefix 8: recall@3=1.000One hosted API example is OpenAI's text-embedding-3 family. Its documentation lists default lengths of 1,536 for text-embedding-3-small and 3,072 for text-embedding-3-large, and documents a dimensions request parameter for shorter outputs.[8] A request shape looks like this:
1{
2 "model": "text-embedding-3-large",
3 "input": "screen arrived cracked and will not turn on",
4 "dimensions": 256
5}The provider feature makes generation of shorter vectors convenient. It doesn't decide whether 256 dimensions meet your damaged-item retrieval requirement.
At this point you have three kinds of tools:
| Tool family | What changes | Best first question |
|---|---|---|
| PCA or random projection | Number of served coordinates after encoding | How much recall survives at a lower dimension? |
| MRL-style native shortening | Number of coordinates emitted from a trained model | Does a supported prefix beat post-hoc alternatives at the same cost? |
| Scalar, binary, or product quantization | Precision or code used to store/search vectors | Can compressed candidate search plus reranking fit the RAM budget? |
| t-SNE or UMAP | Coordinates used for inspection maps | What failures should an engineer examine next? |
A release gate should turn this table into a decision that someone else can audit. The following lab accepts only configurations that meet both a recall floor and a raw-vector memory budget, then chooses the smallest accepted representation.
1measurements = [
2 {"method": "float32-full", "bytes_per_vector": 6144, "recall_at_10": 1.000},
3 {"method": "native-256", "bytes_per_vector": 1024, "recall_at_10": 0.984},
4 {"method": "pca-256", "bytes_per_vector": 1024, "recall_at_10": 0.971},
5 {"method": "pq-16-code", "bytes_per_vector": 16, "recall_at_10": 0.939},
6]
7
8required_recall = 0.975
9maximum_bytes = 1200
10accepted = [
11 row for row in measurements
12 if row["recall_at_10"] >= required_recall
13 and row["bytes_per_vector"] <= maximum_bytes
14]
15choice = min(accepted, key=lambda row: row["bytes_per_vector"])
16
17for row in measurements:
18 status = "pass" if row in accepted else "reject"
19 print(f"{row['method']:<14} recall={row['recall_at_10']:.3f} "
20 f"bytes={row['bytes_per_vector']:<4} {status}")
21print(f"Selected representation: {choice['method']}")1float32-full recall=1.000 bytes=6144 reject
2native-256 recall=0.984 bytes=1024 pass
3pca-256 recall=0.971 bytes=1024 reject
4pq-16-code recall=0.939 bytes=16 reject
5Selected representation: native-256The values above are example measurements for the gate, not benchmark results for a real model. A serious report would include:
You began with a long embedding and a budget question. You now know how to:
The recurring idea is more important than any preferred method: compression is an experiment with a quality constraint. Storage savings are real immediately. Semantic safety must be measured.
Create a small policy corpus with four query categories: damaged item, delivery delay, missing parcel, and return eligibility. Assign each query two acceptable policy chunks. Replace the generated embeddings in pca-recall-sweep.py with embeddings from a model you can run or an API you can audit. Compare full vectors, one shorter representation, and one quantized candidate. Your artifact is a table of recall, bytes per vector, and the exact failed queries, followed by a release recommendation.
Numerical Linear Algebra.
Trefethen, L. N., & Bau, D. · 1997 · SIAM
Visualizing Data using t-SNE.
van der Maaten, L. & Hinton, G. · 2008
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.
McInnes, L., Healy, J., & Melville, J. · 2018
Extensions of Lipschitz mappings into a Hilbert space.
Johnson, W. B. & Lindenstrauss, J. · 1984 · Contemporary Mathematics, 26
Qdrant Quantization Guide
Qdrant. · 2024
Product Quantization for Nearest Neighbor Search.
Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011
Matryoshka Representation Learning.
Kusupati, A., et al. · 2022 · NeurIPS 2022
Vector embeddings
OpenAI · 2024