Inspect unlabeled support-message embeddings with k-means and PCA, then stress-test whether apparent neighborhoods survive scale, metric, and compression choices.
In the last chapter, you learned to protect evaluation from leakage when labels exist. Now a support team has embedded thousands of untagged return and delivery messages, and nobody has assigned reliable topics yet. Before you build search or train a classifier, you need to ask a quieter question: does this representation contain useful neighborhoods at all?
This chapter gives you a disciplined answer. You'll group a tiny unlabeled corpus with k-means, inspect it from fewer dimensions with principal component analysis (PCA), and run failure tests that stop a pretty plot from becoming a false claim. Those habits matter later when a retrieval-augmented generation system selects evidence for an LLM.
Suppose a text embedding model has already converted six customer messages into two-number vectors. Real embeddings might contain hundreds or thousands of coordinates. Two dimensions let us compute every important step by hand first.
| Message ID | Customer text | ||
|---|---|---|---|
refund_A | "I was charged twice." | 1.0 | 1.0 |
refund_B | "Please undo this duplicate payment." | 1.2 | 1.8 |
login_A | "My reset link never arrives." | 4.0 | 4.2 |
login_B | "Two-factor code keeps failing." | 5.0 | 3.8 |
shipping_A | "Where is my package?" | 8.0 | 1.0 |
shipping_B | "Tracking says delayed again." | 9.0 | 1.8 |
The human-readable names are available here so you can check the geometry. Pretend the algorithm sees only coordinates. It doesn't receive refund, login, or shipping as labels.
The first diagnostic is simply nearest neighbors. For two points and , Euclidean distance is:
That formula asks how far apart two messages lie in the current representation. Run it before clustering so you know whether local pairs look sensible.
1import numpy as np
2
3names = np.array(["refund_A", "refund_B", "login_A", "login_B", "shipping_A", "shipping_B"])
4vectors = np.array([
5 [1.0, 1.0],
6 [1.2, 1.8],
7 [4.0, 4.2],
8 [5.0, 3.8],
9 [8.0, 1.0],
10 [9.0, 1.8],
11])
12
13distances = np.linalg.norm(vectors[:, None, :] - vectors[None, :, :], axis=2)
14np.fill_diagonal(distances, np.inf)
15
16for row, name in enumerate(names):
17 neighbor = distances[row].argmin()
18 print(f"{name:10} -> {names[neighbor]:10} distance={distances[row, neighbor]:.2f}")1refund_A -> refund_B distance=0.82
2refund_B -> refund_A distance=0.82
3login_A -> login_B distance=1.08
4login_B -> login_A distance=1.08
5shipping_A -> shipping_B distance=1.28
6shipping_B -> shipping_A distance=1.28The intended local structure appears if each message's closest neighbor belongs to the same readable theme. That isn't a proof that all future messages will organize cleanly. It's a first representation check.
K-means is an unsupervised algorithm: it receives vectors and a requested number of groups, , but no correct topic names. It repeats two operations:
The algorithm minimizes inertia, the within-cluster sum of squared distances:
Here, is message vector , is its assigned cluster, and is that cluster's centroid. Lower inertia means points lie closer to their selected centers. It doesn't mean the resulting groups deserve human topic names.[1]
Start with and choose one seed from each visible area:
refund_Alogin_Ashipping_AAfter assignment, each pair stays together. Averaging coordinates moves the three centroids to:
| Candidate neighborhood | Assigned points | Updated centroid |
|---|---|---|
| left group | refund_A, refund_B | |
| upper group | login_A, login_B | |
| right group | shipping_A, shipping_B |
Notice the wording: the algorithm found left, upper, and right groups. A reviewer reads sampled messages and proposes the topic labels afterward.
1import numpy as np
2
3vectors = np.array([
4 [1.0, 1.0], [1.2, 1.8],
5 [4.0, 4.2], [5.0, 3.8],
6 [8.0, 1.0], [9.0, 1.8],
7])
8centroids = vectors[[0, 2, 4]].copy()
9
10squared_distance = ((vectors[:, None, :] - centroids[None, :, :]) ** 2).sum(axis=2)
11cluster = squared_distance.argmin(axis=1)
12updated = np.vstack([vectors[cluster == index].mean(axis=0) for index in range(3)])
13inertia = ((vectors - updated[cluster]) ** 2).sum()
14
15print("assignment:", cluster.tolist())
16print("updated centroids:", np.round(updated, 2).tolist())
17print(f"inertia after update: {inertia:.2f}")1assignment: [0, 0, 1, 1, 2, 2]
2updated centroids: [[1.1, 1.4], [4.5, 4.0], [8.5, 1.4]]
3inertia after update: 1.74
A production implementation uses a tested library, but writing the small loop once makes later debugging far easier. The function below accepts a matrix of vectors and initial centers. It stops when the centers no longer move.
1import numpy as np
2
3names = np.array(["refund_A", "refund_B", "login_A", "login_B", "shipping_A", "shipping_B"])
4vectors = np.array([
5 [1.0, 1.0], [1.2, 1.8],
6 [4.0, 4.2], [5.0, 3.8],
7 [8.0, 1.0], [9.0, 1.8],
8])
9
10def kmeans(x: np.ndarray, initial: np.ndarray, max_steps: int = 20):
11 centers = initial.astype(float).copy()
12 for step in range(1, max_steps + 1):
13 squared = ((x[:, None, :] - centers[None, :, :]) ** 2).sum(axis=2)
14 labels = squared.argmin(axis=1)
15 next_centers = np.vstack([x[labels == index].mean(axis=0) for index in range(len(centers))])
16 if np.allclose(next_centers, centers):
17 inertia = ((x - next_centers[labels]) ** 2).sum()
18 return labels, next_centers, inertia, step
19 centers = next_centers
20 raise RuntimeError("k-means did not converge")
21
22labels, centers, inertia, steps = kmeans(vectors, vectors[[0, 2, 4]])
23for index in range(3):
24 members = names[labels == index].tolist()
25 print(f"cluster {index}: {members} center={centers[index].round(2).tolist()}")
26print(f"converged in {steps} steps with inertia={inertia:.2f}")1cluster 0: ['refund_A', 'refund_B'] center=[1.1, 1.4]
2cluster 1: ['login_A', 'login_B'] center=[4.5, 4.0]
3cluster 2: ['shipping_A', 'shipping_B'] center=[8.5, 1.4]
4converged in 2 steps with inertia=1.74One defensive detail is missing on purpose: if a center gets no assigned vectors, its mean becomes undefined. Library implementations handle empty clusters and initialization more carefully. Your learning target here is the alternating assignment/update mechanism, not a full clustering package.
k by Usefulness, Not by Wishful NamingYou chose because the map visibly contained three pairs. On a new corpus, isn't supplied by nature. A small audit combines several signals:
The scikit-learn version below evaluates possible values for .[2] Setting n_init=20 makes the choice explicit: try several starting configurations and retain the lowest-inertia result for each .
1import numpy as np
2from sklearn.cluster import KMeans
3from sklearn.metrics import silhouette_score
4
5vectors = np.array([
6 [1.0, 1.0], [1.2, 1.8],
7 [4.0, 4.2], [5.0, 3.8],
8 [8.0, 1.0], [9.0, 1.8],
9])
10
11for k in (2, 3, 4):
12 model = KMeans(n_clusters=k, random_state=7, n_init=20).fit(vectors)
13 silhouette = silhouette_score(vectors, model.labels_)
14 print(f"k={k}: inertia={model.inertia_:.2f}, silhouette={silhouette:.2f}")1k=2: inertia=20.06, silhouette=0.56
2k=3: inertia=1.74, silhouette=0.76
3k=4: inertia=0.92, silhouette=0.51On this unusually neat six-point example, should separate the three visible pairs. On a real support corpus, no single metric gives permission to publish topic names. Read the examples.
Even a stable-looking objective can hide ambiguity. Four messages positioned at the corners of a square have two equally valid two-cluster solutions: split left from right, or split top from bottom.
1import numpy as np
2from sklearn.cluster import KMeans
3
4square = np.array([[0.0, 0.0], [0.0, 2.0], [2.0, 0.0], [2.0, 2.0]])
5vertical_seed = np.array([[0.0, 1.0], [2.0, 1.0]])
6horizontal_seed = np.array([[1.0, 0.0], [1.0, 2.0]])
7
8for name, seed in [("left/right", vertical_seed), ("bottom/top", horizontal_seed)]:
9 model = KMeans(n_clusters=2, init=seed, n_init=1).fit(square)
10 groups = [np.where(model.labels_ == i)[0].tolist() for i in range(2)]
11 print(f"{name:10} groups={groups} inertia={model.inertia_:.1f}")1left/right groups=[[0, 1], [2, 3]] inertia=4.0
2bottom/top groups=[[0, 2], [1, 3]] inertia=4.0Both answers optimize the same squared-distance objective equally well. If each axis meant a different product concern, only human review or a downstream task could say which grouping is useful.
K-means assigns groups. Principal component analysis (PCA) does something different: it replaces the original coordinate axes with new perpendicular axes that preserve as much variance as possible in the first few dimensions.
First center the six vectors by subtracting their mean, . A centered coordinate describes how a message differs from the average message. Centering isn't scaling: subtracting a mean doesn't stop a high-unit column from dominating variance. Then factor the centered matrix using singular value decomposition (SVD):
The rows of are principal directions. The first direction, PC1, captures the greatest remaining linear spread. Its share of total variance is the squared first singular value divided by the sum of squared singular values. Keeping the top components gives the best rank-limited reconstruction in squared-error terms, but it doesn't guarantee that the retained directions preserve a rare semantic distinction.[1][3]
For our map, PC1 runs almost horizontally:
| Quantity | Value | Interpretation |
|---|---|---|
| PC1 direction | Most spread is left to right. | |
| PC1 explained variance | One number retains most linear variation. | |
| PC2 explained variance | Vertical placement still carries information. |
1import numpy as np
2
3names = np.array(["refund_A", "refund_B", "login_A", "login_B", "shipping_A", "shipping_B"])
4vectors = np.array([
5 [1.0, 1.0], [1.2, 1.8],
6 [4.0, 4.2], [5.0, 3.8],
7 [8.0, 1.0], [9.0, 1.8],
8])
9
10centered = vectors - vectors.mean(axis=0)
11_, singular_values, vt = np.linalg.svd(centered, full_matrices=False)
12pc1 = vt[0]
13if pc1[0] < 0:
14 pc1 = -pc1
15projection = centered @ pc1
16ratio = singular_values**2 / (singular_values**2).sum()
17
18print("mean:", np.round(vectors.mean(axis=0), 2).tolist())
19print("pc1:", np.round(pc1, 3).tolist())
20print("explained variance:", np.round(ratio, 3).tolist())
21for name, value in zip(names, projection):
22 print(f"{name:10} pc1={value:5.2f}")1mean: [4.7, 2.27]
2pc1: [1.0, -0.016]
3explained variance: [0.852, 0.148]
4refund_A pc1=-3.68
5refund_B pc1=-3.49
6login_A pc1=-0.73
7login_B pc1= 0.27
8shipping_A pc1= 3.32
9shipping_B pc1= 4.31PCA has created a new feature, pc1. It didn't pick an existing embedding coordinate. In high-dimensional embeddings, each principal component is a weighted mixture of many original coordinates.
Reconstruction makes the tradeoff concrete. Project onto one component, map back into the original space, and measure squared error. Two components can perfectly reconstruct this two-dimensional input; one component can't.
1import numpy as np
2
3vectors = np.array([
4 [1.0, 1.0], [1.2, 1.8],
5 [4.0, 4.2], [5.0, 3.8],
6 [8.0, 1.0], [9.0, 1.8],
7])
8mean = vectors.mean(axis=0)
9centered = vectors - mean
10_, singular_values, vt = np.linalg.svd(centered, full_matrices=False)
11
12for components in (1, 2):
13 basis = vt[:components]
14 reconstructed = (centered @ basis.T) @ basis + mean
15 error = ((vectors - reconstructed) ** 2).sum()
16 kept = (singular_values[:components] ** 2).sum() / (singular_values ** 2).sum()
17 print(f"components={components}: variance_kept={kept:.3f}, squared_error={error:.2f}")1components=1: variance_kept=0.852, squared_error=9.72
2components=2: variance_kept=1.000, squared_error=0.00Explained variance measures geometry, not business value. A rare fraud or policy-escalation direction may carry little total variance and still matter greatly.
Embeddings often appear next to metadata such as message length, order value, or age of the shipment. If you concatenate features with unlike numeric scales, both k-means and PCA can chase the largest units rather than the concept you care about.
The script below creates two intended topics (refund and shipping) while message length cuts across both. Without scaling, length dominates. Standardizing each column to comparable scale recovers the topic split in this constructed check.
1import numpy as np
2from sklearn.cluster import KMeans
3from sklearn.metrics import adjusted_rand_score
4from sklearn.pipeline import make_pipeline
5from sklearn.preprocessing import StandardScaler
6
7# Column 0 carries topic; column 1 is message length in characters.
8features = np.array([
9 [-3.2, 100.0], [-3.0, 500.0], [-2.8, 900.0],
10 [ 2.8, 110.0], [ 3.0, 510.0], [ 3.2, 910.0],
11])
12topic = np.array([0, 0, 0, 1, 1, 1])
13
14raw = KMeans(n_clusters=2, random_state=0, n_init=20).fit_predict(features)
15scaled = make_pipeline(
16 StandardScaler(),
17 KMeans(n_clusters=2, random_state=0, n_init=20),
18).fit_predict(features)
19
20print(f"raw agreement with topic: {adjusted_rand_score(topic, raw):.2f}")
21print(f"scaled agreement with topic: {adjusted_rand_score(topic, scaled):.2f}")1raw agreement with topic: -0.22
2scaled agreement with topic: 1.00This isn't a rule to standardize every embedding blindly. If a model's embedding vectors already come with a specified normalization and similarity contract, follow that contract. During evaluation, fit any learned scaler on the development corpus and reuse it for held-out examples. The rule is simpler: don't mix measurements and then pretend their scales are irrelevant.
Text retrieval frequently compares vector direction with cosine similarity, while ordinary k-means minimizes Euclidean distance to centroids. Dot product additionally rewards magnitude. A model or index must use the similarity rule its representations were built for.
This tiny example gives a query two candidate policy snippets. One candidate points almost exactly in the query direction; the other has a much larger norm.
1import numpy as np
2
3query = np.array([1.0, 0.0])
4names = ["duplicate_charge_policy", "very_long_shipping_page"]
5documents = np.array([
6 [1.0, 0.1],
7 [8.0, 4.0],
8])
9
10dot = documents @ query
11cosine = dot / (np.linalg.norm(documents, axis=1) * np.linalg.norm(query))
12
13print("dot-product winner:", names[int(dot.argmax())])
14print("cosine winner: ", names[int(cosine.argmax())])
15print("cosine scores: ", np.round(cosine, 3).tolist())1dot-product winner: very_long_shipping_page
2cosine winner: duplicate_charge_policy
3cosine scores: [0.995, 0.894]On unit-normalized vectors, cosine ranking and Euclidean nearest-neighbor ranking are directly related: maximizing cosine similarity minimizes squared Euclidean distance. That relation is useful, but only after normalization is explicit.
A retrieval system doesn't need a globally beautiful map. It needs the right document for a specific question. Suppose horizontal variation across refund and shipping policies dominates a corpus, while the distinction between password-reset and billing pages sits mostly in a quieter vertical direction.
The following diagnostic projects documents to only one principal component, then runs a nearest-document lookup for a login question. The compressed view preserves most overall spread but collapses a distinction that mattered for this query.
1import numpy as np
2from sklearn.decomposition import PCA
3
4names = np.array(["refund_policy", "billing_help", "login_2fa", "shipping_policy", "warehouse_note"])
5documents = np.array([
6 [-20.0, 0.0],
7 [ 0.0, 0.0],
8 [ 0.0, 3.0],
9 [20.0, 0.0],
10 [ 0.0, -3.0],
11])
12query = np.array([[0.0, 3.2]])
13
14full_distance = np.linalg.norm(documents - query, axis=1)
15full_winner = names[full_distance.argmin()]
16
17pca = PCA(n_components=1).fit(documents)
18compressed_documents = pca.transform(documents)
19compressed_query = pca.transform(query)
20compressed_distance = np.linalg.norm(compressed_documents - compressed_query, axis=1)
21compressed_ties = names[np.isclose(compressed_distance, compressed_distance.min())].tolist()
22
23print(f"PC1 variance kept: {pca.explained_variance_ratio_[0]:.3f}")
24print("full-space nearest: ", full_winner)
25print("one-PC nearest ties: ", compressed_ties)1PC1 variance kept: 0.978
2full-space nearest: login_2fa
3one-PC nearest ties: ['billing_help', 'login_2fa', 'warehouse_note']After projection, billing_help, login_2fa, and warehouse_note land at the same one-dimensional coordinate. The failure isn't that billing became uniquely closest. PCA removed the quieter vertical distinction that separated the useful login page from two tied alternatives.
PCA is excellent for quick inspection and sometimes useful for compression. Before deploying compressed retrieval vectors, measure retrieval quality on labeled queries. In the next chapter, you'll build exactly those ranking checks.
An unsupervised plot is most useful when it produces testable questions. For a support or policy corpus, write an audit record before claiming that the embedding space is organized:
| Audit question | Evidence to collect |
|---|---|
| Do local neighbors discuss the same issue? | Read nearest-message pairs sampled across the corpus. |
| Do proposed clusters have coherent content? | Review messages nearest each centroid and near boundaries. |
| Is a large axis merely formatting or length? | Compare PCA coordinates against metadata such as length and source. |
| Does an embedding metric match later search? | Run cosine or dot-product checks according to model contract. |
| Can compression hurt a rare but important query? | Evaluate retrieval before and after PCA on held-out query-document pairs. |
Use a handful of reviewed examples as an audit set, not as hidden training labels. It gives you a way to reject a broken representation before a downstream LLM begins quoting the wrong policy.
1import numpy as np
2from sklearn.cluster import KMeans
3
4names = np.array(["refund_A", "refund_B", "login_A", "login_B", "shipping_A", "shipping_B"])
5reviewed_topic = np.array(["refund", "refund", "login", "login", "shipping", "shipping"])
6vectors = np.array([
7 [1.0, 1.0], [1.2, 1.8],
8 [4.0, 4.2], [5.0, 3.8],
9 [8.0, 1.0], [9.0, 1.8],
10])
11
12labels = KMeans(n_clusters=3, random_state=4, n_init=20).fit_predict(vectors)
13for cluster in sorted(set(labels)):
14 topics, counts = np.unique(reviewed_topic[labels == cluster], return_counts=True)
15 dominant = topics[counts.argmax()]
16 purity = counts.max() / counts.sum()
17 members = names[labels == cluster].tolist()
18 print(f"cluster {cluster}: proposed={dominant:8} purity={purity:.2f} members={members}")1cluster 0: proposed=shipping purity=1.00 members=['shipping_A', 'shipping_B']
2cluster 1: proposed=refund purity=1.00 members=['refund_A', 'refund_B']
3cluster 2: proposed=login purity=1.00 members=['login_A', 'login_B']On this tiny teaching set, perfect purity is expected because the coordinates were designed to be readable. A production audit should be harder: sample boundary cases, unusual languages, policy updates, and short messages with ambiguous intent.
kmeans-from-scratch.py so one cluster becomes empty. Add a guard that either reseeds the empty center or raises a clear error.compression-can-change-retrieval.py, then confirm the login query recovers login_2fa. Explain why explained variance alone couldn't guarantee that result.write-a-cluster-audit-summary.py with one mislabeled or boundary message. Identify which samples you'd inspect before assigning topic names.mean(axis=0) on an empty slice produces invalid centers. A learning implementation should detect not np.any(labels == index) before averaging; a production library may reseed more carefully.login_2fa. In a real index, compare held-out retrieval metrics before and after compression.| Symptom | Cause | Fix |
|---|---|---|
Dashboard permanently interprets cluster 2 as refunds. | Arbitrary cluster IDs were treated as product labels. | Version each run and attach reviewed labels after sampling members. |
| Clusters track message length rather than issue type. | Mixed-scale metadata dominated distance or variance. | Standardize deliberately or keep metadata out of embedding geometry. |
| Teams trust a clean PCA chart without reading messages. | A compressed visualization was confused with semantic validation. | Audit centroid and boundary examples in original text. |
| Search results change after switching dot product to cosine. | Similarity contract wasn't made explicit. | Use documented embedding metric and normalize only when appropriate. |
| A compressed index misses an important policy page. | PCA dropped a small direction needed by that query. | Test held-out retrieval before deploying compression. |
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.