Turn token IDs into vectors, learn what nearby usage captures, and see why a word such as charge needs sentence-dependent representations.
A tokenizer can turn a support request into IDs:
1"dispute the charge" -> [91, 12, 407]
2"charge the scanner" -> [407, 12, 982]IDs solve naming. They don't tell a model that refund resembles return, or that charge means a billing event in one message and supplying power in another.
An embedding gives each token a of numbers. A contextual representation goes one step further: it adjusts that vector using the sentence around this particular occurrence.
By the end of this lesson, you'll build both ideas from scratch and watch the same leave with two different meanings.
Suppose a tokenizer has a vocabulary of size V, and you choose embedding dimension d. The model stores an embedding E with shape V x d.
E[token_id] is the starting vector for that token.That final point matters. If charge is ID 0, both billing and device sentences initially retrieve row E[0].
1import numpy as np
2
3vocab = {"charge": 0, "refund": 1, "scanner": 2}
4embedding_table = np.array([
5 [0.80, 0.20], # charge
6 [0.15, 0.95], # refund
7 [0.92, 0.10], # scanner
8])
9
10billing_charge = embedding_table[vocab["charge"]]
11device_charge = embedding_table[vocab["charge"]]
12
13print("billing start:", billing_charge.tolist())
14print("device start:", device_charge.tolist())
15print("same starting row:", np.array_equal(billing_charge, device_charge))1billing start: [0.8, 0.2]
2device start: [0.8, 0.2]
3same starting row: TrueA lookup table alone can't distinguish senses. First, though, it has to learn a useful map of token usage.
Words that repeatedly appear near similar words tend to play similar roles. In return workflows, refund and replacement may both occur near approved, request, or label; carrier and tracking may cluster around shipment status language.
The simplest evidence is a co-occurrence count: for every token, count words within a fixed context window.
1from collections import Counter, defaultdict
2
3messages = [
4 "refund approved send return label".split(),
5 "replacement approved send return label".split(),
6 "refund request needs label".split(),
7 "tracking update from carrier hub".split(),
8 "shipment delayed at carrier hub".split(),
9]
10
11window = 2
12neighbors = defaultdict(Counter)
13
14for message in messages:
15 for center_index, center in enumerate(message):
16 left = max(0, center_index - window)
17 right = min(len(message), center_index + window + 1)
18 for context_index in range(left, right):
19 if context_index != center_index:
20 neighbors[center][message[context_index]] += 1
21
22print("refund context:", neighbors["refund"].most_common(4))
23print("replacement context:", neighbors["replacement"].most_common(4))
24print("carrier context:", neighbors["carrier"].most_common(4))1refund context: [('approved', 1), ('send', 1), ('request', 1), ('needs', 1)]
2replacement context: [('approved', 1), ('send', 1)]
3carrier context: [('hub', 2), ('update', 1), ('from', 1), ('delayed', 1)]A count row can be high-dimensional and noisy. One historical route, latent semantic analysis, uses singular value decomposition (SVD) to compress a high-dimensional term-document matrix.[1] For this lesson, use the same compression move on word-context counts. The compressed rows act as static vectors: one vector per word, independent of sentence.
The next exercise follows that word-context route. It computes positive pointwise mutual information (PPMI), which emphasizes pairs occurring more often than chance, then compresses the matrix with SVD.
1import numpy as np
2
3words = ["refund", "replacement", "carrier", "tracking"]
4contexts = ["label", "approved", "hub", "delayed"]
5counts = np.array([
6 [8, 6, 0, 0], # refund
7 [7, 6, 0, 0], # replacement
8 [0, 0, 8, 5], # carrier
9 [0, 0, 7, 6], # tracking
10], dtype=float)
11
12expected = counts.sum(axis=1, keepdims=True) @ counts.sum(axis=0, keepdims=True) / counts.sum()
13ppmi = np.maximum(np.log((counts + 1e-9) / (expected + 1e-9)), 0.0)
14u, singular_values, _ = np.linalg.svd(ppmi, full_matrices=False)
15vectors = u[:, :2] * np.sqrt(singular_values[:2])
16
17def cosine(a, b):
18 return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
19
20print("refund vs replacement:", round(cosine(vectors[0], vectors[1]), 3))
21print("refund vs carrier:", round(cosine(vectors[0], vectors[2]), 3))1refund vs replacement: 1.0
2refund vs carrier: 0.0This tiny dataset is designed to make the pattern visible. Real corpora have far more contexts, imperfect wording, and competing meanings.
Counting first and compressing later isn't the only way to learn a static embedding table. Word2Vec trains vectors through prediction:
Both objectives reward vectors that help predict observed neighborhoods. Mikolov and colleagues introduced these two architectures in 2013.[2]
For Skip-gram, a training set can be built directly from windows. With window size 2, the center arrived produces one positive pair for each visible neighbor.
1tokens = "package arrived at carrier hub today".split()
2window = 2
3pairs = []
4
5for center_index, center in enumerate(tokens):
6 left = max(0, center_index - window)
7 right = min(len(tokens), center_index + window + 1)
8 for context_index in range(left, right):
9 if context_index != center_index:
10 pairs.append((center, tokens[context_index]))
11
12arrived_pairs = [pair for pair in pairs if pair[0] == "arrived"]
13print(arrived_pairs)1[('arrived', 'package'), ('arrived', 'at'), ('arrived', 'carrier')]Predicting every vocabulary item for every pair is costly. Negative sampling trains the observed pair as positive and a small set of sampled unobserved pairs as negative. The following objective is a small, inspectable version of that idea.
1import numpy as np
2
3center = np.array([1.0, 0.0])
4observed_neighbor = np.array([1.2, 0.1])
5wrong_neighbor = np.array([-1.0, 0.1])
6negative_samples = [np.array([-0.9, -0.2]), np.array([-1.1, 0.0])]
7
8def sigmoid(x):
9 return 1.0 / (1.0 + np.exp(-x))
10
11def negative_sampling_loss(center_vector, positive_vector, negatives):
12 positive_loss = -np.log(sigmoid(center_vector @ positive_vector))
13 negative_loss = sum(-np.log(sigmoid(-(center_vector @ n))) for n in negatives)
14 return float(positive_loss + negative_loss)
15
16observed_loss = negative_sampling_loss(center, observed_neighbor, negative_samples)
17wrong_loss = negative_sampling_loss(center, wrong_neighbor, negative_samples)
18
19print("observed pair loss:", round(observed_loss, 3))
20print("wrong pair loss:", round(wrong_loss, 3))
21print("training prefers observed pair:", observed_loss < wrong_loss)1observed pair loss: 0.892
2wrong pair loss: 1.942
3training prefers observed pair: TrueNo analogy trick is required to understand the useful result: after enough examples, tokens that support similar neighbor predictions can end up near one another.
GloVe starts from global co-occurrence counts. It fits a weighted least-squares objective: a word-vector and context-vector dot product, plus biases, should approximate the logarithm of each observed count. The logarithm compresses the count range, while the weighting function limits how much very common pairs dominate training.[3]
1import numpy as np
2
3def squared_glove_residual(observed_count, model_score):
4 target = np.log(observed_count)
5 return float((model_score - target) ** 2)
6
7model_score = np.log(30)
8matching_error = squared_glove_residual(30, model_score)
9different_error = squared_glove_residual(3, model_score)
10
11print("log targets:", [round(float(np.log(c)), 3) for c in (3, 30, 300)])
12print("matching count error:", round(matching_error, 3))
13print("different count error:", round(different_error, 3))1log targets: [1.099, 3.401, 5.704]
2matching count error: 0.0
3different count error: 5.302Word2Vec and GloVe give each known word one static row. That is efficient, but it creates two problems: new spellings have no row, and ambiguous words have only one.
Support systems constantly encounter variants: reship, reshipped, reshipping, or tracking codes mixed into prose. FastText represents a word using character n-grams as well as its whole-word identity, letting related spellings share parameters.[4]
1def character_ngrams(word, n=3):
2 marked = f"<{word}>"
3 return {marked[i:i + n] for i in range(len(marked) - n + 1)}
4
5known = character_ngrams("reship")
6new_form = character_ngrams("reshipped")
7shared = sorted(known & new_form)
8
9print("shared pieces:", shared)
10print("new spelling can reuse pieces:", len(shared) > 0)1shared pieces: ['<re', 'esh', 'hip', 'res', 'shi']
2new spelling can reuse pieces: TrueSubword composition helps with unfamiliar surface forms. It still doesn't decide which meaning an existing ambiguous token carries.
Read these messages:
1"Dispute the charge on order 8142."
2"Charge the scanner before the warehouse shift."The word charge is spelled the same way in both. A static embedding lookup returns the same vector, even though the first case belongs near billing language and the second near device language.
1import numpy as np
2
3static = {
4 "charge": np.array([0.8, 0.2]),
5 "refund": np.array([0.2, 1.0]),
6 "scanner": np.array([1.0, 0.1]),
7}
8
9def cosine(a, b):
10 return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
11
12billing_charge = static["charge"]
13device_charge = static["charge"]
14
15print("same charge vector:", np.array_equal(billing_charge, device_charge))
16print("billing similarity to refund:", round(cosine(billing_charge, static["refund"]), 3))
17print("device similarity to refund:", round(cosine(device_charge, static["refund"]), 3))1same charge vector: True
2billing similarity to refund: 0.428
3device similarity to refund: 0.428Because the two charge vectors are identical, every downstream comparison begins from the same mistaken compromise. The representation needs evidence from the sentence.
ELMo showed that a token's representation can be produced from a bidirectional language model and selected from multiple learned layers, rather than stored as one context-free vector.[5]
BERT later trained a Transformer encoder with masked language modeling: hide some tokens, then predict them using context on both sides. For a visible token state, encoder self-attention can incorporate evidence before and after that token.[6]
You don't need a pretrained model to see the central mechanism. In this toy contextualizer, the same starting vector for charge is mixed with a clue from its sentence.
1import numpy as np
2
3charge_start = np.array([0.8, 0.2])
4clues = {
5 "dispute": np.array([0.0, 1.0]), # billing evidence
6 "scanner": np.array([1.0, 0.0]), # device evidence
7}
8
9def mix_with_context(token_vector, clue_vector):
10 return 0.5 * token_vector + 0.5 * clue_vector
11
12billing_state = mix_with_context(charge_start, clues["dispute"])
13device_state = mix_with_context(charge_start, clues["scanner"])
14
15print("billing charge state:", billing_state.tolist())
16print("device charge state:", device_state.tolist())
17print("same final state:", np.array_equal(billing_state, device_state))1billing charge state: [0.4, 0.6]
2device charge state: [0.9, 0.1]
3same final state: FalseThe weights above are illustrative, not learned model parameters. A trained attention layer learns which context tokens should contribute, and later layers can refine that state again.
Not every contextual representation may look to the same places.
For the first token in charge the scanner, the right-hand clue scanner is visible to an encoder but not to a causal decoder state at position charge.
1import numpy as np
2
3tokens = ["charge", "the", "scanner"]
4encoder_visibility = np.ones((len(tokens), len(tokens)), dtype=int)
5causal_visibility = np.tril(np.ones((len(tokens), len(tokens)), dtype=int))
6
7def visible_tokens(mask, position):
8 return [token for token, visible in zip(tokens, mask[position]) if visible]
9
10print("encoder state for charge sees:", visible_tokens(encoder_visibility, 0))
11print("causal state for charge sees:", visible_tokens(causal_visibility, 0))1encoder state for charge sees: ['charge', 'the', 'scanner']
2causal state for charge sees: ['charge']Both designs produce contextual states. Their visibility rules suit different training objectives and later tasks.
Embedding applications often use cosine similarity: a value near 1 means two vectors point in similar directions. That measurement is useful only if the space behaves sensibly.
Research on contextual representations found that token vectors can be anisotropic: many vectors share a dominant direction, inflating raw cosine similarities even for tokens that shouldn't be treated as alike.[8]
This tiny example has a large shared horizontal component. Raw cosine says the two states are almost identical; subtracting their shared mean reveals that their informative vertical components oppose one another.
1import numpy as np
2
3billing = np.array([10.0, 1.0])
4device = np.array([10.0, -1.0])
5
6def cosine(a, b):
7 return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
8
9raw_similarity = cosine(billing, device)
10mean_direction = (billing + device) / 2
11centered_similarity = cosine(billing - mean_direction, device - mean_direction)
12
13print("raw cosine:", round(raw_similarity, 3))
14print("centered cosine:", round(centered_similarity, 3))1raw cosine: 0.98
2centered cosine: -1.0Centering isn't a universal production recipe. It is a diagnostic reminder: measure retrieval or classification quality on held-out examples instead of trusting a similarity score in isolation.
No representation is best by slogan. Start with the failure mode and the measurement that would prove improvement.
| Need | Useful starting point | What to measure |
|---|---|---|
| Small fixed vocabulary, simple classifier | Trainable embedding lookup | Held-out classification quality and latency |
Rare spelling variants such as reshipped | Character or subword-aware static vectors | Recall on unseen variants |
Ambiguous words such as billing/device charge | Contextual token states | Accuracy on sense-dependent cases |
| Search over whole support messages | Sentence or chunk embedding model | Retrieval precision and recall on real queries |
Contextual token states aren't automatically good message-level retrieval vectors. Pooling, task training, and evaluation still matter. You will build those choices in later retrieval lessons.
You now have a concrete chain from token IDs to context-dependent meaning:
charge require context mixing.Indexing by Latent Semantic Analysis
Deerwester, S., et al. · 1990 · JASIS
Efficient Estimation of Word Representations in Vector Space.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. · 2013 · arXiv preprint
GloVe: Global Vectors for Word Representation.
Pennington, J., Socher, R., & Manning, C. D. · 2014
Enriching Word Vectors with Subword Information.
Bojanowski, P., et al. · 2017
Deep contextualized word representations.
Peters, M., et al. · 2018 · NAACL 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
Improving Language Understanding by Generative Pre-Training.
Radford, A., et al. · 2018
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.
Ethayarajh, K. · 2019