LearnCore LLM FoundationsStatic to Contextual Embeddings

📝MediumNLP Fundamentals

Static to Contextual Embeddings

Turn token IDs into vectors, learn what nearby usage captures, and see why a word such as charge needs sentence-dependent representations.

16 min read

Learning path

Step 50 of 158 in the full curriculum

BPE, WordPiece, and SentencePiece Perplexity & Model Evaluation

Tokenization turns text into stable token IDs. The harder question is what those IDs should mean once a model starts learning from them.

A tokenizer can turn short sentences into IDs:

text

"dispute the charge" -> [91, 12, 407]
"charge the scanner"  -> [407, 12, 982]

IDs solve naming. They don't tell a model that encoder resembles decoder, or that charge means an invoice line in one sentence and supplying power in another.

An embedding gives each token a vector of numbers. A contextual representation goes one step further: it adjusts that vector using the sentence around this particular occurrence.

You'll build both ideas from scratch and watch the same token leave with two different meanings.

Diagram showing Input sentence, Tokenizer token IDs, Embedding table one starting row per ID, and Context-mixing layers tokens exchange evidence. — Input sentence, Tokenizer token IDs, Embedding table one starting row per ID, and Context-mixing layers tokens exchange evidence.

Look up coordinates for token IDs

Suppose a tokenizer has a vocabulary of size V, and you choose embedding dimension d. The model stores an embedding matrix E with shape V x d.

Row E[token_id] is the starting vector for that token.
Training changes rows so tokens used in similar situations become useful to the prediction task.
Before context mixing, every occurrence of the same token ID receives the same row.

That final point matters. If charge is ID 0, both billing and device sentences initially retrieve row E[0].

01-look-up-token-rows.py

import numpy as np

vocab = {"charge": 0, "invoice": 1, "scanner": 2}
embedding_table = np.array([
    [0.80, 0.20],  # charge
    [0.15, 0.95],  # invoice
    [0.92, 0.10],  # scanner
])

billing_charge = embedding_table[vocab["charge"]]
device_charge = embedding_table[vocab["charge"]]

print("billing start:", billing_charge.tolist())
print("device start:", device_charge.tolist())
print("same starting row:", np.array_equal(billing_charge, device_charge))

Output

billing start: [0.8, 0.2]
device start: [0.8, 0.2]
same starting row: True

Exact static-to-contextual embedding geometry: both charge occurrences start from embedding row E zero at vector 0.8, 0.2; mixing equally with dispute at 0, 1 moves invoice charge to 0.4, 0.6, while mixing with scanner at 1, 0 moves device charge to 0.9, 0.1. — The table reproduces the first lab's shared `E[0] = [0.8, 0.2]` lookup. The vector plot previews the later toy contextualizer: equal mixing with sentence clues sends billing and device occurrences to different final states.

A lookup table alone can't distinguish senses. First, though, it has to learn a useful map of token usage.

Learn meaning from neighboring words

Words that repeatedly appear near similar words tend to play similar roles. In technical text, encoder and decoder may both occur near layer, attention, or mask; cache and prefix may cluster around reuse and latency language.

The simplest evidence is a co-occurrence count: for every token, count words within a fixed context window.

02-count-context-neighbors.py

from collections import Counter, defaultdict

messages = [
    "encoder layer uses attention mask".split(),
    "decoder layer uses attention mask".split(),
    "encoder block mixes token context".split(),
    "cache hit reuses prompt prefix".split(),
    "prefix cache skips prompt compute".split(),
]

window = 2
neighbors = defaultdict(Counter)

for message in messages:
    for center_index, center in enumerate(message):
        left = max(0, center_index - window)
        right = min(len(message), center_index + window + 1)
        for context_index in range(left, right):
            if context_index != center_index:
                neighbors[center][message[context_index]] += 1

print("encoder context:", neighbors["encoder"].most_common(4))
print("decoder context:", neighbors["decoder"].most_common(4))
print("cache context:", neighbors["cache"].most_common(4))

Output

encoder context: [('layer', 1), ('uses', 1), ('block', 1), ('mixes', 1)]
decoder context: [('layer', 1), ('uses', 1)]
cache context: [('hit', 1), ('reuses', 1), ('prefix', 1), ('skips', 1)]

A count row can be high-dimensional and noisy. One historical route, latent semantic analysis, uses singular value decomposition (SVD) to compress a high-dimensional term-document matrix.^{[1]Reference 1Indexing by Latent Semantic Analysishttps://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9} Here, use the same compression move on word-context counts. The compressed rows act as static vectors: one vector per word, independent of sentence.

Exact count-to-embedding pipeline for the lesson fixture: a four-by-four word-context count heatmap becomes a PPMI heatmap with architecture and serving blocks, SVD retains singular values 1.418 and 1.348, and the compressed two-component magnitudes make encoder and decoder parallel while encoder and cache are orthogonal. — The matrices use the exercise's exact counts and PPMI values. Two dominant singular values preserve the two block neighborhoods, producing cosine `1.0` for `encoder`/`decoder` and `0.0` for `encoder`/`cache`.

The next exercise follows that word-context route. It computes positive pointwise mutual information (PPMI), which emphasizes pairs occurring more often than chance, then compresses the matrix with SVD.

03-compress-ppmi-with-svd.py

import numpy as np

words = ["encoder", "decoder", "cache", "prefix"]
contexts = ["layer", "mask", "hit", "reuse"]
counts = np.array([
    [8, 6, 0, 0],  # encoder
    [7, 6, 0, 0],  # decoder
    [0, 0, 8, 5],  # cache
    [0, 0, 7, 6],  # prefix
], dtype=float)

expected = counts.sum(axis=1, keepdims=True) @ counts.sum(axis=0, keepdims=True) / counts.sum()
ppmi = np.maximum(np.log((counts + 1e-9) / (expected + 1e-9)), 0.0)
u, singular_values, _ = np.linalg.svd(ppmi, full_matrices=False)
vectors = u[:, :2] * np.sqrt(singular_values[:2])

def cosine(a, b):
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

print("encoder vs decoder:", round(cosine(vectors[0], vectors[1]), 3))
print("encoder vs cache:", round(cosine(vectors[0], vectors[2]), 3))

Output

encoder vs decoder: 1.0
encoder vs cache: 0.0

This tiny dataset makes the pattern visible. Real corpora have far more contexts, imperfect wording, and competing meanings.

Exact rank-2 SVD geometry from the lesson exercise: encoder and decoder lie almost on the same vertical architecture axis, cache and prefix lie on the horizontal serving axis, and the four-by-four cosine heatmap forms two one-valued blocks separated by zero-valued cross-cluster cells. — The two architecture vectors are parallel, as are the two serving vectors. The axes are orthogonal, so within-cluster cosine is `1.0` and cross-cluster cosine is `0.0`.

Predict neighbors with Word2Vec

Counting first and compressing later isn't the only way to learn a static embedding table. Word2Vec trains vectors through prediction:

CBOW predicts a center token from its surrounding tokens.
Skip-gram predicts surrounding tokens from a center token.

Both objectives reward vectors that help predict observed neighborhoods. Mikolov and colleagues introduced these two architectures in 2013.^{[2]Reference 2Efficient Estimation of Word Representations in Vector Space.https://arxiv.org/abs/1301.3781}

Exact Word2Vec objectives for the lesson sentence: with window size two around hit, CBOW sends request, cache, and layer inward through a mean to predict hit, while Skip-gram sends hit outward to predict those same three neighbors. — For this boundary position, CBOW combines three visible neighbors into one center prediction. Skip-gram reverses the arrows and creates three positive center-neighbor pairs.

For Skip-gram, a training set can be built directly from windows. With window size 2, the center arrived produces one positive pair for each visible neighbor.

04-create-skipgram-pairs.py

tokens = "request hit cache layer today".split()
window = 2
pairs = []

for center_index, center in enumerate(tokens):
    left = max(0, center_index - window)
    right = min(len(tokens), center_index + window + 1)
    for context_index in range(left, right):
        if context_index != center_index:
            pairs.append((center, tokens[context_index]))

hit_pairs = [pair for pair in pairs if pair[0] == "hit"]
print(hit_pairs)

Output

[('hit', 'request'), ('hit', 'cache'), ('hit', 'layer')]

Predicting every vocabulary item for every pair is costly. Negative sampling trains the observed pair as positive and a small set of sampled unobserved pairs as negative. This objective is a small, inspectable version of that idea.

05-compare-negative-sampling-loss.py

import numpy as np

center = np.array([1.0, 0.0])
observed_neighbor = np.array([1.2, 0.1])
wrong_neighbor = np.array([-1.0, 0.1])
negative_samples = [np.array([-0.9, -0.2]), np.array([-1.1, 0.0])]

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def negative_sampling_loss(center_vector, positive_vector, negatives):
    positive_loss = -np.log(sigmoid(center_vector @ positive_vector))
    negative_loss = sum(-np.log(sigmoid(-(center_vector @ n))) for n in negatives)
    return float(positive_loss + negative_loss)

observed_loss = negative_sampling_loss(center, observed_neighbor, negative_samples)
wrong_loss = negative_sampling_loss(center, wrong_neighbor, negative_samples)

print("observed pair loss:", round(observed_loss, 3))
print("wrong pair loss:", round(wrong_loss, 3))
print("training prefers observed pair:", observed_loss < wrong_loss)

Output

observed pair loss: 0.892
wrong pair loss: 1.942
training prefers observed pair: True

No analogy trick is required to understand the useful result: after enough examples, tokens that support similar neighbor predictions can end up near one another.

Fit global counts with GloVe

GloVe starts from global co-occurrence counts. It fits a weighted least-squares objective: a word-vector and context-vector dot product, plus biases, should approximate the logarithm of each observed count. The logarithm compresses the count range, while the weighting function limits how much very common pairs dominate training.^{[3]Reference 3GloVe: Global Vectors for Word Representation.https://aclanthology.org/D14-1162.pdf}

06-fit-glove-log-count.py

import numpy as np

def squared_glove_residual(observed_count, model_score):
    target = np.log(observed_count)
    return float((model_score - target) ** 2)

model_score = np.log(30)
matching_error = squared_glove_residual(30, model_score)
different_error = squared_glove_residual(3, model_score)

print("log targets:", [round(float(np.log(c)), 3) for c in (3, 30, 300)])
print("matching count error:", round(matching_error, 3))
print("different count error:", round(different_error, 3))

Output

log targets: [1.099, 3.401, 5.704]
matching count error: 0.0
different count error: 5.302

Word2Vec and GloVe give each known word one static row. That's efficient, but it creates two problems: new spellings have no row, and ambiguous words have only one.

Compose new spellings from pieces

Technical corpora constantly encounter variants: tokenize, tokenized, tokenizer, or misspellings mixed into text. FastText represents a known word using character n-grams as well as its whole-word identity, letting related spellings share parameters.^{[4]Reference 4Enriching Word Vectors with Subword Information.https://arxiv.org/abs/1607.04606} For an out-of-vocabulary word, there's no learned whole-word row. FastText can still sum its learned n-gram vectors into a usable representation.

Exact FastText trigram alignment for tokenize and tokenized: the known word has eight boundary-marked trigrams, the out-of-vocabulary spelling has nine, and seven columns match exactly while the whole-word row is absent for tokenized. — The exact `n=3` windows share seven bucket rows: `<to`, `tok`, `oke`, `ken`, `eni`, `niz`, and `ize`. The unseen spelling omits a whole-word row but can still sum learned n-gram vectors.

07-share-character-ngrams.py

def character_ngrams(word, n=3):
    marked = f"<{word}>"
    return {marked[i:i + n] for i in range(len(marked) - n + 1)}

known = character_ngrams("tokenize")
new_form = character_ngrams("tokenized")
shared = sorted(known & new_form)

print("shared pieces:", shared)
print("new spelling can reuse pieces:", len(shared) > 0)

Output

shared pieces: ['<to', 'eni', 'ize', 'ken', 'niz', 'oke', 'tok']
new spelling can reuse pieces: True

The code checks which pieces overlap. A trained FastText model would sum learned vectors for those pieces and omit the missing whole-word row for an out-of-vocabulary spelling.

Subword composition helps with unfamiliar surface forms. It still doesn't decide which meaning an existing ambiguous token carries.

One static row can't choose a sense

Read these messages:

text

"Dispute the charge on invoice 8142."
"Charge the scanner before the lab demo."

The word charge is spelled the same way in both. A static embedding lookup returns the same vector, even though the first case belongs near billing language and the second near device language.

08-expose-static-sense-collision.py

import numpy as np

static = {
    "charge": np.array([0.8, 0.2]),
    "invoice": np.array([0.2, 1.0]),
    "scanner": np.array([1.0, 0.1]),
}

def cosine(a, b):
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

billing_charge = static["charge"]
device_charge = static["charge"]

print("same charge vector:", np.array_equal(billing_charge, device_charge))
print("billing similarity to invoice:", round(cosine(billing_charge, static["invoice"]), 3))
print("device similarity to invoice:", round(cosine(device_charge, static["invoice"]), 3))

Output

same charge vector: True
billing similarity to invoice: 0.428
device similarity to invoice: 0.428

Because the two charge vectors are identical, every downstream comparison begins from the same mistaken compromise. The representation needs evidence from the sentence.

Let tokens exchange context

ELMo showed that a token's representation can be produced from a bidirectional language model and selected from multiple learned layers, rather than stored as one context-free vector.^{[5]Reference 5Deep contextualized word representations.https://arxiv.org/abs/1802.05365}

ELMo architecture centered on charge in the invoice sentence: a forward language model carries left context toward charge, a backward language model carries right context toward charge, character-CNN state h0 and bidirectional LSTM states h1 and h2 feed a learned symbolic layer mixture. — At `charge`, each recurrent layer concatenates forward and backward states. A downstream task learns normalized weights over `h0`, `h1`, and `h2`, then scales their sum by `gamma`.

BERT later trained a Transformer encoder with masked language modeling: hide some tokens, then predict them using context on both sides. For a visible token state, encoder self-attention can incorporate evidence before and after that token.^{[6]Reference 6BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://arxiv.org/abs/1810.04805}

Exact BERT encoder visibility for Token MASK in prompt: a four-by-four all-ones visibility heatmap highlights the MASK query row, while four spokes show that its encoder state may read Token, MASK, in, and prompt; learned attention weights are deliberately not shown. — The highlighted `[MASK]` row is `[1, 1, 1, 1]`: every position is allowed. Those binary permissions aren't the learned attention weights, which can differ by head and layer.

You don't need a pretrained model to see the central mechanism. In this toy contextualizer, the same starting vector for charge is mixed with a clue from its sentence.

09-mix-context-into-charge.py

import numpy as np

charge_start = np.array([0.8, 0.2])
clues = {
    "dispute": np.array([0.0, 1.0]),  # invoice evidence
    "scanner": np.array([1.0, 0.0]),  # device evidence
}

def mix_with_context(token_vector, clue_vector):
    return 0.5 * token_vector + 0.5 * clue_vector

billing_state = mix_with_context(charge_start, clues["dispute"])
device_state = mix_with_context(charge_start, clues["scanner"])

print("billing charge state:", billing_state.tolist())
print("device charge state:", device_state.tolist())
print("same final state:", np.array_equal(billing_state, device_state))

Output

billing charge state: [0.4, 0.6]
device charge state: [0.9, 0.1]
same final state: False

The weights above are illustrative, not learned model parameters. A trained attention layer learns which context tokens should contribute, and later layers can refine that state again.

Bidirectional and causal context

Not every contextual representation may look to the same places.

A BERT-style encoder token can use tokens on both sides of its position during encoding.
A GPT-style causal language model predicts the next token from the prefix, so a token state can't use later tokens while preserving that next-token objective. The original GPT work used a Transformer decoder for generative pretraining.^{[7]Reference 7Improving Language Understanding by Generative Pre-Training.https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf}

For the first token in charge the scanner, the right-hand clue scanner is visible to an encoder but not to a causal decoder state at position charge.

10-compare-visibility-masks.py

import numpy as np

tokens = ["charge", "the", "scanner"]
encoder_visibility = np.ones((len(tokens), len(tokens)), dtype=int)
causal_visibility = np.tril(np.ones((len(tokens), len(tokens)), dtype=int))

def visible_tokens(mask, position):
    return [token for token, visible in zip(tokens, mask[position]) if visible]

print("encoder state for charge sees:", visible_tokens(encoder_visibility, 0))
print("causal state for charge sees:", visible_tokens(causal_visibility, 0))

Output

encoder state for charge sees: ['charge', 'the', 'scanner']
causal state for charge sees: ['charge']

Both designs produce contextual states. Their visibility rules suit different training objectives and later tasks.

Exact encoder and causal visibility masks for charge the scanner: the encoder matrix is all ones and its highlighted charge row sees all three tokens, while the causal lower-triangular matrix has charge row one-zero-zero so the first state sees only charge. — For position `0`, the encoder row is `[1, 1, 1]`, while the causal row is `[1, 0, 0]`. Both produce contextual states, but only the encoder can use the future clue `scanner` at that position.

Similarity needs a geometry check

Embedding applications often use cosine similarity: a value near 1 means two vectors point in similar directions. That measurement is useful only if the space behaves sensibly.

Research on contextual representations found that token vectors can be anisotropic: many vectors share a dominant direction, inflating raw cosine similarities even for tokens that shouldn't be treated as alike.^{[8]Reference 8How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.https://arxiv.org/abs/1909.00512}

Exact anisotropy diagnostic from the lesson: raw invoice vector ten-one and device vector ten-minus-one point along a shared horizontal direction with cosine 0.98; subtracting mean ten-zero produces centered vectors zero-one and zero-minus-one with cosine minus 1.0. — The shared mean is `[10, 0]`. Before subtraction, the two vectors differ by only `11.4°`; afterward, their centered components point `180°` apart.

The two states share a large horizontal component. Raw cosine says they're almost identical; subtracting their shared mean reveals that their informative vertical components oppose one another.

11-diagnose-shared-direction.py

import numpy as np

invoice = np.array([10.0, 1.0])
device = np.array([10.0, -1.0])

def cosine(a, b):
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

raw_similarity = cosine(invoice, device)
mean_direction = (invoice + device) / 2
centered_similarity = cosine(invoice - mean_direction, device - mean_direction)

print("raw cosine:", round(raw_similarity, 3))
print("centered cosine:", round(centered_similarity, 3))

Output

raw cosine: 0.98
centered cosine: -1.0

Centering isn't a universal production recipe. It's a diagnostic reminder: measure retrieval or classification quality on held-out examples instead of trusting a similarity score in isolation.

Choose what the failure requires

No representation is best by slogan. Start with the failure mode and the measurement that would prove improvement.

Need	Useful starting point	What to measure
Small fixed vocabulary, simple classifier	Trainable embedding lookup	Held-out classification quality and latency
Rare spelling variants such as `tokenized`	Character or subword-aware static vectors	Recall on unseen variants
Ambiguous words such as billing/device `charge`	Contextual token states	Accuracy on sense-dependent cases
Search over whole documentation chunks	Sentence or chunk embedding model	Retrieval precision and recall on real queries

Contextual token states aren't automatically good message-level retrieval vectors. Pooling, task training, and evaluation still matter. You'll build those choices in later retrieval lessons.

From lookup rows to occurrence states

The chain from token IDs to context-dependent meaning is now concrete:

An embedding table maps each token ID to one starting vector.
Co-occurrence, Word2Vec, and GloVe learn static geometry from usage evidence.
FastText-style character pieces let related spellings share parameters.
Static vectors fail when identical token text carries different senses.
Context mixing gives each occurrence its own state.
Cosine similarity still requires evaluation because geometry can be distorted.

Mastery check

What strong answers show

Foundational: You can explain why an embedding lookup returns one stored row for every occurrence of a token ID.
Intermediate: You can build a co-occurrence or prediction example and show why the two meanings of charge require context mixing.
Advanced: You can choose a representation for a measured failure case and test whether cosine geometry supports the chosen metric.

Follow-up questions

When embeddings mislead

Treating token IDs as meaning: An ID only selects a row; training and context create useful geometry.
Calling subword handling disambiguation: Character pieces help unfamiliar spellings, not multiple senses of one familiar word.
Trusting cosine without an evaluation set: A high score can come from shared directions rather than task-relevant similarity.

Next Step

Continue to Perplexity & Model Evaluation

Embeddings decide what information a language model can carry forward. Next, measure how well its predicted token probabilities fit real text.

PreviousBPE, WordPiece, and SentencePiece

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Indexing by Latent Semantic Analysis

Deerwester, S., et al. · 1990 · JASIS

Efficient Estimation of Word Representations in Vector Space.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. · 2013 · arXiv preprint

GloVe: Global Vectors for Word Representation.

Pennington, J., Socher, R., & Manning, C. D. · 2014

Enriching Word Vectors with Subword Information.

Bojanowski, P., et al. · 2017

Deep contextualized word representations.

Peters, M., et al. · 2018 · NAACL 2018

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. · 2018

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.

Ethayarajh, K. · 2019

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnCore LLM FoundationsStatic to Contextual Embeddings

📝MediumNLP Fundamentals

Static to Contextual Embeddings

Turn token IDs into vectors, learn what nearby usage captures, and see why a word such as charge needs sentence-dependent representations.

16 min read

Learning path

Step 50 of 158 in the full curriculum

BPE, WordPiece, and SentencePiece Perplexity & Model Evaluation

Tokenization turns text into stable token IDs. The harder question is what those IDs should mean once a model starts learning from them.

A tokenizer can turn short sentences into IDs:

text

"dispute the charge" -> [91, 12, 407]
"charge the scanner"  -> [407, 12, 982]

IDs solve naming. They don't tell a model that encoder resembles decoder, or that charge means an invoice line in one sentence and supplying power in another.

An embedding gives each token a vector of numbers. A contextual representation goes one step further: it adjusts that vector using the sentence around this particular occurrence.

You'll build both ideas from scratch and watch the same token leave with two different meanings.

Look up coordinates for token IDs

Suppose a tokenizer has a vocabulary of size V, and you choose embedding dimension d. The model stores an embedding matrix E with shape V x d.

Row E[token_id] is the starting vector for that token.
Training changes rows so tokens used in similar situations become useful to the prediction task.
Before context mixing, every occurrence of the same token ID receives the same row.

That final point matters. If charge is ID 0, both billing and device sentences initially retrieve row E[0].

01-look-up-token-rows.py

import numpy as np

vocab = {"charge": 0, "invoice": 1, "scanner": 2}
embedding_table = np.array([
    [0.80, 0.20],  # charge
    [0.15, 0.95],  # invoice
    [0.92, 0.10],  # scanner
])

billing_charge = embedding_table[vocab["charge"]]
device_charge = embedding_table[vocab["charge"]]

print("billing start:", billing_charge.tolist())
print("device start:", device_charge.tolist())
print("same starting row:", np.array_equal(billing_charge, device_charge))

Output

billing start: [0.8, 0.2]
device start: [0.8, 0.2]
same starting row: True

A lookup table alone can't distinguish senses. First, though, it has to learn a useful map of token usage.

Learn meaning from neighboring words

The simplest evidence is a co-occurrence count: for every token, count words within a fixed context window.

02-count-context-neighbors.py

from collections import Counter, defaultdict

messages = [
    "encoder layer uses attention mask".split(),
    "decoder layer uses attention mask".split(),
    "encoder block mixes token context".split(),
    "cache hit reuses prompt prefix".split(),
    "prefix cache skips prompt compute".split(),
]

window = 2
neighbors = defaultdict(Counter)

for message in messages:
    for center_index, center in enumerate(message):
        left = max(0, center_index - window)
        right = min(len(message), center_index + window + 1)
        for context_index in range(left, right):
            if context_index != center_index:
                neighbors[center][message[context_index]] += 1

print("encoder context:", neighbors["encoder"].most_common(4))
print("decoder context:", neighbors["decoder"].most_common(4))
print("cache context:", neighbors["cache"].most_common(4))

Output

encoder context: [('layer', 1), ('uses', 1), ('block', 1), ('mixes', 1)]
decoder context: [('layer', 1), ('uses', 1)]
cache context: [('hit', 1), ('reuses', 1), ('prefix', 1), ('skips', 1)]

03-compress-ppmi-with-svd.py

import numpy as np

words = ["encoder", "decoder", "cache", "prefix"]
contexts = ["layer", "mask", "hit", "reuse"]
counts = np.array([
    [8, 6, 0, 0],  # encoder
    [7, 6, 0, 0],  # decoder
    [0, 0, 8, 5],  # cache
    [0, 0, 7, 6],  # prefix
], dtype=float)

expected = counts.sum(axis=1, keepdims=True) @ counts.sum(axis=0, keepdims=True) / counts.sum()
ppmi = np.maximum(np.log((counts + 1e-9) / (expected + 1e-9)), 0.0)
u, singular_values, _ = np.linalg.svd(ppmi, full_matrices=False)
vectors = u[:, :2] * np.sqrt(singular_values[:2])

def cosine(a, b):
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

print("encoder vs decoder:", round(cosine(vectors[0], vectors[1]), 3))
print("encoder vs cache:", round(cosine(vectors[0], vectors[2]), 3))

Output

encoder vs decoder: 1.0
encoder vs cache: 0.0

This tiny dataset makes the pattern visible. Real corpora have far more contexts, imperfect wording, and competing meanings.

Predict neighbors with Word2Vec

Counting first and compressing later isn't the only way to learn a static embedding table. Word2Vec trains vectors through prediction:

CBOW predicts a center token from its surrounding tokens.
Skip-gram predicts surrounding tokens from a center token.

For Skip-gram, a training set can be built directly from windows. With window size 2, the center arrived produces one positive pair for each visible neighbor.

04-create-skipgram-pairs.py

tokens = "request hit cache layer today".split()
window = 2
pairs = []

for center_index, center in enumerate(tokens):
    left = max(0, center_index - window)
    right = min(len(tokens), center_index + window + 1)
    for context_index in range(left, right):
        if context_index != center_index:
            pairs.append((center, tokens[context_index]))

hit_pairs = [pair for pair in pairs if pair[0] == "hit"]
print(hit_pairs)

Output

[('hit', 'request'), ('hit', 'cache'), ('hit', 'layer')]

05-compare-negative-sampling-loss.py

import numpy as np

center = np.array([1.0, 0.0])
observed_neighbor = np.array([1.2, 0.1])
wrong_neighbor = np.array([-1.0, 0.1])
negative_samples = [np.array([-0.9, -0.2]), np.array([-1.1, 0.0])]

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def negative_sampling_loss(center_vector, positive_vector, negatives):
    positive_loss = -np.log(sigmoid(center_vector @ positive_vector))
    negative_loss = sum(-np.log(sigmoid(-(center_vector @ n))) for n in negatives)
    return float(positive_loss + negative_loss)

observed_loss = negative_sampling_loss(center, observed_neighbor, negative_samples)
wrong_loss = negative_sampling_loss(center, wrong_neighbor, negative_samples)

print("observed pair loss:", round(observed_loss, 3))
print("wrong pair loss:", round(wrong_loss, 3))
print("training prefers observed pair:", observed_loss < wrong_loss)

Output

observed pair loss: 0.892
wrong pair loss: 1.942
training prefers observed pair: True

No analogy trick is required to understand the useful result: after enough examples, tokens that support similar neighbor predictions can end up near one another.

Fit global counts with GloVe

06-fit-glove-log-count.py

import numpy as np

def squared_glove_residual(observed_count, model_score):
    target = np.log(observed_count)
    return float((model_score - target) ** 2)

model_score = np.log(30)
matching_error = squared_glove_residual(30, model_score)
different_error = squared_glove_residual(3, model_score)

print("log targets:", [round(float(np.log(c)), 3) for c in (3, 30, 300)])
print("matching count error:", round(matching_error, 3))
print("different count error:", round(different_error, 3))

Output

log targets: [1.099, 3.401, 5.704]
matching count error: 0.0
different count error: 5.302

Word2Vec and GloVe give each known word one static row. That's efficient, but it creates two problems: new spellings have no row, and ambiguous words have only one.

Compose new spellings from pieces

07-share-character-ngrams.py

def character_ngrams(word, n=3):
    marked = f"<{word}>"
    return {marked[i:i + n] for i in range(len(marked) - n + 1)}

known = character_ngrams("tokenize")
new_form = character_ngrams("tokenized")
shared = sorted(known & new_form)

print("shared pieces:", shared)
print("new spelling can reuse pieces:", len(shared) > 0)

Output

shared pieces: ['<to', 'eni', 'ize', 'ken', 'niz', 'oke', 'tok']
new spelling can reuse pieces: True

The code checks which pieces overlap. A trained FastText model would sum learned vectors for those pieces and omit the missing whole-word row for an out-of-vocabulary spelling.

Subword composition helps with unfamiliar surface forms. It still doesn't decide which meaning an existing ambiguous token carries.

One static row can't choose a sense

Read these messages:

text

"Dispute the charge on invoice 8142."
"Charge the scanner before the lab demo."

The word charge is spelled the same way in both. A static embedding lookup returns the same vector, even though the first case belongs near billing language and the second near device language.

08-expose-static-sense-collision.py

import numpy as np

static = {
    "charge": np.array([0.8, 0.2]),
    "invoice": np.array([0.2, 1.0]),
    "scanner": np.array([1.0, 0.1]),
}

def cosine(a, b):
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

billing_charge = static["charge"]
device_charge = static["charge"]

print("same charge vector:", np.array_equal(billing_charge, device_charge))
print("billing similarity to invoice:", round(cosine(billing_charge, static["invoice"]), 3))
print("device similarity to invoice:", round(cosine(device_charge, static["invoice"]), 3))

Output

same charge vector: True
billing similarity to invoice: 0.428
device similarity to invoice: 0.428

Because the two charge vectors are identical, every downstream comparison begins from the same mistaken compromise. The representation needs evidence from the sentence.

Let tokens exchange context

You don't need a pretrained model to see the central mechanism. In this toy contextualizer, the same starting vector for charge is mixed with a clue from its sentence.

09-mix-context-into-charge.py

import numpy as np

charge_start = np.array([0.8, 0.2])
clues = {
    "dispute": np.array([0.0, 1.0]),  # invoice evidence
    "scanner": np.array([1.0, 0.0]),  # device evidence
}

def mix_with_context(token_vector, clue_vector):
    return 0.5 * token_vector + 0.5 * clue_vector

billing_state = mix_with_context(charge_start, clues["dispute"])
device_state = mix_with_context(charge_start, clues["scanner"])

print("billing charge state:", billing_state.tolist())
print("device charge state:", device_state.tolist())
print("same final state:", np.array_equal(billing_state, device_state))

Output

billing charge state: [0.4, 0.6]
device charge state: [0.9, 0.1]
same final state: False

The weights above are illustrative, not learned model parameters. A trained attention layer learns which context tokens should contribute, and later layers can refine that state again.

Bidirectional and causal context

Not every contextual representation may look to the same places.

A BERT-style encoder token can use tokens on both sides of its position during encoding.
A GPT-style causal language model predicts the next token from the prefix, so a token state can't use later tokens while preserving that next-token objective. The original GPT work used a Transformer decoder for generative pretraining.^{[7]Reference 7Improving Language Understanding by Generative Pre-Training.https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf}

For the first token in charge the scanner, the right-hand clue scanner is visible to an encoder but not to a causal decoder state at position charge.

10-compare-visibility-masks.py

import numpy as np

tokens = ["charge", "the", "scanner"]
encoder_visibility = np.ones((len(tokens), len(tokens)), dtype=int)
causal_visibility = np.tril(np.ones((len(tokens), len(tokens)), dtype=int))

def visible_tokens(mask, position):
    return [token for token, visible in zip(tokens, mask[position]) if visible]

print("encoder state for charge sees:", visible_tokens(encoder_visibility, 0))
print("causal state for charge sees:", visible_tokens(causal_visibility, 0))

Output

encoder state for charge sees: ['charge', 'the', 'scanner']
causal state for charge sees: ['charge']

Both designs produce contextual states. Their visibility rules suit different training objectives and later tasks.

Similarity needs a geometry check

Embedding applications often use cosine similarity: a value near 1 means two vectors point in similar directions. That measurement is useful only if the space behaves sensibly.

The two states share a large horizontal component. Raw cosine says they're almost identical; subtracting their shared mean reveals that their informative vertical components oppose one another.

11-diagnose-shared-direction.py

import numpy as np

invoice = np.array([10.0, 1.0])
device = np.array([10.0, -1.0])

def cosine(a, b):
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

raw_similarity = cosine(invoice, device)
mean_direction = (invoice + device) / 2
centered_similarity = cosine(invoice - mean_direction, device - mean_direction)

print("raw cosine:", round(raw_similarity, 3))
print("centered cosine:", round(centered_similarity, 3))

Output

raw cosine: 0.98
centered cosine: -1.0

Centering isn't a universal production recipe. It's a diagnostic reminder: measure retrieval or classification quality on held-out examples instead of trusting a similarity score in isolation.

Choose what the failure requires

No representation is best by slogan. Start with the failure mode and the measurement that would prove improvement.

Need	Useful starting point	What to measure
Small fixed vocabulary, simple classifier	Trainable embedding lookup	Held-out classification quality and latency
Rare spelling variants such as `tokenized`	Character or subword-aware static vectors	Recall on unseen variants
Ambiguous words such as billing/device `charge`	Contextual token states	Accuracy on sense-dependent cases
Search over whole documentation chunks	Sentence or chunk embedding model	Retrieval precision and recall on real queries

Contextual token states aren't automatically good message-level retrieval vectors. Pooling, task training, and evaluation still matter. You'll build those choices in later retrieval lessons.

From lookup rows to occurrence states

The chain from token IDs to context-dependent meaning is now concrete:

An embedding table maps each token ID to one starting vector.
Co-occurrence, Word2Vec, and GloVe learn static geometry from usage evidence.
FastText-style character pieces let related spellings share parameters.
Static vectors fail when identical token text carries different senses.
Context mixing gives each occurrence its own state.
Cosine similarity still requires evaluation because geometry can be distorted.

Mastery check

What strong answers show

Foundational: You can explain why an embedding lookup returns one stored row for every occurrence of a token ID.
Intermediate: You can build a co-occurrence or prediction example and show why the two meanings of charge require context mixing.
Advanced: You can choose a representation for a measured failure case and test whether cosine geometry supports the chosen metric.

Follow-up questions

When embeddings mislead

Treating token IDs as meaning: An ID only selects a row; training and context create useful geometry.
Calling subword handling disambiguation: Character pieces help unfamiliar spellings, not multiple senses of one familiar word.
Trusting cosine without an evaluation set: A high score can come from shared directions rather than task-relevant similarity.

Next Step

Continue to Perplexity & Model Evaluation

Embeddings decide what information a language model can carry forward. Next, measure how well its predicted token probabilities fit real text.

PreviousBPE, WordPiece, and SentencePiece

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Indexing by Latent Semantic Analysis

Deerwester, S., et al. · 1990 · JASIS

Efficient Estimation of Word Representations in Vector Space.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. · 2013 · arXiv preprint

GloVe: Global Vectors for Word Representation.

Pennington, J., Socher, R., & Manning, C. D. · 2014

Enriching Word Vectors with Subword Information.

Bojanowski, P., et al. · 2017

Deep contextualized word representations.

Peters, M., et al. · 2018 · NAACL 2018

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. · 2018

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.

Ethayarajh, K. · 2019

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Static to Contextual Embeddings

Look up coordinates for token IDs

Learn meaning from neighboring words

Predict neighbors with Word2Vec

Fit global counts with GloVe

Compose new spellings from pieces

One static row can't choose a sense

Let tokens exchange context

Bidirectional and causal context

Similarity needs a geometry check

Choose what the failure requires

From lookup rows to occurrence states

Mastery check

What strong answers show

Follow-up questions

When embeddings mislead

Mastery Check

Discussion

Static to Contextual Embeddings

Look up coordinates for token IDs

Learn meaning from neighboring words

Predict neighbors with Word2Vec

Fit global counts with GloVe

Compose new spellings from pieces

One static row can't choose a sense

Let tokens exchange context

Bidirectional and causal context

Similarity needs a geometry check

Choose what the failure requires

From lookup rows to occurrence states

Mastery check

What strong answers show

Follow-up questions

When embeddings mislead

Mastery Check

Discussion

Static to Contextual Embeddings

Look up coordinates for token IDs

Learn meaning from neighboring words

Predict neighbors with Word2Vec

Fit global counts with GloVe

Compose new spellings from pieces

One static row can't choose a sense

Let tokens exchange context

Bidirectional and causal context

Similarity needs a geometry check

Choose what the failure requires

From lookup rows to occurrence states

Mastery check

What strong answers show

Follow-up questions

Why can't a static embedding table represent billing charge and device charge differently in these two sentences?

What does a FastText-style character n-gram representation fix for tokenized, and what does it not fix for charge?

Why should a retrieval team test cosine similarity on held-out queries before trusting contextual vectors?

When embeddings mislead

Mastery Check

Discussion

Static to Contextual Embeddings

Look up coordinates for token IDs

Learn meaning from neighboring words

Predict neighbors with Word2Vec

Fit global counts with GloVe

Compose new spellings from pieces

One static row can't choose a sense

Let tokens exchange context

Bidirectional and causal context

Similarity needs a geometry check

Choose what the failure requires

From lookup rows to occurrence states

Mastery check

What strong answers show

Follow-up questions

Why can't a static embedding table represent billing charge and device charge differently in these two sentences?

What does a FastText-style character n-gram representation fix for tokenized, and what does it not fix for charge?

Why should a retrieval team test cosine similarity on held-out queries before trusting contextual vectors?

When embeddings mislead

Mastery Check

Discussion