LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & Model Evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge Distillation for LLMsModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Design an Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAI Engineering FoundationsWord to Contextual Embeddings
📝HardNLP Fundamentals

Word to Contextual Embeddings

Trace the full evolution from count-based methods through Word2Vec/GloVe to contextual BERT/GPT representations. Understand the distributional hypothesis, embedding geometry, and when to use static vs contextual embeddings in production.

30 min readGoogle, Meta, OpenAI +310 key concepts

How does a computer understand that "king" and "queen" are related, or that "Paris" is to "France" what "Tokyo" is to "Japan"? It can't read a dictionary. It needs numbers. The solution is to give every word a set of coordinates, like pinning cities on a map. Words with similar meanings end up close together, and the directions between them encode relationships. These numerical coordinates are called embeddings, and they're the foundation of how every language model (from GPT-5.4 to Gemini 3.1 Pro) understands meaning.

This article explores the full journey: from early counting methods through Word2Vec's breakthrough to today's context-aware representations in BERT and GPT, where the same word gets different coordinates depending on its surroundings.

💡 Key insight: Understanding embeddings means grasping the core idea behind representation learning: that a word's meaning comes from the company it keeps, that static coordinates evolved into dynamic ones, and the practical tradeoffs involved. This knowledge is foundational to every modern NLP (Natural Language Processing) and LLM (Large Language Model) system. See also: how text becomes tokens before it becomes embeddings, and sentence embeddings for scaling this idea to whole passages.


The distributional hypothesis: the foundation of everything

Before any algorithm, understand the core insight that makes all word embeddings possible:

"You shall know a word by the company it keeps." (J.R. Firth, 1957)[1]

This linguistic intuition has deep roots in information theory. Shannon (1948)[2] formalized the idea that the next symbol in a sequence can be predicted from its preceding context, establishing the mathematical foundation for measuring information content in language. The distributional hypothesis operationalizes this: if two words are interchangeable in the same contexts without changing the information content, they must carry similar meaning.

Think of it this way: if you moved to a new city and saw a store you'd never heard of, you could guess what it sells by looking at its neighbors. A store between a bakery and a deli is probably a food shop. A store between a nail salon and a hair salon is probably a beauty shop. You didn't need to go inside. The neighborhood told you everything.

Words work the same way. The word "espresso" tends to appear near "latte," "barista," and "café," so a computer can figure out it's a coffee-related word just by tracking those patterns. This is the distributional hypothesis: words that appear in similar contexts have similar meanings. Every embedding method (from TF-IDF to GPT) is at its core an operationalization of this idea.

To see this in practice, consider a simple fill-in-the-blank sentence. The words that naturally fit into the blank will develop similar embeddings because they share the same surrounding context:

text
1Context: "The ___ sat on the mat" 2 3Words that fit: cat, dog, child, kitten, puppy 4→ These words should have similar embeddings 5 6Words that don't fit: democracy, algorithm, quantum 7→ These should have very different embeddings

The evolution of embeddings is the story of increasingly powerful ways to capture this distributional information:

Diagram Diagram

1. Count-based methods: the starting point

Before neural networks dominated NLP, words were represented using large, sparse matrices of co-occurrence statistics. The goal was to mathematically capture the distributional hypothesis by counting how often words appeared near each other.

TF-IDF (Term Frequency-Inverse Document Frequency) was one of the earliest and most resilient approaches. It represents documents as sparse vectors of word counts, penalizing words that appear frequently across all documents (like "the" or "is") while highlighting rare, discriminative words. While simple, it remains highly effective for basic information retrieval and search engines today.

PMI (Pointwise Mutual Information) extended this by measuring the statistical dependence between two words. If "coffee" and "cup" appear together much more often than their independent probabilities would suggest, their PMI is high. A Word-Context Matrix populated with PMI values is a direct precursor to modern embeddings.

LSA (Latent Semantic Analysis), introduced by Deerwester et al. (1990)[3], took the critical next step: dimensionality reduction. By applying Singular Value Decomposition (SVD) to the sparse term-document matrix, LSA compressed the vocabulary into a dense, lower-dimensional space (e.g., 300 dimensions). This was revolutionary because it captured latent semantics: words that never co-occurred in the same document could still end up with similar vectors if they shared similar neighbors.

Diagram Diagram

Key limitation

These methods produce a single vector per word regardless of context. The word "bank" has the same representation whether it means a financial institution or a riverbank. This is the polysemy problem, and solving it would drive the next decade of research.


2. Word2Vec: the neural embedding revolution

Mikolov et al. (2013)[4] introduced two neural architectures that learn dense word vectors by predicting words from their local context. This was the breakthrough that launched modern NLP embeddings.

Two architectures

CBOW (Continuous Bag of Words)

Predicts the center word from surrounding context:

P(wt∣wt−c,…,wt−1,wt+1,…,wt+c)P(w_t \mid w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c})P(wt​∣wt−c​,…,wt−1​,wt+1​,…,wt+c​)

Reading the formula

Given the surrounding words (the "context window" of size ccc on each side), predict the word in the middle. For example, given "the ___ sat on", predict "cat."

Skip-gram

Predicts context words from the center word (inverse of CBOW):

P(wt+j∣wt)for j∈[−c,c],j≠0P(w_{t+j} \mid w_t) \quad \text{for } j \in [-c, c], j \neq 0P(wt+j​∣wt​)for j∈[−c,c],j=0

Reading the formula

This is the reverse. Given the center word "cat", predict which words likely surround it ("the", "sat", "on"). Skip-gram tends to work better for rare words because each word gets more training updates.

Diagram Diagram

💡 Key insight: CBOW is faster and better for frequent words. Skip-gram is slower but captures rare words better (each rare word generates multiple training examples as a center word, and the model updates its vector more frequently)[4].

Training: from full softmax to negative sampling

Here's the problem: to learn good embeddings, the model needs to check its prediction against every word in the vocabulary, often 100,000+ words. That's like grading a multiple-choice test with 100,000 answer options for every single question. Two major solutions emerged:

Hierarchical softmax (Morin & Bengio, 2005)[5] organizes the vocabulary as a binary Huffman tree. Instead of computing a softmax over all VVV words, the model makes a series of binary left/right decisions traversing the tree from root to leaf. This reduces the per-prediction cost from O(V)O(V)O(V) to O(log⁡V)O(\log V)O(logV). Frequent words get shorter paths (fewer decisions), making the overall computation efficient. However, hierarchical softmax struggles with rare words because their longer paths accumulate more gradient noise, and the tree structure imposes a fixed hierarchy that may not reflect semantic relationships.

Negative sampling (Mikolov et al., 2013)[4] took a different approach that ultimately won out for large-scale training. Instead of comparing against all VVV words, the model picks a handful of random "wrong" answers (negatives) and learns to tell the difference between the real context word and those few imposters. It's like a flashcard game: "Is 'cat' a real neighbor of 'sat'?" (yes ✅) "Is 'democracy' a real neighbor of 'sat'?" (no ❌). This tiny binary quiz is dramatically cheaper while still teaching the model which words belong together. Negative sampling eventually dominated because it provides more uniform gradient quality across rare and frequent words, and its O(k)O(k)O(k) cost (where kkk is typically 5–15) beats even logarithmic scaling at large vocabulary sizes.

The code below demonstrates how to compute this negative sampling loss for a single word pair. It takes the embeddings of the center word, the actual context word, and a set of random negative samples. The loss uses log-sigmoid for numerical stability, encouraging high similarity between the center and true context word while pushing similarity to negative samples toward zero:

python
1import numpy as np 2 3def skip_gram_loss(center_vec: np.ndarray, 4 context_vec: np.ndarray, 5 neg_samples: np.ndarray) -> float: 6 """ 7 Computes the negative sampling loss for a single skip-gram pair. 8 9 Uses log-sigmoid formulation for numerical stability: 10 - Positive: -log(sigmoid(score)) = log(1 + exp(-score)) 11 - Negative: -log(1 - sigmoid(score)) = log(1 + exp(score)) 12 13 Args: 14 center_vec: Embedding of center word (d,) 15 context_vec: Embedding of context word (d,) 16 neg_samples: Embeddings of k negative samples (k, d) 17 18 Returns: 19 Scalar loss value 20 """ 21 # Positive pair: maximize similarity to true context word 22 pos_score = np.dot(context_vec, center_vec) 23 pos_loss = np.log(1 + np.exp(-pos_score)) # = -log(sigmoid(pos_score)) 24 25 # Negative pairs: minimize similarity to random words 26 neg_scores = np.dot(neg_samples, center_vec) # (k,) 27 neg_loss = np.sum(np.log(1 + np.exp(neg_scores))) # = -sum(log(1-sigmoid(neg_scores))) 28 29 return pos_loss + neg_loss

Negative words are sampled proportional to f(w)3/4f(w)^{3/4}f(w)3/4, where f(w)f(w)f(w) is the word frequency. The 3/43/43/4 exponent gives rare words slightly more chance of being selected as negatives, improving their representations.

The famous analogy property

Word2Vec's most striking property is that semantic relationships are captured as linear directions in the embedding space. By performing arithmetic operations on these vectors, we can mathematically traverse conceptual relationships, combining related terms to produce a conceptually analogous result:

text
1king - man + woman ≈ queen 2Paris - France + Italy ≈ Rome 3walked - walking + swimming ≈ swam

This works because regularities in the training data create consistent vector offsets. The "gender direction" (man → woman) is roughly the same vector regardless of the word pair.

Word embedding analogy in 2D space: the vector from "king" to "queen" is parallel to the vector from "man" to "woman", demonstrating learned semantic relationships. Word embedding analogy in 2D space: the vector from "king" to "queen" is parallel to the vector from "man" to "woman", demonstrating learned semantic relationships.

🔬 Research insight: Ethayarajh et al. (2019)[6] found that the king-queen analogy is somewhat cherry-picked. Many analogies don't work cleanly, and the cosine similarity nearest neighbor often isn't the "correct" answer. Understanding this subtlety distinguishes deep expertise from surface-level knowledge.

Practical details

HyperparameterTypical ValueImpact
Embedding dimension100-300300 is standard; diminishing returns beyond
Context window5-10Larger → more topical; smaller → more syntactic
Negative samples (k)5-15More negatives = better for rare words
Min word count5Filters out very rare words
Subsampling threshold1e-5Drops frequent words like "the", "a"

3. GloVe: global vectors

Pennington, Socher & Manning (2014)[7] at Stanford combined the best of count-based and prediction-based methods.

Core insight: co-occurrence ratios encode meaning

Here's a clever detective trick that GloVe exploits. Say you're trying to figure out the difference between "ice" and "steam." Looking at which words appear near each one individually isn't very helpful, since both appear near "water." But if you look at the ratio, the pattern jumps out: the word "solid" appears 100× more often with "ice" than with "steam," while "gas" appears 100× more with "steam." Neutral words like "water" appear equally with both, giving a ratio near 1. It's like comparing two suspects by looking at who they hang out with relative to each other.

The key insight is that ratios of co-occurrence probabilities, not raw probabilities, capture semantic relationships:

P(·|ice)P(·|steam)Ratio
solidHighLowLarge → associated with ice
gasLowHighSmall → associated with steam
waterHighHigh≈ 1 → neutral
randomLowLow≈ 1 → neutral

GloVe is fundamentally a log-bilinear model: it assumes that the dot product of two word vectors should be a bilinear function of the log co-occurrence statistic. The key derivation starts from the observation that co-occurrence probability ratios encode meaning (as shown in the table above). Pennington et al. (2014)[7] showed that requiring the vector difference wa−wb\mathbf{w}_a - \mathbf{w}_bwa​−wb​ to predict these ratios leads directly to a least-squares objective on the log counts:

J=∑i,j=1Vf(Xij)(wiTw~j+bi+b~j−log⁡Xij)2J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2J=∑i,j=1V​f(Xij​)(wiT​w~j​+bi​+b~j​−logXij​)2

Reading the formula

For every pair of words in the vocabulary, the loss penalizes the difference between (a) the dot product of their learned vectors and (b) the log of how often they actually co-occur in the corpus. The weighting function f(Xij)f(X_{ij})f(Xij​) prevents extremely common pairs (like "the, of") from overwhelming the training. The result: vectors whose geometry directly encodes co-occurrence statistics. The bias terms bib_ibi​ and b~j\tilde{b}_jb~j​ absorb the log-frequency of individual words, ensuring the dot product captures only the interaction between word pairs.

Where:

  • •XijX_{ij}Xij​ = co-occurrence count of words iii and jjj
  • •f(x)f(x)f(x) = weighting function that caps at xmax=100x_{\text{max}} = 100xmax​=100 (prevents frequent pairs from dominating)
  • •wi,w~j\mathbf{w}_i, \tilde{\mathbf{w}}_jwi​,w~j​ = word and context vectors
  • •bi,b~jb_i, \tilde{b}_jbi​,b~j​ = per-word bias terms (absorb unigram frequency effects)

GloVe vs Word2Vec

AspectWord2VecGloVe
Training signalLocal context windowsGlobal co-occurrence matrix
ObjectivePredictive (softmax/Negative Sampling)Least squares on log co-occurrence
Theoretical basisDistributional hypothesis via predictionExplicit matrix factorization
PerformanceStrong on analogy tasksCompetitive; often slightly better on similarity
TrainingOnline (stochastic)Batch (needs full co-occurrence matrix)

🔬 Research insight: Levy & Goldberg (2014) showed that Word2Vec with negative sampling is implicitly factorizing a shifted PMI (Pointwise Mutual Information) matrix. The methods are more similar than they appear[8].


4. FastText: solving the out-of-vocabulary problem

Bojanowski et al. (2017)[9] at Facebook extended Word2Vec by representing words as bags of character n-grams (a distinct approach from the subword tokenization used by BERT's WordPiece). Instead of learning a single vector for a word, the model learns vectors for its character-level component parts and sums them together. The following example demonstrates this breakdown, showing how an input word is split into constituent overlapping n-grams, producing a final embedding that is the sum of those subword components:

text
1Word: "where" (with n=3..6) 2N-grams: <wh, whe, her, ere, re>, <whe, wher, here, ere>, <wher, where, here>, <where>, and the full word "where" itself. 3 4Embedding("where") = Σ embedding(ngram) for each ngram

The diagram below visualizes how these component parts are aggregated into a single continuous representation:

Diagram Diagram

Because FastText shares representations at the subword level, it allows the model to generalize across words with similar roots or affixes. This parameter sharing also means it can achieve broad vocabulary coverage efficiently, as related words reuse the same underlying n-gram vectors.

The hashing trick

A naive implementation would need to store a separate embedding vector for every unique n-gram, which can reach tens of millions of entries. FastText manages this via the hashing trick: all n-grams are hashed into a fixed-size bucket table (default: 2M buckets). Multiple n-grams may collide into the same bucket, sharing a single vector. This trades a small amount of precision for dramatic memory savings, keeping the model's memory footprint bounded regardless of n-gram diversity. In practice, the collision rate is low enough that embedding quality degrades minimally.

🎯 Production tip: FastText can compute embeddings for OOV (Out-of-Vocabulary) words (words it has never seen) by summing the vectors of their subword n-grams. This is critical for:

  • •Morphologically rich languages (Turkish, Finnish, Hungarian): "evlerimizdekilerden" → meaningful embedding from subparts
  • •Typos and misspellings: "recieve" gets a reasonable embedding from shared n-grams with "receive"
  • •Neologisms and slang: novel words get useful representations

5. ELMo: the contextual revolution

Every method so far has a fundamental flaw: the word "bank" always gets the same set of coordinates, whether you're talking about a river bank or a savings bank. It's like giving every person named "Jordan" the same ID photo, ignoring everything about who they actually are.

Peters et al. (2018)[10] changed this with Embeddings from Language Models (ELMo), the first widely successful contextual word representations. Now words are like chameleons: they change color based on their surroundings. This was the inflection point, the end of the "one vector per word" era.

Context window visualization: the model can only attend to tokens within a fixed window, with older tokens falling out of context as the sequence grows. Context window visualization: the model can only attend to tokens within a fixed window, with older tokens falling out of context as the sequence grows.

Architecture

ELMo uses a 2-layer bidirectional LSTM (Long Short-Term Memory) trained as a language model:

Diagram Diagram

Layer specialization

Each layer captures different linguistic information. Peters et al. (2018)[10] demonstrated this via probing tasks: training simple classifiers on frozen layer outputs to test what each layer has learned. The results show a clear hierarchy from surface-level features to abstract semantics:

LayerWhat It CapturesProbing Evidence
Layer 0 (Token embeddings)Surface form, character patternsCharacter CNN captures morphology; best at word shape classification
Layer 1 (Lower LSTM)Syntax: POS tags, dependency relations97% POS tagging accuracy from Layer 1 alone; outperforms Layer 2 on syntactic tasks by ~3%
Layer 2 (Upper LSTM)Semantics: word sense, coreferenceBest for WSD (Word Sense Disambiguation) and sentiment; captures long-range semantic dependencies

The final representation is a task-specific weighted combination of all layers:

ELMok=γ∑j=0Lsj⋅hk,j\text{ELMo}_k = \gamma \sum_{j=0}^{L} s_j \cdot \mathbf{h}_{k,j}ELMok​=γ∑j=0L​sj​⋅hk,j​

Reading the formula

The final representation for token kkk is a weighted blend of all layers' outputs. The weights sjs_jsj​ are learned during fine-tuning, so the model can decide "for sentiment analysis, I need more of the upper (semantic) layer; for POS tagging, more of the lower (syntactic) layer." The scaling factor γ\gammaγ adjusts the overall magnitude.

The polysemy breakthrough

By calculating a vector dynamically based on the surrounding sentence, ELMo finally solved the long-standing polysemy problem. As demonstrated below, identical words generate distinctly different representations depending on the meaning implied by the adjacent context:

text
1"The river bank was steep and muddy." 2 → ELMo("bank") ≈ [terrain, geography, nature...] 3 4"I need to visit the bank to deposit a check." 5 → ELMo("bank") ≈ [finance, money, institution...]

For the first time, the same word gets different vectors in different contexts. This single insight drove massive improvements across NLP benchmarks.

Limitation

The bidirectional LSTMs process left and right contexts independently, then concatenate. They don't jointly attend to both directions simultaneously. BERT would solve this.


6. BERT: truly bidirectional transformers

Devlin et al. (2019)[11] replaced LSTMs with Transformers and introduced truly bidirectional pre-training, which perfected and popularized the "pre-train then fine-tune" workflow.

Pre-training objectives

Masked Language Modeling (MLM)

Randomly mask 15% of tokens and predict them using the 80/10/10 strategy, which was carefully designed to address the pre-training/fine-tuning discrepancy:

  • •80% replaced with [MASK] (the primary training signal)
  • •10% replaced with a random token (prevents the model from learning that "something to predict" always looks like [MASK]; forces robustness to corrupted input)
  • •10% kept unchanged (this is the critical detail: during fine-tuning, the model never sees [MASK] tokens, so keeping some targets unchanged during pre-training forces the model to build good representations even for unmasked positions. Without this, the model would learn to only "pay attention" when it sees [MASK], degrading downstream performance.)

Here's an example of what the model sees during MLM training. It takes an input sentence with a randomly masked token and trains the model to predict the original hidden word as its target output:

text
1Input: "The cat [MASK] on the mat" 2Target: "sat"

Next Sentence Prediction (NSP)

Binary classification: are two sentences consecutive? (Later work showed NSP's contribution is minimal. RoBERTa dropped it entirely and improved results.)

Architecture

The BERT architecture processes input tokens through an embedding layer that combines token, segment, and position information. This combined input is then passed through a deep stack of bidirectional transformer encoders to produce the final contextual representations:

Diagram Diagram
VariantLayersHidden DimHeadsParameters
BERT-base1276812110M
BERT-large241,02416340M

Why bidirectional self-attention matters

Unlike ELMo's concatenated left/right contexts, BERT's self-attention jointly conditions on the full context in every layer. The following example illustrates this difference. The model takes a full sentence and allows the target word to simultaneously attend to both preceding and succeeding tokens, resulting in a single, deeply contextualized representation:

text
1"I accessed my bank account" 2→ Attention: "bank" attends to BOTH "accessed" AND "account" 3→ Result: financial meaning 4 5"The river bank was steep" 6→ Attention: "bank" attends to BOTH "river" AND "steep" 7→ Result: geographical meaning

This is fundamentally more powerful than ELMo's approach of processing left and right independently, then concatenating.

💡 Key insight: Analogy, reading a mystery novel: ELMo is like two detectives investigating a crime scene separately. One reads the clues left-to-right, the other right-to-left, and they compare notes at the end. BERT is like a single detective who can look at all the clues simultaneously, spotting connections the two separate investigators would miss. When "bank" appears between "river" and "account," BERT sees both context words interacting with each other through "bank." ELMo never gets that cross-directional evidence.

The fine-tuning approach

BERT established a workflow that dominated NLP for years:

  1. •Pre-train on massive unlabeled corpus (BooksCorpus + Wikipedia, ~3.3B words)
  2. •Add a task-specific head (classification layer, span extraction, etc.)
  3. •Fine-tune all parameters on labeled downstream data
  4. •Result: State-of-the-art on 11 NLP tasks simultaneously at release

7. GPT: autoregressive representations at scale

Radford et al. (2018, 2019)[12] at OpenAI took the opposite approach: unidirectional (left-to-right) transformer pre-training.

The autoregressive objective

L=−∑t=1Tlog⁡P(xt∣x1,…,xt−1;θ)\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1}; \theta)L=−∑t=1T​logP(xt​∣x1​,…,xt−1​;θ)

Reading the formula

For each position in the text, the model tries to predict the next word using only the words that came before it. The loss sums up how wrong those predictions were (via negative log-probability). Better predictions = lower loss. This "predict the next word" game is all GPT needs to learn language.

Causal masking: enforcing unidirectionality

GPT enforces its left-to-right constraint through a causal attention mask (also called a triangular mask). In the self-attention computation, a lower-triangular matrix sets all attention scores for future positions to −∞-\infty−∞ before the softmax, making their attention weights exactly zero. This prevents position ttt from attending to any position t′>tt' > tt′>t, ensuring the model cannot "cheat" by looking at tokens it hasn't generated yet. This is what makes GPT inherently autoregressive: each token's representation is built exclusively from the tokens that precede it, which is precisely the property needed for sequential text generation.

BERT vs GPT: the key tradeoff

AspectBERTGPT
AttentionBidirectional (full context)Causal (left-to-right only)
Pre-trainingMLM + NSPNext-token prediction
Best forNatural Language Understanding (NLU)Natural Language Generation (NLG)
RepresentationsDeep bidirectional contextLeft context only
AdaptationFine-tuning requiredIn-context learning (prompting)
Scaling trajectoryPlateaued at ~340MScaled to 1T+ parameters

Why GPT's approach won at scale

Despite BERT's bidirectional advantage for understanding tasks, GPT's autoregressive approach scaled better:

  1. •Unified framework: Any text task can be cast as text generation (no task-specific heads needed)
  2. •In-context learning: Few-shot examples in the prompt replace fine-tuning entirely (Brown et al., 2020)[13]
  3. •Emergent abilities: Larger models unlock new capabilities like chain-of-thought reasoning, code generation, and tool use (Wei et al., 2022)[14]
  4. •Natural generation: Autoregressive models naturally produce coherent multi-sentence output

💡 Key insight: Saying "BERT is better because it's bidirectional" is incomplete. The correct framing is: BERT has stronger per-token representations for understanding tasks, but GPT's autoregressive formulation enables generation and scaling properties that proved more valuable for general-purpose AI.


8. Embedding geometry: what research tells us

A frequently overlooked but critical concept is the geometry of embeddings: how they're distributed in high-dimensional space.

Anisotropy: the cone problem

Imagine you're in a dark room with a flashlight, and you scatter glow-in-the-dark balls everywhere. Ideally, they'd be spread evenly across the room so you could tell any two balls apart by their positions. But what if all the balls ended up clustered in the narrow beam of the flashlight? Suddenly, every ball looks like it's in roughly the same direction, and it's hard to distinguish them.

Ethayarajh (2019)[6] at Stanford discovered that this is exactly what happens with embeddings. In BERT, ELMo, and GPT-2, word embeddings are anisotropic: they occupy a narrow cone in the embedding space rather than being uniformly distributed.

Diagram Diagram

The representation degeneracy problem

Gao et al. (2019)[15] traced the root cause of anisotropy to weight tying: most language models share parameters between the input embedding layer and the output prediction layer. This architectural shortcut (used for parameter efficiency) creates a degenerate solution where the model pushes all embeddings into a narrow subspace to maximize the log-likelihood surface. The learned embedding matrix ends up with a few dominant singular values, causing all vectors to cluster along those principal directions. This explains why even semantically unrelated words can have cosine similarity > 0.5 in raw BERT embeddings.

Key findings

  • •On average, less than 5% of the variance in a word's contextualized representations can be explained by a static embedding, meaning contextual representations are highly context-dependent
  • •Upper layers produce more context-specific representations than lower layers
  • •Lower-layer BERT representations, when reduced to their first principal component, actually outperform GloVe and FastText on standard static embedding benchmarks

🎯 Production tip: Anisotropy affects how cosine similarity behaves. Two random words might have surprisingly high cosine similarity simply because all vectors point in roughly the same direction, not because the words are semantically related. This motivates techniques like whitening and isotropy calibration in production systems.


The complete evolution at a glance

Looking back over the decades of NLP research, the evolution of word representations follows a clear trajectory from simple frequency counts to massive neural networks. Each step in this journey was driven by the need to capture deeper semantic relationships and solve specific limitations of previous methods.

The earliest approaches, like TF-IDF and LSA, proved that simple mathematical operations on document statistics could capture a surprising amount of meaning. However, they were limited by their reliance on static matrices and struggled with capturing details like syntax or fine-grained semantic analogies.

The neural revolution fundamentally changed the approach. By framing representation learning as a prediction task, models like Word2Vec and BERT learned to compress meaning into dense geometric spaces. This transition from counting to predicting, and finally to contextualizing with transformers, laid the groundwork for modern generative AI.

MethodYearTeamContextKey InnovationStill Used?
TF-IDF1970sVariousStaticFrequency weighting✅ Search engines
LSA1990Deerwester et al.StaticSVD on term-document matrixRarely
Word2Vec2013Mikolov et al. (Google)StaticNeural prediction, negative sampling✅ Features, baselines
GloVe2014Pennington et al. (Stanford)StaticCo-occurrence matrix factorization✅ Features, baselines
FastText2017Bojanowski et al. (Meta)StaticSubword n-gram composition✅ Low-resource, OOV
ELMo2018Peters et al. (Allen AI)DynamicBidirectional LSTM LM featuresRarely
BERT2018Devlin et al. (Google)DynamicMLM + bidirectional Transformer✅ Encoders, retrieval
GPT2018+Radford et al. (OpenAI)DynamicNext-token prediction at scale, 175B params[13]✅ All LLMs

When to use what (practical guide)

A strong engineer knows when not to use the latest model. While large language models dominate the headlines, they're often overkill for simple production tasks. Understanding the trade-offs between static and contextual embeddings is crucial for building cost-effective and low-latency systems.

Static embeddings like FastText or Word2Vec remain highly relevant in resource-constrained environments. They require no specialized hardware like GPUs for inference, making them ideal for lightweight applications or as feature inputs for traditional tabular machine learning models. FastText, in particular, continues to be a standard choice for morphologically rich languages where dealing with unseen words is a common challenge.

Contextual embeddings, on the other hand, are strictly necessary when semantic disambiguation is critical. Tasks like complex semantic search, passage retrieval, and detailed text classification benefit immensely from BERT-style encoders. For anything involving text generation or open-ended reasoning, autoregressive models from the GPT family are the definitive standard. The key is matching the complexity of the embedding model to the complexity of the business requirement.

ScenarioBest ChoiceWhy
Simple text similarity searchFastText or Word2VecFast, no GPU needed, good enough
Feature engineering for tabular MLGloVe/FastText averageDense features from text columns
Morphologically rich languageFastTextHandles OOV via subword composition
Semantic search / retrievalFine-tuned BERT (bi-encoder)Context-aware, high quality
Text classificationFine-tuned BERT / RoBERTaBest accuracy for classification
Text generationGPT / decoder modelAutoregressive = natural generation
Production with compute constraintsDistilled BERT (DistilBERT)97% accuracy at 60% the size
Research prototypeFull BERT-large / Latest GPT modelMaximum quality, cost not a concern

Visualizing and reducing embeddings in production

When building search or retrieval systems, engineers frequently need to visualize high-dimensional embeddings and reduce their dimensionality for efficiency. PCA is the go-to for fast linear reduction and provides interpretable principal components. t-SNE excels at creating 2D/3D visualizations that preserve local cluster structure, making it ideal for quality inspection. UMAP offers the best balance: it preserves both local and global structure better than t-SNE while being significantly faster, making it the standard choice for interactive exploration of embedding spaces. For more on production dimensionality reduction techniques including quantization, see Dimensionality Reduction for Embeddings.


Key takeaways

  • •Trace the evolution: Understand the progression from count-based (statistics) → prediction-based (local context) → contextual (full sentence awareness).
  • •Technical distinctions matter: Know the difference between Word2Vec's local window, GloVe's global matrix factorization, and BERT's bidirectional attention.
  • •Context is king: ELMo proved that one vector per word is insufficient; BERT proved that bidirectional attention beats concatenated LSTMs.
  • •Scale wins: GPT showed that autoregressive objectives, while locally less powerful than bidirectional ones, scale better to general reasoning tasks.
  • •Geometry awareness: Remember that embedding spaces are often anisotropic (cone-shaped), which distorts cosine similarity.

Common misconceptions

  • •"BERT is always better." False. Static embeddings are faster, cheaper, and sufficient for many simple similarity or feature engineering tasks.
  • •"ELMo and BERT are both bidirectional in the same way." False. ELMo concatenates independent left/right LSTMs; BERT jointly attends to all tokens.
  • •"GPT is strictly worse than BERT for understanding." False. While BERT is more sample-efficient for understanding, scaled-up GPT models perform well on NLU tasks via few-shot prompting.
  • •"King - Man + Woman = Queen always works." False. This is a cherry-picked example. Many analogies fail, and the geometry is often distorted.
  • •"High dimensionality is always better." False. While larger models generally perform better, there are tradeoffs. Static embeddings see diminishing returns beyond ~300d. Contextual embeddings from encoders (BERT) plateau around 1024d for most tasks. Decoder models (GPT) scale differently, with modern LLMs using 4K-12K+ dimensions to support massive context windows and emergent capabilities.

Summary

  1. •Start with the problem: Words need numerical representations for neural networks to process them.
  2. •Milestones: Count-based → Word2Vec → GloVe → FastText → ELMo → BERT → GPT.
  3. •Key insights: Co-occurrence → prediction → subwords → context → attention → scale.
  4. •Modern state: Today's LLMs use contextual embeddings computed by transformer layers, but static embeddings remain a vital tool for efficiency.

Evaluation Rubric
  • 1
    Articulates the distributional hypothesis as the foundation for all embeddings
  • 2
    Clearly explains Word2Vec training (skip-gram vs CBOW) with negative sampling details
  • 3
    Describes GloVe's co-occurrence ratio insight and how it bridges count-based and neural methods
  • 4
    Identifies FastText's key advantage: OOV handling via subword n-grams
  • 5
    Explains ELMo as the first contextual model, with layer-specific syntax/semantics specialization
  • 6
    Distinguishes ELMo's concatenated bidirectionality from BERT's joint self-attention
  • 7
    Contrasts BERT (understanding) and GPT (generation) with a detailed scaling discussion
  • 8
    Demonstrates practical judgment about when to use static vs contextual embeddings
Common Pitfalls
  • Claiming BERT is 'always better' without acknowledging static embedding use cases
  • Confusing ELMo's concatenated bidirectionality with BERT's joint bidirectionality
  • Not understanding why GPT's unidirectional approach scales better than BERT
  • Treating the king-queen analogy as a reliable, general property rather than a cherry-picked example
  • Forgetting the OOV problem and how FastText/subword tokenization solved it
  • Ignoring embedding geometry: anisotropy and its impact on cosine similarity
Follow-up Questions to Expect

Key Concepts Tested
Distributional hypothesis and its operationalizationWord2Vec CBOW and Skip-gram architecturesNegative sampling and subsampling tricksGloVe co-occurrence ratio insight and matrix factorizationFastText subword composition for OOV handlingELMo layer specialization (syntax vs semantics)BERT MLM objective and joint bidirectional attentionGPT autoregressive objective and scaling advantagesEmbedding geometry and anisotropyStatic vs contextual embedding tradeoffs
References

A synopsis of linguistic theory, 1930-1955

Firth, J. R. · 1957

A Mathematical Theory of Communication

Claude E. Shannon · 1948

Indexing by Latent Semantic Analysis

Deerwester, S., et al. · 1990 · JASIS

Efficient Estimation of Word Representations in Vector Space.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. · 2013 · arXiv preprint

Hierarchical Probabilistic Neural Network Language Model

Frederic Morin, Yoshua Bengio · 2005

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.

Ethayarajh, K. · 2019

GloVe: Global Vectors for Word Representation.

Pennington, J., Socher, R., & Manning, C. D. · 2014

Neural Word Embedding as Implicit Matrix Factorization.

Levy, O. & Goldberg, Y. · 2014

Enriching Word Vectors with Subword Information.

Bojanowski, P., et al. · 2017

Deep contextualized word representations.

Peters, M., et al. · 2018 · NAACL 2018

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. · 2018

Language Models are Few-Shot Learners.

Brown, T., et al. · 2020 · NeurIPS 2020

Emergent Abilities of Large Language Models.

Wei, J., et al. · 2022 · TMLR

Representation Degeneration Problem in Training Natural Language Generation Models

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, Tie-Yan Liu · 2019

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.