Trace the full evolution from count-based methods through Word2Vec/GloVe to contextual BERT/GPT representations. Understand the distributional hypothesis, embedding geometry, and when to use static vs contextual embeddings in production.
How does a computer understand that "king" and "queen" are related, or that "Paris" is to "France" what "Tokyo" is to "Japan"? It can't read a dictionary. It needs numbers. The solution is to give every word a set of coordinates, like pinning cities on a map. Words with similar meanings end up close together, and the directions between them encode relationships. These numerical coordinates are called embeddings, and they are the foundation of how every language model (from ChatGPT to Gemini) understands meaning.
This article traces the complete arc: from early counting methods through Word2Vec's breakthrough to today's context-aware representations in BERT and GPT, where the same word gets different coordinates depending on its surroundings.
π― Core concept: Understanding embeddings means grasping the core idea behind representation learning: that a word's meaning comes from the company it keeps, that static coordinates evolved into dynamic ones, and the practical tradeoffs involved. This knowledge is foundational to every modern NLP and LLM system. (See also: how text becomes tokens before it becomes embeddings, and sentence embeddings for scaling this idea to whole passages.)
Before any algorithm, understand the core insight that makes all word embeddings possible:
"You shall know a word by the company it keeps." (J.R. Firth, 1957)
Think of it this way: if you moved to a new city and saw a store you'd never heard of, you could guess what it sells by looking at its neighbors. A store between a bakery and a deli is probably a food shop. A store between a nail salon and a hair salon is probably a beauty shop. You didn't need to go inside. The neighborhood told you everything.
Words work the same way. The word "espresso" tends to appear near "latte," "barista," and "cafΓ©," so a computer can figure out it's a coffee-related word just by tracking those patterns. This is the distributional hypothesis: words that appear in similar contexts have similar meanings. Every embedding method (from TF-IDF to GPT) is at its core an operationalization of this idea.
text1Context: "The ___ sat on the mat" 2 3Words that fit: cat, dog, child, kitten, puppy 4β These words should have similar embeddings 5 6Words that don't fit: democracy, algorithm, quantum 7β These should have very different embeddings
The evolution of embeddings is the story of increasingly powerful ways to capture this distributional information:
Before neural approaches, words were represented using co-occurrence statistics.
TF-IDF (Term FrequencyβInverse Document Frequency) weights words by importance within a document relative to a corpus. Simple but effective for information retrieval, it's still used today in search engines.
LSA (Latent Semantic Analysis), introduced by Deerwester et al. (1990)[1], applies Singular Value Decomposition (SVD) to the term-document matrix to discover latent semantic dimensions.
Key limitation: These methods produce a single vector per word regardless of context. The word "bank" has the same representation whether it means a financial institution or a riverbank. This is the polysemy problem, and solving it would drive the next decade of research.
Mikolov et al. (2013)[2] introduced two neural architectures that learn dense word vectors by predicting words from their local context. This was the breakthrough that launched modern NLP embeddings.
CBOW (Continuous Bag of Words) predicts the center word from surrounding context:
Reading the formula: given the surrounding words (the "context window" of size on each side), predict the word in the middle. For example, given "the ___ sat on", predict "cat."
Skip-gram predicts context words from the center word (inverse of CBOW):
Reading the formula: this is the reverse. Given the center word "cat", predict which words likely surround it ("the", "sat", "on"). Skip-gram tends to work better for rare words because each word gets more training updates.
π‘ Key distinction: CBOW is faster and better for frequent words. Skip-gram is slower but captures rare words better (each rare word generates multiple training examples as a center word)[2].
Here's the problem: to learn good embeddings, the model needs to check its prediction against every word in the vocabulary, often 100,000+ words. That's like grading a multiple-choice test with 100,000 answer options for every single question.
Negative sampling (Mikolov et al., 2013)[2] is an elegant shortcut. Instead of comparing against all 100K words, the model picks a handful of random "wrong" answers (negatives) and just learns to tell the difference between the real context word and those few imposters. It's like a flashcard game: "Is 'cat' a real neighbor of 'sat'?" (yes β ) "Is 'democracy' a real neighbor of 'sat'?" (no β). This tiny binary quiz is dramatically cheaper while still teaching the model which words belong together. The code below demonstrates how to compute this negative sampling loss for a single word pair. It takes the embeddings of the center word, the actual context word, and a set of random negative samples, returning a scalar loss value that forces the true pair to have high similarity and the fake pairs to have low similarity:
python1import numpy as np 2 3def skip_gram_loss(center_vec: np.ndarray, 4 context_vec: np.ndarray, 5 neg_samples: np.ndarray) -> float: 6 """ 7 Computes the negative sampling loss for a single skip-gram pair. 8 9 Args: 10 center_vec: Embedding of center word (d,) 11 context_vec: Embedding of context word (d,) 12 neg_samples: Embeddings of k negative samples (k, d) 13 14 Returns: 15 Scalar loss value 16 """ 17 # 1. Positive pair: maximize log probability of real context word 18 # sigmoid(u_o^T v_c) 19 pos_score = np.dot(context_vec, center_vec) 20 # sigmoid(x) = 1 / (1 + exp(-x)) 21 pos_prob = 1 / (1 + np.exp(-pos_score)) 22 pos_loss = -np.log(pos_prob + 1e-10) 23 24 # 2. Negative pairs: maximize log probability of NOT being context words 25 # sigmoid(-u_k^T v_c) for all k negative samples 26 neg_scores = np.dot(neg_samples, center_vec) 27 # We want these to be low, so maximize 1-sigmoid(x) = sigmoid(-x) 28 neg_prob_inv = 1 / (1 + np.exp(neg_scores)) 29 neg_loss = -np.sum(np.log(neg_prob_inv + 1e-10)) 30 31 return pos_loss + neg_loss
Negative words are sampled proportional to , where is the word frequency. The exponent gives rare words slightly more chance of being selected as negatives, improving their representations.
Word2Vec's most striking property is that semantic relationships are captured as linear directions in the embedding space:
text1king - man + woman β queen 2Paris - France + Italy β Rome 3walked - walking + swimming β swam
This works because regularities in the training data create consistent vector offsets. The "gender direction" (man β woman) is roughly the same vector regardless of the word pair.
β οΈ Reality check: Research by Ethayarajh et al. (2019)[3] and others has shown that the kingβqueen analogy is somewhat cherry-picked. Many analogies don't work cleanly, and the cosine similarity nearest neighbor often isn't the "correct" answer. Understanding this nuance distinguishes deep expertise from surface-level knowledge.
| Hyperparameter | Typical Value | Impact |
|---|---|---|
| Embedding dimension | 100β300 | 300 is standard; diminishing returns beyond |
| Context window | 5β10 | Larger β more topical; smaller β more syntactic |
| Negative samples (k) | 5β15 | More negatives = better for rare words |
| Min word count | 5 | Filters out very rare words |
| Subsampling threshold | 1e-5 | Drops frequent words like "the", "a" |
Pennington, Socher & Manning (2014)[4] at Stanford combined the best of count-based and prediction-based methods.
Here's a clever detective trick that GloVe exploits. Say you're trying to figure out the difference between "ice" and "steam." Looking at which words appear near each one individually isn't very helpful, since both appear near "water." But if you look at the ratio, the pattern jumps out: the word "solid" appears 100Γ more often with "ice" than with "steam," while "gas" appears 100Γ more with "steam." Neutral words like "water" appear equally with both, giving a ratio near 1. It's like comparing two suspects by looking at who they hang out with relative to each other.
The key insight is that ratios of co-occurrence probabilities, not raw probabilities, capture semantic relationships:
| P(Β·|ice) | P(Β·|steam) | Ratio | |
|---|---|---|---|
| solid | High | Low | Large β associated with ice |
| gas | Low | High | Small β associated with steam |
| water | High | High | β 1 β neutral |
| random | Low | Low | β 1 β neutral |
GloVe optimizes word vectors so their dot product equals the log co-occurrence count:
Reading the formula: for every pair of words in the vocabulary, the loss penalizes the difference between (a) the dot product of their learned vectors and (b) the log of how often they actually co-occur in the corpus. The weighting function prevents extremely common pairs (like "the, of") from overwhelming the training. The result: vectors whose geometry directly encodes co-occurrence statistics.
Where:
| Aspect | Word2Vec | GloVe |
|---|---|---|
| Training signal | Local context windows | Global co-occurrence matrix |
| Objective | Predictive (softmax/NCE) | Least squares on log co-occurrence |
| Theoretical basis | Distributional hypothesis via prediction | Explicit matrix factorization |
| Performance | Strong on analogy tasks | Competitive; often slightly better on similarity |
| Training | Online (stochastic) | Batch (needs full co-occurrence matrix) |
π‘ Insight: Levy & Goldberg (2014) showed that Word2Vec with negative sampling is implicitly factorizing a shifted PMI matrix. The methods are more similar than they appear[5].
Bojanowski et al. (2017)[6] at Facebook extended Word2Vec by representing words as bags of character n-grams:
text1Word: "where" (with n=3..6) 2N-grams: {"<wh", "whe", "her", "ere", "re>", "<whe", "wher", "here", "ere>", 3 "<wher", "where", "here>", "<where", "where>", "<where>"} 4 5Embedding("where") = Ξ£ embedding(ngram) for each ngram
The killer feature: FastText can compute embeddings for OOV (Out-of-Vocabulary) words, words it has never seen, by summing the vectors of their subword n-grams. This is critical for:
"evlerimizdekilerden" β meaningful embedding from subparts"recieve" gets a reasonable embedding from shared n-grams with "receive"π― Still useful today: FastText embeddings remain the go-to for resource-constrained settings, feature engineering for tabular ML, and simple similarity search where a full transformer is overkill.
Every method so far has a fundamental flaw: the word "bank" always gets the same set of coordinates, whether you're talking about a river bank or a savings bank. It's like giving every person named "Jordan" the same ID photo, ignoring everything about who they actually are.
Peters et al. (2018)[7] changed this with Embeddings from Language Models (ELMo), the first widely successful contextual word representations. Now words are like chameleons: they change color based on their surroundings. This was the inflection point, the end of the "one vector per word" era.
ELMo uses a 2-layer bidirectional LSTM (Long Short-Term Memory) trained as a language model:
Each layer captures different linguistic information:
| Layer | What It Captures | Evidence |
|---|---|---|
| Layer 0 (Token embeddings) | Surface form, character patterns | Character CNN (Convolutional Neural Network) layer |
| Layer 1 (Lower LSTM) | Syntax: POS tags (Part-of-Speech), NER (Named Entity Recognition) | Best for syntactic probing tasks |
| Layer 2 (Upper LSTM) | Semantics: word sense, meaning | Best for WSD (Word Sense Disambiguation), sentiment |
The final representation is a task-specific weighted combination of all layers:
Reading the formula: the final representation for token is a weighted blend of all layers' outputs. The weights are learned during fine-tuning, so the model can decide "for sentiment analysis, I need more of the upper (semantic) layer; for POS tagging, more of the lower (syntactic) layer." The scaling factor adjusts the overall magnitude.
text1"The river bank was steep and muddy." 2 β ELMo("bank") β [terrain, geography, nature...] 3 4"I need to visit the bank to deposit a check." 5 β ELMo("bank") β [finance, money, institution...]
For the first time, the same word gets different vectors in different contexts. This single insight drove massive improvements across NLP benchmarks.
Limitation: The bidirectional LSTMs process left and right contexts independently, then concatenate. They don't jointly attend to both directions simultaneously. BERT would solve this.
Devlin et al. (2019)[8] replaced LSTMs with Transformers and introduced truly bidirectional pre-training, establishing the "pre-train then fine-tune" workflow.
Masked Language Modeling (MLM): Randomly mask 15% of tokens and predict them:
[MASK]text1Input: "The cat [MASK] on the mat" 2Target: "sat"
Next Sentence Prediction (NSP): Binary classification: are two sentences consecutive? (Later work showed NSP's contribution is minimal. RoBERTa dropped it entirely and improved results.)
| Variant | Layers | Hidden Dim | Heads | Parameters |
|---|---|---|---|---|
| BERT-base | 12 | 768 | 12 | 110M |
| BERT-large | 24 | 1,024 | 16 | 340M |
Unlike ELMo's concatenated left/right contexts, BERT's self-attention jointly conditions on the full context in every layer:
text1"I accessed my bank account" 2β Attention: "bank" attends to BOTH "accessed" AND "account" 3β Result: financial meaning 4 5"The river bank was steep" 6β Attention: "bank" attends to BOTH "river" AND "steep" 7β Result: geographical meaning
This is fundamentally more powerful than ELMo's approach of processing left and right independently, then concatenating.
π‘ Analogy, reading a mystery novel: ELMo is like two detectives investigating a crime scene separately. One reads the clues left-to-right, the other right-to-left, and they compare notes at the end. BERT is like a single detective who can look at all the clues simultaneously, spotting connections the two separate investigators would miss. When "bank" appears between "river" and "account," BERT sees both context words interacting with each other through "bank." ELMo never gets that cross-directional evidence.
BERT established a workflow that dominated NLP for years:
Radford et al. (2018, 2019)[9] at OpenAI took the opposite approach: unidirectional (left-to-right) transformer pre-training.
Reading the formula: for each position in the text, the model tries to predict the next word using only the words that came before it. The loss sums up how wrong those predictions were (via negative log-probability). Better predictions = lower loss. This "predict the next word" game is all GPT needs to learn language.
| Aspect | BERT | GPT |
|---|---|---|
| Attention | Bidirectional (full context) | Causal (left-to-right only) |
| Pre-training | MLM + NSP | Next-token prediction |
| Best for | Understanding (NLU) | Generation (NLG) |
| Representations | Deep bidirectional context | Left context only |
| Adaptation | Fine-tuning required | In-context learning (prompting) |
| Scaling trajectory | Plateaued at ~340M | Scaled to 1T+ parameters |
Despite BERT's bidirectional advantage for understanding tasks, GPT's autoregressive approach scaled better:
π‘ Technical nuance: Saying "BERT is better because it's bidirectional" is incomplete. The correct nuance is: BERT has stronger per-token representations for understanding tasks, but GPT's autoregressive formulation enables generation and scaling properties that proved more valuable for general-purpose AI.
A frequently overlooked but critical concept is the geometry of embeddings: how they're distributed in high-dimensional space.
Imagine you're in a dark room with a flashlight, and you scatter glow-in-the-dark balls everywhere. Ideally, they'd be spread evenly across the room so you could tell any two balls apart by their positions. But what if all the balls ended up clustered in the narrow beam of the flashlight? Suddenly, every ball looks like it's in roughly the same direction, and it's hard to distinguish them.
Ethayarajh (2019)[3] at Stanford discovered that this is exactly what happens with embeddings. In BERT, ELMo, and GPT-2, word embeddings are anisotropic: they occupy a narrow cone in the embedding space rather than being uniformly distributed.
Key findings:
π¬ Why this matters: Anisotropy affects how cosine similarity behaves. Two random words might have surprisingly high cosine similarity simply because all vectors point in roughly the same direction, not because the words are semantically related. This motivates techniques like whitening and isotropy calibration in production systems.
| Method | Year | Team | Context | Key Innovation | Still Used? |
|---|---|---|---|---|---|
| TF-IDF | 1970s | Various | Static | Frequency weighting | β Search engines |
| LSA | 1990 | Deerwester et al. | Static | SVD on term-document matrix | Rarely |
| Word2Vec | 2013 | Mikolov et al. (Google) | Static | Neural prediction, negative sampling | β Features, baselines |
| GloVe | 2014 | Pennington et al. (Stanford) | Static | Co-occurrence matrix factorization | β Features, baselines |
| FastText | 2017 | Bojanowski et al. (Meta) | Static | Subword n-gram composition | β Low-resource, OOV |
| ELMo | 2018 | Peters et al. (Allen AI) | Dynamic | Bidirectional LSTM LM features | Rarely |
| BERT | 2019 | Devlin et al. (Google) | Dynamic | MLM + bidirectional Transformer | β Encoders, retrieval |
| GPT | 2018+ | Radford et al. (OpenAI) | Dynamic | Autoregressive Transformer at scale | β All LLMs |
A strong engineer knows when not to use the latest model:
| Scenario | Best Choice | Why |
|---|---|---|
| Simple text similarity search | FastText or Word2Vec | Fast, no GPU needed, good enough |
| Feature engineering for tabular ML | GloVe/FastText average | Dense features from text columns |
| Morphologically rich language | FastText | Handles OOV via subword composition |
| Semantic search / retrieval | Fine-tuned BERT (bi-encoder) | Context-aware, high quality |
| Text classification | Fine-tuned BERT / RoBERTa | Best accuracy for classification |
| Text generation | GPT / decoder model | Autoregressive = natural generation |
| Production with compute constraints | Distilled BERT (DistilBERT) | 97% accuracy at 60% the size |
| Research prototype | Full BERT-large / GPT | Maximum quality, cost not a concern |
Indexing by Latent Semantic Analysis
Deerwester, S., et al. Β· 1990 Β· JASIS
Efficient Estimation of Word Representations in Vector Space.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. Β· 2013 Β· arXiv preprint
GloVe: Global Vectors for Word Representation.
Pennington, J., Socher, R., & Manning, C. D. Β· 2014
Neural Word Embedding as Implicit Matrix Factorization.
Levy, O. & Goldberg, Y. Β· 2014
Enriching Word Vectors with Subword Information.
Bojanowski, P., et al. Β· 2017
Deep contextualized word representations.
Peters, M., et al. Β· 2018 Β· NAACL 2018
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.
Ethayarajh, K. Β· 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. Β· 2019 Β· NAACL 2019
Improving Language Understanding by Generative Pre-Training.
Radford, A., et al. Β· 2018
Language Models are Few-Shot Learners.
Brown, T., et al. Β· 2020 Β· NeurIPS 2020
Emergent Abilities of Large Language Models.
Wei, J., et al. Β· 2022 Β· TMLR