LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

Β© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

πŸ§ͺAI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚑Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
πŸ”Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
πŸ€–Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
πŸ“ŠEvaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
πŸ› οΈLLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
πŸ—οΈSystem Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAI Engineering FoundationsWord to Contextual Embeddings
πŸ“MediumNLP Fundamentals

Word to Contextual Embeddings

Trace the full evolution from count-based methods through Word2Vec/GloVe to contextual BERT/GPT representations. Understand the distributional hypothesis, embedding geometry, and when to use static vs contextual embeddings in production.

30 min readGoogle, Meta, OpenAI +310 key concepts

How does a computer understand that "king" and "queen" are related, or that "Paris" is to "France" what "Tokyo" is to "Japan"? It can't read a dictionary. It needs numbers. The solution is to give every word a set of coordinates, like pinning cities on a map. Words with similar meanings end up close together, and the directions between them encode relationships. These numerical coordinates are called embeddings, and they are the foundation of how every language model (from ChatGPT to Gemini) understands meaning.

This article traces the complete arc: from early counting methods through Word2Vec's breakthrough to today's context-aware representations in BERT and GPT, where the same word gets different coordinates depending on its surroundings.

🎯 Core concept: Understanding embeddings means grasping the core idea behind representation learning: that a word's meaning comes from the company it keeps, that static coordinates evolved into dynamic ones, and the practical tradeoffs involved. This knowledge is foundational to every modern NLP and LLM system. (See also: how text becomes tokens before it becomes embeddings, and sentence embeddings for scaling this idea to whole passages.)


The distributional hypothesis: the foundation of everything

Before any algorithm, understand the core insight that makes all word embeddings possible:

"You shall know a word by the company it keeps." (J.R. Firth, 1957)

Think of it this way: if you moved to a new city and saw a store you'd never heard of, you could guess what it sells by looking at its neighbors. A store between a bakery and a deli is probably a food shop. A store between a nail salon and a hair salon is probably a beauty shop. You didn't need to go inside. The neighborhood told you everything.

Words work the same way. The word "espresso" tends to appear near "latte," "barista," and "cafΓ©," so a computer can figure out it's a coffee-related word just by tracking those patterns. This is the distributional hypothesis: words that appear in similar contexts have similar meanings. Every embedding method (from TF-IDF to GPT) is at its core an operationalization of this idea.

text
1Context: "The ___ sat on the mat" 2 3Words that fit: cat, dog, child, kitten, puppy 4β†’ These words should have similar embeddings 5 6Words that don't fit: democracy, algorithm, quantum 7β†’ These should have very different embeddings

The evolution of embeddings is the story of increasingly powerful ways to capture this distributional information:

Diagram Diagram

1. Count-based methods: the starting point

Before neural approaches, words were represented using co-occurrence statistics.

TF-IDF (Term Frequency–Inverse Document Frequency) weights words by importance within a document relative to a corpus. Simple but effective for information retrieval, it's still used today in search engines.

LSA (Latent Semantic Analysis), introduced by Deerwester et al. (1990)[1], applies Singular Value Decomposition (SVD) to the term-document matrix to discover latent semantic dimensions.

Diagram Diagram

Key limitation: These methods produce a single vector per word regardless of context. The word "bank" has the same representation whether it means a financial institution or a riverbank. This is the polysemy problem, and solving it would drive the next decade of research.


2. Word2Vec: the neural embedding revolution

Mikolov et al. (2013)[2] introduced two neural architectures that learn dense word vectors by predicting words from their local context. This was the breakthrough that launched modern NLP embeddings.

Two architectures

CBOW (Continuous Bag of Words) predicts the center word from surrounding context:

P(wt∣wtβˆ’c,…,wtβˆ’1,wt+1,…,wt+c)P(w_t \mid w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c})P(wtβ€‹βˆ£wtβˆ’c​,…,wtβˆ’1​,wt+1​,…,wt+c​)

Reading the formula: given the surrounding words (the "context window" of size ccc on each side), predict the word in the middle. For example, given "the ___ sat on", predict "cat."

Skip-gram predicts context words from the center word (inverse of CBOW):

P(wt+j∣wt)forΒ j∈[βˆ’c,c],jβ‰ 0P(w_{t+j} \mid w_t) \quad \text{for } j \in [-c, c], j \neq 0P(wt+jβ€‹βˆ£wt​)forΒ j∈[βˆ’c,c],jξ€ =0

Reading the formula: this is the reverse. Given the center word "cat", predict which words likely surround it ("the", "sat", "on"). Skip-gram tends to work better for rare words because each word gets more training updates.

Diagram Diagram

πŸ’‘ Key distinction: CBOW is faster and better for frequent words. Skip-gram is slower but captures rare words better (each rare word generates multiple training examples as a center word)[2].

Training: negative sampling

Here's the problem: to learn good embeddings, the model needs to check its prediction against every word in the vocabulary, often 100,000+ words. That's like grading a multiple-choice test with 100,000 answer options for every single question.

Negative sampling (Mikolov et al., 2013)[2] is an elegant shortcut. Instead of comparing against all 100K words, the model picks a handful of random "wrong" answers (negatives) and just learns to tell the difference between the real context word and those few imposters. It's like a flashcard game: "Is 'cat' a real neighbor of 'sat'?" (yes βœ…) "Is 'democracy' a real neighbor of 'sat'?" (no ❌). This tiny binary quiz is dramatically cheaper while still teaching the model which words belong together. The code below demonstrates how to compute this negative sampling loss for a single word pair. It takes the embeddings of the center word, the actual context word, and a set of random negative samples, returning a scalar loss value that forces the true pair to have high similarity and the fake pairs to have low similarity:

python
1import numpy as np 2 3def skip_gram_loss(center_vec: np.ndarray, 4 context_vec: np.ndarray, 5 neg_samples: np.ndarray) -> float: 6 """ 7 Computes the negative sampling loss for a single skip-gram pair. 8 9 Args: 10 center_vec: Embedding of center word (d,) 11 context_vec: Embedding of context word (d,) 12 neg_samples: Embeddings of k negative samples (k, d) 13 14 Returns: 15 Scalar loss value 16 """ 17 # 1. Positive pair: maximize log probability of real context word 18 # sigmoid(u_o^T v_c) 19 pos_score = np.dot(context_vec, center_vec) 20 # sigmoid(x) = 1 / (1 + exp(-x)) 21 pos_prob = 1 / (1 + np.exp(-pos_score)) 22 pos_loss = -np.log(pos_prob + 1e-10) 23 24 # 2. Negative pairs: maximize log probability of NOT being context words 25 # sigmoid(-u_k^T v_c) for all k negative samples 26 neg_scores = np.dot(neg_samples, center_vec) 27 # We want these to be low, so maximize 1-sigmoid(x) = sigmoid(-x) 28 neg_prob_inv = 1 / (1 + np.exp(neg_scores)) 29 neg_loss = -np.sum(np.log(neg_prob_inv + 1e-10)) 30 31 return pos_loss + neg_loss

Negative words are sampled proportional to f(w)3/4f(w)^{3/4}f(w)3/4, where f(w)f(w)f(w) is the word frequency. The 3/43/43/4 exponent gives rare words slightly more chance of being selected as negatives, improving their representations.

The famous analogy property

Word2Vec's most striking property is that semantic relationships are captured as linear directions in the embedding space:

text
1king - man + woman β‰ˆ queen 2Paris - France + Italy β‰ˆ Rome 3walked - walking + swimming β‰ˆ swam

This works because regularities in the training data create consistent vector offsets. The "gender direction" (man β†’ woman) is roughly the same vector regardless of the word pair.

Word embedding analogy in 2D space: the vector from "king" to "queen" is parallel to the vector from "man" to "woman", demonstrating learned semantic relationships. Word embedding analogy in 2D space: the vector from "king" to "queen" is parallel to the vector from "man" to "woman", demonstrating learned semantic relationships.

⚠️ Reality check: Research by Ethayarajh et al. (2019)[3] and others has shown that the king–queen analogy is somewhat cherry-picked. Many analogies don't work cleanly, and the cosine similarity nearest neighbor often isn't the "correct" answer. Understanding this nuance distinguishes deep expertise from surface-level knowledge.

Practical details

HyperparameterTypical ValueImpact
Embedding dimension100–300300 is standard; diminishing returns beyond
Context window5–10Larger β†’ more topical; smaller β†’ more syntactic
Negative samples (k)5–15More negatives = better for rare words
Min word count5Filters out very rare words
Subsampling threshold1e-5Drops frequent words like "the", "a"

3. GloVe: global vectors

Pennington, Socher & Manning (2014)[4] at Stanford combined the best of count-based and prediction-based methods.

Core insight: co-occurrence ratios encode meaning

Here's a clever detective trick that GloVe exploits. Say you're trying to figure out the difference between "ice" and "steam." Looking at which words appear near each one individually isn't very helpful, since both appear near "water." But if you look at the ratio, the pattern jumps out: the word "solid" appears 100Γ— more often with "ice" than with "steam," while "gas" appears 100Γ— more with "steam." Neutral words like "water" appear equally with both, giving a ratio near 1. It's like comparing two suspects by looking at who they hang out with relative to each other.

The key insight is that ratios of co-occurrence probabilities, not raw probabilities, capture semantic relationships:

P(Β·|ice)P(Β·|steam)Ratio
solidHighLowLarge β†’ associated with ice
gasLowHighSmall β†’ associated with steam
waterHighHighβ‰ˆ 1 β†’ neutral
randomLowLowβ‰ˆ 1 β†’ neutral

GloVe optimizes word vectors so their dot product equals the log co-occurrence count:

J=βˆ‘i,j=1Vf(Xij)(wiTw~j+bi+b~jβˆ’log⁑Xij)2J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2J=βˆ‘i,j=1V​f(Xij​)(wiT​w~j​+bi​+b~jβ€‹βˆ’logXij​)2

Reading the formula: for every pair of words in the vocabulary, the loss penalizes the difference between (a) the dot product of their learned vectors and (b) the log of how often they actually co-occur in the corpus. The weighting function f(Xij)f(X_{ij})f(Xij​) prevents extremely common pairs (like "the, of") from overwhelming the training. The result: vectors whose geometry directly encodes co-occurrence statistics.

Where:

  • β€’XijX_{ij}Xij​ = co-occurrence count of words iii and jjj
  • β€’f(x)f(x)f(x) = weighting function that caps at xmax=100x_{\text{max}} = 100xmax​=100 (prevents frequent pairs from dominating)
  • β€’wi,w~j\mathbf{w}_i, \tilde{\mathbf{w}}_jwi​,w~j​ = word and context vectors

GloVe vs Word2Vec

AspectWord2VecGloVe
Training signalLocal context windowsGlobal co-occurrence matrix
ObjectivePredictive (softmax/NCE)Least squares on log co-occurrence
Theoretical basisDistributional hypothesis via predictionExplicit matrix factorization
PerformanceStrong on analogy tasksCompetitive; often slightly better on similarity
TrainingOnline (stochastic)Batch (needs full co-occurrence matrix)

πŸ’‘ Insight: Levy & Goldberg (2014) showed that Word2Vec with negative sampling is implicitly factorizing a shifted PMI matrix. The methods are more similar than they appear[5].


4. FastText: solving the out-of-vocabulary problem

Bojanowski et al. (2017)[6] at Facebook extended Word2Vec by representing words as bags of character n-grams:

text
1Word: "where" (with n=3..6) 2N-grams: {"<wh", "whe", "her", "ere", "re>", "<whe", "wher", "here", "ere>", 3 "<wher", "where", "here>", "<where", "where>", "<where>"} 4 5Embedding("where") = Ξ£ embedding(ngram) for each ngram

The killer feature: FastText can compute embeddings for OOV (Out-of-Vocabulary) words, words it has never seen, by summing the vectors of their subword n-grams. This is critical for:

  • β€’Morphologically rich languages (Turkish, Finnish, Hungarian): "evlerimizdekilerden" β†’ meaningful embedding from subparts
  • β€’Typos and misspellings: "recieve" gets a reasonable embedding from shared n-grams with "receive"
  • β€’Neologisms and slang: novel words get useful representations

🎯 Still useful today: FastText embeddings remain the go-to for resource-constrained settings, feature engineering for tabular ML, and simple similarity search where a full transformer is overkill.


5. ELMo: the contextual revolution

Every method so far has a fundamental flaw: the word "bank" always gets the same set of coordinates, whether you're talking about a river bank or a savings bank. It's like giving every person named "Jordan" the same ID photo, ignoring everything about who they actually are.

Peters et al. (2018)[7] changed this with Embeddings from Language Models (ELMo), the first widely successful contextual word representations. Now words are like chameleons: they change color based on their surroundings. This was the inflection point, the end of the "one vector per word" era.

Context window visualization: the model can only attend to tokens within a fixed window, with older tokens falling out of context as the sequence grows. Context window visualization: the model can only attend to tokens within a fixed window, with older tokens falling out of context as the sequence grows.

Architecture

ELMo uses a 2-layer bidirectional LSTM (Long Short-Term Memory) trained as a language model:

Diagram Diagram

Layer specialization

Each layer captures different linguistic information:

LayerWhat It CapturesEvidence
Layer 0 (Token embeddings)Surface form, character patternsCharacter CNN (Convolutional Neural Network) layer
Layer 1 (Lower LSTM)Syntax: POS tags (Part-of-Speech), NER (Named Entity Recognition)Best for syntactic probing tasks
Layer 2 (Upper LSTM)Semantics: word sense, meaningBest for WSD (Word Sense Disambiguation), sentiment

The final representation is a task-specific weighted combination of all layers:

ELMok=Ξ³βˆ‘j=0Lsjβ‹…hk,j\text{ELMo}_k = \gamma \sum_{j=0}^{L} s_j \cdot \mathbf{h}_{k,j}ELMok​=Ξ³βˆ‘j=0L​sj​⋅hk,j​

Reading the formula: the final representation for token kkk is a weighted blend of all layers' outputs. The weights sjs_jsj​ are learned during fine-tuning, so the model can decide "for sentiment analysis, I need more of the upper (semantic) layer; for POS tagging, more of the lower (syntactic) layer." The scaling factor Ξ³\gammaΞ³ adjusts the overall magnitude.

The polysemy breakthrough

text
1"The river bank was steep and muddy." 2 β†’ ELMo("bank") β‰ˆ [terrain, geography, nature...] 3 4"I need to visit the bank to deposit a check." 5 β†’ ELMo("bank") β‰ˆ [finance, money, institution...]

For the first time, the same word gets different vectors in different contexts. This single insight drove massive improvements across NLP benchmarks.

Limitation: The bidirectional LSTMs process left and right contexts independently, then concatenate. They don't jointly attend to both directions simultaneously. BERT would solve this.


6. BERT: truly bidirectional transformers

Devlin et al. (2019)[8] replaced LSTMs with Transformers and introduced truly bidirectional pre-training, establishing the "pre-train then fine-tune" workflow.

Pre-training objectives

Masked Language Modeling (MLM): Randomly mask 15% of tokens and predict them:

  • β€’80% replaced with [MASK]
  • β€’10% replaced with a random token (prevents the model from only learning to fill masks)
  • β€’10% kept unchanged (brings representations closer to fine-tuning, where no masks exist)
text
1Input: "The cat [MASK] on the mat" 2Target: "sat"

Next Sentence Prediction (NSP): Binary classification: are two sentences consecutive? (Later work showed NSP's contribution is minimal. RoBERTa dropped it entirely and improved results.)

Architecture

Diagram Diagram
VariantLayersHidden DimHeadsParameters
BERT-base1276812110M
BERT-large241,02416340M

Why bidirectional self-attention matters

Unlike ELMo's concatenated left/right contexts, BERT's self-attention jointly conditions on the full context in every layer:

text
1"I accessed my bank account" 2β†’ Attention: "bank" attends to BOTH "accessed" AND "account" 3β†’ Result: financial meaning 4 5"The river bank was steep" 6β†’ Attention: "bank" attends to BOTH "river" AND "steep" 7β†’ Result: geographical meaning

This is fundamentally more powerful than ELMo's approach of processing left and right independently, then concatenating.

πŸ’‘ Analogy, reading a mystery novel: ELMo is like two detectives investigating a crime scene separately. One reads the clues left-to-right, the other right-to-left, and they compare notes at the end. BERT is like a single detective who can look at all the clues simultaneously, spotting connections the two separate investigators would miss. When "bank" appears between "river" and "account," BERT sees both context words interacting with each other through "bank." ELMo never gets that cross-directional evidence.

The fine-tuning approach

BERT established a workflow that dominated NLP for years:

  1. β€’Pre-train on massive unlabeled corpus (BooksCorpus + Wikipedia, ~3.3B words)
  2. β€’Add a task-specific head (classification layer, span extraction, etc.)
  3. β€’Fine-tune all parameters on labeled downstream data
  4. β€’Result: State-of-the-art on 11 NLP tasks simultaneously at release

7. GPT: autoregressive representations at scale

Radford et al. (2018, 2019)[9] at OpenAI took the opposite approach: unidirectional (left-to-right) transformer pre-training.

The autoregressive objective

L=βˆ’βˆ‘t=1Tlog⁑P(xt∣x1,…,xtβˆ’1;ΞΈ)\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1}; \theta)L=βˆ’βˆ‘t=1T​logP(xtβ€‹βˆ£x1​,…,xtβˆ’1​;ΞΈ)

Reading the formula: for each position in the text, the model tries to predict the next word using only the words that came before it. The loss sums up how wrong those predictions were (via negative log-probability). Better predictions = lower loss. This "predict the next word" game is all GPT needs to learn language.

BERT vs GPT: the key tradeoff

AspectBERTGPT
AttentionBidirectional (full context)Causal (left-to-right only)
Pre-trainingMLM + NSPNext-token prediction
Best forUnderstanding (NLU)Generation (NLG)
RepresentationsDeep bidirectional contextLeft context only
AdaptationFine-tuning requiredIn-context learning (prompting)
Scaling trajectoryPlateaued at ~340MScaled to 1T+ parameters

Why GPT's approach won at scale

Despite BERT's bidirectional advantage for understanding tasks, GPT's autoregressive approach scaled better:

  1. β€’Unified framework: Any text task can be cast as text generation (no task-specific heads needed)
  2. β€’In-context learning: Few-shot examples in the prompt replace fine-tuning entirely (Brown et al., 2020)[10]
  3. β€’Emergent abilities: Larger models unlock new capabilities like chain-of-thought reasoning, code generation, and tool use (Wei et al., 2022)[11]
  4. β€’Natural generation: Autoregressive models naturally produce coherent multi-sentence output

πŸ’‘ Technical nuance: Saying "BERT is better because it's bidirectional" is incomplete. The correct nuance is: BERT has stronger per-token representations for understanding tasks, but GPT's autoregressive formulation enables generation and scaling properties that proved more valuable for general-purpose AI.


8. Embedding geometry: what research tells us

A frequently overlooked but critical concept is the geometry of embeddings: how they're distributed in high-dimensional space.

Anisotropy: the cone problem

Imagine you're in a dark room with a flashlight, and you scatter glow-in-the-dark balls everywhere. Ideally, they'd be spread evenly across the room so you could tell any two balls apart by their positions. But what if all the balls ended up clustered in the narrow beam of the flashlight? Suddenly, every ball looks like it's in roughly the same direction, and it's hard to distinguish them.

Ethayarajh (2019)[3] at Stanford discovered that this is exactly what happens with embeddings. In BERT, ELMo, and GPT-2, word embeddings are anisotropic: they occupy a narrow cone in the embedding space rather than being uniformly distributed.

Diagram Diagram

Key findings:

  • β€’On average, less than 5% of the variance in a word's contextualized representations can be explained by a static embedding, meaning contextual representations are highly context-dependent
  • β€’Upper layers produce more context-specific representations than lower layers
  • β€’Lower-layer BERT representations, when reduced to their first principal component, actually outperform GloVe and FastText on standard static embedding benchmarks

πŸ”¬ Why this matters: Anisotropy affects how cosine similarity behaves. Two random words might have surprisingly high cosine similarity simply because all vectors point in roughly the same direction, not because the words are semantically related. This motivates techniques like whitening and isotropy calibration in production systems.


The complete evolution at a glance

MethodYearTeamContextKey InnovationStill Used?
TF-IDF1970sVariousStaticFrequency weightingβœ… Search engines
LSA1990Deerwester et al.StaticSVD on term-document matrixRarely
Word2Vec2013Mikolov et al. (Google)StaticNeural prediction, negative samplingβœ… Features, baselines
GloVe2014Pennington et al. (Stanford)StaticCo-occurrence matrix factorizationβœ… Features, baselines
FastText2017Bojanowski et al. (Meta)StaticSubword n-gram compositionβœ… Low-resource, OOV
ELMo2018Peters et al. (Allen AI)DynamicBidirectional LSTM LM featuresRarely
BERT2019Devlin et al. (Google)DynamicMLM + bidirectional Transformerβœ… Encoders, retrieval
GPT2018+Radford et al. (OpenAI)DynamicAutoregressive Transformer at scaleβœ… All LLMs

When to use what (practical guide)

A strong engineer knows when not to use the latest model:

ScenarioBest ChoiceWhy
Simple text similarity searchFastText or Word2VecFast, no GPU needed, good enough
Feature engineering for tabular MLGloVe/FastText averageDense features from text columns
Morphologically rich languageFastTextHandles OOV via subword composition
Semantic search / retrievalFine-tuned BERT (bi-encoder)Context-aware, high quality
Text classificationFine-tuned BERT / RoBERTaBest accuracy for classification
Text generationGPT / decoder modelAutoregressive = natural generation
Production with compute constraintsDistilled BERT (DistilBERT)97% accuracy at 60% the size
Research prototypeFull BERT-large / GPTMaximum quality, cost not a concern

Key takeaways

  • β€’Trace the evolution: Understand the progression from count-based (statistics) β†’ prediction-based (local context) β†’ contextual (full sentence awareness).
  • β€’Technical distinctions matter: Know the difference between Word2Vec's local window, GloVe's global matrix factorization, and BERT's bidirectional attention.
  • β€’Context is king: ELMo proved that one vector per word is insufficient; BERT proved that bidirectional attention beats concatenated LSTMs.
  • β€’Scale wins: GPT showed that autoregressive objectives, while locally less powerful than bidirectional ones, scale better to general reasoning tasks.
  • β€’Geometry awareness: Remember that embedding spaces are often anisotropic (cone-shaped), which distorts cosine similarity.

Common misconceptions

  • β€’"BERT is always better." False. Static embeddings are faster, cheaper, and sufficient for many simple similarity or feature engineering tasks.
  • β€’"ELMo and BERT are both bidirectional in the same way." False. ELMo concatenates independent left/right LSTMs; BERT jointly attends to all tokens.
  • β€’"GPT is strictly worse than BERT for understanding." False. While BERT is more sample-efficient for understanding, scaled-up GPT models perform well on NLU tasks via few-shot prompting.
  • β€’"King - Man + Woman = Queen always works." False. This is a cherry-picked example. Many analogies fail, and the geometry is often distorted.
  • β€’"High dimensionality is always better." False. Diminishing returns set in (often around 300d for static, 768d+ for contextual), and higher dimensions increase compute cost.

Summary

  1. β€’Start with the problem: Words need numerical representations for neural networks to process them.
  2. β€’Milestones: Count-based β†’ Word2Vec β†’ GloVe β†’ FastText β†’ ELMo β†’ BERT β†’ GPT.
  3. β€’Key insights: Co-occurrence β†’ prediction β†’ subwords β†’ context β†’ attention β†’ scale.
  4. β€’Modern state: Today's LLMs use contextual embeddings computed by transformer layers, but static embeddings remain a vital tool for efficiency.

Evaluation Rubric
  • 1
    Articulates the distributional hypothesis as the foundation for all embeddings
  • 2
    Clearly explains Word2Vec training (skip-gram vs CBOW) with negative sampling details
  • 3
    Describes GloVe's co-occurrence ratio insight and how it bridges count-based and neural methods
  • 4
    Identifies FastText's key advantage: OOV handling via subword n-grams
  • 5
    Explains ELMo as the first contextual model, with layer-specific syntax/semantics specialization
  • 6
    Distinguishes ELMo's concatenated bidirectionality from BERT's joint self-attention
  • 7
    Contrasts BERT (understanding) and GPT (generation) with nuanced scaling discussion
  • 8
    Demonstrates practical judgment about when to use static vs contextual embeddings
Common Pitfalls
  • Claiming BERT is 'always better' without acknowledging static embedding use cases
  • Confusing ELMo's concatenated bidirectionality with BERT's joint bidirectionality
  • Not understanding why GPT's unidirectional approach scales better than BERT
  • Treating the king–queen analogy as a reliable, general property rather than a cherry-picked example
  • Forgetting the OOV problem and how FastText/subword tokenization solved it
  • Not knowing the distributional hypothesis underlying all embedding methods
  • Ignoring embedding geometry: anisotropy and its impact on cosine similarity
Follow-up Questions to Expect

Key Concepts Tested
Distributional hypothesis and its operationalizationWord2Vec CBOW and Skip-gram architecturesNegative sampling and subsampling tricksGloVe co-occurrence ratio insight and matrix factorizationFastText subword composition for OOV handlingELMo layer specialization (syntax vs semantics)BERT MLM objective and joint bidirectional attentionGPT autoregressive objective and scaling advantagesEmbedding geometry and anisotropyStatic vs contextual embedding tradeoffs
References

Indexing by Latent Semantic Analysis

Deerwester, S., et al. Β· 1990 Β· JASIS

Efficient Estimation of Word Representations in Vector Space.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. Β· 2013 Β· arXiv preprint

GloVe: Global Vectors for Word Representation.

Pennington, J., Socher, R., & Manning, C. D. Β· 2014

Neural Word Embedding as Implicit Matrix Factorization.

Levy, O. & Goldberg, Y. Β· 2014

Enriching Word Vectors with Subword Information.

Bojanowski, P., et al. Β· 2017

Deep contextualized word representations.

Peters, M., et al. Β· 2018 Β· NAACL 2018

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.

Ethayarajh, K. Β· 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. Β· 2019 Β· NAACL 2019

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. Β· 2018

Language Models are Few-Shot Learners.

Brown, T., et al. Β· 2020 Β· NeurIPS 2020

Emergent Abilities of Large Language Models.

Wei, J., et al. Β· 2022 Β· TMLR

Your account is free and you can post anonymously if you choose.