Trace the full evolution from count-based methods through Word2Vec/GloVe to contextual BERT/GPT representations. Understand the distributional hypothesis, embedding geometry, and when to use static vs contextual embeddings in production.
How does a computer understand that "king" and "queen" are related, or that "Paris" is to "France" what "Tokyo" is to "Japan"? It can't read a dictionary. It needs numbers. The solution is to give every word a set of coordinates, like pinning cities on a map. Words with similar meanings end up close together, and the directions between them encode relationships. These numerical coordinates are called embeddings, and they're the foundation of how every language model (from GPT-5.4 to Gemini 3.1 Pro) understands meaning.
This article explores the full journey: from early counting methods through Word2Vec's breakthrough to today's context-aware representations in BERT and GPT, where the same word gets different coordinates depending on its surroundings.
💡 Key insight: Understanding embeddings means grasping the core idea behind representation learning: that a word's meaning comes from the company it keeps, that static coordinates evolved into dynamic ones, and the practical tradeoffs involved. This knowledge is foundational to every modern NLP (Natural Language Processing) and LLM (Large Language Model) system. See also: how text becomes tokens before it becomes embeddings, and sentence embeddings for scaling this idea to whole passages.
Before any algorithm, understand the core insight that makes all word embeddings possible:
"You shall know a word by the company it keeps." (J.R. Firth, 1957)[1]
This linguistic intuition has deep roots in information theory. Shannon (1948)[2] formalized the idea that the next symbol in a sequence can be predicted from its preceding context, establishing the mathematical foundation for measuring information content in language. The distributional hypothesis operationalizes this: if two words are interchangeable in the same contexts without changing the information content, they must carry similar meaning.
Think of it this way: if you moved to a new city and saw a store you'd never heard of, you could guess what it sells by looking at its neighbors. A store between a bakery and a deli is probably a food shop. A store between a nail salon and a hair salon is probably a beauty shop. You didn't need to go inside. The neighborhood told you everything.
Words work the same way. The word "espresso" tends to appear near "latte," "barista," and "café," so a computer can figure out it's a coffee-related word just by tracking those patterns. This is the distributional hypothesis: words that appear in similar contexts have similar meanings. Every embedding method (from TF-IDF to GPT) is at its core an operationalization of this idea.
To see this in practice, consider a simple fill-in-the-blank sentence. The words that naturally fit into the blank will develop similar embeddings because they share the same surrounding context:
text1Context: "The ___ sat on the mat" 2 3Words that fit: cat, dog, child, kitten, puppy 4→ These words should have similar embeddings 5 6Words that don't fit: democracy, algorithm, quantum 7→ These should have very different embeddings
The evolution of embeddings is the story of increasingly powerful ways to capture this distributional information:
Before neural networks dominated NLP, words were represented using large, sparse matrices of co-occurrence statistics. The goal was to mathematically capture the distributional hypothesis by counting how often words appeared near each other.
TF-IDF (Term Frequency-Inverse Document Frequency) was one of the earliest and most resilient approaches. It represents documents as sparse vectors of word counts, penalizing words that appear frequently across all documents (like "the" or "is") while highlighting rare, discriminative words. While simple, it remains highly effective for basic information retrieval and search engines today.
PMI (Pointwise Mutual Information) extended this by measuring the statistical dependence between two words. If "coffee" and "cup" appear together much more often than their independent probabilities would suggest, their PMI is high. A Word-Context Matrix populated with PMI values is a direct precursor to modern embeddings.
LSA (Latent Semantic Analysis), introduced by Deerwester et al. (1990)[3], took the critical next step: dimensionality reduction. By applying Singular Value Decomposition (SVD) to the sparse term-document matrix, LSA compressed the vocabulary into a dense, lower-dimensional space (e.g., 300 dimensions). This was revolutionary because it captured latent semantics: words that never co-occurred in the same document could still end up with similar vectors if they shared similar neighbors.
These methods produce a single vector per word regardless of context. The word "bank" has the same representation whether it means a financial institution or a riverbank. This is the polysemy problem, and solving it would drive the next decade of research.
Mikolov et al. (2013)[4] introduced two neural architectures that learn dense word vectors by predicting words from their local context. This was the breakthrough that launched modern NLP embeddings.
Predicts the center word from surrounding context:
Given the surrounding words (the "context window" of size on each side), predict the word in the middle. For example, given "the ___ sat on", predict "cat."
Predicts context words from the center word (inverse of CBOW):
This is the reverse. Given the center word "cat", predict which words likely surround it ("the", "sat", "on"). Skip-gram tends to work better for rare words because each word gets more training updates.
💡 Key insight: CBOW is faster and better for frequent words. Skip-gram is slower but captures rare words better (each rare word generates multiple training examples as a center word, and the model updates its vector more frequently)[4].
Here's the problem: to learn good embeddings, the model needs to check its prediction against every word in the vocabulary, often 100,000+ words. That's like grading a multiple-choice test with 100,000 answer options for every single question. Two major solutions emerged:
Hierarchical softmax (Morin & Bengio, 2005)[5] organizes the vocabulary as a binary Huffman tree. Instead of computing a softmax over all words, the model makes a series of binary left/right decisions traversing the tree from root to leaf. This reduces the per-prediction cost from to . Frequent words get shorter paths (fewer decisions), making the overall computation efficient. However, hierarchical softmax struggles with rare words because their longer paths accumulate more gradient noise, and the tree structure imposes a fixed hierarchy that may not reflect semantic relationships.
Negative sampling (Mikolov et al., 2013)[4] took a different approach that ultimately won out for large-scale training. Instead of comparing against all words, the model picks a handful of random "wrong" answers (negatives) and learns to tell the difference between the real context word and those few imposters. It's like a flashcard game: "Is 'cat' a real neighbor of 'sat'?" (yes ✅) "Is 'democracy' a real neighbor of 'sat'?" (no ❌). This tiny binary quiz is dramatically cheaper while still teaching the model which words belong together. Negative sampling eventually dominated because it provides more uniform gradient quality across rare and frequent words, and its cost (where is typically 5–15) beats even logarithmic scaling at large vocabulary sizes.
The code below demonstrates how to compute this negative sampling loss for a single word pair. It takes the embeddings of the center word, the actual context word, and a set of random negative samples. The loss uses log-sigmoid for numerical stability, encouraging high similarity between the center and true context word while pushing similarity to negative samples toward zero:
python1import numpy as np 2 3def skip_gram_loss(center_vec: np.ndarray, 4 context_vec: np.ndarray, 5 neg_samples: np.ndarray) -> float: 6 """ 7 Computes the negative sampling loss for a single skip-gram pair. 8 9 Uses log-sigmoid formulation for numerical stability: 10 - Positive: -log(sigmoid(score)) = log(1 + exp(-score)) 11 - Negative: -log(1 - sigmoid(score)) = log(1 + exp(score)) 12 13 Args: 14 center_vec: Embedding of center word (d,) 15 context_vec: Embedding of context word (d,) 16 neg_samples: Embeddings of k negative samples (k, d) 17 18 Returns: 19 Scalar loss value 20 """ 21 # Positive pair: maximize similarity to true context word 22 pos_score = np.dot(context_vec, center_vec) 23 pos_loss = np.log(1 + np.exp(-pos_score)) # = -log(sigmoid(pos_score)) 24 25 # Negative pairs: minimize similarity to random words 26 neg_scores = np.dot(neg_samples, center_vec) # (k,) 27 neg_loss = np.sum(np.log(1 + np.exp(neg_scores))) # = -sum(log(1-sigmoid(neg_scores))) 28 29 return pos_loss + neg_loss
Negative words are sampled proportional to , where is the word frequency. The exponent gives rare words slightly more chance of being selected as negatives, improving their representations.
Word2Vec's most striking property is that semantic relationships are captured as linear directions in the embedding space. By performing arithmetic operations on these vectors, we can mathematically traverse conceptual relationships, combining related terms to produce a conceptually analogous result:
text1king - man + woman ≈ queen 2Paris - France + Italy ≈ Rome 3walked - walking + swimming ≈ swam
This works because regularities in the training data create consistent vector offsets. The "gender direction" (man → woman) is roughly the same vector regardless of the word pair.
🔬 Research insight: Ethayarajh et al. (2019)[6] found that the king-queen analogy is somewhat cherry-picked. Many analogies don't work cleanly, and the cosine similarity nearest neighbor often isn't the "correct" answer. Understanding this subtlety distinguishes deep expertise from surface-level knowledge.
| Hyperparameter | Typical Value | Impact |
|---|---|---|
| Embedding dimension | 100-300 | 300 is standard; diminishing returns beyond |
| Context window | 5-10 | Larger → more topical; smaller → more syntactic |
| Negative samples (k) | 5-15 | More negatives = better for rare words |
| Min word count | 5 | Filters out very rare words |
| Subsampling threshold | 1e-5 | Drops frequent words like "the", "a" |
Pennington, Socher & Manning (2014)[7] at Stanford combined the best of count-based and prediction-based methods.
Here's a clever detective trick that GloVe exploits. Say you're trying to figure out the difference between "ice" and "steam." Looking at which words appear near each one individually isn't very helpful, since both appear near "water." But if you look at the ratio, the pattern jumps out: the word "solid" appears 100× more often with "ice" than with "steam," while "gas" appears 100× more with "steam." Neutral words like "water" appear equally with both, giving a ratio near 1. It's like comparing two suspects by looking at who they hang out with relative to each other.
The key insight is that ratios of co-occurrence probabilities, not raw probabilities, capture semantic relationships:
| P(·|ice) | P(·|steam) | Ratio | |
|---|---|---|---|
| solid | High | Low | Large → associated with ice |
| gas | Low | High | Small → associated with steam |
| water | High | High | ≈ 1 → neutral |
| random | Low | Low | ≈ 1 → neutral |
GloVe is fundamentally a log-bilinear model: it assumes that the dot product of two word vectors should be a bilinear function of the log co-occurrence statistic. The key derivation starts from the observation that co-occurrence probability ratios encode meaning (as shown in the table above). Pennington et al. (2014)[7] showed that requiring the vector difference to predict these ratios leads directly to a least-squares objective on the log counts:
For every pair of words in the vocabulary, the loss penalizes the difference between (a) the dot product of their learned vectors and (b) the log of how often they actually co-occur in the corpus. The weighting function prevents extremely common pairs (like "the, of") from overwhelming the training. The result: vectors whose geometry directly encodes co-occurrence statistics. The bias terms and absorb the log-frequency of individual words, ensuring the dot product captures only the interaction between word pairs.
Where:
| Aspect | Word2Vec | GloVe |
|---|---|---|
| Training signal | Local context windows | Global co-occurrence matrix |
| Objective | Predictive (softmax/Negative Sampling) | Least squares on log co-occurrence |
| Theoretical basis | Distributional hypothesis via prediction | Explicit matrix factorization |
| Performance | Strong on analogy tasks | Competitive; often slightly better on similarity |
| Training | Online (stochastic) | Batch (needs full co-occurrence matrix) |
🔬 Research insight: Levy & Goldberg (2014) showed that Word2Vec with negative sampling is implicitly factorizing a shifted PMI (Pointwise Mutual Information) matrix. The methods are more similar than they appear[8].
Bojanowski et al. (2017)[9] at Facebook extended Word2Vec by representing words as bags of character n-grams (a distinct approach from the subword tokenization used by BERT's WordPiece). Instead of learning a single vector for a word, the model learns vectors for its character-level component parts and sums them together. The following example demonstrates this breakdown, showing how an input word is split into constituent overlapping n-grams, producing a final embedding that is the sum of those subword components:
text1Word: "where" (with n=3..6) 2N-grams: <wh, whe, her, ere, re>, <whe, wher, here, ere>, <wher, where, here>, <where>, and the full word "where" itself. 3 4Embedding("where") = Σ embedding(ngram) for each ngram
The diagram below visualizes how these component parts are aggregated into a single continuous representation:
Because FastText shares representations at the subword level, it allows the model to generalize across words with similar roots or affixes. This parameter sharing also means it can achieve broad vocabulary coverage efficiently, as related words reuse the same underlying n-gram vectors.
A naive implementation would need to store a separate embedding vector for every unique n-gram, which can reach tens of millions of entries. FastText manages this via the hashing trick: all n-grams are hashed into a fixed-size bucket table (default: 2M buckets). Multiple n-grams may collide into the same bucket, sharing a single vector. This trades a small amount of precision for dramatic memory savings, keeping the model's memory footprint bounded regardless of n-gram diversity. In practice, the collision rate is low enough that embedding quality degrades minimally.
🎯 Production tip: FastText can compute embeddings for OOV (Out-of-Vocabulary) words (words it has never seen) by summing the vectors of their subword n-grams. This is critical for:
- •Morphologically rich languages (Turkish, Finnish, Hungarian):
"evlerimizdekilerden"→ meaningful embedding from subparts- •Typos and misspellings:
"recieve"gets a reasonable embedding from shared n-grams with"receive"- •Neologisms and slang: novel words get useful representations
Every method so far has a fundamental flaw: the word "bank" always gets the same set of coordinates, whether you're talking about a river bank or a savings bank. It's like giving every person named "Jordan" the same ID photo, ignoring everything about who they actually are.
Peters et al. (2018)[10] changed this with Embeddings from Language Models (ELMo), the first widely successful contextual word representations. Now words are like chameleons: they change color based on their surroundings. This was the inflection point, the end of the "one vector per word" era.
ELMo uses a 2-layer bidirectional LSTM (Long Short-Term Memory) trained as a language model:
Each layer captures different linguistic information. Peters et al. (2018)[10] demonstrated this via probing tasks: training simple classifiers on frozen layer outputs to test what each layer has learned. The results show a clear hierarchy from surface-level features to abstract semantics:
| Layer | What It Captures | Probing Evidence |
|---|---|---|
| Layer 0 (Token embeddings) | Surface form, character patterns | Character CNN captures morphology; best at word shape classification |
| Layer 1 (Lower LSTM) | Syntax: POS tags, dependency relations | 97% POS tagging accuracy from Layer 1 alone; outperforms Layer 2 on syntactic tasks by ~3% |
| Layer 2 (Upper LSTM) | Semantics: word sense, coreference | Best for WSD (Word Sense Disambiguation) and sentiment; captures long-range semantic dependencies |
The final representation is a task-specific weighted combination of all layers:
The final representation for token is a weighted blend of all layers' outputs. The weights are learned during fine-tuning, so the model can decide "for sentiment analysis, I need more of the upper (semantic) layer; for POS tagging, more of the lower (syntactic) layer." The scaling factor adjusts the overall magnitude.
By calculating a vector dynamically based on the surrounding sentence, ELMo finally solved the long-standing polysemy problem. As demonstrated below, identical words generate distinctly different representations depending on the meaning implied by the adjacent context:
text1"The river bank was steep and muddy." 2 → ELMo("bank") ≈ [terrain, geography, nature...] 3 4"I need to visit the bank to deposit a check." 5 → ELMo("bank") ≈ [finance, money, institution...]
For the first time, the same word gets different vectors in different contexts. This single insight drove massive improvements across NLP benchmarks.
The bidirectional LSTMs process left and right contexts independently, then concatenate. They don't jointly attend to both directions simultaneously. BERT would solve this.
Devlin et al. (2019)[11] replaced LSTMs with Transformers and introduced truly bidirectional pre-training, which perfected and popularized the "pre-train then fine-tune" workflow.
Randomly mask 15% of tokens and predict them using the 80/10/10 strategy, which was carefully designed to address the pre-training/fine-tuning discrepancy:
[MASK] (the primary training signal)[MASK]; forces robustness to corrupted input)[MASK] tokens, so keeping some targets unchanged during pre-training forces the model to build good representations even for unmasked positions. Without this, the model would learn to only "pay attention" when it sees [MASK], degrading downstream performance.)Here's an example of what the model sees during MLM training. It takes an input sentence with a randomly masked token and trains the model to predict the original hidden word as its target output:
text1Input: "The cat [MASK] on the mat" 2Target: "sat"
Binary classification: are two sentences consecutive? (Later work showed NSP's contribution is minimal. RoBERTa dropped it entirely and improved results.)
The BERT architecture processes input tokens through an embedding layer that combines token, segment, and position information. This combined input is then passed through a deep stack of bidirectional transformer encoders to produce the final contextual representations:
| Variant | Layers | Hidden Dim | Heads | Parameters |
|---|---|---|---|---|
| BERT-base | 12 | 768 | 12 | 110M |
| BERT-large | 24 | 1,024 | 16 | 340M |
Unlike ELMo's concatenated left/right contexts, BERT's self-attention jointly conditions on the full context in every layer. The following example illustrates this difference. The model takes a full sentence and allows the target word to simultaneously attend to both preceding and succeeding tokens, resulting in a single, deeply contextualized representation:
text1"I accessed my bank account" 2→ Attention: "bank" attends to BOTH "accessed" AND "account" 3→ Result: financial meaning 4 5"The river bank was steep" 6→ Attention: "bank" attends to BOTH "river" AND "steep" 7→ Result: geographical meaning
This is fundamentally more powerful than ELMo's approach of processing left and right independently, then concatenating.
💡 Key insight: Analogy, reading a mystery novel: ELMo is like two detectives investigating a crime scene separately. One reads the clues left-to-right, the other right-to-left, and they compare notes at the end. BERT is like a single detective who can look at all the clues simultaneously, spotting connections the two separate investigators would miss. When "bank" appears between "river" and "account," BERT sees both context words interacting with each other through "bank." ELMo never gets that cross-directional evidence.
BERT established a workflow that dominated NLP for years:
Radford et al. (2018, 2019)[12] at OpenAI took the opposite approach: unidirectional (left-to-right) transformer pre-training.
For each position in the text, the model tries to predict the next word using only the words that came before it. The loss sums up how wrong those predictions were (via negative log-probability). Better predictions = lower loss. This "predict the next word" game is all GPT needs to learn language.
GPT enforces its left-to-right constraint through a causal attention mask (also called a triangular mask). In the self-attention computation, a lower-triangular matrix sets all attention scores for future positions to before the softmax, making their attention weights exactly zero. This prevents position from attending to any position , ensuring the model cannot "cheat" by looking at tokens it hasn't generated yet. This is what makes GPT inherently autoregressive: each token's representation is built exclusively from the tokens that precede it, which is precisely the property needed for sequential text generation.
| Aspect | BERT | GPT |
|---|---|---|
| Attention | Bidirectional (full context) | Causal (left-to-right only) |
| Pre-training | MLM + NSP | Next-token prediction |
| Best for | Natural Language Understanding (NLU) | Natural Language Generation (NLG) |
| Representations | Deep bidirectional context | Left context only |
| Adaptation | Fine-tuning required | In-context learning (prompting) |
| Scaling trajectory | Plateaued at ~340M | Scaled to 1T+ parameters |
Despite BERT's bidirectional advantage for understanding tasks, GPT's autoregressive approach scaled better:
💡 Key insight: Saying "BERT is better because it's bidirectional" is incomplete. The correct framing is: BERT has stronger per-token representations for understanding tasks, but GPT's autoregressive formulation enables generation and scaling properties that proved more valuable for general-purpose AI.
A frequently overlooked but critical concept is the geometry of embeddings: how they're distributed in high-dimensional space.
Imagine you're in a dark room with a flashlight, and you scatter glow-in-the-dark balls everywhere. Ideally, they'd be spread evenly across the room so you could tell any two balls apart by their positions. But what if all the balls ended up clustered in the narrow beam of the flashlight? Suddenly, every ball looks like it's in roughly the same direction, and it's hard to distinguish them.
Ethayarajh (2019)[6] at Stanford discovered that this is exactly what happens with embeddings. In BERT, ELMo, and GPT-2, word embeddings are anisotropic: they occupy a narrow cone in the embedding space rather than being uniformly distributed.
Gao et al. (2019)[15] traced the root cause of anisotropy to weight tying: most language models share parameters between the input embedding layer and the output prediction layer. This architectural shortcut (used for parameter efficiency) creates a degenerate solution where the model pushes all embeddings into a narrow subspace to maximize the log-likelihood surface. The learned embedding matrix ends up with a few dominant singular values, causing all vectors to cluster along those principal directions. This explains why even semantically unrelated words can have cosine similarity > 0.5 in raw BERT embeddings.
🎯 Production tip: Anisotropy affects how cosine similarity behaves. Two random words might have surprisingly high cosine similarity simply because all vectors point in roughly the same direction, not because the words are semantically related. This motivates techniques like whitening and isotropy calibration in production systems.
Looking back over the decades of NLP research, the evolution of word representations follows a clear trajectory from simple frequency counts to massive neural networks. Each step in this journey was driven by the need to capture deeper semantic relationships and solve specific limitations of previous methods.
The earliest approaches, like TF-IDF and LSA, proved that simple mathematical operations on document statistics could capture a surprising amount of meaning. However, they were limited by their reliance on static matrices and struggled with capturing details like syntax or fine-grained semantic analogies.
The neural revolution fundamentally changed the approach. By framing representation learning as a prediction task, models like Word2Vec and BERT learned to compress meaning into dense geometric spaces. This transition from counting to predicting, and finally to contextualizing with transformers, laid the groundwork for modern generative AI.
| Method | Year | Team | Context | Key Innovation | Still Used? |
|---|---|---|---|---|---|
| TF-IDF | 1970s | Various | Static | Frequency weighting | ✅ Search engines |
| LSA | 1990 | Deerwester et al. | Static | SVD on term-document matrix | Rarely |
| Word2Vec | 2013 | Mikolov et al. (Google) | Static | Neural prediction, negative sampling | ✅ Features, baselines |
| GloVe | 2014 | Pennington et al. (Stanford) | Static | Co-occurrence matrix factorization | ✅ Features, baselines |
| FastText | 2017 | Bojanowski et al. (Meta) | Static | Subword n-gram composition | ✅ Low-resource, OOV |
| ELMo | 2018 | Peters et al. (Allen AI) | Dynamic | Bidirectional LSTM LM features | Rarely |
| BERT | 2018 | Devlin et al. (Google) | Dynamic | MLM + bidirectional Transformer | ✅ Encoders, retrieval |
| GPT | 2018+ | Radford et al. (OpenAI) | Dynamic | Next-token prediction at scale, 175B params[13] | ✅ All LLMs |
A strong engineer knows when not to use the latest model. While large language models dominate the headlines, they're often overkill for simple production tasks. Understanding the trade-offs between static and contextual embeddings is crucial for building cost-effective and low-latency systems.
Static embeddings like FastText or Word2Vec remain highly relevant in resource-constrained environments. They require no specialized hardware like GPUs for inference, making them ideal for lightweight applications or as feature inputs for traditional tabular machine learning models. FastText, in particular, continues to be a standard choice for morphologically rich languages where dealing with unseen words is a common challenge.
Contextual embeddings, on the other hand, are strictly necessary when semantic disambiguation is critical. Tasks like complex semantic search, passage retrieval, and detailed text classification benefit immensely from BERT-style encoders. For anything involving text generation or open-ended reasoning, autoregressive models from the GPT family are the definitive standard. The key is matching the complexity of the embedding model to the complexity of the business requirement.
| Scenario | Best Choice | Why |
|---|---|---|
| Simple text similarity search | FastText or Word2Vec | Fast, no GPU needed, good enough |
| Feature engineering for tabular ML | GloVe/FastText average | Dense features from text columns |
| Morphologically rich language | FastText | Handles OOV via subword composition |
| Semantic search / retrieval | Fine-tuned BERT (bi-encoder) | Context-aware, high quality |
| Text classification | Fine-tuned BERT / RoBERTa | Best accuracy for classification |
| Text generation | GPT / decoder model | Autoregressive = natural generation |
| Production with compute constraints | Distilled BERT (DistilBERT) | 97% accuracy at 60% the size |
| Research prototype | Full BERT-large / Latest GPT model | Maximum quality, cost not a concern |
When building search or retrieval systems, engineers frequently need to visualize high-dimensional embeddings and reduce their dimensionality for efficiency. PCA is the go-to for fast linear reduction and provides interpretable principal components. t-SNE excels at creating 2D/3D visualizations that preserve local cluster structure, making it ideal for quality inspection. UMAP offers the best balance: it preserves both local and global structure better than t-SNE while being significantly faster, making it the standard choice for interactive exploration of embedding spaces. For more on production dimensionality reduction techniques including quantization, see Dimensionality Reduction for Embeddings.
A synopsis of linguistic theory, 1930-1955
Firth, J. R. · 1957
A Mathematical Theory of Communication
Claude E. Shannon · 1948
Indexing by Latent Semantic Analysis
Deerwester, S., et al. · 1990 · JASIS
Efficient Estimation of Word Representations in Vector Space.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. · 2013 · arXiv preprint
Hierarchical Probabilistic Neural Network Language Model
Frederic Morin, Yoshua Bengio · 2005
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.
Ethayarajh, K. · 2019
GloVe: Global Vectors for Word Representation.
Pennington, J., Socher, R., & Manning, C. D. · 2014
Neural Word Embedding as Implicit Matrix Factorization.
Levy, O. & Goldberg, Y. · 2014
Enriching Word Vectors with Subword Information.
Bojanowski, P., et al. · 2017
Deep contextualized word representations.
Peters, M., et al. · 2018 · NAACL 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
Improving Language Understanding by Generative Pre-Training.
Radford, A., et al. · 2018
Language Models are Few-Shot Learners.
Brown, T., et al. · 2020 · NeurIPS 2020
Emergent Abilities of Large Language Models.
Wei, J., et al. · 2022 · TMLR
Representation Degeneration Problem in Training Natural Language Generation Models
Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, Tie-Yan Liu · 2019