LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnCore LLM FoundationsStatic to Contextual Embeddings
📝MediumNLP Fundamentals

Static to Contextual Embeddings

Turn token IDs into vectors, learn what nearby usage captures, and see why a word such as charge needs sentence-dependent representations.

13 min read
Learning path
Step 47 of 155 in the full curriculum
BPE, WordPiece, and SentencePiecePerplexity & Model Evaluation

A tokenizer can turn a support request into IDs:

text
1"dispute the charge" -> [91, 12, 407] 2"charge the scanner" -> [407, 12, 982]

IDs solve naming. They don't tell a model that refund resembles return, or that charge means a billing event in one message and supplying power in another.

An embedding gives each token a of numbers. A contextual representation goes one step further: it adjusts that vector using the sentence around this particular occurrence.

By the end of this lesson, you'll build both ideas from scratch and watch the same leave with two different meanings.

Diagram showing Support message, Tokenizer token IDs, Embedding table one starting row per ID, and Context-mixing layers tokens exchange evidence. Diagram showing Support message, Tokenizer token IDs, Embedding table one starting row per ID, and Context-mixing layers tokens exchange evidence.
Support message, Tokenizer token IDs, Embedding table one starting row per ID, and Context-mixing layers tokens exchange evidence.

Look Up Coordinates for Token IDs

Suppose a tokenizer has a vocabulary of size V, and you choose embedding dimension d. The model stores an embedding E with shape V x d.

  • Row E[token_id] is the starting vector for that token.
  • Training changes rows so tokens used in similar situations become useful to the prediction task.
  • Before context mixing, every occurrence of the same token ID receives the same row.

That final point matters. If charge is ID 0, both billing and device sentences initially retrieve row E[0].

01-look-up-token-rows.py
1import numpy as np 2 3vocab = {"charge": 0, "refund": 1, "scanner": 2} 4embedding_table = np.array([ 5 [0.80, 0.20], # charge 6 [0.15, 0.95], # refund 7 [0.92, 0.10], # scanner 8]) 9 10billing_charge = embedding_table[vocab["charge"]] 11device_charge = embedding_table[vocab["charge"]] 12 13print("billing start:", billing_charge.tolist()) 14print("device start:", device_charge.tolist()) 15print("same starting row:", np.array_equal(billing_charge, device_charge))
Output
1billing start: [0.8, 0.2] 2device start: [0.8, 0.2] 3same starting row: True

A lookup table alone can't distinguish senses. First, though, it has to learn a useful map of token usage.

Learn Meaning from Neighboring Words

Words that repeatedly appear near similar words tend to play similar roles. In return workflows, refund and replacement may both occur near approved, request, or label; carrier and tracking may cluster around shipment status language.

The simplest evidence is a co-occurrence count: for every token, count words within a fixed context window.

Static versus contextual embeddings for the word charge, showing one Word2Vec lookup vector versus separate billing-charge and scanner-charge vectors shaped by surrounding words. Static versus contextual embeddings for the word charge, showing one Word2Vec lookup vector versus separate billing-charge and scanner-charge vectors shaped by surrounding words.
A stored charge row is identical until sentence evidence turns each occurrence into a different final state.
02-count-context-neighbors.py
1from collections import Counter, defaultdict 2 3messages = [ 4 "refund approved send return label".split(), 5 "replacement approved send return label".split(), 6 "refund request needs label".split(), 7 "tracking update from carrier hub".split(), 8 "shipment delayed at carrier hub".split(), 9] 10 11window = 2 12neighbors = defaultdict(Counter) 13 14for message in messages: 15 for center_index, center in enumerate(message): 16 left = max(0, center_index - window) 17 right = min(len(message), center_index + window + 1) 18 for context_index in range(left, right): 19 if context_index != center_index: 20 neighbors[center][message[context_index]] += 1 21 22print("refund context:", neighbors["refund"].most_common(4)) 23print("replacement context:", neighbors["replacement"].most_common(4)) 24print("carrier context:", neighbors["carrier"].most_common(4))
Output
1refund context: [('approved', 1), ('send', 1), ('request', 1), ('needs', 1)] 2replacement context: [('approved', 1), ('send', 1)] 3carrier context: [('hub', 2), ('update', 1), ('from', 1), ('delayed', 1)]

A count row can be high-dimensional and noisy. One historical route, latent semantic analysis, uses singular value decomposition (SVD) to compress a high-dimensional term-document matrix.[1] For this lesson, use the same compression move on word-context counts. The compressed rows act as static vectors: one vector per word, independent of sentence.

Count-based embedding pipeline showing a sparse word-context matrix, SVD factorization, rank-k truncation, and dense word vectors. Count-based embedding pipeline showing a sparse word-context matrix, SVD factorization, rank-k truncation, and dense word vectors.
SVD compresses a sparse context table so return tokens and shipment tokens occupy different nearby regions.

The next exercise follows that word-context route. It computes positive pointwise mutual information (PPMI), which emphasizes pairs occurring more often than chance, then compresses the matrix with SVD.

03-compress-ppmi-with-svd.py
1import numpy as np 2 3words = ["refund", "replacement", "carrier", "tracking"] 4contexts = ["label", "approved", "hub", "delayed"] 5counts = np.array([ 6 [8, 6, 0, 0], # refund 7 [7, 6, 0, 0], # replacement 8 [0, 0, 8, 5], # carrier 9 [0, 0, 7, 6], # tracking 10], dtype=float) 11 12expected = counts.sum(axis=1, keepdims=True) @ counts.sum(axis=0, keepdims=True) / counts.sum() 13ppmi = np.maximum(np.log((counts + 1e-9) / (expected + 1e-9)), 0.0) 14u, singular_values, _ = np.linalg.svd(ppmi, full_matrices=False) 15vectors = u[:, :2] * np.sqrt(singular_values[:2]) 16 17def cosine(a, b): 18 return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b))) 19 20print("refund vs replacement:", round(cosine(vectors[0], vectors[1]), 3)) 21print("refund vs carrier:", round(cosine(vectors[0], vectors[2]), 3))
Output
1refund vs replacement: 1.0 2refund vs carrier: 0.0

This tiny dataset is designed to make the pattern visible. Real corpora have far more contexts, imperfect wording, and competing meanings.

Toy static embedding map showing return-related and shipment-related words clustering because their context neighborhoods overlap. Toy static embedding map showing return-related and shipment-related words clustering because their context neighborhoods overlap.
Nearby coordinates reflect overlapping usage in this toy corpus, not a universal semantic map.

Predict Neighbors with Word2Vec

Counting first and compressing later isn't the only way to learn a static embedding table. Word2Vec trains vectors through prediction:

  • CBOW predicts a center token from its surrounding tokens.
  • Skip-gram predicts surrounding tokens from a center token.

Both objectives reward vectors that help predict observed neighborhoods. Mikolov and colleagues introduced these two architectures in 2013.[2]

Word2Vec CBOW and skip-gram objectives showing context tokens predicting a center word versus a center word predicting surrounding context tokens. Word2Vec CBOW and skip-gram objectives showing context tokens predicting a center word versus a center word predicting surrounding context tokens.
Both Word2Vec objectives learn from observed windows; they reverse which side supplies the prediction target.

For Skip-gram, a training set can be built directly from windows. With window size 2, the center arrived produces one positive pair for each visible neighbor.

04-create-skipgram-pairs.py
1tokens = "package arrived at carrier hub today".split() 2window = 2 3pairs = [] 4 5for center_index, center in enumerate(tokens): 6 left = max(0, center_index - window) 7 right = min(len(tokens), center_index + window + 1) 8 for context_index in range(left, right): 9 if context_index != center_index: 10 pairs.append((center, tokens[context_index])) 11 12arrived_pairs = [pair for pair in pairs if pair[0] == "arrived"] 13print(arrived_pairs)
Output
1[('arrived', 'package'), ('arrived', 'at'), ('arrived', 'carrier')]

Predicting every vocabulary item for every pair is costly. Negative sampling trains the observed pair as positive and a small set of sampled unobserved pairs as negative. The following objective is a small, inspectable version of that idea.

05-compare-negative-sampling-loss.py
1import numpy as np 2 3center = np.array([1.0, 0.0]) 4observed_neighbor = np.array([1.2, 0.1]) 5wrong_neighbor = np.array([-1.0, 0.1]) 6negative_samples = [np.array([-0.9, -0.2]), np.array([-1.1, 0.0])] 7 8def sigmoid(x): 9 return 1.0 / (1.0 + np.exp(-x)) 10 11def negative_sampling_loss(center_vector, positive_vector, negatives): 12 positive_loss = -np.log(sigmoid(center_vector @ positive_vector)) 13 negative_loss = sum(-np.log(sigmoid(-(center_vector @ n))) for n in negatives) 14 return float(positive_loss + negative_loss) 15 16observed_loss = negative_sampling_loss(center, observed_neighbor, negative_samples) 17wrong_loss = negative_sampling_loss(center, wrong_neighbor, negative_samples) 18 19print("observed pair loss:", round(observed_loss, 3)) 20print("wrong pair loss:", round(wrong_loss, 3)) 21print("training prefers observed pair:", observed_loss < wrong_loss)
Output
1observed pair loss: 0.892 2wrong pair loss: 1.942 3training prefers observed pair: True

No analogy trick is required to understand the useful result: after enough examples, tokens that support similar neighbor predictions can end up near one another.

Fit Global Counts with GloVe

GloVe starts from global co-occurrence counts. It fits a weighted least-squares objective: a word-vector and context-vector dot product, plus biases, should approximate the logarithm of each observed count. The logarithm compresses the count range, while the weighting function limits how much very common pairs dominate training.[3]

06-fit-glove-log-count.py
1import numpy as np 2 3def squared_glove_residual(observed_count, model_score): 4 target = np.log(observed_count) 5 return float((model_score - target) ** 2) 6 7model_score = np.log(30) 8matching_error = squared_glove_residual(30, model_score) 9different_error = squared_glove_residual(3, model_score) 10 11print("log targets:", [round(float(np.log(c)), 3) for c in (3, 30, 300)]) 12print("matching count error:", round(matching_error, 3)) 13print("different count error:", round(different_error, 3))
Output
1log targets: [1.099, 3.401, 5.704] 2matching count error: 0.0 3different count error: 5.302

Word2Vec and GloVe give each known word one static row. That is efficient, but it creates two problems: new spellings have no row, and ambiguous words have only one.

Compose New Spellings from Pieces

Support systems constantly encounter variants: reship, reshipped, reshipping, or tracking codes mixed into prose. FastText represents a word using character n-grams as well as its whole-word identity, letting related spellings share parameters.[4]

FastText subword composition showing a boundary-marked word split into character n-grams, hashed into buckets, and summed with the full-word vector. FastText subword composition showing a boundary-marked word split into character n-grams, hashed into buckets, and summed with the full-word vector.
Character pieces provide a usable starting vector for an unseen variant such as reshipped.
07-share-character-ngrams.py
1def character_ngrams(word, n=3): 2 marked = f"<{word}>" 3 return {marked[i:i + n] for i in range(len(marked) - n + 1)} 4 5known = character_ngrams("reship") 6new_form = character_ngrams("reshipped") 7shared = sorted(known & new_form) 8 9print("shared pieces:", shared) 10print("new spelling can reuse pieces:", len(shared) > 0)
Output
1shared pieces: ['<re', 'esh', 'hip', 'res', 'shi'] 2new spelling can reuse pieces: True

Subword composition helps with unfamiliar surface forms. It still doesn't decide which meaning an existing ambiguous token carries.

One Static Row Can't Choose a Sense

Read these messages:

text
1"Dispute the charge on order 8142." 2"Charge the scanner before the warehouse shift."

The word charge is spelled the same way in both. A static embedding lookup returns the same vector, even though the first case belongs near billing language and the second near device language.

08-expose-static-sense-collision.py
1import numpy as np 2 3static = { 4 "charge": np.array([0.8, 0.2]), 5 "refund": np.array([0.2, 1.0]), 6 "scanner": np.array([1.0, 0.1]), 7} 8 9def cosine(a, b): 10 return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b))) 11 12billing_charge = static["charge"] 13device_charge = static["charge"] 14 15print("same charge vector:", np.array_equal(billing_charge, device_charge)) 16print("billing similarity to refund:", round(cosine(billing_charge, static["refund"]), 3)) 17print("device similarity to refund:", round(cosine(device_charge, static["refund"]), 3))
Output
1same charge vector: True 2billing similarity to refund: 0.428 3device similarity to refund: 0.428

Because the two charge vectors are identical, every downstream comparison begins from the same mistaken compromise. The representation needs evidence from the sentence.

Let Tokens Exchange Context

ELMo showed that a token's representation can be produced from a bidirectional language model and selected from multiple learned layers, rather than stored as one context-free vector.[5]

ELMo architecture showing character CNN token representations feeding separate forward and backward LSTM stacks, then a learned weighted layer mixture. ELMo architecture showing character CNN token representations feeding separate forward and backward LSTM stacks, then a learned weighted layer mixture.
ELMo's output depends on sentence context because each directional language model contributes to the token state.

BERT later trained a Transformer encoder with masked language modeling: hide some tokens, then predict them using context on both sides. For a visible token state, encoder self-attention can incorporate evidence before and after that token.[6]

BERT encoder stack showing token, segment, and position embeddings entering bidirectional transformer layers that produce contextual token representations. BERT encoder stack showing token, segment, and position embeddings entering bidirectional transformer layers that produce contextual token representations.
The visibility mask shows available context positions; it deliberately doesn't pretend to show learned attention weights.

You don't need a pretrained model to see the central mechanism. In this toy contextualizer, the same starting vector for charge is mixed with a clue from its sentence.

09-mix-context-into-charge.py
1import numpy as np 2 3charge_start = np.array([0.8, 0.2]) 4clues = { 5 "dispute": np.array([0.0, 1.0]), # billing evidence 6 "scanner": np.array([1.0, 0.0]), # device evidence 7} 8 9def mix_with_context(token_vector, clue_vector): 10 return 0.5 * token_vector + 0.5 * clue_vector 11 12billing_state = mix_with_context(charge_start, clues["dispute"]) 13device_state = mix_with_context(charge_start, clues["scanner"]) 14 15print("billing charge state:", billing_state.tolist()) 16print("device charge state:", device_state.tolist()) 17print("same final state:", np.array_equal(billing_state, device_state))
Output
1billing charge state: [0.4, 0.6] 2device charge state: [0.9, 0.1] 3same final state: False

The weights above are illustrative, not learned model parameters. A trained attention layer learns which context tokens should contribute, and later layers can refine that state again.

Bidirectional and Causal Context

Not every contextual representation may look to the same places.

  • A BERT-style encoder token can use tokens on both sides of its position during encoding.
  • A GPT-style causal language model predicts the next token from the prefix, so a token state can't use later tokens while preserving that next-token objective. The original GPT work used a Transformer decoder for generative pretraining.[7]

For the first token in charge the scanner, the right-hand clue scanner is visible to an encoder but not to a causal decoder state at position charge.

10-compare-visibility-masks.py
1import numpy as np 2 3tokens = ["charge", "the", "scanner"] 4encoder_visibility = np.ones((len(tokens), len(tokens)), dtype=int) 5causal_visibility = np.tril(np.ones((len(tokens), len(tokens)), dtype=int)) 6 7def visible_tokens(mask, position): 8 return [token for token, visible in zip(tokens, mask[position]) if visible] 9 10print("encoder state for charge sees:", visible_tokens(encoder_visibility, 0)) 11print("causal state for charge sees:", visible_tokens(causal_visibility, 0))
Output
1encoder state for charge sees: ['charge', 'the', 'scanner'] 2causal state for charge sees: ['charge']

Both designs produce contextual states. Their visibility rules suit different training objectives and later tasks.

Timeline from count-based static vectors through prediction and subword methods to sentence-dependent encoder and causal states. Timeline from count-based static vectors through prediction and subword methods to sentence-dependent encoder and causal states.
The progression is about information available to each token state, not a blanket ranking of model families.

Similarity Needs a Geometry Check

Embedding applications often use cosine similarity: a value near 1 means two vectors point in similar directions. That measurement is useful only if the space behaves sensibly.

Research on contextual representations found that token vectors can be anisotropic: many vectors share a dominant direction, inflating raw cosine similarities even for tokens that shouldn't be treated as alike.[8]

Embedding geometry comparison showing an isotropic point cloud spread across directions versus an anisotropic cone where contextual vectors cluster around dominant directions. Embedding geometry comparison showing an isotropic point cloud spread across directions versus an anisotropic cone where contextual vectors cluster around dominant directions.
When many vectors share one direction, raw cosine similarity can hide the distinction a task needs.

This tiny example has a large shared horizontal component. Raw cosine says the two states are almost identical; subtracting their shared mean reveals that their informative vertical components oppose one another.

11-diagnose-shared-direction.py
1import numpy as np 2 3billing = np.array([10.0, 1.0]) 4device = np.array([10.0, -1.0]) 5 6def cosine(a, b): 7 return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b))) 8 9raw_similarity = cosine(billing, device) 10mean_direction = (billing + device) / 2 11centered_similarity = cosine(billing - mean_direction, device - mean_direction) 12 13print("raw cosine:", round(raw_similarity, 3)) 14print("centered cosine:", round(centered_similarity, 3))
Output
1raw cosine: 0.98 2centered cosine: -1.0

Centering isn't a universal production recipe. It is a diagnostic reminder: measure retrieval or classification quality on held-out examples instead of trusting a similarity score in isolation.

Choose What the Failure Requires

No representation is best by slogan. Start with the failure mode and the measurement that would prove improvement.

NeedUseful starting pointWhat to measure
Small fixed vocabulary, simple classifierTrainable embedding lookupHeld-out classification quality and latency
Rare spelling variants such as reshippedCharacter or subword-aware static vectorsRecall on unseen variants
Ambiguous words such as billing/device chargeContextual token statesAccuracy on sense-dependent cases
Search over whole support messagesSentence or chunk embedding modelRetrieval precision and recall on real queries

Contextual token states aren't automatically good message-level retrieval vectors. Pooling, task training, and evaluation still matter. You will build those choices in later retrieval lessons.

What You Built

You now have a concrete chain from token IDs to context-dependent meaning:

  1. An embedding table maps each token ID to one starting vector.
  2. Co-occurrence, Word2Vec, and GloVe learn static geometry from usage evidence.
  3. FastText-style character pieces let related spellings share parameters.
  4. Static vectors fail when identical token text carries different senses.
  5. Context mixing gives each occurrence its own state.
  6. Cosine similarity still requires evaluation because geometry can be distorted.

Mastery Check

Evaluation rubric

  • Foundational: You can explain why an embedding lookup returns one stored row for every occurrence of a token ID.
  • Intermediate: You can build a co-occurrence or prediction example and show why the two meanings of charge require context mixing.
  • Advanced: You can choose a representation for a measured failure case and test whether cosine geometry supports the chosen metric.

Follow-up questions

Common pitfalls

  • Treating token IDs as meaning: An ID only selects a row; training and context create useful geometry.
  • Calling subword handling disambiguation: Character pieces help unfamiliar spellings, not multiple senses of one familiar word.
  • Trusting cosine without an evaluation set: A high score can come from shared directions rather than task-relevant similarity.
Next Step
Continue to Perplexity & Model Evaluation

Embeddings decide what information a language model can carry forward. Next you'll measure how well its predicted token probabilities fit real text.

PreviousBPE, WordPiece, and SentencePiece
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Indexing by Latent Semantic Analysis

Deerwester, S., et al. · 1990 · JASIS

Efficient Estimation of Word Representations in Vector Space.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. · 2013 · arXiv preprint

GloVe: Global Vectors for Word Representation.

Pennington, J., Socher, R., & Manning, C. D. · 2014

Enriching Word Vectors with Subword Information.

Bojanowski, P., et al. · 2017

Deep contextualized word representations.

Peters, M., et al. · 2018 · NAACL 2018

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. · 2018

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.

Ethayarajh, K. · 2019