LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnComputing FoundationsCUDA for ML Training
⚡EasyFine-Tuning & Training

CUDA for ML Training

Build beginner-first CUDA intuition for model training: CPU vs GPU roles, host-device copies, asynchronous execution, PyTorch device placement, and first-line debugging of OOM and performance issues.

14 min read
Learning path
Step 5 of 158 in the full curriculum
NumPy and Tensor ShapesMPS & Metal for ML on Mac
Platform path

If you train on a Mac with Apple silicon, pair it with MPS & Metal for ML on Mac. Same device-placement ideas, different backend and setup checks.

CUDA isn't a separate "AI mode." Suppose the access-ticket batch from the previous lesson has shape (32, 128, 768): 32 tickets, 128 tokens per ticket, and 768 features per token. CUDA adds a second contract to that shape contract: where the batch, weights, activations, and gradients live while training runs.

CUDA is NVIDIA's parallel computing platform and programming model. A CPU is built to run a small number of complicated threads with low latency; a GPU is built to run huge amounts of similar arithmetic at high throughput. For dense tensor work like matrix multiplies and attention, that difference can be dramatic once work becomes a large GPU kernel.[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/ The flip side matters too: tiny tensors and repeated copies can spend more time on launch and transfer overhead than on useful math.

The arrays are the same ones you already learned to reason about. Now device placement becomes part of the meaning of every tensor. You'll check an environment, move a training batch, catch a placement failure before a forward pass, budget memory, and measure asynchronous work honestly.[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html

Support-ticket batch moving once from CPU memory to CUDA memory, staying on the GPU for dense training math, then returning one loss scalar for logging. Support-ticket batch moving once from CPU memory to CUDA memory, staying on the GPU for dense training math, then returning one loss scalar for logging.
Track the one expensive boundary crossing: copy the ticket batch to CUDA once, keep the forward and backward pass there, then bring back only the scalar loss for logging.

CPU orchestration vs GPU execution

A training loop has two different jobs:

  1. The CPU side handles Python control flow, dataloading, launching kernels, logging, checkpointing, and filesystem work.
  2. The GPU side handles the heavy tensor math: matrix multiplies, attention kernels, layer norms, optimizer updates, and other parallel operations.

That split matters because the CPU and GPU don't share one flat memory space in the way beginners often imagine. On the standard discrete-GPU path, the GPU has its own device memory. If your tensors live on the CPU, GPU kernels can't use them until you copy them over.[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/

Keep this practical comparison in your head:

WorkloadCPU usually wins whenGPU usually wins when
Python control flow, branching, filesystem workthe work is serial, branchy, or tinynot the right tool
Tensor maththe tensor is so small that transfer and launch overhead dominatethe operation is large, batched, and parallel, like matrix multiplication, convolutions, or attention
End-to-end training stepdataloading, logging, or synchronization stalls the loopweights, activations, and batches already live on device and kernels stay large enough to saturate throughput

Think about a CPU coordinator and a GPU worker pool. The CPU schedules work, but the GPU performs the bulk tensor math. Making the CPU path faster doesn't remove the GPU bottleneck when tensor operations dominate the work.

What CUDA is

CUDA is NVIDIA's GPU computing platform and programming model.[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/ In practice, for most AI engineers, that means five related ideas:

  • Kernels: functions that run on the GPU across many threads in parallel.
  • Thread hierarchy: threads are grouped into blocks, and blocks are grouped into a grid.
  • Warps: on NVIDIA GPUs, threads execute in groups of 32 called warps, so branch-heavy code can waste throughput when lanes in a warp diverge.[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/
  • Device memory: in the standard discrete-GPU setup, the GPU has a device-memory pool separate from host RAM.
  • Asynchronous launch: the CPU often queues GPU work and continues running until something forces synchronization.

You don't need to write custom CUDA kernels on day one. You do need to understand that model layers, loss computation, backward passes, and optimizer updates launch GPU work once their tensors are on a CUDA device.

Select a compatible PyTorch build

The NVIDIA System Management Interface command, nvidia-smi, reports devices visible to the NVIDIA driver. PyTorch's torch.cuda.is_available() reports whether this Python process can use CUDA. One can succeed while the other fails: for example, the driver may see a GPU while your environment has a CPU-only PyTorch installation.[3]Reference 3nvidia-smi documentationhttps://docs.nvidia.com/deploy/nvidia-smi/index.html[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html

Run these checks before changing code:

terminal
1nvidia-smi 2python3 - <<'PY' 3import torch 4print("torch version:", torch.__version__) 5print("compiled CUDA runtime:", torch.version.cuda) 6print("CUDA available:", torch.cuda.is_available()) 7PY

If the process can't access CUDA, use PyTorch's official installation selector for the current operating system, package manager, and supported CUDA option.[4]Reference 4Get Started.https://pytorch.org/get-started/locally/ Wheel tags and supported runtimes change; a hard-coded installation command in an article ages badly.

First device checks in PyTorch

Before you worry about throughput, make sure tensors land where you think they do. This script is intentionally device-agnostic: it runs on a CUDA machine and remains executable on a CPU-only laptop.

cuda_sanity_check.py
1import torch 2 3device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 4print(f"selected device: {device}") 5print(f"cuda available: {torch.cuda.is_available()}") 6 7x = torch.arange(6, dtype=torch.float32).reshape(2, 3).to(device) 8y = (x * 2).sum(dim=1) 9 10print(f"x device: {x.device}") 11print(f"y device: {y.device}") 12print(f"result: {y.detach().cpu().tolist()}")
Output
1selected device: cuda 2cuda available: True 3x device: cuda:0 4y device: cuda:0 5result: [6.0, 24.0]

The output above is the happy path from a configured NVIDIA machine. If your local run reports no accessible CUDA device, that doesn't automatically mean your code is wrong. It means one of these is true:

  • you're on a machine without an NVIDIA GPU
  • the driver is missing or mismatched
  • the environment isn't linked to a CUDA-enabled PyTorch build
  • the process can't access the GPU

The first quick checks are usually:

terminal-2
1nvidia-smi 2python3 -c "import torch; print(torch.cuda.is_available())" 3python3 -c "import torch; print(torch.__version__, torch.version.cuda)"

nvidia-smi tells you whether the driver sees the device. PyTorch tells you whether the framework can use it.

Host memory vs device memory

Next comes memory placement.

Where data livesTypical examplesWhy it matters
Host RAMPython objects, dataset rows, CPU tensorsEasy to manipulate from Python; ordinary training tensors need a transfer before CUDA kernels use them
Device memorymodel weights, activations, gradients, optimizer buffers on GPUFast for GPU compute, bounded per device, and expensive to overflow

An out-of-memory (OOM) failure is local to the device running your job. A model that loads can still fail on its first training batch because weights are only one part of the budget. For a simplified full-precision Adam optimizer floor, count weights, gradients, and Adam's two running statistics. This still excludes activations, temporary buffers, and allocator overhead, so it's a lower bound rather than a capacity promise.

training_memory_floor.py
1params = 1_000_000_000 2bytes_per_param = { 3 "fp32 weights": 4, 4 "fp32 gradients": 4, 5 "fp32 Adam moments": 8, 6} 7 8total_bytes = sum(params * bytes_each for bytes_each in bytes_per_param.values()) 9gib = total_bytes / (1024 ** 3) 10print(f"parameter-related floor: {gib:.2f} GiB") 11print("activations and temporary buffers: add more memory")
Output
1parameter-related floor: 14.90 GiB 2activations and temporary buffers: add more memory

Three beginner rules cover most cases:

  1. Model and inputs must be on compatible devices.
  2. Every host-device copy costs time.
  3. Training failures often come from memory, not math alone.

A standard PyTorch training loop usually does both:

ticket_batch_placement.py
1import torch 2import torch.nn as nn 3 4device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 5model = nn.Linear(768, 3).to(device) 6ticket_batch = torch.randn(32, 128, 768).to(device) 7logits = model(ticket_batch) 8 9print("model device:", next(model.parameters()).device) 10print("batch device:", ticket_batch.device) 11print("logits shape:", tuple(logits.shape))
Output
1model device: cuda:0 2batch device: cuda:0 3logits shape: (32, 128, 3)

GPU index can vary; model and batch still need matching CUDA devices, and the shape contract stays stable.

Real batches often contain inputs, labels, and masks. Move every tensor that participates in device work:

move_whole_batch.py
1import torch 2 3device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 4batch = { 5 "token_features": torch.randn(4, 8, 16), 6 "attention_mask": torch.ones(4, 8, dtype=torch.bool), 7 "labels": torch.tensor([0, 2, 1, 0]), 8} 9moved = {name: tensor.to(device) for name, tensor in batch.items()} 10 11assert all(tensor.device.type == device.type for tensor in moved.values()) 12print("all batch fields moved:", sorted(moved))

If one input stays on CPU while model parameters are on CUDA, the forward pass fails. A small preflight check makes that failure readable before a long training run begins:

catch_device_mismatch.py
1import torch 2 3def require_same_device(model_device: torch.device, batch: torch.Tensor) -> None: 4 if batch.device != model_device: 5 raise RuntimeError(f"batch device does not match model device {model_device}") 6 7batch = torch.randn(4, 3) 8try: 9 require_same_device(torch.device("cuda"), batch) 10except RuntimeError as error: 11 print("caught:", error)
Output
1caught: batch device does not match model device cuda

A small training example, step by step

Make the example concrete. Suppose you're training an access-ticket model that predicts whether a request should be answered, escalated, or blocked.

  1. The dataloader reads a batch of token IDs on the CPU.
  2. The batch is copied to device memory.
  3. The model weights already live on the GPU.
  4. PyTorch launches matmul, attention, and loss kernels on the GPU.
  5. Backward pass produces gradients on the GPU.
  6. The optimizer updates weights on the GPU.
  7. Only when you log a scalar or save results back to disk does the CPU need some of that state again.

That's why CUDA bugs often look strange at first. The Python code line you wrote and the GPU work it triggered are related, but they don't run in one shared place or finish at the same instant.

This complete stochastic-gradient-descent training step keeps the model, features, labels, logits, loss, and gradients on device until the final scalar is brought back for logging:

one_ticket_training_step.py
1import math 2 3import torch 4import torch.nn as nn 5import torch.nn.functional as F 6 7torch.manual_seed(7) 8device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 9model = nn.Linear(8, 3).to(device) 10optimizer = torch.optim.SGD(model.parameters(), lr=0.1) 11features = torch.randn(4, 8, device=device) 12labels = torch.tensor([0, 2, 1, 0], device=device) 13 14optimizer.zero_grad() 15logits = model(features) 16loss = F.cross_entropy(logits, labels) 17loss.backward() 18optimizer.step() 19logged_loss = loss.detach().cpu().item() 20 21print("step device:", device) 22print("finite loss:", math.isfinite(logged_loss))
Output
1step device: cuda 2finite loss: True

The rendered output shows the configured NVIDIA path. The training-step structure stays the same across environments, but this lesson's output should model the accelerator run you are aiming for.

The same idea as a small table:

StepCPU sideGPU sideCommon beginner mistake
batch readcollator builds tensorsnothing yetassuming data is already on GPU
device copylaunch host-to-device transferreceives batch in device memorycopying every tiny tensor separately
forwardqueues layer callsexecutes kernelsmodel on GPU, batch on CPU
backwardlaunches autograd workcomputes gradientsOOM because activations were ignored
loggingasks for loss valuemay still be finishing kernels.item() every step hides synchronization cost

If you can explain those five rows in your own words, you already understand more CUDA than many people who only know the slogan "GPUs are parallel."

Asynchronous execution and hidden sync points

One reason CUDA feels confusing is that the CPU usually launches GPU work asynchronously.[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html That means:

  • Python may continue before the GPU finishes the queued kernels.
  • timing a block with a naive host timer can under-report real GPU time
  • operations that need a CPU value force the host to wait for completion

Common sync points include:

  • loss.item() when loss is a CUDA tensor
  • tensor.cpu(), including the tensor.cpu().numpy() path used for NumPy analysis
  • logging or printing that materializes a CUDA value on the CPU
  • explicit torch.cuda.synchronize()

Calling .numpy() directly on a CUDA tensor isn't the route back to NumPy: move it to CPU first. This explicit boundary is a useful place to control logging frequency:

logging_boundary.py
1import torch 2 3device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 4loss = torch.tensor(2.5, device=device) 5logged_loss = loss.detach().cpu().item() 6 7print(f"reported loss: {logged_loss:.1f}")
Output
1reported loss: 2.5

On a CUDA device, the .cpu() call above waits until data needed for the copy is ready. That's why a loop can look fast until you add "just one print."

A timing trap you should recognize by hand

Suppose one forward pass queues 40 ms of GPU work, but the CPU finishes launching it in 2 ms.

  • A naive timer wrapped only around the Python call might report about 2 ms.
  • A synchronized timer reports the real end-to-end GPU time: about 40 ms.

That mismatch isn't a rounding error. It changes the engineering conclusion.

  • If you believe the 2 ms number, you may think the GPU is extremely fast and the bottleneck must be elsewhere.
  • If you measure the real 40 ms number, you may correctly conclude that sequence length, batch size, or kernel efficiency still need work.

Beginner CUDA debugging should always ask: did the measurement include synchronization, or did it only measure kernel launch overhead?

CUDA timing trap showing a naive host timer reporting 2 ms, the GPU still running queued kernels, and synchronization revealing the real 40 ms step time. CUDA timing trap showing a naive host timer reporting 2 ms, the GPU still running queued kernels, and synchronization revealing the real 40 ms step time.
Asynchronous launch makes the host timer lie unless you synchronize. The CPU can finish queuing work quickly while the GPU is still busy with the real tensor math.

For real CUDA measurements, PyTorch recommends CUDA events or explicit synchronization around host timers.[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html Warm up the operation before recording steady-state work because first execution can include one-time setup costs. This script uses events when CUDA is available and keeps a runnable CPU fallback:

honest_matmul_timing.py
1import torch 2 3device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 4x = torch.randn(128, 128, device=device) 5 6if device.type == "cuda": 7 for _ in range(3): 8 y = x @ x 9 torch.cuda.synchronize() 10 11 start = torch.cuda.Event(enable_timing=True) 12 end = torch.cuda.Event(enable_timing=True) 13 start.record() 14 for _ in range(10): 15 y = x @ x 16 end.record() 17 torch.cuda.synchronize() 18 print("measured with CUDA events:", start.elapsed_time(end) >= 0) 19else: 20 y = x @ x 21 print("CUDA events need CUDA; fallback result shape:", tuple(y.shape))
Output
1measured with CUDA events: True

Reading nvidia-smi without over-trusting it

nvidia-smi is useful, but it isn't a full profiler. PyTorch also uses a caching allocator, so memory visible in nvidia-smi can include reserved memory that's not currently occupied by live tensors.[3]Reference 3nvidia-smi documentationhttps://docs.nvidia.com/deploy/nvidia-smi/index.html[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html

Use it for:

  • checking whether the process attached to the GPU
  • checking rough process and device memory footprint
  • spotting obvious OOM pressure
  • seeing rough utilization snapshots

Don't use it as your only answer for:

  • kernel-level bottlenecks
  • whether dataloading is the issue
  • whether synchronization is killing throughput
  • whether the GPU is compute-bound or memory-bound

For code-level memory checks, separate live tensor bytes from allocator reservations. memory_allocated() tracks memory occupied by tensors. memory_reserved() tracks the larger pool managed by PyTorch's caching allocator. That pool can include unused memory kept for fast reuse, which is why nvidia-smi can report more memory than your live tensors occupy.[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html

allocator_counter_check.py
1import torch 2 3if torch.cuda.is_available(): 4 before_allocated = torch.cuda.memory_allocated() 5 tensor = torch.ones(1024, 1024, device="cuda") 6 after_allocated = torch.cuda.memory_allocated() 7 after_reserved = torch.cuda.memory_reserved() 8 print("live tensor allocation increased:", after_allocated > before_allocated) 9 print("allocator reserved at least live bytes:", after_reserved >= after_allocated) 10else: 11 print("CUDA allocator counters need an accessible CUDA device")
Output
1live tensor allocation increased: True 2allocator reserved at least live bytes: True

At the beginning, nvidia-smi, correct device placement, and these counters catch a large share of broken setups. Detailed profiling comes later.

First CUDA mistakes in training loops

1. Model on GPU, batch on CPU

  • Symptom: device mismatch error on forward pass.

  • Cause: weights and input tensors are on different devices.

  • Fix: move the whole batch, not a single field.

2. OOM on the first real batch

  • Symptom: the script starts, maybe even builds the model, then fails on the forward or backward pass.

  • Cause: activations and optimizer state push total memory over the card limit. Parameters alone aren't the full bill.

  • Fix: shrink per-step batch size first. If you need to preserve effective batch size, accumulate gradients across several smaller steps. Reduce sequence length or enable mixed precision when the task allows it.

Memory lever: The first two knobs reduce the number of token positions in a batch. Halving batch size halves that count; halving sequence length does too. Attention score tensors can drop faster when sequence length shrinks because they have two token axes.

activation_position_budget.py
1batch_size = 32 2sequence_length = 128 3 4def positions(batch: int, tokens: int) -> int: 5 return batch * tokens 6 7baseline = positions(batch_size, sequence_length) 8for name, batch, tokens in [ 9 ("baseline", batch_size, sequence_length), 10 ("half batch", batch_size // 2, sequence_length), 11 ("half length", batch_size, sequence_length // 2), 12]: 13 ratio = positions(batch, tokens) / baseline 14 print(f"{name:11s}: {ratio:.1%} of token positions")
Output
1baseline : 100.0% of token positions 2half batch : 50.0% of token positions 3half length: 50.0% of token positions

3. Slow loop despite high GPU memory usage

  • Symptom: the GPU memory is full enough to look "active," but throughput is poor.

  • Cause: the bottleneck may be dataloading, synchronization, small batch size, or repeated host-device copies.

  • Fix: check whether the data pipeline feeds the GPU fast enough before assuming the math kernels are the problem.

4. Timing without synchronization

  • Symptom: a kernel appears to take almost no time.

  • Cause: the timer stopped before queued CUDA work completed.

  • Fix: warm up the operation, then synchronize before starting and after enqueueing the measured work, or use CUDA events.

What to understand before writing custom kernels

You don't need Triton or CUDA C++ to start training models, but you should already understand:

  • why GPUs help matrix-heavy workloads
  • why tensor placement is explicit
  • why device memory is limited and precious
  • why copies and sync points can dominate step time
  • why "GPU utilization" alone isn't a diagnosis

That foundation makes later topics less mysterious:

  • mixed precision
  • FlashAttention
  • FSDP and ZeRO
  • tensor parallelism
  • custom kernels

Self-check before bigger training runs

Answer these before moving on.

Expected output from your own explanation

At this point, explain without code:

  1. where the tensor starts
  2. when it moves
  3. where the heavy math runs
  4. which operations force the CPU to wait
  5. which memory terms can trigger OOM

If one of those five is fuzzy, re-read the step-by-step table and the timing trap section before moving on. Later training chapters assume this picture is stable.

What to remember

  • CUDA is an execution and memory model, not a speed checkbox.
  • The CPU orchestrates. The GPU executes dense parallel math.
  • Host RAM and device memory are different places with real transfer costs.
  • PyTorch queues CUDA work asynchronously, so .item() and .cpu() can stall the host.
  • OOM errors usually mean the full training footprint doesn't fit, not model weights alone.

If that picture feels solid, you're ready to reason about training loops on accelerators instead of treating the GPU as an opaque speed device.

Use this checklist as the handoff artifact: run the device check, measure one operation with explicit synchronization, and write which memory terms can trigger OOM.

Mastery check

Key concepts

  • CPU vs GPU execution roles
  • host memory vs device memory
  • kernels, thread blocks, and warps
  • PyTorch device placement
  • asynchronous CUDA execution
  • common synchronization points
  • CUDA OOM debugging basics
  • nvidia-smi and runtime sanity checks

Evaluation rubric

  • Foundational: Explains why training work moves from CPU orchestration to GPU kernels and device memory
  • Intermediate: Uses PyTorch device placement correctly and names when host-device transfers become a bottleneck
  • Advanced: Diagnoses first-line CUDA issues such as missing device placement, accidental synchronization, and out-of-memory failures

Follow-up questions

Common pitfalls

  • Treating CUDA as a speed flag instead of a different execution and memory model.
  • Moving the model to GPU but forgetting the inputs, causing device-mismatch errors.
  • Reading nvidia-smi as if it were a profiler. It shows memory and utilization snapshots, not full kernel timelines.
  • Calling .item(), .cpu(), or print() inside tight loops without realizing they can force synchronization.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An access-ticket model is training with its model weights and batches already on cuda:0. Profiling shows the forward and backward passes are dominated by large matrix multiplies and attention kernels, while dataloading and logging are minor. What conclusion follows?
2.In a CUDA training loop, adding print(loss.item()) after every batch makes the measured step time jump. What changed?
3.nvidia-smi lists an NVIDIA GPU, but this Python process prints torch.version.cuda as None and torch.cuda.is_available() as False. Which conclusion should you draw first?
4.For a simplified full-precision Adam setup with 1,000,000,000 parameters, count 4 bytes for weights, 4 bytes for gradients, and 8 bytes for Adam moments per parameter. What is the parameter-related memory floor before activations and temporary buffers?
5.During a training step on cuda:0, a batch dictionary contains token_features, attention_mask, and labels. The model and token_features are on cuda:0, but attention_mask and labels remain on CPU. What should the loop do before the forward pass and loss computation?
6.One forward pass queues about 40 ms of GPU work, but the CPU finishes launching it in about 2 ms. A host timer wrapped only around the Python call reports 2 ms. What measurement change gives the meaningful GPU time?
7.nvidia-smi shows a PyTorch process using much more GPU memory than torch.cuda.memory_allocated() reports for live tensors. Which interpretation matches PyTorch's allocator behavior?
8.A model loads on an 8 GiB GPU but OOMs on the first backward pass for sequences of length 128 and batch size 32. You need to keep the effective batch size near 32. What first change targets the likely training memory cause?
9.GPU memory is mostly allocated, but examples per second are poor. The loop copies many small tensors each step and prints CUDA losses frequently. What should you investigate before blaming the matrix kernels?

9 questions remaining.

Next Step
Continue to MPS & Metal for ML on Mac

CUDA gave you accelerator basics in the NVIDIA world: host orchestration, device placement, synchronization, and memory pressure. The MPS chapter now maps those same ideas onto Apple silicon so Mac users can follow later training lessons with the right backend names and debugging checks.

PreviousNumPy and Tensor Shapes
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

CUDA Programming Guide.

NVIDIA · 2026

CUDA semantics.

PyTorch Contributors · 2026

nvidia-smi documentation

NVIDIA · 2026

Get Started.

PyTorch Contributors · 2026