Build beginner-first CUDA intuition for model training: CPU vs GPU roles, host-device copies, asynchronous execution, PyTorch device placement, and first-line debugging of OOM and performance issues.
If you train on a Mac with Apple silicon, pair it with MPS & Metal for ML on Mac. Same device-placement ideas, different backend and setup checks.
CUDA isn't a separate "AI mode." Suppose the access-ticket batch from the previous lesson has shape (32, 128, 768): 32 tickets, 128 tokens per ticket, and 768 features per token. CUDA adds a second contract to that shape contract: where the batch, weights, activations, and gradients live while training runs.
CUDA is NVIDIA's parallel computing platform and programming model. A CPU is built to run a small number of complicated threads with low latency; a GPU is built to run huge amounts of similar arithmetic at high throughput. For dense tensor work like matrix multiplies and attention, that difference can be dramatic once work becomes a large GPU kernel.[1] The flip side matters too: tiny tensors and repeated copies can spend more time on launch and transfer overhead than on useful math.
The arrays are the same ones you already learned to reason about. Now device placement becomes part of the meaning of every tensor. You'll check an environment, move a training batch, catch a placement failure before a forward pass, budget memory, and measure asynchronous work honestly.[2]
A training loop has two different jobs:
That split matters because the CPU and GPU don't share one flat memory space in the way beginners often imagine. On the standard discrete-GPU path, the GPU has its own device memory. If your tensors live on the CPU, GPU kernels can't use them until you copy them over.[1]
Keep this practical comparison in your head:
| Workload | CPU usually wins when | GPU usually wins when |
|---|---|---|
| Python control flow, branching, filesystem work | the work is serial, branchy, or tiny | not the right tool |
| Tensor math | the tensor is so small that transfer and launch overhead dominate | the operation is large, batched, and parallel, like matrix multiplication, convolutions, or attention |
| End-to-end training step | dataloading, logging, or synchronization stalls the loop | weights, activations, and batches already live on device and kernels stay large enough to saturate throughput |
Think about a CPU coordinator and a GPU worker pool. The CPU schedules work, but the GPU performs the bulk tensor math. Making the CPU path faster doesn't remove the GPU bottleneck when tensor operations dominate the work.
CUDA is NVIDIA's GPU computing platform and programming model.[1] In practice, for most AI engineers, that means five related ideas:
You don't need to write custom CUDA kernels on day one. You do need to understand that model layers, loss computation, backward passes, and optimizer updates launch GPU work once their tensors are on a CUDA device.
The NVIDIA System Management Interface command, nvidia-smi, reports devices visible to the NVIDIA driver. PyTorch's torch.cuda.is_available() reports whether this Python process can use CUDA. One can succeed while the other fails: for example, the driver may see a GPU while your environment has a CPU-only PyTorch installation.[3][2]
Run these checks before changing code:
1nvidia-smi
2python3 - <<'PY'
3import torch
4print("torch version:", torch.__version__)
5print("compiled CUDA runtime:", torch.version.cuda)
6print("CUDA available:", torch.cuda.is_available())
7PYIf the process can't access CUDA, use PyTorch's official installation selector for the current operating system, package manager, and supported CUDA option.[4] Wheel tags and supported runtimes change; a hard-coded installation command in an article ages badly.
Before you worry about throughput, make sure tensors land where you think they do. This script is intentionally device-agnostic: it runs on a CUDA machine and remains executable on a CPU-only laptop.
1import torch
2
3device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
4print(f"selected device: {device}")
5print(f"cuda available: {torch.cuda.is_available()}")
6
7x = torch.arange(6, dtype=torch.float32).reshape(2, 3).to(device)
8y = (x * 2).sum(dim=1)
9
10print(f"x device: {x.device}")
11print(f"y device: {y.device}")
12print(f"result: {y.detach().cpu().tolist()}")1selected device: cuda
2cuda available: True
3x device: cuda:0
4y device: cuda:0
5result: [6.0, 24.0]The output above is the happy path from a configured NVIDIA machine. If your local run reports no accessible CUDA device, that doesn't automatically mean your code is wrong. It means one of these is true:
The first quick checks are usually:
1nvidia-smi
2python3 -c "import torch; print(torch.cuda.is_available())"
3python3 -c "import torch; print(torch.__version__, torch.version.cuda)"nvidia-smi tells you whether the driver sees the device. PyTorch tells you whether the framework can use it.
Next comes memory placement.
| Where data lives | Typical examples | Why it matters |
|---|---|---|
| Host RAM | Python objects, dataset rows, CPU tensors | Easy to manipulate from Python; ordinary training tensors need a transfer before CUDA kernels use them |
| Device memory | model weights, activations, gradients, optimizer buffers on GPU | Fast for GPU compute, bounded per device, and expensive to overflow |
An out-of-memory (OOM) failure is local to the device running your job. A model that loads can still fail on its first training batch because weights are only one part of the budget. For a simplified full-precision Adam optimizer floor, count weights, gradients, and Adam's two running statistics. This still excludes activations, temporary buffers, and allocator overhead, so it's a lower bound rather than a capacity promise.
1params = 1_000_000_000
2bytes_per_param = {
3 "fp32 weights": 4,
4 "fp32 gradients": 4,
5 "fp32 Adam moments": 8,
6}
7
8total_bytes = sum(params * bytes_each for bytes_each in bytes_per_param.values())
9gib = total_bytes / (1024 ** 3)
10print(f"parameter-related floor: {gib:.2f} GiB")
11print("activations and temporary buffers: add more memory")1parameter-related floor: 14.90 GiB
2activations and temporary buffers: add more memoryThree beginner rules cover most cases:
A standard PyTorch training loop usually does both:
1import torch
2import torch.nn as nn
3
4device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
5model = nn.Linear(768, 3).to(device)
6ticket_batch = torch.randn(32, 128, 768).to(device)
7logits = model(ticket_batch)
8
9print("model device:", next(model.parameters()).device)
10print("batch device:", ticket_batch.device)
11print("logits shape:", tuple(logits.shape))1model device: cuda:0
2batch device: cuda:0
3logits shape: (32, 128, 3)GPU index can vary; model and batch still need matching CUDA devices, and the shape contract stays stable.
Real batches often contain inputs, labels, and masks. Move every tensor that participates in device work:
1import torch
2
3device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
4batch = {
5 "token_features": torch.randn(4, 8, 16),
6 "attention_mask": torch.ones(4, 8, dtype=torch.bool),
7 "labels": torch.tensor([0, 2, 1, 0]),
8}
9moved = {name: tensor.to(device) for name, tensor in batch.items()}
10
11assert all(tensor.device.type == device.type for tensor in moved.values())
12print("all batch fields moved:", sorted(moved))If one input stays on CPU while model parameters are on CUDA, the forward pass fails. A small preflight check makes that failure readable before a long training run begins:
1import torch
2
3def require_same_device(model_device: torch.device, batch: torch.Tensor) -> None:
4 if batch.device != model_device:
5 raise RuntimeError(f"batch device does not match model device {model_device}")
6
7batch = torch.randn(4, 3)
8try:
9 require_same_device(torch.device("cuda"), batch)
10except RuntimeError as error:
11 print("caught:", error)1caught: batch device does not match model device cudaMake the example concrete. Suppose you're training an access-ticket model that predicts whether a request should be answered, escalated, or blocked.
That's why CUDA bugs often look strange at first. The Python code line you wrote and the GPU work it triggered are related, but they don't run in one shared place or finish at the same instant.
This complete stochastic-gradient-descent training step keeps the model, features, labels, logits, loss, and gradients on device until the final scalar is brought back for logging:
1import math
2
3import torch
4import torch.nn as nn
5import torch.nn.functional as F
6
7torch.manual_seed(7)
8device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
9model = nn.Linear(8, 3).to(device)
10optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
11features = torch.randn(4, 8, device=device)
12labels = torch.tensor([0, 2, 1, 0], device=device)
13
14optimizer.zero_grad()
15logits = model(features)
16loss = F.cross_entropy(logits, labels)
17loss.backward()
18optimizer.step()
19logged_loss = loss.detach().cpu().item()
20
21print("step device:", device)
22print("finite loss:", math.isfinite(logged_loss))1step device: cuda
2finite loss: TrueThe rendered output shows the configured NVIDIA path. The training-step structure stays the same across environments, but this lesson's output should model the accelerator run you are aiming for.
The same idea as a small table:
| Step | CPU side | GPU side | Common beginner mistake |
|---|---|---|---|
| batch read | collator builds tensors | nothing yet | assuming data is already on GPU |
| device copy | launch host-to-device transfer | receives batch in device memory | copying every tiny tensor separately |
| forward | queues layer calls | executes kernels | model on GPU, batch on CPU |
| backward | launches autograd work | computes gradients | OOM because activations were ignored |
| logging | asks for loss value | may still be finishing kernels | .item() every step hides synchronization cost |
If you can explain those five rows in your own words, you already understand more CUDA than many people who only know the slogan "GPUs are parallel."
One reason CUDA feels confusing is that the CPU usually launches GPU work asynchronously.[2] That means:
Common sync points include:
loss.item() when loss is a CUDA tensortensor.cpu(), including the tensor.cpu().numpy() path used for NumPy analysistorch.cuda.synchronize()Calling .numpy() directly on a CUDA tensor isn't the route back to NumPy: move it to CPU first. This explicit boundary is a useful place to control logging frequency:
1import torch
2
3device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
4loss = torch.tensor(2.5, device=device)
5logged_loss = loss.detach().cpu().item()
6
7print(f"reported loss: {logged_loss:.1f}")1reported loss: 2.5On a CUDA device, the .cpu() call above waits until data needed for the copy is ready. That's why a loop can look fast until you add "just one print."
Suppose one forward pass queues 40 ms of GPU work, but the CPU finishes launching it in 2 ms.
That mismatch isn't a rounding error. It changes the engineering conclusion.
Beginner CUDA debugging should always ask: did the measurement include synchronization, or did it only measure kernel launch overhead?
For real CUDA measurements, PyTorch recommends CUDA events or explicit synchronization around host timers.[2] Warm up the operation before recording steady-state work because first execution can include one-time setup costs. This script uses events when CUDA is available and keeps a runnable CPU fallback:
1import torch
2
3device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
4x = torch.randn(128, 128, device=device)
5
6if device.type == "cuda":
7 for _ in range(3):
8 y = x @ x
9 torch.cuda.synchronize()
10
11 start = torch.cuda.Event(enable_timing=True)
12 end = torch.cuda.Event(enable_timing=True)
13 start.record()
14 for _ in range(10):
15 y = x @ x
16 end.record()
17 torch.cuda.synchronize()
18 print("measured with CUDA events:", start.elapsed_time(end) >= 0)
19else:
20 y = x @ x
21 print("CUDA events need CUDA; fallback result shape:", tuple(y.shape))1measured with CUDA events: Truenvidia-smi without over-trusting itnvidia-smi is useful, but it isn't a full profiler. PyTorch also uses a caching allocator, so memory visible in nvidia-smi can include reserved memory that's not currently occupied by live tensors.[3][2]
Use it for:
Don't use it as your only answer for:
For code-level memory checks, separate live tensor bytes from allocator reservations. memory_allocated() tracks memory occupied by tensors. memory_reserved() tracks the larger pool managed by PyTorch's caching allocator. That pool can include unused memory kept for fast reuse, which is why nvidia-smi can report more memory than your live tensors occupy.[2]
1import torch
2
3if torch.cuda.is_available():
4 before_allocated = torch.cuda.memory_allocated()
5 tensor = torch.ones(1024, 1024, device="cuda")
6 after_allocated = torch.cuda.memory_allocated()
7 after_reserved = torch.cuda.memory_reserved()
8 print("live tensor allocation increased:", after_allocated > before_allocated)
9 print("allocator reserved at least live bytes:", after_reserved >= after_allocated)
10else:
11 print("CUDA allocator counters need an accessible CUDA device")1live tensor allocation increased: True
2allocator reserved at least live bytes: TrueAt the beginning, nvidia-smi, correct device placement, and these counters catch a large share of broken setups. Detailed profiling comes later.
Symptom: device mismatch error on forward pass.
Cause: weights and input tensors are on different devices.
Fix: move the whole batch, not a single field.
Symptom: the script starts, maybe even builds the model, then fails on the forward or backward pass.
Cause: activations and optimizer state push total memory over the card limit. Parameters alone aren't the full bill.
Fix: shrink per-step batch size first. If you need to preserve effective batch size, accumulate gradients across several smaller steps. Reduce sequence length or enable mixed precision when the task allows it.
Memory lever: The first two knobs reduce the number of token positions in a batch. Halving batch size halves that count; halving sequence length does too. Attention score tensors can drop faster when sequence length shrinks because they have two token axes.
1batch_size = 32
2sequence_length = 128
3
4def positions(batch: int, tokens: int) -> int:
5 return batch * tokens
6
7baseline = positions(batch_size, sequence_length)
8for name, batch, tokens in [
9 ("baseline", batch_size, sequence_length),
10 ("half batch", batch_size // 2, sequence_length),
11 ("half length", batch_size, sequence_length // 2),
12]:
13 ratio = positions(batch, tokens) / baseline
14 print(f"{name:11s}: {ratio:.1%} of token positions")1baseline : 100.0% of token positions
2half batch : 50.0% of token positions
3half length: 50.0% of token positionsSymptom: the GPU memory is full enough to look "active," but throughput is poor.
Cause: the bottleneck may be dataloading, synchronization, small batch size, or repeated host-device copies.
Fix: check whether the data pipeline feeds the GPU fast enough before assuming the math kernels are the problem.
Symptom: a kernel appears to take almost no time.
Cause: the timer stopped before queued CUDA work completed.
Fix: warm up the operation, then synchronize before starting and after enqueueing the measured work, or use CUDA events.
You don't need Triton or CUDA C++ to start training models, but you should already understand:
That foundation makes later topics less mysterious:
Answer these before moving on.
At this point, explain without code:
If one of those five is fuzzy, re-read the step-by-step table and the timing trap section before moving on. Later training chapters assume this picture is stable.
.item() and .cpu() can stall the host.If that picture feels solid, you're ready to reason about training loops on accelerators instead of treating the GPU as an opaque speed device.
Use this checklist as the handoff artifact: run the device check, measure one operation with explicit synchronization, and write which memory terms can trigger OOM.
nvidia-smi as if it were a profiler. It shows memory and utilization snapshots, not full kernel timelines..item(), .cpu(), or print() inside tight loops without realizing they can force synchronization.Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.