LearnComputing FoundationsCUDA for ML Training

⚡EasyFine-Tuning & Training

CUDA for ML Training

Build beginner-first CUDA intuition for model training: CPU vs GPU roles, host-device copies, asynchronous execution, PyTorch device placement, and first-line debugging of OOM and performance issues.

14 min read

Learning path

Step 5 of 158 in the full curriculum

NumPy and Tensor Shapes MPS & Metal for ML on Mac

Platform path

If you train on a Mac with Apple silicon, pair it with MPS & Metal for ML on Mac. Same device-placement ideas, different backend and setup checks.

CUDA isn't a separate "AI mode." Suppose the access-ticket batch from the previous lesson has shape (32, 128, 768): 32 tickets, 128 tokens per ticket, and 768 features per token. CUDA adds a second contract to that shape contract: where the batch, weights, activations, and gradients live while training runs.

CUDA is NVIDIA's parallel computing platform and programming model. A CPU is built to run a small number of complicated threads with low latency; a GPU is built to run huge amounts of similar arithmetic at high throughput. For dense tensor work like matrix multiplies and attention, that difference can be dramatic once work becomes a large GPU kernel.^{[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/} The flip side matters too: tiny tensors and repeated copies can spend more time on launch and transfer overhead than on useful math.

The arrays are the same ones you already learned to reason about. Now device placement becomes part of the meaning of every tensor. You'll check an environment, move a training batch, catch a placement failure before a forward pass, budget memory, and measure asynchronous work honestly.^{[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html}

Support-ticket batch moving once from CPU memory to CUDA memory, staying on the GPU for dense training math, then returning one loss scalar for logging. — Track the one expensive boundary crossing: copy the ticket batch to CUDA once, keep the forward and backward pass there, then bring back only the scalar loss for logging.

CPU orchestration vs GPU execution

A training loop has two different jobs:

The CPU side handles Python control flow, dataloading, launching kernels, logging, checkpointing, and filesystem work.
The GPU side handles the heavy tensor math: matrix multiplies, attention kernels, layer norms, optimizer updates, and other parallel operations.

That split matters because the CPU and GPU don't share one flat memory space in the way beginners often imagine. On the standard discrete-GPU path, the GPU has its own device memory. If your tensors live on the CPU, GPU kernels can't use them until you copy them over.^{[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/}

Keep this practical comparison in your head:

Workload	CPU usually wins when	GPU usually wins when
Python control flow, branching, filesystem work	the work is serial, branchy, or tiny	not the right tool
Tensor math	the tensor is so small that transfer and launch overhead dominate	the operation is large, batched, and parallel, like matrix multiplication, convolutions, or attention
End-to-end training step	dataloading, logging, or synchronization stalls the loop	weights, activations, and batches already live on device and kernels stay large enough to saturate throughput

Think about a CPU coordinator and a GPU worker pool. The CPU schedules work, but the GPU performs the bulk tensor math. Making the CPU path faster doesn't remove the GPU bottleneck when tensor operations dominate the work.

What CUDA is

CUDA is NVIDIA's GPU computing platform and programming model.^{[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/} In practice, for most AI engineers, that means five related ideas:

Kernels: functions that run on the GPU across many threads in parallel.
Thread hierarchy: threads are grouped into blocks, and blocks are grouped into a grid.
Warps: on NVIDIA GPUs, threads execute in groups of 32 called warps, so branch-heavy code can waste throughput when lanes in a warp diverge.^{[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/}
Device memory: in the standard discrete-GPU setup, the GPU has a device-memory pool separate from host RAM.
Asynchronous launch: the CPU often queues GPU work and continues running until something forces synchronization.

You don't need to write custom CUDA kernels on day one. You do need to understand that model layers, loss computation, backward passes, and optimizer updates launch GPU work once their tensors are on a CUDA device.

Select a compatible PyTorch build

The NVIDIA System Management Interface command, nvidia-smi, reports devices visible to the NVIDIA driver. PyTorch's torch.cuda.is_available() reports whether this Python process can use CUDA. One can succeed while the other fails: for example, the driver may see a GPU while your environment has a CPU-only PyTorch installation.^{[3]Reference 3nvidia-smi documentationhttps://docs.nvidia.com/deploy/nvidia-smi/index.html}^{[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html}

Run these checks before changing code:

terminal

nvidia-smi
python3 - <<'PY'
import torch
print("torch version:", torch.__version__)
print("compiled CUDA runtime:", torch.version.cuda)
print("CUDA available:", torch.cuda.is_available())
PY

If the process can't access CUDA, use PyTorch's official installation selector for the current operating system, package manager, and supported CUDA option.^{[4]Reference 4Get Started.https://pytorch.org/get-started/locally/} Wheel tags and supported runtimes change; a hard-coded installation command in an article ages badly.

First device checks in PyTorch

Before you worry about throughput, make sure tensors land where you think they do. This script is intentionally device-agnostic: it runs on a CUDA machine and remains executable on a CPU-only laptop.

cuda_sanity_check.py

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"selected device: {device}")
print(f"cuda available: {torch.cuda.is_available()}")

x = torch.arange(6, dtype=torch.float32).reshape(2, 3).to(device)
y = (x * 2).sum(dim=1)

print(f"x device: {x.device}")
print(f"y device: {y.device}")
print(f"result: {y.detach().cpu().tolist()}")

Output

selected device: cuda
cuda available: True
x device: cuda:0
y device: cuda:0
result: [6.0, 24.0]

The output above is the happy path from a configured NVIDIA machine. If your local run reports no accessible CUDA device, that doesn't automatically mean your code is wrong. It means one of these is true:

you're on a machine without an NVIDIA GPU
the driver is missing or mismatched
the environment isn't linked to a CUDA-enabled PyTorch build
the process can't access the GPU

The first quick checks are usually:

terminal-2

nvidia-smi
python3 -c "import torch; print(torch.cuda.is_available())"
python3 -c "import torch; print(torch.__version__, torch.version.cuda)"

nvidia-smi tells you whether the driver sees the device. PyTorch tells you whether the framework can use it.

Host memory vs device memory

Next comes memory placement.

Where data lives	Typical examples	Why it matters
Host RAM	Python objects, dataset rows, CPU tensors	Easy to manipulate from Python; ordinary training tensors need a transfer before CUDA kernels use them
Device memory	model weights, activations, gradients, optimizer buffers on GPU	Fast for GPU compute, bounded per device, and expensive to overflow

An out-of-memory (OOM) failure is local to the device running your job. A model that loads can still fail on its first training batch because weights are only one part of the budget. For a simplified full-precision Adam optimizer floor, count weights, gradients, and Adam's two running statistics. This still excludes activations, temporary buffers, and allocator overhead, so it's a lower bound rather than a capacity promise.

training_memory_floor.py

params = 1_000_000_000
bytes_per_param = {
    "fp32 weights": 4,
    "fp32 gradients": 4,
    "fp32 Adam moments": 8,
}

total_bytes = sum(params * bytes_each for bytes_each in bytes_per_param.values())
gib = total_bytes / (1024 ** 3)
print(f"parameter-related floor: {gib:.2f} GiB")
print("activations and temporary buffers: add more memory")

Output

parameter-related floor: 14.90 GiB
activations and temporary buffers: add more memory

Three beginner rules cover most cases:

Model and inputs must be on compatible devices.
Every host-device copy costs time.
Training failures often come from memory, not math alone.

A standard PyTorch training loop usually does both:

ticket_batch_placement.py

import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = nn.Linear(768, 3).to(device)
ticket_batch = torch.randn(32, 128, 768).to(device)
logits = model(ticket_batch)

print("model device:", next(model.parameters()).device)
print("batch device:", ticket_batch.device)
print("logits shape:", tuple(logits.shape))

Output

model device: cuda:0
batch device: cuda:0
logits shape: (32, 128, 3)

GPU index can vary; model and batch still need matching CUDA devices, and the shape contract stays stable.

Real batches often contain inputs, labels, and masks. Move every tensor that participates in device work:

move_whole_batch.py

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch = {
    "token_features": torch.randn(4, 8, 16),
    "attention_mask": torch.ones(4, 8, dtype=torch.bool),
    "labels": torch.tensor([0, 2, 1, 0]),
}
moved = {name: tensor.to(device) for name, tensor in batch.items()}

assert all(tensor.device.type == device.type for tensor in moved.values())
print("all batch fields moved:", sorted(moved))

If one input stays on CPU while model parameters are on CUDA, the forward pass fails. A small preflight check makes that failure readable before a long training run begins:

catch_device_mismatch.py

import torch

def require_same_device(model_device: torch.device, batch: torch.Tensor) -> None:
    if batch.device != model_device:
        raise RuntimeError(f"batch device does not match model device {model_device}")

batch = torch.randn(4, 3)
try:
    require_same_device(torch.device("cuda"), batch)
except RuntimeError as error:
    print("caught:", error)

Output

caught: batch device does not match model device cuda

A small training example, step by step

Make the example concrete. Suppose you're training an access-ticket model that predicts whether a request should be answered, escalated, or blocked.

The dataloader reads a batch of token IDs on the CPU.
The batch is copied to device memory.
The model weights already live on the GPU.
PyTorch launches matmul, attention, and loss kernels on the GPU.
Backward pass produces gradients on the GPU.
The optimizer updates weights on the GPU.
Only when you log a scalar or save results back to disk does the CPU need some of that state again.

That's why CUDA bugs often look strange at first. The Python code line you wrote and the GPU work it triggered are related, but they don't run in one shared place or finish at the same instant.

This complete stochastic-gradient-descent training step keeps the model, features, labels, logits, loss, and gradients on device until the final scalar is brought back for logging:

one_ticket_training_step.py

import math

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(7)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = nn.Linear(8, 3).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
features = torch.randn(4, 8, device=device)
labels = torch.tensor([0, 2, 1, 0], device=device)

optimizer.zero_grad()
logits = model(features)
loss = F.cross_entropy(logits, labels)
loss.backward()
optimizer.step()
logged_loss = loss.detach().cpu().item()

print("step device:", device)
print("finite loss:", math.isfinite(logged_loss))

Output

step device: cuda
finite loss: True

The rendered output shows the configured NVIDIA path. The training-step structure stays the same across environments, but this lesson's output should model the accelerator run you are aiming for.

The same idea as a small table:

Step	CPU side	GPU side	Common beginner mistake
batch read	collator builds tensors	nothing yet	assuming data is already on GPU
device copy	launch host-to-device transfer	receives batch in device memory	copying every tiny tensor separately
forward	queues layer calls	executes kernels	model on GPU, batch on CPU
backward	launches autograd work	computes gradients	OOM because activations were ignored
logging	asks for loss value	may still be finishing kernels	`.item()` every step hides synchronization cost

If you can explain those five rows in your own words, you already understand more CUDA than many people who only know the slogan "GPUs are parallel."

Asynchronous execution and hidden sync points

One reason CUDA feels confusing is that the CPU usually launches GPU work asynchronously.^{[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html} That means:

Python may continue before the GPU finishes the queued kernels.
timing a block with a naive host timer can under-report real GPU time
operations that need a CPU value force the host to wait for completion

Common sync points include:

loss.item() when loss is a CUDA tensor
tensor.cpu(), including the tensor.cpu().numpy() path used for NumPy analysis
logging or printing that materializes a CUDA value on the CPU
explicit torch.cuda.synchronize()

Calling .numpy() directly on a CUDA tensor isn't the route back to NumPy: move it to CPU first. This explicit boundary is a useful place to control logging frequency:

logging_boundary.py

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
loss = torch.tensor(2.5, device=device)
logged_loss = loss.detach().cpu().item()

print(f"reported loss: {logged_loss:.1f}")

Output

reported loss: 2.5

On a CUDA device, the .cpu() call above waits until data needed for the copy is ready. That's why a loop can look fast until you add "just one print."

A timing trap you should recognize by hand

Suppose one forward pass queues 40 ms of GPU work, but the CPU finishes launching it in 2 ms.

A naive timer wrapped only around the Python call might report about 2 ms.
A synchronized timer reports the real end-to-end GPU time: about 40 ms.

That mismatch isn't a rounding error. It changes the engineering conclusion.

If you believe the 2 ms number, you may think the GPU is extremely fast and the bottleneck must be elsewhere.
If you measure the real 40 ms number, you may correctly conclude that sequence length, batch size, or kernel efficiency still need work.

Beginner CUDA debugging should always ask: did the measurement include synchronization, or did it only measure kernel launch overhead?

CUDA timing trap showing a naive host timer reporting 2 ms, the GPU still running queued kernels, and synchronization revealing the real 40 ms step time. — Asynchronous launch makes the host timer lie unless you synchronize. The CPU can finish queuing work quickly while the GPU is still busy with the real tensor math.

For real CUDA measurements, PyTorch recommends CUDA events or explicit synchronization around host timers.^{[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html} Warm up the operation before recording steady-state work because first execution can include one-time setup costs. This script uses events when CUDA is available and keeps a runnable CPU fallback:

honest_matmul_timing.py

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(128, 128, device=device)

if device.type == "cuda":
    for _ in range(3):
        y = x @ x
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(10):
        y = x @ x
    end.record()
    torch.cuda.synchronize()
    print("measured with CUDA events:", start.elapsed_time(end) >= 0)
else:
    y = x @ x
    print("CUDA events need CUDA; fallback result shape:", tuple(y.shape))

Output

measured with CUDA events: True

Reading `nvidia-smi` without over-trusting it

nvidia-smi is useful, but it isn't a full profiler. PyTorch also uses a caching allocator, so memory visible in nvidia-smi can include reserved memory that's not currently occupied by live tensors.^{[3]Reference 3nvidia-smi documentationhttps://docs.nvidia.com/deploy/nvidia-smi/index.html}^{[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html}

Use it for:

checking whether the process attached to the GPU
checking rough process and device memory footprint
spotting obvious OOM pressure
seeing rough utilization snapshots

Don't use it as your only answer for:

kernel-level bottlenecks
whether dataloading is the issue
whether synchronization is killing throughput
whether the GPU is compute-bound or memory-bound

For code-level memory checks, separate live tensor bytes from allocator reservations. memory_allocated() tracks memory occupied by tensors. memory_reserved() tracks the larger pool managed by PyTorch's caching allocator. That pool can include unused memory kept for fast reuse, which is why nvidia-smi can report more memory than your live tensors occupy.^{[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html}

allocator_counter_check.py

import torch

if torch.cuda.is_available():
    before_allocated = torch.cuda.memory_allocated()
    tensor = torch.ones(1024, 1024, device="cuda")
    after_allocated = torch.cuda.memory_allocated()
    after_reserved = torch.cuda.memory_reserved()
    print("live tensor allocation increased:", after_allocated > before_allocated)
    print("allocator reserved at least live bytes:", after_reserved >= after_allocated)
else:
    print("CUDA allocator counters need an accessible CUDA device")

Output

live tensor allocation increased: True
allocator reserved at least live bytes: True

At the beginning, nvidia-smi, correct device placement, and these counters catch a large share of broken setups. Detailed profiling comes later.

First CUDA mistakes in training loops

1. Model on GPU, batch on CPU

Symptom: device mismatch error on forward pass.
Cause: weights and input tensors are on different devices.
Fix: move the whole batch, not a single field.

2. OOM on the first real batch

Symptom: the script starts, maybe even builds the model, then fails on the forward or backward pass.
Cause: activations and optimizer state push total memory over the card limit. Parameters alone aren't the full bill.
Fix: shrink per-step batch size first. If you need to preserve effective batch size, accumulate gradients across several smaller steps. Reduce sequence length or enable mixed precision when the task allows it.

Memory lever: The first two knobs reduce the number of token positions in a batch. Halving batch size halves that count; halving sequence length does too. Attention score tensors can drop faster when sequence length shrinks because they have two token axes.

activation_position_budget.py

batch_size = 32
sequence_length = 128

def positions(batch: int, tokens: int) -> int:
    return batch * tokens

baseline = positions(batch_size, sequence_length)
for name, batch, tokens in [
    ("baseline", batch_size, sequence_length),
    ("half batch", batch_size // 2, sequence_length),
    ("half length", batch_size, sequence_length // 2),
]:
    ratio = positions(batch, tokens) / baseline
    print(f"{name:11s}: {ratio:.1%} of token positions")

Output

baseline   : 100.0% of token positions
half batch : 50.0% of token positions
half length: 50.0% of token positions

3. Slow loop despite high GPU memory usage

Symptom: the GPU memory is full enough to look "active," but throughput is poor.
Cause: the bottleneck may be dataloading, synchronization, small batch size, or repeated host-device copies.
Fix: check whether the data pipeline feeds the GPU fast enough before assuming the math kernels are the problem.

4. Timing without synchronization

Symptom: a kernel appears to take almost no time.
Cause: the timer stopped before queued CUDA work completed.
Fix: warm up the operation, then synchronize before starting and after enqueueing the measured work, or use CUDA events.

What to understand before writing custom kernels

You don't need Triton or CUDA C++ to start training models, but you should already understand:

why GPUs help matrix-heavy workloads
why tensor placement is explicit
why device memory is limited and precious
why copies and sync points can dominate step time
why "GPU utilization" alone isn't a diagnosis

That foundation makes later topics less mysterious:

mixed precision
FlashAttention
FSDP and ZeRO
tensor parallelism
custom kernels

Self-check before bigger training runs

Answer these before moving on.

Expected output from your own explanation

At this point, explain without code:

where the tensor starts
when it moves
where the heavy math runs
which operations force the CPU to wait
which memory terms can trigger OOM

If one of those five is fuzzy, re-read the step-by-step table and the timing trap section before moving on. Later training chapters assume this picture is stable.

What to remember

CUDA is an execution and memory model, not a speed checkbox.
The CPU orchestrates. The GPU executes dense parallel math.
Host RAM and device memory are different places with real transfer costs.
PyTorch queues CUDA work asynchronously, so .item() and .cpu() can stall the host.
OOM errors usually mean the full training footprint doesn't fit, not model weights alone.

If that picture feels solid, you're ready to reason about training loops on accelerators instead of treating the GPU as an opaque speed device.

Use this checklist as the handoff artifact: run the device check, measure one operation with explicit synchronization, and write which memory terms can trigger OOM.

Mastery check

Key concepts

CPU vs GPU execution roles
host memory vs device memory
kernels, thread blocks, and warps
PyTorch device placement
asynchronous CUDA execution
common synchronization points
CUDA OOM debugging basics
nvidia-smi and runtime sanity checks

Evaluation rubric

Foundational: Explains why training work moves from CPU orchestration to GPU kernels and device memory
Intermediate: Uses PyTorch device placement correctly and names when host-device transfers become a bottleneck
Advanced: Diagnoses first-line CUDA issues such as missing device placement, accidental synchronization, and out-of-memory failures

Follow-up questions

Common pitfalls

Treating CUDA as a speed flag instead of a different execution and memory model.
Moving the model to GPU but forgetting the inputs, causing device-mismatch errors.
Reading nvidia-smi as if it were a profiler. It shows memory and utilization snapshots, not full kernel timelines.
Calling .item(), .cpu(), or print() inside tight loops without realizing they can force synchronization.

Next Step

Continue to MPS & Metal for ML on Mac

CUDA gave you accelerator basics in the NVIDIA world: host orchestration, device placement, synchronization, and memory pressure. The MPS chapter now maps those same ideas onto Apple silicon so Mac users can follow later training lessons with the right backend names and debugging checks.

PreviousNumPy and Tensor Shapes

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

CUDA Programming Guide.

NVIDIA · 2026

CUDA semantics.

PyTorch Contributors · 2026

nvidia-smi documentation

NVIDIA · 2026

Get Started.

PyTorch Contributors · 2026

Back to Topics

LearnComputing FoundationsCUDA for ML Training

⚡EasyFine-Tuning & Training

CUDA for ML Training

Build beginner-first CUDA intuition for model training: CPU vs GPU roles, host-device copies, asynchronous execution, PyTorch device placement, and first-line debugging of OOM and performance issues.

14 min read

Learning path

Step 5 of 158 in the full curriculum

NumPy and Tensor Shapes MPS & Metal for ML on Mac

Platform path

If you train on a Mac with Apple silicon, pair it with MPS & Metal for ML on Mac. Same device-placement ideas, different backend and setup checks.

CPU orchestration vs GPU execution

A training loop has two different jobs:

The CPU side handles Python control flow, dataloading, launching kernels, logging, checkpointing, and filesystem work.
The GPU side handles the heavy tensor math: matrix multiplies, attention kernels, layer norms, optimizer updates, and other parallel operations.

Keep this practical comparison in your head:

Workload	CPU usually wins when	GPU usually wins when
Python control flow, branching, filesystem work	the work is serial, branchy, or tiny	not the right tool
Tensor math	the tensor is so small that transfer and launch overhead dominate	the operation is large, batched, and parallel, like matrix multiplication, convolutions, or attention
End-to-end training step	dataloading, logging, or synchronization stalls the loop	weights, activations, and batches already live on device and kernels stay large enough to saturate throughput

What CUDA is

Kernels: functions that run on the GPU across many threads in parallel.
Thread hierarchy: threads are grouped into blocks, and blocks are grouped into a grid.
Warps: on NVIDIA GPUs, threads execute in groups of 32 called warps, so branch-heavy code can waste throughput when lanes in a warp diverge.^{[1]Reference 1CUDA Programming Guide.https://docs.nvidia.com/cuda/cuda-programming-guide/}
Device memory: in the standard discrete-GPU setup, the GPU has a device-memory pool separate from host RAM.
Asynchronous launch: the CPU often queues GPU work and continues running until something forces synchronization.

Select a compatible PyTorch build

Run these checks before changing code:

terminal

nvidia-smi
python3 - <<'PY'
import torch
print("torch version:", torch.__version__)
print("compiled CUDA runtime:", torch.version.cuda)
print("CUDA available:", torch.cuda.is_available())
PY

First device checks in PyTorch

Before you worry about throughput, make sure tensors land where you think they do. This script is intentionally device-agnostic: it runs on a CUDA machine and remains executable on a CPU-only laptop.

cuda_sanity_check.py

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"selected device: {device}")
print(f"cuda available: {torch.cuda.is_available()}")

x = torch.arange(6, dtype=torch.float32).reshape(2, 3).to(device)
y = (x * 2).sum(dim=1)

print(f"x device: {x.device}")
print(f"y device: {y.device}")
print(f"result: {y.detach().cpu().tolist()}")

Output

selected device: cuda
cuda available: True
x device: cuda:0
y device: cuda:0
result: [6.0, 24.0]

you're on a machine without an NVIDIA GPU
the driver is missing or mismatched
the environment isn't linked to a CUDA-enabled PyTorch build
the process can't access the GPU

The first quick checks are usually:

terminal-2

nvidia-smi
python3 -c "import torch; print(torch.cuda.is_available())"
python3 -c "import torch; print(torch.__version__, torch.version.cuda)"

nvidia-smi tells you whether the driver sees the device. PyTorch tells you whether the framework can use it.

Host memory vs device memory

Next comes memory placement.

Where data lives	Typical examples	Why it matters
Host RAM	Python objects, dataset rows, CPU tensors	Easy to manipulate from Python; ordinary training tensors need a transfer before CUDA kernels use them
Device memory	model weights, activations, gradients, optimizer buffers on GPU	Fast for GPU compute, bounded per device, and expensive to overflow

training_memory_floor.py

params = 1_000_000_000
bytes_per_param = {
    "fp32 weights": 4,
    "fp32 gradients": 4,
    "fp32 Adam moments": 8,
}

total_bytes = sum(params * bytes_each for bytes_each in bytes_per_param.values())
gib = total_bytes / (1024 ** 3)
print(f"parameter-related floor: {gib:.2f} GiB")
print("activations and temporary buffers: add more memory")

Output

parameter-related floor: 14.90 GiB
activations and temporary buffers: add more memory

Three beginner rules cover most cases:

Model and inputs must be on compatible devices.
Every host-device copy costs time.
Training failures often come from memory, not math alone.

A standard PyTorch training loop usually does both:

ticket_batch_placement.py

import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = nn.Linear(768, 3).to(device)
ticket_batch = torch.randn(32, 128, 768).to(device)
logits = model(ticket_batch)

print("model device:", next(model.parameters()).device)
print("batch device:", ticket_batch.device)
print("logits shape:", tuple(logits.shape))

Output

model device: cuda:0
batch device: cuda:0
logits shape: (32, 128, 3)

GPU index can vary; model and batch still need matching CUDA devices, and the shape contract stays stable.

Real batches often contain inputs, labels, and masks. Move every tensor that participates in device work:

move_whole_batch.py

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch = {
    "token_features": torch.randn(4, 8, 16),
    "attention_mask": torch.ones(4, 8, dtype=torch.bool),
    "labels": torch.tensor([0, 2, 1, 0]),
}
moved = {name: tensor.to(device) for name, tensor in batch.items()}

assert all(tensor.device.type == device.type for tensor in moved.values())
print("all batch fields moved:", sorted(moved))

If one input stays on CPU while model parameters are on CUDA, the forward pass fails. A small preflight check makes that failure readable before a long training run begins:

catch_device_mismatch.py

import torch

def require_same_device(model_device: torch.device, batch: torch.Tensor) -> None:
    if batch.device != model_device:
        raise RuntimeError(f"batch device does not match model device {model_device}")

batch = torch.randn(4, 3)
try:
    require_same_device(torch.device("cuda"), batch)
except RuntimeError as error:
    print("caught:", error)

Output

caught: batch device does not match model device cuda

A small training example, step by step

Make the example concrete. Suppose you're training an access-ticket model that predicts whether a request should be answered, escalated, or blocked.

The dataloader reads a batch of token IDs on the CPU.
The batch is copied to device memory.
The model weights already live on the GPU.
PyTorch launches matmul, attention, and loss kernels on the GPU.
Backward pass produces gradients on the GPU.
The optimizer updates weights on the GPU.
Only when you log a scalar or save results back to disk does the CPU need some of that state again.

That's why CUDA bugs often look strange at first. The Python code line you wrote and the GPU work it triggered are related, but they don't run in one shared place or finish at the same instant.

This complete stochastic-gradient-descent training step keeps the model, features, labels, logits, loss, and gradients on device until the final scalar is brought back for logging:

one_ticket_training_step.py

import math

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(7)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = nn.Linear(8, 3).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
features = torch.randn(4, 8, device=device)
labels = torch.tensor([0, 2, 1, 0], device=device)

optimizer.zero_grad()
logits = model(features)
loss = F.cross_entropy(logits, labels)
loss.backward()
optimizer.step()
logged_loss = loss.detach().cpu().item()

print("step device:", device)
print("finite loss:", math.isfinite(logged_loss))

Output

step device: cuda
finite loss: True

The rendered output shows the configured NVIDIA path. The training-step structure stays the same across environments, but this lesson's output should model the accelerator run you are aiming for.

The same idea as a small table:

Step	CPU side	GPU side	Common beginner mistake
batch read	collator builds tensors	nothing yet	assuming data is already on GPU
device copy	launch host-to-device transfer	receives batch in device memory	copying every tiny tensor separately
forward	queues layer calls	executes kernels	model on GPU, batch on CPU
backward	launches autograd work	computes gradients	OOM because activations were ignored
logging	asks for loss value	may still be finishing kernels	`.item()` every step hides synchronization cost

If you can explain those five rows in your own words, you already understand more CUDA than many people who only know the slogan "GPUs are parallel."

Asynchronous execution and hidden sync points

One reason CUDA feels confusing is that the CPU usually launches GPU work asynchronously.^{[2]Reference 2CUDA semantics.https://docs.pytorch.org/docs/2.12/notes/cuda.html} That means:

Python may continue before the GPU finishes the queued kernels.
timing a block with a naive host timer can under-report real GPU time
operations that need a CPU value force the host to wait for completion

Common sync points include:

loss.item() when loss is a CUDA tensor
tensor.cpu(), including the tensor.cpu().numpy() path used for NumPy analysis
logging or printing that materializes a CUDA value on the CPU
explicit torch.cuda.synchronize()

Calling .numpy() directly on a CUDA tensor isn't the route back to NumPy: move it to CPU first. This explicit boundary is a useful place to control logging frequency:

logging_boundary.py

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
loss = torch.tensor(2.5, device=device)
logged_loss = loss.detach().cpu().item()

print(f"reported loss: {logged_loss:.1f}")

Output

reported loss: 2.5

On a CUDA device, the .cpu() call above waits until data needed for the copy is ready. That's why a loop can look fast until you add "just one print."

A timing trap you should recognize by hand

Suppose one forward pass queues 40 ms of GPU work, but the CPU finishes launching it in 2 ms.

A naive timer wrapped only around the Python call might report about 2 ms.
A synchronized timer reports the real end-to-end GPU time: about 40 ms.

That mismatch isn't a rounding error. It changes the engineering conclusion.

If you believe the 2 ms number, you may think the GPU is extremely fast and the bottleneck must be elsewhere.
If you measure the real 40 ms number, you may correctly conclude that sequence length, batch size, or kernel efficiency still need work.

Beginner CUDA debugging should always ask: did the measurement include synchronization, or did it only measure kernel launch overhead?

honest_matmul_timing.py

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(128, 128, device=device)

if device.type == "cuda":
    for _ in range(3):
        y = x @ x
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(10):
        y = x @ x
    end.record()
    torch.cuda.synchronize()
    print("measured with CUDA events:", start.elapsed_time(end) >= 0)
else:
    y = x @ x
    print("CUDA events need CUDA; fallback result shape:", tuple(y.shape))

Output

measured with CUDA events: True

Reading `nvidia-smi` without over-trusting it

Use it for:

checking whether the process attached to the GPU
checking rough process and device memory footprint
spotting obvious OOM pressure
seeing rough utilization snapshots

Don't use it as your only answer for:

kernel-level bottlenecks
whether dataloading is the issue
whether synchronization is killing throughput
whether the GPU is compute-bound or memory-bound

allocator_counter_check.py

import torch

if torch.cuda.is_available():
    before_allocated = torch.cuda.memory_allocated()
    tensor = torch.ones(1024, 1024, device="cuda")
    after_allocated = torch.cuda.memory_allocated()
    after_reserved = torch.cuda.memory_reserved()
    print("live tensor allocation increased:", after_allocated > before_allocated)
    print("allocator reserved at least live bytes:", after_reserved >= after_allocated)
else:
    print("CUDA allocator counters need an accessible CUDA device")

Output

live tensor allocation increased: True
allocator reserved at least live bytes: True

At the beginning, nvidia-smi, correct device placement, and these counters catch a large share of broken setups. Detailed profiling comes later.

First CUDA mistakes in training loops

1. Model on GPU, batch on CPU

Symptom: device mismatch error on forward pass.
Cause: weights and input tensors are on different devices.
Fix: move the whole batch, not a single field.

2. OOM on the first real batch

Symptom: the script starts, maybe even builds the model, then fails on the forward or backward pass.
Cause: activations and optimizer state push total memory over the card limit. Parameters alone aren't the full bill.
Fix: shrink per-step batch size first. If you need to preserve effective batch size, accumulate gradients across several smaller steps. Reduce sequence length or enable mixed precision when the task allows it.

Memory lever: The first two knobs reduce the number of token positions in a batch. Halving batch size halves that count; halving sequence length does too. Attention score tensors can drop faster when sequence length shrinks because they have two token axes.

activation_position_budget.py

batch_size = 32
sequence_length = 128

def positions(batch: int, tokens: int) -> int:
    return batch * tokens

baseline = positions(batch_size, sequence_length)
for name, batch, tokens in [
    ("baseline", batch_size, sequence_length),
    ("half batch", batch_size // 2, sequence_length),
    ("half length", batch_size, sequence_length // 2),
]:
    ratio = positions(batch, tokens) / baseline
    print(f"{name:11s}: {ratio:.1%} of token positions")

Output

baseline   : 100.0% of token positions
half batch : 50.0% of token positions
half length: 50.0% of token positions

3. Slow loop despite high GPU memory usage

Symptom: the GPU memory is full enough to look "active," but throughput is poor.
Cause: the bottleneck may be dataloading, synchronization, small batch size, or repeated host-device copies.
Fix: check whether the data pipeline feeds the GPU fast enough before assuming the math kernels are the problem.

4. Timing without synchronization

Symptom: a kernel appears to take almost no time.
Cause: the timer stopped before queued CUDA work completed.
Fix: warm up the operation, then synchronize before starting and after enqueueing the measured work, or use CUDA events.

What to understand before writing custom kernels

You don't need Triton or CUDA C++ to start training models, but you should already understand:

why GPUs help matrix-heavy workloads
why tensor placement is explicit
why device memory is limited and precious
why copies and sync points can dominate step time
why "GPU utilization" alone isn't a diagnosis

That foundation makes later topics less mysterious:

mixed precision
FlashAttention
FSDP and ZeRO
tensor parallelism
custom kernels

Self-check before bigger training runs

Answer these before moving on.

Expected output from your own explanation

At this point, explain without code:

where the tensor starts
when it moves
where the heavy math runs
which operations force the CPU to wait
which memory terms can trigger OOM

If one of those five is fuzzy, re-read the step-by-step table and the timing trap section before moving on. Later training chapters assume this picture is stable.

What to remember

CUDA is an execution and memory model, not a speed checkbox.
The CPU orchestrates. The GPU executes dense parallel math.
Host RAM and device memory are different places with real transfer costs.
PyTorch queues CUDA work asynchronously, so .item() and .cpu() can stall the host.
OOM errors usually mean the full training footprint doesn't fit, not model weights alone.

If that picture feels solid, you're ready to reason about training loops on accelerators instead of treating the GPU as an opaque speed device.

Use this checklist as the handoff artifact: run the device check, measure one operation with explicit synchronization, and write which memory terms can trigger OOM.

Mastery check

Key concepts

CPU vs GPU execution roles
host memory vs device memory
kernels, thread blocks, and warps
PyTorch device placement
asynchronous CUDA execution
common synchronization points
CUDA OOM debugging basics
nvidia-smi and runtime sanity checks

Evaluation rubric

Foundational: Explains why training work moves from CPU orchestration to GPU kernels and device memory
Intermediate: Uses PyTorch device placement correctly and names when host-device transfers become a bottleneck
Advanced: Diagnoses first-line CUDA issues such as missing device placement, accidental synchronization, and out-of-memory failures

Follow-up questions

Common pitfalls

Treating CUDA as a speed flag instead of a different execution and memory model.
Moving the model to GPU but forgetting the inputs, causing device-mismatch errors.
Reading nvidia-smi as if it were a profiler. It shows memory and utilization snapshots, not full kernel timelines.
Calling .item(), .cpu(), or print() inside tight loops without realizing they can force synchronization.

Next Step

Continue to MPS & Metal for ML on Mac

PreviousNumPy and Tensor Shapes

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

CUDA Programming Guide.

NVIDIA · 2026

CUDA semantics.

PyTorch Contributors · 2026

nvidia-smi documentation

NVIDIA · 2026

Get Started.

PyTorch Contributors · 2026

CUDA for ML Training

CPU orchestration vs GPU execution

What CUDA is

Select a compatible PyTorch build

First device checks in PyTorch

Host memory vs device memory

A small training example, step by step

Asynchronous execution and hidden sync points

A timing trap you should recognize by hand

Reading nvidia-smi without over-trusting it

First CUDA mistakes in training loops

1. Model on GPU, batch on CPU

2. OOM on the first real batch

3. Slow loop despite high GPU memory usage

4. Timing without synchronization

What to understand before writing custom kernels

Self-check before bigger training runs

Why can model(x) return before GPU work is done?

Why can a training loop OOM even when weights fit?

Why can one innocent print(loss.item()) change timing?

What must match before a forward pass works?

Your batch tensor is on cpu, your model weights are on cuda:0, and the first forward pass crashes. What is the first fix?

Expected output from your own explanation

What to remember

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why doesn't a fast CPU remove the need for CUDA when training models?

For a CPU tensor, what does tensor.to("cuda") change?

Why can loss.item() make a loop look slower than expected?

Common pitfalls

Mastery Check

CUDA for ML Training

CPU orchestration vs GPU execution

What CUDA is

Select a compatible PyTorch build

First device checks in PyTorch

Host memory vs device memory

A small training example, step by step

Asynchronous execution and hidden sync points

A timing trap you should recognize by hand

Reading nvidia-smi without over-trusting it

First CUDA mistakes in training loops

1. Model on GPU, batch on CPU

2. OOM on the first real batch

3. Slow loop despite high GPU memory usage

4. Timing without synchronization

What to understand before writing custom kernels

Self-check before bigger training runs

Why can model(x) return before GPU work is done?

Why can a training loop OOM even when weights fit?

Why can one innocent print(loss.item()) change timing?

What must match before a forward pass works?

Your batch tensor is on cpu, your model weights are on cuda:0, and the first forward pass crashes. What is the first fix?

Expected output from your own explanation

What to remember

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why doesn't a fast CPU remove the need for CUDA when training models?

For a CPU tensor, what does tensor.to("cuda") change?

Why can loss.item() make a loop look slower than expected?

Common pitfalls

Mastery Check

Reading `nvidia-smi` without over-trusting it

Reading `nvidia-smi` without over-trusting it