LearnComputing FoundationsMPS & Metal for ML on Mac

⚡EasyFine-Tuning & Training

MPS & Metal for ML on Mac

Build beginner-first intuition for training on Apple silicon: what Metal and MPS are, why unified memory changes the CUDA mental model, how PyTorch exposes the `mps` device, how to check availability, where CPU fallback appears, and how synchronization and memory pressure still shape performance.

13 min read

Learning path

Step 6 of 158 in the full curriculum

CUDA for ML Training Data Structures for AI

Platform path

If you train on Linux or Windows with an NVIDIA GPU, start with CUDA for ML Training. If you train on a Mac with Apple silicon, use the matching path here.

The CUDA chapter moved a text-classification batch onto an NVIDIA GPU. On an Apple silicon Mac, the same batch shape (32, 128, 768) can run on the Apple GPU through PyTorch's mps device. You still place tensors deliberately, measure queued work honestly, and manage memory pressure; the backend name and memory architecture change.^{[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/}^{[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps}

Build on that CUDA placement and timing model, then translate it to Apple's backend checks and unified-memory behavior.

That distinction matters when you develop locally. Large batched tensor math can use the Apple GPU, but unsupported operators, hidden synchronization, or an oversized workload can still leave a training step slow or broken. Use the Mac-specific checks with the same issue classifier: predict needs_docs, needs_test, or needs_review from developer text.

Apple silicon diagram showing one shared memory pool, separate PyTorch cpu and mps placement, and a training step that moves the batch and model together before Metal kernels run. — Apple silicon shares memory, but PyTorch placement is still explicit. The batch and weights need to land on `mps` together before Metal kernels do the training work.

What MPS and Metal are

Metal is Apple's graphics and compute framework. PyTorch uses a Metal Performance Shaders (MPS) backend so ordinary tensor code can target Apple GPUs through MPS Graph and tuned MPS kernels.^{[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/}^{[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps}

For a beginner, three names matter:

Metal: Apple's GPU programming framework.
MPS: Metal Performance Shaders, the backend whose graph and tuned kernels PyTorch maps operations onto.
mps device: the PyTorch device string used for tensors and modules assigned to this backend.

So the question isn't "CUDA or GPU?" It's "which backend does this machine expose to PyTorch?"

Backend status

Apple's PyTorch setup page currently labels MPS beta. An operation unsupported by the backend may fail or require CPU fallback, so availability alone isn't a performance guarantee.^{[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/}^{[3]Reference 3MPS Environment Variables.https://docs.pytorch.org/docs/2.9/mps_environment_variables.html}

Why one memory pool changes the mental model

In the CUDA chapter, the key picture was system memory and discrete GPU video memory, with copies crossing an interconnect. Apple silicon uses a unified memory architecture: CPU and GPU share system memory instead of dividing storage into CPU RAM and separate video RAM.^{[4]Reference 4MLX: An array framework for Apple siliconhttps://github.com/ml-explore/mlx}

Keep the contrast precise:

On a discrete NVIDIA GPU, tensor.to("cuda") places tensor storage in the GPU's separate device memory.
On Apple silicon, tensor.to("mps") doesn't cross the same CPU-RAM-to-discrete-VRAM boundary. It still returns an mps tensor and selects MPS-backed execution. Don't infer that a framework-level device move is always free or always zero-copy.^{[4]Reference 4MLX: An array framework for Apple siliconhttps://github.com/ml-explore/mlx}^{[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps}

Two practical consequences follow, and both show up below:

There isn't a separate video-memory capacity to fill. The GPU shares system memory with macOS and other applications, so a large training run competes with the rest of the machine.
Placement is still explicit. You still write cpu and mps, keep model and batch on compatible devices, and avoid needless host-visible scalar reads in the hot path.

So unified memory doesn't let you skip device discipline. It changes the hardware boundary, not the PyTorch contract.

The tensor shape from the CUDA lesson gives a concrete first budget. A float32 text-classification batch uses four bytes for every feature value:

ticket_batch_bytes.py

batch, tokens, features = 32, 128, 768
bytes_per_float = 4
batch_bytes = batch * tokens * features * bytes_per_float

print("shape:", (batch, tokens, features))
print(f"one float32 batch: {batch_bytes / (1024 ** 2):.1f} MiB")
print("training also keeps weights, activations, gradients, and optimizer state")

Output

shape: (32, 128, 768)
one float32 batch: 12.0 MiB
training also keeps weights, activations, gradients, and optimizer state

First Mac requirements to check

Apple's MPS setup page lists an Apple silicon Mac, macOS 14.0 or later, Python 3.10 or later, and Xcode command-line tools for its documented path.^{[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/} PyTorch release numbers change, so use the normal wheel install below and verify the backend instead of pinning a version in your setup notes.

Before writing model code, make sure the machine satisfies those basics.

terminal

xcode-select --install
python3 --version
sw_vers

When the machine looks compatible, install PyTorch from the normal wheel path Apple documents:

terminal

pip3 install torch torchvision torchaudio

First device checks in PyTorch

Start with one tiny script that distinguishes three states cleanly:

PyTorch was not built with MPS support.
PyTorch knows about MPS, but this machine or OS can't use it right now.
MPS is available, so you can move model and tensors onto mps.

mps_sanity_check.py

import torch

has_mps_backend = hasattr(torch.backends, "mps")
mps_built = bool(has_mps_backend and torch.backends.mps.is_built())
mps_available = bool(has_mps_backend and torch.backends.mps.is_available())

device = torch.device("mps") if mps_available else torch.device("cpu")

x = torch.arange(6, dtype=torch.float32).reshape(2, 3).to(device)
model = torch.nn.Linear(3, 2).to(device)
y = model(x)

print(f"mps built: {mps_built}")
print(f"mps available: {mps_available}")
print(f"selected device: {device}")
print(f"output shape: {tuple(y.shape)}")

Output

mps built: True
mps available: True
selected device: mps
output shape: (2, 2)

PyTorch's MPS note uses the same is_built() and is_available() distinction.^{[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps} Run this check on the target machine: on a compatible Mac it should select mps; elsewhere the fallback path is cpu. That split matters:

built = False usually means wrong PyTorch build for this machine
built = True, available = False usually means supported backend exists but OS, hardware, or runtime access is missing
available = True means you can use torch.device("mps")

Check your reasoning before moving on: is_built() answers "does this wheel even know about MPS?" while is_available() answers "can this specific machine use it right now?" Strong answer keeps those two questions separate.

Three-row PyTorch MPS diagnosis table mapping not built, built but unavailable, and available states to the next concrete setup or ticket-batch placement action. — Read each observed state as a diagnosis, not as sequential setup steps: an available backend is the only state where this run should place model and ticket batch on `mps`.

One text-classification batch on Mac

Continue with the CUDA lesson's text-classification batch. Thirty-two issue reports become 128 hidden token positions each, with 768 features per position. For this small classifier, average those token positions into one vector per report, then produce three logits per report for needs_docs, needs_test, and needs_review.

On a Mac training run, one small step usually looks like this:

Step	CPU side	`mps` side	Why beginner should care
batch assembly	tokenizer, collator, padding, labels	nothing yet	data still starts on host
device move	Python asks for `.to("mps")`	batch becomes an `mps` tensor	placement is explicit and inspectable
forward pass	host launches ops	Metal kernels run embeddings, matmuls, norms	most heavy math lives here
loss read	maybe host asks for scalar	device may need to finish queued work first	innocent logging can stall loop
backward pass	autograd schedules gradient work	gradient kernels run on `mps`	memory now includes activations and grads
optimizer step	host calls `step()`	parameter updates happen on `mps`	model stays on device across steps

That table isn't theory for theory's sake. Later chapters on training loops, mixed precision, and checkpointing assume you can point at each row and say what ran on the CPU, what ran on the accelerator, and what could unexpectedly bounce back.

If you can't narrate one batch this way yet, stop here and do it slowly. Name tensor location, kernel location, sync point, and likely failure for each row. That habit transfers directly to real model debugging.

Same device rules, different backend

Unified memory tempts Mac users into thinking "one laptop, one memory pool, one device." The hardware shares memory, but PyTorch still doesn't work that way at the code level. cpu and mps are separate device targets in your code, and the forward pass still fails if model and batch land on different devices.^{[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps}

Training code on Mac still follows the same pattern as CUDA:

mps_device_placement.py

import torch
import torch.nn as nn

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = nn.Linear(768, 3).to(device)
ticket_batch = torch.randn(32, 128, 768).to(device)

ticket_vectors = ticket_batch.mean(dim=1)
logits = model(ticket_vectors)
devices_match = next(model.parameters()).device == ticket_batch.device == logits.device
print("model and batch agree:", devices_match)
print("logits shape:", tuple(logits.shape))

Output

model and batch agree: True
logits shape: (32, 3)

Continue from the placed classifier and ticket batch with one small optimizer step. It uses MPS when available and remains executable on a non-Mac machine:

one_mps_ticket_step.py

import math

import torch.nn.functional as F

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
labels = torch.tensor([0, 2, 1, 0], device=device)
weights_before = model.weight.detach().clone()

optimizer.zero_grad()
loss = F.cross_entropy(model(ticket_vectors[:4]), labels)
loss.backward()
optimizer.step()
logged_loss = loss.detach().cpu().item()

print("weights changed:", not torch.equal(weights_before, model.weight.detach()))
print("finite loss:", math.isfinite(logged_loss))

Output

weights changed: True
finite loss: True

The .cpu().item() call is outside gradient computation and deliberately marks the boundary where a scalar returns to the host for reporting.

If model and batch devices don't agree, fix placement before looking for deeper bugs. Catch that failure before a full training run by checking the device contract:

catch_mps_mismatch.py

import torch

def require_same_device(model_device: torch.device, batch: torch.Tensor) -> None:
    if batch.device != model_device:
        raise RuntimeError(f"batch device does not match model device {model_device}")

batch = torch.randn(8, 4)
try:
    require_same_device(torch.device("mps"), batch)
except RuntimeError as error:
    print("caught:", error)

Output

caught: batch device does not match model device mps

That's the same rule you learned in CUDA. The Mac path isn't an exemption from device consistency.

Unsupported operations and CPU fallback

An operation without an MPS implementation can stop an otherwise valid training loop. PyTorch exposes PYTORCH_ENABLE_MPS_FALLBACK=1 so unsupported MPS operations can run on CPU instead of failing immediately.^{[3]Reference 3MPS Environment Variables.https://docs.pytorch.org/docs/2.9/mps_environment_variables.html} Treat that flag as a debugging aid, not a promise that every unsupported operation will work.

terminal

PYTORCH_ENABLE_MPS_FALLBACK=1 python train.py

Fallback is a debugging tool

Fallback keeps debugging unblocked, but it can also hide backend switches and synchronization. If one layer keeps falling back to CPU inside a hot loop, throughput can collapse even though the script "works."

CPU work isn't automatically fallback. Tokenization and batch assembly normally happen on CPU before .to("mps"). MPS fallback means a PyTorch operation on the accelerator path has no MPS implementation and runs on CPU instead.

Use fallback to identify the blocking operation. Then decide whether to rewrite that part, upgrade PyTorch, change precision, or accept the CPU path for that workload.

What fallback looks like in a real local model

Your issue classifier may spend most of its time in embedding lookup, attention, and classifier matmuls on mps, while one tensor-indexing operation has no MPS implementation and falls back to CPU inside every batch.

At small scale, you may barely notice:

batch leaves CPU
most model math runs on mps
unsupported op runs through CPU fallback
later mps work resumes

At larger scale, that repeated detour gets expensive because every backend switch disrupts the fast path. Throughput drops, step time becomes noisy, and profiler traces stop looking like one clean accelerator-bound loop. "Script finished" isn't the same as "training path is healthy."

You can reason about fallback cost without requiring an unsupported operator on your machine. Each backend change below is a boundary where execution or data handling has to switch paths:

fallback_boundary_count.py

path = ["host batch", "mps forward", "fallback op", "mps classifier", "host log"]
backends = [stage.split()[0] for stage in path]
switches = sum(left != right for left, right in zip(backends, backends[1:]))

print(" -> ".join(path))
print("backend changes:", switches)
print("fix target: remove fallback from hot path")

Output

host batch -> mps forward -> fallback op -> mps classifier -> host log
backend changes: 4
fix target: remove fallback from hot path

The first move and final log are deliberate boundaries. The middle CPU fallback is the boundary to remove from the hot path.

Timing and synchronization on MPS

The Mac timing trap looks like the CUDA timing trap. Kernel launches and queued device work can make naive timers lie. Warm up the operation first, then synchronize before and after the measured block. PyTorch exposes torch.mps.synchronize() for that boundary.^{[5]Reference 5torch.mps.https://docs.pytorch.org/docs/stable/mps.html}

mps_timing.py

import time
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
x = torch.randn(256, 256, device=device)
w = torch.randn(256, 256, device=device)

for _ in range(3):
    y = x @ w
if device.type == "mps":
    torch.mps.synchronize()

start = time.perf_counter()
for _ in range(10):
    y = x @ w
if device.type == "mps":
    torch.mps.synchronize()
elapsed_ms = (time.perf_counter() - start) * 1000

print("timed matmuls:", 10)
print("result shape:", tuple(y.shape))
print("elapsed is nonnegative:", elapsed_ms >= 0)

Output

timed matmuls: 10
result shape: (256, 256)
elapsed is nonnegative: True

Same hidden sync points still matter:

loss.item() when loss is an MPS tensor
tensor.cpu(), including tensor.cpu().numpy() for NumPy analysis
printing values that must come back to host memory

Calling .numpy() directly on an MPS tensor isn't the route back to NumPy: move the data to CPU first. Keep that reporting boundary explicit:

mps_logging_boundary.py

import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
loss = torch.tensor(2.5, device=device)
reported_loss = loss.detach().cpu().item()

print(f"reported loss: {reported_loss:.1f}")

Output

reported loss: 2.5

Don't trust timing claims until you know whether the host waited for the device.

Worked timing story by hand

Say one issue-classifier forward pass launches 35 ms of mps work, but Python finishes queueing it in 3 ms.

naive timer around the Python call reports about 3 ms
synchronized timer reports about 35 ms

That gap changes your engineering decisions.

If you believe 3 ms, you may chase dataloader code that isn't the bottleneck.
If you believe 35 ms, you know model-side math or sequence length still dominates.

Good accelerator debugging starts with honest measurement before clever optimization.

Memory pressure on Apple GPUs

The first memory lesson is the same as CUDA: weights aren't the whole bill. Activations, gradients, optimizer state, and temporary workspaces matter too.

Unified memory adds one twist. There isn't a separate video-memory pool: a large run competes for system memory with macOS and every other app. PyTorch exposes current_allocated_memory() for bytes occupied by live tensors, driver_allocated_memory() for total memory allocated by Metal for the process (including cached allocator blocks and MPS/MPSGraph allocations), and empty_cache() to release unoccupied cached memory.^{[5]Reference 5torch.mps.https://docs.pytorch.org/docs/stable/mps.html}

Inspect those counters only after MPS is available:

mps_allocator_check.py

import torch

if torch.backends.mps.is_available():
    before = torch.mps.current_allocated_memory()
    tensor = torch.ones(1024, 1024, device="mps")
    after = torch.mps.current_allocated_memory()
    del tensor
    torch.mps.empty_cache()
    print("live tensor allocation increased:", after > before)
    print("recommended limit reported:", torch.mps.recommended_max_memory() > 0)
else:
    print("MPS allocator counters need an available mps device")

When memory gets tight, fix order should stay boring:

Symptom	First question	First fix
OOM on first real batch	Is batch or sequence length too large?	shrink batch size first
Step time swings wildly	Are unsupported ops or sync points bouncing work back to CPU?	check fallback and logging paths
MPS allocator errors	Are you near working-set limits?	reduce workload before touching allocator env vars

PyTorch also exposes MPS-specific allocator controls such as PYTORCH_MPS_HIGH_WATERMARK_RATIO and PYTORCH_MPS_LOW_WATERMARK_RATIO.^{[3]Reference 3MPS Environment Variables.https://docs.pytorch.org/docs/2.9/mps_environment_variables.html} They are advanced tuning knobs, not first response. The documentation warns that disabling the high watermark can cause system failure under system-wide out-of-memory conditions. Start by shrinking work.

For the text-classification example, common first fixes are boring on purpose:

lower per-step batch size before touching allocator ratios
shorten sequence length if the task allows it
remove needless .cpu() calls before blaming Metal
confirm fallback isn't firing inside the hot path

Measure the simplest workload reductions before changing allocator limits:

mps_workload_reduction.py

batch_size = 32
sequence_length = 128
baseline_positions = batch_size * sequence_length

for label, batch, tokens in [
    ("baseline", 32, 128),
    ("half batch", 16, 128),
    ("half length", 32, 64),
]:
    share = (batch * tokens) / baseline_positions
    print(f"{label:11s}: {share:.0%} of token positions")

Output

baseline   : 100% of token positions
half batch : 50% of token positions
half length: 50% of token positions

Memory lever: Both changes halve token positions and many activation tensors. Shorter sequences can reduce attention score tensors faster because attention has two sequence axes.

What stays the same from CUDA

Different backend, same engineering questions:

where does tensor live right now?
where does forward pass run?
what forces host to wait?
which step is moving data back to CPU?
which part of memory footprint is blowing up?

If you can answer those on Mac, later training chapters stop feeling platform-specific.

Self-check before bigger runs

Cover the sections above and answer these before you peek.

What to remember

Metal is Apple's GPU stack. mps is PyTorch's device name for using it.
Apple silicon uses unified memory: CPU and GPU share system memory, but .to("mps") still selects MPS execution and doesn't promise a cost-free move. Apple currently labels the MPS backend beta.^{[4]Reference 4MLX: An array framework for Apple siliconhttps://github.com/ml-explore/mlx}^{[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/}
Mac training still needs explicit device placement for models and tensors.
is_built() and is_available() answer different setup questions.
PYTORCH_ENABLE_MPS_FALLBACK=1 is useful, but it can hide slow CPU detours.
Honest timing on MPS still depends on understanding synchronization.
Same text-classification training loop still needs clear batch flow, measurement, and memory discipline.

If you want broader accelerator intuition or you also work on NVIDIA servers, read CUDA for ML Training too. Same mental model, different backend details.

Mastery check

Key concepts

Metal vs MPS vs mps device
unified memory architecture vs discrete GPU VRAM
PyTorch is_built() vs is_available()
model and tensor device placement on Mac
CPU fallback for unsupported ops
MPS synchronization and timing
Apple GPU memory-pressure debugging basics

Evaluation rubric

Foundational: Explains how PyTorch maps model code onto Apple GPUs through the MPS backend, what unified memory changes versus a discrete GPU, and why Mac users still manage explicit device placement
Intermediate: Checks is_built() versus is_available(), moves model and batch onto mps, and identifies when unsupported operations fall back to CPU
Advanced: Diagnoses first-line MPS issues such as missing backend support, hidden synchronization, slow CPU fallback, and memory-pressure failures

Follow-up questions

Common pitfalls

Assuming a Mac needs no explicit device placement because CPU and GPU live in one laptop.
Checking only is_available() and missing the more basic case where the installed PyTorch build has no MPS support at all.
Leaving CPU fallback enabled and misreading a surviving script as a fast script.
Timing MPS work without synchronization, then underestimating true step time.

Next Step

Continue to Data Structures for AI Systems

CUDA and MPS taught you where training tensors live, how accelerator work gets launched, and why placement and synchronization matter before a model learns anything. Data structures now switches to another systems foundation: how to organize repeated lookups, queues, caches, and <span data-glossary="top-k">top-k</span> state once those tensors start feeding real AI pipelines.

PreviousCUDA for ML Training

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Accelerated PyTorch training on Mac.

Apple · 2026

MPS backend.

PyTorch Contributors · 2026

MPS Environment Variables.

PyTorch Contributors · 2026

MLX: An array framework for Apple silicon

Apple (ml-explore) · 2026

torch.mps.

PyTorch Contributors · 2026

Back to Topics

LearnComputing FoundationsMPS & Metal for ML on Mac

⚡EasyFine-Tuning & Training

MPS & Metal for ML on Mac

13 min read

Learning path

Step 6 of 158 in the full curriculum

CUDA for ML Training Data Structures for AI

Platform path

If you train on Linux or Windows with an NVIDIA GPU, start with CUDA for ML Training. If you train on a Mac with Apple silicon, use the matching path here.

Build on that CUDA placement and timing model, then translate it to Apple's backend checks and unified-memory behavior.

What MPS and Metal are

For a beginner, three names matter:

Metal: Apple's GPU programming framework.
MPS: Metal Performance Shaders, the backend whose graph and tuned kernels PyTorch maps operations onto.
mps device: the PyTorch device string used for tensors and modules assigned to this backend.

So the question isn't "CUDA or GPU?" It's "which backend does this machine expose to PyTorch?"

Backend status

Why one memory pool changes the mental model

Keep the contrast precise:

On a discrete NVIDIA GPU, tensor.to("cuda") places tensor storage in the GPU's separate device memory.
On Apple silicon, tensor.to("mps") doesn't cross the same CPU-RAM-to-discrete-VRAM boundary. It still returns an mps tensor and selects MPS-backed execution. Don't infer that a framework-level device move is always free or always zero-copy.^{[4]Reference 4MLX: An array framework for Apple siliconhttps://github.com/ml-explore/mlx}^{[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps}

Two practical consequences follow, and both show up below:

There isn't a separate video-memory capacity to fill. The GPU shares system memory with macOS and other applications, so a large training run competes with the rest of the machine.
Placement is still explicit. You still write cpu and mps, keep model and batch on compatible devices, and avoid needless host-visible scalar reads in the hot path.

So unified memory doesn't let you skip device discipline. It changes the hardware boundary, not the PyTorch contract.

The tensor shape from the CUDA lesson gives a concrete first budget. A float32 text-classification batch uses four bytes for every feature value:

ticket_batch_bytes.py

batch, tokens, features = 32, 128, 768
bytes_per_float = 4
batch_bytes = batch * tokens * features * bytes_per_float

print("shape:", (batch, tokens, features))
print(f"one float32 batch: {batch_bytes / (1024 ** 2):.1f} MiB")
print("training also keeps weights, activations, gradients, and optimizer state")

Output

shape: (32, 128, 768)
one float32 batch: 12.0 MiB
training also keeps weights, activations, gradients, and optimizer state

First Mac requirements to check

Before writing model code, make sure the machine satisfies those basics.

terminal

xcode-select --install
python3 --version
sw_vers

When the machine looks compatible, install PyTorch from the normal wheel path Apple documents:

terminal

pip3 install torch torchvision torchaudio

First device checks in PyTorch

Start with one tiny script that distinguishes three states cleanly:

PyTorch was not built with MPS support.
PyTorch knows about MPS, but this machine or OS can't use it right now.
MPS is available, so you can move model and tensors onto mps.

mps_sanity_check.py

import torch

has_mps_backend = hasattr(torch.backends, "mps")
mps_built = bool(has_mps_backend and torch.backends.mps.is_built())
mps_available = bool(has_mps_backend and torch.backends.mps.is_available())

device = torch.device("mps") if mps_available else torch.device("cpu")

x = torch.arange(6, dtype=torch.float32).reshape(2, 3).to(device)
model = torch.nn.Linear(3, 2).to(device)
y = model(x)

print(f"mps built: {mps_built}")
print(f"mps available: {mps_available}")
print(f"selected device: {device}")
print(f"output shape: {tuple(y.shape)}")

Output

mps built: True
mps available: True
selected device: mps
output shape: (2, 2)

built = False usually means wrong PyTorch build for this machine
built = True, available = False usually means supported backend exists but OS, hardware, or runtime access is missing
available = True means you can use torch.device("mps")

One text-classification batch on Mac

On a Mac training run, one small step usually looks like this:

Step	CPU side	`mps` side	Why beginner should care
batch assembly	tokenizer, collator, padding, labels	nothing yet	data still starts on host
device move	Python asks for `.to("mps")`	batch becomes an `mps` tensor	placement is explicit and inspectable
forward pass	host launches ops	Metal kernels run embeddings, matmuls, norms	most heavy math lives here
loss read	maybe host asks for scalar	device may need to finish queued work first	innocent logging can stall loop
backward pass	autograd schedules gradient work	gradient kernels run on `mps`	memory now includes activations and grads
optimizer step	host calls `step()`	parameter updates happen on `mps`	model stays on device across steps

Same device rules, different backend

Training code on Mac still follows the same pattern as CUDA:

mps_device_placement.py

import torch
import torch.nn as nn

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = nn.Linear(768, 3).to(device)
ticket_batch = torch.randn(32, 128, 768).to(device)

ticket_vectors = ticket_batch.mean(dim=1)
logits = model(ticket_vectors)
devices_match = next(model.parameters()).device == ticket_batch.device == logits.device
print("model and batch agree:", devices_match)
print("logits shape:", tuple(logits.shape))

Output

model and batch agree: True
logits shape: (32, 3)

Continue from the placed classifier and ticket batch with one small optimizer step. It uses MPS when available and remains executable on a non-Mac machine:

one_mps_ticket_step.py

import math

import torch.nn.functional as F

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
labels = torch.tensor([0, 2, 1, 0], device=device)
weights_before = model.weight.detach().clone()

optimizer.zero_grad()
loss = F.cross_entropy(model(ticket_vectors[:4]), labels)
loss.backward()
optimizer.step()
logged_loss = loss.detach().cpu().item()

print("weights changed:", not torch.equal(weights_before, model.weight.detach()))
print("finite loss:", math.isfinite(logged_loss))

Output

weights changed: True
finite loss: True

The .cpu().item() call is outside gradient computation and deliberately marks the boundary where a scalar returns to the host for reporting.

If model and batch devices don't agree, fix placement before looking for deeper bugs. Catch that failure before a full training run by checking the device contract:

catch_mps_mismatch.py

import torch

def require_same_device(model_device: torch.device, batch: torch.Tensor) -> None:
    if batch.device != model_device:
        raise RuntimeError(f"batch device does not match model device {model_device}")

batch = torch.randn(8, 4)
try:
    require_same_device(torch.device("mps"), batch)
except RuntimeError as error:
    print("caught:", error)

Output

caught: batch device does not match model device mps

That's the same rule you learned in CUDA. The Mac path isn't an exemption from device consistency.

Unsupported operations and CPU fallback

terminal

PYTORCH_ENABLE_MPS_FALLBACK=1 python train.py

Fallback is a debugging tool

Use fallback to identify the blocking operation. Then decide whether to rewrite that part, upgrade PyTorch, change precision, or accept the CPU path for that workload.

What fallback looks like in a real local model

At small scale, you may barely notice:

batch leaves CPU
most model math runs on mps
unsupported op runs through CPU fallback
later mps work resumes

You can reason about fallback cost without requiring an unsupported operator on your machine. Each backend change below is a boundary where execution or data handling has to switch paths:

fallback_boundary_count.py

path = ["host batch", "mps forward", "fallback op", "mps classifier", "host log"]
backends = [stage.split()[0] for stage in path]
switches = sum(left != right for left, right in zip(backends, backends[1:]))

print(" -> ".join(path))
print("backend changes:", switches)
print("fix target: remove fallback from hot path")

Output

host batch -> mps forward -> fallback op -> mps classifier -> host log
backend changes: 4
fix target: remove fallback from hot path

The first move and final log are deliberate boundaries. The middle CPU fallback is the boundary to remove from the hot path.

Timing and synchronization on MPS

mps_timing.py

import time
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
x = torch.randn(256, 256, device=device)
w = torch.randn(256, 256, device=device)

for _ in range(3):
    y = x @ w
if device.type == "mps":
    torch.mps.synchronize()

start = time.perf_counter()
for _ in range(10):
    y = x @ w
if device.type == "mps":
    torch.mps.synchronize()
elapsed_ms = (time.perf_counter() - start) * 1000

print("timed matmuls:", 10)
print("result shape:", tuple(y.shape))
print("elapsed is nonnegative:", elapsed_ms >= 0)

Output

timed matmuls: 10
result shape: (256, 256)
elapsed is nonnegative: True

Same hidden sync points still matter:

loss.item() when loss is an MPS tensor
tensor.cpu(), including tensor.cpu().numpy() for NumPy analysis
printing values that must come back to host memory

Calling .numpy() directly on an MPS tensor isn't the route back to NumPy: move the data to CPU first. Keep that reporting boundary explicit:

mps_logging_boundary.py

import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
loss = torch.tensor(2.5, device=device)
reported_loss = loss.detach().cpu().item()

print(f"reported loss: {reported_loss:.1f}")

Output

reported loss: 2.5

Don't trust timing claims until you know whether the host waited for the device.

Worked timing story by hand

Say one issue-classifier forward pass launches 35 ms of mps work, but Python finishes queueing it in 3 ms.

naive timer around the Python call reports about 3 ms
synchronized timer reports about 35 ms

That gap changes your engineering decisions.

If you believe 3 ms, you may chase dataloader code that isn't the bottleneck.
If you believe 35 ms, you know model-side math or sequence length still dominates.

Good accelerator debugging starts with honest measurement before clever optimization.

Memory pressure on Apple GPUs

The first memory lesson is the same as CUDA: weights aren't the whole bill. Activations, gradients, optimizer state, and temporary workspaces matter too.

Inspect those counters only after MPS is available:

mps_allocator_check.py

import torch

if torch.backends.mps.is_available():
    before = torch.mps.current_allocated_memory()
    tensor = torch.ones(1024, 1024, device="mps")
    after = torch.mps.current_allocated_memory()
    del tensor
    torch.mps.empty_cache()
    print("live tensor allocation increased:", after > before)
    print("recommended limit reported:", torch.mps.recommended_max_memory() > 0)
else:
    print("MPS allocator counters need an available mps device")

When memory gets tight, fix order should stay boring:

Symptom	First question	First fix
OOM on first real batch	Is batch or sequence length too large?	shrink batch size first
Step time swings wildly	Are unsupported ops or sync points bouncing work back to CPU?	check fallback and logging paths
MPS allocator errors	Are you near working-set limits?	reduce workload before touching allocator env vars

For the text-classification example, common first fixes are boring on purpose:

lower per-step batch size before touching allocator ratios
shorten sequence length if the task allows it
remove needless .cpu() calls before blaming Metal
confirm fallback isn't firing inside the hot path

Measure the simplest workload reductions before changing allocator limits:

mps_workload_reduction.py

batch_size = 32
sequence_length = 128
baseline_positions = batch_size * sequence_length

for label, batch, tokens in [
    ("baseline", 32, 128),
    ("half batch", 16, 128),
    ("half length", 32, 64),
]:
    share = (batch * tokens) / baseline_positions
    print(f"{label:11s}: {share:.0%} of token positions")

Output

baseline   : 100% of token positions
half batch : 50% of token positions
half length: 50% of token positions

Memory lever: Both changes halve token positions and many activation tensors. Shorter sequences can reduce attention score tensors faster because attention has two sequence axes.

What stays the same from CUDA

Different backend, same engineering questions:

where does tensor live right now?
where does forward pass run?
what forces host to wait?
which step is moving data back to CPU?
which part of memory footprint is blowing up?

If you can answer those on Mac, later training chapters stop feeling platform-specific.

Self-check before bigger runs

Cover the sections above and answer these before you peek.

What to remember

Metal is Apple's GPU stack. mps is PyTorch's device name for using it.
Apple silicon uses unified memory: CPU and GPU share system memory, but .to("mps") still selects MPS execution and doesn't promise a cost-free move. Apple currently labels the MPS backend beta.^{[4]Reference 4MLX: An array framework for Apple siliconhttps://github.com/ml-explore/mlx}^{[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/}
Mac training still needs explicit device placement for models and tensors.
is_built() and is_available() answer different setup questions.
PYTORCH_ENABLE_MPS_FALLBACK=1 is useful, but it can hide slow CPU detours.
Honest timing on MPS still depends on understanding synchronization.
Same text-classification training loop still needs clear batch flow, measurement, and memory discipline.

If you want broader accelerator intuition or you also work on NVIDIA servers, read CUDA for ML Training too. Same mental model, different backend details.

Mastery check

Key concepts

Metal vs MPS vs mps device
unified memory architecture vs discrete GPU VRAM
PyTorch is_built() vs is_available()
model and tensor device placement on Mac
CPU fallback for unsupported ops
MPS synchronization and timing
Apple GPU memory-pressure debugging basics

Evaluation rubric

Foundational: Explains how PyTorch maps model code onto Apple GPUs through the MPS backend, what unified memory changes versus a discrete GPU, and why Mac users still manage explicit device placement
Intermediate: Checks is_built() versus is_available(), moves model and batch onto mps, and identifies when unsupported operations fall back to CPU
Advanced: Diagnoses first-line MPS issues such as missing backend support, hidden synchronization, slow CPU fallback, and memory-pressure failures

Follow-up questions

Common pitfalls

Assuming a Mac needs no explicit device placement because CPU and GPU live in one laptop.
Checking only is_available() and missing the more basic case where the installed PyTorch build has no MPS support at all.
Leaving CPU fallback enabled and misreading a surviving script as a fast script.
Timing MPS work without synchronization, then underestimating true step time.

Next Step

Continue to Data Structures for AI Systems

PreviousCUDA for ML Training

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Accelerated PyTorch training on Mac.

Apple · 2026

MPS backend.

PyTorch Contributors · 2026

MPS Environment Variables.

PyTorch Contributors · 2026

MLX: An array framework for Apple silicon

Apple (ml-explore) · 2026

torch.mps.

PyTorch Contributors · 2026

MPS & Metal for ML on Mac

What MPS and Metal are

Why one memory pool changes the mental model

First Mac requirements to check

First device checks in PyTorch

One text-classification batch on Mac

Same device rules, different backend

Unsupported operations and CPU fallback

Your tokenizer runs on CPU before ticket_batch.to("mps"). Is that MPS fallback?

What fallback looks like in a real local model

Timing and synchronization on MPS

Worked timing story by hand

Memory pressure on Apple GPUs

What stays the same from CUDA

Self-check before bigger runs

You are on an Apple silicon Mac. torch.backends.mps.is_built() is True, but torch.backends.mps.is_available() is False. What does that tell you first?

Your training loop "works" only with PYTORCH_ENABLE_MPS_FALLBACK=1, but step time is awful. What is most likely happening?

What to remember

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why is there a separate Mac accelerator chapter instead of only reusing the CUDA one?

What does torch.backends.mps.is_built() tell you that is_available() doesn't?

Why can PYTORCH_ENABLE_MPS_FALLBACK=1 be both useful and dangerous?

If Apple silicon shares one unified memory pool, why do you still write .to("mps") instead of skipping device placement entirely?

Common pitfalls

Mastery Check

MPS & Metal for ML on Mac

What MPS and Metal are

Why one memory pool changes the mental model

First Mac requirements to check

First device checks in PyTorch

One text-classification batch on Mac

Same device rules, different backend

Unsupported operations and CPU fallback

Your tokenizer runs on CPU before ticket_batch.to("mps"). Is that MPS fallback?

What fallback looks like in a real local model

Timing and synchronization on MPS

Worked timing story by hand

Memory pressure on Apple GPUs

What stays the same from CUDA

Self-check before bigger runs

You are on an Apple silicon Mac. torch.backends.mps.is_built() is True, but torch.backends.mps.is_available() is False. What does that tell you first?

Your training loop "works" only with PYTORCH_ENABLE_MPS_FALLBACK=1, but step time is awful. What is most likely happening?

What to remember

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why is there a separate Mac accelerator chapter instead of only reusing the CUDA one?

What does torch.backends.mps.is_built() tell you that is_available() doesn't?

Why can PYTORCH_ENABLE_MPS_FALLBACK=1 be both useful and dangerous?

If Apple silicon shares one unified memory pool, why do you still write .to("mps") instead of skipping device placement entirely?

Common pitfalls

Mastery Check