LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnComputing FoundationsMPS & Metal for ML on Mac
⚡EasyFine-Tuning & Training

MPS & Metal for ML on Mac

Build beginner-first intuition for training on Apple silicon: what Metal and MPS are, why unified memory changes the CUDA mental model, how PyTorch exposes the `mps` device, how to check availability, where CPU fallback appears, and how synchronization and memory pressure still shape performance.

13 min read
Learning path
Step 6 of 158 in the full curriculum
CUDA for ML TrainingData Structures for AI
Platform path

If you train on Linux or Windows with an NVIDIA GPU, start with CUDA for ML Training. If you train on a Mac with Apple silicon, use the matching path here.

The CUDA chapter moved a text-classification batch onto an NVIDIA GPU. On an Apple silicon Mac, the same batch shape (32, 128, 768) can run on the Apple GPU through PyTorch's mps device. You still place tensors deliberately, measure queued work honestly, and manage memory pressure; the backend name and memory architecture change.[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps

Build on that CUDA placement and timing model, then translate it to Apple's backend checks and unified-memory behavior.

That distinction matters when you develop locally. Large batched tensor math can use the Apple GPU, but unsupported operators, hidden synchronization, or an oversized workload can still leave a training step slow or broken. Use the Mac-specific checks with the same issue classifier: predict needs_docs, needs_test, or needs_review from developer text.

Apple silicon diagram showing one shared memory pool, separate PyTorch cpu and mps placement, and a training step that moves the batch and model together before Metal kernels run. Apple silicon diagram showing one shared memory pool, separate PyTorch cpu and mps placement, and a training step that moves the batch and model together before Metal kernels run.
Apple silicon shares memory, but PyTorch placement is still explicit. The batch and weights need to land on `mps` together before Metal kernels do the training work.

What MPS and Metal are

Metal is Apple's graphics and compute framework. PyTorch uses a Metal Performance Shaders (MPS) backend so ordinary tensor code can target Apple GPUs through MPS Graph and tuned MPS kernels.[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps

For a beginner, three names matter:

  • Metal: Apple's GPU programming framework.
  • MPS: Metal Performance Shaders, the backend whose graph and tuned kernels PyTorch maps operations onto.
  • mps device: the PyTorch device string used for tensors and modules assigned to this backend.

So the question isn't "CUDA or GPU?" It's "which backend does this machine expose to PyTorch?"

Backend status

Apple's PyTorch setup page currently labels MPS beta. An operation unsupported by the backend may fail or require CPU fallback, so availability alone isn't a performance guarantee.[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/[3]Reference 3MPS Environment Variables.https://docs.pytorch.org/docs/2.9/mps_environment_variables.html

Why one memory pool changes the mental model

In the CUDA chapter, the key picture was system memory and discrete GPU video memory, with copies crossing an interconnect. Apple silicon uses a unified memory architecture: CPU and GPU share system memory instead of dividing storage into CPU RAM and separate video RAM.[4]Reference 4MLX: An array framework for Apple siliconhttps://github.com/ml-explore/mlx

Keep the contrast precise:

  • On a discrete NVIDIA GPU, tensor.to("cuda") places tensor storage in the GPU's separate device memory.
  • On Apple silicon, tensor.to("mps") doesn't cross the same CPU-RAM-to-discrete-VRAM boundary. It still returns an mps tensor and selects MPS-backed execution. Don't infer that a framework-level device move is always free or always zero-copy.[4]Reference 4MLX: An array framework for Apple siliconhttps://github.com/ml-explore/mlx[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps

Two practical consequences follow, and both show up below:

  1. There isn't a separate video-memory capacity to fill. The GPU shares system memory with macOS and other applications, so a large training run competes with the rest of the machine.
  2. Placement is still explicit. You still write cpu and mps, keep model and batch on compatible devices, and avoid needless host-visible scalar reads in the hot path.

So unified memory doesn't let you skip device discipline. It changes the hardware boundary, not the PyTorch contract.

The tensor shape from the CUDA lesson gives a concrete first budget. A float32 text-classification batch uses four bytes for every feature value:

ticket_batch_bytes.py
1batch, tokens, features = 32, 128, 768 2bytes_per_float = 4 3batch_bytes = batch * tokens * features * bytes_per_float 4 5print("shape:", (batch, tokens, features)) 6print(f"one float32 batch: {batch_bytes / (1024 ** 2):.1f} MiB") 7print("training also keeps weights, activations, gradients, and optimizer state")
Output
1shape: (32, 128, 768) 2one float32 batch: 12.0 MiB 3training also keeps weights, activations, gradients, and optimizer state

First Mac requirements to check

Apple's MPS setup page lists an Apple silicon Mac, macOS 14.0 or later, Python 3.10 or later, and Xcode command-line tools for its documented path.[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/ PyTorch release numbers change, so use the normal wheel install below and verify the backend instead of pinning a version in your setup notes.

Before writing model code, make sure the machine satisfies those basics.

terminal
1xcode-select --install 2python3 --version 3sw_vers

When the machine looks compatible, install PyTorch from the normal wheel path Apple documents:

terminal
1pip3 install torch torchvision torchaudio

First device checks in PyTorch

Start with one tiny script that distinguishes three states cleanly:

  1. PyTorch was not built with MPS support.
  2. PyTorch knows about MPS, but this machine or OS can't use it right now.
  3. MPS is available, so you can move model and tensors onto mps.
mps_sanity_check.py
1import torch 2 3has_mps_backend = hasattr(torch.backends, "mps") 4mps_built = bool(has_mps_backend and torch.backends.mps.is_built()) 5mps_available = bool(has_mps_backend and torch.backends.mps.is_available()) 6 7device = torch.device("mps") if mps_available else torch.device("cpu") 8 9x = torch.arange(6, dtype=torch.float32).reshape(2, 3).to(device) 10model = torch.nn.Linear(3, 2).to(device) 11y = model(x) 12 13print(f"mps built: {mps_built}") 14print(f"mps available: {mps_available}") 15print(f"selected device: {device}") 16print(f"output shape: {tuple(y.shape)}")
Output
1mps built: True 2mps available: True 3selected device: mps 4output shape: (2, 2)

PyTorch's MPS note uses the same is_built() and is_available() distinction.[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps Run this check on the target machine: on a compatible Mac it should select mps; elsewhere the fallback path is cpu. That split matters:

  • built = False usually means wrong PyTorch build for this machine
  • built = True, available = False usually means supported backend exists but OS, hardware, or runtime access is missing
  • available = True means you can use torch.device("mps")

Check your reasoning before moving on: is_built() answers "does this wheel even know about MPS?" while is_available() answers "can this specific machine use it right now?" Strong answer keeps those two questions separate.

Three-row PyTorch MPS diagnosis table mapping not built, built but unavailable, and available states to the next concrete setup or ticket-batch placement action. Three-row PyTorch MPS diagnosis table mapping not built, built but unavailable, and available states to the next concrete setup or ticket-batch placement action.
Read each observed state as a diagnosis, not as sequential setup steps: an available backend is the only state where this run should place model and ticket batch on `mps`.

One text-classification batch on Mac

Continue with the CUDA lesson's text-classification batch. Thirty-two issue reports become 128 hidden token positions each, with 768 features per position. For this small classifier, average those token positions into one vector per report, then produce three logits per report for needs_docs, needs_test, and needs_review.

On a Mac training run, one small step usually looks like this:

StepCPU sidemps sideWhy beginner should care
batch assemblytokenizer, collator, padding, labelsnothing yetdata still starts on host
device movePython asks for .to("mps")batch becomes an mps tensorplacement is explicit and inspectable
forward passhost launches opsMetal kernels run embeddings, matmuls, normsmost heavy math lives here
loss readmaybe host asks for scalardevice may need to finish queued work firstinnocent logging can stall loop
backward passautograd schedules gradient workgradient kernels run on mpsmemory now includes activations and grads
optimizer stephost calls step()parameter updates happen on mpsmodel stays on device across steps

That table isn't theory for theory's sake. Later chapters on training loops, mixed precision, and checkpointing assume you can point at each row and say what ran on the CPU, what ran on the accelerator, and what could unexpectedly bounce back.

If you can't narrate one batch this way yet, stop here and do it slowly. Name tensor location, kernel location, sync point, and likely failure for each row. That habit transfers directly to real model debugging.

Same device rules, different backend

Unified memory tempts Mac users into thinking "one laptop, one memory pool, one device." The hardware shares memory, but PyTorch still doesn't work that way at the code level. cpu and mps are separate device targets in your code, and the forward pass still fails if model and batch land on different devices.[2]Reference 2MPS backend.https://docs.pytorch.org/docs/stable/notes/mps

Training code on Mac still follows the same pattern as CUDA:

mps_device_placement.py
1import torch 2import torch.nn as nn 3 4device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") 5model = nn.Linear(768, 3).to(device) 6ticket_batch = torch.randn(32, 128, 768).to(device) 7 8ticket_vectors = ticket_batch.mean(dim=1) 9logits = model(ticket_vectors) 10devices_match = next(model.parameters()).device == ticket_batch.device == logits.device 11print("model and batch agree:", devices_match) 12print("logits shape:", tuple(logits.shape))
Output
1model and batch agree: True 2logits shape: (32, 3)

Continue from the placed classifier and ticket batch with one small optimizer step. It uses MPS when available and remains executable on a non-Mac machine:

one_mps_ticket_step.py
1import math 2 3import torch.nn.functional as F 4 5optimizer = torch.optim.SGD(model.parameters(), lr=0.1) 6labels = torch.tensor([0, 2, 1, 0], device=device) 7weights_before = model.weight.detach().clone() 8 9optimizer.zero_grad() 10loss = F.cross_entropy(model(ticket_vectors[:4]), labels) 11loss.backward() 12optimizer.step() 13logged_loss = loss.detach().cpu().item() 14 15print("weights changed:", not torch.equal(weights_before, model.weight.detach())) 16print("finite loss:", math.isfinite(logged_loss))
Output
1weights changed: True 2finite loss: True

The .cpu().item() call is outside gradient computation and deliberately marks the boundary where a scalar returns to the host for reporting.

If model and batch devices don't agree, fix placement before looking for deeper bugs. Catch that failure before a full training run by checking the device contract:

catch_mps_mismatch.py
1import torch 2 3def require_same_device(model_device: torch.device, batch: torch.Tensor) -> None: 4 if batch.device != model_device: 5 raise RuntimeError(f"batch device does not match model device {model_device}") 6 7batch = torch.randn(8, 4) 8try: 9 require_same_device(torch.device("mps"), batch) 10except RuntimeError as error: 11 print("caught:", error)
Output
1caught: batch device does not match model device mps

That's the same rule you learned in CUDA. The Mac path isn't an exemption from device consistency.

Unsupported operations and CPU fallback

An operation without an MPS implementation can stop an otherwise valid training loop. PyTorch exposes PYTORCH_ENABLE_MPS_FALLBACK=1 so unsupported MPS operations can run on CPU instead of failing immediately.[3]Reference 3MPS Environment Variables.https://docs.pytorch.org/docs/2.9/mps_environment_variables.html Treat that flag as a debugging aid, not a promise that every unsupported operation will work.

terminal
1PYTORCH_ENABLE_MPS_FALLBACK=1 python train.py
Fallback is a debugging tool

Fallback keeps debugging unblocked, but it can also hide backend switches and synchronization. If one layer keeps falling back to CPU inside a hot loop, throughput can collapse even though the script "works."

CPU work isn't automatically fallback. Tokenization and batch assembly normally happen on CPU before .to("mps"). MPS fallback means a PyTorch operation on the accelerator path has no MPS implementation and runs on CPU instead.

Use fallback to identify the blocking operation. Then decide whether to rewrite that part, upgrade PyTorch, change precision, or accept the CPU path for that workload.

What fallback looks like in a real local model

Your issue classifier may spend most of its time in embedding lookup, attention, and classifier matmuls on mps, while one tensor-indexing operation has no MPS implementation and falls back to CPU inside every batch.

At small scale, you may barely notice:

  • batch leaves CPU
  • most model math runs on mps
  • unsupported op runs through CPU fallback
  • later mps work resumes

At larger scale, that repeated detour gets expensive because every backend switch disrupts the fast path. Throughput drops, step time becomes noisy, and profiler traces stop looking like one clean accelerator-bound loop. "Script finished" isn't the same as "training path is healthy."

You can reason about fallback cost without requiring an unsupported operator on your machine. Each backend change below is a boundary where execution or data handling has to switch paths:

fallback_boundary_count.py
1path = ["host batch", "mps forward", "fallback op", "mps classifier", "host log"] 2backends = [stage.split()[0] for stage in path] 3switches = sum(left != right for left, right in zip(backends, backends[1:])) 4 5print(" -> ".join(path)) 6print("backend changes:", switches) 7print("fix target: remove fallback from hot path")
Output
1host batch -> mps forward -> fallback op -> mps classifier -> host log 2backend changes: 4 3fix target: remove fallback from hot path

The first move and final log are deliberate boundaries. The middle CPU fallback is the boundary to remove from the hot path.

Timing and synchronization on MPS

The Mac timing trap looks like the CUDA timing trap. Kernel launches and queued device work can make naive timers lie. Warm up the operation first, then synchronize before and after the measured block. PyTorch exposes torch.mps.synchronize() for that boundary.[5]Reference 5torch.mps.https://docs.pytorch.org/docs/stable/mps.html

mps_timing.py
1import time 2import torch 3 4device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") 5x = torch.randn(256, 256, device=device) 6w = torch.randn(256, 256, device=device) 7 8for _ in range(3): 9 y = x @ w 10if device.type == "mps": 11 torch.mps.synchronize() 12 13start = time.perf_counter() 14for _ in range(10): 15 y = x @ w 16if device.type == "mps": 17 torch.mps.synchronize() 18elapsed_ms = (time.perf_counter() - start) * 1000 19 20print("timed matmuls:", 10) 21print("result shape:", tuple(y.shape)) 22print("elapsed is nonnegative:", elapsed_ms >= 0)
Output
1timed matmuls: 10 2result shape: (256, 256) 3elapsed is nonnegative: True

Same hidden sync points still matter:

  • loss.item() when loss is an MPS tensor
  • tensor.cpu(), including tensor.cpu().numpy() for NumPy analysis
  • printing values that must come back to host memory

Calling .numpy() directly on an MPS tensor isn't the route back to NumPy: move the data to CPU first. Keep that reporting boundary explicit:

mps_logging_boundary.py
1import torch 2 3device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") 4loss = torch.tensor(2.5, device=device) 5reported_loss = loss.detach().cpu().item() 6 7print(f"reported loss: {reported_loss:.1f}")
Output
1reported loss: 2.5

Don't trust timing claims until you know whether the host waited for the device.

Worked timing story by hand

Say one issue-classifier forward pass launches 35 ms of mps work, but Python finishes queueing it in 3 ms.

  • naive timer around the Python call reports about 3 ms
  • synchronized timer reports about 35 ms

That gap changes your engineering decisions.

  • If you believe 3 ms, you may chase dataloader code that isn't the bottleneck.
  • If you believe 35 ms, you know model-side math or sequence length still dominates.

Good accelerator debugging starts with honest measurement before clever optimization.

Memory pressure on Apple GPUs

The first memory lesson is the same as CUDA: weights aren't the whole bill. Activations, gradients, optimizer state, and temporary workspaces matter too.

Unified memory adds one twist. There isn't a separate video-memory pool: a large run competes for system memory with macOS and every other app. PyTorch exposes current_allocated_memory() for bytes occupied by live tensors, driver_allocated_memory() for total memory allocated by Metal for the process (including cached allocator blocks and MPS/MPSGraph allocations), and empty_cache() to release unoccupied cached memory.[5]Reference 5torch.mps.https://docs.pytorch.org/docs/stable/mps.html

Inspect those counters only after MPS is available:

mps_allocator_check.py
1import torch 2 3if torch.backends.mps.is_available(): 4 before = torch.mps.current_allocated_memory() 5 tensor = torch.ones(1024, 1024, device="mps") 6 after = torch.mps.current_allocated_memory() 7 del tensor 8 torch.mps.empty_cache() 9 print("live tensor allocation increased:", after > before) 10 print("recommended limit reported:", torch.mps.recommended_max_memory() > 0) 11else: 12 print("MPS allocator counters need an available mps device")

When memory gets tight, fix order should stay boring:

SymptomFirst questionFirst fix
OOM on first real batchIs batch or sequence length too large?shrink batch size first
Step time swings wildlyAre unsupported ops or sync points bouncing work back to CPU?check fallback and logging paths
MPS allocator errorsAre you near working-set limits?reduce workload before touching allocator env vars

PyTorch also exposes MPS-specific allocator controls such as PYTORCH_MPS_HIGH_WATERMARK_RATIO and PYTORCH_MPS_LOW_WATERMARK_RATIO.[3]Reference 3MPS Environment Variables.https://docs.pytorch.org/docs/2.9/mps_environment_variables.html They are advanced tuning knobs, not first response. The documentation warns that disabling the high watermark can cause system failure under system-wide out-of-memory conditions. Start by shrinking work.

For the text-classification example, common first fixes are boring on purpose:

  1. lower per-step batch size before touching allocator ratios
  2. shorten sequence length if the task allows it
  3. remove needless .cpu() calls before blaming Metal
  4. confirm fallback isn't firing inside the hot path

Measure the simplest workload reductions before changing allocator limits:

mps_workload_reduction.py
1batch_size = 32 2sequence_length = 128 3baseline_positions = batch_size * sequence_length 4 5for label, batch, tokens in [ 6 ("baseline", 32, 128), 7 ("half batch", 16, 128), 8 ("half length", 32, 64), 9]: 10 share = (batch * tokens) / baseline_positions 11 print(f"{label:11s}: {share:.0%} of token positions")
Output
1baseline : 100% of token positions 2half batch : 50% of token positions 3half length: 50% of token positions

Memory lever: Both changes halve token positions and many activation tensors. Shorter sequences can reduce attention score tensors faster because attention has two sequence axes.

What stays the same from CUDA

Different backend, same engineering questions:

  • where does tensor live right now?
  • where does forward pass run?
  • what forces host to wait?
  • which step is moving data back to CPU?
  • which part of memory footprint is blowing up?

If you can answer those on Mac, later training chapters stop feeling platform-specific.

Self-check before bigger runs

Cover the sections above and answer these before you peek.

What to remember

  • Metal is Apple's GPU stack. mps is PyTorch's device name for using it.
  • Apple silicon uses unified memory: CPU and GPU share system memory, but .to("mps") still selects MPS execution and doesn't promise a cost-free move. Apple currently labels the MPS backend beta.[4]Reference 4MLX: An array framework for Apple siliconhttps://github.com/ml-explore/mlx[1]Reference 1Accelerated PyTorch training on Mac.https://developer.apple.com/metal/pytorch/
  • Mac training still needs explicit device placement for models and tensors.
  • is_built() and is_available() answer different setup questions.
  • PYTORCH_ENABLE_MPS_FALLBACK=1 is useful, but it can hide slow CPU detours.
  • Honest timing on MPS still depends on understanding synchronization.
  • Same text-classification training loop still needs clear batch flow, measurement, and memory discipline.

If you want broader accelerator intuition or you also work on NVIDIA servers, read CUDA for ML Training too. Same mental model, different backend details.

Mastery check

Key concepts

  • Metal vs MPS vs mps device
  • unified memory architecture vs discrete GPU VRAM
  • PyTorch is_built() vs is_available()
  • model and tensor device placement on Mac
  • CPU fallback for unsupported ops
  • MPS synchronization and timing
  • Apple GPU memory-pressure debugging basics

Evaluation rubric

  • Foundational: Explains how PyTorch maps model code onto Apple GPUs through the MPS backend, what unified memory changes versus a discrete GPU, and why Mac users still manage explicit device placement
  • Intermediate: Checks is_built() versus is_available(), moves model and batch onto mps, and identifies when unsupported operations fall back to CPU
  • Advanced: Diagnoses first-line MPS issues such as missing backend support, hidden synchronization, slow CPU fallback, and memory-pressure failures

Follow-up questions

Common pitfalls

  • Assuming a Mac needs no explicit device placement because CPU and GPU live in one laptop.
  • Checking only is_available() and missing the more basic case where the installed PyTorch build has no MPS support at all.
  • Leaving CPU fallback enabled and misreading a surviving script as a fast script.
  • Timing MPS work without synchronization, then underestimating true step time.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.You are porting a PyTorch training step to an Apple silicon Mac and write model.to("mps"). Which description of the stack is correct?
2.On Apple silicon, a tensor is moved with ticket_batch.to("mps"). Which statement matches the unified-memory model PyTorch uses?
3.Your check prints mps built: True, mps available: False, and selected device cpu. What does that state mean?
4.A model is on mps. A tokenizer produced ticket_batch = torch.randn(32, 128, 768) on CPU. The classifier averages tokens with mean(dim=1) and then applies nn.Linear(768, 3). What must happen before the forward pass succeeds on MPS, and what shape should the logits have?
5.Your ticket classifier fails on mps unless you run PYTORCH_ENABLE_MPS_FALLBACK=1. With the flag, the script finishes but each step is much slower. Tokenization happened on CPU before ticket_batch.to("mps"). What is the most likely diagnosis?
6.An issue-classifier forward pass enqueues about 35 ms of mps work, but Python finishes launching it in about 3 ms. A timer around only the Python call reports 3 ms. What should you change to measure the device work honestly?
7.How much storage is used by the raw float32 feature values in one ticket batch shaped (32, 128, 768), and why is that not the whole training memory bill?
8.While investigating MPS memory pressure, which statement correctly uses the allocator tools and first-line fix?

8 questions remaining.

Next Step
Continue to Data Structures for AI Systems

CUDA and MPS taught you where training tensors live, how accelerator work gets launched, and why placement and synchronization matter before a model learns anything. Data structures now switches to another systems foundation: how to organize repeated lookups, queues, caches, and <span data-glossary="top-k">top-k</span> state once those tensors start feeding real AI pipelines.

PreviousCUDA for ML Training
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Accelerated PyTorch training on Mac.

Apple · 2026

MPS backend.

PyTorch Contributors · 2026

MPS Environment Variables.

PyTorch Contributors · 2026

MLX: An array framework for Apple silicon

Apple (ml-explore) · 2026

torch.mps.

PyTorch Contributors · 2026