apple

Apple M5 GPU Roofline Analysis

Apple M5 GPU Roofline Analysis
Credit: Lifehacker

System: MacBook Air 15-inch (2026), Apple M5, 16 GB LPDDR5X-9600, macOS 15

GPU: Apple M5 (10 cores, Apple GPU family)

Tool: metal-roofline (Swift/Metal compute, runtime-compiled MSL shaders)

This is a similar analysis to the Radeon 780M roofline study from the Ryzen 7 8745HS mini PC. That piece measured an AMD RDNA 3 iGPU running Vulkan on Linux. This one measures Apple's M5 GPU using Metal compute shaders on macOS. The benchmark design is identical in principle - STREAM bandwidth measurement plus an arithmetic intensity sweep - adapted for Metal's API and Apple Silicon's unified memory architecture.

The goal was simple: build a roofline model for the M5 Air, measure how its bandwidth and compute ceilings compare to the 780M, and see if the same patterns hold. In short, they do. Furthermore, we conducted an investigation on a discrepancy in the compute ceiling, which uncovered something interesting about how Apple's GPU actually executes vector instructions.

The Roofline Model (Quick Recap)

If you've read the 780M analysis, skip this section. If you haven't: the roofline model plots GPU throughput (GFLOPS) against arithmetic intensity (abbreviated as AI, measured in FLOP/byte). At low AI, throughput is limited by memory bandwidth - the GPU can crunch numbers faster than memory can feed it. At high AI, throughput is limited by compute - there's enough data in flight to keep the ALUs busy. The crossover point is the ridge point: peak GFLOPS / peak GB/s.

Benchmark Design

The Metal benchmark mirrors the Vulkan version's structure. Four STREAM-style bandwidth kernels (copy, scale, add, triad) operating on float4 vectors, plus an arithmetic intensity sweep kernel that varies FMA iterations per memory element.

The sweep kernel loads one float4 (16 bytes), performs fma_per_load iterations of 4 cross-dependent FMA chains, then stores one float4 (16 bytes). Total bytes per thread: 32. Total FLOPs per thread: fma_per_load * 4 * 2. Arithmetic intensity: fma_per_load / 4 FLOP/byte.

The cross-dependent chain structure matters. Each chain's output feeds into the next chain's inputs:

c = fma(c, d, e);
d = fma(d, e, f);
e = fma(e, f, c);  // depends on c from line 1
f = fma(f, c, d);  // depends on c from line 1, d from line 2

This prevents the compiler from eliminating iterations (every result is consumed), reordering them (data dependencies enforce ordering), or merging chains (they share operands). It's the same pattern used in the Vulkan version, and it's the standard methodology for roofline measurement. We'll come back to why this matters.

Metal vs Vulkan Implementation Differences

For anyone coming from Vulkan, three things stand out about Metal:

No descriptor sets. Vulkan requires creating descriptor set layouts, allocating descriptor pools, updating descriptor sets, and binding them before dispatch. Metal uses direct buffer binding: encoder.setBuffer(buf, offset: 0, index: 0) maps a buffer to [[buffer(0)]] in the shader. The index match is the entire binding model.

Unified memory means no staging buffers. In Vulkan, getting data to the GPU typically involves a host-visible staging buffer, a device-local buffer, and a copy command. On Apple Silicon, .storageModeShared gives both CPU and GPU access to the same physical memory. You write from Swift, dispatch a compute kernel, and the GPU reads the same bytes. No copies, no barriers, no transfer queues.

Runtime shader compilation. SPM (Swift Package Manager) bundles .metal files as raw text resources for command-line targets - it doesn't compile them into .metallib the way Xcode does. So we compile shaders from source at startup via device.makeLibrary(source:options:). This takes milliseconds and only happens once. It's worth noting because it means the shader compiler's optimization passes may differ from ahead-of-time compilation, which becomes relevant later.

Bandwidth Results (256 MB buffers)

Variant GB/s Median ms
Copy 121.8 4.41
Scale 124.2 4.32
Add 123.1 6.54
Triad 125.0 6.44

The M5 Air's LPDDR5X-9600 delivers 153.6 GB/s of theoretical bandwidth. Copy measures 121.8 GB/s (79% efficiency); scale peaks at 124.2 GB/s (81%). The two-source variants (add, triad) show similar throughput despite accessing three buffers instead of two - the memory controller handles the additional request streams without significant contention.

For the roofline model, we use the copy bandwidth of ~122 GB/s as the measured ceiling, since STREAM copy most closely matches the sweep kernel's access pattern (one read + one write per element).

For comparison, the 780M measured 73 GB/s from DDR5-5600. The M5 delivers 67% more bandwidth - a direct consequence of LPDDR5X-9600's higher data rate and Apple Silicon's memory controller efficiency.

Roofline Sweep (128 MB buffers)

fma_per_load AI (F/B) GFLOPS Effective GB/s Median ms
1 0.25 32.5 130 2.07
2 0.50 64.4 129 2.08
4 1.00 128.7 129 2.09
8 2.00 260.7 130 2.06
16 4.00 525.6 131 2.04
32 8.00 813 102 2.64
64 16.00 815 51 5.27
128 32.00 798 25 10.77
256 64.00 787 12 21.84
512 128.00 767 6.0 44.95

The classic roofline shape is clearly visible.

Bandwidth-bound region (AI 0.25–4.0 F/B): Throughput scales linearly with arithmetic intensity. Effective bandwidth holds steady at 129–131 GB/s across the entire range - the GPU has more than enough compute to saturate the memory bus at these low arithmetic intensities. This is the diagonal line on the log-log chart.

Transition zone (AI 8.0 F/B): Throughput reaches 813 GFLOPS but effective bandwidth has dropped to 102 GB/s. The workload straddles the ridge point.

Compute-bound region (AI 16–128 F/B): Throughput plateaus at ~770–815 GFLOPS. Additional arithmetic intensity no longer increases throughput because the ALUs are the bottleneck.

Ridge Point

The ridge point falls at approximately AI ≈ 6.5 F/B (peak GFLOPS / bandwidth = 815 / 126 ≈ 6.5).

No Occupancy Cliff

Unlike the 780M (RDNA 3), which showed a 30–35% throughput drop at AI=64–128 due to VGPR pressure reducing wave occupancy, the M5 maintains stable throughput across the entire high-AI range:

AI M5 GFLOPS 780M GFLOPS (54W)
32 798 1,181 (peak)
64 787 800 (−32%)
128 767 801 (−32%)

The absence of a sustained occupancy cliff suggests Apple's GPU has either a larger register file per thread or a more flexible occupancy model than RDNA 3. There's a slight dip at AI=128 (767 GFLOPS, ~6% below peak) but it's gradual, not the sharp cliff the 780M exhibits.

The Compute Peak Question

At this point the roofline model is complete. The M5's compute ceiling sits at ~815 GFLOPS, the bandwidth ceiling at ~122 GB/s, and the ridge point at ~6.5 F/B. These are good, useful numbers for understanding real workload performance.

But ~815 GFLOPS is substantially below the M5's theoretical FP32 peak. Apple's M5 GPU has 10 cores, each with 128 ALU lanes, running at approximately 1.4–1.5 GHz. At 2 FLOPs per FMA instruction, that's a theoretical peak of ~3,500–4,000 GFLOPS. The sweep kernel measures 22% of that. A 4–5x gap.

On the 780M, we saw the same pattern: 4,700 GFLOPS theoretical versus 1,181 GFLOPS sweep peak - also roughly 4x. At the time, we attributed the gap to memory latency, cache effects, and the difference between "pure compute" (tiny buffer, maximal FMA) and "realistic compute" (large buffer with real memory traffic).

For the M5, we decided to investigate further. We built a pure compute benchmark - a 4 MB buffer that fits entirely in the System Level Cache, with 512 FMA iterations per element - and tested several kernel variants to isolate where the gap comes from.

Isolating the Gap

Six kernel variants, each controlling a different variable:

# Variant Chains Type What it tests
1 cross_dep_4 4 cross-dependent float4 Baseline (same as sweep kernel)
2 independent_4 4 self-dependent float4 Does removing cross-chain deps help?
3 independent_8 8 self-dependent float4 Does more ILP help with float4?
4 scalar_indep_4 4 self-dependent float Does scalar beat float4?
5 scalar_indep_8 8 self-dependent float Does more scalar ILP help?
6 scalar_indep_16 16 self-dependent float Diminishing returns from ILP?

"Self-dependent" means each chain depends only on itself (c = fma(c, c, a)) - the result feeds back into the same accumulator, preventing the compiler from optimizing it away, but not blocking other chains from executing in parallel. "Cross-dependent" is the sweep kernel's pattern where each chain's output feeds into other chains' inputs.

Results (4 MB buffer, fma_per_load=512)

Variant Chains Type GFLOPS vs baseline
cross_dep_4 4 cross-dep float4 791 1.0x
independent_4 4 self-dep float4 761 0.96x
independent_8 8 self-dep float4 806 1.02x
scalar_indep_4 4 self-dep float 2,772 3.50x
scalar_indep_8 8 self-dep float 3,760 4.75x
scalar_indep_16 16 self-dep float 3,575 4.52x

These results tell the whole story.

Finding 1: float4 is the bottleneck, not chain dependencies

The self-dependent float4 variants (761–806 GFLOPS) perform identically to the cross-dependent float4 baseline (791 GFLOPS). Making the four FMA chains independent of each other has no effect. Four chains, eight chains, cross-dependent or self-dependent - it doesn't matter. float4 throughput is stuck at ~800 GFLOPS regardless.

This rules out cross-chain data dependencies as the cause of the gap.

Finding 2: Apple's GPU compiles float4 into 4 scalar operations

Switching from float4 to scalar float with the same number of self-dependent chains produces a 3.5x throughput increase (791 ->2,772 GFLOPS with 4 chains).

This means a float4 FMA is not a single wide SIMD instruction - the Metal shader compiler decomposes it into 4 scalar fmadd instructions. The near-4x throughput ratio confirms these scalar ops execute largely sequentially rather than in parallel, despite the hardware being superscalar.

This is consistent with what's publicly known about Apple's GPU ISA. Dougall Johnson's reverse-engineered G13 ISA documentation shows that general purpose registers are scalar - "each store one 32-bit value per thread" - with fmadd operating on individual scalar operands. There are no native vector instructions. Alyssa Rosenzweig, lead of the Asahi Linux GPU effort, confirmed that "the M1's GPU is scalar at all bit sizes."

Apple's own documentation describes the parallelism model: "Each SIMD unit has 32 threads, and each thread in the SIMD executes the same instruction". The "SIMD" in Apple's GPU refers to 32 threads executing in lockstep (analogous to NVIDIA's warp or AMD's wavefront) - not to vector-width parallelism within a single thread. float4 is a language-level convenience that the compiler lowers to multiple scalar instructions.

One caveat: Rosenzweig also notes the hardware is "superscalar, with more 16-bit ALUs than 32-bit ALUs." This means the GPU can issue multiple scalar instructions per cycle to different ALU pipelines. In principle, some of the 4 scalar ops from a float4 FMA could overlap. Our benchmark data shows this overlap is minimal for FP32 FMA specifically - the ~4x scalar-over-float4 ratio suggests the FP32 FMA pipeline has limited superscalar issue width, with most of the throughput gain coming from filling the pipeline with independent instructions rather than from same-cycle dual-issue.

Finding 3: 8 scalar chains saturate the ALU pipeline

The jump from 4 ->8 chains (2,772 ->3,760 GFLOPS, +36%) shows the M5 GPU needs at least 8 independent instructions in flight per thread to fully hide FMA latency. This implies a 4-cycle FMA latency: with 8 independent ops in the pipeline, the GPU can issue one per cycle while the others are in various stages of completion, keeping the ALU continuously occupied.

Finding 4: 16 chains cause register pressure

16 scalar accumulators (3,575 GFLOPS) is 5% slower than 8 chains. The extra registers reduce the number of threads that can be in flight simultaneously - each thread consumes more of the register file, so fewer threads fit per SIMD group, reducing overall occupancy. The lost parallelism from fewer concurrent threads outweighs the additional per-thread ILP.

8 chains is the sweet spot.

Measured FP32 Peak: 3,760 GFLOPS

The 8-chain scalar variant gives us the true measured FP32 peak: 3,760 GFLOPS.

Apple doesn't publish GPU clock speeds or ALU counts. But we can derive them. Apple announced 2.6 TFLOPS for the M1's 8-core GPU. At the M1's known clock of ~1.28 GHz, that works out to 2,600 / 8 / 1.28 / 2 ≈ 127 FMA units per core - effectively 128, a natural hardware boundary. Apple's WWDC 2020 session confirms "each shader core has multiple SIMD units" with 32 threads each; 128 ALUs / 32 threads = 4 SIMD units per core, which matches "multiple."

We can confirm the clock directly. macOS powermetrics exposes GPU DVFS state in real time. Under sustained compute load, the M5 GPU locks to its top P-state at 1578 MHz with 100% residency - no throttling, no time spent at lower frequencies, even in a fanless chassis.

The full P-state table (12 states from 338 MHz idle to 1578 MHz peak):

P-state Frequency Theoretical GFLOPS (128 ALU/core)
P12 (idle) 338 MHz 866
P6 1084 MHz 2,775
P2 1470 MHz 3,763
P1 (peak) 1578 MHz 4,040

At 1578 MHz, assuming 128 ALUs per core, the theoretical peak is 4,040 GFLOPS. We measured 3,814 GFLOPS - 94.4% utilization. The 5.6% gap is consistent with loop overhead (branch, increment, memory load/store between FMA iterations) and thread scheduling. The implied ALU count is ~121 per core, close enough to 128 that the difference is overhead, not missing hardware.

GPU power at the top P-state: 18.2W sustained. This is in a fanless chassis with no active cooling - the M5 holds its peak clock indefinitely during our ~45-second benchmark runs.

FP16: Does Double Rate Deliver?

Apple's WWDC 2020 session on Metal performance states that the GPU runs FP16 at "double rate" - twice the throughput of FP32. If our FP32 peak is 3,769 GFLOPS at 1578 MHz, the theoretical FP16 peak should be ~8,080 GFLOPS. We tested four FP16 kernel variants to find out.

The half type in MSL is IEEE 754 16-bit floating point. On Apple Silicon, the hardware has more 16-bit ALUs than 32-bit ALUs (confirmed by Asahi Linux reverse engineering), which is how double rate is implemented - it's not the same ALUs running faster, it's more of them.

Kernel FP32 (GFLOPS) FP16 (GFLOPS) Ratio
Vector4, cross-dependent 780 1,525 1.96x
Scalar, 8 independent chains 3,769 5,213 1.38x
Scalar, 16 independent chains 3,572 5,686 1.59x
Scalar, 32 independent chains - 6,004 1.59x vs best FP32

The vector4 cross-dependent case achieves nearly perfect 2x (1.96x) - each half4 FMA becomes one instruction that executes at double rate. This is the cleanest confirmation of the hardware claim.

But the scalar results are more nuanced. The best FP16 result (6,004 GFLOPS with 32 chains) is only 1.59x the best FP32 result (3,769 GFLOPS with 8 chains). Why not 2x?

The answer is ILP saturation. Double-rate FP16 means the ALU can retire FP16 FMAs twice as fast - but that also means the pipeline needs twice as many independent instructions in flight to stay saturated. Our FP32 peak needed 8 chains; FP16 should need 16. And indeed, the FP16 results keep climbing from 8 chains (5,213) to 16 chains (5,686) to 32 chains (6,004). The pipeline isn't fully saturated even at 32 chains, suggesting the register file or instruction scheduler becomes the bottleneck before we can feed enough independent work to fill the double-rate pipeline.

At 6,004 GFLOPS, we're at 74% of the 8,080 GFLOPS theoretical ceiling. This matches the pattern we saw in FP32 - there's always overhead from loop control, memory operations, and scheduling. But the 74% utilization at 32 chains (vs 94% at 8 chains for FP32) suggests diminishing returns from ILP: at some point you run out of registers to hold independent accumulators, and the spill-to-memory cost eats into your throughput gains.

Practical implications: FP16 double rate is real and delivers close to 2x for vectorized code (half4). For scalar code, expect 1.5-1.6x improvement. ML inference workloads (which heavily use FP16 matrix multiplications) benefit the most because the data is already half-precision - no conversion overhead, and the matrix multiply inner loops naturally provide enough ILP through the accumulation across the K dimension. For graphics shaders, the MSL half qualifier on intermediates gives the compiler permission to use the faster FP16 ALUs, which is why Apple's best practices recommend using half for fragment shader intermediates wherever full precision isn't required.

The float4 Trap

Here's the practical upshot: if you write a Metal compute kernel using float4 vectors - the "obvious" choice for GPU programming - your compute-bound throughput ceiling is ~800 GFLOPS. If you restructure the same computation to use scalar floats with 8 independent accumulator chains, your ceiling is ~3,760 GFLOPS. A 4.7x difference.

This isn't a benchmarking artifact rather than a consequence of Apple Silicon's scalar GPU ISA. The GPU's parallelism comes from 32-thread SIMD groups executing in lockstep, not from vector-width parallelism within a single thread. When the shader compiler lowers float4 to 4 scalar fmadd instructions, it converts what looks like one operation into four - and in the compute-bound regime, that 4x instruction expansion directly limits throughput.

For bandwidth-bound kernels (the majority of real GPU workloads), float4 is fine. Memory throughput is the same regardless of vector width, and the ALU isn't the bottleneck. But any kernel that's compute-bound - matrix multiply tiles, convolution inner loops, ML inference accumulators, physics simulations - should consider the scalar approach.

The optimal pattern for compute-heavy Metal kernels on Apple Silicon:

  1. Load as float4 for memory efficiency (16 bytes per load, good coalescing)
  2. Unpack to scalars for the inner loop
  3. Use 8 independent accumulator chains to saturate the ALU pipeline
  4. Repack to float4 for the store

This is likely what Apple's own Metal Performance Shaders (MPS) framework does internally. It would explain why hand-written "obvious" Metal kernels often underperform MPS by large margins - the framework is written with knowledge of the execution model that the documentation doesn't surface clearly.

The Optimized Roofline

Theory is one thing. Let's prove it. We built an optimized sweep kernel that applies everything we learned: 8 scalar floats loaded per thread (32 bytes), 8 independent self-dependent FMA chains, 8 scalar stores (32 bytes). Same arithmetic intensity scale as the original (AI = fma_per_load / 4), but with the scalar 8-chain structure that hit 94% of theoretical in the compute peak test.

kernel void sweep_optimized(
    device const float* input  [[buffer(0)]],
    device float*       output [[buffer(1)]],
    constant uint&  fma_per_load [[buffer(2)]],
    uint tid [[thread_position_in_grid]]
) {
    uint base = tid * 8;
    float a = 0.01f;
    float c0 = input[base],     c1 = input[base + 1];
    float c2 = input[base + 2], c3 = input[base + 3];
    float c4 = input[base + 4], c5 = input[base + 5];
    float c6 = input[base + 6], c7 = input[base + 7];

    for (uint i = 0; i < fma_per_load; i++) {
        c0 = fma(c0, c0, a); c1 = fma(c1, c1, a);
        c2 = fma(c2, c2, a); c3 = fma(c3, c3, a);
        c4 = fma(c4, c4, a); c5 = fma(c5, c5, a);
        c6 = fma(c6, c6, a); c7 = fma(c7, c7, a);
    }

    output[base] = c0; output[base + 1] = c1; // ... store all 8
}

Results: Standard vs Optimized (128 MB buffers, release build)

AI (F/B) Standard GFLOPS Optimized GFLOPS Effective GB/s (opt) Speedup
0.25 33 32 127 0.97x
0.50 65 63 126 0.97x
1.00 129 127 127 0.99x
2.00 262 258 129 0.98x
4.00 520 514 129 0.99x
8.00 813 1,046 131 1.29x
16.00 815 2,107 132 2.59x
32.00 799 3,831 120 4.80x
64.00 770 3,849 60 5.00x
128.00 769 3,757 29 4.89x

What the chart shows

The two curves share the same bandwidth-bound diagonal up to AI=4. Then they diverge.

The standard kernel (blue) hits its compute ceiling at AI=8. Throughput flattens at ~815 GFLOPS. From AI=8 onward, the GPU has unused compute capacity but the float4 serialization prevents it from being accessed. Effective bandwidth drops from 130 GB/s to 6 GB/s as AI increases - the ALUs are bottlenecked, and most of each thread's execution time is spent waiting for float4 FMAs to complete.

The optimized kernel (green) stays bandwidth-bound through AI=16. At AI=8, it delivers 1,046 GFLOPS at 131 GB/s - still saturating the memory bus. At AI=16, it reaches 2,107 GFLOPS at 132 GB/s. The standard kernel was already compute-bound here at 815 GFLOPS; the optimized kernel has enough compute headroom to keep pulling data at full bandwidth.

The optimized compute ceiling lands at 3,849 GFLOPS. This matches the pure compute peak (3,814 GFLOPS in the isolated test) - the 128 MB buffer sweep achieves the same throughput as the 4 MB SLC-resident test, confirming that at high AI the kernel is truly compute-bound, not memory-bound.

The ridge point shifts from ~6.5 to ~31 F/B. This is the real headline. The standard roofline made it look like the M5 had a 6.5:1 compute-to-bandwidth ratio. The optimized roofline reveals it's actually 31:1. The M5 has nearly 5x more usable compute than the standard methodology suggests - if you know how to access it.

What this means for real workloads

Any workload between AI 8 and AI 31 that uses float4 is leaving performance on the table. In that range, the standard kernel is compute-bound at ~815 GFLOPS, while the optimized kernel is still bandwidth-bound at ~130 GB/s - delivering 2–5x more throughput. This is exactly the regime where many ML inference kernels, physics accumulators, and signal processing algorithms operate.

Below AI 8, both kernels are identical. The float4 vs scalar choice doesn't matter when bandwidth is the bottleneck.

Above AI 31, both kernels are compute-bound, but the optimized kernel's ceiling is 4.9x higher (3,849 vs 769 GFLOPS). Dense matrix multiply, large convolutions, and other compute-heavy kernels would see the full benefit.

Cross-GPU Comparison

Metric 780M (RDNA 3, DDR5-5600) M5 (Apple GPU, LPDDR5X-9600)
Bandwidth ceiling 73 GB/s 122 GB/s
Standard sweep peak (float4) 1,181 GFLOPS 815 GFLOPS
Optimized sweep peak (scalar) - 3,849 GFLOPS
Pure compute peak (scalar) 4,700 GFLOPS 3,814 GFLOPS
Ridge point (standard) ~16 F/B ~6.5 F/B
Ridge point (optimized) - ~31 F/B
CUs/cores 12 CUs 10 cores
GPU clock (measured) ~2,600 MHz 1,578 MHz
Memory DDR5-5600 (shared) LPDDR5X-9600 (unified)
GPU power (measured) ~30W 18.2W

Both GPUs show a similar ~4x sweep-to-scalar gap

The 780M achieves 1,181 / 4,700 = 25% of its scalar peak in the float4 sweep. The M5 achieves 815 / 3,760 = 22%. The ratio is strikingly similar, though the mechanisms may differ.

On Apple Silicon, we've confirmed the cause: a scalar ISA that decomposes float4 into 4 scalar instructions. On RDNA 3, the architecture is different - AMD's compute units have a dedicated scalar ALU (SALU) and a 32-wide vector ALU (VALU), and vec4 operations compile to separate VALU instructions for each component. We didn't run the same scalar isolation tests on the 780M, so we can't confirm the gap breaks down the same way. The Vulkan compute.comp pure compute test also used cross-dependent chains, so the 4,700 GFLOPS figure may reflect a different mix of factors (cross-dependency penalty, occupancy effects, or genuine vector decomposition overhead).

What's clear is that on both architectures, the standard roofline methodology (float4 vectors, cross-dependent FMA chains) measures roughly one quarter of the hardware's theoretical scalar throughput.

The M5 is better balanced

The most interesting difference is the compute-to-bandwidth ratio. The 780M has 16x more float4 compute than bandwidth (in FLOP/byte terms), meaning its ALUs are massively overprovisioned relative to memory. The M5's ratio is 6.5:1, much closer to what typical GPU workloads demand. More of the silicon is doing useful work at any given time.

In the bandwidth-bound regime (AI 1–4, where most rendering and many compute workloads operate), the M5 delivers ~130 GB/s versus the 780M's ~73 GB/s - 1.75x more bandwidth. This translates almost directly to higher throughput for bandwidth-limited workloads. The M5's lower compute ceiling is irrelevant here because neither GPU is compute-limited.

In the compute-bound regime (AI > 16), the 780M's higher float4 peak (1,181 vs 815 GFLOPS) gives it an edge for vectorized workloads. But the true scalar peaks are closer (4,700 vs 3,760 GFLOPS) - workloads that exploit scalar ILP would see a much smaller gap.

Efficiency

The M5 Air is a fanless machine drawing 18.2W GPU power at full load (measured via powermetrics). The 780M data was collected at 25–54W package power in an actively-cooled mini PC (~30W GPU power at full load).

Metric M5 (18.2W GPU) 780M (~30W GPU)
GB/s per watt 6.7 ~2.4
Sweep GFLOPS per watt 45 ~39
Scalar GFLOPS per watt 210 ~157

The M5 is 1.2–2.8x more power-efficient depending on the metric. The bandwidth efficiency gap (2.8x) is the most significant - it reflects both LPDDR5X's superior bandwidth-per-watt and Apple's memory controller efficiency. The compute efficiency advantage is more modest (1.2–1.3x), which makes sense: both architectures are running FP32 FMA on similarly modern process nodes. The M5 achieves 81% of the 780M's scalar peak at 61% of the power draw.

What This Means for AI Inference

LLM inference has two phases with opposite roofline profiles. Prefill - processing the input prompt - is a large matrix-matrix multiply that lands in the compute-bound regime (arithmetic intensity 50–500 FLOP/byte depending on sequence length). Decode - generating tokens one at a time - is a matrix-vector multiply at batch=1, with an arithmetic intensity of roughly 2 FLOP/byte for FP16 weights, squarely in the bandwidth-bound regime. The M5's 1.75x bandwidth advantage over the 780M directly accelerates decode, while its lower compute ceiling (3,760 vs 4,700 GFLOPS scalar FP32) penalizes prefill. In practice, this tradeoff favors the M5 for interactive use.

For a typical chat interaction - say a 1K-token prompt generating 300 tokens of response - prefill completes in a fraction of a second on either GPU, while decode takes several seconds. Decode dominates wall-clock time by 10–100x in these scenarios, and it's the phase users actually perceive as "speed": the token-by-token streaming rate they watch while the model responds. A GPU that is 1.75x faster at decode but 20% slower at prefill will feel meaningfully faster to a user, because the sustained generation rate improves throughout the interaction while the prefill penalty is a one-time cost measured in milliseconds.

This breaks down for prompt-heavy workloads (RAG with long retrieved contexts, bulk summarization, classification) where prefill dominates and compute matters more - but for the conversational use case that defines most local LLM usage, the M5's bandwidth-rich, compute-modest balance is closer to ideal than the 780M's compute-heavy design.

Graphics Pipeline: TBDR Changes Everything

The compute roofline tells you how fast the ALUs can crunch numbers. But GPUs exist to render pixels. To understand how the M5 performs as a graphics processor, we need to measure the rendering pipeline: rasterization, fragment shading, texture sampling, and the fixed-function hardware that stitches it all together.

We built three graphics benchmarks using Metal render passes - the same fullscreen-triangle methodology used in the 780M fill rate analysis, adapted for Metal's tile-based deferred rendering (TBDR) architecture.

What is TBDR?

The 780M (and all desktop/console GPUs from AMD, NVIDIA, Intel) uses immediate-mode rendering: every pixel the fragment shader writes goes directly to DRAM. If you draw a triangle that covers 1000 pixels, that's 1000 × 4 bytes written to the framebuffer in main memory. Draw another triangle on top? Another 1000 × 4 bytes. Overdraw - rendering pixels that get covered by later geometry - is a direct DRAM bandwidth cost.

Apple Silicon uses tile-based deferred rendering. The screen is divided into small tiles (typically 32×32 pixels), and all rendering for each tile happens in fast on-chip tile memory (~32 KB per tile). Fragment shaders and overdraw write to tile memory, not DRAM. Only when the entire tile is finished does the final result flush to DRAM exactly once, regardless of how many triangles contributed to it.

This means the fill rate equation is fundamentally different. On the 780M, fill rate at zero ALU work measures DRAM write bandwidth. On the M5, it measures tile memory write bandwidth - a much higher ceiling.

Test 1: Fill Rate Roofline

A fullscreen triangle rendered to a 1920×1080 RGBA8 render target with 128× overdraw. The fragment shader runs cross-dependent float4 FMA chains (same pattern as the compute sweep) controlled by a Metal function constant that sets the iteration count. We sweep FMA-per-pixel from 0 (null shader, pure rasterizer throughput) to 512 (ALU-bound).

To isolate TBDR's contribution, each FMA count is tested twice:

  • .store - tile memory flushes to DRAM at render pass end (normal rendering)
  • .dontCare - tile memory is discarded (no DRAM write at all)
FMA/pixel .store GPixels/s .dontCare GPixels/s Difference
0 1,284 1,296 <1%
4 1,291 1,297 <1%
16 1,295 1,292 <1%
32 1,277 1,282 <1%
64 505 506 <1%
128 91 91 <1%
512 22.7 22.7 0%

The .store and .dontCare curves are indistinguishable. The final tile-to-DRAM flush - writing 1920×1080×4 = 7.9 MB once per render pass - takes ~0.07 ms at 118 GB/s. Against a 3.3 ms render pass, that's 2% of the total time. TBDR makes the store action essentially free.

Comparison with the 780M: The 780M measured 20.8 GPixels/sec at FMA=0 on RGBA8. The M5 measures 1,284 GPixels/sec - 62× higher. This isn't a 62× faster GPU. It's the difference between 128× DRAM writes (780M) and 128× tile memory writes + 1× DRAM write (M5). The TBDR architecture eliminates 99.2% of the pixel write bandwidth.

At FMA=0–32, the M5 sustains ~1,290 GPixels/sec - the rasterizer and tile memory are the bottleneck, not ALU or DRAM. The transition to ALU-bound happens at FMA=64, where throughput drops to 505 GPixels/sec. By FMA=512, it settles at 22.7 GPixels/sec. The GFLOPS at the ALU ceiling (FMA=128+) correspond to the float4 compute ceiling from the roofline, confirming that the fragment shader ALU is the same hardware as the compute ALU.

Test 2: R8 vs RGBA8 - Where Tile Memory Becomes the Bottleneck

On the 780M, switching from RGBA8 (4 bytes/pixel) to R8 (1 byte/pixel) revealed a DRAM bandwidth bottleneck: R8 was 1.8–3.7× faster because each pixel write moved 4× less data through the memory bus. What happens on a TBDR GPU where pixel writes go to tile memory?

FMA/pixel RGBA8 GPixels/s R8 GPixels/s R8 speedup
0 1,295 1,292 1.0×
4 1,289 1,290 1.0×
32 1,270 1,297 1.0×
64 507 1,283 2.5×
128 91 1,067 11.7×
256 46 452 9.8×
512 23 80 3.5×

At low FMA counts (0–32), RGBA8 and R8 are identical - both hit the rasterizer ceiling at ~1,290 GPixels/sec. The format doesn't matter when the bottleneck is rasterizer throughput, not memory of any kind.

The divergence at FMA=64 is dramatic. RGBA8 drops to 507 GPixels/sec while R8 sustains 1,283 - the ALU work per pixel is now large enough that tile memory bandwidth per pixel matters. Each RGBA8 pixel occupies 4 bytes in tile memory; each R8 pixel occupies 1 byte. With less tile memory consumed per pixel, more pixels can be in flight simultaneously, and the ALU has more work to chew through before the tile fills up.

At FMA=128, the gap peaks at 11.7×. R8 sustains 1,067 GPixels/sec - the rasterizer is still not the bottleneck. RGBA8 has collapsed to 91 GPixels/sec - tile memory bandwidth is now the binding constraint. This is the TBDR equivalent of the 780M's DRAM bandwidth wall, but the wall is on-chip tile memory, not off-chip DRAM.

At very high FMA (512), both formats converge again - ALU becomes the absolute bottleneck regardless of tile memory pressure.

The insight: TBDR doesn't eliminate bandwidth bottlenecks, instead moves them from DRAM to tile memory. Tile memory bandwidth is much higher than DRAM, so the bottleneck engages at a much higher workload intensity - but it's still there. Games with fat G-buffers (deferred rendering with 4–5 render targets at 16–32 bytes per pixel) will hit tile memory limits before ALU limits. This is why Metal's MTLStorageMode.memoryless and imageblock features exist - they let developers minimize tile memory footprint for intermediate render targets that don't need to survive past the render pass.

Test 3: Texture Cache Hierarchy

The fill rate tests exercised the rasterizer, ROPs, and ALU but never touched the texture units. Real rendering is dominated by texture sampling. This benchmark profiles the texture sampling pipeline by sweeping texture size and access pattern.

Six RGBA8 textures from 64×64 (16 KB) to 4096×4096 (64 MB), noise-filled to defeat hardware compression. Two UV access modes: coherent (adjacent pixels sample adjacent texels) and random (PCG hash scatters fetches across the entire texture). Five fetch counts per pixel (1, 2, 4, 8, 16).

Apple Silicon's cache hierarchy differs from the 780M's:

  • Per-core texture cache: ~16 KB (similar to 780M's per-CU L1)
  • System Level Cache (SLC): ~32 MB shared across CPU and GPU (vs 780M's 2 MB L2)
  • DRAM: 118 GB/s LPDDR5X (vs 780M's 73 GB/s DDR5)

The SLC is the key differentiator. The 780M's 2 MB L2 meant a 1024×1024 texture (4 MB) was already in DRAM territory. The M5's 32 MB SLC should keep textures up to at least 2048×2048 (16 MB) fully cached.

Coherent UVs (128× overdraw)

Texture Size fetch=1 GTexels/s fetch=16 GTexels/s Effective BW (fetch=16)
64×64 (16 KB) 1,282 20,642 82,567 GB/s
128×128 (64 KB) 1,294 20,671 82,682 GB/s
512×512 (1 MB) 1,281 20,544 82,175 GB/s
1024×1024 (4 MB) 1,292 20,642 82,568 GB/s
2048×2048 (16 MB) 1,295 20,519 82,077 GB/s
3072×3072 (36 MB) 1,289 20,619 82,477 GB/s
4096×4096 (64 MB) 1,295 20,730 82,920 GB/s

Zero degradation. From 16 KB to 64 MB - a 4,000× range in texture size - coherent texture throughput is flat at ~1,290 GTexels/sec (fetch=1) and ~20,600 GTexels/sec (fetch=16). The cache hierarchy is invisible. Adjacent pixels sample adjacent texels, so even "DRAM-resident" textures benefit from cache line reuse within SIMD groups and across tile passes. The SLC absorbs any miss traffic before it reaches DRAM.

The 780M showed 19% degradation from L1-resident to 4096×4096 coherent. The M5 shows 0%. For normal UV-mapped game textures, the M5 behaves as if it has infinite texture bandwidth.

Random UVs (8× overdraw)

Texture Size fetch=1 GTexels/s fetch=16 GTexels/s Effective BW (fetch=16)
64×64 (16 KB) 1,249 5,431 21,722 GB/s
128×128 (64 KB) 433 34 137 GB/s
512×512 (1 MB) 172 11.2 45 GB/s
1024×1024 (4 MB) 7.7 7.2 29 GB/s
2048×2048 (16 MB) 4.4 4.2 17 GB/s
4096×4096 (64 MB) 3.3 3.3 13 GB/s

Random access destroys the illusion. Three distinct cliffs emerge:

L1 plateau (64×64, 16 KB): Even random UVs keep the tiny texture in per-core L1 cache. Throughput is 1,249 GTexels/sec at fetch=1 - comparable to coherent. The texture is small enough that every texel is always cached.

Cache hierarchy cliff (128×128 ->1024×1024): Throughput collapses from 433 to 7.7 GTexels/sec at fetch=1 - a 56× drop across just two texture size doublings. Random access patterns scatter fetches across the texture, defeating the cache line reuse that made coherent access fast. Each SIMD group's 32 threads hit 32 unrelated cache lines; the caches can't absorb the working set.

DRAM floor (2048×2048+): Throughput stabilizes at 3.3–4.4 GTexels/sec, implying 13–17 GB/s of effective DRAM bandwidth. This is far below the 118 GB/s DRAM ceiling - the same cache line waste pattern seen on the 780M. Random bilinear samples use at most 16 bytes of each 64-byte cache line, wasting 75% of fetched data. Combined with TLB thrashing across millions of pages, effective bandwidth drops to ~11% of theoretical.

Comparison with the 780M

Texture Size M5 coherent (fetch=16) 780M coherent (fetch=16) M5 random (fetch=16) 780M random (fetch=16)
64×64 20,642 123 5,431 45.8
1024×1024 20,642 104 7.2 0.86
4096×4096 20,730 99.9 3.3 0.44

The absolute numbers aren't directly comparable - different overdraw factors, different resolutions, different everything. But the patterns are: the M5's SLC eliminates the coherent-access cliff that the 780M showed (the 780M dropped 19% from L1 to 4096²; the M5 drops 0%), while both GPUs show the same random-access collapse at large texture sizes. The underlying physics of cache line waste and TLB thrashing is architecture-independent.

What This Means for Real Games

The natural question: does a game, particularly one not specifically optimized for Apple Silicon, actually benefit from these architectural differences?

TBDR benefits are automatic. The Metal driver handles tile scheduling transparently. Any game using Metal (including games ported via MoltenVK or Game Porting Toolkit) automatically gets overdraw absorbed by tile memory instead of hitting DRAM. A game doesn't need to know it's running on TBDR. The driver does it. Our fill rate data shows the result: store vs dontCare is identical, meaning even the final tile flush is negligible. For any workload with overdraw - which is all real rendering - the M5 doesn't pay the bandwidth cost the 780M does.

Deferred rendering is the big caveat. Modern engines (Unreal, Unity, Frostbite) use deferred rendering with G-buffers - typically 4–5 render targets totaling 16–32 bytes per pixel. On the 780M, filling and reading back a G-buffer at 1080p costs ~7 GB/s of DRAM bandwidth just for the round-trip. On the M5, if the engine structures rendering as a single Metal render pass (G-buffer fill + lighting resolve in one pass), the G-buffer never leaves tile memory. Zero DRAM cost.

But PC-ported games typically structure deferred rendering as multiple separate render passes - one pass fills the G-buffer, a second pass reads it back for lighting. Each pass boundary forces a tile flush to DRAM and a reload. An unoptimized port pays the DRAM round-trip, losing the TBDR advantage for deferred rendering specifically. Our R8 vs RGBA8 data quantifies what's at stake: at moderate shader complexity (FMA=128), tile-memory-optimized rendering (R8, minimal tile footprint) delivers 11.7× the throughput of tile-memory-heavy rendering (RGBA8, 4× the tile footprint).

Texture caching is automatic and transformative. Our coherent texture data applies directly to real games - normal UV-mapped textures have high spatial coherence. The M5 showed zero degradation from 16 KB to 64 MB textures with coherent access. A game loading dozens of 2048×2048 or 4096×4096 texture atlases gets effectively unlimited texture bandwidth. This is automatic; no optimization needed.

The random texture results predict performance for specific effects: volumetric ray marching, indirection textures, bindless texture arrays with scattered lookups. Both GPUs crater here (M5: 13 GB/s, 780M: 3.5 GB/s) - roughly proportional to their DRAM bandwidth ratio. No architecture escapes the physics of random memory access.

What an unoptimized port gets for free:

  • Overdraw absorption (~20–30% bandwidth savings for typical 2–3× overdraw)
  • Texture cache benefits from SLC (zero coherent degradation up to 64 MB)
  • Tile-local depth/stencil testing (rejected fragments never touch DRAM)

What requires Metal-specific optimization:

  • Single-pass deferred rendering (restructuring render passes for TBDR)
  • Memoryless intermediate buffers (storing intermediate render targets only in tile memory)
  • Programmable blending (tile-local blend operations, no DRAM round-trip)

Estimated Frame Budget Comparison

Here's a rough model of a 1080p60 frame for a typical deferred-rendered game:

Workload component 780M bandwidth cost M5 bandwidth cost Notes
G-buffer fill (2× overdraw, 20 B/px) 7.1 GB/s ~0 GB/s (tile memory) TBDR eliminates G-buffer DRAM writes
Shadow maps (4 cascades, depth-only) 4.2 GB/s ~0 GB/s Depth is tile-local; .dontCare store action
Texture sampling (coherent) Capped by 73 GB/s DRAM Effectively unlimited SLC absorbs working set
Post-processing chain 12 GB/s 12 GB/s Both hit DRAM for full-screen passes
Remaining BW for everything else ~50 GB/s ~106 GB/s

The M5 effectively has 2× more usable bandwidth than the 780M, not because the DRAM is faster (1.6×), but because TBDR eliminates 30–40% of the bandwidth traffic that the 780M can't avoid. Even an unoptimized port reaps the overdraw and texture benefits automatically.

The prediction: for a typical 1080p game, the M5 would deliver roughly equivalent or better frame rates than the 780M despite having 20% less raw FP32 compute, because the bandwidth savings from TBDR more than compensate. At higher resolutions (where bandwidth pressure increases quadratically with pixel count), the M5's architectural advantage grows.

AMX: The CPU's Matrix Coprocessor

The GPU isn't the only compute engine on the M5. Apple Silicon includes AMX (Apple Matrix coprocessor), an undocumented matrix multiply accelerator that sits on the CPU side. When you call Accelerate framework's BLAS routines - cblas_sgemm, cblas_dgemm - they dispatch to AMX automatically. No public API exists for AMX directly; the only supported path is through Accelerate.

We built a separate benchmark to measure AMX throughput across matrix sizes and precisions, using cblas_sgemm (FP32), cblas_dgemm (FP64), and BNNS BNNSMatMul (FP16) - the only Accelerate API that accepts half-precision inputs.

AMX Peak Throughput

Precision API Peak GFLOPS At size N Sustained (N≥2048)
FP32 cblas_sgemm 1,790 1024 ~1,720
FP64 cblas_dgemm 471 1024 ~435
FP16 BNNS BNNSMatMul 1,920 2560 see below

The FP32 peak of 1,790 GFLOPS is roughly half the GPU's scalar FP32 peak (3,849 GFLOPS) and a third of the GPU's FP16 peak (6,004 GFLOPS). The FP32:FP64 ratio is 3.8:1 - AMX has significantly fewer double-precision units, as expected for a consumer chip. For context, the M1's AMX was measured at roughly 2 TFLOPS FP32 by the community, so the M5 is in a similar range, suggesting AMX throughput has not scaled as aggressively as the GPU across generations.

BNNS FP16: A Tiling Bug

The FP16 results revealed something unexpected. Below N=3584, FP16 outperforms FP32 by 10–30%, confirming that AMX has native half-precision support. But at N=4096, throughput falls off a cliff:

N FP16 GFLOPS FP32 GFLOPS FP16/FP32 Matrix size (FP16)
2560 1,921 1,563 1.23x 13.1 MB
3072 1,872 1,621 1.15x 18.9 MB
3584 1,829 1,641 1.11x 25.7 MB
4096 1,049 1,655 0.63x 33.6 MB
5120 674 1,626 0.41x 52.4 MB
8192 676 1,633 0.41x 134.2 MB

The transition is sharp, not gradual. FP32 throughput via both cblas_sgemm and BNNS's own FP32 path is completely unaffected at these sizes - both sustain ~1,650 GFLOPS through N=8192. The dropoff is FP16-specific and BNNS-specific.

The threshold aligns with the M5's 32 MB SLC (System Level Cache). A single FP16 matrix at N=4096 is 33.6 MB - just over SLC capacity. But FP32 matrices at N=4096 are 67.2 MB (well over SLC) and don't drop at all, because cblas_sgemm uses well-optimized tiling that keeps AMX working on small blocks fitting in L1/L2 regardless of total matrix size.

The conclusion: BNNS's FP16 matmul path has a tiling deficiency for matrices exceeding SLC size. It likely either fails to tile properly (working on cache-busting chunks) or converts entire matrices to FP32 before dispatching to AMX, incurring O(N²) conversion overhead on top of O(N³) compute. The stabilization at 0.41x FP32 throughput for all sizes above N=5120 is consistent with a fixed overhead model where conversion cost dominates.

This matters for ML inference frameworks running on CPU. Any framework using BNNS for FP16 matmul with matrices larger than ~3500×3500 (25 MB) would get better throughput by converting to FP32 and calling cblas_sgemm directly. In practice, most LLM weight matrices are smaller than this threshold (a 4096-hidden-dim model's largest matmul is 4096×4096 at 33.6 MB - right at the cliff edge), so the impact depends on exact model dimensions.

AMX vs GPU: Where Each Wins

AMX (CPU) GPU
FP32 peak 1,790 GFLOPS 3,849 GFLOPS
FP16 peak 1,920 GFLOPS 6,004 GFLOPS
Memory bandwidth 122 GB/s (shared) 122 GB/s (shared)
Advantage No data transfer, direct memory access 2–3x higher compute

AMX and GPU share the same unified memory - there's no PCIe transfer cost. The GPU wins on raw compute by 2–3x, but AMX avoids the overhead of Metal command buffer setup, GPU scheduling, and synchronization. For small matrices or mixed CPU/matrix workloads where GPU launch latency matters, AMX could be competitive. For sustained large-matrix compute (the LLM use case), the GPU's higher throughput wins.

ANE: The Neural Engine

The M5 also includes an ANE (Apple Neural Engine) - a dedicated matrix/tensor accelerator designed for ML inference. Unlike the GPU and AMX, the ANE has no low-level API. The only path to it is through Core ML: convert a model with coremltools, specify the compute unit, and let Core ML's compiler decide how to schedule operations on the hardware. This means we're benchmarking the Core ML stack as much as the ANE itself - but since Core ML is the only way anyone can actually use the ANE, that's the measurement that matters.

We built a benchmark that generates simple matmul models at various sizes and precisions via coremltools, then measures inference throughput across four compute unit configurations: CPU only, CPU+GPU, CPU+ANE, and ALL (Core ML chooses).

Batch=1: ANE Loses to CPU

For single-sample inference - the decode case in LLM serving - the results are stark:

N (square matmul) CPU (AMX) GPU ANE ALL
256 4.2 GFLOPS 4.5 4.7 4.4
1024 64.3 64.4 14.0 13.3
2048 224.7 15.0 34.7 35.8
4096 74.6 36.3 19.0 18.8

At batch=1, the CPU (via AMX) dominates at medium sizes, peaking at 225 GFLOPS for N=2048. The ANE manages only 19–35 GFLOPS for the sizes that matter (1024–4096). The GPU through Core ML is also poor - 15–64 GFLOPS versus the 3,849 GFLOPS we measured with raw Metal compute shaders. Core ML's scheduling overhead dwarfs the actual computation at these small batch sizes.

The CPU's 225 GFLOPS through Core ML is lower than the 1,790 GFLOPS we measured with direct cblas_sgemm calls. Core ML adds substantial overhead even on the CPU path - model loading, graph execution, memory management - that makes it uncompetitive with raw BLAS for single operations.

Batch Scaling: Where ANE Wins

The ANE's strength is throughput, not latency. When we increase the batch size for a 2048×2048 matmul:

Batch CPU (AMX) GPU ANE ALL
1 216 GFLOPS 24 36 36
4 324 57 127 121
16 1,067 226 447 446
64 1,913 656 1,384 1,376
256 2,021 2,023 2,879 2,893

At batch=256, the ANE reaches 2,879 GFLOPS - beating both CPU (2,021) and GPU (2,023) through the same Core ML interface. The "ALL" compute unit tracks ANE at high batch, confirming Core ML routes work to the ANE when it's the fastest option.

The ANE's 2.9 TFLOPS at batch=256 is consistent with Apple's published TOPS figures for the M-series neural engines (accounting for FP16 precision and Core ML overhead). The near-linear scaling from batch=1 to batch=256 (80x throughput increase for 256x batch increase) shows the ANE has massive parallelism but enormous fixed-cost overhead per invocation.

What This Means for Local LLM Inference

The ANE results explain a pattern that LLM framework developers have observed: the ANE is fast for prompt processing (prefill) but slow for token generation (decode).

  • Decode (batch=1): CPU via AMX at 225 GFLOPS beats ANE at 35 GFLOPS by 6.4x. The ANE's launch overhead makes it useless for the single-token-at-a-time decode loop. This is why llama.cpp and similar frameworks default to CPU or GPU for decode.
  • Prefill (large batch): Processing a 256-token prompt is effectively a batch=256 operation. Here the ANE's 2,879 GFLOPS is 1.4x faster than AMX. A framework that dispatches prefill to ANE and decode to CPU/GPU would get the best of both worlds.
  • GPU through Core ML vs raw Metal: The GPU numbers through Core ML (24–2,023 GFLOPS) are dramatically lower than raw Metal (3,849 GFLOPS scalar FP32, 6,004 FP16). This is Core ML overhead, not GPU limitation. Frameworks that use Metal directly (like llama.cpp's Metal backend) bypass this entirely and get much higher GPU utilization.

The practical hierarchy for local LLM inference on the M5:

Phase Best engine Throughput Why
Decode (batch=1) GPU (raw Metal) ~3,849 GFLOPS FP32 Lowest overhead, highest bandwidth utilization
Prefill (batch≥64) ANE (Core ML) ~2,879 GFLOPS FP16 Highest throughput for batched matmul
Fallback CPU (AMX) ~1,790 GFLOPS FP32 No framework overhead, good for small ops

INT8 Quantization: Where It Helps and Where It Doesn't

Most local LLM inference uses quantized weights - INT8 (Q8_0) or INT4 (Q4_K_M) - to fit models in memory and reduce bandwidth pressure. We built a benchmark comparing FP16 and INT8 (W8A16: INT8 weights, FP16 activations) across CPU, GPU, and ANE via Core ML, testing both decode-like (batch=1) and prefill-like (batch=128–256) configurations.

Results

Config Engine FP16 GFLOPS INT8 GFLOPS Speedup
Decode (batch=1, 2048²) CPU 208 217 1.04x
Decode (batch=1, 2048²) ANE 35 49 1.41x
Decode (batch=1, 4096²) GPU 39 67 1.72x
Prefill (batch=128, 2048²) CPU 1,978 2,089 1.06x
Prefill (batch=128, 2048²) ANE 2,124 2,562 1.21x
Prefill (batch=256, 2048²) ANE 2,893 2,987 1.03x
FFN prefill (128, 2048×8192) ANE 1,720 2,561 1.49x

What the numbers say

AMX doesn't have native INT8 compute. CPU decode - the latency-critical path for token generation - goes from 208 to 217 GFLOPS with INT8. A 4% improvement. AMX is almost certainly dequantizing INT8 to FP32 internally and computing in floating point. The quantized weights save memory bandwidth (half the bytes to read per weight), but the multiply-accumulate units operate at the same speed regardless. Quantization doesn't make decode faster on CPU.

The ANE has real INT8 datapaths. The ANE shows 1.2–1.5x speedup from INT8, with the largest gain on FFN prefill: 1,720 ->2,561 GFLOPS (1.49x). This isn't a bandwidth effect - the compute itself is faster, confirming the ANE has dedicated integer multiply-accumulate units distinct from its FP16 path. This is what Apple's TOPS marketing reflects: the ANE genuinely executes INT8 operations faster than FP16.

But the ANE saturates at high batch regardless of precision. At batch=256, FP16 delivers 2,893 GFLOPS and INT8 delivers 2,987 - only 3% difference. The ANE hits a compute ceiling where it's fully utilized, and precision doesn't matter. The INT8 advantage lives in the mid-batch range (32–128) where the ANE has headroom to exploit the faster integer units.

GPU benefits from reduced memory traffic, not faster compute. The GPU's 1.72x INT8 speedup at decode_4096 almost exactly matches the 2x reduction in weight data size (INT8 vs FP16). At batch=1, the GPU is memory-bandwidth-bound - halving the weight data halves the time spent reading weights. But these numbers are through Core ML at 67 GFLOPS - the raw Metal GPU does 3,849 GFLOPS. Core ML overhead masks the real GPU bandwidth benefit.

What this means for local LLM users

INT8 quantization doesn't change the performance hierarchy on the M5. The practical implications:

No perceptible speedup in token generation. Decode is CPU-bound at batch=1, and AMX doesn't benefit from INT8 compute. The 4% improvement is invisible to users watching tokens stream in. Frameworks like llama.cpp that use raw Metal for decode bypass Core ML entirely, and the GPU's decode speed is limited by memory bandwidth for reading weights - INT8 helps here (half the bytes), but this benefit exists at the Metal level, not through Core ML.

Modest prefill speedup via ANE. Processing a 128-token prompt is 1.2–1.5x faster with INT8 on the ANE. But prefill is already the fast phase - it completes in under a second for typical prompts. Making it 1.3x faster saves fractions of a second.

The real win from quantization is memory capacity. A Q8 model uses half the RAM of FP16. On a 16 GB M5 Air, this is the difference between fitting a 7B parameter model (14 GB in FP16, 7 GB in Q8) and not. The performance benefit is secondary to the capacity benefit - quantization lets you run larger models, and a larger model producing better outputs matters more than a smaller model producing them 4% faster.

Core ML Overhead: Not What We Expected

The ANE and earlier Core ML benchmarks showed what looked like massive framework overhead - the GPU delivering only 15–80 GFLOPS through Core ML versus 3,849 GFLOPS from raw Metal compute shaders. This seemed like a damning indictment of Core ML's efficiency. We built a dedicated benchmark to quantify the overhead precisely, comparing Core ML predictions against raw Metal compute kernels and AMX for the same matmul operations. The results inverted our assumptions.

The Fixed Overhead is Small

Config FLOPs Core ML CPU µs Core ML GPU µs Core ML ANE µs Metal (wall) µs AMX µs
tiny (1×64²) 8K 27 23 24 352 0.1
small (1×256²) 131K 25 23 22 375 0.3
medium (1×1024²) 2M 26 23 161 521 4.7
large (1×2048²) 8M 37 596 231 381 105
xlarge (1×4096²) 34M 415 677 1,765 886 860
batch32 (32×2048²) 268M 95 510 240 794 570
batch128 (128×2048²) 1.1G 296 450 238 1,466 1,063
batch256 (256×2048²) 2.1G 568 614 278 2,712 1,696

Core ML's per-prediction overhead is roughly 22–27 µs - consistent across all compute units and visible in the tiny/small configs where overhead dominates. For comparison, our raw Metal dispatch (command buffer creation, encoding, commit, wait) takes ~350 µs wall-clock for the same trivial operation. Metal's CPU-side dispatch overhead is 14x larger than Core ML's.

AMX via cblas_sgemm has effectively zero overhead (0.1 µs for tiny), which is why it dominates for small batch=1 operations.

Core ML's Internal Kernels Are Highly Optimized

The real surprise is the effective GFLOPS at large sizes:

Config FLOPs AMX Metal GPU Core ML CPU Core ML GPU Core ML ANE
batch32 (32×2048²) 268M 471 796 2,813 526 1,120
batch128 (128×2048²) 1.1G 1,010 859 3,629 2,389 4,511
batch256 (256×2048²) 2.1G 1,266 869 3,781 3,498 7,732

Core ML's CPU path delivers 3,781 GFLOPS at batch=256 - that's 4.3x faster than our tiled Metal matmul kernel running on the GPU. Core ML routes CPU matmul through AMX with highly optimized tiling, far beyond what cblas_sgemm achieves (1,266 GFLOPS for the same operation). The framework isn't adding overhead here - it's providing access to better-optimized kernels than the public BLAS API.

The ANE reaches 7,732 GFLOPS at batch=256 - the highest throughput measured on this chip, exceeding even the GPU's 6,004 GFLOPS FP16 peak from our roofline sweep. This suggests the ANE's raw compute capability is substantially higher than what Apple's published TOPS figures imply, or that Core ML is exploiting hardware features (like mixed-precision accumulation) that amplify effective throughput.

Why Our Metal Kernel Loses

Our Metal matmul uses 16×16 tiled shared memory - a standard textbook implementation. But Apple's internal GPU kernels (used by Core ML's GPU path and by MPS) use architecture-specific optimizations: SIMD group operations, optimal register tiling for Apple's scalar ALU layout, multi-level tiling matching the L1/SLC hierarchy, and possibly undocumented hardware features. The 869 vs 3,498 GFLOPS gap at batch=256 (4.0x) represents the difference between portable GPU code and Apple-tuned kernels.

This is the same pattern we saw in the GPU roofline section - float4 code achieving only 815 GFLOPS while scalar code with sufficient ILP reached 3,849. Apple's internal matmul kernels are written for the scalar ISA with optimal ILP, not for the float4 abstraction that most developers use.

What This Means

The earlier benchmarks weren't measuring Core ML overhead - they were measuring the cost of a single small operation relative to its compute. A 2048×2048 matmul at batch=1 is only 8 MFLOPS. At 3,781 GFLOPS capability, that should take 2 µs of compute. The 37 µs we measured is 2 µs of compute plus 25 µs of fixed dispatch overhead - and 25 µs is actually very low for a framework that manages model graphs, memory, and multi-engine scheduling.

The practical implications for LLM frameworks:

  • Core ML's CPU path is the fastest way to do matmul on this chip for operations large enough to amortize the 25 µs overhead (roughly ≥1 MFLOP). It beats both raw cblas_sgemm and our Metal GPU kernels.
  • The ANE is a throughput monster at 7.7 TFLOPS, but only for batched operations. The 25 µs fixed overhead means operations under ~50 MFLOP are overhead-dominated.
  • Naive Metal code leaves 4x+ performance on the table. Developers using Metal directly for matrix operations should use MPS (Metal Performance Shaders) rather than hand-written kernels, unless they're prepared to optimize for Apple's specific scalar ISA.
  • The "Core ML is slow" perception comes from measuring small operations where the 25 µs overhead dominates, not from the framework being inefficient. For the workloads that actually matter (large matmuls in LLM inference), Core ML is faster than most alternatives.

Summary

Three compute engines, one chip:

Bandwidth Compute ceiling Ridge point
Standard (float4, FP32) 122 GB/s 815 GFLOPS ~6.5 F/B
Optimized (scalar 8-chain, FP32) 122 GB/s 3,849 GFLOPS ~31 F/B
FP16 (scalar 32-chain) 122 GB/s 6,004 GFLOPS ~49 F/B

The standard roofline methodology - float4 vectors with cross-dependent FMA chains - measures a compute ceiling that is 4.7x below the hardware's actual capability. This isn't a flaw in the methodology; the standard approach exists because most GPU code uses vector types, and measuring the float4 ceiling gives the correct performance prediction for that code. But it hides two higher ceilings: the scalar FP32 ceiling at 3,849 GFLOPS (accessible through ILP restructuring) and the FP16 ceiling at 6,004 GFLOPS (accessible through half-precision arithmetic with sufficient ILP).

The gaps between these three ceilings define optimization opportunities. Every workload between AI 6.5 and AI 31 that uses float4 is compute-bound on the standard roofline but bandwidth-bound on the scalar FP32 one. Every FP16-eligible workload between AI 31 and AI 49 can push its ridge point even further right. For ML inference - where half-precision is standard - the effective compute ceiling is 7.4x higher than what a naive float4 benchmark measures.

The graphics pipeline tells a complementary story. TBDR eliminates the DRAM bandwidth wall that defines fill rate on immediate-mode GPUs like the 780M. The M5's tile memory absorbs overdraw, its 32 MB SLC makes coherent texture access essentially free regardless of texture size, and the R8 vs RGBA8 comparison reveals that tile memory bandwidth - not DRAM - is the new binding constraint for fragment-heavy rendering. For real games, the architectural benefits of TBDR are largely automatic: overdraw absorption and texture caching require no application-level optimization. The one area that rewards Metal-specific work is render pass structure for deferred rendering, where combining G-buffer fill and lighting resolve into a single pass keeps the G-buffer entirely in tile memory.

Beyond the GPU, the M5's AMX coprocessor delivers 1,790 GFLOPS FP32 through the public cblas_sgemm API, but Core ML's internal CPU path reaches 3,781 GFLOPS for the same hardware - suggesting Apple has AMX-optimized kernels that outperform the public BLAS interface by 2x. The ANE reaches 7,732 GFLOPS FP16 at batch=256 through Core ML - the highest throughput measured on this chip - but collapses to single-digit GFLOPS for tiny operations due to a fixed ~25 µs per-prediction overhead.

INT8 quantization benefits the ANE (1.2–1.5x over FP16) but not AMX (which lacks native INT8 compute). For local LLM users, quantization's primary benefit is memory capacity - fitting a 7B model in 7 GB instead of 14 GB - not compute speed.

The Core ML overhead analysis corrected a key misconception. Core ML's per-call overhead is only ~25 µs - much lower than raw Metal dispatch (~350 µs wall-clock). The perception of "slow Core ML" comes from measuring small operations where 25 µs dominates, not from framework inefficiency. For large matmuls, Core ML provides access to Apple's internal optimized kernels that outperform both the public BLAS API and naive Metal compute shaders by 2–4x.

Takeaways

After completing this test, we were a bit surprised that M5's GPU still wasn't up-to-par with the Radeon 780M in compute. Radeon 780M only has 12 CUs, which is even less than half of what's available in the bottom-of-the-barrel Radeon RX 7400, one used in OEM markets that has 28 CUs and by no means a gaming powerhouse. The RX 7600, widely regarded as one of the best sweet spot offerings, clocks in with 32 CUs and has a memory bandwidth of 476.9 GB/s.

Granted, the 10-core GPU included in this 15 inch M5 MacBook Air isn't geared towards gaming or any sustained graphical load, but apart from having a much higher memory bandwidth that lowers the roofline to 6.5 AI vs Radeon 780M's 8, it doesn't have much going for it. If manufacturers pair the Radeon 780M with much higher bandwidth LPDDR5X, we could very well see the script flipping, albeit in a much higher power envelope.

Then there's the gaming market - rumor on the street is that the Steam Machine would use a variant of the RX 7400. If we extrapolate the numbers, Apple would have little to no answer to that kind of compute power - not even the 20-core GPU found in the M5 Pro can match the raw performance a RX 7400 can deliver. If Apple is serious about taking on the gaming market, the GPU would be one of the biggest hurdles to overcome.

As for local AI inferencing, we are confident that if M5 Max Mac Studio comes out with the same or slightly higher price tag at $1999-2199 and has the same 36GB RAM as the M5 Max MacBook Pro, it would be the single hottest thing for the AI crowd.

Methodology Notes

  • All timing uses Metal's gpuEndTime - gpuStartTime (GPU-side timestamps, not wall clock)
  • 3 warmup iterations, 10 measurement iterations per data point; median reported
  • Shaders compiled at runtime from MSL source with fastMathEnabled = true
  • Optimized sweep and compute peak tests built and run in release mode (swift build -c release)
  • Buffer allocation uses .storageModeShared (unified memory, no CPU→GPU copies)
  • Sweep uses 128 MB buffers; compute peak tests use 4 MB (fits in SLC)
  • Thread group size set to pipeline.maxTotalThreadsPerThreadgroup (typically 1024 on M5)

Architecture References

The claims about Apple's GPU execution model draw on these sources:

  • Dougall Johnson, Apple G13 GPU ISA reference - Reverse-engineered ISA documentation confirming scalar registers ("each store one 32-bit value per thread"), scalar fmadd instruction, and no native vector instructions. The most detailed public documentation of Apple's GPU instruction set.
  • Alyssa Rosenzweig, Dissecting the Apple M1 GPU, Part I - Lead of the Asahi Linux GPU driver effort. Confirms "the M1's GPU is scalar at all bit sizes" and notes the hardware is "superscalar, with more 16-bit ALUs than 32-bit ALUs."

Apple does not publish GPU clock speeds, ALU counts per core, or detailed microarchitecture documents for the M-series. The 128 ALU and ~1.47 GHz figures in this analysis are derived from published TFLOPS numbers and our benchmark measurements. The scalar ISA claims are supported by independent reverse engineering, not Apple's own documentation.

Recent Posts

M5 MacBook Air Review - The Laptop You Should Actually Buy

M5 MacBook Air Review - The Laptop You Should Actually Buy

The base model M5 MacBook Air 15-inch delivers more CPU performance than a Ryzen 9 5950X desktop in complete silence, with no fan. Its 15.3-inch 16:10 display, all-day battery life, and fanless design make it the best laptop in its price range. Stop worrying about specs.

/ 18 min read