amd

Ryzen 7 8745HS (Zen 4 APU, Phoenix) - Part II: 780M iGPU Roofline & Bandwidth Analysis

Ryzen 7 8745HS (Zen 4 APU, Phoenix) - Part II: 780M iGPU Roofline & Bandwidth Analysis

System: AceMagic W1 mini PC, AMI BIOS PHXPM7B0, 32GB DDR5-5600, CachyOS

iGPU: AMD Radeon 780M (12 CUs, RDNA 3)

This is the in-depth analysis companion to Part I, which covers power/thermal tuning and benchmark results. Part II builds on that data to explain why the 780M iGPU behaves the way it does, using roofline modeling, hardware performance counter profiling, and cross-GPU comparison.

The pattern from Part I: Every GPU benchmark showed the same behavior. FurMark FPS was flat from 25-54W. RE6 scores plateaued above 40W. Geekbench compute subtests were flat from 54W down to 25W. clpeak measured 73 GB/s memory bandwidth regardless of power, while FP32 compute scaled from 3,000 to 4,700 GFLOPS. The 780M clearly had far more compute than it could use. This section quantifies why.

The Short Version

The 780M's GPU cores can crunch numbers fast - nearly as fast per core as a discrete gaming GPU. The problem is they can't be fed data fast enough. The GPU shares the system's DDR5 memory with the CPU, and that memory can only deliver about 73 GB/s. To put that in perspective: the GPU can process pixels 4x faster than the memory can supply them, and read textures 6x faster than DRAM can deliver them, because small caches close to the GPU absorb most of the traffic. When those caches work well (which is most of the time in normal gaming), performance is solid. When they don't - scattered texture access, large working sets - performance falls off a cliff.

What this means in practice:

  • Cranking up the power limit doesn't help much for gaming. The GPU hits the memory bandwidth wall regardless. Going from 35W to 54W gains less than 10% in most games while increasing heat and fan noise significantly. The sweet spot is 35-45W.
  • The 760M (8 cores) is nearly as fast as the 780M (12 cores) for gaming. Both are bottlenecked by the same shared memory. The extra 4 cores in the 780M rarely get to do useful work because they're waiting on data. The 780M's advantage is mainly in compute workloads like video encoding or ML inference.
  • Game performance depends more on the engine than the hardware. A well-optimized engine that keeps its texture working set small and accesses memory efficiently can run 100-200x faster through the texture pipeline than a poorly optimized one. Techniques like texture compression (BC7/ASTC), proper mip mapping, and texture atlas packing matter far more on iGPUs than on discrete GPUs with large caches.
  • FSR and resolution scaling help for the right reason. On this iGPU, upscaling isn't compensating for weak GPU cores - it's reducing the amount of data the GPU needs to read and write per frame, working around the memory bottleneck.
  • Faster memory would help more than a faster GPU. LPDDR5X-7500 (found in some newer laptops) would boost bandwidth by ~34%, which would translate almost directly into gaming performance gains. More GPU cores at the same memory speed would not.

The rest of this article is the technical deep dive that produces these conclusions, using custom Vulkan benchmarks, hardware performance counter profiling, and cross-GPU comparison.

The Roofline Model

Every GPU workload has two fundamental resource needs: memory bandwidth (to load/store data) and compute throughput (to do math on that data). The roofline model captures the relationship between these two limits in a single picture.

The key concept is arithmetic intensity (AI): the ratio of compute work to memory traffic, measured in FLOPS per byte (F/B). A workload that loads 32 bytes and performs 8 floating-point operations has AI = 0.25 F/B. A workload that loads the same 32 bytes but churns through 1024 operations before writing back has AI = 32 F/B.

At low AI, the GPU spends most of its time waiting for data from memory. Throughput is limited by how fast memory can deliver bytes, regardless of how much compute power is available - the workload is bandwidth-bound. At high AI, there's enough data in flight to keep the compute units busy. Throughput is limited by how fast the ALUs can crunch numbers - the workload is compute-bound.

The ridge point is the AI value where these two limits intersect. Below it, more compute power doesn't help (the GPU is starving for data). Above it, more memory bandwidth doesn't help (the ALUs are the bottleneck). The ridge point is simply: peak GFLOPS / peak GB/s.

What makes this interesting for power tuning: raising the power limit increases the compute ceiling (higher GPU clocks = more GFLOPS) but cannot change the bandwidth ceiling (DDR5-5600 speed is fixed). This shifts the ridge point rightward, meaning at higher power, more workloads fall into the bandwidth-bound regime where additional power is wasted.

Vulkan Roofline Power Sweep (vk-roofline)

Benchmark Design

Custom Vulkan compute kernels sweeping AI from 0.25 to 128 F/B at each package power level. Each kernel: load vec4 -> N FMA iterations -> store vec4 (32 bytes/thread, varying compute). 200 measurement iterations per data point for stable medians.

Roofline Sweep (GFLOPS at each AI point)

AI (F/B) 15W 20W 25W 30W 35W 45W 54W 60W 65W
0.25 18 18 18 18 18 18 18 18 18
0.5 36 37 37 36 36 36 37 36 36
1 72 72 74 71 73 74 72 74 74
2 145 147 147 143 146 147 147 147 148
4 277 287 293 291 292 294 295 294 295
8 345 489 592 578 593 586 588 592 588
16 399 575 772 877 961 1037 1069 1072 1072
32 443 682 804 904 1028 1135 1181 1180 1180
64 435 525 631 678 764 798 800 799 802
128 449 594 666 709 770 800 799 801 801

Bold rows highlight the transition zone where workloads shift from bandwidth-bound to compute-bound.

At AI <= 4, all power levels produce identical throughput: GFLOPS = bandwidth xAI ≈ 73 x AI. The GPU has enough compute even at 15W to saturate the memory bus at these low arithmetic intensities.

At AI = 8, the roofline diverges: 15W delivers only 345 GFLOPS (59% of the bandwidth-limited 584 GFLOPS), while 25W+ still tracks the bandwidth ceiling. This is the ridge point at 15W - the power-limited compute ceiling intersects the bandwidth ceiling around AI ≈ 5.

At AI = 32, throughput peaks for all power levels and the full power-scaling picture is visible: 443 GFLOPS (15W) -> 1181 GFLOPS (54W) -> 1180 GFLOPS (65W). At AI = 64 and 128, throughput drops back due to register pressure reducing wave occupancy on RDNA 3.

Summary Table

Power Copy BW (GB/s) Peak Sweep (GFLOPS) FP32 Scalar (GFLOPS) Ridge AI (F/B) Peak Die Temp Peak Fan
15W 72.1 449 2,545 ~5 43°C 1,590 RPM
20W 72.3 682 3,442 ~9 45°C 1,658 RPM
25W 73.1 804 3,965 ~11 48°C 1,815 RPM
30W 68.8 904 4,329 ~12 51°C 1,978 RPM
35W 72.5 1,028 4,454 ~14 54°C 2,091 RPM
45W 73.3 1,135 4,700 ~16 56°C 2,257 RPM
54W 72.2 1,181 4,663 ~16 57°C 2,321 RPM
60W 73.2 1,180 4,701 ~16 57°C 2,318 RPM
65W 73.2 1,180 4,710 ~16 57°C 2,326 RPM

Ridge AI = peak sweep GFLOPS / copy bandwidth. This is the arithmetic intensity above which workloads are compute-bound.

Key Findings

The roofline tilts with power, but the bandwidth floor is fixed. The memory bandwidth ceiling (~73 GB/s) is set by DDR5-5600 shared memory and cannot be changed by power tuning. Only the compute ceiling moves, from ~450 GFLOPS at 15W to ~1180 GFLOPS at 54W. This shifts the ridge point from AI ≈ 5 to AI ≈ 16, tripling the range of workloads that are bandwidth-bound at higher power.

GPU compute saturates at 45W package power. From 45W through 65W, every metric is identical within measurement noise: peak sweep throughput (1135–1180 GFLOPS), FP32 scalar (4700 GFLOPS), bandwidth (73 GB/s), ridge AI (16 F/B). The 780M's 12 CUs reach max clock and max throughput at ~45W. The extra 20W of headroom from 45->65W goes entirely unused under GPU-only load, confirming the effective GPU power ceiling is ~30W (45W package minus ~15W idle/uncore overhead). This is consistent with the FurMark data showing GPU-only package draw of 29W at 54W limit.

Theoretical peak vs effective roofline peak - a 4x gap. The pure compute test (4MB buffer, 512 FMA iterations per element, AI ≈ 128 F/B) achieves 4,700 GFLOPS. The sweep peak at AI=32 (128MB buffer, realistic memory traffic) achieves only 1,180 GFLOPS. This 4x gap exists because realistic workloads with memory traffic cannot sustain full ALU utilization: memory latency, cache misses, and reduced wave occupancy (from register pressure at high FMA counts) all erode throughput. The 4,700 GFLOPS figure is only reachable when arithmetic intensity is high enough that memory access time is completely hidden by compute - a regime few real workloads occupy.

The roofline is not monotonic at high AI. Throughput peaks at AI=32 (fma_128) then drops 30-35% at AI=64 and AI=128. At very high FMA counts per thread, the RDNA 3 compiler allocates more VGPRs, reducing the number of waves that can execute concurrently. Fewer in-flight waves means less latency hiding, and throughput drops despite the kernel being "pure compute." This occupancy cliff is an important consideration for GPU kernel optimization: packing more arithmetic per memory access has diminishing returns beyond ~32 FLOPS/byte on this architecture.

Validates all prior benchmark conclusions. FurMark (AI < 1, pure BW-bound) is flat above 25W - confirmed by the roofline showing all power levels are identical at AI ≤ 4. Geekbench Edge Detection (BW-bound) is flat from 20-54W - consistent with AI < 4. Geekbench Stereo Matching and Particle Physics (higher AI, compute-bound) scale with power - consistent with the roofline showing divergence at AI > 8. RE6 and BMW plateau at 35-40W - consistent with game shaders having mixed AI in the 4-16 range, straddling the ridge point.

Hardware Busy-Bit Profiling (GRBM_STATUS)

The roofline model predicts which workloads are bandwidth-bound vs compute-bound. To verify this directly at the hardware level, we sampled the AMD GRBM_STATUS registers via umr at ~20 Hz with GFXOFF disabled. These registers expose per-unit busy/idle flags: the memory pipeline (TA -> TCP -> GL1 -> GL2 -> EA, from texture unit through L1/L2 cache to DRAM), the shader dispatch unit (SPI), and the fixed-function graphics pipeline (SC/DB/CB: rasterizer, depth, color).

Unit Low AI (1-2 F/B) High AI (256-512 F/B) Interpretation
GUI_ACTIVE 95% 99% GPU active in both regimes
SPI_BUSY 95% 99% Wave dispatch saturated in both
TA_BUSY 95% 63% Memory request unit: saturated -> partially idle
TCP_BUSY (L1) 93% 61% L1 cache: saturated -> partially idle
GL1CC_BUSY 93% 58% GL1 cache controller: same trend
GL2CC_BUSY 95% 43% L2 cache: saturated -> less than half busy
EA_BUSY (DRAM) 95% 41% DRAM interface: 95% -> 41% - clear shift

At low AI (bandwidth-bound), the entire memory pipeline from L1 through DRAM runs at 93-95% busy - the GPU is bottlenecked waiting for data at every level of the cache hierarchy. The compute units (SPI) are also 95% busy because they're constantly issuing memory requests and stalling on the results.

At high AI (compute-bound), the memory path drops to 41-63% utilization while SPI stays at 99%. The ALUs are fully occupied doing arithmetic, only occasionally needing to fetch or store data. The EA (DRAM interface) dropping from 95% to 41% is the most direct measurement of the bandwidth-bound -> compute-bound transition.

This confirms the roofline model's prediction: at low AI, the memory subsystem is the bottleneck (all memory units saturated); at high AI, the compute units are the bottleneck (memory units partially idle). The ~20 Hz sampling rate is coarse but sufficient to see the structural difference. For cycle-accurate measurement, the SQ_PERFCOUNTER and GL2C_PERFCOUNTER programmable counters are available via umr but require explicit event programming.

Real Game Validation: Resident Evil 6

To confirm this pattern holds for real rendering workloads, we profiled the RE6 built-in benchmark (DX9, 1080p, Proton) at 54W and 25W:

Unit RE6 @ 54W RE6 @ 25W Delta
GUI_ACTIVE 99% 97% -2
SPI_BUSY 97% 95% -2
TA_BUSY 83% 83% 0
SC_BUSY 94% 93% -1
DB_BUSY 93% 91% -2
CB_BUSY 74% 74% 0
TCP_BUSY (L1) 84% 83% -1
GL1CC_BUSY 94% 91% -3
GL2CC_BUSY 95% 93% -2
EA_BUSY (DRAM) 93% 89% -4

54W: 2270 samples (~113s, two benchmark passes). 25W: 1416 samples (~70s, one pass - slower FPS = longer runtime).

The profiles are nearly identical despite a significant difference in available compute power. Every hardware unit shows the same utilization within a few percent. At 25W the GPU runs at ~1900 MHz with ~68% of the 54W throughput, yet the memory pipeline (EA 89-93%, GL2CC 93-95%, GL1CC 91-94%) is saturated at both power levels. The extra compute headroom at 54W simply sits idle waiting for data.

Compared to the synthetic benchmarks: RE6 shows the fixed-function rasterizer pipeline active (SC 92%, DB 91%) as expected for a graphics workload, while CB (color writes) is lower at 73-74% - consistent with early-Z depth rejection discarding fragments before they reach the color stage. The memory path utilization (82-95%) matches the low-AI synthetic profile (93-95%), confirming RE6 operates firmly in the bandwidth-bound regime.

This is the hardware-level explanation for RE6's benchmark scores: 10,280 at 54W vs 8,842 at 25W (-14%). The 14% performance loss doesn't come from compute starvation - it comes from the 25W GPU clock being too low to fully saturate the memory bus in some phases, as evidenced by EA_BUSY dropping slightly from 93% to 89%. In the roofline model, RE6 sits at approximately AI 2-4, well below the ridge point at either power level.

iGPU Bandwidth Wall

The roofline and GRBM data quantify what the benchmarks showed: the 780M delivers 4,700 GFLOPS of FP32 compute but only 73 GB/s of memory bandwidth, giving a ridge point at AI ≈ 16 F/B. A workload needs to perform 16 floating-point operations for every byte it fetches from memory before the compute units become the bottleneck. Most rasterization workloads fall well below this threshold - fragment shaders typically fetch textures, do a few multiplies and adds, and write the result, landing at AI 1-4. The GPU's compute capacity is dramatically overprovisioned relative to its memory bandwidth.

But the 780M is not a weak GPU - it's a bandwidth-starved one. The 12 RDNA 3 CUs are genuinely capable compute engines running at nearly the same clock speed as discrete parts (2600 MHz observed, 2800 MHz max boost). The problem is the memory subsystem. Compare it to the discrete RX 7600, which shares the same RDNA 3 architecture at a similar 2655 MHz boost clock:

740M (iGPU) 760M (iGPU) 780M (iGPU) RX 7600 (discrete)
Compute Units 4 8 12 32
Boost clock 2800 MHz 2600 MHz 2800 MHz 2655 MHz
FP32 peak (theoretical) ~1,690 GFLOPS* ~3,100 GFLOPS* ~4,700 GFLOPS ~12,200 GFLOPS
Effective roofline peak ~420 GFLOPS* ~790 GFLOPS* ~1,180 GFLOPS -
Memory DDR5-5600 (shared) DDR5-5600 (shared) DDR5-5600 (shared) GDDR6 128-bit
Memory bandwidth 73 GB/s 73 GB/s 73 GB/s 288 GB/s
Last-level cache 2 MB L2** 2 MB L2 2 MB L2 2 MB L2 + 32 MB Infinity Cache
Ridge point (AI) ~6 F/B* ~11 F/B* ~16 F/B ~4 F/B

*iGPU values interpolated from 780M measurements, scaling by CU count and clock ratio. All four GPUs share the same RDNA 3 architecture; the 740M and 760M are the same silicon with fewer CUs enabled.

**The 740M is a Phoenix 2 variant (GFX1103_R2) which may have a smaller L2 cache; the exact size is unconfirmed.

All four GPUs run the same RDNA 3 architecture at nearly the same clock speed (~2600-2800 MHz). The performance differences come almost entirely from two factors: CU count (how much compute) and memory subsystem (how fast data arrives).

The RX 7600's 32 MB Infinity Cache acts as a bandwidth amplifier: any texture, render target, or buffer that fits in cache is served at internal bandwidth far exceeding the 288 GB/s GDDR6 rate, pushing its effective ridge point even lower than the ~4 F/B calculated from raw DRAM bandwidth. None of the iGPUs have an equivalent. Their 2 MB GPU L2 is the only cache between the compute units and DDR5, and the iGPU cannot access the CPU's 16 MB L3 (on AMD APUs, the L3 is exclusive to the CPU core complex). Every working set larger than 2 MB hits DDR5 at 73 GB/s with no intermediate cache tier to absorb the traffic. This makes the iGPUs' effective bandwidth disadvantage closer to 6-8x for typical game workloads where textures and framebuffers dwarf 2 MB.

Per-iGPU Analysis

The iGPU comparison across CU counts is instructive because all three share the same 73 GB/s memory bus. For any bandwidth-bound workload, they perform identically - additional CUs sit idle, waiting on memory. The differences only emerge above each GPU's ridge point:

  • 740M (ridge AI ≈ 6): Becomes compute-bound for workloads above AI 6 - a threshold low enough that some game rendering actually crosses it. Bandwidth-efficient engines with compressed render targets and good culling (AI < 6) will run at the same FPS as on a 780M. But engines with moderate shader complexity or post-processing chains will hit the 740M's compute ceiling, losing up to 65% throughput versus the 780M.
  • 760M (ridge AI ≈ 11): The sweet spot for DDR5-5600 gaming. Most game rendering falls below AI 11, so the 760M performs nearly identically to the 780M. The 780M only wins for workloads between AI 11-16 - a narrow band. In practice, the 760M delivers ~95% of the 780M's gaming performance on the same memory, making the 4 extra CUs poor value for gaming-only use cases.
  • 780M (ridge AI ≈ 16): The extra CU headroom only matters for compute-heavy tasks (ML inference, video encoding, physics simulations) or the small number of rendering passes that exceed AI 11. For pure gaming on DDR5-5600, the 780M is overprovisioned - its 12 CUs are underutilized for most of the frame.

Engine Efficiency and Real-World Impact

This means game performance on the 780M depends heavily on how efficiently the engine uses memory bandwidth. Techniques that reduce bandwidth pressure - such as tile-based rendering, compressed render targets, visibility buffer rendering, mesh shaders that cull before shading, and texture compression formats like BC7/ASTC - effectively shift a game's arithmetic intensity upward, moving the workload away from the bandwidth ceiling and toward the compute ceiling where the 780M has headroom to spare. Conversely, engines with naive full-screen passes, uncompressed framebuffers, or excessive overdraw will saturate the 73 GB/s wall regardless of how much power or clock speed is available.

This explains the wide variance in real-world game performance on iGPUs. Two games at the same resolution and visual complexity can perform very differently - not because of shader complexity, but because one engine is bandwidth-efficient and the other is not. FSR and other upscaling technologies help precisely because they reduce the resolution of the most bandwidth-heavy passes (shading, post-processing) while presenting a higher-resolution final image. On the 780M, FSR is less about compensating for weak compute and more about working around the memory bandwidth bottleneck.

Practical Implications

For most users, including gamers, this means the iGPU can run at a fraction of the rated power budget and still deliver close to full performance. At 35W package power the GPU retains over 90% of its 54W gaming performance while running significantly cooler and quieter. The power savings are essentially free: the extra wattage at higher budgets just heats the chip while the CUs idle-wait on memory fetches.

This mirrors what cryptocurrency miners discovered during the Ethereum era. Ethash was memory-bandwidth bound, so miners would power-limit and undervolt the GPU core while overclocking VRAM as aggressively as possible, achieving up to 30% higher hashrates with 30% less core power. The same principle applies here: when the workload is bottlenecked by memory bandwidth, throwing more compute power at it is pure waste heat.

Fill Rate Roofline (Graphics Pipeline)

The roofline measurements above used compute shaders exclusively - the GPU's shader cores loaded and stored data through the memory hierarchy, but the fixed-function graphics pipeline (rasterizer, ROPs, color compression) was never exercised. Fill rate testing completes the picture by measuring fragment processing throughput: how fast can the GPU rasterize, shade, and write pixels?

Benchmark Design

A fullscreen triangle rendered to an offscreen R8G8B8A8_UNORM render target (1920x1080, 128x overdraw per pass = 265M pixels/pass). The fragment shader uses the same FMA chain pattern as the compute sweep, controlled by a specialization constant FMA_PER_PIXEL:

  • FMA=0: Fragment shader outputs gl_FragCoord sums with no FMA loop - measures raw ROP/rasterizer throughput
  • FMA=1..512: Increasing ALU work per pixel, same 4 inter-dependent accumulator chains as the compute sweep
  • FLOP counting: fma_per_pixel * 4 chains * 2 FLOPs/FMA = fma_per_pixel * 8 per pixel

Results (RGBA8, 4 bytes/pixel)

FMA/pixel GPixels/sec GFLOPS Median ms Implied write BW
0 20.8 0 12.78 83.2 GB/s
1 20.8 166 12.77 83.1 GB/s
2 20.7 332 12.80 82.9 GB/s
4 20.4 654 12.98 81.8 GB/s
8 16.9 1,083 15.69 67.7 GB/s
16 9.7 1,245 27.28 38.9 GB/s
32 5.2 1,340 50.70 20.9 GB/s
64 2.7 1,397 97.26 10.9 GB/s
128 1.4 1,422 191.17 5.6 GB/s
256 0.4 810 671.35 1.6 GB/s
512 0.2 818 1,329 0.8 GB/s

The fill rate is bandwidth-limited, not ROP-limited

The official AMD spec for the 780M lists a fill rate of 86.4 GPixels/sec (32 ROPs x 2700 MHz). We measured 20.8 GPixels/sec - 24% of spec. This is not a measurement error.

The spec rate is the ROP processing rate: how fast the fixed-function color output units can blend and format pixels internally. But each pixel still has to be written to memory. At 4 bytes per pixel (RGBA8), the spec rate would require 86.4 x 4 = 345.6 GB/s of write bandwidth. The DDR5-5600 bus provides only 73 GB/s.

The implied write bandwidth at FMA=0 is 83.2 GB/s - above the 73 GB/s measured by STREAM copy. The ~14% excess comes from AMD's Delta Color Compression (DCC): the render target is allocated with hardware compression metadata, and the gl_FragCoord-seeded pixel values are smooth enough for DCC to reduce effective write traffic. This means our fill rate measurement is actually hitting the compressed DRAM write ceiling, not even the raw bandwidth ceiling.

On this iGPU, the ROPs are never the bottleneck. The 32 ROPs can process 86.4 billion pixels per second, but the memory bus can only absorb ~20 billion pixels per second (with DCC). The ROPs are 4x overprovisioned relative to memory bandwidth - the same pattern we saw with compute, where 4,700 GFLOPS of FP32 throughput faces only 73 GB/s of bandwidth.

Isolating the bandwidth bottleneck: R8 vs RGBA8

To confirm that the FMA=0 result is bandwidth-limited and not ROP-limited, we switched the render target from R8G8B8A8_UNORM (4 bytes/pixel) to R8_UNORM (1 byte/pixel), cutting write bandwidth demand by 4x while keeping the same pixel count (1920x1080, 128x overdraw).

FMA/pixel RGBA8 GPixels/sec R8 GPixels/sec Speedup R8 implied write BW
0 20.8 38.3 1.8x 38.3 GB/s
1 20.8 60.9 2.9x 60.9 GB/s
2 20.7 63.6 3.1x 63.6 GB/s
4 20.4 63.4 3.1x 63.4 GB/s
8 16.9 61.7 3.6x 61.7 GB/s
16 9.7 36.3 3.7x 36.3 GB/s
32 5.2 19.0 3.6x 19.0 GB/s
64 2.7 9.3 3.4x 9.3 GB/s
RGBA8 R8
Peak GPixels/sec 20.8 (FMA=0) 63.6 (FMA=2)
Peak GFLOPS 1,422 (FMA=128) 5,076 (FMA=512)
% of spec (86.4 GPixels/sec) 24% 74%

The R8 results reveal three regimes that RGBA8 obscured:

1. Fixed-function ceiling (FMA=0): 38.3 GPixels/sec. With 1 byte/pixel, write bandwidth is only 38.3 GB/s - just 52% of DRAM capacity. The bottleneck has shifted from DRAM bandwidth to the rasterizer/ROP fixed-function pipeline. This is the first time any of our benchmarks has isolated a non-memory, non-ALU limit.

2. Pipeline sweet spot (FMA=1-4): 60-64 GPixels/sec. Adding a small amount of ALU work increases throughput by 1.7x over the null shader. This is a well-known GPU effect: when shaders retire too fast, the fragment pipeline has insufficient occupancy to hide fixed-function latency (rasterizer tile dispatch, ROP format conversion, DCC metadata updates). A few FMAs per pixel provide just enough in-flight work for the scheduler to keep the pipeline saturated. The peak of 63.6 GPixels/sec at FMA=2 represents 74% of the theoretical 86.4 GPixels/sec spec - the remaining 26% gap is accounted for by rasterizer overhead and DCC metadata traffic.

3. ALU-bound ceiling reaches true FP32 peak. With R8's minimal write bandwidth, fragment ALU throughput is no longer suppressed by memory contention. Peak GFLOPS reaches 5,076 - matching the ~4,700 GFLOPS FP32 compute peak. The RGBA8 fragment peak of 1,422 GFLOPS was 3.6x lower because every pixel write consumed memory bandwidth that competed with shader execution.

Graphics vs compute roofline comparison

With both RGBA8 and R8 data, the fill rate sweep reveals how the fixed-function graphics pipeline compares to pure compute:

  • Bandwidth-bound floor (RGBA8 FMA 0-4): 20.8 GPixels/sec = 83 GB/s write BW, limited by DRAM bandwidth through DCC. The equivalent in the compute roofline is the flat bandwidth ceiling at ~73 GB/s for AI <= 4.
  • Fixed-function floor (R8 FMA=0): 38.3 GPixels/sec = 38 GB/s, limited by rasterizer/ROP throughput. This ceiling has no equivalent in the compute pipeline - it's unique to the graphics fixed-function path.
  • Transition zone (FMA 8-16): In both formats, throughput drops as fragment ALU work dominates. The "ridge point" is around FMA=8 per pixel, consistent with the compute ridge at AI=8-16.
  • ALU-bound ceiling: RGBA8 peaks at 1,422 GFLOPS (memory contention limits ALU throughput); R8 peaks at 5,076 GFLOPS (matching compute peak, proving memory contention was the limiter).
  • Occupancy cliff (FMA 256-512): Both formats show GFLOPS dropping at very high FMA counts due to VGPR pressure reducing wave occupancy - the same effect seen in the compute sweep at AI=64-128.

The fragment pipeline adds fixed-function overhead (rasterizer setup, ROP formatting, DCC compression) but these are pipelined and don't significantly reduce peak throughput compared to compute. The dominant architectural bottleneck - DRAM bandwidth - is shared between both pipelines, and the R8 experiment proves it: remove the bandwidth pressure and fragment shaders reach the same FP32 ceiling as compute.

Textured Fill Rate (Cache Hierarchy Profiling)

The fill rate measurements above exercised the ROPs and ALUs but never touched the texture units - every fragment shader used gl_FragCoord arithmetic with no texture fetches. Real game rendering is dominated by texture sampling, which exercises the texture address unit (TA), texture cache (TCP/L1), and the shared cache hierarchy (GL1 -> L2 -> DRAM). GRBM profiling of RE6 showed TA_BUSY at 83%, confirming texture sampling is a primary workload for real games. This benchmark directly measures the texture sampling pipeline.

Benchmark Design

The same fullscreen triangle setup as the fill rate test (1920x1080, RGBA8 render target), but the fragment shader samples a sampler2D texture using textureLod(..., 0.0) to force the base mip level. A specialization constant TEX_PER_PIXEL controls how many texture fetches each fragment performs (1–16).

The sweep has three axes:

  • Texture size - square RGBA8 textures from 64x64 (16 KB) to 4096x4096 (64 MB), targeting different levels of the cache hierarchy:
  • 64x64 = 16 KB -> fits texture L1 (16 KB/CU)
  • 256x256 = 256 KB -> fits GL1 (256 KB shared)
  • 1024x1024 = 4 MB -> exceeds L2 (2 MB)
  • 2048x2048 = 16 MB -> pure DRAM
  • 4096x4096 = 64 MB -> pure DRAM, large working set
  • Fetches per pixel - 1, 2, 4, 8, 16 texture samples per fragment
  • UV access pattern - coherent (tiling) vs random (hash-scattered)

Two UV modes test opposite ends of the spatial locality spectrum:

  • Coherent: gl_FragCoord / textureSize with REPEAT wrap - adjacent pixels sample adjacent texels, maximizing intra-wavefront cache reuse. This is the best case, representative of normal texture mapping.
  • Random: PCG hash of pixel coordinates generates uniformly distributed UVs - adjacent pixels scatter across the entire texture, destroying spatial locality. This is the worst case, representative of random-access patterns like indirection textures, volumetric ray marching, or bindless texture arrays with scattered lookups.

Texture data is filled with xorshift PRNG noise to defeat Delta Color Compression (DCC), ensuring worst-case bandwidth. Bilinear filtering (VK_FILTER_LINEAR) is used throughout. Coherent tests use 128x overdraw; random tests use 8x to avoid GPU timeout on the slowest cases.

Results: Coherent UVs (128x overdraw)

Texture Size Fetch/px GTexels/sec TexBW GB/s Median ms
64x64 (16 KB, L1)
1 41.0 164.0 6.47
2 83.1 332.3 6.39
4 115.4 461.8 9.20
8 117.1 468.5 18.13
16 123.0 491.8 34.54
256x256 (256 KB, GL1)
1 41.6 166.4 6.38
2 82.1 328.6 6.46
4 114.2 456.6 9.30
8 117.5 470.1 18.07
16 120.9 483.7 35.12
1024x1024 (4 MB, >L2)
1 41.1 164.4 6.46
2 79.9 319.6 6.64
4 111.9 447.5 9.49
8 111.7 446.9 19.01
16 104.2 416.7 40.76
2048x2048 (16 MB, DRAM)
1 40.4 161.7 6.57
2 77.3 309.1 6.87
4 109.0 436.0 9.74
8 105.7 422.7 20.10
16 106.4 425.6 39.91
4096x4096 (64 MB, DRAM)
1 34.4 137.7 7.71
2 75.0 300.0 7.08
4 96.8 387.2 10.97
8 102.7 410.9 20.67
16 99.9 399.4 42.53

Hitting the texture unit ceiling: 95% of theoretical peak

AMD specifies the 780M's texture rate at 129.6 GTexels/sec (12 CUs x 4 TMUs/CU x 2700 MHz). Our peak measurement of 123.0 GTexels/sec (64x64, fetch=16) reaches 94.9% of this theoretical maximum. The remaining ~5% is attributable to pipeline bubbles between the 128 draw calls per pass and render pass begin/end overhead.

At fetch=1, throughput is only 41 GTexels/sec (32% of peak) - the shader is too cheap to saturate the texture units. The single-fetch fragment retires so quickly that the rasterizer and ROP overhead dominate, similar to the R8 FMA=0 case in the fill rate tests. By fetch=4, we reach ~89% of peak, and fetch=8-16 saturates at 93-95%. The texture units need at least 4 fetches per pixel to stay fully occupied on this GPU.

Coherent cache hierarchy: present but modest

With coherent UVs, the expected cache hierarchy cliffs are surprisingly mild:

Texture Size Peak GTexels/sec (fetch=16) % of L1-resident peak
64x64 (16 KB, L1) 123.0 100%
256x256 (256 KB, GL1) 120.9 98%
1024x1024 (4 MB, >L2) 104.2 85%
2048x2048 (16 MB, DRAM) 106.4 87%
4096x4096 (64 MB, DRAM) 99.9 81%

Even the 64 MB texture only drops 19% from the L1-resident peak. Adjacent screen pixels sample adjacent texels, so even "DRAM-resident" textures benefit from cache line reuse within wavefronts. The peak effective texture bandwidth of 492 GB/s (L1-resident) is 6.7x the DRAM bandwidth (73 GB/s). Even the DRAM-bound 4096x4096 case achieves 399 GB/s - 5.5x DRAM - because the caches absorb most of the spatial redundancy before requests reach main memory.

This is the best-case scenario that real games approximate with normal UV-mapped textures. But how much of this performance depends on spatial locality?

Results: Random UVs (8x overdraw)

Replacing gl_FragCoord / textureSize with a PCG hash destroys intra-wavefront spatial locality - adjacent fragments now scatter across the entire texture. The results are dramatically different:

Texture Size Fetch/px GTexels/sec TexBW GB/s Median ms
64x64 (16 KB, L1)
1 44.0 176.1 0.38
2 43.0 172.0 0.77
4 44.4 177.8 1.49
8 45.3 181.3 2.93
16 45.8 183.0 5.80
256x256 (256 KB, GL1)
1 12.8 51.1 1.30
2 15.1 60.5 2.19
4 16.2 64.6 4.11
8 18.2 72.9 7.28
16 18.7 74.6 14.22
1024x1024 (4 MB, >L2)
1 2.06 8.22 8.07
2 1.04 4.16 31.92
4 0.93 3.71 71.52
8 0.87 3.48 152.69
16 0.86 3.45 307.86
2048x2048 (16 MB, DRAM)
1 0.80 3.20 20.75
2 0.58 2.32 57.08
4 0.53 2.13 124.43
8 0.50 2.00 264.94
16 0.49 1.94 546.55
4096x4096 (64 MB, DRAM)
1 0.67 2.66 24.92
2 0.54 2.18 60.96
4 0.48 1.91 139.21
8 0.44 1.76 301.53
16 0.44 1.76 603.14

The cache hierarchy exposed

With random UVs, the cache hierarchy cliffs that coherent access completely masked become the dominant performance feature:

Texture Size Coherent (fetch=16) Random (fetch=16) Ratio Bottleneck
64x64 (16 KB, L1) 123.0 GTexels/s 45.8 GTexels/s 0.37x TMU throughput (L1 still hit)
256x256 (256 KB, GL1) 120.9 GTexels/s 18.7 GTexels/s 0.15x DRAM bandwidth (74.6 GB/s)
1024x1024 (4 MB, >L2) 104.2 GTexels/s 0.86 GTexels/s 0.008x Cache line waste + TLB
2048x2048 (16 MB, DRAM) 106.4 GTexels/s 0.49 GTexels/s 0.005x TLB thrashing
4096x4096 (64 MB, DRAM) 99.9 GTexels/s 0.44 GTexels/s 0.004x TLB thrashing

Three distinct cliffs emerge:

1. L1 plateau (64x64): 45.8 GTexels/sec. Even with random UVs, the 16 KB texture fits entirely in the per-CU L1 texture cache. Throughput drops to 35% of coherent peak - the random access pattern prevents the hardware prefetcher from anticipating the next fetch and eliminates the benefit of spatial locality within a cache line (bilinear filtering normally pulls 4 adjacent texels from the same line, but random UVs scatter the 4 corners across unrelated lines). Still, all data is L1-resident, so the texture units can sustain ~46 GTexels/sec - about 35% of the theoretical 129.6 GTexels/sec peak.

2. DRAM bandwidth wall (256x256): 18.7 GTexels/sec at 74.6 GB/s. At 256 KB, the texture exceeds L1 but fits in GL1 for coherent access. With random UVs, intra-wavefront scattering defeats GL1 reuse - each of the 64 threads in a wave hits a different cache line, and GL1 cannot absorb the working set fast enough. The effective texture bandwidth of 74.6 GB/s exactly matches the system's DDR5-5600 DRAM bandwidth (73 GB/s copy measured by our bandwidth test). This is the clean DRAM bandwidth limit: every texel fetch misses the texture cache and goes to main memory at the full DRAM rate.

3. Cache line waste cliff (1024x1024+): 0.86 GTexels/sec at 3.5 GB/s. This is the most dramatic result: throughput collapses to 0.7% of the coherent peak and effective bandwidth drops to just 3.5 GB/s - 20x below the DRAM bandwidth ceiling. The texture is 4 MB, far exceeding the 2 MB L2 cache, so every random fetch misses all cache levels and goes to DRAM. But the delivered bandwidth is only 3.5 GB/s, not 73 GB/s. Three factors explain the 20x gap:

  • Cache line waste: DRAM fetches 64-byte cache lines, but a random bilinear RGBA8 sample uses at most 16 bytes (4 texels x 4 bytes). With random access, the other 48 bytes of each cache line are never used before eviction - 75% of DRAM bandwidth is wasted fetching bytes that are discarded.
  • TLB thrashing: The GPU's translation lookaside buffer maps virtual texture pages to physical memory. Random access across a 4+ MB texture hits many more pages than the TLB can cache, adding page table walk latency to every miss. This is visible in the 2048->4096 transition: throughput drops further (0.49->0.44) despite both textures being equally "DRAM-resident," because the 64 MB texture touches 4x more pages.
  • Memory controller inefficiency: Random 64-byte requests to scattered DRAM addresses prevent the memory controller from using burst transfers and page-open optimizations designed for sequential access.

Coherent vs random: what games actually see

Real game rendering falls between these two extremes. Normal UV-mapped geometry has high spatial coherence (adjacent fragments sample adjacent texels), operating near the coherent column. But several common rendering patterns create scattered access:

Pattern Locality Expected regime
UV-mapped diffuse/albedo High coherence ~100-120 GTexels/s
Normal mapping, detail textures Mostly coherent ~80-100 GTexels/s
Shadow map sampling Medium (depends on projection) ~20-80 GTexels/s
Parallax occlusion mapping Low (dependent reads) ~5-20 GTexels/s
Volumetric ray marching Very low (3D scatter) ~1-5 GTexels/s
Bindless/virtual texturing Random per-draw ~0.5-5 GTexels/s

The RE6 GRBM data showing TA_BUSY at 83% is consistent with a mixed workload: mostly coherent diffuse/normal sampling (operating near peak throughput), with some scattered access from shadow maps and post-processing passes pulling the average utilization below 100%. On a discrete GPU with Infinity Cache, the scattered patterns would remain cached in the 32 MB LLC; on the 780M, they fall through the 2 MB L2 and hit the DRAM bandwidth wall at 73 GB/s - or worse, the TLB/cache-line-waste cliff at 2-4 GB/s for truly random access into large textures.

This makes texture atlas packing and mip level selection critical performance levers on the 780M. A well-packed texture atlas that keeps working set within 2 MB operates at 400+ GB/s effective bandwidth; a poorly packed one that scatters across 16+ MB of VRAM operates at 2-4 GB/s - a 100-200x performance difference for the same number of texture fetches.

Conclusion

The 780M is a capable GPU held back by its memory subsystem. Every layer of analysis converges on the same conclusion:

  • The roofline model shows a ridge point at AI ≈ 16 F/B, meaning the GPU has enough compute to perform 16 FLOPs for every byte fetched before becoming compute-bound. Most rendering workloads operate at AI 1-4.
  • Hardware performance counters confirm this directly: the DRAM interface (EA_BUSY) runs at 89-95% utilization during both real games and synthetic bandwidth tests, while compute units have headroom to spare.
  • Fill rate testing shows the ROPs are 4x overprovisioned (86.4 GPixels/sec spec vs 20.8 measured), bottlenecked entirely by DRAM write bandwidth.
  • Texture sampling reaches 95% of the 129.6 GTexels/sec spec for coherent access, but random access into textures larger than L2 collapses to 0.44 GTexels/sec (0.3% of peak) - a 100-200x penalty from cache line waste and TLB thrashing. The texture cache hierarchy provides a 5-7x bandwidth amplifier for spatially coherent access but offers no protection against scattered patterns, making texture atlas packing and mip selection critical on iGPUs.
  • Cross-GPU comparison shows the 760M (8 CUs) delivers ~95% of the 780M's gaming performance on the same DDR5-5600 memory, because the bottleneck is bandwidth, not CU count.
  • Power scaling confirms it empirically: GPU performance saturates at 35-45W package power for gaming, and the GPU draws only ~29W even when given a 54W budget.

For this class of iGPU, the single most impactful upgrade would be faster memory (LPDDR5X-7500 would increase bandwidth by ~34%), not more compute units or higher clocks. Until then, the 780M's 12 CUs will remain underutilized for the majority of workloads it encounters.

Recent Posts

Ryzen 7 8745HS (Zen 4 APU, Phoenix) Part I: General Performance Analysis

Ryzen 7 8745HS (Zen 4 APU, Phoenix) Part I: General Performance Analysis

Ryzen 7 8745HS power and thermal analysis in the AceMagic W1 mini PC. BIOS unlock boosts CPU performance 7.7% by raising TjMax from 85 to 95°C. GPU benchmarks reveal a DDR5 bandwidth wall at 35-40W, where the Radeon 780M retains over 90% of peak gaming performance at a fraction of full power.

/ 12 min read / Featured
One Day with CachyOS: An Arch Experience Without the Arch Pain

One Day with CachyOS: An Arch Experience Without the Arch Pain

I've been a Windows user for most of my computing life. My main daily driver is an M1 Max Mac Studio, but that's strictly a productivity machine: code, writing, the occasional terminal rabbit hole. For everything else — gaming, tinkering, the stuff that doesn't belong

/ 15 min read