Why ROCm Wins the Throughput Race but Loses the Power Bill on Strix Halo — A 35% Energy Reversal Caused by APU Runtime Polling

🎯
TL;DR — On AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S iGPU) running the same Qwen3-30B-A3B Q4_K_M model, HIP/ROCm dominates raw throughput (pp512 354 vs 78 t/s, tg 48 vs 36 t/s) but just loading the model burns 2.5 CPU cores continuously because of HSA runtime polling. For 24/7 workloads where idle time dominates, Windows Ollama (Vulkan) consumes about 35% less energy per token. A clean case study in why dGPU intuitions don't transfer to APUs.
📊
[Image #1 — Hero] ROCm and Vulkan facing off on top of a Strix Halo APU
file: 2026-05-31-strixhalo-hero.png
notion image

📌 Intro — Strix Halo, a new "middle-ground" platform

Who this is for — Infra/ML engineers running local LLM workloads on Strix Halo / Ryzen AI MAX+ systems, and any backend decision-maker who has to validate the "AMD GPU means ROCm" intuition. If you ever picked an inference backend based on a single throughput table, this case study is for you.
AMD Strix Halo (Ryzen AI MAX+ 395 + Radeon 8060S iGPU, gfx1151) is neither a desktop dGPU nor a laptop iGPU — it's a new middle ground. With up to 65 GB of unified memory (UMA/GTT) you can fit a 30B-class LLM entirely on the GPU, and "run an MoE 30B on a single desktop" stops being marketing and becomes a measurable claim.
The trap is what happens the moment you carry the old dGPU intuition — "AMD GPU = ROCm is the answer" — onto this new variant. Your operational metrics invert. This article logs four measurement rounds comparing ROCm 7.2.4 + a self-built HIP llama.cpp against Windows-native Ollama (Vulkan), on the same model, on the same machine.
Quick glossary:
  • APU (Accelerated Processing Unit): A single silicon die containing both CPU and GPU. A dGPU (Discrete GPU) sits on a separate card connected via PCIe.
  • UMA/GTT: A BIOS setting that carves out a portion of system RAM as GPU-visible memory on an APU. Unlike a dGPU's dedicated VRAM, this allocation is static — set in BIOS, not dynamic.
  • HSA polling: When the CPU waits for GPU work to complete, instead of being woken by an OS interrupt, a CPU thread keeps tapping the queue to check. On a dGPU the cost hides; on an APU it shows up loud and clear.

🔬 Round 1 — VRAM recognition ("ROCm only sees 4 GB" — solved)

The first wall was the predictable one. With default BIOS allocation, ROCm only sees the dedicated VRAM (4 GB), so the 30B model can't be GPU-offloaded. Increasing BIOS UMA/GTT to 65 GB brings rocminfo Pool 1 GLOBAL to 95.83 GiB, and llama-bench correctly enumerates the device.
🧩
[Image #2] BIOS UMA dial — from 4 GB to 65 GB
file: 2026-05-31-strixhalo-vram-dial.png
notion image
rocminfo (gfx1151 agent, Pool 1): Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 100,478,882 KB ≈ 95.83 GiB llama-bench device discovery: Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151) VRAM: 98123 MiB, free: 97044 MiB
This part is uncontroversial. "We can use it now" is the conclusion, and most articles would stop here.

⚡ Round 2 — Raw throughput: HIP wins outright

Running Qwen3-30B-A3B Q4_K_M with -ngl 99 (full GPU offload), the numbers strongly favored HIP.
Measurement
WSL HIP (ROCm 7.2.4)
Windows Ollama (Vulkan)
Note
pp t/s (16 tok)
56.25
<strong>77.7</strong>
Short prompts: Vulkan wins
pp t/s (128 tok)
155.55
(not measured)
amortization starts
pp t/s (512 tok)
<strong>354.09 ± 15.72</strong>
(not measured)
HIP's real strength
tg t/s (warm)
<strong>48.21 ± 16.39</strong>
36.2
HIP ~33% faster when warm
tg t/s (cold first run)
36.06
36.2
essentially tied cold
Model load time
~27s
~24s
comparable
Reading just this table, the verdict is clear: "HIP dominates long-prompt processing and warm generation, so go ROCm." Stopping here would reinforce the conventional wisdom one more time.
📈
[Image #3] HIP vs Vulkan throughput bars — the surface verdict</strong>
file: 2026-05-31-strixhalo-throughput.png
notion image

🚨 Round 3 — Wait, the CPU is doing what? (idle 254% spin)

Trouble started right after the throughput run. With llama-server holding the model and zero requests inbound, a glance at top blew everything up.

Idle after model load (no requests, 15s sampling)

t=3s llama-server CPU: 265% RSS: 454 MB t=6s llama-server CPU: 259% RSS: 454 MB t=9s llama-server CPU: 253% RSS: 454 MB t=12s llama-server CPU: 249% RSS: 454 MB t=15s llama-server CPU: 245% RSS: 454 MB → avg ~254% (2.5 cores continuously burning)

Idle right after 128-token generation (30s sampling)

t=5s CPU: 248% t=10s CPU: 244% t=15s CPU: 240% t=20s CPU: 237% t=25s CPU: 234% t=30s CPU: 231% system jiffies: busy +5089 / total +81108 = 6.27% of 32-core system → ~2.0 cores. The spin doesn't release.
A perf record/report of the hot path pointed to a single source: libhsa-runtime64.so.1.18.70204, offset 0x9ab00–0x9ad60 (~200 bytes) — the HSA runtime's GPU completion-queue polling loop.
The control comparison — Windows Ollama (Vulkan) under identical conditions — looked like this:
ollama main(pid 16372) CPU: 0.0% RSS: 104 MB ollama UI(pid 10212) CPU: 0.0% RSS: 403 MB => avg idle CPU: 0.0% During inference: runner CPU ~65% of 1 core (<2% of 32-core system) Post-generation idle: 0.0%
🔥
[Image #4] HSA runtime polling — CPU cores tapping an empty queue forever
file: 2026-05-31-strixhalo-hsa-polling.png
notion image
I tried every escape hatch: HSA_ENABLE_INTERRUPT=1, HSA_ENABLE_DXG_DETECTION=1, GGML_CUDA_NO_PINNED=1, partial offload (-ngl 16). None released the spin. The polling routine is hard-coded inside HSA itself — no environment variable reaches it.

🧩 Why dGPUs hide this — same polling, different die

HSA polling exists everywhere ROCm runs, but on a dGPU you don't see it. The workload itself keeps the GPU busy enough that the polling thread blends into the noise, and the dGPU lives on a separate die with separate memory and separate cores — so even when polling burns CPU, it's "somewhere else" from the host's perspective.
APUs are different. CPU and GPU share one die and the same memory controller. The polling thread eats host CPU cores while the GPU sits idle, and those cores compete with the host. Same polling, two completely different costs: invisible on dGPU, 2.5 cores burned on APU.
🧠
[Image #5] dGPU vs APU die layout — where polling hides vs. where it shows</strong>
file: 2026-05-31-strixhalo-die-comparison.png
notion image
Vulkan, on the same physical GPU, uses the OS's regular GPU completion-wait mechanism (interrupt/event-driven). That's why it hits 0% idle CPU. The difference is invisible in a throughput comparison and only shows up when you measure the whole runtime.

🔋 Round 4 — Convert to energy per token, and the ranking flips

Even when throughput is comparable, adding the polling cost on top changes the picture. Assuming roughly 4 W per active Zen 5 core, here are the estimates:
Scenario
HIP extra power
Vulkan extra power
Gap
Idle after model load
+~10 W constant (2.5-core spin)
~0 W
HIP's standing cost
Active inference (1s, GPU equal)
+~10 W (CPU spin compounds)
+~2–3 W
HIP +7–8 W
Estimated J/token (GPU 40 W shared)
~1.94 J/tok
<strong>~1.25 J/tok</strong>
<strong>Vulkan ~35% less</strong>
24h low-frequency server (idle ≥95%)
+~250 Wh/day
baseline
~90 kWh/year more
The gap is decided by workload pattern. For burst-batch workloads where the GPU stays pinned, polling cost gets absorbed and HIP's throughput edge survives. But for Discord bots, API servers, internal RAG gateways — anything with ≥95% idle ratio — polling cost dominates total usage cost. The annual delta of ~90 kWh isn't huge in dollars (a few tens of dollars at typical electricity rates), but the heat, fan noise, and chassis-life impact pile on top.
⚖️
[Image #6] Energy-per-token seesaw — polling tips the balance even in equal-throughput territory</strong>
file: 2026-05-31-strixhalo-energy-seesaw.png
notion image

🛠️ Operational recommendation — pick by workload pattern, not by the throughput table

The conclusion from this single set of measurements is sharp. The dGPU intuition "AMD GPU = ROCm" doesn't transfer to a unified-memory APU like Strix Halo. On APUs, the top-priority operational metric isn't "which backend is faster" — it's "which runtime doesn't burn the CPU when it's not working."
Workload
Recommended backend
Why
24/7 server (Discord bot, API serving, internal RAG)
<strong>Windows Ollama (Vulkan)</strong>
0% idle CPU, interrupt-based GPU wait
Burst batch (long-prompt batches, reranking)
HIP llama-server — spin up, then kill immediately
Keep the throughput edge, minimize polling exposure
Realtime single-user chat
Either — the 12 t/s warm-tg gap is barely perceptible
Decide by idle-time share

Tunables that did NOT release the HSA spin

  • HSA_ENABLE_INTERRUPT=1 — no effect
  • HSA_ENABLE_DXG_DETECTION=1 — required for device discovery, unrelated to the spin
  • GGML_CUDA_NO_PINNED=1 — no effect
  • -ngl 16 (partial offload) — CPU↔GPU layer hopping makes it slower than CPU-only
  • Killing all external clients — the spin is independent of inbound traffic

Triggers for revisiting HIP (watch ROCm release notes for)

  • "HSA polling fix"
  • "gfx11 generic interrupt support"
  • "APU power efficiency"
If any of those land, re-measure immediately. Until then, pick the backend to match the form factor.
🛡️
[Image #7] Workload-pattern-based backend choice — 24/7 idle vs burst batch</strong>
file: 2026-05-31-strixhalo-workload-decision.png
notion image

📚 Closing — measurement limits and what would change the verdict

Every number above is a snapshot from 2026-05-31, a single ASUS Strix Halo machine (BIOS UMA 65 GB), measured on ROCm 7.2.4. The conclusion is subject to update if any of the following changes:
  • ROCm releases: If patches for HSA polling or gfx11 interrupt support land, the gap could close.
  • Model variety: Only Qwen3-30B-A3B (MoE active 3B) measured. Ratios will differ for dense models.
  • Power measurement: Estimated from CPU core utilization, not a physical wattmeter. A follow-up with HWInfo / Ryzen Master is warranted.
  • Other Strix Halo boards: BIOS UMA settings and firmware revisions may vary reproducibility.
One case study in how a hardened dGPU intuition — "AMD GPU = ROCm is the answer" — can invert an APU's operating metrics entirely. As unified-memory APUs like Strix Halo become more common, this kind of runtime-cost evaluation is going to be standard.

Reference categories

Specific URLs change frequently across ROCm, llama.cpp, and Ollama releases, so I leave search keywords instead of brittle anchors:
  • AMD Strix Halo / gfx1151 ROCm support — official ROCm release notes "Supported GPUs / gfx1151" section, Phoronix "Strix Halo ROCm" benchmark series
  • llama.cpp HIP / Vulkan backend comparisonsggml-org/llama.cpp repository under the Vulkan label, the HIP_UMA introduction PR
  • HSA runtime polling on APU/iGPU — ROCm issue tracker for "HSA polling APU CPU usage", discussions around libhsa-runtime64's wait_until_busy
  • Ollama Vulkan backend origins — Ollama release notes for the "Vulkan" keyword, the OLLAMA_VULKAN env-var docs

Measurement metadata (for reproduction)

  • Test machine: ASUS Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S, gfx1151, 64 GB RAM, BIOS UMA 65 GB)
  • Date: 2026-05-31
  • HIP build: llama.cpp commit 2d9b7c8 (-DGGML_HIP=ON -DGGML_HIP_UMA=ON -DAMDGPU_TARGETS=gfx1151)
  • Vulkan run: Windows-native Ollama, OLLAMA_VULKAN=1, HIP_VISIBLE_DEVICES=-1, ROCR_VISIBLE_DEVICES=-1
  • Model: Qwen3-30B-A3B-Instruct-2507 Q4_K_M (17.28 GiB, MoE active 3B)

Measured on 2026-05-31 · ROCm 7.2.4 · llama.cpp commit 2d9b7c8 · Qwen3-30B-A3B-Instruct-2507 Q4_K_M