Why ROCm Wins the Throughput Race but Loses the Power Bill on Strix Halo — A 35% Energy Reversal Caused by APU Runtime Polling

🇰🇷 한국어 자매 페이지: Strix Halo에서 ROCm이 빠른데도 진 이유 — APU 런타임 polling이 만든 35% 전력 효율 역전 (Korean)

🎯

TL;DR — On AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S iGPU) running the same Qwen3-30B-A3B Q4_K_M model, HIP/ROCm dominates raw throughput (pp512 354 vs 78 t/s, tg 48 vs 36 t/s) but just loading the model burns 2.5 CPU cores continuously because of HSA runtime polling. For 24/7 workloads where idle time dominates, Windows Ollama (Vulkan) consumes about 35% less energy per token. A clean case study in why dGPU intuitions don't transfer to APUs.

📊

[Image #1 — Hero] ROCm and Vulkan facing off on top of a Strix Halo APU

file: 2026-05-31-strixhalo-hero.png

📌 Intro — Strix Halo, a new "middle-ground" platform

Who this is for — Infra/ML engineers running local LLM workloads on Strix Halo / Ryzen AI MAX+ systems, and any backend decision-maker who has to validate the "AMD GPU means ROCm" intuition. If you ever picked an inference backend based on a single throughput table, this case study is for you.

AMD Strix Halo (Ryzen AI MAX+ 395 + Radeon 8060S iGPU, gfx1151) is neither a desktop dGPU nor a laptop iGPU — it's a new middle ground. With up to 65 GB of unified memory (UMA/GTT) you can fit a 30B-class LLM entirely on the GPU, and "run an MoE 30B on a single desktop" stops being marketing and becomes a measurable claim.

The trap is what happens the moment you carry the old dGPU intuition — "AMD GPU = ROCm is the answer" — onto this new variant. Your operational metrics invert. This article logs four measurement rounds comparing ROCm 7.2.4 + a self-built HIP llama.cpp against Windows-native Ollama (Vulkan), on the same model, on the same machine.

Quick glossary:

APU (Accelerated Processing Unit): A single silicon die containing both CPU and GPU. A dGPU (Discrete GPU) sits on a separate card connected via PCIe.

UMA/GTT: A BIOS setting that carves out a portion of system RAM as GPU-visible memory on an APU. Unlike a dGPU's dedicated VRAM, this allocation is static — set in BIOS, not dynamic.

HSA polling: When the CPU waits for GPU work to complete, instead of being woken by an OS interrupt, a CPU thread keeps tapping the queue to check. On a dGPU the cost hides; on an APU it shows up loud and clear.

🔬 Round 1 — VRAM recognition ("ROCm only sees 4 GB" — solved)

The first wall was the predictable one. With default BIOS allocation, ROCm only sees the dedicated VRAM (4 GB), so the 30B model can't be GPU-offloaded. Increasing BIOS UMA/GTT to 65 GB brings rocminfo Pool 1 GLOBAL to 95.83 GiB, and llama-bench correctly enumerates the device.

🧩

[Image #2] BIOS UMA dial — from 4 GB to 65 GB

file: 2026-05-31-strixhalo-vram-dial.png


rocminfo (gfx1151 agent, Pool 1):
  Segment: GLOBAL; FLAGS: COARSE GRAINED
  Size:    100,478,882 KB  ≈ 95.83 GiB

llama-bench device discovery:
  Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151)
  VRAM: 98123 MiB,  free: 97044 MiB

This part is uncontroversial. "We can use it now" is the conclusion, and most articles would stop here.

⚡ Round 2 — Raw throughput: HIP wins outright

Running Qwen3-30B-A3B Q4_K_M with -ngl 99 (full GPU offload), the numbers strongly favored HIP.

Measurement	WSL HIP (ROCm 7.2.4)	Windows Ollama (Vulkan)	Note
pp t/s (16 tok)	56.25	<strong>77.7</strong>	Short prompts: Vulkan wins
pp t/s (128 tok)	155.55	(not measured)	amortization starts
pp t/s (512 tok)	<strong>354.09 ± 15.72</strong>	(not measured)	HIP's real strength
tg t/s (warm)	<strong>48.21 ± 16.39</strong>	36.2	HIP ~33% faster when warm
tg t/s (cold first run)	36.06	36.2	essentially tied cold
Model load time	~27s	~24s	comparable

Reading just this table, the verdict is clear: "HIP dominates long-prompt processing and warm generation, so go ROCm." Stopping here would reinforce the conventional wisdom one more time.

📈

[Image #3] HIP vs Vulkan throughput bars — the surface verdict</strong>

file: 2026-05-31-strixhalo-throughput.png

🚨 Round 3 — Wait, the CPU is doing what? (idle 254% spin)

Trouble started right after the throughput run. With llama-server holding the model and zero requests inbound, a glance at top blew everything up.

Idle after model load (no requests, 15s sampling)


t=3s   llama-server CPU: 265%   RSS: 454 MB
t=6s   llama-server CPU: 259%   RSS: 454 MB
t=9s   llama-server CPU: 253%   RSS: 454 MB
t=12s  llama-server CPU: 249%   RSS: 454 MB
t=15s  llama-server CPU: 245%   RSS: 454 MB
→ avg ~254% (2.5 cores continuously burning)

Idle right after 128-token generation (30s sampling)


t=5s   CPU: 248%
t=10s  CPU: 244%
t=15s  CPU: 240%
t=20s  CPU: 237%
t=25s  CPU: 234%
t=30s  CPU: 231%
system jiffies: busy +5089 / total +81108 = 6.27% of 32-core system
→ ~2.0 cores. The spin doesn't release.

A perf record/report of the hot path pointed to a single source: libhsa-runtime64.so.1.18.70204, offset 0x9ab00–0x9ad60 (~200 bytes) — the HSA runtime's GPU completion-queue polling loop.

The control comparison — Windows Ollama (Vulkan) under identical conditions — looked like this:


ollama main(pid 16372)   CPU: 0.0%   RSS: 104 MB
ollama UI(pid 10212)     CPU: 0.0%   RSS: 403 MB
=> avg idle CPU: 0.0%

During inference: runner CPU ~65% of 1 core (<2% of 32-core system)
Post-generation idle: 0.0%

🔥

[Image #4] HSA runtime polling — CPU cores tapping an empty queue forever

file: 2026-05-31-strixhalo-hsa-polling.png

I tried every escape hatch: HSA_ENABLE_INTERRUPT=1, HSA_ENABLE_DXG_DETECTION=1, GGML_CUDA_NO_PINNED=1, partial offload (-ngl 16). None released the spin. The polling routine is hard-coded inside HSA itself — no environment variable reaches it.

🧩 Why dGPUs hide this — same polling, different die

HSA polling exists everywhere ROCm runs, but on a dGPU you don't see it. The workload itself keeps the GPU busy enough that the polling thread blends into the noise, and the dGPU lives on a separate die with separate memory and separate cores — so even when polling burns CPU, it's "somewhere else" from the host's perspective.

APUs are different. CPU and GPU share one die and the same memory controller. The polling thread eats host CPU cores while the GPU sits idle, and those cores compete with the host. Same polling, two completely different costs: invisible on dGPU, 2.5 cores burned on APU.

🧠

[Image #5] dGPU vs APU die layout — where polling hides vs. where it shows</strong>

file: 2026-05-31-strixhalo-die-comparison.png

Vulkan, on the same physical GPU, uses the OS's regular GPU completion-wait mechanism (interrupt/event-driven). That's why it hits 0% idle CPU. The difference is invisible in a throughput comparison and only shows up when you measure the whole runtime.

🔋 Round 4 — Convert to energy per token, and the ranking flips

Even when throughput is comparable, adding the polling cost on top changes the picture. Assuming roughly 4 W per active Zen 5 core, here are the estimates:

Scenario	HIP extra power	Vulkan extra power	Gap
Idle after model load	+~10 W constant (2.5-core spin)	~0 W	HIP's standing cost
Active inference (1s, GPU equal)	+~10 W (CPU spin compounds)	+~2–3 W	HIP +7–8 W
Estimated J/token (GPU 40 W shared)	~1.94 J/tok	<strong>~1.25 J/tok</strong>	<strong>Vulkan ~35% less</strong>
24h low-frequency server (idle ≥95%)	+~250 Wh/day	baseline	~90 kWh/year more

The gap is decided by workload pattern. For burst-batch workloads where the GPU stays pinned, polling cost gets absorbed and HIP's throughput edge survives. But for Discord bots, API servers, internal RAG gateways — anything with ≥95% idle ratio — polling cost dominates total usage cost. The annual delta of ~90 kWh isn't huge in dollars (a few tens of dollars at typical electricity rates), but the heat, fan noise, and chassis-life impact pile on top.

⚖️

[Image #6] Energy-per-token seesaw — polling tips the balance even in equal-throughput territory</strong>

file: 2026-05-31-strixhalo-energy-seesaw.png

🛠️ Operational recommendation — pick by workload pattern, not by the throughput table

The conclusion from this single set of measurements is sharp. The dGPU intuition "AMD GPU = ROCm" doesn't transfer to a unified-memory APU like Strix Halo. On APUs, the top-priority operational metric isn't "which backend is faster" — it's "which runtime doesn't burn the CPU when it's not working."

Workload	Recommended backend	Why
24/7 server (Discord bot, API serving, internal RAG)	<strong>Windows Ollama (Vulkan)</strong>	0% idle CPU, interrupt-based GPU wait
Burst batch (long-prompt batches, reranking)	HIP llama-server — spin up, then kill immediately	Keep the throughput edge, minimize polling exposure
Realtime single-user chat	Either — the 12 t/s warm-tg gap is barely perceptible	Decide by idle-time share

Tunables that did NOT release the HSA spin

HSA_ENABLE_INTERRUPT=1 — no effect

HSA_ENABLE_DXG_DETECTION=1 — required for device discovery, unrelated to the spin

GGML_CUDA_NO_PINNED=1 — no effect

-ngl 16 (partial offload) — CPU↔GPU layer hopping makes it slower than CPU-only

Killing all external clients — the spin is independent of inbound traffic

Triggers for revisiting HIP (watch ROCm release notes for)

"HSA polling fix"

"gfx11 generic interrupt support"

"APU power efficiency"

If any of those land, re-measure immediately. Until then, pick the backend to match the form factor.

🛡️

[Image #7] Workload-pattern-based backend choice — 24/7 idle vs burst batch</strong>

file: 2026-05-31-strixhalo-workload-decision.png

📚 Closing — measurement limits and what would change the verdict

Every number above is a snapshot from 2026-05-31, a single ASUS Strix Halo machine (BIOS UMA 65 GB), measured on ROCm 7.2.4. The conclusion is subject to update if any of the following changes:

ROCm releases: If patches for HSA polling or gfx11 interrupt support land, the gap could close.

Model variety: Only Qwen3-30B-A3B (MoE active 3B) measured. Ratios will differ for dense models.

Power measurement: Estimated from CPU core utilization, not a physical wattmeter. A follow-up with HWInfo / Ryzen Master is warranted.

Other Strix Halo boards: BIOS UMA settings and firmware revisions may vary reproducibility.

One case study in how a hardened dGPU intuition — "AMD GPU = ROCm is the answer" — can invert an APU's operating metrics entirely. As unified-memory APUs like Strix Halo become more common, this kind of runtime-cost evaluation is going to be standard.

Reference categories

Specific URLs change frequently across ROCm, llama.cpp, and Ollama releases, so I leave search keywords instead of brittle anchors:

AMD Strix Halo / gfx1151 ROCm support — official ROCm release notes "Supported GPUs / gfx1151" section, Phoronix "Strix Halo ROCm" benchmark series

llama.cpp HIP / Vulkan backend comparisons — ggml-org/llama.cpp repository under the Vulkan label, the HIP_UMA introduction PR

HSA runtime polling on APU/iGPU — ROCm issue tracker for "HSA polling APU CPU usage", discussions around libhsa-runtime64's wait_until_busy

Ollama Vulkan backend origins — Ollama release notes for the "Vulkan" keyword, the OLLAMA_VULKAN env-var docs

Measurement metadata (for reproduction)

Test machine: ASUS Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S, gfx1151, 64 GB RAM, BIOS UMA 65 GB)

Date: 2026-05-31

HIP build: llama.cpp commit 2d9b7c8 (-DGGML_HIP=ON -DGGML_HIP_UMA=ON -DAMDGPU_TARGETS=gfx1151)

Vulkan run: Windows-native Ollama, OLLAMA_VULKAN=1, HIP_VISIBLE_DEVICES=-1, ROCR_VISIBLE_DEVICES=-1

Model: Qwen3-30B-A3B-Instruct-2507 Q4_K_M (17.28 GiB, MoE active 3B)

Measured on 2026-05-31 · ROCm 7.2.4 · llama.cpp commit 2d9b7c8 · Qwen3-30B-A3B-Instruct-2507 Q4_K_M