🇰🇷 한국어 자매 페이지: Strix Halo에서 ROCm이 빠른데도 진 이유 — APU 런타임 polling이 만든 35% 전력 효율 역전 (Korean)
TL;DR — On AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S iGPU) running the same Qwen3-30B-A3B Q4_K_M model, HIP/ROCm dominates raw throughput (pp512 354 vs 78 t/s, tg 48 vs 36 t/s) but just loading the model burns 2.5 CPU cores continuously because of HSA runtime polling. For 24/7 workloads where idle time dominates, Windows Ollama (Vulkan) consumes about 35% less energy per token. A clean case study in why dGPU intuitions don't transfer to APUs.
[Image #1 — Hero] ROCm and Vulkan facing off on top of a Strix Halo APU
file:
2026-05-31-strixhalo-hero.png📌 Intro — Strix Halo, a new "middle-ground" platform
Who this is for — Infra/ML engineers running local LLM workloads on Strix Halo / Ryzen AI MAX+ systems, and any backend decision-maker who has to validate the "AMD GPU means ROCm" intuition. If you ever picked an inference backend based on a single throughput table, this case study is for you.
AMD Strix Halo (Ryzen AI MAX+ 395 + Radeon 8060S iGPU, gfx1151) is neither a desktop dGPU nor a laptop iGPU — it's a new middle ground. With up to 65 GB of unified memory (UMA/GTT) you can fit a 30B-class LLM entirely on the GPU, and "run an MoE 30B on a single desktop" stops being marketing and becomes a measurable claim.
The trap is what happens the moment you carry the old dGPU intuition — "AMD GPU = ROCm is the answer" — onto this new variant. Your operational metrics invert. This article logs four measurement rounds comparing ROCm 7.2.4 + a self-built HIP llama.cpp against Windows-native Ollama (Vulkan), on the same model, on the same machine.
Quick glossary:
- APU (Accelerated Processing Unit): A single silicon die containing both CPU and GPU. A dGPU (Discrete GPU) sits on a separate card connected via PCIe.
- UMA/GTT: A BIOS setting that carves out a portion of system RAM as GPU-visible memory on an APU. Unlike a dGPU's dedicated VRAM, this allocation is static — set in BIOS, not dynamic.
- HSA polling: When the CPU waits for GPU work to complete, instead of being woken by an OS interrupt, a CPU thread keeps tapping the queue to check. On a dGPU the cost hides; on an APU it shows up loud and clear.
🔬 Round 1 — VRAM recognition ("ROCm only sees 4 GB" — solved)
The first wall was the predictable one. With default BIOS allocation, ROCm only sees the dedicated VRAM (4 GB), so the 30B model can't be GPU-offloaded. Increasing BIOS UMA/GTT to 65 GB brings
rocminfo Pool 1 GLOBAL to 95.83 GiB, and llama-bench correctly enumerates the device.[Image #2] BIOS UMA dial — from 4 GB to 65 GB
file:
2026-05-31-strixhalo-vram-dial.pngrocminfo (gfx1151 agent, Pool 1): Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 100,478,882 KB ≈ 95.83 GiB llama-bench device discovery: Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151) VRAM: 98123 MiB, free: 97044 MiB
This part is uncontroversial. "We can use it now" is the conclusion, and most articles would stop here.
⚡ Round 2 — Raw throughput: HIP wins outright
Running Qwen3-30B-A3B Q4_K_M with
-ngl 99 (full GPU offload), the numbers strongly favored HIP.Measurement | WSL HIP (ROCm 7.2.4) | Windows Ollama (Vulkan) | Note |
pp t/s (16 tok) | 56.25 | <strong>77.7</strong> | Short prompts: Vulkan wins |
pp t/s (128 tok) | 155.55 | (not measured) | amortization starts |
pp t/s (512 tok) | <strong>354.09 ± 15.72</strong> | (not measured) | HIP's real strength |
tg t/s (warm) | <strong>48.21 ± 16.39</strong> | 36.2 | HIP ~33% faster when warm |
tg t/s (cold first run) | 36.06 | 36.2 | essentially tied cold |
Model load time | ~27s | ~24s | comparable |
Reading just this table, the verdict is clear: "HIP dominates long-prompt processing and warm generation, so go ROCm." Stopping here would reinforce the conventional wisdom one more time.
[Image #3] HIP vs Vulkan throughput bars — the surface verdict</strong>
file:
2026-05-31-strixhalo-throughput.png🚨 Round 3 — Wait, the CPU is doing what? (idle 254% spin)
Trouble started right after the throughput run. With
llama-server holding the model and zero requests inbound, a glance at top blew everything up.Idle after model load (no requests, 15s sampling)
t=3s llama-server CPU: 265% RSS: 454 MB t=6s llama-server CPU: 259% RSS: 454 MB t=9s llama-server CPU: 253% RSS: 454 MB t=12s llama-server CPU: 249% RSS: 454 MB t=15s llama-server CPU: 245% RSS: 454 MB → avg ~254% (2.5 cores continuously burning)
Idle right after 128-token generation (30s sampling)
t=5s CPU: 248% t=10s CPU: 244% t=15s CPU: 240% t=20s CPU: 237% t=25s CPU: 234% t=30s CPU: 231% system jiffies: busy +5089 / total +81108 = 6.27% of 32-core system → ~2.0 cores. The spin doesn't release.
A
perf record/report of the hot path pointed to a single source: libhsa-runtime64.so.1.18.70204, offset 0x9ab00–0x9ad60 (~200 bytes) — the HSA runtime's GPU completion-queue polling loop.The control comparison — Windows Ollama (Vulkan) under identical conditions — looked like this:
ollama main(pid 16372) CPU: 0.0% RSS: 104 MB ollama UI(pid 10212) CPU: 0.0% RSS: 403 MB => avg idle CPU: 0.0% During inference: runner CPU ~65% of 1 core (<2% of 32-core system) Post-generation idle: 0.0%
[Image #4] HSA runtime polling — CPU cores tapping an empty queue forever
file:
2026-05-31-strixhalo-hsa-polling.pngI tried every escape hatch:
HSA_ENABLE_INTERRUPT=1, HSA_ENABLE_DXG_DETECTION=1, GGML_CUDA_NO_PINNED=1, partial offload (-ngl 16). None released the spin. The polling routine is hard-coded inside HSA itself — no environment variable reaches it.🧩 Why dGPUs hide this — same polling, different die
HSA polling exists everywhere ROCm runs, but on a dGPU you don't see it. The workload itself keeps the GPU busy enough that the polling thread blends into the noise, and the dGPU lives on a separate die with separate memory and separate cores — so even when polling burns CPU, it's "somewhere else" from the host's perspective.
APUs are different. CPU and GPU share one die and the same memory controller. The polling thread eats host CPU cores while the GPU sits idle, and those cores compete with the host. Same polling, two completely different costs: invisible on dGPU, 2.5 cores burned on APU.
[Image #5] dGPU vs APU die layout — where polling hides vs. where it shows</strong>
file:
2026-05-31-strixhalo-die-comparison.pngVulkan, on the same physical GPU, uses the OS's regular GPU completion-wait mechanism (interrupt/event-driven). That's why it hits 0% idle CPU. The difference is invisible in a throughput comparison and only shows up when you measure the whole runtime.
🔋 Round 4 — Convert to energy per token, and the ranking flips
Even when throughput is comparable, adding the polling cost on top changes the picture. Assuming roughly 4 W per active Zen 5 core, here are the estimates:
Scenario | HIP extra power | Vulkan extra power | Gap |
Idle after model load | +~10 W constant (2.5-core spin) | ~0 W | HIP's standing cost |
Active inference (1s, GPU equal) | +~10 W (CPU spin compounds) | +~2–3 W | HIP +7–8 W |
Estimated J/token (GPU 40 W shared) | ~1.94 J/tok | <strong>~1.25 J/tok</strong> | <strong>Vulkan ~35% less</strong> |
24h low-frequency server (idle ≥95%) | +~250 Wh/day | baseline | ~90 kWh/year more |
The gap is decided by workload pattern. For burst-batch workloads where the GPU stays pinned, polling cost gets absorbed and HIP's throughput edge survives. But for Discord bots, API servers, internal RAG gateways — anything with ≥95% idle ratio — polling cost dominates total usage cost. The annual delta of ~90 kWh isn't huge in dollars (a few tens of dollars at typical electricity rates), but the heat, fan noise, and chassis-life impact pile on top.
[Image #6] Energy-per-token seesaw — polling tips the balance even in equal-throughput territory</strong>
file:
2026-05-31-strixhalo-energy-seesaw.png🛠️ Operational recommendation — pick by workload pattern, not by the throughput table
The conclusion from this single set of measurements is sharp. The dGPU intuition "AMD GPU = ROCm" doesn't transfer to a unified-memory APU like Strix Halo. On APUs, the top-priority operational metric isn't "which backend is faster" — it's "which runtime doesn't burn the CPU when it's not working."
Workload | Recommended backend | Why |
24/7 server (Discord bot, API serving, internal RAG) | <strong>Windows Ollama (Vulkan)</strong> | 0% idle CPU, interrupt-based GPU wait |
Burst batch (long-prompt batches, reranking) | HIP llama-server — spin up, then kill immediately | Keep the throughput edge, minimize polling exposure |
Realtime single-user chat | Either — the 12 t/s warm-tg gap is barely perceptible | Decide by idle-time share |
Tunables that did NOT release the HSA spin
HSA_ENABLE_INTERRUPT=1— no effect
HSA_ENABLE_DXG_DETECTION=1— required for device discovery, unrelated to the spin
GGML_CUDA_NO_PINNED=1— no effect
-ngl 16(partial offload) — CPU↔GPU layer hopping makes it slower than CPU-only
- Killing all external clients — the spin is independent of inbound traffic
Triggers for revisiting HIP (watch ROCm release notes for)
- "HSA polling fix"
- "gfx11 generic interrupt support"
- "APU power efficiency"
If any of those land, re-measure immediately. Until then, pick the backend to match the form factor.
[Image #7] Workload-pattern-based backend choice — 24/7 idle vs burst batch</strong>
file:
2026-05-31-strixhalo-workload-decision.png📚 Closing — measurement limits and what would change the verdict
Every number above is a snapshot from 2026-05-31, a single ASUS Strix Halo machine (BIOS UMA 65 GB), measured on ROCm 7.2.4. The conclusion is subject to update if any of the following changes:
- ROCm releases: If patches for HSA polling or gfx11 interrupt support land, the gap could close.
- Model variety: Only Qwen3-30B-A3B (MoE active 3B) measured. Ratios will differ for dense models.
- Power measurement: Estimated from CPU core utilization, not a physical wattmeter. A follow-up with HWInfo / Ryzen Master is warranted.
- Other Strix Halo boards: BIOS UMA settings and firmware revisions may vary reproducibility.
One case study in how a hardened dGPU intuition — "AMD GPU = ROCm is the answer" — can invert an APU's operating metrics entirely. As unified-memory APUs like Strix Halo become more common, this kind of runtime-cost evaluation is going to be standard.
Reference categories
Specific URLs change frequently across ROCm, llama.cpp, and Ollama releases, so I leave search keywords instead of brittle anchors:
- AMD Strix Halo / gfx1151 ROCm support — official ROCm release notes "Supported GPUs / gfx1151" section, Phoronix "Strix Halo ROCm" benchmark series
- llama.cpp HIP / Vulkan backend comparisons —
ggml-org/llama.cpprepository under theVulkanlabel, the HIP_UMA introduction PR
- HSA runtime polling on APU/iGPU — ROCm issue tracker for "HSA polling APU CPU usage", discussions around
libhsa-runtime64's wait_until_busy
- Ollama Vulkan backend origins — Ollama release notes for the "Vulkan" keyword, the
OLLAMA_VULKANenv-var docs
Measurement metadata (for reproduction)
- Test machine: ASUS Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S, gfx1151, 64 GB RAM, BIOS UMA 65 GB)
- Date: 2026-05-31
- HIP build: llama.cpp commit
2d9b7c8(-DGGML_HIP=ON -DGGML_HIP_UMA=ON -DAMDGPU_TARGETS=gfx1151)
- Vulkan run: Windows-native Ollama,
OLLAMA_VULKAN=1,HIP_VISIBLE_DEVICES=-1,ROCR_VISIBLE_DEVICES=-1
- Model: Qwen3-30B-A3B-Instruct-2507 Q4_K_M (17.28 GiB, MoE active 3B)
Measured on 2026-05-31 · ROCm 7.2.4 · llama.cpp commit 2d9b7c8 · Qwen3-30B-A3B-Instruct-2507 Q4_K_M