🎙

AgentZero Lite v0.6.1 — Voice in, OS out: when your CLI grows hands

notion image
A minimalist Windows shell for the AI era just learned two new tricks: it can drive your desktop, and you can drive it with your voice. Mission M0014 wired in a symmetric OS-Control surface — every read-only capability is exposed both as a CLI verb (AgentZeroLite.exe -cli os …) and as an LLM tool (os_*). Mission M0008 had already brought a fully on-device voice pipeline (Whisper + Vulkan, Akka Streams, command interception). Put them together and the on-device assistant can self-operate the computer hands-free — within a deliberate approval gate.

What shipped

Two capabilities, one runway.
Capability
Mission
What it gives you
OS-Control surface
M0014
List windows, screenshot, walk the UIA tree, click, type, activate — from CLI or from the LLM.
Voice pipeline
M0008 (and follow-ups)
Mic → Whisper STT (CPU+Vulkan) → command classifier → bot dispatch. Stop-words, "summarize terminal", and pass-through.
Each verb, each tool, each voice command leaves a JSONL audit line. Nothing leaves the machine. Read-only is unconditional; input simulation is gated. The point is symmetry: no matter which channel the request arrives on — keyboard, voice, remote CLI, LLM tool call — the same facade enforces the same rules.

Architecture at a glance

The voice signal and the keyboard signal converge on the same dispatch surface. From there an "AI Mode" toggle decides whether the text goes straight to the focused terminal or runs through the on-device LLM agent loop, which can itself call the OS-Control facade as a tool.
flowchart LR subgraph IN[Inputs] MIC[🎙 Mic] KBD[⌨ Keyboard] RCLI[📡 Remote CLI<br/>WM_COPYDATA] end subgraph VP[Voice pipeline · M0008] VAD[VAD + Akka Streams<br/>VoiceSegmenterFlow] STT[Whisper.net<br/>CPU + Vulkan] VCI[VoiceCommand<br/>Interceptor] end subgraph BOT[AgentBot · UI gateway] MODE{AI Mode?} LOOP[AgentLoopActor<br/>IAgentLoop] end subgraph TOOLS[Toolbelt · M0013/M0014] TR[Terminal-relay<br/>tools] OS[OS-Control facade<br/>OsControlService] end subgraph WIN[Windows surface] UIA[UI Automation tree] GDI[GDI BitBlt → PNG] INP[SendInput<br/>gated] end MIC --> VAD --> STT --> VCI VCI -- "stop / summarize" --> BOT VCI -- pass-through --> BOT KBD --> BOT RCLI --> BOT MODE -- off --> TR MODE -- on --> LOOP LOOP --> TR LOOP --> OS OS --> UIA OS --> GDI OS -.gate.-> INP OS --> AUDIT[(tmp/os-cli/audit/<br/>YYYY-MM-DD.jsonl)]
A few things this picture says explicitly:
  • One facade, many callers. OsControlService is reached by the CLI dispatcher, the in-process LLM toolbelt, and the E2E smoke probe. They all stamp a caller field on the audit line, so a forensic reader can tell which channel originated a click.
  • The LLM never sees pixels. Screenshots return a path, not bytes. The agent loop's context never inflates with image data, and the prompt-injection attack surface stays small.
  • Voice is just another input. It funnels into the same bot dispatch the keyboard uses. Once the transcript arrives, AI Mode + tool calls decide what happens — including whether os_* tools fire.

The symmetric surface

Every read-only OS-Control feature is exposed twice — once as a CLI verb, once as an LLM tool:
CLI verb
LLM tool
Side-effect
Gate
os list-windows
os_list_windows
enum
none
os get-window-info <hwnd>
(CLI only)
read
none
os screenshot
os_screenshot
GDI BitBlt → PNG file
none
os element-tree <hwnd>
os_element_tree
UIA walk
none
os text-capture <hwnd>
(CLI only, Phase A)
UIA Name aggregation
none
os dpi
(CLI only)
metrics
none
os activate <hwnd>
os_activate
Z-order change
none
os mouse-click <x> <y>
os_mouse_click
input simulation
gated
os mouse-move / mouse-wheel
(CLI only, Phase B)
input simulation
gated
os keypress <spec>
os_key_press
input simulation
gated
os audit [--last N]
(CLI only)
inspect audit log
n/a
Read-only verbs are unconditional. Input simulation requires either:
  1. --allow-input on the CLI verb (per-call), or
  1. Process env var AGENTZERO_OS_INPUT_ALLOWED=1 (any of 1 / true / yes).
There is no per-call human prompt. The gate is binary by design — automation friction would defeat the purpose. The audit log gives the operator forensic visibility after the fact.
When a gated tool is denied, callers receive:
{ "ok": false, "error": "OS input simulation is gated. Set environment variable AGENTZERO_OS_INPUT_ALLOWED=1 (LLM/GUI) or pass --allow-input on the CLI verb to enable.", "verb": "<tool name>" }
The LLM system prompt explicitly forbids retrying after a gate denial. The model must report the failure via done rather than loop.

How the voice path classifies a transcript

The voice transcript doesn't go directly to the LLM. It first hits a deliberately small classifier (VoiceCommandInterceptor) so a few common in-app commands stay deterministic.
flowchart TD T[Whisper transcript] --> N{IsNullOrWhiteSpace?} N -->|yes| P[PassThrough no-op] N -->|no| TR2[Trim + strip trailing punctuation] TR2 --> S{exact match<br/>StopPhrases?} S -->|yes| STOP[StopSpeaking<br/>cancel TTS, skip LLM] S -->|no| K{contains 터미널 + 요약<br/>or terminal + summar?} K -->|yes| SUM[SummarizeTerminal<br/>snapshot + ask LLM] K -->|no| PT[PassThrough<br/>forward to bot]
The rules are intentionally simple — short transcript-level matching, no NLP — because Whisper output already varies across runs (punctuation, casing, trailing periods). Predictability beats sophistication for the "voice as remote control" use case. Stop-phrases include 그만, 그만해, stop, shut up. Summarize-terminal needs both keywords in either order, with English fallbacks because Whisper occasionally transcribes Korean tech jargon in English even with lang=ko.

Where the artefacts go

Everything OS-Control does lands under tmp/os-cli/ (gitignored):
tmp/os-cli/ ├── audit/2026-05-08.jsonl one line per call (cli, llm, e2e) ├── screenshots/2026-05-08/ PNG outputs (HH-mm-ss-fff prefix) └── e2e/2026-05-08.log E2E smoke summary
Audit JSONL schema:
{ "ts": "2026-05-08T11:42:09.123+09:00", "caller": "cli | llm | e2e", "verb": "list_windows | screenshot | mouse_click | …", "args": { }, "ok": true, "error": null }
os audit --last 50 prints today's tail. The audit is best-effort: failures inside the recorder swallow the exception so a transient disk error never breaks the actual operation. It's a forensic record after the fact, not a transactional gate.

Use cases — what this combination unlocks

1. Voice-driven screen review while you keyboard the next tab

You're writing code in tab A; the build log is tailing in tab B. You say into the mic:
"터미널 작업 요약해 줘."
VoiceCommandInterceptor classifies that as SummarizeTerminal. The bot snapshots the active terminal output and dispatches it to the on-device LLM with a "summarise" prompt. AI Mode + os_screenshot can optionally drop a grayscale PNG into tmp/os-cli/screenshots/today/ for the human reviewer. Your hands never leave the other keyboard.

2. Hands-free TTS interruption

Voice output (TTS) replies are now standard in AI Mode (M0011 follow-up landed alongside this), and they get long. Saying:
"그만."
…hits the StopSpeaking branch, drains the in-flight TTS queue, and skips LLM dispatch entirely. There's no awkward "say the magic phrase exactly" moment — the classifier strips trailing punctuation and case, so 그만. / 그만해 / stop! all match. (See commit 00610f6 fix(voice): gate AI-bubble TTS on mic-on; drain in-flight TTS at mic-off.)

3. Remote desktop check-up via CLI

The CLI is reachable from any shell on the same machine, and the -cli os verbs run in-process — no WM_COPYDATA hop, no GUI required. That makes remote inspection trivial:
# From a Claude tab (or any peer terminal) on the same box: AgentZeroLite.exe -cli os list-windows --filter "Visual Studio" AgentZeroLite.exe -cli os screenshot --hwnd 0x000A0234 AgentZeroLite.exe -cli os element-tree 0x000A0234 --depth 5
Pipe through ConvertFrom-Json and you have a self-describing JSON report of what's on the operator's desktop right now. CI scripts get the same surface. Read-only verbs need no opt-in; the operator authorises input simulation explicitly per-run.

4. LLM-driven UI inspection (without typing on your behalf)

In AI Mode you can ask:
"내 노트패드 열려 있어? 열려 있으면 내용 캡처해."
A grammar-constrained agent loop (LocalAgentLoop for Gemma 4 local / ExternalAgentLoop for OpenAI-compatible) emits:
  1. os_list_windows({ "title_filter": "Notepad" }) → returns hwnd list.
  1. os_screenshot({ "hwnd": <hwnd>, "grayscale": true }) → returns path.
  1. os_text-capture (CLI, future LLM exposure) for the textual content.
  1. done with the operator-readable summary.
The LLM never receives pixels — it gets the file path and the textual extraction. If the operator wants to push back into Notepad (typing text, sending Ctrl+S), that crosses the gate. Without AGENTZERO_OS_INPUT_ALLOWED=1 the model gets a denial envelope and calls done with the error — no retry loop. The system prompt is explicit about this.

5. Self-piloting micro-flows on operator-approved sessions

When the operator turns the gate on for a focused work block:
$env:AGENTZERO_OS_INPUT_ALLOWED = "1"
…the LLM can chain os_activateos_key_pressos_screenshot to produce true self-driving micro-tasks ("activate Notepad, type a date header, screenshot the result"). When the work block ends, Remove-Item Env:AGENTZERO_OS_INPUT_ALLOWED (or just close the shell) re-locks the gate. The audit log captures every call regardless of who fired it.

Anti-patterns we deliberately ruled out

  • No screenshot bytes in LLM context. The toolbelt returns a path string. Inlining bytes wastes tokens and exposes the prompt-injection attack surface.
  • No os_launch LLM tool. os_activate requires a pre-existing hwnd. The CLI has open-win, but exposing it to the LLM creates a recursion risk (LLM driving an instance of itself). If a future mission needs this, it lands behind its own approval flag.
  • No reused screenshot paths. Filenames embed millisecond precision (HH-mm-ss-fff) so two near-simultaneous captures don't collide.
  • No "trivial" call left unaudited. Even os_list_windows logs, because enumeration itself is a fingerprinting signal worth tracing.

Origin parity (AgentWin)

The AgentWin Origin project shipped a richer 15-verb / 9-LLM-tool surface. M0014 imported the 80% slice that fits Lite's identity:
  • Imported: list-windows, get-window-info, screenshot, element-tree, text-capture, dpi, activate, mouse-click, mouse-move, mouse-wheel, keypress.
  • Skipped: scroll-capture (folded into element-tree), copy <text> (clashes with existing copy verb), virtual-desktop service (a future, dedicated mission).
The lift was deliberate: Lite already carried 90% of the necessary Win32 P/Invoke (Project/AgentZeroWpf/NativeMethods.cs) and the ElementTreeScanner. M0014 added the wrappers, audit, gating, CLI dispatch, LLM bridge, and an E2E smoke probe.

Try it

Smoke-test a fresh build without driving any features:
pwsh Docs/scripts/launch-self-smoke.ps1 -Configuration Debug
Steps: list-windows → get-window-info → screenshot → element-tree → dpi. Exit 0 = all probes passed; PNG and audit lines are kept under tmp/os-cli/ for review.
The full installer + portable ZIP will land at:

Closing note

The release theme isn't "voice is on" or "OS automation is on." It's that the same facade governs every input vector, which is what makes self-operation safe to ship at all. Voice that types into a terminal is useful. Voice that asks a local LLM to inspect your desktop and drop a grayscale snapshot, with every call audited and input gated, is a different kind of useful — the kind where you can hand the keyboard to the assistant for a while and still know exactly what it touched.

TECH LINKS