TL;DR — In AgentZero Lite, the voice channel is not a chat input. It is the captain's bridge for a fleet of terminal AIs. You speak; an on-device Gemma 4 with a GBNF-locked tool catalog decides which terminal to read, which terminal to write, which key to send, when to wait; and the Akka.NET actor mesh carries those decisions to N concurrent CLI agents — Claude Code, Codex, Cursor CLI, or a plain shell. Same captain, swappable crew. Same bridge, swappable language model behind it. The point of this chapter is to show the seam: how voice becomes a function call, and how a function call becomes a keystroke into another agent's terminal.
This is a follow-up to Adding Providers and a Voice — AgentZero Lite Expands Beyond a Single Local Model. That chapter added STT/TTS as new modalities behind the same gateway. This one shows what the voice channel is for.
The shape of the feature — UI view
You sit at the AgentBot window. The mic toggle is in the input strip. Your workspace already has 2–4 terminal tabs open, each running its own AI CLI (Claude Code in tab 1, Codex in tab 2, a plain PowerShell in tab 3). You say:
"Tell Claude to refactor the auth module and tell Codex to write the tests for it. When Claude finishes, paste its diff into Codex."
The bridge picks it up. Here is what the operator sees:
┌──────────────────────────────────── AgentBot ─────────────────────────────────┐ │ Mode: [AI ▾] Backend: Gemma 4 E4B (local) Reactor: ▶ running R3/12 │ ├───────────────────────────────────────────────────────────────────────────────┤ │ │ │ 🎙 YOU — "Tell Claude to refactor the auth module and tell Codex…" │ │ │ │ 🤖 REACTOR — Thinking ░░░░░░░░░ ▓ │ │ ↳ Tool: list_terminals → 3 terminals found │ │ ↳ Tool: send_to_terminal #1 (Claude) "refactor the auth module…" │ │ ↳ Tool: send_to_terminal #2 (Codex) "write tests for auth…" │ │ ↳ Tool: wait 8s │ │ ↳ Tool: read_terminal #1 → captured 142 lines │ │ ↳ Tool: send_to_terminal #2 "<paste diff from Claude>" │ │ ↳ Tool: done │ │ │ │ 🤖 REACTOR — done in 4 turns · 11.2s │ │ │ ├───────────── Workspace tabs (each is an independent ConPTY) ──────────────────┤ │ ┌────[1] Claude Code──┬──[2] Codex──┬──[3] PS──┬──[4] +]──────────────────┐ │ │ │ > /refactor auth │ > write tests for… │ PS> _ │ │ │ │ │ ✓ refactored 3 files│ writing test_auth.py │ │ │ │ │ │ diff: 142 lines │ pytest collected 17 │ │ │ │ │ └─────────────────────┴───────────────────────┴──────────┴────────────────┘ │ │ │ │ [🎙 listening · Whisper.net small] ▮▮▮▮▮▯▯▯▯ amplitude · VAD: speaking │ │ │ │ > _ │ └───────────────────────────────────────────────────────────────────────────────┘
Nothing on this screen is a metaphor. The mic strip is a real UserControl (
VoiceWaveformIndicator.xaml.cs, 9-bar cyan→magenta gradient driven by VAD amplitude every 50 ms). The reactor log is real ReactorProgress actor messages streamed to the UI thread. The terminal tabs are real ConPTY hosts — each one literally runs claude, codex, or pwsh.exe as a child process. The captain's instructions arrive at those processes as bytes written to a pipe. The terminals don't know they're being driven by an LLM; from their point of view, somebody is just typing.That last sentence is the load-bearing one. It is why the system works for any CLI AI you can install.
The mental model — captain, comms, crew
Three roles. Once you see them, the architecture is obvious.
Role | Component | What it owns | What it doesn't |
Captain | The user, via voice | Intent. The desired outcome. | Tactical decisions about which tool to call when. |
Comms officer | On-device Gemma 4 E4B (or Nemotron 8B fallback) running inside the Reactor actor | Translating intent into a sequence of typed tool calls. | Knowing what each terminal AI does. It only sees terminal IDs and text streams. |
Crew | N TerminalActors wrapping ConPTY sessions running Claude / Codex / shell | Doing the actual work in their own process. | Each other. They don't share memory or know they have peers. |
The captain speaks in natural language. The comms officer converts that to at most 6 verbs:
list_terminals, read_terminal, send_to_terminal, send_key, wait, done. The crew receives plain bytes and produces plain bytes back. That narrow waist is what lets the same UX pilot any future terminal AI without changing the comms officer.Pipeline layer 1 — voice becomes text
Voice is the riskiest seam in the whole stack: any latency or transcription error here ruins the captain's bridge feeling. The pipeline is built around two principles that fell out of two months of experimentation: dual-VAD gating (so we don't transcribe silence or send half-utterances), and a factory-of-engines behind one interface (so we can swap Whisper.net / OpenAI Whisper / Webnori-Gemma audio / local Gemma audio without code changes downstream).
flowchart LR Mic[🎙 NAudio<br/>16 kHz mono PCM] Mic --> RMS[Frame RMS<br/>every ~50 ms] RMS --> Meter[UI amplitude meter<br/>9-bar gradient] RMS --> VAD[Utterance VAD<br/>~40 silent frames<br/>= 2 s threshold] VAD -->|UtteranceStarted| Buf[ring buffer<br/>collect PCM] VAD -->|UtteranceEnded| Drain[ConsumePcmBuffer] Drain --> STT{ISpeechToText<br/>factory} STT -->|local CPU| W[Whisper.net small<br/>~487 MB] STT -->|cloud REST| O[OpenAI Whisper-1] STT -->|local server| WG[Webnori-Gemma audio] STT -->|on-device LLM| LG[Local Gemma audio] W --> Txt[transcript text] O --> Txt WG --> Txt LG --> Txt Txt --> Disp[Dispatcher.Invoke<br/>txtInput on UI thread] Disp --> Send[SendCurrentInput<br/>mode-aware] Send -->|Chat/Key mode| Term[active terminal write] Send -->|AI mode| Bot[AgentBotActor.StartReactor] classDef pri fill:#1e3a5f,stroke:#3794FF,color:#fff class Bot,Send pri
Implementation pointers:
Project/AgentZeroWpf/Services/Voice/VoiceCaptureService.cs— dual-VAD service, exposesAmplitudeChanged,UtteranceStarted,UtteranceEndedevents. The 40-frame silence threshold (line 41) is empirically tuned — shorter clips bleed into the next utterance, longer feels laggy on the bridge.
Project/AgentZeroWpf/Services/Voice/VoiceRuntimeFactory.cs—BuildStt(VoiceSettings)switch returning the rightISpeechToTextimplementation. Adding a new engine is one new branch and one new class.
Project/AgentZeroWpf/UI/APP/AgentBotWindow.Voice.cs— wires capture events to the input box and the AUTO toggle. AUTO ON means eachUtteranceEndedautomatically callsSendCurrentInput; AUTO OFF means the user still has to press Enter. STT engines are warmed up off-thread to hide cold-start latency (Whisper.net is ~487 MB CPU model load).
- The mode awareness is critical — the same voice transcript routes to a terminal write in Chat/Key mode but to the AIMODE reactor in AI mode. Same input, different captain.
Pipeline layer 2 — text becomes a tool plan (the GBNF function-call loop)
Once the captain's words are text, the comms officer takes over. This is the heart of AIMODE. Two backends exist for it; both expose the same
IAgentToolLoop contract:flowchart TB Req[user request<br/>natural language] Req --> Reactor[AgentReactorActor] Reactor --> Choose{IAgentToolLoop<br/>factory} Choose -->|primary, on-device| Local[AgentToolLoop<br/>Gemma 4 E4B + GBNF] Choose -->|fallback, REST| Ext[ExternalAgentToolLoop<br/>Nemotron Nano 8B-v1] subgraph LOOP [tool-call loop · same shape on both backends] Plan[generate next step] Plan --> JSON{JSON<br/>tool call} JSON -->|GBNF-locked| Strict["{tool, args}<br/>schema-valid by construction"] JSON -->|native FC| Native["OpenAI-style<br/>function call"] Strict --> Dispatch[dispatch to tool] Native --> Dispatch Dispatch --> T1[list_terminals] Dispatch --> T2[read_terminal] Dispatch --> T3[send_to_terminal] Dispatch --> T4[send_key] Dispatch --> T5[wait] Dispatch --> T6[done] T1 --> Obs[observation] T2 --> Obs T3 --> Obs T4 --> Obs T5 --> Obs Obs --> Plan T6 --> Stop((finalize)) end Local --> LOOP Ext --> LOOP Stop --> Result[ReactorResult<br/>Success, FinalMessage,<br/>TurnCount, ElapsedMs] Result --> UI[stream to AgentBot UI] classDef gbnf fill:#3a1e5f,stroke:#C586C0,color:#fff class Local,Strict gbnf classDef ext fill:#1e3a5f,stroke:#3794FF,color:#fff class Ext,Native ext
Why GBNF for the local path. Gemma 4 doesn't have native function calling the way Nemotron or Claude do. We give it the function-calling property by construction — a GBNF grammar that locks output to
{"tool": "<one of six>", "args": { ... }} at the sampler level. Every token the model emits is constrained by the grammar; the model literally cannot produce ill-formed JSON or hallucinate an unknown tool name. This is meaningfully different from "ask the model to please return JSON and hope" — it is schema-valid by construction.Why a REST fallback. Some users will want a stronger reasoning model than a 4 B local one. Nemotron Nano 8B-v1 has native OpenAI-compatible function calling, so we can keep the same six-tool catalog and just route the calls through
ILlmProvider. The ExternalAgentToolLoop keeps its own message history list (REST is stateless) where the local one reuses the LLamaSharp KV cache across turns.Why exactly six tools. Anything more is unnecessary; anything less and the captain can't compose useful behaviors. The six were chosen by tracing concrete instructions — "have Claude review what Codex just wrote", "keep watching this build until it finishes", "abort the test, send Ctrl+C" — and asking what the minimum vocabulary was.
Implementation pointers:
Project/ZeroCommon/Llm/Tools/AgentToolLoop.cs— local Gemma path. Constructor (line 54–58) wiresChatTemplates.Gemmaand constructs theGrammarfromAgentToolGrammar.GbnfwithGrammarRootRule. KV cache reuse means follow-up prompts only re-process the new turn.
Project/ZeroCommon/Llm/Tools/ExternalAgentToolLoop.cs— REST path. First user send seedsLlmMessage.System(AgentToolGrammar.SystemPrompt); subsequent sends append only the new user request to a stateless history list (line 66).
Project/ZeroCommon/Actors/AgentReactorActor.cs— the FSM around the loop. Lines 55–89 lazy-construct the loop withOnTurnCompletedandOnGenerationProgresscallbacks that re-enter the actor mailbox asTurnCompletedInternal/GenerationProgressInternal. This is what keeps streaming UI updates flowing without violating actor encapsulation.
Project/ZeroCommon/Actors/Messages.cslines 159–208 — the public reactor protocol:StartReactor,ReactorProgress(phase, text, round),ReactorResult(success, finalMessage, turnCount, elapsedMs, failureReason),ReactorBindings(hostFactory, optionsFactory, toolLoopFactory).
Pipeline layer 3 — tool plan becomes keystrokes in N terminals
The tool calls are not yet contact with reality. They are still messages inside one process. Reality is reached when
send_to_terminal becomes bytes written to a child process's stdin pipe. That last leg goes through the actor mesh:flowchart TB Reactor[AgentReactorActor<br/>/user/stage/bot/reactor] Reactor -->|emits tool call| Bot[AgentBotActor<br/>/user/stage/bot] Bot -->|SendToTerminal<br/>WorkspaceName, TerminalId, Text| Stage[StageActor<br/>/user/stage] Stage -->|forwards by ws-name| WS1[WorkspaceActor<br/>/user/stage/ws-default] Stage -->|forwards by ws-name| WS2[WorkspaceActor<br/>/user/stage/ws-secondary] WS1 -->|terminalId lookup| TA1[TerminalActor<br/>term-claude] WS1 -->|terminalId lookup| TA2[TerminalActor<br/>term-codex] WS1 -->|terminalId lookup| TA3[TerminalActor<br/>term-pwsh] WS2 -->|terminalId lookup| TA4[TerminalActor<br/>term-other] TA1 -->|ITerminalSession.Write| C1[ConPtyTerminalSession<br/>child: claude.exe] TA2 -->|ITerminalSession.Write| C2[ConPtyTerminalSession<br/>child: codex] TA3 -->|ITerminalSession.Write| C3[ConPtyTerminalSession<br/>child: pwsh.exe] TA4 -->|ITerminalSession.Write| C4[ConPtyTerminalSession<br/>child: any CLI] C1 -.->|stdout bytes| TA1 C2 -.->|stdout bytes| TA2 C3 -.->|stdout bytes| TA3 C4 -.->|stdout bytes| TA4 TA1 -.->|TerminalOutput| Stage TA2 -.->|TerminalOutput| Stage TA3 -.->|TerminalOutput| Stage TA4 -.->|TerminalOutput| Stage Stage -.->|broadcast / read_terminal| Reactor classDef agent fill:#3a1e5f,stroke:#C586C0,color:#fff class Reactor,Bot agent classDef supervisor fill:#1e3a5f,stroke:#3794FF,color:#fff class Stage,WS1,WS2 supervisor classDef worker fill:#1e3a3a,stroke:#4EC9B0,color:#fff class TA1,TA2,TA3,TA4 worker
Why Akka.NET for this. The terminal layer has hard problems that are messy with raw threads and clean with actors:
- Names from user input must not poison the addressing scheme. Workspace and terminal names go through
ActorNameSanitizerbefore becoming actor paths — Akka rejects/,:, spaces.
- Supervision isolates failures. A Claude CLI that hangs or crashes is one TerminalActor's problem, not the bridge's. The supervising WorkspaceActor restarts or escalates.
- The
ITerminalSessionseam is what keeps the actor layer testable headlessly. Real ConPTY sessions live inAgentZeroWpf/Services/; the actors only know the interface inZeroCommon/Services/ITerminalSession.cs. Tests can swap a fake session and assert on the message protocol.
- Active-terminal tracking (
SetActiveTerminalatStageActor.cslines 87–91) gives the reactor a sane default forsend_to_terminalwhen the user doesn't name one — "send this to whichever terminal is in front."
Implementation pointers:
Project/ZeroCommon/Actors/StageActor.cs— top supervisor. Children: oneAgentBotActor(singleton) and NWorkspaceActors. Holds(activeWorkspace, activeTerminalId)so the reactor doesn't need to ask each turn.
Project/ZeroCommon/Actors/WorkspaceActor.cs— owns_terminalsdict (line 25), routesSendToTerminalby terminal ID (lines 62–79).
Project/ZeroCommon/Actors/TerminalActor.cs— wrapsITerminalSession. Bidirectional: outboundWriteToTerminalbecomes pipe writes; inbound stdout becomesTerminalOutputevents broadcast to Stage.
Project/ZeroCommon/Data/Entities/CliDefinition.cs— the database row that says "Claude" maps toclaude.exewith these args. Built-in seeds (IsBuiltIn = true) cover CMD / PowerShell 5/7 / Claude. Adding Codex or Cursor CLI is one row.
End-to-end sequence — voice to keystrokes
To pull all three layers together, here is the chronological sequence for the original utterance — "tell Claude to refactor the auth module and tell Codex to write the tests for it" — from the moment the user releases the mic to the moment both children have received their bytes.
sequenceDiagram autonumber actor U as User participant V as VoiceCaptureService participant S as STT (Whisper.net) participant W as AgentBotWindow participant B as AgentBotActor participant R as AgentReactorActor participant L as AgentToolLoop<br/>(Gemma 4 + GBNF) participant Stg as StageActor participant TA1 as TerminalActor<br/>(Claude) participant TA2 as TerminalActor<br/>(Codex) U->>V: speaks utterance V->>V: dual-VAD: speaking → silent ≥ 40 frames V->>W: UtteranceEnded W->>S: ConsumePcmBuffer → transcribe S-->>W: text transcript W->>B: StartReactor(userRequest=text) B->>R: forward StartReactor R->>L: build loop (lazy) R->>L: send userRequest loop tool-call loop L->>L: GBNF-constrained sample L-->>R: tool call JSON {tool, args} R-->>W: ReactorProgress(Acting, "send_to_terminal #1") alt send_to_terminal R->>Stg: SendToTerminal(ws, termId, text) Stg->>TA1: forward TA1->>TA1: ITerminalSession.Write(bytes) TA1-->>Stg: TerminalOutput (echo, prompt) Stg-->>R: observation else read_terminal R->>Stg: ReadTerminal(ws, termId, lines) Stg-->>R: lines else done L-->>R: tool=done, message=... end end R-->>B: ReactorResult(success, msg, turns, ms) B-->>W: stream to chat UI W-->>U: render final message + per-turn log
The interesting part is steps 9–14, the inner loop. Each iteration is one Gemma 4 forward pass constrained by the GBNF sampler. Because the grammar guarantees well-formed JSON, the actor on the receiving side does not need a JSON-parse fallback path. There is no "the model returned malformed output" branch to write or test. You either have a tool call you can dispatch, or generation hasn't finished yet.
This is one of the underrated ergonomic wins of the GBNF approach: the absence of code that would have existed in a "ask politely for JSON" architecture.
Where Claude, Codex and friends fit — the model-agnostic terminal layer
The captain (you) and the comms officer (Gemma 4) are picky about each other — they have to share a tool catalog and a system prompt and a chat template. The crew is not picky at all. From a TerminalActor's perspective:
- It receives
WriteToTerminal(text)and writes those bytes to a pipe.
- It reads bytes from the pipe and wraps them in
TerminalOutput.
- That is the entire contract.
So when the comms officer says "send_to_terminal id=1, text='/refactor the auth module'", terminal 1 is just a process. If that process happens to be
claude reading /refactor as a slash command, you've used Claude Code. If it's codex parsing the same text as natural-language input, you've used Codex. If it's pwsh.exe, you've sent a probably-broken PowerShell command. The terminal layer does not care what is on the other side of the pipe.This is the secret of why this architecture handles Claude and Codex and Cursor CLI and whatever ships next month with a single integration point: there is no integration. Each of them is a row in the
CliDefinition table. The captain learns the conventions of each crew member implicitly from their stdout — same way a human operator does.The only place the terminal AIs become first-class is when they want to talk back to AgentBot, not just to the user. That's the peer-signal protocol — a CLI agent calls
AgentZeroLite.exe -cli bot-chat <msg> --from <peerName>, which the AgentBotActor routes back to the reactor as TerminalSentToBot(peerName, text) so the loop can take it as an observation. Optional, opt-in, useful for agents that want to coordinate ("hey AgentBot, Claude here — Codex's tests look like they need an extra fixture").Why this design holds together
Three properties are doing most of the work, and all three were inherited from earlier architectural decisions rather than invented for voice control:
- One abstract interface, many implementations.
ISpeechToTextfor voice.IAgentToolLoopfor the function-call loop.ILlmProviderfor the underlying chat backend.ITerminalSessionfor the terminal seam. Adding a fifth STT engine, a third tool-loop backend, a sixth provider, or a new terminal type each costs a single class — never a structural change.
- Schema-valid by construction at the LLM boundary. GBNF gives us function calling on Gemma 4 without trusting the model to behave. Without this, a 4 B-parameter local model would be too unreliable to pilot real terminals.
- Actors all the way down for routing. The captain → comms → crew chain crosses async boundaries (mic threads, LLM inference threads, ConPTY pipe threads, UI thread). Akka.NET makes those boundaries the addressable shape of the system instead of the bug surface.
The voice-driven multi-terminal feature is what those three properties look like when they meet. We didn't add voice to a chat app and bolt on terminal control. We built a fleet captain's bridge — and voice is just the most natural way for a captain to give orders.
Status (2026-04-28)
Everything in this document is in
main and runs on a stock developer machine:- Voice capture & STT — shipping. Four engines available; AUTO toggle live.
- AgentBotActor + AgentReactorActor + tool-loop FSM — shipping. Both backends exercised.
- Six-tool GBNF catalog — shipping.
AgentToolGrammar.Gbnfenforced at sampler level.
- Multi-terminal actor mesh — shipping. ConPTY hosts in WPF tabs, sanitized actor paths, peer-signal protocol live.
- Built-in CLI definitions — CMD / PowerShell 5 / PowerShell 7 / Claude seeded; Codex / Cursor / others are user-defined rows.
What's next is captain-side, not infrastructure: better cancellation UX mid-utterance, optional TTS readback of reactor decisions ("I'm sending this to Claude — say cancel to stop"), and a per-tool latency budget so the captain knows which step is the slow one.
The bridge is built. We're learning to be captains.
TECH LINKS
- 𝕏 @webnori