Voice-Driven Multi-Terminal Control — How a Local Gemma 4 Function-Caller Pilots Claude, Codex and Anything Else with a Prompt

TL;DR — In AgentZero Lite, the voice channel is not a chat input. It is the captain's bridge for a fleet of terminal AIs. You speak; an on-device Gemma 4 with a GBNF-locked tool catalog decides which terminal to read, which terminal to write, which key to send, when to wait; and the Akka.NET actor mesh carries those decisions to N concurrent CLI agents — Claude Code, Codex, Cursor CLI, or a plain shell. Same captain, swappable crew. Same bridge, swappable language model behind it. The point of this chapter is to show the seam: how voice becomes a function call, and how a function call becomes a keystroke into another agent's terminal.

This is a follow-up to Adding Providers and a Voice — AgentZero Lite Expands Beyond a Single Local Model. That chapter added STT/TTS as new modalities behind the same gateway. This one shows what the voice channel is for.

The shape of the feature — UI view

You sit at the AgentBot window. The mic toggle is in the input strip. Your workspace already has 2–4 terminal tabs open, each running its own AI CLI (Claude Code in tab 1, Codex in tab 2, a plain PowerShell in tab 3). You say:

"Tell Claude to refactor the auth module and tell Codex to write the tests for it. When Claude finishes, paste its diff into Codex."

The bridge picks it up. Here is what the operator sees:


┌──────────────────────────────────── AgentBot ─────────────────────────────────┐
│  Mode: [AI ▾]   Backend: Gemma 4 E4B (local)   Reactor: ▶ running  R3/12     │
├───────────────────────────────────────────────────────────────────────────────┤
│                                                                               │
│  🎙  YOU — "Tell Claude to refactor the auth module and tell Codex…"          │
│                                                                               │
│  🤖  REACTOR — Thinking ░░░░░░░░░ ▓                                           │
│        ↳ Tool: list_terminals                  → 3 terminals found            │
│        ↳ Tool: send_to_terminal #1 (Claude)   "refactor the auth module…"     │
│        ↳ Tool: send_to_terminal #2 (Codex)    "write tests for auth…"         │
│        ↳ Tool: wait 8s                                                        │
│        ↳ Tool: read_terminal  #1               → captured 142 lines           │
│        ↳ Tool: send_to_terminal #2             "<paste diff from Claude>"     │
│        ↳ Tool: done                                                           │
│                                                                               │
│  🤖  REACTOR — done in 4 turns · 11.2s                                        │
│                                                                               │
├───────────── Workspace tabs (each is an independent ConPTY) ──────────────────┤
│  ┌────[1] Claude Code──┬──[2] Codex──┬──[3] PS──┬──[4] +]──────────────────┐  │
│  │  > /refactor auth   │ > write tests for…    │ PS> _    │                │  │
│  │  ✓ refactored 3 files│ writing test_auth.py │          │                │  │
│  │  diff: 142 lines    │ pytest collected 17  │          │                │  │
│  └─────────────────────┴───────────────────────┴──────────┴────────────────┘  │
│                                                                               │
│  [🎙 listening · Whisper.net small]   ▮▮▮▮▮▯▯▯▯  amplitude · VAD: speaking   │
│                                                                               │
│  > _                                                                          │
└───────────────────────────────────────────────────────────────────────────────┘

Nothing on this screen is a metaphor. The mic strip is a real UserControl (VoiceWaveformIndicator.xaml.cs, 9-bar cyan→magenta gradient driven by VAD amplitude every 50 ms). The reactor log is real ReactorProgress actor messages streamed to the UI thread. The terminal tabs are real ConPTY hosts — each one literally runs claude, codex, or pwsh.exe as a child process. The captain's instructions arrive at those processes as bytes written to a pipe. The terminals don't know they're being driven by an LLM; from their point of view, somebody is just typing.

That last sentence is the load-bearing one. It is why the system works for any CLI AI you can install.

The mental model — captain, comms, crew

Three roles. Once you see them, the architecture is obvious.

Role	Component	What it owns	What it doesn't
Captain	The user, via voice	Intent. The desired outcome.	Tactical decisions about which tool to call when.
Comms officer	On-device Gemma 4 E4B (or Nemotron 8B fallback) running inside the Reactor actor	Translating intent into a sequence of typed tool calls.	Knowing what each terminal AI does. It only sees terminal IDs and text streams.
Crew	N TerminalActors wrapping ConPTY sessions running Claude / Codex / shell	Doing the actual work in their own process.	Each other. They don't share memory or know they have peers.

The captain speaks in natural language. The comms officer converts that to at most 6 verbs: list_terminals, read_terminal, send_to_terminal, send_key, wait, done. The crew receives plain bytes and produces plain bytes back. That narrow waist is what lets the same UX pilot any future terminal AI without changing the comms officer.

Pipeline layer 1 — voice becomes text

Voice is the riskiest seam in the whole stack: any latency or transcription error here ruins the captain's bridge feeling. The pipeline is built around two principles that fell out of two months of experimentation: dual-VAD gating (so we don't transcribe silence or send half-utterances), and a factory-of-engines behind one interface (so we can swap Whisper.net / OpenAI Whisper / Webnori-Gemma audio / local Gemma audio without code changes downstream).


flowchart LR
  Mic[🎙 NAudio<br/>16 kHz mono PCM]
  Mic --> RMS[Frame RMS<br/>every ~50 ms]
  RMS --> Meter[UI amplitude meter<br/>9-bar gradient]
  RMS --> VAD[Utterance VAD<br/>~40 silent frames<br/>= 2 s threshold]

  VAD -->|UtteranceStarted| Buf[ring buffer<br/>collect PCM]
  VAD -->|UtteranceEnded| Drain[ConsumePcmBuffer]

  Drain --> STT{ISpeechToText<br/>factory}
  STT -->|local CPU| W[Whisper.net small<br/>~487 MB]
  STT -->|cloud REST| O[OpenAI Whisper-1]
  STT -->|local server| WG[Webnori-Gemma audio]
  STT -->|on-device LLM| LG[Local Gemma audio]

  W --> Txt[transcript text]
  O --> Txt
  WG --> Txt
  LG --> Txt

  Txt --> Disp[Dispatcher.Invoke<br/>txtInput on UI thread]
  Disp --> Send[SendCurrentInput<br/>mode-aware]

  Send -->|Chat/Key mode| Term[active terminal write]
  Send -->|AI mode| Bot[AgentBotActor.StartReactor]

  classDef pri fill:#1e3a5f,stroke:#3794FF,color:#fff
  class Bot,Send pri

Implementation pointers:

Project/AgentZeroWpf/Services/Voice/VoiceCaptureService.cs — dual-VAD service, exposes AmplitudeChanged, UtteranceStarted, UtteranceEnded events. The 40-frame silence threshold (line 41) is empirically tuned — shorter clips bleed into the next utterance, longer feels laggy on the bridge.

Project/AgentZeroWpf/Services/Voice/VoiceRuntimeFactory.cs — BuildStt(VoiceSettings) switch returning the right ISpeechToText implementation. Adding a new engine is one new branch and one new class.

Project/AgentZeroWpf/UI/APP/AgentBotWindow.Voice.cs — wires capture events to the input box and the AUTO toggle. AUTO ON means each UtteranceEnded automatically calls SendCurrentInput; AUTO OFF means the user still has to press Enter. STT engines are warmed up off-thread to hide cold-start latency (Whisper.net is ~487 MB CPU model load).

The mode awareness is critical — the same voice transcript routes to a terminal write in Chat/Key mode but to the AIMODE reactor in AI mode. Same input, different captain.

Pipeline layer 2 — text becomes a tool plan (the GBNF function-call loop)

Once the captain's words are text, the comms officer takes over. This is the heart of AIMODE. Two backends exist for it; both expose the same IAgentToolLoop contract:


flowchart TB
  Req[user request<br/>natural language]
  Req --> Reactor[AgentReactorActor]
  Reactor --> Choose{IAgentToolLoop<br/>factory}

  Choose -->|primary, on-device| Local[AgentToolLoop<br/>Gemma 4 E4B + GBNF]
  Choose -->|fallback, REST| Ext[ExternalAgentToolLoop<br/>Nemotron Nano 8B-v1]

  subgraph LOOP [tool-call loop · same shape on both backends]
    Plan[generate next step]
    Plan --> JSON{JSON<br/>tool call}
    JSON -->|GBNF-locked| Strict["{tool, args}<br/>schema-valid by construction"]
    JSON -->|native FC| Native["OpenAI-style<br/>function call"]
    Strict --> Dispatch[dispatch to tool]
    Native --> Dispatch

    Dispatch --> T1[list_terminals]
    Dispatch --> T2[read_terminal]
    Dispatch --> T3[send_to_terminal]
    Dispatch --> T4[send_key]
    Dispatch --> T5[wait]
    Dispatch --> T6[done]

    T1 --> Obs[observation]
    T2 --> Obs
    T3 --> Obs
    T4 --> Obs
    T5 --> Obs

    Obs --> Plan
    T6 --> Stop((finalize))
  end

  Local --> LOOP
  Ext --> LOOP

  Stop --> Result[ReactorResult<br/>Success, FinalMessage,<br/>TurnCount, ElapsedMs]
  Result --> UI[stream to AgentBot UI]

  classDef gbnf fill:#3a1e5f,stroke:#C586C0,color:#fff
  class Local,Strict gbnf
  classDef ext fill:#1e3a5f,stroke:#3794FF,color:#fff
  class Ext,Native ext

Why GBNF for the local path. Gemma 4 doesn't have native function calling the way Nemotron or Claude do. We give it the function-calling property by construction — a GBNF grammar that locks output to {"tool": "<one of six>", "args": { ... }} at the sampler level. Every token the model emits is constrained by the grammar; the model literally cannot produce ill-formed JSON or hallucinate an unknown tool name. This is meaningfully different from "ask the model to please return JSON and hope" — it is schema-valid by construction.

Why a REST fallback. Some users will want a stronger reasoning model than a 4 B local one. Nemotron Nano 8B-v1 has native OpenAI-compatible function calling, so we can keep the same six-tool catalog and just route the calls through ILlmProvider. The ExternalAgentToolLoop keeps its own message history list (REST is stateless) where the local one reuses the LLamaSharp KV cache across turns.

Why exactly six tools. Anything more is unnecessary; anything less and the captain can't compose useful behaviors. The six were chosen by tracing concrete instructions — "have Claude review what Codex just wrote", "keep watching this build until it finishes", "abort the test, send Ctrl+C" — and asking what the minimum vocabulary was.

Implementation pointers:

Project/ZeroCommon/Llm/Tools/AgentToolLoop.cs — local Gemma path. Constructor (line 54–58) wires ChatTemplates.Gemma and constructs the Grammar from AgentToolGrammar.Gbnf with GrammarRootRule. KV cache reuse means follow-up prompts only re-process the new turn.

Project/ZeroCommon/Llm/Tools/ExternalAgentToolLoop.cs — REST path. First user send seeds LlmMessage.System(AgentToolGrammar.SystemPrompt); subsequent sends append only the new user request to a stateless history list (line 66).

Project/ZeroCommon/Actors/AgentReactorActor.cs — the FSM around the loop. Lines 55–89 lazy-construct the loop with OnTurnCompleted and OnGenerationProgress callbacks that re-enter the actor mailbox as TurnCompletedInternal / GenerationProgressInternal. This is what keeps streaming UI updates flowing without violating actor encapsulation.

Project/ZeroCommon/Actors/Messages.cs lines 159–208 — the public reactor protocol: StartReactor, ReactorProgress(phase, text, round), ReactorResult(success, finalMessage, turnCount, elapsedMs, failureReason), ReactorBindings(hostFactory, optionsFactory, toolLoopFactory).

Pipeline layer 3 — tool plan becomes keystrokes in N terminals

The tool calls are not yet contact with reality. They are still messages inside one process. Reality is reached when send_to_terminal becomes bytes written to a child process's stdin pipe. That last leg goes through the actor mesh:


flowchart TB
  Reactor[AgentReactorActor<br/>/user/stage/bot/reactor]
  Reactor -->|emits tool call| Bot[AgentBotActor<br/>/user/stage/bot]
  Bot -->|SendToTerminal<br/>WorkspaceName, TerminalId, Text| Stage[StageActor<br/>/user/stage]

  Stage -->|forwards by ws-name| WS1[WorkspaceActor<br/>/user/stage/ws-default]
  Stage -->|forwards by ws-name| WS2[WorkspaceActor<br/>/user/stage/ws-secondary]

  WS1 -->|terminalId lookup| TA1[TerminalActor<br/>term-claude]
  WS1 -->|terminalId lookup| TA2[TerminalActor<br/>term-codex]
  WS1 -->|terminalId lookup| TA3[TerminalActor<br/>term-pwsh]
  WS2 -->|terminalId lookup| TA4[TerminalActor<br/>term-other]

  TA1 -->|ITerminalSession.Write| C1[ConPtyTerminalSession<br/>child: claude.exe]
  TA2 -->|ITerminalSession.Write| C2[ConPtyTerminalSession<br/>child: codex]
  TA3 -->|ITerminalSession.Write| C3[ConPtyTerminalSession<br/>child: pwsh.exe]
  TA4 -->|ITerminalSession.Write| C4[ConPtyTerminalSession<br/>child: any CLI]

  C1 -.->|stdout bytes| TA1
  C2 -.->|stdout bytes| TA2
  C3 -.->|stdout bytes| TA3
  C4 -.->|stdout bytes| TA4

  TA1 -.->|TerminalOutput| Stage
  TA2 -.->|TerminalOutput| Stage
  TA3 -.->|TerminalOutput| Stage
  TA4 -.->|TerminalOutput| Stage

  Stage -.->|broadcast / read_terminal| Reactor

  classDef agent fill:#3a1e5f,stroke:#C586C0,color:#fff
  class Reactor,Bot agent
  classDef supervisor fill:#1e3a5f,stroke:#3794FF,color:#fff
  class Stage,WS1,WS2 supervisor
  classDef worker fill:#1e3a3a,stroke:#4EC9B0,color:#fff
  class TA1,TA2,TA3,TA4 worker

Why Akka.NET for this. The terminal layer has hard problems that are messy with raw threads and clean with actors:

Names from user input must not poison the addressing scheme. Workspace and terminal names go through ActorNameSanitizer before becoming actor paths — Akka rejects /, :, spaces.

Supervision isolates failures. A Claude CLI that hangs or crashes is one TerminalActor's problem, not the bridge's. The supervising WorkspaceActor restarts or escalates.

The ITerminalSession seam is what keeps the actor layer testable headlessly. Real ConPTY sessions live in AgentZeroWpf/Services/; the actors only know the interface in ZeroCommon/Services/ITerminalSession.cs. Tests can swap a fake session and assert on the message protocol.

Active-terminal tracking (SetActiveTerminal at StageActor.cs lines 87–91) gives the reactor a sane default for send_to_terminal when the user doesn't name one — "send this to whichever terminal is in front."

Implementation pointers:

Project/ZeroCommon/Actors/StageActor.cs — top supervisor. Children: one AgentBotActor (singleton) and N WorkspaceActors. Holds (activeWorkspace, activeTerminalId) so the reactor doesn't need to ask each turn.

Project/ZeroCommon/Actors/WorkspaceActor.cs — owns _terminals dict (line 25), routes SendToTerminal by terminal ID (lines 62–79).

Project/ZeroCommon/Actors/TerminalActor.cs — wraps ITerminalSession. Bidirectional: outbound WriteToTerminal becomes pipe writes; inbound stdout becomes TerminalOutput events broadcast to Stage.

Project/ZeroCommon/Data/Entities/CliDefinition.cs — the database row that says "Claude" maps to claude.exe with these args. Built-in seeds (IsBuiltIn = true) cover CMD / PowerShell 5/7 / Claude. Adding Codex or Cursor CLI is one row.

End-to-end sequence — voice to keystrokes

To pull all three layers together, here is the chronological sequence for the original utterance — "tell Claude to refactor the auth module and tell Codex to write the tests for it" — from the moment the user releases the mic to the moment both children have received their bytes.


sequenceDiagram
  autonumber
  actor U as User
  participant V as VoiceCaptureService
  participant S as STT (Whisper.net)
  participant W as AgentBotWindow
  participant B as AgentBotActor
  participant R as AgentReactorActor
  participant L as AgentToolLoop<br/>(Gemma 4 + GBNF)
  participant Stg as StageActor
  participant TA1 as TerminalActor<br/>(Claude)
  participant TA2 as TerminalActor<br/>(Codex)

  U->>V: speaks utterance
  V->>V: dual-VAD: speaking → silent ≥ 40 frames
  V->>W: UtteranceEnded
  W->>S: ConsumePcmBuffer → transcribe
  S-->>W: text transcript
  W->>B: StartReactor(userRequest=text)
  B->>R: forward StartReactor

  R->>L: build loop (lazy)
  R->>L: send userRequest

  loop tool-call loop
    L->>L: GBNF-constrained sample
    L-->>R: tool call JSON {tool, args}
    R-->>W: ReactorProgress(Acting, "send_to_terminal #1")

    alt send_to_terminal
      R->>Stg: SendToTerminal(ws, termId, text)
      Stg->>TA1: forward
      TA1->>TA1: ITerminalSession.Write(bytes)
      TA1-->>Stg: TerminalOutput (echo, prompt)
      Stg-->>R: observation
    else read_terminal
      R->>Stg: ReadTerminal(ws, termId, lines)
      Stg-->>R: lines
    else done
      L-->>R: tool=done, message=...
    end
  end

  R-->>B: ReactorResult(success, msg, turns, ms)
  B-->>W: stream to chat UI
  W-->>U: render final message + per-turn log

The interesting part is steps 9–14, the inner loop. Each iteration is one Gemma 4 forward pass constrained by the GBNF sampler. Because the grammar guarantees well-formed JSON, the actor on the receiving side does not need a JSON-parse fallback path. There is no "the model returned malformed output" branch to write or test. You either have a tool call you can dispatch, or generation hasn't finished yet.

This is one of the underrated ergonomic wins of the GBNF approach: the absence of code that would have existed in a "ask politely for JSON" architecture.

Where Claude, Codex and friends fit — the model-agnostic terminal layer

The captain (you) and the comms officer (Gemma 4) are picky about each other — they have to share a tool catalog and a system prompt and a chat template. The crew is not picky at all. From a TerminalActor's perspective:

It receives WriteToTerminal(text) and writes those bytes to a pipe.

It reads bytes from the pipe and wraps them in TerminalOutput.

That is the entire contract.

So when the comms officer says "send_to_terminal id=1, text='/refactor the auth module'", terminal 1 is just a process. If that process happens to be claude reading /refactor as a slash command, you've used Claude Code. If it's codex parsing the same text as natural-language input, you've used Codex. If it's pwsh.exe, you've sent a probably-broken PowerShell command. The terminal layer does not care what is on the other side of the pipe.

This is the secret of why this architecture handles Claude and Codex and Cursor CLI and whatever ships next month with a single integration point: there is no integration. Each of them is a row in the CliDefinition table. The captain learns the conventions of each crew member implicitly from their stdout — same way a human operator does.

The only place the terminal AIs become first-class is when they want to talk back to AgentBot, not just to the user. That's the peer-signal protocol — a CLI agent calls AgentZeroLite.exe -cli bot-chat <msg> --from <peerName>, which the AgentBotActor routes back to the reactor as TerminalSentToBot(peerName, text) so the loop can take it as an observation. Optional, opt-in, useful for agents that want to coordinate ("hey AgentBot, Claude here — Codex's tests look like they need an extra fixture").

Why this design holds together

Three properties are doing most of the work, and all three were inherited from earlier architectural decisions rather than invented for voice control:

One abstract interface, many implementations. ISpeechToText for voice. IAgentToolLoop for the function-call loop. ILlmProvider for the underlying chat backend. ITerminalSession for the terminal seam. Adding a fifth STT engine, a third tool-loop backend, a sixth provider, or a new terminal type each costs a single class — never a structural change.

Schema-valid by construction at the LLM boundary. GBNF gives us function calling on Gemma 4 without trusting the model to behave. Without this, a 4 B-parameter local model would be too unreliable to pilot real terminals.

Actors all the way down for routing. The captain → comms → crew chain crosses async boundaries (mic threads, LLM inference threads, ConPTY pipe threads, UI thread). Akka.NET makes those boundaries the addressable shape of the system instead of the bug surface.

The voice-driven multi-terminal feature is what those three properties look like when they meet. We didn't add voice to a chat app and bolt on terminal control. We built a fleet captain's bridge — and voice is just the most natural way for a captain to give orders.

Status (2026-04-28)

Everything in this document is in main and runs on a stock developer machine:

Voice capture & STT — shipping. Four engines available; AUTO toggle live.

AgentBotActor + AgentReactorActor + tool-loop FSM — shipping. Both backends exercised.

Six-tool GBNF catalog — shipping. AgentToolGrammar.Gbnf enforced at sampler level.

Multi-terminal actor mesh — shipping. ConPTY hosts in WPF tabs, sanitized actor paths, peer-signal protocol live.

Built-in CLI definitions — CMD / PowerShell 5 / PowerShell 7 / Claude seeded; Codex / Cursor / others are user-defined rows.

What's next is captain-side, not infrastructure: better cancellation UX mid-utterance, optional TTS readback of reactor decisions ("I'm sending this to Claude — say cancel to stop"), and a per-tool latency budget so the captain knows which step is the slow one.

The bridge is built. We're learning to be captains.

TECH LINKS

𝕏 @webnori

📘 Akka Labs (Facebook group)

📝 More writings — webnori wiki