Series: AgentZero Lite — Part 11. The build report for this work is Part 10 — Shipping Voice Output to AIMODE (the engineering story). This part is a different angle on the same commit: the user-facing demo. It works. Here's what that looks like and why.
TL;DR
- I spoke to AgentZero Lite's bot. The bot wrote my instruction into a Claude CLI terminal. Claude did the work, the bot read the result back from the terminal, summarised it, and the summary came through my speakers — all in a single voice turn.
- Zero API keys in the audio path. STT is
Whisper.netrunning a local GGML model. TTS is Windows SAPI (the OS-bundled Heami voice). Claude CLI is the only LLM in the loop, and it's the user's existing CLI — not a hosted API call.
- The same wiring works for Codex CLI, Gemini CLI, or any other peer agent that lives in a terminal — the bot doesn't know which CLI it's typing into. It just owns the keyboard.
- The voice turn auto-mutes the mic during TTS playback so the bot doesn't transcribe itself. That detail sounds trivial; it isn't (build report Part 10 for the post-mortem).
The show, before we open the box
Skip this section if you want the implementation. Read it if you want to know what the user actually does.
What you see on screen
The AgentZero Lite WPF app has two halves: a multi-terminal area on the left, and the AgentBot chat panel on the right. The AgentBot has three modes — Chat (your typing goes to the active terminal), Key (each key goes to the active terminal in real time), and AI (the bot uses tools to talk to whichever AI agent is running in a terminal).
For voice work, AI mode is the one. Here's a faithful ASCII redraw of the live screen during a voice turn:
┌─────────────────────────────────────────────────────────────────────────────────┐ │ AgentZero Lite [_][□][X] │ ├──────────────────────────────────────────┬──────────────────────────────────────┤ │ [CMD] [EDIT] [Claude•] [Codex] [+] │ ◂ AgentBot [ AI ▼ ] │ │ ───────────────────────────────────── │ ──────────────────────────────────── │ │ │ ▸ You │ │ D:\Code\AI\AgentZeroLite> │ "터미널 작업 요약해 줘" │ │ $ git log --oneline -5 │ │ │ 988b4cd feat(voice): AI-mode TTS … │ ◂ Gemma │ │ c17717e chore: bump version to 0.2.1 │ "활성 Claude 터미널의 최근 5개 │ │ 2ae8b75 fix(ui/dock): break floating-… │ 커밋은 voice / dock / version 세 │ │ 5622080 fix(ui/dock): don't auto-grab… │ 영역에 집중되어 있고, 가장 최근 │ │ 95c6227 DOCUPDATE │ 변경은 AI 모드 TTS 출력입니다." │ │ │ │ │ Claude> │ ✓ done · 3127ms · 1 turn │ │ │ │ │ │ ┌─────────────────────────────────┐ │ │ │ │ 🎙 Listening · Whisper small │ │ │ │ │ ▆▆▅▂▁▁▁▂▃▅▆▆▇ [🔇] [🔊 75%] │ │ │ │ └─────────────────────────────────┘ │ └──────────────────────────────────────────┴──────────────────────────────────────┘ ↓ 🔊 (speakers play the Gemma reply, mic auto-mutes for the duration)
The asterisk on
Claude• means that's the active terminal — the one the bot will read from / write to when its tools fire. The little waveform strip at the bottom of the AgentBot panel is live audio level; the icons next to it are mute toggle and system mic volume.What you say, and what happens
A working turn looks like this from the user's seat:
- You speak. "터미널 작업 요약해 줘." (Or in English: "Summarise what the terminal is doing.")
- The mic indicator pulses, then settles. The bot transcribes locally — no spinner waiting on a cloud call.
- Your phrase appears as a chat bubble, immediately followed by the bot's reply: a couple of sentences summarising what the active terminal is showing.
- Your speakers play the bot's reply. The mic indicator dims to muted while the bot is speaking — no echo, no double-take.
- The mic comes back on automatically the moment playback drains. You can speak the next thing without touching the keyboard or the mute button.
Three round-trips per minute is comfortable. Five is fine if you don't hesitate. The bottleneck is honest thinking time, not the stack.
Why this is a different feeling than "voice assistant"
Two reasons.
The bot doesn't have to know what your CLI does. The summary you heard above came from Claude CLI sitting in a terminal tab. Tomorrow, that tab can be running Codex, Gemini CLI, a SSH session into a server, a
python REPL — anything that's a terminal. The bot's tools (send_to_terminal, read_terminal) are the same. There's no "Claude integration" or "Codex integration" — there's terminal integration, and the agents are reachable through it.It's all yours. The audio never leaves your machine. The Whisper model is downloaded once and lives in
~/.ollama/models/agentzero/whisper/. Heami is part of Windows. Claude CLI talks to Anthropic over the network for its own work, but the voice path is closed-loop on the local box. That changes what feels safe to say out loud.Why a "free stack" is the headline, not a footnote
Voice-controlled AI agents have existed for years; what's been missing for self-hosters is a path that doesn't require:
- An OpenAI or ElevenLabs key for TTS,
- A Deepgram or Google Speech key for streaming STT,
- A custom server pipeline that fans the audio out and back,
- Or, in many "free" demos, an open-source TTS model whose voice sounds like a 1996 GPS unit.
AgentZero Lite picks four off-the-shelf pieces that are already on most developer machines and welds them together:
Layer | Piece | License / cost |
------: | :------ | :--------------- |
STT | Whisper.net • GGML small model | MIT, runs offline (Vulkan / CPU) |
TTS | Windows SAPI (Microsoft Heami) | OS-bundled, free |
Agent reasoning | Whatever LLM the user already runs (local or external) | User's existing setup |
Worker CLI in the terminal | Claude CLI / Codex CLI / Gemini CLI / shell | User's existing tools |
That's the whole shopping list. Optionally swap Whisper for OpenAI Whisper API or Webnori Gemma audio for higher quality on noisy mics; swap Heami for OpenAI TTS for a more natural voice. Both are toggles in the Voice settings panel. The default everyone sees on first launch is the free stack above.
What's inside
Component map
flowchart LR subgraph User["User"] Mic[("🎙 Microphone")] Speaker[("🔊 Speakers")] end subgraph WPF["AgentZero Lite (WPF host)"] VCS[VoiceCaptureService<br/>NAudio WaveInEvent<br/>16 kHz mono 50 ms frames] VSA[VoiceStreamActor<br/>under /user/stage/voice] Whisper[Whisper.net<br/>local GGML] SAPI[Windows TTS<br/>SAPI / Heami] Naudio[NAudioPlaybackQueue] AgentBot[AgentBotActor<br/>+ AgentReactorActor<br/>tool loop] TermActors[TerminalActor pool<br/>per ConPTY tab] end subgraph Agents["Worker agents (in ConPTY tabs)"] Claude[Claude CLI] Codex[Codex CLI] Other[Gemini / shells / REPLs] end Mic -->|PCM frames| VCS VCS -->|MicFrame| VSA VSA -->|TranscribeRequest| Whisper Whisper -->|transcript| VSA VSA -->|VoiceTranscriptReady| AgentBot AgentBot -->|StartReactor| TermActors TermActors -->|send_to_terminal<br/>read_terminal| Agents Agents -.->|tool result text| TermActors TermActors -->|ReactorResult<br/>final summary| AgentBot AgentBot -->|SpeakText| VSA VSA -->|SynthesizeRequest| SAPI SAPI -->|wav bytes| Naudio Naudio -->|WaveOut| Speaker Naudio -.->|PlaybackStarted/Stopped| VSA VSA -.->|OnTtsPlaybackChanged| VCS
The dotted edges are the control plane. The most important one is
VSA → VCS (OnTtsPlaybackChanged): that's how the mic knows to mute itself while the speakers are busy with the bot's voice.One full voice round-trip, beat by beat
sequenceDiagram autonumber actor U as User participant Mic as VoiceCaptureService participant VSA as VoiceStreamActor participant W as Whisper.net participant Bot as AgentBotActor participant Reactor as AgentReactorActor participant Term as TerminalActor (Claude tab) participant CLI as Claude CLI participant TTS as Windows TTS participant PB as NAudioPlaybackQueue participant Spk as Speakers U->>Mic: speak: "터미널 작업 요약해 줘" Mic->>VSA: MicFrames (50 ms PCM, 16 kHz) Note over VSA: VAD segmenter buffers<br/>until ~2 s silence VSA->>W: PCM segment (~3.5 s) W-->>VSA: transcript (chars=15) Note over VSA: WhisperHallucinationFilter ✓<br/>VoiceCommandInterceptor → SummarizeTerminal<br/>Snapshot GetConsoleText() once VSA->>Bot: StartReactor(prompt + terminal snapshot) Bot->>Reactor: run tool loop Reactor->>Term: tool: read_terminal (group=0, tab=2) Term-->>Reactor: terminal screen text Reactor->>Reactor: model decides "done"<br/>with summary in args.message Reactor-->>Bot: ReactorResult(success, finalMessage) Bot->>Bot: AddBotMessage(finalMessage) Bot->>VSA: SpeakText(finalMessage, voice) VSA->>TTS: synthesize chunk(s) TTS-->>VSA: wav bytes per sentence VSA->>PB: Enqueue(audio, format=wav) PB->>Spk: WaveOut.Play() PB-->>VSA: PlaybackStarted VSA-->>Mic: OnTtsPlaybackChanged(true) → Muted=true Note over Mic,Spk: Mic captures audio for the level meter<br/>but VAD/forwarder are frozen PB->>Spk: …chunks drain in arrival order… PB-->>VSA: PlaybackStopped (queue idle) VSA-->>Mic: OnTtsPlaybackChanged(false) → Muted=false Note over Mic: User can speak again
Steps 14–16 are the auto-mute envelope. They look like one transition because they happen in the same dispatcher tick, but they are three messages — playback started, callback to the WPF side, VoiceCaptureService.Muted = true. Same triple in reverse on drain.
The interesting part of the loop is between steps 9 and 13: the bot's reasoning happens in an Akka actor, the tool calls hit a terminal actor, the terminal actor types into a ConPTY session running Claude CLI, and Claude's response surfaces back through the same actor. Voice and the agent loop never block each other — Akka.NET keeps them on different dispatchers.
Why an actor sits between the mic and the speaker
graph TB subgraph Stage["StageActor — supervisor"] AgentBot[AgentBotActor<br/>chat orchestration] Reactor[AgentReactorActor<br/>tool loop] Voice[VoiceStreamActor<br/>singleton<br/>under /user/stage/voice] WS[WorkspaceActor<br/>per workspace] TA1[TerminalActor<br/>Claude tab] TA2[TerminalActor<br/>Codex tab] TA3[TerminalActor<br/>shell tab] end AgentBot --- Reactor Reactor -.->|read/send| TA1 Reactor -.->|read/send| TA2 Reactor -.->|read/send| TA3 AgentBot -.->|SpeakText| Voice Voice -.->|VoiceTranscriptReady| AgentBot WS --- TA1 WS --- TA2 WS --- TA3
Three concrete benefits the Akka.NET shape buys us:
- One backpressure boundary for the whole audio path.
Source.Queue<MicFrame>(64, OverflowStrategy.DropHead)at the front of the input graph means the soundcard can never block the actor. If STT lags, old frames drop instead of stalling capture — the user hears no clicks, sees no jitter.
- Atomic teardown of the output side via
KillSwitch. When the user says "그만" (stop), one Tell —BargeIn— collapses the OUTPUT graph: token queue closed, kill switch shut, playback queue stopped, mic auto-unmuted. Mid-sentence interruption with no leftover audio in the pipeline.
- Settings reload happens at the seam, not the leaves.
VoiceRuntimeFactory.BuildStt(VoiceSettingsStore.Load())is invoked at the moment a worker actor spawns, so changing the TTS provider in the settings panel takes effect on the next worker without restarting the actor or the app.
The five pillars
End-to-end voice-driving Claude CLI on a free stack only feels seamless because five small pieces line up. Each one was a real bug or a real gap before this commit landed.
Pillar 1 — Whisper hallucination filter
Whisper trained on enormous amounts of YouTube audio. On near-silent input it confidently emits Korean creator outros: "감사합니다", "시청해주셔서 감사합니다", "다음 영상에서 만나요". English equivalent: "Thank you for watching." These aren't bugs in the model — they're statistical priors winning when there's nothing better to predict.
A simple normalising filter sits between STT and the dispatch layer:
public static bool IsLikelyHallucination(string? transcript) { if (string.IsNullOrWhiteSpace(transcript)) return false; var trimmed = transcript.Trim(); if (Patterns.Symbolic.Contains(trimmed)) return true; // ♪, [Music] var n = Normalize(trimmed); // letters/digits, lowercase if (n.Length == 0) return false; return Patterns.Normalised.Contains(n); }
It only matches the whole transcript — a sentence containing the phrase plus other words passes through. So "감사합니다 잠깐만요" survives, "감사합니다." alone is dropped. 23 unit tests pin the boundary.
Pillar 2 — Voice command interceptor
Two phrases short-circuit the LLM dispatch entirely:
flowchart LR T[Transcript] --> C{Classify} C -->|"그만 / stop"| Stop[VoiceCommandIntent.StopSpeaking] C -->|contains 터미널 + 요약| Sum[VoiceCommandIntent.SummarizeTerminal] C -->|else| Pass[VoiceCommandIntent.PassThrough] Stop --> BI[Tell BargeIn → kill OUTPUT graph] Sum --> SS[Snapshot GetConsoleText → embed → SendThroughAiToolLoopAsync] Pass --> SCI[fill txtInput → SendCurrentInput]
StopSpeaking matches the whole utterance after stripping trailing punctuation, so saying "이거 그만 하면 좋겠어" doesn't accidentally cancel TTS — 그만 is a normal word too.SummarizeTerminal requires both "터미널" and "요약" (or "terminal" and "summar" — Whisper code-switches on Korean tech jargon sometimes even with lang=ko). Order doesn't matter. The classifier is pure logic with 26 unit tests.Pillar 3 — Auto-mute envelope (NAudio contract + self-tracked flag)
Two coordinated fixes break the bot-transcribes-itself feedback loop.
3a. The NAudio queue had a contract violation.
IAudioPlaybackQueue.PlaybackStarted was supposed to fire once per idle→busy transition. The implementation fired per chunk. A multi-sentence bot reply produced N starts and 1 stop, so any state machine on the WPF side would latch on the second start.stateDiagram-v2 direction LR [*] --> Idle Idle --> Busy: Enqueue + Started ✓ Busy --> Busy: Enqueue (no event) Busy --> Busy: chunk drains, next clip starts<br/><i>Started fires again — BUG</i> Busy --> Idle: queue drains + Stopped ✓
After the fix:
stateDiagram-v2 direction LR [*] --> Idle Idle --> Busy: Enqueue + Started ✓<br/>_started = true Busy --> Busy: Enqueue (no event) Busy --> Busy: chunk drains, next clip starts<br/><i>silent — _started already true</i> Busy --> Idle: queue drains + Stopped ✓<br/>_started = false
3b. The mute handler now self-tracks ownership. Instead of snapshotting the user's prior mute state on every Started event (and hoping the upstream contract holds), the WPF callback owns one bit:
private bool _autoMutedByTts; private void OnTtsPlaybackChanged(bool isPlaying) { if (isPlaying) { if (!_voiceCapture.Muted) { SetVoiceMicMuted(true, source: "tts-auto"); _autoMutedByTts = true; } } else { if (_autoMutedByTts) { SetVoiceMicMuted(false, source: "tts-auto"); _autoMutedByTts = false; } } }
If the user manually muted before the bot spoke,
_autoMutedByTts stays false and the natural drain at end doesn't override their preference. If we muted, we unmute. One bit, no snapshot.Pillar 4 — AIMODE tool loop
The reasoning side is a small, well-bounded loop. Five tools, JSON envelopes, a hard iteration cap.
flowchart TD Start([User instruction +<br/>optional terminal snapshot]) Start --> Sys[Append system prompt + tool catalog<br/>on first send only] Sys --> Loop{For iter in 0..MaxIterations} Loop --> Gen[Stream LLM response] Gen --> Parse{Extract first JSON object<br/>+ ParseToolCall} Parse -->|done| End([Final summary returned<br/>via ReactorResult]) Parse -->|tool name| Exec[ExecuteToolAsync] Exec -->|list_terminals| LT[Host returns groups + tabs] Exec -->|read_terminal| RT[Host calls GetConsoleText / ReadOutput] Exec -->|send_to_terminal| ST[Host writes via WriteAndEnter] Exec -->|send_key| SK[Host sends control sequence] Exec -->|wait| W[Task.Delay clamped 1..30 s] LT --> Result[Append --- TOOL RESULT --- to history] RT --> Result ST --> Result SK --> Result W --> Result Result --> Loop Loop -->|cap reached| Fail([Failure surfaced via FailureReason])
For a "summarise the terminal" turn, the loop is short: one
read_terminal call, then done with the summary in args.message. That summary is what AddBotMessage puts on screen and what SpeakText reads aloud.ParseToolCall had a long-dormant bug worth flagging: JsonElement.GetProperty("tool") throws KeyNotFoundException (not JsonException) when the model omits a field. The catch only caught JsonException, so the user occasionally saw the raw .NET error "The given key was not present in the dictionary." The fix uses TryGetProperty plus an explicit JsonValueKind guard. Three regression tests now lock the contract.Pillar 5 — Static snapshot for the summary prompt
The "summarise" command had to be deterministic. ConPTY output is a stream — the
ConsoleOutputLog StringBuilder is constantly appending. If we asked the LLM to "go look at it", two consecutive summarise requests could read overlapping chunks and produce confusingly redundant answers.flowchart LR Voice[Voice intent: SummarizeTerminal] --> Snap["GetConsoleText()<br/>called ONCE at request time"] Snap --> Build["BuildTerminalSummaryPrompt<br/>fenced code block"] Build --> Bot["SendThroughAiToolLoopAsync<br/>aiInput = enriched<br/>displayText = user phrase"] Bot --> LLM[LLM sees a static window] LLM --> Sum[One-shot summary]
The user's natural phrase ("터미널 작업 요약해 줘") becomes the chat bubble; the LLM's prompt is the phrase plus a fenced block of the snapshot. Two requests in sequence simply produce two fresh snapshots. No shared streaming state, no risk of duplicate analysis.
private static string BuildTerminalSummaryPrompt(string userPhrase, string terminalText) { return $"{userPhrase}\n\n" + "아래는 사용자가 보고 있는 활성 터미널의 현재 화면 출력입니다. " + "이 내용을 한국어로 간결하게 요약해 주세요. " + "어떤 작업이 실행되었고 어떤 결과 / 에러가 있었는지 핵심만 짚어주세요.\n\n" + "```\n" + terminalText + "\n```"; }
What one round-trip looks like in the diagnostic log
Excerpt from a real session, slightly trimmed for legibility. Timeline is left-edge timestamps:
[t0] [BOT-Voice-pipe] [t0] utterance-start [+2.6s] [BOT-Voice-pipe] [t1] utterance-end | t1-t0=2600ms · pcm=115200 bytes (~3.60s) [+2.6s] [BOT-Voice-pipe] [t2] pipeline-start | provider=WhisperLocal · lang=ko · peak=-10.2dBFS · rms=-32.5dBFS · VAR=25.0% [+5.9s] [BOT-Voice-pipe] [stage] STT transcribe | 3319ms · chars=15 [+5.9s] [BOT-Voice] (batch) summarize-terminal | terminalChars=4821 [+5.9s] [AIMODE] StartReactor sent (backend=External, model=google/gemma-4-e4b) [+9.0s] [AIMODE] result success=True turns=1 elapsed=3127ms final="현재 활성 Claude 터미널의 최근 5개 커밋은 …" [+9.5s] [BOT-Voice] Mic MUTED (tts-auto) [+9.5s] [BOT-Voice] TTS started — autoMuted=True micMuted=True [+15.4s] [BOT-Voice] Mic UNMUTED (tts-auto) [+15.4s] [BOT-Voice] TTS stopped — autoMutedNow=False micMuted=False
Total perceived latency from finishing the sentence to hearing the first syllable of the reply is about 6.6 seconds. STT is the dominant cost (3.3 s on CPU; about half that on Vulkan). LLM is 3.1 s — capped by
MaxTokens=128 in voice mode so the reply stays speakable. TTS synth + first audio is sub-200 ms because Heami is local.The mute envelope is the tight window from line 9 to line 11: about six seconds of speech, exactly bracketed by the auto-mute messages. No phantom mid-burst Started events after the queue contract fix — that's why this is one clean envelope and not the latching pattern that the old build would produce.
Codex compatibility (and beyond)
Nothing in the bot's tool catalog mentions Claude. The five tools —
list_terminals, read_terminal, send_to_terminal, send_key, wait, done — operate on (group, tab) coordinates. The terminal actor underneath is a ConPtyTerminalSession that wraps the Microsoft Terminal Control's PTY surface. So:- A tab running
claudebecomes a Claude peer.
- A tab running
codexbecomes a Codex peer.
- A tab running
geminibecomes a Gemini peer.
- A tab running
pwshbecomes a shell.
- A tab running
pythonbecomes a REPL.
- A tab running
psqlbecomes a database session.
The voice → bot → terminal direction works for any of them. The reverse direction (peer → bot, e.g., Claude reporting back via WM_COPYDATA) is wired today only for tabs that actively talk back through the agent IPC channel; passive CLIs surface their output via the bot's next
read_terminal call, which is enough for the summarise-style turns this article is about.The same voice infrastructure also drives:
- "Tell Claude to refactor file X" — bot writes the instruction into the Claude tab via
send_to_terminal, hits Enter, then summarises the response when Claude finishes.
- "List the terminals" — bot calls
list_terminals, reads aloud what's in each tab.
- "Cancel that" — 그만 tears down the OUTPUT graph if the bot is mid-sentence; you get audio silence within the next chunk boundary.
What's still on the table
A few honest gaps:
- Latency is honest, not great. ~3.3 s STT on CPU is the dominant cost. Vulkan halves it.
large-v3improves accuracy but adds another second. There's no streaming partials yet — the segmenter waits for ~2 s of silence before declaring an utterance, which adds another second to the perceived delay. Streaming partials are a stretch goal for the next iteration.
MaxTokens=128on voice replies is a guardrail, not a strategy. It keeps replies speakable but truncates legitimate longer answers. A better answer is sentence-chunked TTS that starts speaking before the LLM finishes, but that requires token-level streaming through the OUTPUT graph; today the OUTPUT graph receives the whole final message at once.
- Mute envelope still has a race. When two
SpeakTextarrivals straddle a sub-200 ms gap (rare in practice), the playback queue's natural drain and the next start can overlap such that the mic auto-unmutes for one or two frames. A small "hold mute for N ms after stop" timer would close it; deferred until we see whether real users hit it.
- Settings hot-reload is partial. Factory closures load settings at call time, so new TTS workers see the latest provider — but the existing pool keeps the original. Switching providers mid-session technically requires a mic toggle. An explicit "settings changed → reset pool" Tell would close it.
Closing
There are two kinds of "voice agent" demo. The first kind sends your audio to a cloud STT, reasons in a hosted LLM, synthesises in a cloud TTS, and sounds great. The second kind runs on your laptop with no keys and proves the loop closes.
This is the second kind. The voice quality is "Heami in 2026" (good enough for development; not a production VO artist). Latency is honest (~6 s end-to-end on CPU, ~4 s on Vulkan). The reasoning is whatever LLM you've already configured. And the worker on the other end of the bot's keyboard can be Claude, Codex, Gemini, or your shell — the bot doesn't care, because terminals are the lingua franca.
What the build had to get right to make this feel natural: a hallucination filter so noise doesn't reach Claude as instructions, an intent classifier so 그만 and 터미널 요약 short-circuit cleanly, an auto-mute envelope so the bot doesn't transcribe itself, a tool loop that stays inside its envelope, and a static snapshot that keeps consecutive summaries deterministic. Each of those was small. The seams between them are where the work was.
If you want to drive your own CLI agents by voice on a free stack, the four pieces are: Whisper.net, Windows SAPI, an Akka.NET-shaped IPC seam, and ConPTY for the terminals. The audio never leaves the box. The bot doesn't know which CLI you're typing into. And it works.
- Built with Claude Code Opus 4.7 (1M context)
Screenshot of it working
- TIP : a basic command-line text editor on Windows (like MS-DOS Editor) - TestCLI for VoiceInput
TECH LINKS
- 𝕏 @webnori