Voice-Driving Claude CLI on a Free Stack — A Working End-to-End Round-Trip in AgentZero Lite

Series: AgentZero Lite — Part 11. The build report for this work is Part 10 — Shipping Voice Output to AIMODE (the engineering story). This part is a different angle on the same commit: the user-facing demo. It works. Here's what that looks like and why.

TL;DR

I spoke to AgentZero Lite's bot. The bot wrote my instruction into a Claude CLI terminal. Claude did the work, the bot read the result back from the terminal, summarised it, and the summary came through my speakers — all in a single voice turn.

Zero API keys in the audio path. STT is Whisper.net running a local GGML model. TTS is Windows SAPI (the OS-bundled Heami voice). Claude CLI is the only LLM in the loop, and it's the user's existing CLI — not a hosted API call.

The same wiring works for Codex CLI, Gemini CLI, or any other peer agent that lives in a terminal — the bot doesn't know which CLI it's typing into. It just owns the keyboard.

The voice turn auto-mutes the mic during TTS playback so the bot doesn't transcribe itself. That detail sounds trivial; it isn't (build report Part 10 for the post-mortem).

The show, before we open the box

Skip this section if you want the implementation. Read it if you want to know what the user actually does.

What you see on screen

The AgentZero Lite WPF app has two halves: a multi-terminal area on the left, and the AgentBot chat panel on the right. The AgentBot has three modes — Chat (your typing goes to the active terminal), Key (each key goes to the active terminal in real time), and AI (the bot uses tools to talk to whichever AI agent is running in a terminal).

For voice work, AI mode is the one. Here's a faithful ASCII redraw of the live screen during a voice turn:


┌─────────────────────────────────────────────────────────────────────────────────┐
│ AgentZero Lite                                                       [_][□][X]  │
├──────────────────────────────────────────┬──────────────────────────────────────┤
│  [CMD] [EDIT] [Claude•] [Codex] [+]      │  ◂ AgentBot                [ AI ▼ ]  │
│ ─────────────────────────────────────    │ ──────────────────────────────────── │
│                                          │  ▸ You                               │
│  D:\Code\AI\AgentZeroLite>               │    "터미널 작업 요약해 줘"            │
│  $ git log --oneline -5                  │                                      │
│  988b4cd feat(voice): AI-mode TTS …      │  ◂ Gemma                             │
│  c17717e chore: bump version to 0.2.1    │    "활성 Claude 터미널의 최근 5개    │
│  2ae8b75 fix(ui/dock): break floating-…  │     커밋은 voice / dock / version 세 │
│  5622080 fix(ui/dock): don't auto-grab…  │     영역에 집중되어 있고, 가장 최근  │
│  95c6227 DOCUPDATE                       │     변경은 AI 모드 TTS 출력입니다."  │
│                                          │                                      │
│  Claude>                                 │  ✓ done · 3127ms · 1 turn            │
│                                          │                                      │
│                                          │  ┌─────────────────────────────────┐ │
│                                          │  │ 🎙 Listening · Whisper small    │ │
│                                          │  │  ▆▆▅▂▁▁▁▂▃▅▆▆▇  [🔇] [🔊 75%]  │ │
│                                          │  └─────────────────────────────────┘ │
└──────────────────────────────────────────┴──────────────────────────────────────┘
                                                                ↓
                                     🔊 (speakers play the Gemma reply,
                                          mic auto-mutes for the duration)

The asterisk on Claude• means that's the active terminal — the one the bot will read from / write to when its tools fire. The little waveform strip at the bottom of the AgentBot panel is live audio level; the icons next to it are mute toggle and system mic volume.

What you say, and what happens

A working turn looks like this from the user's seat:

You speak. "터미널 작업 요약해 줘." (Or in English: "Summarise what the terminal is doing.")

The mic indicator pulses, then settles. The bot transcribes locally — no spinner waiting on a cloud call.

Your phrase appears as a chat bubble, immediately followed by the bot's reply: a couple of sentences summarising what the active terminal is showing.

Your speakers play the bot's reply. The mic indicator dims to muted while the bot is speaking — no echo, no double-take.

The mic comes back on automatically the moment playback drains. You can speak the next thing without touching the keyboard or the mute button.

Three round-trips per minute is comfortable. Five is fine if you don't hesitate. The bottleneck is honest thinking time, not the stack.

Why this is a different feeling than "voice assistant"

Two reasons.

The bot doesn't have to know what your CLI does. The summary you heard above came from Claude CLI sitting in a terminal tab. Tomorrow, that tab can be running Codex, Gemini CLI, a SSH session into a server, a python REPL — anything that's a terminal. The bot's tools (send_to_terminal, read_terminal) are the same. There's no "Claude integration" or "Codex integration" — there's terminal integration, and the agents are reachable through it.

It's all yours. The audio never leaves your machine. The Whisper model is downloaded once and lives in ~/.ollama/models/agentzero/whisper/. Heami is part of Windows. Claude CLI talks to Anthropic over the network for its own work, but the voice path is closed-loop on the local box. That changes what feels safe to say out loud.

Why a "free stack" is the headline, not a footnote

Voice-controlled AI agents have existed for years; what's been missing for self-hosters is a path that doesn't require:

An OpenAI or ElevenLabs key for TTS,

A Deepgram or Google Speech key for streaming STT,

A custom server pipeline that fans the audio out and back,

Or, in many "free" demos, an open-source TTS model whose voice sounds like a 1996 GPS unit.

AgentZero Lite picks four off-the-shelf pieces that are already on most developer machines and welds them together:

Layer	Piece	License / cost
------:	:------	:---------------
STT	Whisper.net • GGML small model	MIT, runs offline (Vulkan / CPU)
TTS	Windows SAPI (Microsoft Heami)	OS-bundled, free
Agent reasoning	Whatever LLM the user already runs (local or external)	User's existing setup
Worker CLI in the terminal	Claude CLI / Codex CLI / Gemini CLI / shell	User's existing tools

That's the whole shopping list. Optionally swap Whisper for OpenAI Whisper API or Webnori Gemma audio for higher quality on noisy mics; swap Heami for OpenAI TTS for a more natural voice. Both are toggles in the Voice settings panel. The default everyone sees on first launch is the free stack above.

What's inside

Component map


flowchart LR
    subgraph User["User"]
        Mic[("🎙 Microphone")]
        Speaker[("🔊 Speakers")]
    end

    subgraph WPF["AgentZero Lite (WPF host)"]
        VCS[VoiceCaptureService<br/>NAudio WaveInEvent<br/>16 kHz mono 50 ms frames]
        VSA[VoiceStreamActor<br/>under /user/stage/voice]
        Whisper[Whisper.net<br/>local GGML]
        SAPI[Windows TTS<br/>SAPI / Heami]
        Naudio[NAudioPlaybackQueue]
        AgentBot[AgentBotActor<br/>+ AgentReactorActor<br/>tool loop]
        TermActors[TerminalActor pool<br/>per ConPTY tab]
    end

    subgraph Agents["Worker agents (in ConPTY tabs)"]
        Claude[Claude CLI]
        Codex[Codex CLI]
        Other[Gemini / shells / REPLs]
    end

    Mic -->|PCM frames| VCS
    VCS -->|MicFrame| VSA
    VSA -->|TranscribeRequest| Whisper
    Whisper -->|transcript| VSA
    VSA -->|VoiceTranscriptReady| AgentBot
    AgentBot -->|StartReactor| TermActors
    TermActors -->|send_to_terminal<br/>read_terminal| Agents
    Agents -.->|tool result text| TermActors
    TermActors -->|ReactorResult<br/>final summary| AgentBot
    AgentBot -->|SpeakText| VSA
    VSA -->|SynthesizeRequest| SAPI
    SAPI -->|wav bytes| Naudio
    Naudio -->|WaveOut| Speaker
    Naudio -.->|PlaybackStarted/Stopped| VSA
    VSA -.->|OnTtsPlaybackChanged| VCS

The dotted edges are the control plane. The most important one is VSA → VCS (OnTtsPlaybackChanged): that's how the mic knows to mute itself while the speakers are busy with the bot's voice.

One full voice round-trip, beat by beat


sequenceDiagram
    autonumber
    actor U as User
    participant Mic as VoiceCaptureService
    participant VSA as VoiceStreamActor
    participant W as Whisper.net
    participant Bot as AgentBotActor
    participant Reactor as AgentReactorActor
    participant Term as TerminalActor (Claude tab)
    participant CLI as Claude CLI
    participant TTS as Windows TTS
    participant PB as NAudioPlaybackQueue
    participant Spk as Speakers

    U->>Mic: speak: "터미널 작업 요약해 줘"
    Mic->>VSA: MicFrames (50 ms PCM, 16 kHz)
    Note over VSA: VAD segmenter buffers<br/>until ~2 s silence
    VSA->>W: PCM segment (~3.5 s)
    W-->>VSA: transcript (chars=15)
    Note over VSA: WhisperHallucinationFilter ✓<br/>VoiceCommandInterceptor → SummarizeTerminal<br/>Snapshot GetConsoleText() once
    VSA->>Bot: StartReactor(prompt + terminal snapshot)
    Bot->>Reactor: run tool loop
    Reactor->>Term: tool: read_terminal (group=0, tab=2)
    Term-->>Reactor: terminal screen text
    Reactor->>Reactor: model decides "done"<br/>with summary in args.message
    Reactor-->>Bot: ReactorResult(success, finalMessage)
    Bot->>Bot: AddBotMessage(finalMessage)
    Bot->>VSA: SpeakText(finalMessage, voice)
    VSA->>TTS: synthesize chunk(s)
    TTS-->>VSA: wav bytes per sentence
    VSA->>PB: Enqueue(audio, format=wav)
    PB->>Spk: WaveOut.Play()
    PB-->>VSA: PlaybackStarted
    VSA-->>Mic: OnTtsPlaybackChanged(true) → Muted=true
    Note over Mic,Spk: Mic captures audio for the level meter<br/>but VAD/forwarder are frozen
    PB->>Spk: …chunks drain in arrival order…
    PB-->>VSA: PlaybackStopped (queue idle)
    VSA-->>Mic: OnTtsPlaybackChanged(false) → Muted=false
    Note over Mic: User can speak again

Steps 14–16 are the auto-mute envelope. They look like one transition because they happen in the same dispatcher tick, but they are three messages — playback started, callback to the WPF side, VoiceCaptureService.Muted = true. Same triple in reverse on drain.

The interesting part of the loop is between steps 9 and 13: the bot's reasoning happens in an Akka actor, the tool calls hit a terminal actor, the terminal actor types into a ConPTY session running Claude CLI, and Claude's response surfaces back through the same actor. Voice and the agent loop never block each other — Akka.NET keeps them on different dispatchers.

Why an actor sits between the mic and the speaker


graph TB
    subgraph Stage["StageActor — supervisor"]
        AgentBot[AgentBotActor<br/>chat orchestration]
        Reactor[AgentReactorActor<br/>tool loop]
        Voice[VoiceStreamActor<br/>singleton<br/>under /user/stage/voice]
        WS[WorkspaceActor<br/>per workspace]
        TA1[TerminalActor<br/>Claude tab]
        TA2[TerminalActor<br/>Codex tab]
        TA3[TerminalActor<br/>shell tab]
    end
    AgentBot --- Reactor
    Reactor -.->|read/send| TA1
    Reactor -.->|read/send| TA2
    Reactor -.->|read/send| TA3
    AgentBot -.->|SpeakText| Voice
    Voice -.->|VoiceTranscriptReady| AgentBot
    WS --- TA1
    WS --- TA2
    WS --- TA3

Three concrete benefits the Akka.NET shape buys us:

One backpressure boundary for the whole audio path. Source.Queue<MicFrame>(64, OverflowStrategy.DropHead) at the front of the input graph means the soundcard can never block the actor. If STT lags, old frames drop instead of stalling capture — the user hears no clicks, sees no jitter.

Atomic teardown of the output side via KillSwitch. When the user says "그만" (stop), one Tell — BargeIn — collapses the OUTPUT graph: token queue closed, kill switch shut, playback queue stopped, mic auto-unmuted. Mid-sentence interruption with no leftover audio in the pipeline.

Settings reload happens at the seam, not the leaves. VoiceRuntimeFactory.BuildStt(VoiceSettingsStore.Load()) is invoked at the moment a worker actor spawns, so changing the TTS provider in the settings panel takes effect on the next worker without restarting the actor or the app.

The five pillars

End-to-end voice-driving Claude CLI on a free stack only feels seamless because five small pieces line up. Each one was a real bug or a real gap before this commit landed.

Pillar 1 — Whisper hallucination filter

Whisper trained on enormous amounts of YouTube audio. On near-silent input it confidently emits Korean creator outros: "감사합니다", "시청해주셔서 감사합니다", "다음 영상에서 만나요". English equivalent: "Thank you for watching." These aren't bugs in the model — they're statistical priors winning when there's nothing better to predict.

A simple normalising filter sits between STT and the dispatch layer:


public static bool IsLikelyHallucination(string? transcript)
{
    if (string.IsNullOrWhiteSpace(transcript)) return false;
    var trimmed = transcript.Trim();
    if (Patterns.Symbolic.Contains(trimmed)) return true;   // ♪, [Music]
    var n = Normalize(trimmed);                             // letters/digits, lowercase
    if (n.Length == 0) return false;
    return Patterns.Normalised.Contains(n);
}

It only matches the whole transcript — a sentence containing the phrase plus other words passes through. So "감사합니다 잠깐만요" survives, "감사합니다." alone is dropped. 23 unit tests pin the boundary.

Pillar 2 — Voice command interceptor

Two phrases short-circuit the LLM dispatch entirely:


flowchart LR
    T[Transcript] --> C{Classify}
    C -->|"그만 / stop"| Stop[VoiceCommandIntent.StopSpeaking]
    C -->|contains 터미널 + 요약| Sum[VoiceCommandIntent.SummarizeTerminal]
    C -->|else| Pass[VoiceCommandIntent.PassThrough]
    Stop --> BI[Tell BargeIn → kill OUTPUT graph]
    Sum --> SS[Snapshot GetConsoleText → embed → SendThroughAiToolLoopAsync]
    Pass --> SCI[fill txtInput → SendCurrentInput]

StopSpeaking matches the whole utterance after stripping trailing punctuation, so saying "이거 그만 하면 좋겠어" doesn't accidentally cancel TTS — 그만 is a normal word too.

SummarizeTerminal requires both "터미널" and "요약" (or "terminal" and "summar" — Whisper code-switches on Korean tech jargon sometimes even with lang=ko). Order doesn't matter. The classifier is pure logic with 26 unit tests.

Pillar 3 — Auto-mute envelope (NAudio contract + self-tracked flag)

Two coordinated fixes break the bot-transcribes-itself feedback loop.

3a. The NAudio queue had a contract violation. IAudioPlaybackQueue.PlaybackStarted was supposed to fire once per idle→busy transition. The implementation fired per chunk. A multi-sentence bot reply produced N starts and 1 stop, so any state machine on the WPF side would latch on the second start.


stateDiagram-v2
    direction LR
    [*] --> Idle
    Idle --> Busy: Enqueue + Started ✓
    Busy --> Busy: Enqueue (no event)
    Busy --> Busy: chunk drains, next clip starts<br/><i>Started fires again — BUG</i>
    Busy --> Idle: queue drains + Stopped ✓

After the fix:


stateDiagram-v2
    direction LR
    [*] --> Idle
    Idle --> Busy: Enqueue + Started ✓<br/>_started = true
    Busy --> Busy: Enqueue (no event)
    Busy --> Busy: chunk drains, next clip starts<br/><i>silent — _started already true</i>
    Busy --> Idle: queue drains + Stopped ✓<br/>_started = false

3b. The mute handler now self-tracks ownership. Instead of snapshotting the user's prior mute state on every Started event (and hoping the upstream contract holds), the WPF callback owns one bit:


private bool _autoMutedByTts;

private void OnTtsPlaybackChanged(bool isPlaying)
{
    if (isPlaying)
    {
        if (!_voiceCapture.Muted)
        {
            SetVoiceMicMuted(true, source: "tts-auto");
            _autoMutedByTts = true;
        }
    }
    else
    {
        if (_autoMutedByTts)
        {
            SetVoiceMicMuted(false, source: "tts-auto");
            _autoMutedByTts = false;
        }
    }
}

If the user manually muted before the bot spoke, _autoMutedByTts stays false and the natural drain at end doesn't override their preference. If we muted, we unmute. One bit, no snapshot.

Pillar 4 — AIMODE tool loop

The reasoning side is a small, well-bounded loop. Five tools, JSON envelopes, a hard iteration cap.


flowchart TD
    Start([User instruction +<br/>optional terminal snapshot])
    Start --> Sys[Append system prompt + tool catalog<br/>on first send only]
    Sys --> Loop{For iter in 0..MaxIterations}
    Loop --> Gen[Stream LLM response]
    Gen --> Parse{Extract first JSON object<br/>+ ParseToolCall}
    Parse -->|done| End([Final summary returned<br/>via ReactorResult])
    Parse -->|tool name| Exec[ExecuteToolAsync]
    Exec -->|list_terminals| LT[Host returns groups + tabs]
    Exec -->|read_terminal| RT[Host calls GetConsoleText / ReadOutput]
    Exec -->|send_to_terminal| ST[Host writes via WriteAndEnter]
    Exec -->|send_key| SK[Host sends control sequence]
    Exec -->|wait| W[Task.Delay clamped 1..30 s]
    LT --> Result[Append --- TOOL RESULT --- to history]
    RT --> Result
    ST --> Result
    SK --> Result
    W --> Result
    Result --> Loop
    Loop -->|cap reached| Fail([Failure surfaced via FailureReason])

For a "summarise the terminal" turn, the loop is short: one read_terminal call, then done with the summary in args.message. That summary is what AddBotMessage puts on screen and what SpeakText reads aloud.

ParseToolCall had a long-dormant bug worth flagging: JsonElement.GetProperty("tool") throws KeyNotFoundException (not JsonException) when the model omits a field. The catch only caught JsonException, so the user occasionally saw the raw .NET error "The given key was not present in the dictionary." The fix uses TryGetProperty plus an explicit JsonValueKind guard. Three regression tests now lock the contract.

Pillar 5 — Static snapshot for the summary prompt

The "summarise" command had to be deterministic. ConPTY output is a stream — the ConsoleOutputLog StringBuilder is constantly appending. If we asked the LLM to "go look at it", two consecutive summarise requests could read overlapping chunks and produce confusingly redundant answers.


flowchart LR
    Voice[Voice intent: SummarizeTerminal] --> Snap["GetConsoleText()<br/>called ONCE at request time"]
    Snap --> Build["BuildTerminalSummaryPrompt<br/>fenced code block"]
    Build --> Bot["SendThroughAiToolLoopAsync<br/>aiInput = enriched<br/>displayText = user phrase"]
    Bot --> LLM[LLM sees a static window]
    LLM --> Sum[One-shot summary]

The user's natural phrase ("터미널 작업 요약해 줘") becomes the chat bubble; the LLM's prompt is the phrase plus a fenced block of the snapshot. Two requests in sequence simply produce two fresh snapshots. No shared streaming state, no risk of duplicate analysis.


private static string BuildTerminalSummaryPrompt(string userPhrase, string terminalText)
{
    return
        $"{userPhrase}\n\n" +
        "아래는 사용자가 보고 있는 활성 터미널의 현재 화면 출력입니다. " +
        "이 내용을 한국어로 간결하게 요약해 주세요. " +
        "어떤 작업이 실행되었고 어떤 결과 / 에러가 있었는지 핵심만 짚어주세요.\n\n" +
        "```\n" + terminalText + "\n```";
}

What one round-trip looks like in the diagnostic log

Excerpt from a real session, slightly trimmed for legibility. Timeline is left-edge timestamps:


[t0] [BOT-Voice-pipe] [t0] utterance-start
[+2.6s] [BOT-Voice-pipe] [t1] utterance-end | t1-t0=2600ms · pcm=115200 bytes (~3.60s)
[+2.6s] [BOT-Voice-pipe] [t2] pipeline-start | provider=WhisperLocal · lang=ko ·
        peak=-10.2dBFS · rms=-32.5dBFS · VAR=25.0%
[+5.9s] [BOT-Voice-pipe] [stage] STT transcribe | 3319ms · chars=15
[+5.9s] [BOT-Voice]  (batch) summarize-terminal | terminalChars=4821
[+5.9s] [AIMODE]    StartReactor sent (backend=External, model=google/gemma-4-e4b)
[+9.0s] [AIMODE]    result success=True turns=1 elapsed=3127ms
                    final="현재 활성 Claude 터미널의 최근 5개 커밋은 …"
[+9.5s] [BOT-Voice] Mic MUTED (tts-auto)
[+9.5s] [BOT-Voice] TTS started — autoMuted=True micMuted=True
[+15.4s] [BOT-Voice] Mic UNMUTED (tts-auto)
[+15.4s] [BOT-Voice] TTS stopped — autoMutedNow=False micMuted=False

Total perceived latency from finishing the sentence to hearing the first syllable of the reply is about 6.6 seconds. STT is the dominant cost (3.3 s on CPU; about half that on Vulkan). LLM is 3.1 s — capped by MaxTokens=128 in voice mode so the reply stays speakable. TTS synth + first audio is sub-200 ms because Heami is local.

The mute envelope is the tight window from line 9 to line 11: about six seconds of speech, exactly bracketed by the auto-mute messages. No phantom mid-burst Started events after the queue contract fix — that's why this is one clean envelope and not the latching pattern that the old build would produce.

Codex compatibility (and beyond)

Nothing in the bot's tool catalog mentions Claude. The five tools — list_terminals, read_terminal, send_to_terminal, send_key, wait, done — operate on (group, tab) coordinates. The terminal actor underneath is a ConPtyTerminalSession that wraps the Microsoft Terminal Control's PTY surface. So:

A tab running claude becomes a Claude peer.

A tab running codex becomes a Codex peer.

A tab running gemini becomes a Gemini peer.

A tab running pwsh becomes a shell.

A tab running python becomes a REPL.

A tab running psql becomes a database session.

The voice → bot → terminal direction works for any of them. The reverse direction (peer → bot, e.g., Claude reporting back via WM_COPYDATA) is wired today only for tabs that actively talk back through the agent IPC channel; passive CLIs surface their output via the bot's next read_terminal call, which is enough for the summarise-style turns this article is about.

The same voice infrastructure also drives:

"Tell Claude to refactor file X" — bot writes the instruction into the Claude tab via send_to_terminal, hits Enter, then summarises the response when Claude finishes.

"List the terminals" — bot calls list_terminals, reads aloud what's in each tab.

"Cancel that" — 그만 tears down the OUTPUT graph if the bot is mid-sentence; you get audio silence within the next chunk boundary.

What's still on the table

A few honest gaps:

Latency is honest, not great. ~3.3 s STT on CPU is the dominant cost. Vulkan halves it. large-v3 improves accuracy but adds another second. There's no streaming partials yet — the segmenter waits for ~2 s of silence before declaring an utterance, which adds another second to the perceived delay. Streaming partials are a stretch goal for the next iteration.

MaxTokens=128 on voice replies is a guardrail, not a strategy. It keeps replies speakable but truncates legitimate longer answers. A better answer is sentence-chunked TTS that starts speaking before the LLM finishes, but that requires token-level streaming through the OUTPUT graph; today the OUTPUT graph receives the whole final message at once.

Mute envelope still has a race. When two SpeakText arrivals straddle a sub-200 ms gap (rare in practice), the playback queue's natural drain and the next start can overlap such that the mic auto-unmutes for one or two frames. A small "hold mute for N ms after stop" timer would close it; deferred until we see whether real users hit it.

Settings hot-reload is partial. Factory closures load settings at call time, so new TTS workers see the latest provider — but the existing pool keeps the original. Switching providers mid-session technically requires a mic toggle. An explicit "settings changed → reset pool" Tell would close it.

Closing

There are two kinds of "voice agent" demo. The first kind sends your audio to a cloud STT, reasons in a hosted LLM, synthesises in a cloud TTS, and sounds great. The second kind runs on your laptop with no keys and proves the loop closes.

This is the second kind. The voice quality is "Heami in 2026" (good enough for development; not a production VO artist). Latency is honest (~6 s end-to-end on CPU, ~4 s on Vulkan). The reasoning is whatever LLM you've already configured. And the worker on the other end of the bot's keyboard can be Claude, Codex, Gemini, or your shell — the bot doesn't care, because terminals are the lingua franca.

What the build had to get right to make this feel natural: a hallucination filter so noise doesn't reach Claude as instructions, an intent classifier so 그만 and 터미널 요약 short-circuit cleanly, an auto-mute envelope so the bot doesn't transcribe itself, a tool loop that stays inside its envelope, and a static snapshot that keeps consecutive summaries deterministic. Each of those was small. The seams between them are where the work was.

If you want to drive your own CLI agents by voice on a free stack, the four pieces are: Whisper.net, Windows SAPI, an Akka.NET-shaped IPC seam, and ConPTY for the terminals. The audio never leaves the box. The bot doesn't know which CLI you're typing into. And it works.

Built with Claude Code Opus 4.7 (1M context)

Screenshot of it working

TIP : a basic command-line text editor on Windows (like MS-DOS Editor) - TestCLI for VoiceInput

https://psmon.github.io/AgentZeroLite/

TECH LINKS

𝕏 @webnori

📘 Akka Labs (Facebook group)

📝 More writings — webnori wiki