TL;DR — Two surfaces of AgentZero Lite expanded at once. The LLM layer went from a single local Gemma-4 to five backends (local Gemma, local Nemotron Nano 8B, Webnori, OpenAI, LM Studio, Ollama) behind one OpenAI-compatible interface. The modality layer grew a voice — a STT/TTS subsystem with four STT engines, two TTS engines, dual-VAD mic capture, and a live pipeline that reuses the same LLM gateway. Neither expansion needed a new architecture; both fell out of the same design principle that already defined the agent: one abstract interface, many implementations.
This is a follow-up to the foundation report — An LLM Is Not an Agent: Five Layers. That doc argued the agent is what surrounds the model. This doc shows what happens when you keep that frame and ask: what else can we slot behind the same interfaces?
The two surfaces that just expanded
flowchart LR subgraph Before A1[Local Gemma 4<br/>+ Nemotron]:::dim T1[Text only]:::dim end subgraph After A2[Local Gemma 4<br/>Local Nemotron<br/>Webnori<br/>OpenAI<br/>LM Studio<br/>Ollama] T2[Text + Voice<br/>STT × 4 / TTS × 2] end A1 -.expanded.-> A2 T1 -.expanded.-> T2 classDef dim fill:#f5f5f5,stroke:#999,color:#666
Same shell. Same actor topology. Same harness. The expansions land inside the boxes the field report drew.
Surface 1 — From one local LLM to five providers
The original story was self-built
llama.cpp + LLamaSharp 0.26 running Gemma-4 / Nemotron on-device. That solved one question: can the loop work entirely offline? Yes. But it left the user stuck if they wanted to try a hosted model, or already had Ollama / LM Studio running on the same machine.The expansion adds four external providers and unifies them behind one HTTP client:
Provider | Mode | Endpoint | Reason to include |
Webnori | Hosted (free, public) | OpenAI-compatible | Zero-friction first try |
OpenAI | Hosted (paid) | api.openai.com/v1 | Frontier comparison |
LM Studio | Local | localhost:1234/v1 | Already-running local server |
Ollama | Local | localhost:11434/v1 | Most common local server |
All four speak the OpenAI Chat Completions wire format, so a single
OpenAiCompatibleProvider covers them. Only the base URL and credentials differ.public interface ILlmProvider { string ProviderName { get; } Task<List<LlmModelInfo>> ListModelsAsync(CancellationToken ct = default); Task<LlmResponse> CompleteAsync(LlmRequest request, CancellationToken ct = default); IAsyncEnumerable<LlmStreamChunk> StreamAsync(LlmRequest request, CancellationToken ct = default); }
The selection lives where the user expects it — a Local / External radio on the LLM settings tab. Picking External reveals provider + model + API-key fields; the chosen provider is persisted in
LlmSettingsStore. A factory (LlmProviderFactory) returns the right ILlmProvider at runtime.flowchart TD S[LLM tab<br/>Local · External] --> F[LlmProviderFactory] F -->|Local| LH[LLamaSharp host<br/>Gemma 4 / Nemotron / …] F -->|External| OAC[OpenAiCompatibleProvider] OAC -->|baseUrl + apiKey| W[Webnori] OAC -->|baseUrl + apiKey| O[OpenAI] OAC -->|baseUrl| LMS[LM Studio] OAC -->|baseUrl| OLL[Ollama]
Where this connects back to AIMODE
The foundation report described AIMODE's dual backend — Gemma 4 via GBNF, Nemotron via the Llama-3.1 chat template, both behind
IAgentToolHost. The provider expansion follows the same principle, one layer up:Layer | Interface | Implementations |
Tool host | IAgentToolHost | WorkspaceTerminalToolHost |
Tool-call backend (AIMODE) | AgentToolLoop / ExternalAgentToolLoop | Gemma-GBNF, Nemotron-template |
LLM provider (new) | ILlmProvider | OpenAiCompatibleProvider × 4 hosts, local LLamaSharp host |
Same shape, different layer. Adding a fifth provider tomorrow is a config entry, not a refactor.
Surface 2 — From text to voice
Adding voice could have meant adding a new agent. It didn't. The Voice tab routes through the same
LlmGateway.OpenSession() the rest of the app uses — whatever the user picked on the LLM tab is what answers the voice. STT and TTS are I/O adapters, not a parallel agent.This was a deliberate split from the AgentWin origin: there is no separate "voice LLM provider" — only a new way to enter and exit the existing one.
Pipeline
flowchart LR M[Mic<br/>NAudio<br/>16 kHz mono] --> V[VAD<br/>frame + utterance] V -->|UtteranceEnded| STT[ISpeechToText] STT -->|text| GW[LlmGateway.OpenSession] GW -->|reply| CL[TtsTextCleaner<br/>strip markup] CL --> TTS[ITextToSpeech] TTS -->|audio bytes| PB[VoicePlaybackService<br/>NAudio] PB -->|while playing| MUTE[Muted=true<br/>mic suppressed] MUTE -.-> M
STT — four engines, one interface
public interface ISpeechToText { string ProviderName { get; } Task<string> TranscribeAsync(byte[] pcm16k, string? language, CancellationToken ct); }
Engine | Mode | Model / endpoint | Notes |
Local | `ggml-{tiny\ | small\ | |
OpenAI Whisper | Hosted | /v1/audio/transcriptions (whisper-1) | RIFF-WAV multipart POST |
Webnori-Gemma audio | Hosted | OpenAI-compatible Chat Completions w/ input_audio block | Reuses LLM tab credentials |
LocalGemmaStt | Local | (placeholder) | Will load an audio-capable Gemma GGUF via LLamaSharp multimodal once the catalog ships one |
Model files for Whisper.net are cached at
%USERPROFILE%\.ollama\models\agentzero\whisper\ so they live alongside the user's existing Ollama models — no new convention to learn.TTS — three options
Engine | Mode | Notes |
Off | — | Default; voice in only |
Windows SAPI | Local | Uses installed Win11 language packs (e.g., the Korean neural voice ships free at OS level) |
OpenAI tts-1 | Hosted | Voices: alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, verse |
The "Win11 ships TTS free at the OS level — not ElevenLabs-grade but usable" observation already came up in the @Agredo10 reply chain; this is that observation built into a tab.
Mic capture: dual-VAD + pre-roll + mute-while-playing
A voice loop fails in three predictable ways. The capture service was written assuming all three would happen.
public sealed class VoiceCaptureService : IDisposable { public event EventHandler<float>? AmplitudeChanged; // ~50 ms ticks → UI level meter public event EventHandler<bool>? SpeakingStateChanged; // VAD frame state public event EventHandler? UtteranceStarted; public event EventHandler<byte[]>? UtteranceEnded; // PCM 16k mono → STT public bool Muted { get; set; } // suppress while TTS plays }
- Frame-level VAD (~50 ms) — for the live amplitude meter the user sees in the Voice Test panel.
- Utterance-level VAD (~2 s of trailing silence) — the actual segment boundary that triggers STT. Without this, every breath becomes a transcription request.
- Pre-roll buffer (~1 s) — captured before VAD trips, so the first consonant of a sentence isn't clipped off.
Mutedflag — set true whileVoicePlaybackServiceis playing TTS audio. Without this the speaker output gets re-captured by the mic and the loop talks to itself.
These are not ML problems. They're plumbing. Getting them wrong makes the voice feature feel "broken in a way nobody can articulate."
Source:
VoicePlaybackService.cs (auto-detects WAV / MP3 / PCM16@24kHz; patches the OpenAI gpt-4o-audio-preview RIFF header)What this expansion didn't change
This is as important as what it added.
- No new actor was introduced. Voice still talks to the same
AgentReactorActortopology described in the foundation report. STT/TTS are services, not actors.
- No new gateway. Voice reuses
LlmGateway.OpenSession(). Whatever the LLM tab decides answers the voice — local Gemma, OpenAI, anything.
- No chunked streaming yet. All STT/TTS calls take a complete utterance. Streaming is a known follow-up; the absence is intentional, not an oversight, because chunked Whisper + chunked TTS on five providers is its own multi-week problem.
- No CUDA in the installer. Whisper.net's CUDA runtime is real and works, but bundling it doubled installer size for a feature most users will run on small models. CPU-only first; GPU later if the data says it matters.
The shared design principle, restated
Expansion | Interface that absorbed it | Implementations behind it |
Multi-provider LLM | ILlmProvider | LLamaSharp local, OpenAiCompatibleProvider × 4 |
Voice in | ISpeechToText | Whisper.net, OpenAI Whisper, Webnori-Gemma, (LocalGemma placeholder) |
Voice out | ITextToSpeech | Off, Windows SAPI, OpenAI tts-1 |
(Established earlier) Tool host | IAgentToolHost | WorkspaceTerminalToolHost |
(Established earlier) AIMODE backend | AgentToolLoop / ExternalAgentToolLoop | Gemma-GBNF, Nemotron-template |
The agent doesn't grow by adding agents. It grows by adding implementations behind already-named interfaces. That is what made today's expansion a 2,000-line pull rather than a 20,000-line one.
Why the free-OS-voice angle is load-bearing
Win11 ships a Korean neural TTS voice at the OS level, for free, as part of the language pack — same for English, Japanese, and most major languages. This isn't loud in marketing decks, but for a desktop agent that wants to say something out loud, it removes a $0–22/month line item that competitors silently charge for.
Combine that with what's already free on each side:
Slot | Free default | Cost in dollars per month |
STT | $0 | |
LLM (local) | Local Gemma 4 / Local Nemotron | $0 (your CPU/GPU) |
LLM (hosted) | Webnori free tier | $0 |
TTS (local) | Windows SAPI (Win11 neural voice) | $0 |
TTS (hosted) | OpenAI tts-1 | paid |
The default loop end-to-end — speak Korean → transcribe → local model thinks → answer in Korean voice — costs $0 in API fees. Paid providers slot in only when the user opts up. That is not a downstream consequence; it is the design.
Most "voice AI" desktop products in this category bury this fact because their billing depends on you not noticing. AgentZero Lite leans into it: the Windows OS is a free voice asset for anyone targeting Windows, and we use it.
Situational use cases this stack unlocks
The architecture didn't grow voice for novelty. Most of the high-value desktop-agent scenarios are voice-first.
Scenario | Why voice changes it | What's already in the box |
Meeting note-taker | Bot-free meeting capture is what Otter / Granola / Fireflies charge $20–30/mo for. Meetily already proved a self-hosted Whisper + LLM-summary stack is competitive. | STT × 4, LLM × 6 incl. local — recorder + summarizer is a feature week, not a rebuild |
Voice journaling | Speaking is faster than typing; LLM cleans up filler and commits a structured day note | Local Gemma + Win11 SAPI = $0 loop |
Hands-busy healthcare tracking | Cooking / exercise / caretaking — typing isn't viable. Mirrors the @graylanj edge-LLM pattern (Gemma + dental scanner + pill bottle scanner) but on Windows desktop | Local-only path covers privacy; speech-in / speech-out covers ergonomics |
Hands-free CLI | Talk to the Claude/Codex tabs while watching the screen | Already wired — Voice tab + AIMODE handshake meet at LlmGateway |
Language practice | Voice cloning + pronunciation feedback | XTTS-v2 slots in as a new ITextToSpeech |
The meeting-note one is worth pausing on. People pay real money today for transcribe + summarize + share over their own meetings. The components AgentZero Lite already has — bot-free local recording (NAudio), STT × 4, an LLM gateway with multiple providers, and a discipline harness for prompt engineering — are exactly what Meetily bundles. The stack hasn't shipped that feature, but the lego pieces are all already in the box.
What's coming from the open-source side
The voice-model space reshaped fast in early 2026. Most of these slot into our existing
ISpeechToText / ITextToSpeech interfaces with one new class each.STT (open / efficient, 2026 picks)
Model | License | Why it's interesting |
Whisper Large V3 Turbo | MIT (OpenAI) | 216× real-time, multilingual; current accuracy/speed balance pick |
Distil-Whisper | MIT (Hugging Face) | 6× faster than Whisper Large V3, WER within ~1% — ideal for streaming |
NVIDIA Parakeet TDT 1.1B | NVIDIA OS | RTFx > 2,000 (fastest available), English-only |
Faster-Whisper | MIT | CTranslate2 reimpl, ~4× faster, drop-in for Whisper.cpp |
WhisperX | BSD-2 | Word-level timestamps + speaker diarization → unlocks "who said what" in meetings |
Moonshine / Canary-Qwen / Qwen3-ASR | OS | New generation already matching commercial APIs on standard benchmarks |
TTS (open / efficient, 2026 picks)
Model | License | Why it's interesting |
Kokoro | Apache-2.0 | 82M params, ElevenLabs-comparable quality, free for commercial use |
Coqui XTTS-v2 | Coqui non-commercial | 6-second voice cloning, 17 languages — note license restriction |
Coqui TTS toolkit | MPL-2.0 | 1,100+ languages of pre-trained models |
Microsoft's own move (April 2026)
On 2026-04-02 Microsoft surfaced MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Foundry. The voice-relevant pieces:
- MAI-Voice-1 generates 60 seconds of audio in 1 second (60× real-time).
- Microsoft's embedded-speech path lets newer Win11 boxes (24H2 + DX12 + 8 GB RAM) run TTS locally — exactly the same OS-integration story SAPI already gives us, but neural-HD-grade.
- 2026-03 Foundry release notes mention Neural HD TTS with MAI-voice-1 integration.
For our architecture this is a known-future
ITextToSpeech impl — the day MAI-Voice-1 ships a public Win11 local API, it slots in next to WindowsTts and the rest of the user's loop doesn't have to learn anything new.What plugs in where
flowchart LR subgraph existing[Existing slots] S[ISpeechToText] T[ITextToSpeech] end subgraph stt_open[Open STT models] Pk[Parakeet TDT 1.1B] DW[Distil-Whisper] WT[Whisper Turbo V3] FW[Faster-Whisper] WX[WhisperX<br/>+ diarization] end subgraph tts_open[Open / future TTS] Ko[Kokoro 82M] XT[XTTS-v2] Mai[MAI-Voice-1<br/>via Foundry Local] end Pk --> S DW --> S WT --> S FW --> S WX --> S Ko --> T XT --> T Mai --> T
The point isn't "we'll add all of these tomorrow." The point is the architecture is ready to absorb whichever ones the user actually wants — the same way the LLM expansion absorbed five providers behind one interface. Each model added is a new file, not a new shell.
Closing — what the next slot is for
Two surfaces opened. The next obvious additions slot in without changing the shell:
- Local Gemma audio model when the catalog ships one — fills the
LocalGemmaSttplaceholder, gives a fully-offline voice loop.
- A streaming provider (e.g., Anthropic's streaming API) — adds nothing structural, just a new
ILlmProviderimpl.
- A new voice provider — same.
The agent earned its name once, in the foundation report. Each chapter after that is the same shape played in a new key.
TECH LINKS
- 𝕏 @webnori