🎙️

Adding Providers and a Voice — AgentZero Lite Expands Beyond a Single Local Model

notion image
TL;DR — Two surfaces of AgentZero Lite expanded at once. The LLM layer went from a single local Gemma-4 to five backends (local Gemma, local Nemotron Nano 8B, Webnori, OpenAI, LM Studio, Ollama) behind one OpenAI-compatible interface. The modality layer grew a voice — a STT/TTS subsystem with four STT engines, two TTS engines, dual-VAD mic capture, and a live pipeline that reuses the same LLM gateway. Neither expansion needed a new architecture; both fell out of the same design principle that already defined the agent: one abstract interface, many implementations.
This is a follow-up to the foundation report — An LLM Is Not an Agent: Five Layers. That doc argued the agent is what surrounds the model. This doc shows what happens when you keep that frame and ask: what else can we slot behind the same interfaces?

The two surfaces that just expanded

flowchart LR subgraph Before A1[Local Gemma 4<br/>+ Nemotron]:::dim T1[Text only]:::dim end subgraph After A2[Local Gemma 4<br/>Local Nemotron<br/>Webnori<br/>OpenAI<br/>LM Studio<br/>Ollama] T2[Text + Voice<br/>STT × 4 / TTS × 2] end A1 -.expanded.-> A2 T1 -.expanded.-> T2 classDef dim fill:#f5f5f5,stroke:#999,color:#666
Same shell. Same actor topology. Same harness. The expansions land inside the boxes the field report drew.

Surface 1 — From one local LLM to five providers

The original story was self-built llama.cpp + LLamaSharp 0.26 running Gemma-4 / Nemotron on-device. That solved one question: can the loop work entirely offline? Yes. But it left the user stuck if they wanted to try a hosted model, or already had Ollama / LM Studio running on the same machine.
The expansion adds four external providers and unifies them behind one HTTP client:
Provider
Mode
Endpoint
Reason to include
Webnori
Hosted (free, public)
OpenAI-compatible
Zero-friction first try
OpenAI
Hosted (paid)
api.openai.com/v1
Frontier comparison
LM Studio
Local
localhost:1234/v1
Already-running local server
Ollama
Local
localhost:11434/v1
Most common local server
All four speak the OpenAI Chat Completions wire format, so a single OpenAiCompatibleProvider covers them. Only the base URL and credentials differ.
public interface ILlmProvider { string ProviderName { get; } Task<List<LlmModelInfo>> ListModelsAsync(CancellationToken ct = default); Task<LlmResponse> CompleteAsync(LlmRequest request, CancellationToken ct = default); IAsyncEnumerable<LlmStreamChunk> StreamAsync(LlmRequest request, CancellationToken ct = default); }
The selection lives where the user expects it — a Local / External radio on the LLM settings tab. Picking External reveals provider + model + API-key fields; the chosen provider is persisted in LlmSettingsStore. A factory (LlmProviderFactory) returns the right ILlmProvider at runtime.
flowchart TD S[LLM tab<br/>Local · External] --> F[LlmProviderFactory] F -->|Local| LH[LLamaSharp host<br/>Gemma 4 / Nemotron / …] F -->|External| OAC[OpenAiCompatibleProvider] OAC -->|baseUrl + apiKey| W[Webnori] OAC -->|baseUrl + apiKey| O[OpenAI] OAC -->|baseUrl| LMS[LM Studio] OAC -->|baseUrl| OLL[Ollama]

Where this connects back to AIMODE

The foundation report described AIMODE's dual backend — Gemma 4 via GBNF, Nemotron via the Llama-3.1 chat template, both behind IAgentToolHost. The provider expansion follows the same principle, one layer up:
Layer
Interface
Implementations
Tool host
IAgentToolHost
WorkspaceTerminalToolHost
Tool-call backend (AIMODE)
AgentToolLoop / ExternalAgentToolLoop
Gemma-GBNF, Nemotron-template
LLM provider (new)
ILlmProvider
OpenAiCompatibleProvider × 4 hosts, local LLamaSharp host
Same shape, different layer. Adding a fifth provider tomorrow is a config entry, not a refactor.

Surface 2 — From text to voice

Adding voice could have meant adding a new agent. It didn't. The Voice tab routes through the same LlmGateway.OpenSession() the rest of the app uses — whatever the user picked on the LLM tab is what answers the voice. STT and TTS are I/O adapters, not a parallel agent.
This was a deliberate split from the AgentWin origin: there is no separate "voice LLM provider" — only a new way to enter and exit the existing one.

Pipeline

flowchart LR M[Mic<br/>NAudio<br/>16 kHz mono] --> V[VAD<br/>frame + utterance] V -->|UtteranceEnded| STT[ISpeechToText] STT -->|text| GW[LlmGateway.OpenSession] GW -->|reply| CL[TtsTextCleaner<br/>strip markup] CL --> TTS[ITextToSpeech] TTS -->|audio bytes| PB[VoicePlaybackService<br/>NAudio] PB -->|while playing| MUTE[Muted=true<br/>mic suppressed] MUTE -.-> M

STT — four engines, one interface

public interface ISpeechToText { string ProviderName { get; } Task<string> TranscribeAsync(byte[] pcm16k, string? language, CancellationToken ct); }
Engine
Mode
Model / endpoint
Notes
Local
`ggml-{tiny\
small\
OpenAI Whisper
Hosted
/v1/audio/transcriptions (whisper-1)
RIFF-WAV multipart POST
Webnori-Gemma audio
Hosted
OpenAI-compatible Chat Completions w/ input_audio block
Reuses LLM tab credentials
LocalGemmaStt
Local
(placeholder)
Will load an audio-capable Gemma GGUF via LLamaSharp multimodal once the catalog ships one
Model files for Whisper.net are cached at %USERPROFILE%\.ollama\models\agentzero\whisper\ so they live alongside the user's existing Ollama models — no new convention to learn.

TTS — three options

Engine
Mode
Notes
Off
Default; voice in only
Windows SAPI
Local
Uses installed Win11 language packs (e.g., the Korean neural voice ships free at OS level)
OpenAI tts-1
Hosted
Voices: alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, verse
The "Win11 ships TTS free at the OS level — not ElevenLabs-grade but usable" observation already came up in the @Agredo10 reply chain; this is that observation built into a tab.

Mic capture: dual-VAD + pre-roll + mute-while-playing

A voice loop fails in three predictable ways. The capture service was written assuming all three would happen.
public sealed class VoiceCaptureService : IDisposable { public event EventHandler<float>? AmplitudeChanged; // ~50 ms ticks → UI level meter public event EventHandler<bool>? SpeakingStateChanged; // VAD frame state public event EventHandler? UtteranceStarted; public event EventHandler<byte[]>? UtteranceEnded; // PCM 16k mono → STT public bool Muted { get; set; } // suppress while TTS plays }
  • Frame-level VAD (~50 ms) — for the live amplitude meter the user sees in the Voice Test panel.
  • Utterance-level VAD (~2 s of trailing silence) — the actual segment boundary that triggers STT. Without this, every breath becomes a transcription request.
  • Pre-roll buffer (~1 s) — captured before VAD trips, so the first consonant of a sentence isn't clipped off.
  • Muted flag — set true while VoicePlaybackService is playing TTS audio. Without this the speaker output gets re-captured by the mic and the loop talks to itself.
These are not ML problems. They're plumbing. Getting them wrong makes the voice feature feel "broken in a way nobody can articulate."
Source: VoicePlaybackService.cs (auto-detects WAV / MP3 / PCM16@24kHz; patches the OpenAI gpt-4o-audio-preview RIFF header)

What this expansion didn't change

This is as important as what it added.
  • No new actor was introduced. Voice still talks to the same AgentReactorActor topology described in the foundation report. STT/TTS are services, not actors.
  • No new gateway. Voice reuses LlmGateway.OpenSession(). Whatever the LLM tab decides answers the voice — local Gemma, OpenAI, anything.
  • No chunked streaming yet. All STT/TTS calls take a complete utterance. Streaming is a known follow-up; the absence is intentional, not an oversight, because chunked Whisper + chunked TTS on five providers is its own multi-week problem.
  • No CUDA in the installer. Whisper.net's CUDA runtime is real and works, but bundling it doubled installer size for a feature most users will run on small models. CPU-only first; GPU later if the data says it matters.

The shared design principle, restated

Expansion
Interface that absorbed it
Implementations behind it
Multi-provider LLM
ILlmProvider
LLamaSharp local, OpenAiCompatibleProvider × 4
Voice in
ISpeechToText
Whisper.net, OpenAI Whisper, Webnori-Gemma, (LocalGemma placeholder)
Voice out
ITextToSpeech
Off, Windows SAPI, OpenAI tts-1
(Established earlier) Tool host
IAgentToolHost
WorkspaceTerminalToolHost
(Established earlier) AIMODE backend
AgentToolLoop / ExternalAgentToolLoop
Gemma-GBNF, Nemotron-template
The agent doesn't grow by adding agents. It grows by adding implementations behind already-named interfaces. That is what made today's expansion a 2,000-line pull rather than a 20,000-line one.

Why the free-OS-voice angle is load-bearing

Win11 ships a Korean neural TTS voice at the OS level, for free, as part of the language pack — same for English, Japanese, and most major languages. This isn't loud in marketing decks, but for a desktop agent that wants to say something out loud, it removes a $0–22/month line item that competitors silently charge for.
Combine that with what's already free on each side:
Slot
Free default
Cost in dollars per month
STT
Whisper.net (ggml-tiny/small/medium, local CPU)
$0
LLM (local)
Local Gemma 4 / Local Nemotron
$0 (your CPU/GPU)
LLM (hosted)
Webnori free tier
$0
TTS (local)
Windows SAPI (Win11 neural voice)
$0
TTS (hosted)
OpenAI tts-1
paid
The default loop end-to-end — speak Korean → transcribe → local model thinks → answer in Korean voice — costs $0 in API fees. Paid providers slot in only when the user opts up. That is not a downstream consequence; it is the design.
Most "voice AI" desktop products in this category bury this fact because their billing depends on you not noticing. AgentZero Lite leans into it: the Windows OS is a free voice asset for anyone targeting Windows, and we use it.

Situational use cases this stack unlocks

The architecture didn't grow voice for novelty. Most of the high-value desktop-agent scenarios are voice-first.
Scenario
Why voice changes it
What's already in the box
Meeting note-taker
Bot-free meeting capture is what Otter / Granola / Fireflies charge $20–30/mo for. Meetily already proved a self-hosted Whisper + LLM-summary stack is competitive.
STT × 4, LLM × 6 incl. local — recorder + summarizer is a feature week, not a rebuild
Voice journaling
Speaking is faster than typing; LLM cleans up filler and commits a structured day note
Local Gemma + Win11 SAPI = $0 loop
Hands-busy healthcare tracking
Cooking / exercise / caretaking — typing isn't viable. Mirrors the @graylanj edge-LLM pattern (Gemma + dental scanner + pill bottle scanner) but on Windows desktop
Local-only path covers privacy; speech-in / speech-out covers ergonomics
Hands-free CLI
Talk to the Claude/Codex tabs while watching the screen
Already wired — Voice tab + AIMODE handshake meet at LlmGateway
Language practice
Voice cloning + pronunciation feedback
XTTS-v2 slots in as a new ITextToSpeech
The meeting-note one is worth pausing on. People pay real money today for transcribe + summarize + share over their own meetings. The components AgentZero Lite already has — bot-free local recording (NAudio), STT × 4, an LLM gateway with multiple providers, and a discipline harness for prompt engineering — are exactly what Meetily bundles. The stack hasn't shipped that feature, but the lego pieces are all already in the box.

What's coming from the open-source side

The voice-model space reshaped fast in early 2026. Most of these slot into our existing ISpeechToText / ITextToSpeech interfaces with one new class each.

STT (open / efficient, 2026 picks)

Model
License
Why it's interesting
Whisper Large V3 Turbo
MIT (OpenAI)
216× real-time, multilingual; current accuracy/speed balance pick
Distil-Whisper
MIT (Hugging Face)
6× faster than Whisper Large V3, WER within ~1% — ideal for streaming
NVIDIA Parakeet TDT 1.1B
NVIDIA OS
RTFx > 2,000 (fastest available), English-only
Faster-Whisper
MIT
CTranslate2 reimpl, ~4× faster, drop-in for Whisper.cpp
WhisperX
BSD-2
Word-level timestamps + speaker diarization → unlocks "who said what" in meetings
Moonshine / Canary-Qwen / Qwen3-ASR
OS
New generation already matching commercial APIs on standard benchmarks

TTS (open / efficient, 2026 picks)

Model
License
Why it's interesting
Kokoro
Apache-2.0
82M params, ElevenLabs-comparable quality, free for commercial use
Coqui XTTS-v2
Coqui non-commercial
6-second voice cloning, 17 languages — note license restriction
Coqui TTS toolkit
MPL-2.0
1,100+ languages of pre-trained models

Microsoft's own move (April 2026)

On 2026-04-02 Microsoft surfaced MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Foundry. The voice-relevant pieces:
  • MAI-Voice-1 generates 60 seconds of audio in 1 second (60× real-time).
  • Microsoft's embedded-speech path lets newer Win11 boxes (24H2 + DX12 + 8 GB RAM) run TTS locally — exactly the same OS-integration story SAPI already gives us, but neural-HD-grade.
  • 2026-03 Foundry release notes mention Neural HD TTS with MAI-voice-1 integration.
For our architecture this is a known-future ITextToSpeech impl — the day MAI-Voice-1 ships a public Win11 local API, it slots in next to WindowsTts and the rest of the user's loop doesn't have to learn anything new.

What plugs in where

flowchart LR subgraph existing[Existing slots] S[ISpeechToText] T[ITextToSpeech] end subgraph stt_open[Open STT models] Pk[Parakeet TDT 1.1B] DW[Distil-Whisper] WT[Whisper Turbo V3] FW[Faster-Whisper] WX[WhisperX<br/>+ diarization] end subgraph tts_open[Open / future TTS] Ko[Kokoro 82M] XT[XTTS-v2] Mai[MAI-Voice-1<br/>via Foundry Local] end Pk --> S DW --> S WT --> S FW --> S WX --> S Ko --> T XT --> T Mai --> T
The point isn't "we'll add all of these tomorrow." The point is the architecture is ready to absorb whichever ones the user actually wants — the same way the LLM expansion absorbed five providers behind one interface. Each model added is a new file, not a new shell.

Closing — what the next slot is for

Two surfaces opened. The next obvious additions slot in without changing the shell:
  • Local Gemma audio model when the catalog ships one — fills the LocalGemmaStt placeholder, gives a fully-offline voice loop.
  • A streaming provider (e.g., Anthropic's streaming API) — adds nothing structural, just a new ILlmProvider impl.
  • A new voice provider — same.
The agent earned its name once, in the foundation report. Each chapter after that is the same shape played in a new key.

TECH LINKS