Adding Providers and a Voice — AgentZero Lite Expands Beyond a Single Local Model

TL;DR — Two surfaces of AgentZero Lite expanded at once. The LLM layer went from a single local Gemma-4 to five backends (local Gemma, local Nemotron Nano 8B, Webnori, OpenAI, LM Studio, Ollama) behind one OpenAI-compatible interface. The modality layer grew a voice — a STT/TTS subsystem with four STT engines, two TTS engines, dual-VAD mic capture, and a live pipeline that reuses the same LLM gateway. Neither expansion needed a new architecture; both fell out of the same design principle that already defined the agent: one abstract interface, many implementations.

This is a follow-up to the foundation report — An LLM Is Not an Agent: Five Layers. That doc argued the agent is what surrounds the model. This doc shows what happens when you keep that frame and ask: what else can we slot behind the same interfaces?

The two surfaces that just expanded


flowchart LR
  subgraph Before
    A1[Local Gemma 4<br/>+ Nemotron]:::dim
    T1[Text only]:::dim
  end
  subgraph After
    A2[Local Gemma 4<br/>Local Nemotron<br/>Webnori<br/>OpenAI<br/>LM Studio<br/>Ollama]
    T2[Text + Voice<br/>STT × 4 / TTS × 2]
  end
  A1 -.expanded.-> A2
  T1 -.expanded.-> T2
  classDef dim fill:#f5f5f5,stroke:#999,color:#666

Same shell. Same actor topology. Same harness. The expansions land inside the boxes the field report drew.

Surface 1 — From one local LLM to five providers

The original story was self-built llama.cpp + LLamaSharp 0.26 running Gemma-4 / Nemotron on-device. That solved one question: can the loop work entirely offline? Yes. But it left the user stuck if they wanted to try a hosted model, or already had Ollama / LM Studio running on the same machine.

The expansion adds four external providers and unifies them behind one HTTP client:

Provider	Mode	Endpoint	Reason to include
Webnori	Hosted (free, public)	OpenAI-compatible	Zero-friction first try
OpenAI	Hosted (paid)	`api.openai.com/v1`	Frontier comparison
LM Studio	Local	`localhost:1234/v1`	Already-running local server
Ollama	Local	`localhost:11434/v1`	Most common local server

All four speak the OpenAI Chat Completions wire format, so a single OpenAiCompatibleProvider covers them. Only the base URL and credentials differ.


public interface ILlmProvider
{
    string ProviderName { get; }
    Task<List<LlmModelInfo>> ListModelsAsync(CancellationToken ct = default);
    Task<LlmResponse> CompleteAsync(LlmRequest request, CancellationToken ct = default);
    IAsyncEnumerable<LlmStreamChunk> StreamAsync(LlmRequest request, CancellationToken ct = default);
}

Source: Project/ZeroCommon/Llm/Providers/ILlmProvider.cs

The selection lives where the user expects it — a Local / External radio on the LLM settings tab. Picking External reveals provider + model + API-key fields; the chosen provider is persisted in LlmSettingsStore. A factory (LlmProviderFactory) returns the right ILlmProvider at runtime.


flowchart TD
  S[LLM tab<br/>Local · External] --> F[LlmProviderFactory]
  F -->|Local| LH[LLamaSharp host<br/>Gemma 4 / Nemotron / …]
  F -->|External| OAC[OpenAiCompatibleProvider]
  OAC -->|baseUrl + apiKey| W[Webnori]
  OAC -->|baseUrl + apiKey| O[OpenAI]
  OAC -->|baseUrl| LMS[LM Studio]
  OAC -->|baseUrl| OLL[Ollama]

Where this connects back to AIMODE

The foundation report described AIMODE's dual backend — Gemma 4 via GBNF, Nemotron via the Llama-3.1 chat template, both behind IAgentToolHost. The provider expansion follows the same principle, one layer up:

Layer	Interface	Implementations
Tool host	`IAgentToolHost`	`WorkspaceTerminalToolHost`
Tool-call backend (AIMODE)	`AgentToolLoop` / `ExternalAgentToolLoop`	Gemma-GBNF, Nemotron-template
LLM provider (new)	`ILlmProvider`	`OpenAiCompatibleProvider` × 4 hosts, local LLamaSharp host

Same shape, different layer. Adding a fifth provider tomorrow is a config entry, not a refactor.

Sources: OpenAiCompatibleProvider.cs · LlmProviderFactory.cs · SettingsPanel.Llm.cs

Surface 2 — From text to voice

Adding voice could have meant adding a new agent. It didn't. The Voice tab routes through the same LlmGateway.OpenSession() the rest of the app uses — whatever the user picked on the LLM tab is what answers the voice. STT and TTS are I/O adapters, not a parallel agent.

This was a deliberate split from the AgentWin origin: there is no separate "voice LLM provider" — only a new way to enter and exit the existing one.

Pipeline


flowchart LR
  M[Mic<br/>NAudio<br/>16 kHz mono] --> V[VAD<br/>frame + utterance]
  V -->|UtteranceEnded| STT[ISpeechToText]
  STT -->|text| GW[LlmGateway.OpenSession]
  GW -->|reply| CL[TtsTextCleaner<br/>strip markup]
  CL --> TTS[ITextToSpeech]
  TTS -->|audio bytes| PB[VoicePlaybackService<br/>NAudio]
  PB -->|while playing| MUTE[Muted=true<br/>mic suppressed]
  MUTE -.-> M

STT — four engines, one interface


public interface ISpeechToText
{
    string ProviderName { get; }
    Task<string> TranscribeAsync(byte[] pcm16k, string? language, CancellationToken ct);
}

Source: Project/ZeroCommon/Voice/ISpeechToText.cs

Engine	Mode	Model / endpoint	Notes
Whisper.net	Local	`ggml-{tiny\	small\
OpenAI Whisper	Hosted	`/v1/audio/transcriptions` (whisper-1)	RIFF-WAV multipart POST
Webnori-Gemma audio	Hosted	OpenAI-compatible Chat Completions w/ `input_audio` block	Reuses LLM tab credentials
LocalGemmaStt	Local	(placeholder)	Will load an audio-capable Gemma GGUF via LLamaSharp multimodal once the catalog ships one

Model files for Whisper.net are cached at %USERPROFILE%\.ollama\models\agentzero\whisper\ so they live alongside the user's existing Ollama models — no new convention to learn.

Sources: WhisperLocalStt.cs · OpenAiWhisperStt.cs · WebnoriGemmaStt.cs

TTS — three options

Engine	Mode	Notes
Off	—	Default; voice in only
Windows SAPI	Local	Uses installed Win11 language packs (e.g., the Korean neural voice ships free at OS level)
OpenAI tts-1	Hosted	Voices: alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, verse

Source: Project/ZeroCommon/Voice/ITextToSpeech.cs

The "Win11 ships TTS free at the OS level — not ElevenLabs-grade but usable" observation already came up in the @Agredo10 reply chain; this is that observation built into a tab.

Mic capture: dual-VAD + pre-roll + mute-while-playing

A voice loop fails in three predictable ways. The capture service was written assuming all three would happen.


public sealed class VoiceCaptureService : IDisposable
{
    public event EventHandler<float>? AmplitudeChanged;          // ~50 ms ticks → UI level meter
    public event EventHandler<bool>? SpeakingStateChanged;        // VAD frame state
    public event EventHandler? UtteranceStarted;
    public event EventHandler<byte[]>? UtteranceEnded;            // PCM 16k mono → STT
    public bool Muted { get; set; }                               // suppress while TTS plays
}

Source: Project/AgentZeroWpf/Services/Voice/VoiceCaptureService.cs

Frame-level VAD (~50 ms) — for the live amplitude meter the user sees in the Voice Test panel.

Utterance-level VAD (~2 s of trailing silence) — the actual segment boundary that triggers STT. Without this, every breath becomes a transcription request.

Pre-roll buffer (~1 s) — captured before VAD trips, so the first consonant of a sentence isn't clipped off.

Muted flag — set true while VoicePlaybackService is playing TTS audio. Without this the speaker output gets re-captured by the mic and the loop talks to itself.

These are not ML problems. They're plumbing. Getting them wrong makes the voice feature feel "broken in a way nobody can articulate."

Source: VoicePlaybackService.cs (auto-detects WAV / MP3 / PCM16@24kHz; patches the OpenAI gpt-4o-audio-preview RIFF header)

What this expansion didn't change

This is as important as what it added.

No new actor was introduced. Voice still talks to the same AgentReactorActor topology described in the foundation report. STT/TTS are services, not actors.

No new gateway. Voice reuses LlmGateway.OpenSession(). Whatever the LLM tab decides answers the voice — local Gemma, OpenAI, anything.

No chunked streaming yet. All STT/TTS calls take a complete utterance. Streaming is a known follow-up; the absence is intentional, not an oversight, because chunked Whisper + chunked TTS on five providers is its own multi-week problem.

No CUDA in the installer. Whisper.net's CUDA runtime is real and works, but bundling it doubled installer size for a feature most users will run on small models. CPU-only first; GPU later if the data says it matters.

The shared design principle, restated

Expansion	Interface that absorbed it	Implementations behind it
Multi-provider LLM	`ILlmProvider`	LLamaSharp local, `OpenAiCompatibleProvider` × 4
Voice in	`ISpeechToText`	Whisper.net, OpenAI Whisper, Webnori-Gemma, (LocalGemma placeholder)
Voice out	`ITextToSpeech`	Off, Windows SAPI, OpenAI tts-1
(Established earlier) Tool host	`IAgentToolHost`	`WorkspaceTerminalToolHost`
(Established earlier) AIMODE backend	`AgentToolLoop` / `ExternalAgentToolLoop`	Gemma-GBNF, Nemotron-template

The agent doesn't grow by adding agents. It grows by adding implementations behind already-named interfaces. That is what made today's expansion a 2,000-line pull rather than a 20,000-line one.

Why the free-OS-voice angle is load-bearing

Win11 ships a Korean neural TTS voice at the OS level, for free, as part of the language pack — same for English, Japanese, and most major languages. This isn't loud in marketing decks, but for a desktop agent that wants to say something out loud, it removes a $0–22/month line item that competitors silently charge for.

Combine that with what's already free on each side:

Slot	Free default	Cost in dollars per month
STT	Whisper.net (`ggml-tiny`/`small`/`medium`, local CPU)	$0
LLM (local)	Local Gemma 4 / Local Nemotron	$0 (your CPU/GPU)
LLM (hosted)	Webnori free tier	$0
TTS (local)	Windows SAPI (Win11 neural voice)	$0
TTS (hosted)	OpenAI tts-1	paid

The default loop end-to-end — speak Korean → transcribe → local model thinks → answer in Korean voice — costs $0 in API fees. Paid providers slot in only when the user opts up. That is not a downstream consequence; it is the design.

Most "voice AI" desktop products in this category bury this fact because their billing depends on you not noticing. AgentZero Lite leans into it: the Windows OS is a free voice asset for anyone targeting Windows, and we use it.

Situational use cases this stack unlocks

The architecture didn't grow voice for novelty. Most of the high-value desktop-agent scenarios are voice-first.

Scenario	Why voice changes it	What's already in the box
Meeting note-taker	Bot-free meeting capture is what Otter / Granola / Fireflies charge $20–30/mo for. Meetily already proved a self-hosted Whisper + LLM-summary stack is competitive.	STT × 4, LLM × 6 incl. local — recorder + summarizer is a feature week, not a rebuild
Voice journaling	Speaking is faster than typing; LLM cleans up filler and commits a structured day note	Local Gemma + Win11 SAPI = $0 loop
Hands-busy healthcare tracking	Cooking / exercise / caretaking — typing isn't viable. Mirrors the @graylanj edge-LLM pattern (Gemma + dental scanner + pill bottle scanner) but on Windows desktop	Local-only path covers privacy; speech-in / speech-out covers ergonomics
Hands-free CLI	Talk to the Claude/Codex tabs while watching the screen	Already wired — Voice tab + AIMODE handshake meet at `LlmGateway`
Language practice	Voice cloning + pronunciation feedback	XTTS-v2 slots in as a new `ITextToSpeech`

The meeting-note one is worth pausing on. People pay real money today for transcribe + summarize + share over their own meetings. The components AgentZero Lite already has — bot-free local recording (NAudio), STT × 4, an LLM gateway with multiple providers, and a discipline harness for prompt engineering — are exactly what Meetily bundles. The stack hasn't shipped that feature, but the lego pieces are all already in the box.

What's coming from the open-source side

The voice-model space reshaped fast in early 2026. Most of these slot into our existing ISpeechToText / ITextToSpeech interfaces with one new class each.

STT (open / efficient, 2026 picks)

Model	License	Why it's interesting
Whisper Large V3 Turbo	MIT (OpenAI)	216× real-time, multilingual; current accuracy/speed balance pick
Distil-Whisper	MIT (Hugging Face)	6× faster than Whisper Large V3, WER within ~1% — ideal for streaming
NVIDIA Parakeet TDT 1.1B	NVIDIA OS	RTFx > 2,000 (fastest available), English-only
Faster-Whisper	MIT	CTranslate2 reimpl, ~4× faster, drop-in for Whisper.cpp
WhisperX	BSD-2	Word-level timestamps + speaker diarization → unlocks "who said what" in meetings
Moonshine / Canary-Qwen / Qwen3-ASR	OS	New generation already matching commercial APIs on standard benchmarks

TTS (open / efficient, 2026 picks)

Model	License	Why it's interesting
Kokoro	Apache-2.0	82M params, ElevenLabs-comparable quality, free for commercial use
Coqui XTTS-v2	Coqui non-commercial	6-second voice cloning, 17 languages — note license restriction
Coqui TTS toolkit	MPL-2.0	1,100+ languages of pre-trained models

Microsoft's own move (April 2026)

On 2026-04-02 Microsoft surfaced MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Foundry. The voice-relevant pieces:

MAI-Voice-1 generates 60 seconds of audio in 1 second (60× real-time).

Microsoft's embedded-speech path lets newer Win11 boxes (24H2 + DX12 + 8 GB RAM) run TTS locally — exactly the same OS-integration story SAPI already gives us, but neural-HD-grade.

2026-03 Foundry release notes mention Neural HD TTS with MAI-voice-1 integration.

For our architecture this is a known-future ITextToSpeech impl — the day MAI-Voice-1 ships a public Win11 local API, it slots in next to WindowsTts and the rest of the user's loop doesn't have to learn anything new.

What plugs in where


flowchart LR
  subgraph existing[Existing slots]
    S[ISpeechToText]
    T[ITextToSpeech]
  end
  subgraph stt_open[Open STT models]
    Pk[Parakeet TDT 1.1B]
    DW[Distil-Whisper]
    WT[Whisper Turbo V3]
    FW[Faster-Whisper]
    WX[WhisperX<br/>+ diarization]
  end
  subgraph tts_open[Open / future TTS]
    Ko[Kokoro 82M]
    XT[XTTS-v2]
    Mai[MAI-Voice-1<br/>via Foundry Local]
  end
  Pk --> S
  DW --> S
  WT --> S
  FW --> S
  WX --> S
  Ko --> T
  XT --> T
  Mai --> T

The point isn't "we'll add all of these tomorrow." The point is the architecture is ready to absorb whichever ones the user actually wants — the same way the LLM expansion absorbed five providers behind one interface. Each model added is a new file, not a new shell.

Closing — what the next slot is for

Two surfaces opened. The next obvious additions slot in without changing the shell:

Local Gemma audio model when the catalog ships one — fills the LocalGemmaStt placeholder, gives a fully-offline voice loop.

A streaming provider (e.g., Anthropic's streaming API) — adds nothing structural, just a new ILlmProvider impl.

A new voice provider — same.

The agent earned its name once, in the foundation report. Each chapter after that is the same shape played in a new key.

TECH LINKS

𝕏 @webnori

📘 Akka Labs (Facebook group)

📝 More writings — webnori wiki