The Realtime Voice Mountain — Step 1 Embedding an Open vLLM, Step 2 Akka Streams, and Why ElevenLabs Stays Plugged In

TL;DR — On real-time streaming voice, ElevenLabs is the bar. Sub-second latency, near-natural prosody, actual real-time. AgentZero Lite today doesn't chase that — we VAD-batch utterances before sending to STT and accept the latency hit, which lets us treat voice behind a clean byte[] PCM → string interface. But to feel like the assistant we want — and to do it without a Realtime API in the loop — we need to climb. Adopting ElevenLabs Realtime or OpenAI Realtime is genuinely cheap on the implementation side: they already ship the streaming protocol, sentence-boundary handling, and interrupt semantics. The mountain only exists when you stitch a local open STT model and a local open TTS model into something Realtime-API-like yourself, because none of the open pieces ship that orchestration layer for you. Step 1: prove that an open voice-LLM (vLLM) embeds end-to-end on consumer hardware. Step 2: an Akka.NET Streams Graph that handles mid-stream long-sentence speech, request-unit switching, and mid-stream LLM function calls for sentence-boundary detection. The goal isn't to beat ElevenLabs Realtime or OpenAI Realtime — it's to reach a minimum usable approximation of that experience on a local-only stack. Step 2 may end up harder than the multi-CLI core itself. So Realtime APIs stay plugged in throughout — I'm a paying user; this isn't a cost argument, it's an ownership-and-architecture one. The road for fully open local realtime voice is long.

This is a follow-up to Voice-Driven Multi-Terminal Control — How a Local Gemma 4 Function-Caller Pilots Claude, Codex and Anything Else with a Prompt, and indirectly a response to a sharp X critique that landed yesterday: the OpenAI-compatible interface holds for text but leaks for voice. That critique is correct, and it's the doorway to this whole conversation.

Where realtime streaming voice actually is in 2026

Before talking architecture, calibrate the bar.

Property	ElevenLabs (today)	Best open / local stack (today)
First-token-to-audio latency	~250–600 ms	1–3 s common, sub-1 s with heavy GPU
Streaming chunk semantics	Documented, stable	Differs per backend (chars vs ms boundaries, finalization signaling)
Voice naturalness	Indistinguishable in many domains	Noticeably synthetic in long sentences
Voice library	Hundreds, multi-lingual	Tens, mostly English-leaning
Mid-sentence interruption	First-class	Mostly add-on / post-hoc
Cost	$11–$330/month	Hardware + electricity

ElevenLabs is the gold standard. The number of seconds of polish their team has put into the streaming protocol alone is the kind of investment a side-project can't replicate, and shouldn't pretend to.

This article is not "we're going to outdo ElevenLabs." It's "what does the open path look like, honestly, and where does it meet the premium path."

Why even attempt the open path

Three reasons, none of them about price:

Local control of the audio modality. The same reason we ship an on-device Gemma 4 for the function-call agent applies here — voice data is sensitive, and a captain's bridge (per the previous chapter) shouldn't have to leave the machine for every utterance.

Architectural ownership of the streaming protocol. Bek's critique on X named it precisely: behind a unified OpenAI-compatible HTTP endpoint, the streaming chunk semantics still differ per backend — chars vs ms boundaries, finalization shape, interim vs final signaling. If we own at least one path end-to-end, we get to shape that protocol seam intentionally instead of papering over it.

The captain-can-interrupt problem. Real conversational voice means the user can cut in mid-output. That requires the TTS layer to expose cancellation primitives that match how our Akka actor mesh talks. ElevenLabs' API has good interrupt support; doing it ourselves teaches us exactly how that protocol should look.

And to be explicit about my position: I'm a paying ElevenLabs user, not on a free tier. This article isn't framed by cost. It's framed by curiosity about whether the open path is good enough for the experience we want, and what the actual gap is.

Today's design — VAD utterance batching, not streaming

To set context: AgentZero Lite does not currently do streaming STT or streaming TTS. We deliberately punted that whole layer. Quick recap of what exists in main:


flowchart LR
  Mic[🎙 NAudio<br/>16 kHz mono PCM]
  Mic --> VAD[Frame VAD + Utterance VAD<br/>40 silent frames = 2 s threshold]
  VAD -->|UtteranceEnded| Drain["ConsumePcmBuffer to byte array"]
  Drain --> STT{ISpeechToText<br/>factory}
  STT --> Whisper[Whisper.net]
  STT --> OAI[OpenAI Whisper-1]
  STT --> Webnori[Webnori-Gemma audio]
  STT --> LocalG[Local Gemma audio]
  Whisper --> Txt[full transcript]
  OAI --> Txt
  Webnori --> Txt
  LocalG --> Txt
  Txt --> Bot[AgentBotActor.StartReactor]

  classDef now fill:#1e3a3a,stroke:#4EC9B0,color:#fff
  class Drain,Txt,Bot now

The abstraction is byte[] PCM → string transcript per utterance. Backend chunk semantics never escape into our consumer code, because we never operate on chunks — we operate on completed utterances. This is why our ISpeechToText factory stays clean even as we add or swap engines: the leak Bek named exists at the streaming layer, and we side-step that layer entirely.

The cost: ~600 ms to ~2 s added latency between when a user stops speaking and when the reactor starts thinking. For the captain-bridge UX (where the user just gave a multi-clause command and is now watching terminals), that's tolerable. For real conversation, it's not.

This is the honest baseline. To go further, we have to climb.

Step 1 — Embedding an open vLLM end-to-end

Goal: prove that a single open voice-LLM (vLLM) — meaning a model that does end-to-end speech-to-speech, or at least streaming TTS with prosody — runs inside our process on consumer hardware, with acceptable latency and quality.

Candidate models (2026 landscape):

Microsoft MAI-Voice-1 via Foundry Local — interesting because Foundry Local already ships in Windows; Windows ML integration path looks clean. (Hat tip to @nkanauzu for surfacing this on X.)

Kokoro — small footprint, surprisingly good prosody, ONNX path available.

XTTS-v2 — multi-lingual, but heavier and license is restrictive.

Parakeet TDT — primarily STT side but worth probing for the streaming protocol design.

Step 1 success criteria:

The model loads and synthesizes inside Project/AgentZeroWpf without an external server process.

First-audio-out latency is bounded — say <1.5 s on a non-GPU developer machine.

The streaming output exposes a chunk interface we can plug behind an ITtsStream abstraction without leaking backend-specific chunk semantics into consumer code.

That third bullet is where Bek's critique becomes real engineering work: we have to design ITtsStream such that the chunk-semantics leak doesn't reach the consumer. Likely shape:


interface ITtsStream {
  IAsyncEnumerable<AudioChunk> Synthesize(string text, CancellationToken ct);
  Task SignalSentenceBoundary();        // optional, for mid-stream coordination
  ValueTask DisposeAsync();              // explicit cancellation
}

record AudioChunk(byte[] Pcm16, bool IsFinal, TimeSpan Position);

The IsFinal + Position pair is the normalization. Per-backend adapters absorb whether the upstream uses ms or char boundaries and translate to our shape. This is unglamorous adapter work, but it's the price of a clean seam.

If Step 1 ships and works, we have a usable streaming TTS path. Step 2 is what we do with the stream.

Step 2 — Akka.NET Streams Graph for mid-stream long-sentence speech

A quick word on Akka Streams "working with Graph." Akka Streams (the .NET port of Akka's Reactive Streams) gives you two layers. The default DSL composes linear pipelines — Source.Single().Via(flow).RunWith(sink). But for anything non-linear (fan-in, fan-out, cycles, dynamic routing), you graduate to the Graph DSL: you wire GraphStage operators into a topology like a circuit diagram, and the materializer turns it into a back-pressured runnable graph. Voice pipelines are non-linear by nature — a token stream fanning into a sentence-detection branch that fans back into TTS sinks, with cancellation cycles — so the Graph layer is exactly the affordance we need.

Here's the simplest non-linear example, drawn in the Graph DSL:


flowchart LR
  S[Source<br/>produces items] --> B((Broadcast))
  B --> FA[Flow A<br/>transform]
  B --> FB[Flow B<br/>transform]
  FA --> M((Merge))
  FB --> M
  M --> Snk[Sink<br/>consumes]

  classDef source fill:#1e3a5f,stroke:#3794FF,color:#fff
  class S source
  classDef flow fill:#3a1e5f,stroke:#C586C0,color:#fff
  class FA,FB flow
  classDef junction fill:#5f3a1e,stroke:#FFA500,color:#fff
  class B,M junction
  classDef sink fill:#1e3a3a,stroke:#4EC9B0,color:#fff
  class Snk sink

Three things to read off this picture:

Source / Flow / Sink are the primitives. Every Akka Streams pipeline ends with one Sink and starts with one Source; Flows are the in-between transforms. The linear DSL stops here.

Broadcast and Merge are junctions. Broadcast duplicates each upstream item across N downstream branches; Merge interleaves N upstream branches into one downstream. Junctions are how you bend a pipeline into a non-linear shape.

Back-pressure flows backwards through everything. If the Sink is slow, Merge slows down, Flow A / Flow B slow, Broadcast slows, Source slows. You don't write any of that — the materializer wires it for you.

That's the trick in one paragraph: you describe the topology declaratively, the runtime gives you a back-pressured, cancellable, parallelizable pipeline. Now apply this lens to voice.

Honest aside, before going further: I've never actually built one of these in production. The JVM Akka community has decades of streaming-graph experience; .NET Akka.Streams shares the lineage but real-world examples in the .NET wild are thinner on the ground. So this leg is genuinely a "first attempt — didn't expect the project to drag me here" moment (light tone intended). Treat what follows as a design sketch I'm publishing partly to invite people who have built one of these to push back.

Two framings before the design. First: this isn't a Realtime-API-killer. ElevenLabs Realtime and OpenAI Realtime already do this, and do it well — if your constraint allows them, adopt them, ship faster. The Akka Streams Graph below only earns its keep when the constraint is local-only (audio cannot leave the machine, or the LLM and TTS must both be open models you control). In that mode, no vendor ships the orchestration glue, so you build it. Second: the goal is approximation, not victory. Realtime-API-quality locally is multi-year work for a side project. A usable approximation — maybe 70–80% of the perceived liveness, with cancellable streaming and reasonable latency — is the realistic Step-2 target. With those framings: this is where it gets harder than the multi-CLI core, in my honest estimate.

The problem: an LLM produces tokens incrementally over seconds. A long answer might take 8–15 seconds to fully generate. A natural conversation can't wait until the LLM is done — the voice must start speaking after the first complete clause and continue as more clauses arrive. But the LLM doesn't emit "I just finished a sentence" markers natively. We have to detect sentence boundaries as the tokens arrive, dispatch each completed sentence to TTS, and handle interrupt / switch / re-route mid-stream without dropping audio frames.

This is a pipeline shape, and Akka.NET Streams (the Reactive Streams implementation in Akka) is exactly the kind of tool the JVM Akka community has used for similar workloads.


flowchart LR
  Src[Source<br/>LLM token stream]
  Src --> Acc[Stage: accumulate<br/>tokens into buffer]
  Acc --> Boundary{Stage:<br/>sentence-boundary<br/>detection}
  Boundary -->|complete sentence| Sentence[Source: completed sentence]
  Boundary -->|partial / continuing| Acc
  Sentence --> Switch[Stage: request-unit<br/>switching / noise filter]
  Switch --> TTS[Sink: ITtsStream<br/>synthesize chunk]
  TTS --> Audio[AudioChunk stream]

  classDef stage fill:#3a1e5f,stroke:#C586C0,color:#fff
  class Acc,Boundary,Switch stage
  classDef io fill:#1e3a5f,stroke:#3794FF,color:#fff
  class Src,TTS io
  classDef out fill:#1e3a3a,stroke:#4EC9B0,color:#fff
  class Sentence,Audio out

Three Step-2 sub-problems, each non-trivial:

2.1 Sentence-boundary detection mid-stream — via LLM function call

Naive heuristics on punctuation get you 70% of the way. The remaining 30% — abbreviations ("Dr. Smith"), incomplete clauses, code blocks, numbered lists — wreck the experience. Our planned approach: mid-stream LLM function call. Periodically (every N tokens, or every punctuation candidate), the streaming graph emits a function call to a small local model: "is this a complete sentence end, or is the speaker still going?" The function returns a single boolean, and the graph branches accordingly.

This costs latency per check, so the question becomes: how cheap can the boundary check be? Gemma 4 E4B with GBNF-locked output is one candidate; a tiny dedicated classifier is another. Step 2.1 success = boundary check round-trip < 100 ms on consumer CPU.

2.2 Request-unit switching (noise filter)

Different sentences in the same response may want different voice profiles — system messages in a calmer voice, code excerpts spelled differently than prose, error messages emphasized. The Akka Streams stage between sentence detection and TTS sink is where we plug a router that picks the right TTS profile per chunk.

This also doubles as a noise filter: skipping sentences that are pure code / log spam from going to TTS at all. The captain's bridge doesn't need to hear at System.Threading... read aloud.

2.3 Cancellation and back-pressure

The downstream TTS engine produces audio at real-time speed (1 s of audio per 1 s of synthesis). The upstream LLM produces tokens faster. So back-pressure flows backwards through the graph — when the TTS sink is full, sentences buffer; when LLM tokens stop arriving, the sink drains gracefully without clipping. Akka Streams gives us this for free as long as we wire stages with the standard back-pressure-aware materializer.

Cancellation is the harder half. When the user interrupts ("stop talking, send Ctrl+C"), the graph must:

Cancel the LLM token source.

Drain or discard buffered sentences depending on policy.

Stop the in-flight TTS synthesize call cleanly.

Release the audio device without click artifacts.

ElevenLabs' streaming API handles step 3 well, partly because they've been polishing it for years. Doing this for an open backend is real work.

Honest assessment — Step 2 may be harder than the core

The multi-CLI text core (everything before this chapter) was hard but well-trodden: actor mesh, ConPTY pipes, GBNF tool-calling, fairly standard async patterns. The Step-2 voice graph touches three things that compound:

Real-time audio constraints — clicks, gaps, glitches are immediately perceptible to humans.

LLM unpredictability — token timing is bursty; sentence boundaries arrive irregularly.

User interrupt semantics — the human can change intent mid-response, and the system must respond gracefully.

Any one of these in isolation is solvable. All three at once, on consumer hardware, with an open model — that's the mountain. I'm explicitly flagging this not as a humble-brag but as a honest scoping signal: expect Step 2 to dominate effort more than the entire text path did.

Why Realtime APIs stay the default — and what the open work is actually for

Two practical paths, very different difficulty profiles:

Realtime API path (ElevenLabs Realtime / OpenAI Realtime) — the streaming fabric is built into the API. Implementation = client integration, mostly. Recommended unless local-only is a hard constraint.

Local open path (open STT + open TTS + our own orchestration) — the streaming fabric is the work. Step 1 + Step 2 above. Activated only when audio cannot leave the machine, or the user wants total ownership of the modality.

Strategy: dual-path, Realtime-API-as-default, local-open-as-the-fallback-when-constrained.


flowchart LR
  User[User intent]
  User --> Dispatch{TTS dispatch}
  Dispatch -->|default for realtime,<br/>cheap to adopt| EL[Realtime API<br/>ElevenLabs / OpenAI<br/>built-in streaming]
  Dispatch -->|local-only constraint| Open[Local open stack<br/>Step-1 + Step-2<br/>approximation only]
  Dispatch -->|Step 1 not yet ready,<br/>fallback| EL
  EL --> Audio[audio out]
  Open --> Audio

  classDef premium fill:#3a1e5f,stroke:#C586C0,color:#fff
  class EL premium
  classDef open fill:#1e3a3a,stroke:#4EC9B0,color:#fff
  class Open open

Until Step 2 reaches "usable approximation," the Realtime API path (ElevenLabs Realtime, OpenAI Realtime) is the answer for anyone who wants real-time conversation today. That includes me — I pay for ElevenLabs. This isn't a "build everything free" project; it's a "build the open path only when the local-only constraint is actually present, and aim for approximation rather than parity."

The same principle applied to the LLM layer earlier in the series: local Gemma 4 is the default for the function-call agent, but the ILlmProvider abstraction also supports OpenAI / Nemotron / Webnori / LM Studio / Ollama. The user picks based on their context. Voice will follow the same pattern.

What this means for users right now

Voice input today — Whisper.net / OpenAI Whisper / Webnori-Gemma / Local Gemma audio, all behind VAD utterance batching. Works, ~600 ms latency penalty, no streaming. Stable.

Voice output today — Windows SAPI / OpenAI tts-1, both also batch. Works, no streaming chunk semantics leaking out, no real-time interruption.

Voice output Step 1 (planned) — ITtsStream interface + at least one open backend (probably Kokoro or MAI-Voice-1 via Foundry Local) embedded end-to-end. Adds streaming as an option without changing existing batched paths.

Voice output Step 2 (planned, may take a while) — Akka Streams Graph for sentence-boundary-aware long-form streaming with cancellation and request-unit switching.

ElevenLabs path — kept as a first-class provider throughout. If real-time matters now, that's the answer.

The long road for fully open local realtime voice

This isn't pessimism — it's calibration. The model side has progressed enormously (Kokoro, MAI-Voice-1, XTTS-v2). The protocol/streaming side is where the gap to ElevenLabs is widest, and it's the side where individual project effort matters more than model size.

For AgentZero Lite specifically, the path is:

Ship Step 1 (embed an open vLLM, prove the seam).

Decide based on Step 1's actual feel whether Step 2's complexity is worth it for our UX.

Keep ElevenLabs plugged in regardless — they earned the bar.

If you have thoughts on which open vLLM to start with, or whether Akka Streams is the right vehicle for the Step-2 graph (vs raw Channel<T> / Reactive Extensions / something else), the comments / X are open. This is the kind of design question that benefits from people who've climbed similar mountains before.

𝕏 @webnori

📘 Akka Labs (Facebook group)

📝 More writings — webnori wiki