🪤

Hybrid Isn't a Choice — Plugging into All Three Giants Is Inevitable; Mixing Open Models In Is the Strategy

Series: AgentZero Lite — Part 12. The previous part, Voice-Driving Claude CLI on a Free Stack, showed what one specific hybrid configuration looks like end-to-end. This part zooms out: why "pick a giant" is not a thing anymore, why open models in your composition is what you actually choose, and what the architecture looks like that lets a single app run all of those at once.

notion image

TL;DR

  • You don't pick OpenAI or Anthropic or Google. You plug into all three. Each one wins different turns — coding, voice, multimodal, reasoning, on-device — and your users will route around you the moment your app refuses to follow. Picking one giant is a posture for a research demo, not a product.
  • AgentZero Lite is built for the composition, not for any one provider. The on-device LLM (Gemma 4 today, Nemotron Nano in flight) is the in-shell coordinator; the giants live in their own terminal tabs. The user's recipe — Paid+Free, Paid+Free+Free, Free+Free — is a configuration choice, not an architecture rewrite.
  • What turns a plain LLM into an agent is function calling, not the model. Gemma 4 is a token generator until you constrain its decoding to a JSON tool-call grammar. That grammar is half a page. Once it exists, the same actor stage that drives Claude Code drives Gemma 4 drives Nemotron Nano.
  • Gemma 4 is the inflection point, not the destination. From April 2026 onwards, purpose-specific small models — fine-tuned for terminal log triage, voice command parsing, code-review summarisation — are expected to keep accelerating. Three weeks after Gemma 4, NVIDIA shipped Nemotron 3 Nano Omni. The cadence isn't slowing.

What hybrid looks like in this app

Skip this section if you want the architecture. Read it if you want to know what the user actually does.
The AgentZero Lite shell has two halves. On the left, multi-tab ConPTY terminals — one for claude, one for codex, one for pwsh, whatever you need (Part 3 — Tour). On the right, an AgentBot chat panel with three modes: CHT (your typing routes to the active terminal), KEY (raw keystrokes), and AI (AIMODE).
In AI mode, an on-device LLM acts as a "small secretary" — not the brain, the receptionist. You ask in Korean or English; it picks the right terminal AI, sends the message, waits for the reply, brings back a summary. The small model never tries to out-think Claude or Gemini or GPT-5. It owns coordination; the heavy reasoning lives in whichever giant you put in the next tab.
flowchart LR U["User"] -->|"ask in KO/EN"| BOT["AgentBot AIMODE"] BOT -->|"GBNF JSON tool call"| LL["Local LLM<br>Gemma 4 / Nemotron Nano"] LL -->|"send_to_terminal"| RE["AgentReactorActor<br>FSM"] RE -->|"ConPTY write + Enter"| TT["Top-tier CLI in tab<br>claude / codex / gemini / aider"] TT -->|"reply via -cli bot-chat"| RE RE -->|"continuation cycle"| LL LL -->|"summary"| BOT BOT -->|"text or voice"| U
The on-device path costs nothing per turn after the model download. The top-tier path is whatever you're already paying Anthropic / OpenAI / Google for. The user moves between them by rephrasing the ask, not by switching apps.

The economics — a snapshot, not a trend

It's tempting to anchor on absolute prices. Don't. The numbers below are accurate the day this is published; they may not be accurate the day you read it.
Provider
Surface
Current price (May 2026)
Source
Anthropic
Claude Max 5×
$100 / month
Anthropic
Claude Max 20×
$200 / month
OpenAI
Realtime API audio out
≈ $0.24 / minute
OpenAI
Realtime API audio in
≈ $0.06 / minute
ElevenLabs
Pro voice
$99 / month
ElevenLabs
Scale voice
$330 / month
What the snapshot actually tells you is that the axis of pricing is unstable — not that any individual price is high or low. April 2026 alone: Anthropic briefly removed Claude Code from the $20 Pro tier before walking it back; OpenAI cut the Realtime API rate; ElevenLabs introduced Scale-tier rebates. None of those moves were predictable a quarter before they shipped.
If you are running a product team, the problem isn't "is it $200 or $50." The problem is you cannot give your finance team a forecast. A burn-rate plan against "$0.24 per voice minute × unknown user voice minutes per month" is not a plan; it is a hope.
What finance and ops teams will sign off on, every time, is fixed cost:
"We pay $X for the GPU, $Y for the model download, $Z for electricity. The marginal voice minute is free. Top-tier kicks in only on turns flagged for it."
That is what the open path actually buys you. It is not cheaper than top-tier on a single turn. A small model running locally on a developer's box is roughly free per turn but you also amortise the box. The gain is predictability: a fixed-cost floor under a variable-cost ceiling. You let top-tier handle the few high-value turns and absorb the meter; you let on-device handle the volume and never look at the meter again.
The "free lunch" perception of the consumer (GPT, Windows-bundled Copilot, GitHub Copilot's once-cheap tier) is over for delivery — it isn't over for budgeting. Hybrid is how you get to charge the consumer something defensible while keeping your COGS in a band your CFO can model.

What top-tier still wins at — and what open is rapidly closing in on

Be precise about this. Open does not yet match top-tier on the most demanding turns.
  • Voice prosody and latency — OpenAI Voice and ElevenLabs are still in another class. For production (audiobook narration, branded TTS, customer-facing IVR), top-tier is the right call.
  • Frontier reasoning depth — anything that benefits from a 200K+ context window, very long chain-of-thought, or recent web grounding still favours Claude / GPT / Gemini.
  • Best-in-class coding — Claude Code's tool-use and codebase agency are genuinely ahead of any open model running on a single user GPU.
What open is closing in on, fast:
  • Multimodal at the edge. Gemma 4 (Apr 2, 2026) shipped four variants from 2B to 31B with on-device deployment as the headline use case (HuggingFace blog; InfoQ). The smaller variants run on a phone or a Raspberry Pi. The licence is Apache 2.0.
  • Single-GPU omni-modal. NVIDIA Nemotron 3 Nano Omni (Apr 28, 2026) is a 30B Mixture-of-Experts model that activates only 3B parameters per token, fits in 25 GB of RAM at 4-bit, and ships as open weights with commercial licensing (DeepInfra editorial). It handles audio + video + image + text in a single backbone — the exact shape AgentZero Lite's voice subsystem currently glues out of separate single-modality models.
  • Function-calling reliability. Both Gemma 4 and the Nemotron line ship with robust tool-use templates. The reliability gap to top-tier on structured output (JSON-only, schema-constrained) is now small enough that constrained-decoding (GBNF) closes it for most production flows.
Two trends are clear. (1) The "small model" frontier is now multimodal and commercially licensed. (2) The cadence is months, not years — Gemma 4 in April, Nemotron 3 Nano three weeks later. Every quarter, more of what you used to send to a $0.24/min API can be answered by a model on the user's machine. That is the floor that keeps rising.

What turns a plain LLM into an agent — function calling, walked through

A pretrained LLM by itself is a token generator. Give it the prompt "Tell terminal Claude to summarise the last failure trace," and you'll get back fluent English. Beautiful. Useless — there is no "terminal Claude" to send anything to, no mechanism to do the sending, no way for the model to even know the tab exists.
What closes that gap is function calling: the model is taught (or constrained) to emit structured JSON describing a tool to run, instead of free-form prose. Something downstream parses the JSON, executes the tool, feeds the result back, and the model continues.
There are two routes to get there with an open model like Gemma 4:

Route A — constrained decoding via GBNF grammar (what AgentZero Lite uses)

llama.cpp ships with a grammar-based decoding system (GBNF) that forbids the model from emitting any token that violates a supplied grammar. You write the grammar once; every output is guaranteed to parse.
The Gemma 4 tool-call grammar that AgentZero Lite uses is roughly this shape:
root ::= "{" ws "\"tool\":" ws tool-name ws "," ws "\"args\":" ws args-object ws "}" tool-name ::= "\"list_terminals\"" | "\"read_terminal\"" | "\"send_to_terminal\"" | "\"send_key\"" | "\"wait\"" | "\"done\"" args-object ::= "{" ws (kv (ws "," ws kv)*)? ws "}" kv ::= string ws ":" ws value value ::= string | number | boolean string ::= "\"" char* "\"" ws ::= [ \t\n]*
(real grammar lives at Project/ZeroCommon/Llm/Tools/AgentToolGrammar.cs; trimmed here)
With this grammar bound to the inference call, Gemma 4 cannot emit prose. The first token has to be {, the next has to lead to "tool":, and so on. The model is still doing what models do — sampling tokens — but the search space is collapsed to legal JSON. A typical output:
{ "tool": "send_to_terminal", "args": { "tab": "Claude", "text": "Summarise the last failure trace." } }
That is now actionable. Layer 2 of the five-layer ladder from Part 1 — Tools — has something to consume. The model has been agentized without retraining.

Route B — native tool-use template (what Nemotron Nano uses)

When a model has been fine-tuned with tool-use as part of its training (Nemotron Nano, Llama-3.1-Instruct, GPT-4 family), you can skip GBNF and use the model's native template: a special chat-format that puts tool definitions in the system message and reads tool calls back out of the assistant turn. This is more idiomatic when the model supports it; AgentZero Lite uses this path for Nemotron and falls back to GBNF if anything looks malformed.
Both routes produce the same thing — a ToolCall struct that the actor stage consumes. The reactor never knows or cares which one ran.

How AgentZero Lite plays both sides — the architecture

The hybrid posture only works if a single app can route between regimes without the front-end caring which side it's talking to. Three pieces make that possible.

1. The actor stage runs the loop, the LLM does not

This is the central inversion from Part 1 — An LLM Is Not an Agent. A token generator is not an agent; an agent is what you get when you wrap the generator with structure, tools, concurrency and protocol. AgentZero Lite stacks those layers on top of an Akka.NET hierarchy:
/user/stage StageActor (supervisor + broker) /bot AgentBotActor (chat-input router) /ws-<name> WorkspaceActor (one per folder) /term-<id> TerminalActor (one per ConPTY tab) /reactor AgentReactorActor (FSM around the LLM)
AgentReactorActor is the loop. Its FSM is small enough to read in one block:
Idle ──StartReactor──▶ Thinking ──prompt sent──▶ Generating │ ▼ Acting ──tool result──▶ Thinking (continuation) │ └──────done──────▶ Done
The LLM only owns the Thinking → Generating edges. Everything else — choosing which tool to fire, sequencing turns, owning the KV cache, surviving a slow inference, supervising a crashed terminal — happens in actor land. When a peer terminal replies hours later via AgentZeroLite.exe -cli bot-chat, the message lands in /user/stage/bot, gets routed to the reactor, and a continuation cycle starts. The model never had to keep state.
This is what makes the swap cheap. Whether the inference is happening in a 4 GB Gemma 4 GGUF on the user's GPU or in a Claude Code subprocess in the next tab, the reactor's job is identical.

2. One contract, two backends — ILocalLlm

public interface ILocalLlm : IDisposable { string ModelId { get; } string Backend { get; } // "gemma4-gbnf" | "nemotron-native" | … bool Ready { get; } // ONE inference call. Caller supplies the tool catalog; receiver returns // the structured tool-call payload (or a "done" signal). KV cache lives // inside the implementation; the actor does not see it. Task<ToolCall> NextTurnAsync( ChatHistory history, IReadOnlyList<ToolDef> tools, CancellationToken ct); }
Two implementations live behind this interface today (Gemma 4 GBNF, Nemotron Nano native). The reactor binds to one at startup based on LlmGateway config and never asks again.
The reactor side, simplified to the FSM transition that matters:
// AgentReactorActor — Generating state When<TurnReady>(t => { var call = await _llm.NextTurnAsync(_history, _tools, ct); if (call.Done) return GoTo(Done); var result = await _toolHost.ExecuteAsync(call, ct); _history.AppendToolResult(call, result); return GoTo(Thinking); // continuation cycle });
That's the whole pivot. The LLM is awaited like any other I/O. If it's Gemma 4 GBNF on the GPU, the call takes 200–800 ms; if it's a top-tier API, the call takes 1–4 s; the FSM doesn't notice. The supervisor strategy on StageActor covers the what if it crashes case.

3. Voice as the synthesis — a free stack mimicking a paid one

The deepest test of "can open replace expensive" is voice, because voice is where top-tier currently wins by the widest margin. AgentZero Lite's voice path (Part 11 demo; Akka.Streams pipeline write-up) is entirely open by default:
mic ──▶ Whisper.net (local GGML) ──▶ AgentReactorActor ──▶ chunked text out ──▶ Windows SAPI / OpenAI TTS ──▶ speakers STT FSM TTS
Whisper for transcription, Windows SAPI (the OS-bundled Heami voice) for synthesis, Akka.NET Streams for the chunking and back-pressure that let the bot stop transcribing itself while it speaks. Zero API keys in the audio path. The end-to-end demo runs without a network. It does not match OpenAI Voice on prosody — but it answers the right question for a developer surface, which is can we ship a closed-loop voice turn at all. We can.
The architectural payoff is that the same Akka.Streams stages slot in front of a paid TTS provider when the user has one configured. The voice settings panel exposes a provider selector (Part 7 — Adding Providers and a Voice); pick OpenAI TTS, the same chunker / mic-mute / wave-out pipeline carries the bytes. Hybrid all the way down to the audio buffer.

Strategic compositions — what hybrid actually buys you

"Hybrid" is not a single architecture. It is a menu of recipes. Below is what the same AgentZero Lite surface lets you ship for different use-cases — the user picks the recipe by configuring providers, not by switching apps.
Recipe
Composition
Why this mix
Paid + Free
Claude Code (paid) for reasoning + Gemma 4 (free) as in-shell coordinator
High-stakes coding, long sessions; the coordinator handles routing and summary so the paid context isn't burned on logistics
Paid + Free + Free
Claude (paid) for content + Whisper.net (free) STT + SAPI (free) TTS
Voice-driven coding flow on a developer box; voice path costs nothing, content path is metered only when a turn actually needs Claude
Free + Free + Free
Gemma 4 (coordinator) + Nemotron Nano specialist (e.g. log triage) + Whisper/SAPI voice
On-call mode, offline, no network, no API keys. Quality ceiling is lower; no cost variance at all
Paid (audio) + Free (logic)
OpenAI Realtime (paid) for prosody + Gemma 4 (free) for tool-call orchestration
Branded voice product where intonation matters; logic stays free so the meter is bounded to actual speech minutes
Three giants in tabs
Claude (one tab) + GPT-5 in codex (one tab) + Gemini in another tab + Gemma 4 coordinator
Comparison rig — same prompt routed to three giants, results reconciled by the coordinator. The "playground" use-case is concretely this recipe.
The thing to notice is that the architecture doesn't change between rows. The same ILocalLlm, same AgentReactorActor, same voice pipeline, same CLI tabs. What changes is which providers are wired up.
For a product manager: this is the difference between "we built one product" and "we built one product surface that hosts five recipes." The marketing message changes per audience; the engineering doesn't.
For a finance team: this is the difference between "our COGS is a wide band" and "our COGS is a fixed-cost floor with paid surcharges that the user toggles on for the turns they care about."

What's next — Gemma 4 is the inflection point

Gemma 4 isn't the destination, it is the inflection. The reasons it matters for this argument, specifically:
  1. Apache 2.0 licence with commercial use — the legal blocker that kept earlier open-weights models out of consumer products is gone for this family.
  1. Four variants spanning phone-class to 31B — the same model identity (and therefore the same community fine-tuning ecosystem) can target a Pixel and a developer GPU. That is new.
  1. A real fine-tuning ecosystem from week one — HuggingFace community fine-tunes started landing within days of the April 2 release. Specialised variants (code-only, multilingual-strong, instruction-tight) compound on a base model that's now actually base — not hobbled by licence.
  1. Vendors are racing to fill the niche — NVIDIA's Nemotron 3 Nano Omni shipped 26 days later. That is not a coincidence; that is a market signal that "small, open, on-device, multimodal" is now the contested ground.
From here, the prediction is not subtle. Purpose-specific small models are expected to keep accelerating. The shape of the next 12 months looks like:
  • Workflow specialists — small models fine-tuned for terminal log triage, code-review summary, voice command parsing, error-message rewriting. A 4B model that wins at its specific job against a generic 32B.
  • Domain specialists — small models tuned on industry corpora (legal review, medical notes summarisation, financial compliance) where the data sovereignty argument forces on-device anyway.
  • Multimodal collapse — Nemotron 3 Nano Omni's pattern (one backbone, all modalities) reaches more vendors. The voice subsystem in Project/ZeroCommon/Voice/ exists because we had to glue multiple single-modality models together; an omni-modal small model could collapse half of it.
The product implication is that a hybrid surface today is not a temporary scaffolding until the giants get cheaper. It is the surface that gets more valuable as the on-device shelf fills out. AgentZero Lite's LlmModelCatalog already discovers and ranks local models on first launch; replacing Gemma 4 with a fine-tune is a config change. Replacing it with a specialist is the same config change with better results.
That is the wedge. The giants charge for capability; the open shelf supplies floor; the user composes between them per task. Whoever ships the surface that lets the composition happen with one click wins the consumer trust that any single-provider product cannot.

Closing — three questions for the reader

The core argument of this post is that hybrid is not a choice — it is the table-stakes posture for any AI surface that wants to be useful for more than one workflow. Plugging into all three giants is inevitable. What you actually choose is which open models you compose with them. AgentZero Lite is one bet on what that composition surface should look like. The bet rests on three things — and these are the questions worth arguing about:
  1. Do you really think you can ship a serious AI product on top of just one of OpenAI / Anthropic / Google? If not — and most can't — then you're already running hybrid. The question is whether you've named it as a strategic capability or are still treating it as integration debt.
  1. How do you prevent the on-device path from being a Potemkin demo? A small model that fails 30% of the time is worse than no small model — it trains the user to distrust the surface. AgentZero Lite uses GBNF + native templates + a deterministic FSM to keep the failure mode loud. What's your equivalent guardrail?
  1. What does the playground look like a year out? This project's stated value today is being a place where top-tier and open can be benchmarked head-to-head on the same UI. If Gemma 4-class models keep landing every six weeks, the playground becomes a competitive surface — every dev with the app is, in effect, running an A/B between a free model and the model their employer pays for. That is a research artifact, and it is also a buying signal. Who wants that data?
If you have answers — or sharper questions — I want to hear them. The code is at github.com/psmon/AgentZeroLite; the contract is the JSON tool call.

notion image
 
 

TECH LINKS