🪟

AgentZero Lite — Bringing Multi-CLI and On-Device LLM to Windows (Part 1, EN)

notion image
📘
Series Note — This is Part 1 of two.
Part 1 (this article) — Multi-CLI · On-Device LLM · Hybrid Strategy
Part 2 — The engine room beneath: Akka.NET actor model, Gemma 4 + GBNF, AgentReactorActor FSM, STT/LLM/TTS three-layer ensemble →
🤖
How Actors Turn LLMs into Real Agents — AgentZero Lite Deep Dive (Part 2, EN)
A minimalist IDE for the AI era — manage multiple CLI agents in one window, mixing a smart big model with a lightweight on-device LLM in the same place.

1. Why another terminal — the rise of CLIs

The most interesting shift in 2026 dev workflows fits in one line: we're moving from humans-using-IDEs to AI-agents-using-terminals.
A unit of work is no longer "one shell." Claude Code runs a big refactor in one tab while another tab's Codex builds a PR review and a third tab tails the build log. A human supervises all three, occasionally stepping in. The trend shows up in product moves too — over the last twelve months a new category called AI terminal multiplexers has emerged. The macOS/Linux camp got cmux, amux, agent-of-empires one after another, and tmux itself returned to the spotlight as a stage for AI agent teams rather than a single prompt.
flowchart LR classDef plat fill:#1e293b,stroke:#06b6d4,color:#e2e8f0 classDef new fill:#312e81,stroke:#a855f7,color:#e9d5ff classDef gap fill:#7f1d1d,stroke:#ef4444,color:#fecaca subgraph mac["macOS / Linux"] direction TB A1["tmux"]:::plat A2["cmux"]:::new A3["amux"]:::new A4["agent-of-empires"]:::new end subgraph win["Windows"] direction TB B1["Windows Terminal"]:::plat B2["wmux (new)"]:::new B3["?? gap ??"]:::gap end mac -.plenty of options.-> goal["Multi AI Agent<br>same screen"] win -.long empty seat.-> goal B3 ==> agentzero["AgentZero Lite"]:::new
The problem is the Windows side. For a long time there was nothing on Windows that put multiple AI agents in one screen without WSL. Recent attempts like wmux showed up, but most still focus on terminal splitting alone. AgentZero Lite tries to fill that gap differently. It's not just a multiplexer — it bundles multi AI agents + on-device LLM + WASM plugins + voice input + its own harness frame into a single executable, a Windows desktop shell.

2. AgentZero Lite — one-line summary

AgentZero Lite is a Windows desktop shell that lets you handle several CLI agents in one screen. The core is simple.
  • Spin up N real ConPTY terminals (no pseudo-PTY mimicry).
  • Group tabs by folder via workspaces — keep separate CLI sets per project.
  • An AgentBot chat panel forwards text or keystrokes to the focused terminal.
  • An on-device LLM (Gemma 4) inside that chat panel coordinates the terminal AIs.
  • WASM sandbox plugins let users add their own features.
  • Voice input runs offline via Whisper.net + Vulkan.
  • Single executable, no dependency beyond .NET 10, ~60 MB build.
flowchart TB classDef tab fill:#0f172a,stroke:#06b6d4,color:#e2e8f0 classDef bot fill:#312e81,stroke:#a855f7,color:#e9d5ff classDef ws fill:#1e293b,stroke:#f59e0b,color:#fde68a user(("👤 User")) subgraph win["AgentZero Lite (single .exe)"] direction TB subgraph tabs["Tabs (N ConPTY terminals)"] direction LR T1["Claude tab"]:::tab T2["Codex tab"]:::tab T3["pwsh tab"]:::tab T4["build-log"]:::tab end subgraph side["Sidebar"] direction TB W1["▸ monorepo"]:::ws W2["▸ blog"]:::ws end subgraph bottom["Bottom panel"] direction LR P1["AGENT BOT<br>(Chat / Key / AI)"]:::bot P2["NOTE"]:::tab P3["LOG"]:::tab end end user --> tabs user --> bottom side -.context.-> tabs P1 -.focused tab.-> tabs

3. AI ↔ AI conversation — the Lite edition's signature feature

The reason AgentZero Lite exists fits in a single sentence: "I want two AI CLIs to talk to each other."
The recipe is surprisingly simple. Open a Claude tab and a Codex tab in the same workspace, then teach each tab one line.
Learn AgentZeroLite.ps1 help. Use terminal-list to see tabs and terminal-send <grp> <tab> "..." to talk to the AI in another tab.
Now when the user says "Greet the tab named Codex and propose we design a REST endpoint together", the Claude tab runs terminal-send 0 1 "Hi Codex, ...". The Codex tab sees that message in its own prompt, composes a reply, and sends it back via terminal-send 0 0 "...". A live conversation flows across two tabs.
sequenceDiagram autonumber actor U as 👤 User participant C as Claude tab (tab 0) participant IPC as AgentZero IPC<br>(WM_COPYDATA + MMF) participant X as Codex tab (tab 1) U->>C: "Greet Codex and propose REST design" C->>IPC: terminal-send 0 1 "Hi Codex, ..." IPC->>X: text auto-typed X->>X: composes reply X->>IPC: terminal-send 0 0 "Hi Claude, ..." IPC->>C: reply auto-typed Note over C,X: No cloud relay<br>local IPC only
Why this is interesting: all the traffic flows over AgentZero's own IPC (WM_COPYDATA + memory-mapped files). No cloud relay, no external server. And because the broker is just a shell command the AIs already understand, any CLI-native agent — Aider, Copilot CLI, a local ollama chat — can join with the same protocol. No catch-up work for AgentZero Lite when a new agent appears.
A terminal multiplexer lets you see several prompts. AgentZero Lite lets them talk to each other.

4. ChatMode and AIMode — a small assistant orchestrating big models

The AgentBot panel has two modes.
  • ChatMode (CHT/KEY) — an input broker that forwards your text or keystrokes straight to the active terminal. Not an AI.
  • AIMode — flip with Shift+Tab and the same chat box becomes an on-device LLM coordinator.
flowchart LR classDef cht fill:#0f172a,stroke:#06b6d4,color:#e2e8f0 classDef ai fill:#312e81,stroke:#a855f7,color:#e9d5ff user(("User")) subgraph cht1["ChatMode (input broker)"] direction LR I1["chat input"]:::cht --> O1["forward to<br>active terminal"]:::cht end subgraph aim["AIMode (LocalLLM coordinator)"] direction LR I2["chat input"]:::ai --> L["Gemma 4<br>(on-device)"]:::ai L --> R{"routing"}:::ai R -->|"heavy reasoning"| T1["Claude tab"]:::ai R -->|"quick summary"| T2["Codex tab"]:::ai R -->|"direct"| O2["reply to user"]:::ai end user --> I1 user --> I2
There's a philosophy worth stating — the LocalLLM in AIMode is not trying to be smarter than Claude or Codex. It plays the small assistant. Take a vague request ("start a debate with Codex", "summarize today's PRs"), route it to the right terminal AI, then come back with a tidy summary. The metaphor that fits: the heavy reasoning lives in the bigger CLIs; LocalLLM is the receptionist who knows everyone's extension.
This little assistant becomes an actual Agent — not just a text autocomplete engine — through four moves.
  1. Output constraint — GBNF grammar forces every output to be {"tool": "<name>", "args": { ... }} and nothing else, enforced at the sampler level.
  1. Tool execution — six-tool surface: list_terminals, read_terminal, send_to_terminal, send_key, wait, done.
  1. Inject results back into context — as the next user turn.
  1. Repeat until the LLM emits done.
That generate → tool → result → generate again loop is what turns text completion into agency. And the one cycle per run rule keeps the LLM from scripting a five-turn debate as one giant tool chain — the next cycle is triggered by either the user or an arriving peer signal.
stateDiagram-v2 [*] --> Idle Idle --> Thinking: StartReactor(text) Thinking --> Generating: prompt prefill Generating --> Acting: tool_call JSON Acting --> Generating: tool_result injected Acting --> Done: done called Done --> Idle: one cycle complete Idle --> Idle: peer signal or<br>next user input
Who do you delegate the PM role to? That's AgentZero Lite's real question. It can be the smart cloud model. It can be the small on-device model directly. Letting the user actually have that choice is the point.

5. Voice — dual multitasking with hands and mouth at once

Voice input is wired straight into AgentBot. Speak into the mic, your audio gets transcribed locally (Whisper.net + Vulkan, never touches the cloud), and the resulting text flows into the active AI CLI tab through the exact same path as if you typed it.
sequenceDiagram autonumber actor U as 👤 User participant K as Keyboard participant M as Microphone participant T0 as Tab 0 (Claude) participant T1 as Tab 1 (Codex) par keyboard channel U->>K: writing code K->>T0: "refactor this function" T0-->>U: diff response and voice channel U->>M: "summarize today's PRs" M->>T1: Whisper transcribe → auto-typed T1-->>U: voice reply (P3) end Note over T0,T1: two AI dialogs run in parallel<br>one human supervises both
The key idea is dual multitasking. While one terminal is occupied by the keyboard (writing code, reviewing a Claude diff), the second terminal can run hands-free via voice. Two AI conversations move in parallel and one human supervises both. AIMode's between-models tikitaka now extends to a between-your-own-input-channels (hands/mouth) tikitaka.
The stack matters for non-English users.
  • Whisper.net offline STT, GGML small (~466 MB) / medium (~1.5 GB) cached locally. Korean-strong turbo variants can be plugged in separately.
  • CPU + Vulkan dual bundle — the Vulkan backend is cross-vendor: AMD, Intel, NVIDIA all run on the same binary. No CUDA lock-in.
  • Even on multi-GPU laptops, Auto picks the dGPU correctly; if it picks wrong, manual override is one click.
  • TTS output is in progress — Windows SAPI / OpenAI tts-1 plumbing is in, but the response-streaming pipeline is staged.
The voice-output model will follow the same strategy — pick the best free models per language and combine them. Don't force cloud lock-in.

6. WASM plugins — bring your own feature

Another experiment in AgentZero Lite is sandboxed WASM plugins.
flowchart LR classDef host fill:#1e293b,stroke:#06b6d4,color:#e2e8f0 classDef sand fill:#312e81,stroke:#a855f7,color:#e9d5ff classDef cap fill:#064e3b,stroke:#10b981,color:#a7f3d0 subgraph host["AgentZero host"] direction TB LLM["LocalLLM API"]:::cap Term["Terminal Actor API"]:::cap Note["NOTE Panel API"]:::cap end subgraph sandbox["WASM sandbox"] direction TB Plug["Voice Note plugin<br>(manifest + html + js)"]:::sand end Plug -.explicit grant.-> LLM Plug -.explicit grant.-> Term Plug -.explicit grant.-> Note Plug -.no default access.-> X1[("✗ filesystem")] Plug -.no default access.-> X2[("✗ network")]
Some background. In 2026 WebAssembly settled in as the standard isolation layer for AI agents. NVIDIA, Cloudflare, Vercel, E2B, Firecrawl all shipped WASM-based sandboxes. The reason is simple — WASM modules are powerless by default. No filesystem, no network, no external resources. The host has to grant capabilities explicitly. Containers and microVMs struggle to give the same isolation.
AgentZero Lite picked the same philosophy. Users can build their own features and plug them in.
  • Pull on-device AI capabilities that AgentZeroCLI exposes — call LocalLLM from inside a plugin.
  • The first official plugin is Voice Note — STT-driven voice journaling, VAD-gated capture, sensitivity slider, pause/resume, LLM summary.
  • A new plugin needs only manifest.json + index.html + *.js/*.css. Install in one click via Install Plugin Picker under the AgentBot [+] menu.
"Users build the features they need themselves" is more natural in the LLM era. AI got good at writing code, so the cost of making one plugin is collapsing.

7. Harness view — the framework grows alongside

Built in parallel with AgentZero Lite is the harness harness frame. As a bonus, the Note panel's harness view lets you peek at the framework's tech elements directly.
Wiring an LLM into a useful tool chain is hard. Reasoning from scratch every time repeats the same mistakes. The harness makes the learning loop explicit.
harness/ ├── agents/ — expert evaluators (security-guard, code-coach, test-sentinel, tamer) ├── engine/ — workflows (release-build-pipeline, pre-commit-review) ├── knowledge/ — domain notes (LLM prompt conventions, tool-calling survey) └── logs/ — Mode 3 reviews, RCAs, evaluations all pinned here
There are four feedback loops.
  1. Unit-test feedbackT1G..T7G live + headless TestKit suite catches regressions in protocol/state machines.
  1. Live-run logs%LOCALAPPDATA%\AgentZeroWpf\logs\app-log.txt captures every Reactor turn, peer signal, JSON parse failure.
  1. Mode 3 RCA logs — per-regression dated post-mortems (symptom, root cause, patch, evaluation, deferred follow-ups).
  1. Human as reviewer — accept or course-correct the harness's suggestions. Not "the AI does it all" but pair-programming with an iterative improver.
The AIMode prompt got revised six times in this iteration alone — one-cycle rule, anti-passivity, anti-refusal, handshake separation, peer-signal triggers, ID scheme as string. Each attempt logged in the same Mode 3 doc, recording what failed and why the next attempt fixed it. The harness is the memory of those attempts, so the same mistakes don't repeat.

8. The on-device LLM era kicked off by open-model Gemma

The pool of LocalLLM candidates AgentZero Lite can plug in exploded over a year.
On-device LLM timeline (by release date):
  • 2025-03Gemma 3 (1B/4B/12B/27B). 140+ languages, laptop · workstation target.
  • 2025-08GPT-OSS 20B/120B (MXFP4 4-bit MoE). 20B runs on a 16 GB RAM edge.
  • 2025Gemma 3n (PLE-based RAM saving). Mobile · handheld optimized.
  • 2026-04Gemma 4 (small 128K, medium 256K context). Optimized for laptop local execution.
The simple question to ask: "Can a single decent GPU run these on-device?"
The answer is partially yes. GPT-OSS 20B runs on 16 GB edge. The 4-bit quantized Nemotron 3 Nano Omni handles multimodal in 25 GB VRAM. The cost of running an assistant model on your own machine has dropped to single-digit-thousand-dollar GPU territory.
Heavy reasoning is still a cloud-model game. But the assistant role is clearly viable on-device. AIMode's Gemma 4 sits right in that seat.

9. Will tokens really get cheaper than air?

A market mood check is in order. The assumption that tokens get infinitely cheaper is wobbling.
⚠️
AnthropicClaude Pro/Max plans got weekly limits (introduced 2025-08-28). 5-hour rolling window + weekly cap + separate Opus cap. Peak hours (05:00–11:00 PT, 13:00–19:00 GMT) get extra session-limit adjustments. The company officially admits "infrastructure investment will take 12–24 months to translate into capacity" — limits stay tight or get tighter for a while.
⚠️
GitHub Copilotmoves to usage-based billing on 2026-06-01. Every plan gets monthly GitHub AI Credits, deducted by token (input/output/cached). From 2026-04-20 new sign-ups for Pro/Pro+/student plans are paused. The reason: "agentic workflows pushed infrastructure cost up significantly."
Agent serverless cost — once a harness framework juggles N agents, nobody knows yet whether the CPU/memory pattern ends up cheaper or more expensive than AWS serverless. Nobody has a final answer.
In short — this is a supplier-power market. Promises get rescinded overnight. Consumers have few choices.

10. Hybrid strategy — vendor-lock-free, choice belongs to the consumer

AgentZero Lite's answer is plain. Hybrid mode. Combine free/on-device models where useful and delegate heavy reasoning to bigger cloud models. No vendor lock, the user keeps the choice.
flowchart LR classDef cloud fill:#1e3a8a,stroke:#3b82f6,color:#dbeafe classDef local fill:#064e3b,stroke:#10b981,color:#a7f3d0 classDef user fill:#7c2d12,stroke:#f97316,color:#fed7aa user(("👤 User<br>holds the choice")):::user subgraph cloud["Cloud (vendor)"] direction TB Cl["Claude"]:::cloud Cx["Codex"]:::cloud Gm["Gemini"]:::cloud end subgraph local["On-device (free)"] direction TB G4["Gemma 4"]:::local Wh["Whisper STT"]:::local Os["GPT-OSS"]:::local end user -->|"heavy reasoning"| cloud user -->|"PM · summary · STT"| local cloud <-.AI ↔ AI talk.-> local user -->|"WASM self-built plugin"| W["Voice Note etc."]:::local
Concretely how:
  • Multi-CLI — Claude, Codex, Gemini CLI all run side by side. A future CLI agent can slot into the same multi-view with almost no extra work.
  • On-device ↔ cloud routing — AIMode's LocalLLM, acting as PM, decides where to send each task. Heavy reasoning → Claude, quick summary → Gemma. Agent usage cost stops being a vendor-bound thing and becomes the consumer's choice.
  • Free voice-model combos — STT via Whisper, TTS via the best Korean-friendly model. Pick each separately.
  • WASM plugins — internal tools and personal workflows: build them yourself. No vendor approval needed for a feature add.
  • Harness view — see how each agent is actually working. Zero black box.
PlayGround value. AgentZero Lite is not aiming to "replace everything with on-device". It's a testbed for the partial-adoption hybrid strategy where small models step in for specific strengths. Not competing with Claude Code.

11. Closing — the agent serverless market is the real war

Stepping back — what we're watching is more than a crisis of the IDE.
  • Skill / MCP specs are interfaces. Anthropic created and opened them. By themselves they're a thin billing surface.
  • Harness frame + agent serverless is where the real money is. Tokens aren't all of the cost. Token × tool fee (agent serverless) stacks into a double-billing structure. Because a harness orchestrates N agents, the contest is system-level, not single-agent-level.
  • Benchmarks are losing discriminative power. Even Chinese models post higher LLM scores now. The real war happens in the agent serverless market.
  • A B2B billing surface. A harness framework is exactly where you can attach pricing.
In that environment, AgentZero Lite is a PlayGround for experimenting with cloud-bound AI plus on-device combinations.
  • IDEs keep getting heavier and pricier. Web/GUI tools layer in AI features and raise their subscription fees. Lighter attempts to control N things via CLI/TUI will keep coming.
  • CLIs, once a developer's exclusive turf, recently picked up serious users in high-end creative and product-planning camps. The user base is widening.
  • AgentZero Lite stakes its place in this AI upheaval as a PlayGround for validating the hybrid strategy of mixing diverse AI models (open × vendor) under user choice.
🚧 In progress · <https://blumn.ai/>

Appendix — sister repos worth a look

Sister repos
  • harness-kakashi — a single-player practice harness. A Naruto-themed sandbox to get the feel of harnesses.
  • pencil-creator — an experiment that uses a harness to auto-generate design-system templates.
  • memorizer-v1 — a vector-search-based agent memory MCP server. Slated to become the harness's shared memory.
  • DeskWeb — a qooxdoo-based Windows-XP-styled WebOS. Ships with 4 Claude Code Skills.

Sources

 

NEXT