AgentZero Lite — Bringing Multi-CLI and On-Device LLM to Windows (Part 1, EN)

📘

Series Note — This is Part 1 of two.

Part 1 (this article) — Multi-CLI · On-Device LLM · Hybrid Strategy

Part 2 — The engine room beneath: Akka.NET actor model, Gemma 4 + GBNF, AgentReactorActor FSM, STT/LLM/TTS three-layer ensemble →

🤖

How Actors Turn LLMs into Real Agents — AgentZero Lite Deep Dive (Part 2, EN)

A minimalist IDE for the AI era — manage multiple CLI agents in one window, mixing a smart big model with a lightweight on-device LLM in the same place.

1. Why another terminal — the rise of CLIs

The most interesting shift in 2026 dev workflows fits in one line: we're moving from humans-using-IDEs to AI-agents-using-terminals.

A unit of work is no longer "one shell." Claude Code runs a big refactor in one tab while another tab's Codex builds a PR review and a third tab tails the build log. A human supervises all three, occasionally stepping in. The trend shows up in product moves too — over the last twelve months a new category called AI terminal multiplexers has emerged. The macOS/Linux camp got cmux, amux, agent-of-empires one after another, and tmux itself returned to the spotlight as a stage for AI agent teams rather than a single prompt.


flowchart LR
  classDef plat fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
  classDef new fill:#312e81,stroke:#a855f7,color:#e9d5ff
  classDef gap fill:#7f1d1d,stroke:#ef4444,color:#fecaca

  subgraph mac["macOS / Linux"]
    direction TB
    A1["tmux"]:::plat
    A2["cmux"]:::new
    A3["amux"]:::new
    A4["agent-of-empires"]:::new
  end

  subgraph win["Windows"]
    direction TB
    B1["Windows Terminal"]:::plat
    B2["wmux (new)"]:::new
    B3["?? gap ??"]:::gap
  end

  mac -.plenty of options.-> goal["Multi AI Agent<br>same screen"]
  win -.long empty seat.-> goal
  B3 ==> agentzero["AgentZero Lite"]:::new

The problem is the Windows side. For a long time there was nothing on Windows that put multiple AI agents in one screen without WSL. Recent attempts like wmux showed up, but most still focus on terminal splitting alone. AgentZero Lite tries to fill that gap differently. It's not just a multiplexer — it bundles multi AI agents + on-device LLM + WASM plugins + voice input + its own harness frame into a single executable, a Windows desktop shell.

2. AgentZero Lite — one-line summary

AgentZero Lite is a Windows desktop shell that lets you handle several CLI agents in one screen. The core is simple.

Spin up N real ConPTY terminals (no pseudo-PTY mimicry).

Group tabs by folder via workspaces — keep separate CLI sets per project.

An AgentBot chat panel forwards text or keystrokes to the focused terminal.

An on-device LLM (Gemma 4) inside that chat panel coordinates the terminal AIs.

WASM sandbox plugins let users add their own features.

Voice input runs offline via Whisper.net + Vulkan.

Single executable, no dependency beyond .NET 10, ~60 MB build.


flowchart TB
  classDef tab fill:#0f172a,stroke:#06b6d4,color:#e2e8f0
  classDef bot fill:#312e81,stroke:#a855f7,color:#e9d5ff
  classDef ws fill:#1e293b,stroke:#f59e0b,color:#fde68a

  user(("👤 User"))
  subgraph win["AgentZero Lite (single .exe)"]
    direction TB
    subgraph tabs["Tabs (N ConPTY terminals)"]
      direction LR
      T1["Claude tab"]:::tab
      T2["Codex tab"]:::tab
      T3["pwsh tab"]:::tab
      T4["build-log"]:::tab
    end
    subgraph side["Sidebar"]
      direction TB
      W1["▸ monorepo"]:::ws
      W2["▸ blog"]:::ws
    end
    subgraph bottom["Bottom panel"]
      direction LR
      P1["AGENT BOT<br>(Chat / Key / AI)"]:::bot
      P2["NOTE"]:::tab
      P3["LOG"]:::tab
    end
  end
  user --> tabs
  user --> bottom
  side -.context.-> tabs
  P1 -.focused tab.-> tabs

3. AI ↔ AI conversation — the Lite edition's signature feature

The reason AgentZero Lite exists fits in a single sentence: "I want two AI CLIs to talk to each other."

The recipe is surprisingly simple. Open a Claude tab and a Codex tab in the same workspace, then teach each tab one line.

Learn AgentZeroLite.ps1 help. Use terminal-list to see tabs and terminal-send <grp> <tab> "..." to talk to the AI in another tab.

Now when the user says "Greet the tab named Codex and propose we design a REST endpoint together", the Claude tab runs terminal-send 0 1 "Hi Codex, ...". The Codex tab sees that message in its own prompt, composes a reply, and sends it back via terminal-send 0 0 "...". A live conversation flows across two tabs.


sequenceDiagram
  autonumber
  actor U as 👤 User
  participant C as Claude tab (tab 0)
  participant IPC as AgentZero IPC<br>(WM_COPYDATA + MMF)
  participant X as Codex tab (tab 1)

  U->>C: "Greet Codex and propose REST design"
  C->>IPC: terminal-send 0 1 "Hi Codex, ..."
  IPC->>X: text auto-typed
  X->>X: composes reply
  X->>IPC: terminal-send 0 0 "Hi Claude, ..."
  IPC->>C: reply auto-typed
  Note over C,X: No cloud relay<br>local IPC only

Why this is interesting: all the traffic flows over AgentZero's own IPC (WM_COPYDATA + memory-mapped files). No cloud relay, no external server. And because the broker is just a shell command the AIs already understand, any CLI-native agent — Aider, Copilot CLI, a local ollama chat — can join with the same protocol. No catch-up work for AgentZero Lite when a new agent appears.

A terminal multiplexer lets you see several prompts. AgentZero Lite lets them talk to each other.

4. ChatMode and AIMode — a small assistant orchestrating big models

The AgentBot panel has two modes.

ChatMode (CHT/KEY) — an input broker that forwards your text or keystrokes straight to the active terminal. Not an AI.

AIMode — flip with Shift+Tab and the same chat box becomes an on-device LLM coordinator.


flowchart LR
  classDef cht fill:#0f172a,stroke:#06b6d4,color:#e2e8f0
  classDef ai fill:#312e81,stroke:#a855f7,color:#e9d5ff
  user(("User"))

  subgraph cht1["ChatMode (input broker)"]
    direction LR
    I1["chat input"]:::cht --> O1["forward to<br>active terminal"]:::cht
  end

  subgraph aim["AIMode (LocalLLM coordinator)"]
    direction LR
    I2["chat input"]:::ai --> L["Gemma 4<br>(on-device)"]:::ai
    L --> R{"routing"}:::ai
    R -->|"heavy reasoning"| T1["Claude tab"]:::ai
    R -->|"quick summary"| T2["Codex tab"]:::ai
    R -->|"direct"| O2["reply to user"]:::ai
  end

  user --> I1
  user --> I2

There's a philosophy worth stating — the LocalLLM in AIMode is not trying to be smarter than Claude or Codex. It plays the small assistant. Take a vague request ("start a debate with Codex", "summarize today's PRs"), route it to the right terminal AI, then come back with a tidy summary. The metaphor that fits: the heavy reasoning lives in the bigger CLIs; LocalLLM is the receptionist who knows everyone's extension.

This little assistant becomes an actual Agent — not just a text autocomplete engine — through four moves.

Output constraint — GBNF grammar forces every output to be {"tool": "<name>", "args": { ... }} and nothing else, enforced at the sampler level.

Tool execution — six-tool surface: list_terminals, read_terminal, send_to_terminal, send_key, wait, done.

Inject results back into context — as the next user turn.

Repeat until the LLM emits done.

That generate → tool → result → generate again loop is what turns text completion into agency. And the one cycle per run rule keeps the LLM from scripting a five-turn debate as one giant tool chain — the next cycle is triggered by either the user or an arriving peer signal.


stateDiagram-v2
  [*] --> Idle
  Idle --> Thinking: StartReactor(text)
  Thinking --> Generating: prompt prefill
  Generating --> Acting: tool_call JSON
  Acting --> Generating: tool_result injected
  Acting --> Done: done called
  Done --> Idle: one cycle complete
  Idle --> Idle: peer signal or<br>next user input

Who do you delegate the PM role to? That's AgentZero Lite's real question. It can be the smart cloud model. It can be the small on-device model directly. Letting the user actually have that choice is the point.

5. Voice — dual multitasking with hands and mouth at once

Voice input is wired straight into AgentBot. Speak into the mic, your audio gets transcribed locally (Whisper.net + Vulkan, never touches the cloud), and the resulting text flows into the active AI CLI tab through the exact same path as if you typed it.


sequenceDiagram
  autonumber
  actor U as 👤 User
  participant K as Keyboard
  participant M as Microphone
  participant T0 as Tab 0 (Claude)
  participant T1 as Tab 1 (Codex)

  par keyboard channel
    U->>K: writing code
    K->>T0: "refactor this function"
    T0-->>U: diff response
  and voice channel
    U->>M: "summarize today's PRs"
    M->>T1: Whisper transcribe → auto-typed
    T1-->>U: voice reply (P3)
  end
  Note over T0,T1: two AI dialogs run in parallel<br>one human supervises both

The key idea is dual multitasking. While one terminal is occupied by the keyboard (writing code, reviewing a Claude diff), the second terminal can run hands-free via voice. Two AI conversations move in parallel and one human supervises both. AIMode's between-models tikitaka now extends to a between-your-own-input-channels (hands/mouth) tikitaka.

The stack matters for non-English users.

Whisper.net offline STT, GGML small (~466 MB) / medium (~1.5 GB) cached locally. Korean-strong turbo variants can be plugged in separately.

CPU + Vulkan dual bundle — the Vulkan backend is cross-vendor: AMD, Intel, NVIDIA all run on the same binary. No CUDA lock-in.

Even on multi-GPU laptops, Auto picks the dGPU correctly; if it picks wrong, manual override is one click.

TTS output is in progress — Windows SAPI / OpenAI tts-1 plumbing is in, but the response-streaming pipeline is staged.

The voice-output model will follow the same strategy — pick the best free models per language and combine them. Don't force cloud lock-in.

6. WASM plugins — bring your own feature

Another experiment in AgentZero Lite is sandboxed WASM plugins.


flowchart LR
  classDef host fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
  classDef sand fill:#312e81,stroke:#a855f7,color:#e9d5ff
  classDef cap fill:#064e3b,stroke:#10b981,color:#a7f3d0

  subgraph host["AgentZero host"]
    direction TB
    LLM["LocalLLM API"]:::cap
    Term["Terminal Actor API"]:::cap
    Note["NOTE Panel API"]:::cap
  end

  subgraph sandbox["WASM sandbox"]
    direction TB
    Plug["Voice Note plugin<br>(manifest + html + js)"]:::sand
  end

  Plug -.explicit grant.-> LLM
  Plug -.explicit grant.-> Term
  Plug -.explicit grant.-> Note
  Plug -.no default access.-> X1[("✗ filesystem")]
  Plug -.no default access.-> X2[("✗ network")]

Some background. In 2026 WebAssembly settled in as the standard isolation layer for AI agents. NVIDIA, Cloudflare, Vercel, E2B, Firecrawl all shipped WASM-based sandboxes. The reason is simple — WASM modules are powerless by default. No filesystem, no network, no external resources. The host has to grant capabilities explicitly. Containers and microVMs struggle to give the same isolation.

AgentZero Lite picked the same philosophy. Users can build their own features and plug them in.

Pull on-device AI capabilities that AgentZeroCLI exposes — call LocalLLM from inside a plugin.

The first official plugin is Voice Note — STT-driven voice journaling, VAD-gated capture, sensitivity slider, pause/resume, LLM summary.

A new plugin needs only manifest.json + index.html + *.js/*.css. Install in one click via Install Plugin Picker under the AgentBot [+] menu.

"Users build the features they need themselves" is more natural in the LLM era. AI got good at writing code, so the cost of making one plugin is collapsing.

7. Harness view — the framework grows alongside

Built in parallel with AgentZero Lite is the harness harness frame. As a bonus, the Note panel's harness view lets you peek at the framework's tech elements directly.

Wiring an LLM into a useful tool chain is hard. Reasoning from scratch every time repeats the same mistakes. The harness makes the learning loop explicit.


harness/
├── agents/        — expert evaluators (security-guard, code-coach, test-sentinel, tamer)
├── engine/        — workflows (release-build-pipeline, pre-commit-review)
├── knowledge/     — domain notes (LLM prompt conventions, tool-calling survey)
└── logs/          — Mode 3 reviews, RCAs, evaluations all pinned here

There are four feedback loops.

Unit-test feedback — T1G..T7G live + headless TestKit suite catches regressions in protocol/state machines.

Live-run logs — %LOCALAPPDATA%\AgentZeroWpf\logs\app-log.txt captures every Reactor turn, peer signal, JSON parse failure.

Mode 3 RCA logs — per-regression dated post-mortems (symptom, root cause, patch, evaluation, deferred follow-ups).

Human as reviewer — accept or course-correct the harness's suggestions. Not "the AI does it all" but pair-programming with an iterative improver.

The AIMode prompt got revised six times in this iteration alone — one-cycle rule, anti-passivity, anti-refusal, handshake separation, peer-signal triggers, ID scheme as string. Each attempt logged in the same Mode 3 doc, recording what failed and why the next attempt fixed it. The harness is the memory of those attempts, so the same mistakes don't repeat.

8. The on-device LLM era kicked off by open-model Gemma

The pool of LocalLLM candidates AgentZero Lite can plug in exploded over a year.

On-device LLM timeline (by release date):

2025-03 — Gemma 3 (1B/4B/12B/27B). 140+ languages, laptop · workstation target.

2025-08 — GPT-OSS 20B/120B (MXFP4 4-bit MoE). 20B runs on a 16 GB RAM edge.

2025 — Gemma 3n (PLE-based RAM saving). Mobile · handheld optimized.

2026-04 — Gemma 4 (small 128K, medium 256K context). Optimized for laptop local execution.

2026-04 — NVIDIA Nemotron 3 Nano Omni (30B-A3B Hybrid Mamba-MoE, 256K). Vision+audio+text, 9× throughput.

The simple question to ask: "Can a single decent GPU run these on-device?"

The answer is partially yes. GPT-OSS 20B runs on 16 GB edge. The 4-bit quantized Nemotron 3 Nano Omni handles multimodal in 25 GB VRAM. The cost of running an assistant model on your own machine has dropped to single-digit-thousand-dollar GPU territory.

Heavy reasoning is still a cloud-model game. But the assistant role is clearly viable on-device. AIMode's Gemma 4 sits right in that seat.

9. Will tokens really get cheaper than air?

A market mood check is in order. The assumption that tokens get infinitely cheaper is wobbling.

⚠️

Anthropic — Claude Pro/Max plans got weekly limits (introduced 2025-08-28). 5-hour rolling window + weekly cap + separate Opus cap. Peak hours (05:00–11:00 PT, 13:00–19:00 GMT) get extra session-limit adjustments. The company officially admits "infrastructure investment will take 12–24 months to translate into capacity" — limits stay tight or get tighter for a while.

⚠️

GitHub Copilot — moves to usage-based billing on 2026-06-01. Every plan gets monthly GitHub AI Credits, deducted by token (input/output/cached). From 2026-04-20 new sign-ups for Pro/Pro+/student plans are paused. The reason: "agentic workflows pushed infrastructure cost up significantly."

❓

Agent serverless cost — once a harness framework juggles N agents, nobody knows yet whether the CPU/memory pattern ends up cheaper or more expensive than AWS serverless. Nobody has a final answer.

In short — this is a supplier-power market. Promises get rescinded overnight. Consumers have few choices.

10. Hybrid strategy — vendor-lock-free, choice belongs to the consumer

AgentZero Lite's answer is plain. Hybrid mode. Combine free/on-device models where useful and delegate heavy reasoning to bigger cloud models. No vendor lock, the user keeps the choice.


flowchart LR
  classDef cloud fill:#1e3a8a,stroke:#3b82f6,color:#dbeafe
  classDef local fill:#064e3b,stroke:#10b981,color:#a7f3d0
  classDef user fill:#7c2d12,stroke:#f97316,color:#fed7aa

  user(("👤 User<br>holds the choice")):::user

  subgraph cloud["Cloud (vendor)"]
    direction TB
    Cl["Claude"]:::cloud
    Cx["Codex"]:::cloud
    Gm["Gemini"]:::cloud
  end

  subgraph local["On-device (free)"]
    direction TB
    G4["Gemma 4"]:::local
    Wh["Whisper STT"]:::local
    Os["GPT-OSS"]:::local
  end

  user -->|"heavy reasoning"| cloud
  user -->|"PM · summary · STT"| local
  cloud <-.AI ↔ AI talk.-> local
  user -->|"WASM self-built plugin"| W["Voice Note etc."]:::local

Concretely how:

Multi-CLI — Claude, Codex, Gemini CLI all run side by side. A future CLI agent can slot into the same multi-view with almost no extra work.

On-device ↔ cloud routing — AIMode's LocalLLM, acting as PM, decides where to send each task. Heavy reasoning → Claude, quick summary → Gemma. Agent usage cost stops being a vendor-bound thing and becomes the consumer's choice.

Free voice-model combos — STT via Whisper, TTS via the best Korean-friendly model. Pick each separately.

WASM plugins — internal tools and personal workflows: build them yourself. No vendor approval needed for a feature add.

Harness view — see how each agent is actually working. Zero black box.

PlayGround value. AgentZero Lite is not aiming to "replace everything with on-device". It's a testbed for the partial-adoption hybrid strategy where small models step in for specific strengths. Not competing with Claude Code.

11. Closing — the agent serverless market is the real war

Stepping back — what we're watching is more than a crisis of the IDE.

Skill / MCP specs are interfaces. Anthropic created and opened them. By themselves they're a thin billing surface.

Harness frame + agent serverless is where the real money is. Tokens aren't all of the cost. Token × tool fee (agent serverless) stacks into a double-billing structure. Because a harness orchestrates N agents, the contest is system-level, not single-agent-level.

Benchmarks are losing discriminative power. Even Chinese models post higher LLM scores now. The real war happens in the agent serverless market.

A B2B billing surface. A harness framework is exactly where you can attach pricing.

In that environment, AgentZero Lite is a PlayGround for experimenting with cloud-bound AI plus on-device combinations.

IDEs keep getting heavier and pricier. Web/GUI tools layer in AI features and raise their subscription fees. Lighter attempts to control N things via CLI/TUI will keep coming.

CLIs, once a developer's exclusive turf, recently picked up serious users in high-end creative and product-planning camps. The user base is widening.

AgentZero Lite stakes its place in this AI upheaval as a PlayGround for validating the hybrid strategy of mixing diverse AI models (open × vendor) under user choice.

🚧 In progress · <https://blumn.ai/>

Appendix — sister repos worth a look

Sister repos

harness-kakashi — a single-player practice harness. A Naruto-themed sandbox to get the feel of harnesses.

pencil-creator — an experiment that uses a harness to auto-generate design-system templates.

memorizer-v1 — a vector-search-based agent memory MCP server. Slated to become the harness's shared memory.

DeskWeb — a qooxdoo-based Windows-XP-styled WebOS. Ships with 4 Claude Code Skills.

Sources

AgentZero Lite — <https://github.com/psmon/AgentZeroLite> (README · README-KR)

Anthropic tweaks Claude usage limits to manage capacity — The Register, 2026-03-26

GitHub Copilot is moving to usage-based billing — GitHub Blog

Gemma 3 — Google DeepMind, Gemma 4 model overview, Gemma 3n developer guide

Introducing gpt-oss — OpenAI

NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning

Sandboxing Agentic AI Workflows with WebAssembly — NVIDIA

The Rise of AI Terminal Multiplexers — Beam

tmux vs cmux — ice-ice-bear blog

wmux — Windows tmux alternative for AI agents

Whisper.net .Runtime.Vulkan — NuGet

🤖