How Actors Turn LLMs into Real Agents — AgentZero Lite Deep Dive (Part 2, EN)

📘

Series Note — This is Part 2 of two.

Part 1 — Multi-CLI · On-Device LLM · Hybrid Strategy (the big picture) →

AgentZero Lite — Bringing Multi-CLI and On-Device LLM to Windows (Part 1, EN)

Part 2 (this article) — The engine room beneath: Akka.NET actor model, Gemma 4 + GBNF, AgentReactorActor FSM, STT/LLM/TTS three-layer ensembl

Two questions to carry into this article. 1. An LLM is a text-completion engine. So how did it ever become an Agent? 2. Voice / LLM / TTS run as separate models and processes, so how do they appear to respond concurrently like the OpenAI Realtime API? The answer is the same word — Actor model.

1. An LLM is a genius with no office

Over the last year we kept making LLMs smarter. But the moment you try to make one actually do work, things get oddly frustrating. GPT-OSS, Gemma 4, Nemotron Nano Omni — all great, yet none of them, by themselves, can "send a command to a terminal, wait for the result, then decide what to do next."

That's the point. An LLM is a text-completion engine. Feed in a prompt, it emits the next token. It cannot call tools, read files, or remember yesterday's conversation. It's like a genius without an office — no desk, no phone, no colleagues.

Then who is making ChatGPT's o3 call tools, Claude Code edit files, Gemini CLI run builds? The answer is the runtime around the model. That runtime takes the LLM's output, executes tools, pushes the results back into the LLM's context, and stops when it should.

How should you build that runtime — that's where the Actor model enters.

2. The Actor model in 30 minutes — what's different from objects?

The word Actor was coined by Carl Hewitt in 1973. Bottom line: an actor is a tiny computer with a mailbox, talking only via messages, holding its own state alone. It looks like an object but differs in decisive ways.


flowchart LR
  classDef obj fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
  classDef act fill:#312e81,stroke:#a855f7,color:#e9d5ff

  subgraph object["Object"]
    direction TB
    Caller1["Caller"]
    Caller1 -->|"obj.foo() direct call<br>sync + lock"| Obj["📦 Object<br>(shared state)"]:::obj
    Obj -.shared var.-> Caller1
  end

  subgraph actor["Actor"]
    direction TB
    Caller2["Sender"]
    Caller2 -->|"actor.Tell(msg)<br>async, returns now"| MB["📬 Mailbox"]:::act
    MB -->|"one at a time"| Act["🤖 Actor<br>(isolated state)"]:::act
    Act -.replies via msg only.-> Caller2
  end

Object vs Actor — six axes of difference

Call style — Object: obj.foo() direct method call. Actor: actor.Tell(msg) message send.

Synchrony — Object: usually synchronous, caller waits. Actor: async, drop into the mailbox and return immediately.

Concurrency — Object: hand-managed lock / mutex / volatile. Actor: mailbox processes one at a time — no lock needed.

State — Object: directly accessible from outside (fields, getters). Actor: lives only inside the actor — outside touches it only via messages.

Failure handling — Object: try-catch propagating up the call stack. Actor: parent supervises children (Restart / Resume / Stop).

Location — Object: same process only. Actor: same machine or remote — call sites unchanged.

Compress it to one line — an actor is "a lightweight process with no shared mutable state." That's why it doesn't grab locks, that's why concurrent work doesn't deadlock, that's why one actor's death leaves its neighbors alive.

One more thing — an actor can change how it behaves right now, dynamically. In Akka.NET one line, Become(state), transforms the actor into another state. That is a Finite State Machine (FSM). The same actor receives different messages while idle vs thinking.

These five things (mailbox, message, isolated state, supervision, Become) are all capabilities an AI Agent needs. Quick mapping.

Actor essentials → AI Agent capabilities

Mailbox (one at a time) → execute exactly one tool safely at a time.

Message passing (async) → keep LLM inference / tool execution / result feedback in continuous ensemble without blocking.

State isolation → each agent holds its own KV cache · conversation history · page context.

Supervision (Restart/Stop) → recover into a fresh context when the model's response breaks.

Become / FSM → the same agent transitions through chat mode → tool-using mode → wrap-up mode.

Location transparency → route on-device ↔ cloud ↔ another machine without code change.

This is no coincidence. Akka.io officially announced "Akka Agents" in late 2025 with the explicit pitch "actors are the natural runtime for stateful AI agents" (Akka Agents). Aaron Stannard sums it up: "There is a natural synergy between the Actor pattern and agentic AI" (Real-time Marketing Automation with Akka.NET). The convergence isn't accidental — both models solve the same problems the same way.

3. AgentZero Lite's actual actor topology

Now the code. AgentZero Lite spins up the following tree on top of Akka.NET.


flowchart TB
  classDef root fill:#1e3a8a,stroke:#3b82f6,color:#dbeafe
  classDef bot fill:#312e81,stroke:#a855f7,color:#e9d5ff
  classDef ws fill:#064e3b,stroke:#10b981,color:#a7f3d0
  classDef voice fill:#7c2d12,stroke:#f97316,color:#fed7aa

  Stage["/user/stage<br>StageActor<br>(top supervisor)"]:::root

  Bot["/user/stage/bot<br>AgentBotActor<br>(Chat / Key / AI)"]:::bot
  Reactor["/user/stage/bot/reactor<br>AgentReactorActor<br>(AIMODE FSM)"]:::bot

  Voice["/user/stage/voice<br>VoiceStreamActor"]:::voice
  STT["STT pool<br>SmallestMailbox"]:::voice
  TTS["TTS pool"]:::voice

  WS["/user/stage/ws-{name}<br>WorkspaceActor"]:::ws
  Term["/ws-*/term-{id}<br>TerminalActor<br>(one ConPTY)"]:::ws

  Stage --> Bot
  Bot --> Reactor
  Stage --> Voice
  Voice --> STT
  Voice --> TTS
  Stage --> WS
  WS --> Term

Each actor's responsibility in one line.

StageActor — supervises children's lifecycles. Holds the supervision strategy.

AgentBotActor — controller for user input. Switches modes Chat ↔ Key ↔ AI via Become().

AgentReactorActor — AIMODE inference FSM. Runs exactly one cycle (send→wait→read→done) at a time.

VoiceStreamActor — owns the Akka.Streams INPUT/OUTPUT graph. Routes STT/TTS worker pools.

WorkspaceActor / TerminalActor — wraps one real ConPTY terminal session each.

The supervision strategy is interesting. Code from StageActor (Project/ZeroCommon/Actors/StageActor.cs:132-143).


protected override SupervisorStrategy SupervisorStrategy()
{
    return new OneForOneStrategy(
        maxNrOfRetries: 5,
        withinTimeRange: TimeSpan.FromMinutes(1),
        localOnlyDecider: ex => ex switch
        {
            ArgumentException        => Directive.Resume,    // ignore bad messages
            NullReferenceException   => Directive.Restart,   // restart on common bugs
            _                        => Directive.Escalate    // hand up if unknown
        });
}

This block, by itself, is the AI Agent's resilience policy. When one agent throws inside a tool call, its sibling agents are unaffected. And which exception class warrants survival vs death is spelled out in code — the opposite of "the AI system dies quietly."

4. Four steps to turn an LLM into an Agent — code walk

The real point. To turn a text-completion engine into an Agent that acts, you need exactly four moves.


flowchart LR
  classDef step fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
  S1["1. Output constraint<br>(GBNF)"]:::step
  S2["2. Tool execution"]:::step
  S3["3. Result injection<br>(KV cache)"]:::step
  S4{{"done ?"}}:::step
  Out["reply to user"]:::step

  S1 --> S2 --> S3 --> S4
  S4 -- "no, continue" --> S1
  S4 -- "yes" --> Out

That loop is the generate → act → observe cycle, and an Agent is whatever turns this loop one full round per turn. Let's see how AgentZero implements each step.

4-1. Force the output to a tool call — GBNF

When an LLM emits free prose, the host can't parse it. Asking "please answer in JSON" via prompt is just prompt engineering, not enforcement at the sampler level. A flaky model breaks it.

GBNF (GGML BNF) blocks this at the sampling stage. At every token, the grammar masks the next-token distribution down to tokens it allows. Free prose becomes impossible at the token level.


flowchart LR
  classDef raw fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
  classDef mask fill:#7c2d12,stroke:#f97316,color:#fed7aa
  classDef out fill:#064e3b,stroke:#10b981,color:#a7f3d0

  LLM["LLM logits<br>(full vocab dist)"]:::raw
  Mask["GBNF grammar mask<br>(only allowed tokens pass)"]:::mask
  Sample["Sampler"]:::raw
  Out["next token<br>(JSON shape guaranteed)"]:::out
  X["✗ free-prose token<br>(blocked)"]:::mask

  LLM --> Mask --> Sample --> Out
  Mask -.block.-> X

The full GBNF in AgentZero lives in AgentToolGrammar.cs:184-200.


root      ::= ws "{" ws "\"tool\"" ws ":" ws toolname ws ","
                  ws "\"args\"" ws ":" ws args ws "}" ws

toolname  ::= "\"list_terminals\"" | "\"read_terminal\""
            | "\"send_to_terminal\"" | "\"send_key\""
            | "\"wait\"" | "\"done\""

args      ::= "{" ws "}"
            | "{" ws kv (ws "," ws kv)* ws "}"
kv        ::= string ws ":" ws value
value     ::= string | integer | boolean

That's the whole grammar. Because Gemma 4 lacks Llama-3.1-style native tool-calling SFT, GBNF is the most reliable way to force Gemma into grammar-clean tool-call shape.

There are exactly six tools.

Six-tool surface

list_terminals — no args → list current workspace's tab catalog.

read_terminal — {group, tab, last_n} → last N bytes of a tab's scrollback.

send_to_terminal — {group, tab, text} → send text.

send_key — {group, tab, key} → control key (cr/lf/esc/tab/ctrlc, etc.).

wait — {seconds: 1..30} → wait for response.

done — {message} → end loop + final message to user.

These six cover every scenario for talking to a terminal AI. The smaller the tool surface, the less room for an LLM to get confused.

4-2. Execute the tool — AgentToolLoop's main loop

When the GBNF-backed JSON arrives, parsing it and calling the real tool is AgentToolLoop.RunAsync (Llm/Tools/AgentToolLoop.cs:66-168).


public async Task<AgentToolSession> RunAsync(string userRequest, CancellationToken ct)
{
    for (var iter = 0; iter < _opts.MaxIterations; iter++)   // default 12
    {
        ct.ThrowIfCancellationRequested();

        // (1) first turn = system prompt + user; later turns inject prior tool_result only
        var turnInput = (iter == 0)
            ? FormatFirstTurn(userRequest)
            : FormatToolResultTurn(turns[^1].ToolResult);

        // (2) GBNF-forced single JSON
        var rawJson = await GenerateOneTurnAsync(turnInput, ct);
        var call = ParseToolCall(rawJson);

        // (3) done signal → end loop
        if (call.Tool == "done") return new AgentToolSession(turns, call.Args["message"], true);

        // (4) execute tool + record turn
        var toolResult = await ExecuteToolAsync(call, ct);
        turns.Add(new ToolTurn(call, toolResult));
        _opts.OnTurnCompleted?.Invoke(turns[^1]);    // ← UI callback
    }
    return new AgentToolSession(turns, "max iterations", false);
}

Two key details.

MaxIterations = 12. send → wait → read = 3 calls, so about 4 rounds. This arithmetic alone blocks the runaway scenario where an LLM repeats the same tool. Separately, ToolLoopGuards catches repeated calls in two stages — stage 1 feeds an error back to the model, stage 2 hard-stops.

OnTurnCompleted callback. Every time a tool runs, it lands as AgentReactorActor.Self.Tell(TurnCompletedInternal(turn)) back into the actor. That is, the loop reports progress through the actor's mailbox. The UI hears those messages and renders progress.

4-3. Preserve results in the KV cache

"With many turns, resending the system prompt every time blows up tokens" — exactly. So we create one LLamaContext from LLamaSharp and keep it alive for the loop instance's lifecycle (AgentToolLoop.cs:45-59).


var (weights, modelParams) = llm.GetInternals();
_context  = weights.CreateContext(modelParams);  // ← KV cache here
_executor = new InteractiveExecutor(_context);
_grammar  = new Grammar(AgentToolGrammar.Gbnf, "root");

The _isFirstUserSend flag injects the full system prompt only on the very first turn; later turns inject only the previous tool result. The KV cache holds everything in between. So one loop run = one cycle, and the memory between cycles is the KV cache's job.

4-4. ONE CYCLE PER RUN — the philosophy that blocks runaway

A single line gets repeated insistently in AgentZero's system prompt (AgentToolGrammar.cs:97-101).

CRITICAL principle — ONE CYCLE PER RUN, BUT DO THE CYCLE. Each tool chain run = ONE complete round trip with the terminal AI: send_to_terminal → wait → read_terminal → react → done. Subsequent cycles are triggered by the user OR an arriving peer signal. The KV cache preserves history across runs.

Translation — don't ask the LLM to script a five-turn debate as one giant tool chain. Each turn ends after one cycle and stops. The next cycle is triggered by the user pressing again, or a peer sending a signal.

Why this rule? If you stuff N cycles into one run — when the LLM misreads a response or starts repeating the same tool, runaway happens inside one run. Cap at one cycle and the start of the next run becomes a natural handle. The user can step in, another peer can signal, or the system can simply just stop.

That's the actor model's essence — "one message = one unit of work." The mailbox is the best tool we have to keep an LLM from emitting tokens forever.

5. AgentReactorActor — the LLM that becomes an FSM

AgentToolLoop alone can't reconcile async-progressing inference with the UI. While tokens stream you need to draw progress, and when the user hits Cancel you need to stop instantly. That's AgentReactorActor's job.

This actor is a two-state FSM (Actors/AgentReactorActor.cs:50-226).


stateDiagram-v2
  [*] --> Idle
  Idle --> Running: StartReactor(userRequest)<br>BecomeRunning()
  state Running {
    [*] --> Thinking
    Thinking --> Generating: prefill ends
    Generating --> Acting: tool_call arrives<br>OnTurnCompleted
    Acting --> Generating: tool_result injected
  }
  Running --> Idle: RunCompletedInternal<br>BecomeIdle()
  Running --> Idle: RunFailedInternal<br>BecomeIdle()
  Running --> Running: CancelReactor<br>(_cts.Cancel)

Idle: waits for a StartReactor(userRequest) message. On receipt, fires Task.Run(() => loop.RunAsync(...)).PipeTo(Self) to launch the loop async, then BecomeRunning().

Running: the loop drops internal messages on the actor via OnGenerationProgress / OnTurnCompleted callbacks. The actor relays them to its parent as ReactorProgress(Phase, Round, Tokens, ToolCall). The parent draws to the UI.

End: RunCompletedInternal(session) arrives → BecomeIdle() automatic return.

Cancel: CancelReactor message → _cts.Cancel() → loop.RunAsync exits via OperationCanceledException → RunFailedInternal → BecomeIdle().

The point: inference itself runs in a separate Task, but progress and termination signals all pass through the actor's mailbox. UI code just listens to actor messages — it doesn't hold the token stream itself.

That single line Task.Run(...).PipeTo(Self) matters a lot. Plain C# await ties to the caller's SynchronizationContext — inside an actor, that blocks the mailbox. PipeTo turns the Task's result into a message and drops it into the actor's mailbox. That gives complete async separation — the actor can keep receiving other messages (e.g. Cancel) at the same time.

6. Why STT × LLM × TTS look like they respond at the same time

Now the second question — how do you make "respond while listening" work with three completely separate models, the way OpenAI Realtime API does?

OpenAI Realtime API runs audio in/out + function calls round-robin on a single stateful WebSocket. That's possible because the model is a unified speech model.

AgentZero uses three completely separate free models.

STT — Whisper.net (~466 MB GGML, offline)

LLM — Gemma 4 (LLamaSharp, on-device)

TTS — OpenAI tts-1 / Windows SAPI (plumbing in, output staged)

The secret to making the three play together as one system is actor + Akka.Streams.

6-1. VoiceStreamActor owns Akka.Streams graphs


flowchart TB
  classDef in fill:#064e3b,stroke:#10b981,color:#a7f3d0
  classDef act fill:#312e81,stroke:#a855f7,color:#e9d5ff
  classDef out fill:#7c2d12,stroke:#f97316,color:#fed7aa

  Mic["🎤 Mic<br>NAudio MicFrame"]:::in
  Q["Source.Queue<br>(DropHead)"]:::in
  VAD["VoiceSegmenter<br>(VAD split)"]:::in
  STT["SttWorkerActor<br>SmallestMailboxPool"]:::in

  AB["AgentBotActor<br>(AIMODE active)"]:::act
  R["AgentReactorActor<br>(LLM inference)"]:::act
  Tools["Terminal Actors<br>(tool execution)"]:::act

  TQ["token Source.Queue"]:::out
  Chunk["SentenceChunker"]:::out
  TTS["TtsWorkerActor pool"]:::out
  Spk["🔊 Speaker"]:::out

  Mic --> Q --> VAD --> STT
  STT -->|"VoiceTranscriptReady"| AB
  AB -->|"StartReactor"| R
  R -->|"send_to_terminal"| Tools
  R -->|"final tokens"| TQ
  TQ --> Chunk --> TTS --> Spk

VoiceStreamActor materializes two Akka.Streams graphs at startup (Actors/VoiceStreamActor.cs:159-172).


// INPUT graph — MicFrame → VAD → STT → transcript
var materialized = Source.Queue<MicFrame>(cmd.MicBufferSize, OverflowStrategy.DropHead)
    .Via(VoiceSegmenterFlow.Create(vadCfg))             // VAD + segmentation
    .Async()
    .SelectAsync(parallelism, async (PcmSegment seg) =>
    {
        var reply = await sttPool.Ask<TranscribeReply>(   // ← delegate to STT pool
            new TranscribeRequest(seg, language),
            TimeSpan.FromSeconds(120));
        return new VoiceTranscriptReady(reply.Transcript, reply.DurationSeconds);
    })
    .Where(t => !string.IsNullOrWhiteSpace(t.Transcript))
    .ToMaterialized(sink, Keep.Left)
    .Run(_materializer);

That graph does the heavy lifting.

Source.Queue<MicFrame> accepts audio frames from the mic fire-and-forget (the mic thread never blocks).

VoiceSegmenterFlow uses VAD to cut out only the speaking region into PCM segments.

SelectAsync(parallelism, ...) Asks the STT worker pool per segment async. Workers process N in parallel.

The STT pool uses SmallestMailboxPool routing to send work to the least-busy worker (Voice/Streams/SttWorkerActor.cs:78-80).

The result returns as a VoiceTranscriptReady message back to VoiceStreamActor's mailbox.

The Sink.ActorRefWithAck protocol — every message must be Acked with VoiceFrameAck before the next segment enters. That's backpressure.

6-2. STT → LLM → TTS message sequence

The whole sequence from a person's perspective.


sequenceDiagram
  autonumber
  actor U as 👤 User
  participant Mic as 🎤
  participant V as VoiceStreamActor
  participant ST as STT pool
  participant B as AgentBotActor
  participant R as AgentReactorActor
  participant T as Terminal
  participant TT as TTS pool
  participant Sp as 🔊

  U->>Mic: "summarize today's PRs"
  Mic->>V: MicFrame stream
  V->>ST: TranscribeRequest (VAD segment)
  ST-->>V: TranscribeReply
  V->>B: VoiceTranscriptReady("today PRs ...")
  B->>R: StartReactor("...")
  R->>T: send_to_terminal(tab=1, "summarize PRs")
  R->>R: wait(5)
  R->>T: read_terminal(tab=1)
  T-->>R: "PRs summary..."
  R->>R: done(message)
  R->>V: SpeakResponse(token stream) [P3]
  V->>TT: SynthesizeRequest (sentence chunks)
  TT-->>V: audio chunks
  V->>Sp: playback

At every step nothing blocks. The mic thread sees only the mic. STT workers only transcribe. The Reactor only watches the LLM. TTS workers only synthesize. Everything connects via messages, mailboxes provide backpressure.

6-3. Compared to Realtime API — same outcome, different infra


flowchart LR
  classDef api fill:#1e3a8a,stroke:#3b82f6,color:#dbeafe
  classDef ens fill:#312e81,stroke:#a855f7,color:#e9d5ff

  subgraph realtime["OpenAI Realtime API"]
    direction TB
    WS["1 stateful WebSocket"]:::api
    Omni["unified omni model"]:::api
    WS <-->|"audio in/out + tool calls<br>round-robin"| Omni
  end

  subgraph actor1["AgentZero actor ensemble"]
    direction TB
    A1["STT actor<br>📬 mailbox"]:::ens
    A2["LLM actor<br>📬 mailbox"]:::ens
    A3["TTS actor<br>📬 mailbox"]:::ens
    A1 -.message.-> A2 -.message.-> A3
  end

OpenAI Realtime API vs AgentZero actor ensemble — seven axes

Transport — Realtime: 1 stateful WebSocket. AgentZero: actor messages + Akka.Streams.

Model — Realtime: 1 unified speech model (gpt-realtime). AgentZero: separate STT/LLM/TTS three.

Function call — Realtime: WebSocket events (round-robin). AgentZero: OnTurnCompleted → actor message.

Cost model — Realtime: token + audio-minute billing (vendor). AgentZero: on-device free + optional OpenAI TTS.

Barge-in — Realtime: server VAD turn detection. AgentZero: BargeIn message → cancel OUTPUT graph.

Extension — Realtime: add new events on the same socket. AgentZero: add new actor + message.

Failure isolation — Realtime: if the socket drops, everything drops. AgentZero: one worker dies, others stay alive.

Same user experience ("speak and it listens and responds") — completely different infrastructure. What Realtime API solves with one chunk of model, the actor solves with small composable modules.

The trade-offs are clear.

Realtime API strength — small latency. One model that also sees speech meaning.

Actor ensemble strength — combination freedom. You pick the best Korean-strong STT model, the best reasoning LLM, the most natural TTS separately and plug them in. And when one vendor changes its policy, the rest stays alive. That's the infrastructure answer to the vendor-lock-free point in Part 1.

7. peer-signal bidirectional channel — why actors were born for remote

One last piece — the mechanism we mentioned in Part 1, where a peer terminal calls back to the bot directly, in code.

Scenario: AIMODE sent "summarize" to the Claude tab and is now polling wait + read. But Claude wants to actively signal AgentBot the moment its own work is done. We'd like the next cycle to trigger immediately, before wait ends.

Here's how — the Claude tab runs this one line.


AgentZeroLite.exe -cli bot-chat "DONE(summary done)" --from Claude

The path that line takes into the actor system (CliHandler.cs:679-742 + MainWindow.xaml.cs:413-419, 796-819).


sequenceDiagram
  autonumber
  participant CT as Claude tab (external process)
  participant CLI as CliHandler.BotChat()
  participant MW as MainWindow
  participant AB as AgentBotActor
  participant R as AgentReactorActor

  CT->>CLI: bot-chat "DONE(summary done)" --from Claude
  CLI->>MW: WM_COPYDATA(0x414C "AL")
  MW->>AB: ActorSelection("/user/stage/bot")<br>.Tell(TerminalSentToBot("Claude", "summary done"))
  AB->>AB: check _activeConversations
  AB->>R: StartReactor("[from Claude] summary done")
  Note over R: even mid-wait, fires a fresh run<br>(not waking wait inside same run)

This is the actor model showing why it was born for remote. A signal sent by an external process (Claude CLI) — not a direct method call on an in-process object — drops as a message into a mailbox. And the actor doesn't have to know where it came from. WM_COPYDATA, gRPC, or Akka.Cluster — the same actor code receives.

The AgentZeroRemote / AgentZeroCluster roadmap from Part 1 makes natural sense here. Akka.Tell works the same whether it's same process or another machine, with practically no code change. Systems built on actors grow naturally into distributed — moving a single-machine IDE into a multi-machine AI assistant cluster looks more like configuration change than rewrite.

8. So — Akka becomes AI Agent's runtime, the hypothesis

If you've followed this far, one thing should be visible. Almost everything that makes an AI Agent different from regular code, Actors already have.

AI Agent's hard parts → Actor's existing answers

LLM responses can break → Supervision (Restart / Resume / Stop).

The same tool can repeat infinitely → mailbox = one at a time + Become for state transitions.

Tools finish async → Tell + PipeTo, mailbox is the queue.

Many agents work concurrently → an actor is essentially a unit of concurrency.

Signals come from user / peer / self anywhere → message = source-agnostic.

Want to host the same agent remotely → Location transparency.

That's the same direction Akka.io is pursuing with its "Akka Agents" line and NVIDIA is going with WASM-based agent sandbox standardization. All converge on the same conclusion — the runtime around AI models, more than the models themselves, is the real infrastructure.

🎯

So what is AgentZero Lite actually doing?

On top of Akka.NET, mount one LLM (Gemma 4 + GBNF) and stick STT/TTS/terminal actors next to it as neighbors — that builds our own little Realtime API inside one desktop. Token cost: 0. Code: all open.

This is a small preview of the real AI infrastructure of the next five years.

9. Closing — two invitations

Junior developers. The AI Agent design instinct of someone who has hand-coded an actor once vs someone who hasn't is starting to diverge. For C#, the Akka.NET official tutorial is a one-day course. The pattern of throwing a message, sealing state, and recovering via supervision once is worth a semester of regular OOP.

Senior engineers. Take a look at the runtime layer of the AI Agent your company is building. If it's a giant while True inside a single function, it's worth one hour of prototyping to see how actor + FSM + supervision simplifies that code. AgentZero Lite's AgentReactorActor.cs is a good starting point.

Appendix — references

AgentZero Lite — <https://github.com/psmon/AgentZeroLite>

Project/ZeroCommon/Actors/StageActor.cs — actor supervision
Project/ZeroCommon/Llm/Tools/AgentToolGrammar.cs — GBNF + system prompt
Project/ZeroCommon/Llm/Tools/AgentToolLoop.cs — main loop + KV cache
Project/ZeroCommon/Actors/AgentReactorActor.cs — FSM
Project/ZeroCommon/Actors/VoiceStreamActor.cs — STT/TTS Akka.Streams

Akka Agents announcement — Lightbend/Akka's official AI agent runtime

The Akka Actor Model: A Foundation for Concurrent AI Agents — Pradeep Loganathan

The Natural Synergy Between The Actor Pattern and Agentic AI Systems — robotmunki

Real-time Marketing Automation with Distributed Actor Systems and Akka.NET — Aaron Stannard

OpenAI Realtime API Guide

OpenAI Realtime API: The Missing Manual — Latent Space

Sandboxing Agentic AI Workflows with WebAssembly — NVIDIA Developer

Akka.NET Official Docs — Actors