🤝

An LLM Is Not an Agent: Five Layers Between a Local Model and a Real Conversation

notion image
TL;DR — A local on-device LLM is, by itself, a token generator. It is not an agent. AgentZero Lite turned a Gemma-4 running on a user's machine into something that can hold a real conversation with another AI (Anthropic's Claude CLI in a neighbouring terminal tab) by stacking five distinct layers on top of inference. This is the field report.

The premise: an LLM alone is not an agent

Drop a local model into your app and you get tokens. That is all.
You do not get a thing that "decides what to do," that "uses tools," that "knows whether the other side received the message." Those are not properties of the model. They are properties of the system you build around the model. Treating the LLM as if it were already an agent is the most common mistake — and the one that makes "agentic" features brittle.
So a more useful question than "which model is best?" is "what is the smallest set of layers I have to add on top of an LLM before it earns the word agent?"
In AgentZero Lite the answer turned out to be five.
flowchart BT L[Local LLM<br/>tokens out] S[+ Structure<br/>GBNF grammar OR native tool template] T[+ Tools<br/>5-tool catalog over real ConPTY] C[+ Concurrency<br/>off-UI-thread reactor actor] P[+ Protocol<br/>identity-first handshake, peer-name routing] D[Discipline harness<br/>research before code] A((Agent)) L --> S --> T --> C --> P --> A D -.-> A
The rest of this doc is one chapter per layer.

1. A harness before a feature

The most expensive bug in agent work is not a logic error. It is building the wrong thing because the design decision was a guess. Local-LLM tool-calling is a domain where you can spend a weekend on the wrong backend before you notice.
The Kakashi harness is a small, structured set of agents (security-guard, build-doctor, code-coach, release-build-pipeline) that each own a slice of the engineering process. Before any AIMODE feature work, the harness produced a single concrete artifact: a research note that decided AIMODE would be dual-backend — Gemma 4 via GBNF grammar enforcement, and Nemotron Nano 8B-v1 via its native chat template — and explained why.
Once the decision was an artifact instead of a guess, both backends could be implemented behind one interface (IAgentToolHost) and switched without re-litigating the choice mid-implementation. Discipline produces leverage; it is not overhead.

2. Prompts ask. Grammars constrain.

A common pattern: "Reply with one JSON object: {tool, args}. Do not add prose."
This works inconsistently. Gemma 4 in particular has no native tool-calling SFT — it will helpfully wrap your JSON in markdown, add a friendly preamble, or omit a quote. Asking nicely fails enough of the time to be unusable in a loop.
The trick is to stop asking. Move the contract from the prompt layer to the sampler layer with a GBNF grammar. The model is then physically incapable of emitting a token sequence that does not match the grammar. It still chooses which tool and which args; it cannot choose to break the schema.
root ::= ws "{" ws "\"tool\"" ws ":" ws toolname ws "," ws "\"args\"" ws ":" ws args ws "}" ws toolname ::= "\"list_terminals\"" | "\"read_terminal\"" | "\"send_to_terminal\"" | "\"send_key\"" | "\"wait\"" | "\"done\"" args ::= "{" ws "}" | "{" ws kv (ws "," ws kv)* ws "}" kv ::= string ws ":" ws value value ::= string | integer | boolean
For models that do have native tool calling (Nemotron via the Llama-3.1 chat template), the grammar isn't needed — the structure already exists in the chat format. The application sees both backends through the same IAgentToolHost interface. The lesson generalises: the prompt is for intent, the grammar (or native channel) is for shape. Conflating them is what makes tool-calling flaky.

3. The terminal is just five tools

What a real terminal is: a process attached to a pseudo-console that accepts bytes and emits bytes.
What an LLM should see: not a process, not a console — five named affordances.
Tool
Purpose
list_terminals
What tabs exist?
read_terminal
What is the current screen contents?
send_to_terminal
Type this string + Enter.
send_key
Send one control key (esc, ctrl-c, …).
wait
Sleep N seconds before the next read.
done
End the cycle, return a final message.
The bridge that turns the first into the second is WorkspaceTerminalToolHost. From the LLM's point of view the terminal has been collapsed to six function calls; from the tab's point of view it is still a real ConPTY receiving real bytes.
flowchart TD U[User prompt] --> L[AgentToolLoop] L -->|system + grammar/template| G{LLM backend} G -->|Gemma 4 path| GBNF[GBNF sampler<br/>5-tool catalog] G -->|Nemotron path| TPL[Llama-3.1 chat template<br/>native tool channel] GBNF --> P[Parse JSON] TPL --> P P --> H[IAgentToolHost<br/>WorkspaceTerminalToolHost] H -->|tool result| L L -->|loop until 'done'<br/>or MaxIterations cap| L
The wait tool deserves a callout. Terminal AIs (Claude, Codex) take 5–15 seconds to start replying. Reading immediately after sending captures only a thinking spinner — Crafting…, , esc to interrupt. Without wait, the loop "reads" emptiness and hallucinates a reply. With wait exposed as a first-class tool, the model owns the timing decision instead of the application guessing it.

4. Inference belongs in an actor, not a window

Multi-second token generation cannot share a thread with mouse clicks. The first version of AIMODE ran inference on the WPF UI thread and the GUI froze for the duration of every turn.
The fix is not async/await. The fix is moving the work out of the UI's universe entirely. AgentReactorActor, a new Akka actor at /user/stage/bot/reactor, wraps AgentToolLoop. The window becomes Tell-only — no _aiLoop or _aiCts field, no UI-thread inference, no freezes. Live progress visibility (a 💭 generating… (N tok) bubble) is delivered as messages from the actor to the window, never the other way around.
This isn't a performance optimisation. It's a correctness property: the UI is allowed to remain interactive because the inference work physically cannot block it. KV cache lives inside the actor and persists across cycles, so multi-turn conversation history is preserved without re-prompting.

5. Identity is the protocol

Two AIs talking is not one bigger AI. It is two systems negotiating a shared channel. Without that negotiation step, the local AI happily sends a message and then either lies about the response (hallucinates a reply) or gives up ("there is no reverse channel"). Both are failure modes of pretending the other side exists when you haven't proved it.
So before any substantive exchange, the local AI does a first-contact handshake, identity-first:
private static string BuildFirstContactHeader(string peerName = "Claude") => "[AgentBot Handshake — first contact, please read carefully]\n\n" + "You are " + peerName + " and I am AgentBot, " + "an on-device AI agent running inside the AgentZero Lite shell.\n" + "\"" + peerName + "\" is the name of YOUR terminal tab — " + "both of us will use that name to refer to you, and you will use it " + "as your --from identity when replying.\n\n" + "Step 1 — Verify the CLI exists.\n" + "Step 2 — Acknowledge using the same CLI.\n" + " AgentZeroLite.exe -cli bot-chat \"DONE(handshake-ok)\" --from " + peerName + "\n";
The string peerName is the name of the tab. It is also what the peer must put in --from when calling back via the bot-chat CLI. It is also the key that the actor system uses to route incoming peer signals to the right reactor cycle. One string ties together identity, routing, and handshake state — and because there is only one, there is no inconsistency to maintain.
The full peer-signal protocol is just a few records and an enum:
public enum HandshakeState { NotConnected, // never sent intro HandshakeSent, // intro dispatched, awaiting bot-chat reply Connected, // peer has called back at least once } public sealed record TerminalSentToBot(string PeerName, string Text); public sealed record MarkHandshakeSent(string PeerName); public sealed record QueryHandshakeState(string PeerName); public sealed record HandshakeStateReply(string PeerName, HandshakeState State);
The full cycle, end to end:
sequenceDiagram participant U as User participant B as AgentBot<br/>(local Gemma) participant R as AgentReactorActor<br/>(Akka, off-UI-thread) participant H as WorkspaceTerminalToolHost participant P as Peer terminal<br/>(Claude CLI) U->>B: "Talk to the Claude tab" B->>R: StartReactor (Mode 2) R->>H: send_to_terminal(<handshake intro>) H->>P: "You are <peerName>, I am AgentBot.<br/>Verify CLI: -cli help.<br/>Ack via bot-chat DONE(handshake-ok)" Note over R,H: MarkHandshakeSent P->>P: runs -cli help, confirms P->>B: AgentZeroLite.exe -cli bot-chat<br/>"DONE(handshake-ok)" --from <peer> B->>R: TerminalSentToBot (peer signal) Note over R: HandshakeState → Connected<br/>Phase H done → Phase C R->>H: substantive turn H->>P: real exchange P->>B: real reply via bot-chat R->>U: result
Mode 1 (direct chat — greetings stay direct) and Mode 2 (terminal relay — handshake then conversation) are kept strictly separate by the system prompt and verified by live tests: greetings never accidentally route to a terminal, and vague Mode 2 requests still produce a reasonable opener instead of bouncing back to the user. The headless test suite covers FSM transitions, handshake state, and 5 sequential cycles per session — 42/42 pass.

The tech stack at a glance

Layer
Tech
Why this and not something else
Host
.NET 10 (preview) + WPF
Native Windows desktop, real ConPTY tabs
Concurrency
Akka.NET actor system
Off-UI-thread inference, supervised state machines, message-keyed routing
Local LLM
Self-built llama.cpp (Vulkan) + LLamaSharp 0.26
Runs Gemma 4 / Nemotron on-device, no Ollama dependency
Tool calling — Gemma
GBNF grammar at sampler level
Gemma 4 has no tool-calling SFT; grammar makes structure non-optional
Tool calling — Nemotron
Native Llama-3.1 chat template
Model has its own tool channel; reuse it
Terminal
ConPTY via Microsoft.Terminal.Control
Real terminals, not screen-scraping
Reverse channel
AgentZeroLite.exe -cli bot-chat ... --from <peer> over WM_COPYDATA
Pre-existing IPC repurposed; one identifier ties everything
Engineering harness
Kakashi (security-guard, build-doctor, code-coach, release-build-pipeline)
Forces research → code → review → log to be a structured cycle

Closing — what made the tokens become a conversation

Each layer in the ladder solves a different category of failure:
  • Structure removes "the model produced unparseable output" failures.
  • Tools remove "the model has no way to act on the world" failures.
  • Concurrency removes "the act of thinking blocks the act of using" failures.
  • Protocol removes "the model lies about a counterparty it has not actually reached" failures.
  • Discipline (the harness) removes "we built the wrong thing" failures.
None of them are about a smarter model. All of them are about what surrounds the model. That is the part that is, in the end, the agent.
The two AIs are talking. The harness logged it. The next features get planted the same way.

TECH LINKS