Building Agent Orchestration on Akka.NET — Cluster Edition (Part 2)

🎯
In Part 1 we ported the Akka.io Agent abstraction to Akka.NET — a single LlmAgentActor driving the four-Behavior FSM (Idle → GatheringContext → Reasoning → Acting), a SessionMemoryActor (PersistentActor), and an AgentRouter that routed by SessionId. All of that lived inside one ActorSystem on one machine.
This article asks exactly one question — how much of that code has to change when you move it from one machine to an N-node cluster?
Target reader: a .NET engineer comfortable with distributed systems and the actor model basics, who wants to see how the abstraction survives horizontal scale.

1. Why cluster — the real single-node limits

🎯
3-line summary
  • The single-node ceiling isn't CPU/GPU shortage — it's memory pressure as sessions grow and cold-start cost on restart.
  • Location transparency in the actor model is not marketing copy — it's the design choice that solves both problems at once.
  • Once you go cluster, which node an Agent actor lives on stops mattering.
Part 1's AgentRouter spawned one child LlmAgentActor per session and routed via a _bySession dictionary. On a single box, that pattern breaks in three places.
  1. Memory ceiling. Per session: 1 LlmAgentActor + 1 SessionMemoryActor = 2 actors. 100k active users = 200k actors. At ~4 KB per actor that's 800 MB before mailboxes and history. The GC pressure on one .NET process climbs into the abnormal range fast.
  1. Restart cold start. Restart the process and every session has to replay events from the journal. If N users reconnect simultaneously you get a concurrent replay storm.
  1. SLA — single point of failure. When the agent node dies, every conversation on it drops. In production that's a non-negotiable disqualifier.
All three have well-known distributed-systems answers. Shard the actors across N nodes, lean on persistence to shrink cold start, isolate failures with supervision + Split-Brain Resolver. Location transparency is what makes you apply those answers with almost no code change.
graph LR subgraph "Part 1 — single node" Router1[AgentRouter] --> A1[LlmAgentActor: alice] Router1 --> A2[LlmAgentActor: bob] Router1 --> A3[LlmAgentActor: carol] A1 --> M1[SessionMemoryActor] A2 --> M2[SessionMemoryActor] A3 --> M3[SessionMemoryActor] end subgraph "Part 2 — N-node cluster" Client[Client] --> SR[ShardRegion<br/>one per node] SR --> N1[Node A<br/>shard 0,3,6] SR --> N2[Node B<br/>shard 1,4,7] SR --> N3[Node C<br/>shard 2,5,8] N1 -. Persistence .-> DB[(EventStore<br/>PostgreSQL/Mongo)] N2 -. Persistence .-> DB N3 -. Persistence .-> DB end A1 -.->|"≈ 0 code change"| N1
Going from the left to the right, the business logic barely moves. What changes is how actors are spawned and which journal backend they use.

2. Akka.NET Cluster in 5 minutes

🎯
3-line summary
  • A cluster is a negotiated membership set of nodes. State is propagated via a Gossip protocol.
  • Same ActorSystem name + agreed seed-nodes → ClusterEvent.MemberUp and you're running.
  • Three distributed actor tools to know: Cluster Sharding, Cluster Singleton, Distributed PubSub.

Minimum HOCON

akka { actor.provider = cluster remote.dot-netty.tcp { hostname = "0.0.0.0" port = 0 # 0 = dynamic (pin in production) } cluster { seed-nodes = [ "akka.tcp://agent-cluster@node-a:2552", "akka.tcp://agent-cluster@node-b:2552" ] roles = ["agent"] sharding { role = "agent" remember-entities = on # reactivate entities after node down } downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster" split-brain-resolver { active-strategy = keep-majority } } persistence { journal.plugin = "akka.persistence.journal.postgresql" snapshot-store.plugin = "akka.persistence.snapshot-store.postgresql" } }
The lines that matter:
  • actor.provider = cluster — this single line flips ActorSystem into cluster mode.
  • seed-nodes — first contact points when joining. Usually 2–3.
  • roles — assigns a node's responsibilities (e.g. agent / gateway / persistence-writer). Sharding only rebalances within nodes that share a role.
  • split-brain-resolver — decides which partition survives a network split. Without it, two halves of a split-brain evolve into separate clusters in parallel. Non-optional for production.

Three distributed primitives

graph TB subgraph "Cluster Sharding" direction LR SC[ShardCoordinator<br/>Singleton] --> SR1[ShardRegion @ Node A] SC --> SR2[ShardRegion @ Node B] SC --> SR3[ShardRegion @ Node C] SR1 --> E11[Entity 1] SR1 --> E12[Entity 4] SR2 --> E21[Entity 2] SR2 --> E22[Entity 5] SR3 --> E31[Entity 3] SR3 --> E32[Entity 6] end subgraph "Cluster Singleton" SM[SingletonManager<br/>on every node] --> SP[Singleton Proxy] --> S[Singleton Actor<br/>exactly one, on oldest node] end subgraph "Distributed PubSub" P1[Publisher @ Node A] --> Med1[Mediator @ Node A] Med1 -. gossip .- Med2[Mediator @ Node B] Med2 --> Sub1[Subscriber @ Node B] Med2 --> Sub2[Subscriber @ Node B] end
  • Cluster Sharding = "one actor per entity" auto-distributed across N nodes. ShardCoordinator (a Singleton) decides which shard lives on which node. Handoff (rebalance) is automatic.
  • Cluster Singleton = "an actor that must exist exactly once cluster-wide" — lives on the oldest node by default. Rate limiter, scheduler, central coordinator.
  • Distributed PubSub = topic-based pub/sub without knowing actor locations. The backbone of multi-agent collaboration.
The core Part 2 transformation is — replace Part 1's AgentRouter with Cluster Sharding's ShardRegion.

3. The five real problems when going cluster

From a core-architecture stance, the single-to-cluster move hits five concrete problems.
Problem
Single node
Cluster
Akka.NET answer
1. Agent location
Dictionary lookup
Which node?
Cluster Sharding
2. Session affinity
Same in-memory object
Same SessionId → same actor, guaranteed
ShardId = hash(SessionId) % N
3. Persistence backend
Local SQLite OK
Reachable from every node
PostgreSQL / Mongo Persistence Plugin
4. Failure isolation
SupervisorStrategy
• network partitions
Split-Brain Resolver + remember-entities
5. Observability
One log file
Distributed traces
OpenTelemetry + Phobos or Petabridge.Cmd
We'll walk through each.

4. Sharding the Agent actors

🎯
3-line summary
  • Swap AgentRouter (Part 1) for ClusterSharding.Start(...) — actors are now distributed across N nodes by SessionId.
  • Entity ID = SessionId, Shard ID = hash(SessionId) % shardCount.
  • When nodes join/leave, ShardCoordinator handles rebalance automatically. Business code stays the same.

4.1 Sharding setup

public sealed class AgentMessageExtractor : HashCodeMessageExtractor { public AgentMessageExtractor(int maxShards = 100) : base(maxShards) { } public override string EntityId(object message) => message switch { AgentRequest req => req.SessionId, ShardEnvelope env => env.EntityId, _ => null! }; public override object EntityMessage(object message) => message switch { ShardEnvelope env => env.Message, _ => message }; } // Cluster boot — once var sharding = ClusterSharding.Get(system); var agentShardRegion = sharding.Start( typeName: "agent", entityProps: HelloAgentActor.Props(llm), // unchanged Part 1 actor settings: ClusterShardingSettings.Create(system).WithRole("agent"), messageExtractor: new AgentMessageExtractor(maxShards: 100) ); // Callers send to the ShardRegion — routing is automatic var reply = await agentShardRegion.Ask<AgentResponse>( new AgentRequest("alice", "Hi, I'm Alice"), TimeSpan.FromSeconds(30));

4.2 What the ShardRegion does for you

sequenceDiagram autonumber participant Client participant SR_A as ShardRegion @ Node A participant SC as ShardCoordinator<br/>(Singleton) participant SR_C as ShardRegion @ Node C participant E as Entity "alice"<br/>(alive on Node C) Client->>SR_A: AgentRequest "alice" Note over SR_A: EntityId=alice<br/>ShardId=hash(alice) % 100 = 23 SR_A->>SR_A: don't know where shard 23 lives yet SR_A->>SC: GetShardHome(23) SC-->>SR_A: shard 23 → Node C SR_A->>SR_C: Forward(AgentRequest "alice") Note over SR_C: does shard 23 contain entity "alice"? SR_C->>E: AgentRequest E-->>SR_C: AgentResponse SR_C-->>SR_A: AgentResponse SR_A-->>Client: AgentResponse Note over SR_A,SR_C: From now on Node A's ShardRegion<br/>caches shard 23 → Node C
Key points:
  • The client can send to any ShardRegion (Node A/B/C). Routing is the ShardRegion's job.
  • Entity actor alice exists exactly once in the entire cluster — Cluster Sharding guarantees that.
  • The first hop goes through ShardCoordinator, but subsequent calls hit a per-region shard → node cache.

4.3 Automatic rebalance on join/leave

graph TB subgraph "T=0 even balance across 3 nodes" direction LR N1A[Node A<br/>shard 0,3,6,9...] N2A[Node B<br/>shard 1,4,7,10...] N3A[Node C<br/>shard 2,5,8,11...] end subgraph "T=1 Node D joins" direction LR N1B[Node A<br/>shard 0,4,8...] N2B[Node B<br/>shard 1,5,9...] N3B[Node C<br/>shard 2,6,10...] N4B[Node D<br/>shard 3,7,11...] end subgraph "T=2 Node B down" direction LR N1C[Node A<br/>shard 0,1,4,5,8,9...] N3C[Node C<br/>shard 2,3,6,7,10,11...] N4C[Node D<br/>shard 3,7,11...] end N1A -. rebalance .-> N1B N1B -. node down +<br/>remember-entities .-> N1C
With remember-entities = on, the active entities that were living on Node B are re-activated elsewhere on the surviving nodes. Each SessionMemoryActor (a PersistentActor) recovers its state from the event journal.

5. The Coordinator is a Cluster Singleton — workflows & rate limiters

🎯
3-line summary
  • Actors that must exist exactly once cluster-wide (multi-agent planner orchestrator, central tool registry) are Cluster Singletons.
  • LLM rate limiter is the canonical case — you need one call-counter across the whole cluster.
  • Callers use SingletonProxy and don't track location.
The Multi-Agent Planner workflow from Part 1's §9-1 chains WeatherAgent and ActivityAgent. In a cluster, we have to decide where the workflow runs.
Two options:
  1. Workflow as Sharded Entity — one Entity per Workflow instance. Pairs well with the saga pattern.
  1. Coordinator as Cluster Singleton — keep checkpoints in the persistence journal, but logical decisions in a single actor.
For LLM rate limiting, (2) is almost always correct. You need one view of cluster-wide call counts.
// Singleton registration — called on every node, Akka activates only on the oldest one var singletonManager = system.ActorOf( ClusterSingletonManager.Props( singletonProps: Props.Create<LlmRateLimiterActor>(), terminationMessage: PoisonPill.Instance, settings: ClusterSingletonManagerSettings.Create(system) .WithRole("rate-limiter")), name: "rate-limiter"); // Caller — uses SingletonProxy, no location awareness var proxy = system.ActorOf( ClusterSingletonProxy.Props( singletonManagerPath: "/user/rate-limiter", settings: ClusterSingletonProxySettings.Create(system) .WithRole("rate-limiter")), name: "rate-limiter-proxy"); // Inside LlmAgentActor var token = await proxy.Ask<RateLimitGrant>( new AcquireToken(estimatedTokens: 1500), TimeSpan.FromSeconds(5));
sequenceDiagram autonumber participant A1 as LlmAgentActor @ Node A participant P1 as SingletonProxy @ Node A participant SM as SingletonManager @ Node B (oldest) participant S as LlmRateLimiterActor<br/>(one in cluster) participant A2 as LlmAgentActor @ Node C participant P2 as SingletonProxy @ Node C A1->>P1: AcquireToken(1500) P1->>SM: forward (location-aware) SM->>S: AcquireToken(1500) S-->>SM: Grant SM-->>P1: Grant P1-->>A1: Grant A2->>P2: AcquireToken(2000) P2->>SM: forward SM->>S: AcquireToken(2000) Note over S: remaining this minute: 4000<br/>→ Grant S-->>P2: Grant via SM P2-->>A2: Grant Note over A1,A2: Agents on Node A and C<br/>share the same rate limiter
The point of Singleton is single responsibility + determinism in a distributed environment. Used wrong, it becomes a bottleneck — so the rule is let it decide, delegate the heavy lifting.

6. Persistence — event sourcing + snapshot strategy

🎯
3-line summary
  • For SessionMemoryActor to make sense in a cluster, the journal backend has to be reachable from every node (PostgreSQL/Mongo/Cassandra).
  • To avoid the restart replay storm, snapshot cadence and lazy recovery are mandatory.
  • Keep LlmAgentActor stateless; let SessionMemoryActor be the only PersistentActor. Clean separation of responsibility.

6.1 Event flow

sequenceDiagram autonumber participant Client participant Agent as LlmAgentActor<br/>(Sharded Entity) participant Mem as SessionMemoryActor<br/>(PersistentActor) participant Journal as EventJournal<br/>(PostgreSQL) participant LLM Client->>Agent: AgentRequest("alice", "Hi") Agent->>Mem: LoadSessionMessages Mem->>Journal: SELECT events WHERE persistence_id='session-alice' Journal-->>Mem: [MessageAppended..., MessageAppended...] Note over Mem: Recover<MessageAppended>(Apply)<br/>history restored Mem-->>Agent: SessionMessages(history) Agent->>LLM: complete(prompt + history) LLM-->>Agent: reply Agent->>Mem: AppendMessages(user+assistant) Mem->>Journal: Persist(MessageAppended) Journal-->>Mem: ack Note over Mem: SaveSnapshot every 50th message Agent-->>Client: AgentResponse

6.2 Snapshot cadence — preventing replay storms

SessionMemoryActor already saves a snapshot every 50 messages (from Part 1). In a cluster, that cadence determines replay time.
public sealed class SessionMemoryActor : ReceivePersistentActor { public override string PersistenceId => $"session-memory-{_sessionId}"; private readonly LinkedList<ChatMessage> _history = new(); private const int MaxHistory = 20; private const int SnapshotInterval = 50; public SessionMemoryActor(string sessionId) { _sessionId = sessionId; // 1. Restore from most recent snapshot first Recover<SnapshotOffer>(offer => { if (offer.Snapshot is List<ChatMessage> snap) foreach (var m in snap) _history.AddLast(m); }); // 2. Replay only events after the snapshot Recover<MessageAppended>(evt => Apply(evt)); Command<LoadSessionMessages>(_ => Sender.Tell(new SessionMessages(_history.ToList()))); Command<AppendMessages>(cmd => Persist(new MessageAppended(cmd.Message), evt => { Apply(evt); // ops tip: track sequence via SaveSnapshotAsync ack if (SnapshotSequenceNr % SnapshotInterval == 0) SaveSnapshot(_history.ToList()); })); // 3. On snapshot ack — delete older snapshots/events to keep storage bounded Command<SaveSnapshotSuccess>(success => { DeleteSnapshots(new SnapshotSelectionCriteria( maxSequenceNr: success.Metadata.SequenceNr - 1)); DeleteMessages(success.Metadata.SequenceNr - SnapshotInterval); }); } private void Apply(MessageAppended evt) { _history.AddLast(evt.Message); while (_history.Count > MaxHistory) _history.RemoveFirst(); } }
Three things to internalize:
  • Snapshot interval = max_replay_time / event_apply_time. If one apply costs 0.1 ms and your SLA is 100 ms replay, the interval is around 1000 events.
  • GC for events/snapshots. DeleteSnapshots + DeleteMessages matter. Drop those two lines and the PostgreSQL table grows unbounded.
  • DB connection pool is a cluster-level number, not a node-level number. N nodes × pool size per node = total concurrent connections.

6.3 Responsibility split — Sharded vs Persistent

graph TB subgraph "Node A" SR_A[ShardRegion 'agent'] E_A1[LlmAgentActor<br/>SessionId=alice<br/>STATELESS] E_A2[LlmAgentActor<br/>SessionId=bob<br/>STATELESS] M_A1[SessionMemoryActor<br/>persistence_id=alice<br/>PERSISTENT] M_A2[SessionMemoryActor<br/>persistence_id=bob<br/>PERSISTENT] SR_A --> E_A1 SR_A --> E_A2 E_A1 -. child .-> M_A1 E_A2 -. child .-> M_A2 end M_A1 --> J[(EventJournal<br/>PostgreSQL)] M_A2 --> J style E_A1 fill:#e3f2fd style E_A2 fill:#e3f2fd style M_A1 fill:#fce4ec style M_A2 fill:#fce4ec
The principle: LlmAgentActor is stateless; only SessionMemoryActor is a PersistentActor. Reasoning: LLM calls are non-deterministic, so replaying the actions of LlmAgentActor makes no sense. What's worth persisting is the conversation history (events), full stop.

7. Three multi-agent collaboration topologies

🎯
3-line summary
  • Three collaboration patterns: Conversation (direct message), Topic (pub/sub), Workflow (central saga).
  • The choice depends on interaction synchronicity and state-sharing scope.
  • All three are first-class in Akka.NET — DistributedPubSub, Cluster Sharding, Cluster Singleton.

7.1 Conversation — Agent ↔ Agent direct call

sequenceDiagram autonumber participant W as WeatherAgent<br/>(Sharded) participant A as ActivityAgent<br/>(Sharded) participant SR as ShardRegion Note over W: One LlmAgentActor calls another like a tool W->>SR: AgentRequest(SessionId="alice-activity", "Madrid") SR->>A: forward (EntityId=alice-activity) A-->>W: AgentResponse("Indoor museums...") Note over W: wrap as tool_result,<br/>inject into next LLM call
The simplest pattern. One agent delegates a tool to another agent. The downside is long call chains become hard to debug.

7.2 Topic — DistributedPubSub broadcast

graph LR subgraph "Topic Mesh" P1[NewsCrawlerAgent<br/>publishes] P2[StockTickerAgent<br/>publishes] Med1[Mediator @ A] Med2[Mediator @ B] Med3[Mediator @ C] S1[SentimentAgent<br/>subscribes 'market.*'] S2[PortfolioAgent<br/>subscribes 'market.stock'] S3[AlertAgent<br/>subscribes 'market.alert'] P1 --> Med1 P2 --> Med1 Med1 -. gossip .- Med2 Med1 -. gossip .- Med3 Med2 --> S1 Med3 --> S2 Med3 --> S3 end
// Publisher var mediator = DistributedPubSub.Get(system).Mediator; mediator.Tell(new Publish("market.stock", new StockPriceUpdate("MSFT", 510.32))); // Subscriber (at agent startup) mediator.Tell(new Subscribe("market.stock", Self)); Receive<StockPriceUpdate>(update => /* trigger LLM analysis */);
Useful for event-stream-driven multi-agent scenarios. Caveat: delivery guarantee is at-most-once. In production you usually pair it with an external broker like Kafka or Pulsar.

7.3 Workflow — saga / central orchestration

sequenceDiagram autonumber participant Client participant W as TripPlannerWorkflow<br/>(Sharded Entity) participant Mem as Workflow EventStore participant WA as WeatherAgent<br/>(Sharded) participant AA as ActivityAgent<br/>(Sharded) participant PE as PreferencesEntity<br/>(Sharded) Client->>W: StartWorkflow("plan-trip", "alice", "Madrid") Note over W: state=GetWeather, PersistEvent W->>Mem: Persist(StateTransitioned) W->>WA: forward WA-->>W: AgentResponse("rainy 18C") Note over W: state=GetPreferences, PersistEvent W->>Mem: Persist W->>PE: GetPreferences("alice") PE-->>W: ["museum","cafe"] Note over W: state=SuggestActivity, PersistEvent W->>Mem: Persist W->>AA: forward AA-->>W: AgentResponse("Prado Museum") Note over W: state=Done, PersistEvent W->>Mem: Persist W-->>Client: WorkflowResult
This is almost 1:1 with Akka.io's Workflow component. The Workflow itself is a Sharded Entity, and each step transition is recorded as a PersistentActor event — so if a node dies mid-step, the workflow resumes from exactly the same step on a different node.
Pattern
Best for
Determinism
Debug difficulty
Conversation
Short synchronous tool calls
Drops as call chain grows
Low (clear call graph)
Topic
Async event fan-out
At-most-once
Medium (tracking broadcasts)
Workflow
Multi-step + side effects
Replayable via event log
High (need event log tooling)

8. Split-Brain — the most dangerous trap in production

🎯
3-line summary
  • A network partition can split the cluster into two halves that each behave like a separate cluster.
  • If both halves try to host the same Singleton, or run the same Sharded Entity instance on two nodes, data consistency breaks.
  • Among the four Akka.NET SBR strategies, keep-majority is the safest default.
graph TB subgraph "Healthy state — 5-node cluster" N1[Node A] --- N2[Node B] N2 --- N3[Node C] N3 --- N4[Node D] N4 --- N5[Node E] N1 --- N5 end subgraph "Network partition" direction LR P1[Partition X<br/>Node A, B] -. broken link .- P2[Partition Y<br/>Node C, D, E] P1 -. SBR keep-majority .-> X1[ ❌ SHUTDOWN<br/>2 nodes — minority] P2 -.-> X2[ ✅ SURVIVE<br/>3 nodes — majority] end style X1 fill:#ffcdd2 style X2 fill:#c8e6c9

Four SBR strategies

Strategy
Behavior
Use case
keep-majority
Survive if you're the larger partition
Most common — odd-numbered clusters
static-quorum
Survive if size >= preconfigured quorum
When cluster size varies
keep-oldest
Survive if you contain the oldest node
When preserving the Singleton matters most
keep-referee
Survive if you contain a designated referee
Intentional master/replica designs
akka.cluster.split-brain-resolver { active-strategy = keep-majority stable-after = 20s # ignore network flutter down-all-when-unstable = on # if instability persists, down everything }

Operational checklist

  • Always run an odd number of nodes. With an even number plus keep-majority, a 5:5 split kills both halves.
  • Set stable-after to at least 5 s. Anything shorter overreacts to GC pauses and transient network blips.
  • Monitor membership. Surface Cluster.MemberStatus change events to your monitoring stack. SBR-downed nodes do not rejoin automatically — you need automated restart.

9. Operations — rolling upgrades and node replacement

🎯
3-line summary
  • During a rolling upgrade, take down at most ⌊(N-1)/2⌋ nodes at a time — preserve keep-majority.
  • With remember-entities = on, entities on the downed node reactivate elsewhere automatically.
  • Even an agent mid-LLM-call is safe — the PersistentActor resumes from the event log on a different node.

9.1 Rolling upgrade sequence

sequenceDiagram autonumber participant Ops as Operator participant LB as Load Balancer participant NA as Node A (v1.0) participant NB as Node B (v1.0) participant NC as Node C (v1.0) participant Coord as ShardCoordinator Note over NA,NC: T=0 3-node v1.0 cluster, shards [0..99] distributed Ops->>LB: drain Node A (block traffic) Ops->>NA: Cluster.Leave NA->>Coord: leaving Coord->>NB: rebalance shards 0,3,6... Coord->>NC: rebalance shards 0,3,6... Note over NB,NC: remember-entities=on<br/>active entities re-activate NA->>NA: graceful shutdown Ops->>Ops: deploy v1.1 binary Ops->>NA: start v1.1 NA->>Coord: join cluster Coord->>NA: assign shards 0,3,6... Note over NA: state recovery from EventJournal Ops->>LB: re-add Node A Note over Ops,Coord: Repeat for Node B, then C<br/>cluster is fully v1.1

9.2 Anti-patterns

Anti-pattern
Consequence
Right way
SIGKILL without Cluster.Leave
Shard timeout → 5-30 s response delay during rebalance
Always graceful leave first
Two nodes down simultaneously (3-node cluster)
1 surviving node = not majority → SBR may down everything
One node at a time
Persistence schema migration interleaved with code deploy
No safe rollback
DB migration → code deploy → validate, in distinct phases
Putting in-memory state on LlmAgentActor itself
Lost when node dies; implicitly not replayable on restart
All state belongs in SessionMemoryActor or an external store

9.3 Observability — distributed tracing is non-negotiable

On a single node, one log file captured every flow. In a cluster, one user request now traverses multiple nodes.
// Petabridge.Cmd or OpenTelemetry middleware Receive<AgentRequest>(req => { using var activity = AgentTelemetry.Source.StartActivity("agent.request"); activity?.SetTag("session_id", req.SessionId); activity?.SetTag("node", Cluster.Get(Context.System).SelfAddress.Host); activity?.SetTag("shard_id", _shardId); // ... existing handling });
Minimum trace context:
  • session_id — user tracking (mind PII)
  • shard_id / node — routing path
  • llm_call_id — identify LLM round-trips
  • tool_name — identify tool invocations
Pipe to an OpenTelemetry collector + Jaeger/Tempo and the Conversation pattern's call chains and the Workflow pattern's per-step durations become legible.

10. Closing — the cost of porting from single-node to cluster is ~0

A few decisions in Part 1 turned out to be with-cluster-in-mind choices.
  • Keeping LlmAgentActor stateless and putting all state in a child SessionMemoryActor.
  • Defining every message as an immutable record.
  • Letting AgentRouter handle session-keyed routing as a separate concern.
Those three decisions are what let the Part 2 cluster migration barely change any code at all. The three things that did change:
  1. AgentRouterClusterSharding.Start(...).
  1. Persistence plugin: SQLite → PostgreSQL/Mongo.
  1. HOCON gains cluster, sharding, and split-brain-resolver blocks.
The business logic — the four-Behavior FSM (Idle/GatheringContext/Reasoning/Acting), the IAgentTool abstraction, the event-sourced SessionMemoryActorstays the same.
This is what location transparency in the actor model actually buys you. Not a marketing line — a design property that drops the cost of moving from single-node to cluster to near zero. Across the four axes I covered in
Actor Model Underneath Agentic AI — No Surprise
(Google AX / Akka / Orleans / Ray / Temporal — isolation, messaging, durability, location transparency), the last axis is the one whose value shows up latest — and the moment it shows up is exactly when you go cluster.
Candidate Part 3: MCP server integration and multi-tenant isolation on the Akka.NET cluster. The limits of Singleton (per-tenant rate limiters cannot be a single Singleton) and the multi-key Sharded entity pattern.

Editor's intent map (for translators and reviewers)

Korean intent
English rendering
Reason
"단일 노드의 진짜 한계"
"the real single-node limits"
Drops the awkward "true limits"; "real" reads more naturally for a senior eng audience
"진짜 문제 5가지" (concrete, hands-on tone)
"the five real problems"
Same emphasis — "real" doing the work the Korean "진짜" does
"코드 변경 거의 0"
"≈ 0 code change" / "barely change any code at all"
Two different renderings depending on diagram context vs prose
"마치며"
"Closing — the cost of porting from single-node to cluster is ~0"
Korean prose convention doesn't translate; turn the section closer into a thesis statement
"운영 필수" / "협상 불가능한 결격 사유"
"non-negotiable for production" / "non-negotiable disqualifier"
Standard English ops vocabulary

References

Sister piece / direct predecessor

Akka.NET cluster official docs

Memorizer

External

  • Tyler Jewell, Agentic AI: Why Experience Matters More Than Hype, Akka, 2025