This is the follow-up to Building LLM Agents in Akka.NET — Porting Akka.io's Agent SDK to the Actor Model (Part 1).
In Part 1 we ported the Akka.io Agent abstraction to Akka.NET — a single
LlmAgentActor driving the four-Behavior FSM (Idle → GatheringContext → Reasoning → Acting), a SessionMemoryActor (PersistentActor), and an AgentRouter that routed by SessionId. All of that lived inside one ActorSystem on one machine.This article asks exactly one question — how much of that code has to change when you move it from one machine to an N-node cluster?
Target reader: a .NET engineer comfortable with distributed systems and the actor model basics, who wants to see how the abstraction survives horizontal scale.
한국어 원문 (Part 2): Akka.NET으로 AgentAI 오케스트레이션 만들기 — 클러스터 확장편 (Part 2) · Part 1 (predecessor): Building LLM Agents in Akka.NET — Porting Akka.io's Agent SDK to the Actor Model · Related: Actor Model Underneath Agentic AI — No Surprise
1. Why cluster — the real single-node limits
3-line summary
- The single-node ceiling isn't CPU/GPU shortage — it's memory pressure as sessions grow and cold-start cost on restart.
- Location transparency in the actor model is not marketing copy — it's the design choice that solves both problems at once.
- Once you go cluster, which node an Agent actor lives on stops mattering.
Part 1's
AgentRouter spawned one child LlmAgentActor per session and routed via a _bySession dictionary. On a single box, that pattern breaks in three places.- Memory ceiling. Per session: 1
LlmAgentActor+ 1SessionMemoryActor= 2 actors. 100k active users = 200k actors. At ~4 KB per actor that's 800 MB before mailboxes and history. The GC pressure on one .NET process climbs into the abnormal range fast.
- Restart cold start. Restart the process and every session has to replay events from the journal. If N users reconnect simultaneously you get a concurrent replay storm.
- SLA — single point of failure. When the agent node dies, every conversation on it drops. In production that's a non-negotiable disqualifier.
All three have well-known distributed-systems answers. Shard the actors across N nodes, lean on persistence to shrink cold start, isolate failures with supervision + Split-Brain Resolver. Location transparency is what makes you apply those answers with almost no code change.
graph LR subgraph "Part 1 — single node" Router1[AgentRouter] --> A1[LlmAgentActor: alice] Router1 --> A2[LlmAgentActor: bob] Router1 --> A3[LlmAgentActor: carol] A1 --> M1[SessionMemoryActor] A2 --> M2[SessionMemoryActor] A3 --> M3[SessionMemoryActor] end subgraph "Part 2 — N-node cluster" Client[Client] --> SR[ShardRegion<br/>one per node] SR --> N1[Node A<br/>shard 0,3,6] SR --> N2[Node B<br/>shard 1,4,7] SR --> N3[Node C<br/>shard 2,5,8] N1 -. Persistence .-> DB[(EventStore<br/>PostgreSQL/Mongo)] N2 -. Persistence .-> DB N3 -. Persistence .-> DB end A1 -.->|"≈ 0 code change"| N1
Going from the left to the right, the business logic barely moves. What changes is how actors are spawned and which journal backend they use.
2. Akka.NET Cluster in 5 minutes
3-line summary
- A cluster is a negotiated membership set of nodes. State is propagated via a Gossip protocol.
- Same ActorSystem name + agreed seed-nodes →
ClusterEvent.MemberUpand you're running.
- Three distributed actor tools to know: Cluster Sharding, Cluster Singleton, Distributed PubSub.
Minimum HOCON
akka { actor.provider = cluster remote.dot-netty.tcp { hostname = "0.0.0.0" port = 0 # 0 = dynamic (pin in production) } cluster { seed-nodes = [ "akka.tcp://agent-cluster@node-a:2552", "akka.tcp://agent-cluster@node-b:2552" ] roles = ["agent"] sharding { role = "agent" remember-entities = on # reactivate entities after node down } downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster" split-brain-resolver { active-strategy = keep-majority } } persistence { journal.plugin = "akka.persistence.journal.postgresql" snapshot-store.plugin = "akka.persistence.snapshot-store.postgresql" } }
The lines that matter:
actor.provider = cluster— this single line flips ActorSystem into cluster mode.
seed-nodes— first contact points when joining. Usually 2–3.
roles— assigns a node's responsibilities (e.g. agent / gateway / persistence-writer). Sharding only rebalances within nodes that share a role.
split-brain-resolver— decides which partition survives a network split. Without it, two halves of a split-brain evolve into separate clusters in parallel. Non-optional for production.
Three distributed primitives
graph TB subgraph "Cluster Sharding" direction LR SC[ShardCoordinator<br/>Singleton] --> SR1[ShardRegion @ Node A] SC --> SR2[ShardRegion @ Node B] SC --> SR3[ShardRegion @ Node C] SR1 --> E11[Entity 1] SR1 --> E12[Entity 4] SR2 --> E21[Entity 2] SR2 --> E22[Entity 5] SR3 --> E31[Entity 3] SR3 --> E32[Entity 6] end subgraph "Cluster Singleton" SM[SingletonManager<br/>on every node] --> SP[Singleton Proxy] --> S[Singleton Actor<br/>exactly one, on oldest node] end subgraph "Distributed PubSub" P1[Publisher @ Node A] --> Med1[Mediator @ Node A] Med1 -. gossip .- Med2[Mediator @ Node B] Med2 --> Sub1[Subscriber @ Node B] Med2 --> Sub2[Subscriber @ Node B] end
- Cluster Sharding = "one actor per entity" auto-distributed across N nodes.
ShardCoordinator(a Singleton) decides which shard lives on which node. Handoff (rebalance) is automatic.
- Cluster Singleton = "an actor that must exist exactly once cluster-wide" — lives on the oldest node by default. Rate limiter, scheduler, central coordinator.
- Distributed PubSub = topic-based pub/sub without knowing actor locations. The backbone of multi-agent collaboration.
The core Part 2 transformation is — replace Part 1's AgentRouter with Cluster Sharding's ShardRegion.
3. The five real problems when going cluster
From a core-architecture stance, the single-to-cluster move hits five concrete problems.
Problem | Single node | Cluster | Akka.NET answer |
1. Agent location | Dictionary lookup | Which node? | Cluster Sharding |
2. Session affinity | Same in-memory object | Same SessionId → same actor, guaranteed | ShardId = hash(SessionId) % N |
3. Persistence backend | Local SQLite OK | Reachable from every node | PostgreSQL / Mongo Persistence Plugin |
4. Failure isolation | SupervisorStrategy | • network partitions | Split-Brain Resolver + remember-entities |
5. Observability | One log file | Distributed traces | OpenTelemetry + Phobos or Petabridge.Cmd |
We'll walk through each.
4. Sharding the Agent actors
3-line summary
- Swap
AgentRouter(Part 1) forClusterSharding.Start(...)— actors are now distributed across N nodes by SessionId.
- Entity ID = SessionId, Shard ID = hash(SessionId) % shardCount.
- When nodes join/leave,
ShardCoordinatorhandles rebalance automatically. Business code stays the same.
4.1 Sharding setup
public sealed class AgentMessageExtractor : HashCodeMessageExtractor { public AgentMessageExtractor(int maxShards = 100) : base(maxShards) { } public override string EntityId(object message) => message switch { AgentRequest req => req.SessionId, ShardEnvelope env => env.EntityId, _ => null! }; public override object EntityMessage(object message) => message switch { ShardEnvelope env => env.Message, _ => message }; } // Cluster boot — once var sharding = ClusterSharding.Get(system); var agentShardRegion = sharding.Start( typeName: "agent", entityProps: HelloAgentActor.Props(llm), // unchanged Part 1 actor settings: ClusterShardingSettings.Create(system).WithRole("agent"), messageExtractor: new AgentMessageExtractor(maxShards: 100) ); // Callers send to the ShardRegion — routing is automatic var reply = await agentShardRegion.Ask<AgentResponse>( new AgentRequest("alice", "Hi, I'm Alice"), TimeSpan.FromSeconds(30));
4.2 What the ShardRegion does for you
sequenceDiagram autonumber participant Client participant SR_A as ShardRegion @ Node A participant SC as ShardCoordinator<br/>(Singleton) participant SR_C as ShardRegion @ Node C participant E as Entity "alice"<br/>(alive on Node C) Client->>SR_A: AgentRequest "alice" Note over SR_A: EntityId=alice<br/>ShardId=hash(alice) % 100 = 23 SR_A->>SR_A: don't know where shard 23 lives yet SR_A->>SC: GetShardHome(23) SC-->>SR_A: shard 23 → Node C SR_A->>SR_C: Forward(AgentRequest "alice") Note over SR_C: does shard 23 contain entity "alice"? SR_C->>E: AgentRequest E-->>SR_C: AgentResponse SR_C-->>SR_A: AgentResponse SR_A-->>Client: AgentResponse Note over SR_A,SR_C: From now on Node A's ShardRegion<br/>caches shard 23 → Node C
Key points:
- The client can send to any ShardRegion (Node A/B/C). Routing is the ShardRegion's job.
- Entity actor
aliceexists exactly once in the entire cluster — Cluster Sharding guarantees that.
- The first hop goes through
ShardCoordinator, but subsequent calls hit a per-region shard → node cache.
4.3 Automatic rebalance on join/leave
graph TB subgraph "T=0 even balance across 3 nodes" direction LR N1A[Node A<br/>shard 0,3,6,9...] N2A[Node B<br/>shard 1,4,7,10...] N3A[Node C<br/>shard 2,5,8,11...] end subgraph "T=1 Node D joins" direction LR N1B[Node A<br/>shard 0,4,8...] N2B[Node B<br/>shard 1,5,9...] N3B[Node C<br/>shard 2,6,10...] N4B[Node D<br/>shard 3,7,11...] end subgraph "T=2 Node B down" direction LR N1C[Node A<br/>shard 0,1,4,5,8,9...] N3C[Node C<br/>shard 2,3,6,7,10,11...] N4C[Node D<br/>shard 3,7,11...] end N1A -. rebalance .-> N1B N1B -. node down +<br/>remember-entities .-> N1C
With
remember-entities = on, the active entities that were living on Node B are re-activated elsewhere on the surviving nodes. Each SessionMemoryActor (a PersistentActor) recovers its state from the event journal.5. The Coordinator is a Cluster Singleton — workflows & rate limiters
3-line summary
- Actors that must exist exactly once cluster-wide (multi-agent planner orchestrator, central tool registry) are Cluster Singletons.
- LLM rate limiter is the canonical case — you need one call-counter across the whole cluster.
- Callers use
SingletonProxyand don't track location.
The Multi-Agent Planner workflow from Part 1's §9-1 chains
WeatherAgent and ActivityAgent. In a cluster, we have to decide where the workflow runs.Two options:
- Workflow as Sharded Entity — one Entity per Workflow instance. Pairs well with the saga pattern.
- Coordinator as Cluster Singleton — keep checkpoints in the persistence journal, but logical decisions in a single actor.
For LLM rate limiting, (2) is almost always correct. You need one view of cluster-wide call counts.
// Singleton registration — called on every node, Akka activates only on the oldest one var singletonManager = system.ActorOf( ClusterSingletonManager.Props( singletonProps: Props.Create<LlmRateLimiterActor>(), terminationMessage: PoisonPill.Instance, settings: ClusterSingletonManagerSettings.Create(system) .WithRole("rate-limiter")), name: "rate-limiter"); // Caller — uses SingletonProxy, no location awareness var proxy = system.ActorOf( ClusterSingletonProxy.Props( singletonManagerPath: "/user/rate-limiter", settings: ClusterSingletonProxySettings.Create(system) .WithRole("rate-limiter")), name: "rate-limiter-proxy"); // Inside LlmAgentActor var token = await proxy.Ask<RateLimitGrant>( new AcquireToken(estimatedTokens: 1500), TimeSpan.FromSeconds(5));
sequenceDiagram autonumber participant A1 as LlmAgentActor @ Node A participant P1 as SingletonProxy @ Node A participant SM as SingletonManager @ Node B (oldest) participant S as LlmRateLimiterActor<br/>(one in cluster) participant A2 as LlmAgentActor @ Node C participant P2 as SingletonProxy @ Node C A1->>P1: AcquireToken(1500) P1->>SM: forward (location-aware) SM->>S: AcquireToken(1500) S-->>SM: Grant SM-->>P1: Grant P1-->>A1: Grant A2->>P2: AcquireToken(2000) P2->>SM: forward SM->>S: AcquireToken(2000) Note over S: remaining this minute: 4000<br/>→ Grant S-->>P2: Grant via SM P2-->>A2: Grant Note over A1,A2: Agents on Node A and C<br/>share the same rate limiter
The point of Singleton is single responsibility + determinism in a distributed environment. Used wrong, it becomes a bottleneck — so the rule is let it decide, delegate the heavy lifting.
6. Persistence — event sourcing + snapshot strategy
3-line summary
- For SessionMemoryActor to make sense in a cluster, the journal backend has to be reachable from every node (PostgreSQL/Mongo/Cassandra).
- To avoid the restart replay storm, snapshot cadence and lazy recovery are mandatory.
- Keep
LlmAgentActorstateless; letSessionMemoryActorbe the onlyPersistentActor. Clean separation of responsibility.
6.1 Event flow
sequenceDiagram autonumber participant Client participant Agent as LlmAgentActor<br/>(Sharded Entity) participant Mem as SessionMemoryActor<br/>(PersistentActor) participant Journal as EventJournal<br/>(PostgreSQL) participant LLM Client->>Agent: AgentRequest("alice", "Hi") Agent->>Mem: LoadSessionMessages Mem->>Journal: SELECT events WHERE persistence_id='session-alice' Journal-->>Mem: [MessageAppended..., MessageAppended...] Note over Mem: Recover<MessageAppended>(Apply)<br/>history restored Mem-->>Agent: SessionMessages(history) Agent->>LLM: complete(prompt + history) LLM-->>Agent: reply Agent->>Mem: AppendMessages(user+assistant) Mem->>Journal: Persist(MessageAppended) Journal-->>Mem: ack Note over Mem: SaveSnapshot every 50th message Agent-->>Client: AgentResponse
6.2 Snapshot cadence — preventing replay storms
SessionMemoryActor already saves a snapshot every 50 messages (from Part 1). In a cluster, that cadence determines replay time.public sealed class SessionMemoryActor : ReceivePersistentActor { public override string PersistenceId => $"session-memory-{_sessionId}"; private readonly LinkedList<ChatMessage> _history = new(); private const int MaxHistory = 20; private const int SnapshotInterval = 50; public SessionMemoryActor(string sessionId) { _sessionId = sessionId; // 1. Restore from most recent snapshot first Recover<SnapshotOffer>(offer => { if (offer.Snapshot is List<ChatMessage> snap) foreach (var m in snap) _history.AddLast(m); }); // 2. Replay only events after the snapshot Recover<MessageAppended>(evt => Apply(evt)); Command<LoadSessionMessages>(_ => Sender.Tell(new SessionMessages(_history.ToList()))); Command<AppendMessages>(cmd => Persist(new MessageAppended(cmd.Message), evt => { Apply(evt); // ops tip: track sequence via SaveSnapshotAsync ack if (SnapshotSequenceNr % SnapshotInterval == 0) SaveSnapshot(_history.ToList()); })); // 3. On snapshot ack — delete older snapshots/events to keep storage bounded Command<SaveSnapshotSuccess>(success => { DeleteSnapshots(new SnapshotSelectionCriteria( maxSequenceNr: success.Metadata.SequenceNr - 1)); DeleteMessages(success.Metadata.SequenceNr - SnapshotInterval); }); } private void Apply(MessageAppended evt) { _history.AddLast(evt.Message); while (_history.Count > MaxHistory) _history.RemoveFirst(); } }
Three things to internalize:
- Snapshot interval = max_replay_time / event_apply_time. If one apply costs 0.1 ms and your SLA is 100 ms replay, the interval is around 1000 events.
- GC for events/snapshots.
DeleteSnapshots+DeleteMessagesmatter. Drop those two lines and the PostgreSQL table grows unbounded.
- DB connection pool is a cluster-level number, not a node-level number. N nodes × pool size per node = total concurrent connections.
6.3 Responsibility split — Sharded vs Persistent
graph TB subgraph "Node A" SR_A[ShardRegion 'agent'] E_A1[LlmAgentActor<br/>SessionId=alice<br/>STATELESS] E_A2[LlmAgentActor<br/>SessionId=bob<br/>STATELESS] M_A1[SessionMemoryActor<br/>persistence_id=alice<br/>PERSISTENT] M_A2[SessionMemoryActor<br/>persistence_id=bob<br/>PERSISTENT] SR_A --> E_A1 SR_A --> E_A2 E_A1 -. child .-> M_A1 E_A2 -. child .-> M_A2 end M_A1 --> J[(EventJournal<br/>PostgreSQL)] M_A2 --> J style E_A1 fill:#e3f2fd style E_A2 fill:#e3f2fd style M_A1 fill:#fce4ec style M_A2 fill:#fce4ec
The principle:
LlmAgentActor is stateless; only SessionMemoryActor is a PersistentActor. Reasoning: LLM calls are non-deterministic, so replaying the actions of LlmAgentActor makes no sense. What's worth persisting is the conversation history (events), full stop.7. Three multi-agent collaboration topologies
3-line summary
- Three collaboration patterns: Conversation (direct message), Topic (pub/sub), Workflow (central saga).
- The choice depends on interaction synchronicity and state-sharing scope.
- All three are first-class in Akka.NET — DistributedPubSub, Cluster Sharding, Cluster Singleton.
7.1 Conversation — Agent ↔ Agent direct call
sequenceDiagram autonumber participant W as WeatherAgent<br/>(Sharded) participant A as ActivityAgent<br/>(Sharded) participant SR as ShardRegion Note over W: One LlmAgentActor calls another like a tool W->>SR: AgentRequest(SessionId="alice-activity", "Madrid") SR->>A: forward (EntityId=alice-activity) A-->>W: AgentResponse("Indoor museums...") Note over W: wrap as tool_result,<br/>inject into next LLM call
The simplest pattern. One agent delegates a tool to another agent. The downside is long call chains become hard to debug.
7.2 Topic — DistributedPubSub broadcast
graph LR subgraph "Topic Mesh" P1[NewsCrawlerAgent<br/>publishes] P2[StockTickerAgent<br/>publishes] Med1[Mediator @ A] Med2[Mediator @ B] Med3[Mediator @ C] S1[SentimentAgent<br/>subscribes 'market.*'] S2[PortfolioAgent<br/>subscribes 'market.stock'] S3[AlertAgent<br/>subscribes 'market.alert'] P1 --> Med1 P2 --> Med1 Med1 -. gossip .- Med2 Med1 -. gossip .- Med3 Med2 --> S1 Med3 --> S2 Med3 --> S3 end
// Publisher var mediator = DistributedPubSub.Get(system).Mediator; mediator.Tell(new Publish("market.stock", new StockPriceUpdate("MSFT", 510.32))); // Subscriber (at agent startup) mediator.Tell(new Subscribe("market.stock", Self)); Receive<StockPriceUpdate>(update => /* trigger LLM analysis */);
Useful for event-stream-driven multi-agent scenarios. Caveat: delivery guarantee is at-most-once. In production you usually pair it with an external broker like Kafka or Pulsar.
7.3 Workflow — saga / central orchestration
sequenceDiagram autonumber participant Client participant W as TripPlannerWorkflow<br/>(Sharded Entity) participant Mem as Workflow EventStore participant WA as WeatherAgent<br/>(Sharded) participant AA as ActivityAgent<br/>(Sharded) participant PE as PreferencesEntity<br/>(Sharded) Client->>W: StartWorkflow("plan-trip", "alice", "Madrid") Note over W: state=GetWeather, PersistEvent W->>Mem: Persist(StateTransitioned) W->>WA: forward WA-->>W: AgentResponse("rainy 18C") Note over W: state=GetPreferences, PersistEvent W->>Mem: Persist W->>PE: GetPreferences("alice") PE-->>W: ["museum","cafe"] Note over W: state=SuggestActivity, PersistEvent W->>Mem: Persist W->>AA: forward AA-->>W: AgentResponse("Prado Museum") Note over W: state=Done, PersistEvent W->>Mem: Persist W-->>Client: WorkflowResult
This is almost 1:1 with Akka.io's
Workflow component. The Workflow itself is a Sharded Entity, and each step transition is recorded as a PersistentActor event — so if a node dies mid-step, the workflow resumes from exactly the same step on a different node.Pattern | Best for | Determinism | Debug difficulty |
Conversation | Short synchronous tool calls | Drops as call chain grows | Low (clear call graph) |
Topic | Async event fan-out | At-most-once | Medium (tracking broadcasts) |
Workflow | Multi-step + side effects | Replayable via event log | High (need event log tooling) |
8. Split-Brain — the most dangerous trap in production
3-line summary
- A network partition can split the cluster into two halves that each behave like a separate cluster.
- If both halves try to host the same Singleton, or run the same Sharded Entity instance on two nodes, data consistency breaks.
- Among the four Akka.NET SBR strategies,
keep-majorityis the safest default.
graph TB subgraph "Healthy state — 5-node cluster" N1[Node A] --- N2[Node B] N2 --- N3[Node C] N3 --- N4[Node D] N4 --- N5[Node E] N1 --- N5 end subgraph "Network partition" direction LR P1[Partition X<br/>Node A, B] -. broken link .- P2[Partition Y<br/>Node C, D, E] P1 -. SBR keep-majority .-> X1[ ❌ SHUTDOWN<br/>2 nodes — minority] P2 -.-> X2[ ✅ SURVIVE<br/>3 nodes — majority] end style X1 fill:#ffcdd2 style X2 fill:#c8e6c9
Four SBR strategies
Strategy | Behavior | Use case |
keep-majority | Survive if you're the larger partition | Most common — odd-numbered clusters |
static-quorum | Survive if size >= preconfigured quorum | When cluster size varies |
keep-oldest | Survive if you contain the oldest node | When preserving the Singleton matters most |
keep-referee | Survive if you contain a designated referee | Intentional master/replica designs |
akka.cluster.split-brain-resolver { active-strategy = keep-majority stable-after = 20s # ignore network flutter down-all-when-unstable = on # if instability persists, down everything }
Operational checklist
- Always run an odd number of nodes. With an even number plus
keep-majority, a 5:5 split kills both halves.
- Set
stable-afterto at least 5 s. Anything shorter overreacts to GC pauses and transient network blips.
- Monitor membership. Surface
Cluster.MemberStatuschange events to your monitoring stack. SBR-downed nodes do not rejoin automatically — you need automated restart.
9. Operations — rolling upgrades and node replacement
3-line summary
- During a rolling upgrade, take down at most ⌊(N-1)/2⌋ nodes at a time — preserve
keep-majority.
- With
remember-entities = on, entities on the downed node reactivate elsewhere automatically.
- Even an agent mid-LLM-call is safe — the
PersistentActorresumes from the event log on a different node.
9.1 Rolling upgrade sequence
sequenceDiagram autonumber participant Ops as Operator participant LB as Load Balancer participant NA as Node A (v1.0) participant NB as Node B (v1.0) participant NC as Node C (v1.0) participant Coord as ShardCoordinator Note over NA,NC: T=0 3-node v1.0 cluster, shards [0..99] distributed Ops->>LB: drain Node A (block traffic) Ops->>NA: Cluster.Leave NA->>Coord: leaving Coord->>NB: rebalance shards 0,3,6... Coord->>NC: rebalance shards 0,3,6... Note over NB,NC: remember-entities=on<br/>active entities re-activate NA->>NA: graceful shutdown Ops->>Ops: deploy v1.1 binary Ops->>NA: start v1.1 NA->>Coord: join cluster Coord->>NA: assign shards 0,3,6... Note over NA: state recovery from EventJournal Ops->>LB: re-add Node A Note over Ops,Coord: Repeat for Node B, then C<br/>cluster is fully v1.1
9.2 Anti-patterns
Anti-pattern | Consequence | Right way |
SIGKILL without Cluster.Leave | Shard timeout → 5-30 s response delay during rebalance | Always graceful leave first |
Two nodes down simultaneously (3-node cluster) | 1 surviving node = not majority → SBR may down everything | One node at a time |
Persistence schema migration interleaved with code deploy | No safe rollback | DB migration → code deploy → validate, in distinct phases |
Putting in-memory state on LlmAgentActor itself | Lost when node dies; implicitly not replayable on restart | All state belongs in SessionMemoryActor or an external store |
9.3 Observability — distributed tracing is non-negotiable
On a single node, one log file captured every flow. In a cluster, one user request now traverses multiple nodes.
// Petabridge.Cmd or OpenTelemetry middleware Receive<AgentRequest>(req => { using var activity = AgentTelemetry.Source.StartActivity("agent.request"); activity?.SetTag("session_id", req.SessionId); activity?.SetTag("node", Cluster.Get(Context.System).SelfAddress.Host); activity?.SetTag("shard_id", _shardId); // ... existing handling });
Minimum trace context:
session_id— user tracking (mind PII)
shard_id/node— routing path
llm_call_id— identify LLM round-trips
tool_name— identify tool invocations
Pipe to an OpenTelemetry collector + Jaeger/Tempo and the Conversation pattern's call chains and the Workflow pattern's per-step durations become legible.
10. Closing — the cost of porting from single-node to cluster is ~0
A few decisions in Part 1 turned out to be with-cluster-in-mind choices.
- Keeping
LlmAgentActorstateless and putting all state in a childSessionMemoryActor.
- Defining every message as an immutable
record.
- Letting
AgentRouterhandle session-keyed routing as a separate concern.
Those three decisions are what let the Part 2 cluster migration barely change any code at all. The three things that did change:
AgentRouter→ClusterSharding.Start(...).
- Persistence plugin: SQLite → PostgreSQL/Mongo.
- HOCON gains
cluster,sharding, andsplit-brain-resolverblocks.
The business logic — the four-Behavior FSM (
Idle/GatheringContext/Reasoning/Acting), the IAgentTool abstraction, the event-sourced SessionMemoryActor — stays the same.This is what location transparency in the actor model actually buys you. Not a marketing line — a design property that drops the cost of moving from single-node to cluster to near zero. Across the four axes I covered in Actor Model Underneath Agentic AI — No Surprise (Google AX / Akka / Orleans / Ray / Temporal — isolation, messaging, durability, location transparency), the last axis is the one whose value shows up latest — and the moment it shows up is exactly when you go cluster.
Candidate Part 3: MCP server integration and multi-tenant isolation on the Akka.NET cluster. The limits of Singleton (per-tenant rate limiters cannot be a single Singleton) and the multi-key Sharded entity pattern.
Editor's intent map (for translators and reviewers)
Korean intent | English rendering | Reason |
"단일 노드의 진짜 한계" | "the real single-node limits" | Drops the awkward "true limits"; "real" reads more naturally for a senior eng audience |
"진짜 문제 5가지" (concrete, hands-on tone) | "the five real problems" | Same emphasis — "real" doing the work the Korean "진짜" does |
"코드 변경 거의 0" | "≈ 0 code change" / "barely change any code at all" | Two different renderings depending on diagram context vs prose |
"마치며" | "Closing — the cost of porting from single-node to cluster is ~0" | Korean prose convention doesn't translate; turn the section closer into a thesis statement |
"운영 필수" / "협상 불가능한 결격 사유" | "non-negotiable for production" / "non-negotiable disqualifier" | Standard English ops vocabulary |
References
Sister piece / direct predecessor
Akka.NET cluster official docs
Memorizer
External
- Tyler Jewell, Agentic AI: Why Experience Matters More Than Hype, Akka, 2025