Building Agent Orchestration on Akka.NET — Cluster Edition (Part 2)

🎯

This is the follow-up to

Building LLM Agents in Akka.NET — Porting Akka.io's Agent SDK to the Actor Model (Part 1).

In Part 1 we ported the Akka.io Agent abstraction to Akka.NET — a single LlmAgentActor driving the four-Behavior FSM (Idle → GatheringContext → Reasoning → Acting), a SessionMemoryActor (PersistentActor), and an AgentRouter that routed by SessionId. All of that lived inside one ActorSystem on one machine.

This article asks exactly one question — how much of that code has to change when you move it from one machine to an N-node cluster?

Target reader: a .NET engineer comfortable with distributed systems and the actor model basics, who wants to see how the abstraction survives horizontal scale.

🌐

한국어 원문 (Part 2):

Akka.NET으로 AgentAI 오케스트레이션 만들기 — 클러스터 확장편 (Part 2) · Part 1 (predecessor):

Building LLM Agents in Akka.NET — Porting Akka.io's Agent SDK to the Actor Model · Related:

Actor Model Underneath Agentic AI — No Surprise

1. Why cluster — the real single-node limits

🎯

3-line summary

The single-node ceiling isn't CPU/GPU shortage — it's memory pressure as sessions grow and cold-start cost on restart.

Location transparency in the actor model is not marketing copy — it's the design choice that solves both problems at once.

Once you go cluster, which node an Agent actor lives on stops mattering.

Part 1's AgentRouter spawned one child LlmAgentActor per session and routed via a _bySession dictionary. On a single box, that pattern breaks in three places.

Memory ceiling. Per session: 1 LlmAgentActor + 1 SessionMemoryActor = 2 actors. 100k active users = 200k actors. At ~4 KB per actor that's 800 MB before mailboxes and history. The GC pressure on one .NET process climbs into the abnormal range fast.

Restart cold start. Restart the process and every session has to replay events from the journal. If N users reconnect simultaneously you get a concurrent replay storm.

SLA — single point of failure. When the agent node dies, every conversation on it drops. In production that's a non-negotiable disqualifier.

All three have well-known distributed-systems answers. Shard the actors across N nodes, lean on persistence to shrink cold start, isolate failures with supervision + Split-Brain Resolver. Location transparency is what makes you apply those answers with almost no code change.


graph LR
    subgraph "Part 1 — single node"
        Router1[AgentRouter] --> A1[LlmAgentActor: alice]
        Router1 --> A2[LlmAgentActor: bob]
        Router1 --> A3[LlmAgentActor: carol]
        A1 --> M1[SessionMemoryActor]
        A2 --> M2[SessionMemoryActor]
        A3 --> M3[SessionMemoryActor]
    end

    subgraph "Part 2 — N-node cluster"
        Client[Client] --> SR[ShardRegion<br/>one per node]
        SR --> N1[Node A<br/>shard 0,3,6]
        SR --> N2[Node B<br/>shard 1,4,7]
        SR --> N3[Node C<br/>shard 2,5,8]
        N1 -. Persistence .-> DB[(EventStore<br/>PostgreSQL/Mongo)]
        N2 -. Persistence .-> DB
        N3 -. Persistence .-> DB
    end

    A1 -.->|"≈ 0 code change"| N1

Going from the left to the right, the business logic barely moves. What changes is how actors are spawned and which journal backend they use.

2. Akka.NET Cluster in 5 minutes

🎯

3-line summary

A cluster is a negotiated membership set of nodes. State is propagated via a Gossip protocol.

Same ActorSystem name + agreed seed-nodes → ClusterEvent.MemberUp and you're running.

Three distributed actor tools to know: Cluster Sharding, Cluster Singleton, Distributed PubSub.

Minimum HOCON


akka {
  actor.provider = cluster
  remote.dot-netty.tcp {
    hostname = "0.0.0.0"
    port = 0  # 0 = dynamic (pin in production)
  }
  cluster {
    seed-nodes = [
      "akka.tcp://agent-cluster@node-a:2552",
      "akka.tcp://agent-cluster@node-b:2552"
    ]
    roles = ["agent"]
    sharding {
      role = "agent"
      remember-entities = on  # reactivate entities after node down
    }
    downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"
    split-brain-resolver {
      active-strategy = keep-majority
    }
  }
  persistence {
    journal.plugin = "akka.persistence.journal.postgresql"
    snapshot-store.plugin = "akka.persistence.snapshot-store.postgresql"
  }
}

The lines that matter:

actor.provider = cluster — this single line flips ActorSystem into cluster mode.

seed-nodes — first contact points when joining. Usually 2–3.

roles — assigns a node's responsibilities (e.g. agent / gateway / persistence-writer). Sharding only rebalances within nodes that share a role.

split-brain-resolver — decides which partition survives a network split. Without it, two halves of a split-brain evolve into separate clusters in parallel. Non-optional for production.

Three distributed primitives


graph TB
    subgraph "Cluster Sharding"
        direction LR
        SC[ShardCoordinator<br/>Singleton] --> SR1[ShardRegion @ Node A]
        SC --> SR2[ShardRegion @ Node B]
        SC --> SR3[ShardRegion @ Node C]
        SR1 --> E11[Entity 1]
        SR1 --> E12[Entity 4]
        SR2 --> E21[Entity 2]
        SR2 --> E22[Entity 5]
        SR3 --> E31[Entity 3]
        SR3 --> E32[Entity 6]
    end

    subgraph "Cluster Singleton"
        SM[SingletonManager<br/>on every node] --> SP[Singleton Proxy] --> S[Singleton Actor<br/>exactly one, on oldest node]
    end

    subgraph "Distributed PubSub"
        P1[Publisher @ Node A] --> Med1[Mediator @ Node A]
        Med1 -. gossip .- Med2[Mediator @ Node B]
        Med2 --> Sub1[Subscriber @ Node B]
        Med2 --> Sub2[Subscriber @ Node B]
    end

Cluster Sharding = "one actor per entity" auto-distributed across N nodes. ShardCoordinator (a Singleton) decides which shard lives on which node. Handoff (rebalance) is automatic.

Cluster Singleton = "an actor that must exist exactly once cluster-wide" — lives on the oldest node by default. Rate limiter, scheduler, central coordinator.

Distributed PubSub = topic-based pub/sub without knowing actor locations. The backbone of multi-agent collaboration.

The core Part 2 transformation is — replace Part 1's AgentRouter with Cluster Sharding's ShardRegion.

3. The five real problems when going cluster

From a core-architecture stance, the single-to-cluster move hits five concrete problems.

Problem	Single node	Cluster	Akka.NET answer
1. Agent location	Dictionary lookup	Which node?	Cluster Sharding
2. Session affinity	Same in-memory object	Same SessionId → same actor, guaranteed	ShardId = hash(SessionId) % N
3. Persistence backend	Local SQLite OK	Reachable from every node	PostgreSQL / Mongo Persistence Plugin
4. Failure isolation	SupervisorStrategy	• network partitions	Split-Brain Resolver + remember-entities
5. Observability	One log file	Distributed traces	OpenTelemetry + Phobos or Petabridge.Cmd

We'll walk through each.

4. Sharding the Agent actors

🎯

3-line summary

Swap AgentRouter (Part 1) for ClusterSharding.Start(...) — actors are now distributed across N nodes by SessionId.

Entity ID = SessionId, Shard ID = hash(SessionId) % shardCount.

When nodes join/leave, ShardCoordinator handles rebalance automatically. Business code stays the same.

4.1 Sharding setup


public sealed class AgentMessageExtractor : HashCodeMessageExtractor
{
    public AgentMessageExtractor(int maxShards = 100) : base(maxShards) { }

    public override string EntityId(object message) => message switch
    {
        AgentRequest req => req.SessionId,
        ShardEnvelope env => env.EntityId,
        _ => null!
    };

    public override object EntityMessage(object message) => message switch
    {
        ShardEnvelope env => env.Message,
        _ => message
    };
}

// Cluster boot — once
var sharding = ClusterSharding.Get(system);
var agentShardRegion = sharding.Start(
    typeName: "agent",
    entityProps: HelloAgentActor.Props(llm),   // unchanged Part 1 actor
    settings: ClusterShardingSettings.Create(system).WithRole("agent"),
    messageExtractor: new AgentMessageExtractor(maxShards: 100)
);

// Callers send to the ShardRegion — routing is automatic
var reply = await agentShardRegion.Ask<AgentResponse>(
    new AgentRequest("alice", "Hi, I'm Alice"),
    TimeSpan.FromSeconds(30));

4.2 What the ShardRegion does for you


sequenceDiagram
    autonumber
    participant Client
    participant SR_A as ShardRegion @ Node A
    participant SC as ShardCoordinator<br/>(Singleton)
    participant SR_C as ShardRegion @ Node C
    participant E as Entity "alice"<br/>(alive on Node C)

    Client->>SR_A: AgentRequest "alice"
    Note over SR_A: EntityId=alice<br/>ShardId=hash(alice) % 100 = 23
    SR_A->>SR_A: don't know where shard 23 lives yet
    SR_A->>SC: GetShardHome(23)
    SC-->>SR_A: shard 23 → Node C
    SR_A->>SR_C: Forward(AgentRequest "alice")
    Note over SR_C: does shard 23 contain entity "alice"?
    SR_C->>E: AgentRequest
    E-->>SR_C: AgentResponse
    SR_C-->>SR_A: AgentResponse
    SR_A-->>Client: AgentResponse

    Note over SR_A,SR_C: From now on Node A's ShardRegion<br/>caches shard 23 → Node C

Key points:

The client can send to any ShardRegion (Node A/B/C). Routing is the ShardRegion's job.

Entity actor alice exists exactly once in the entire cluster — Cluster Sharding guarantees that.

The first hop goes through ShardCoordinator, but subsequent calls hit a per-region shard → node cache.

4.3 Automatic rebalance on join/leave


graph TB
    subgraph "T=0  even balance across 3 nodes"
        direction LR
        N1A[Node A<br/>shard 0,3,6,9...]
        N2A[Node B<br/>shard 1,4,7,10...]
        N3A[Node C<br/>shard 2,5,8,11...]
    end

    subgraph "T=1  Node D joins"
        direction LR
        N1B[Node A<br/>shard 0,4,8...]
        N2B[Node B<br/>shard 1,5,9...]
        N3B[Node C<br/>shard 2,6,10...]
        N4B[Node D<br/>shard 3,7,11...]
    end

    subgraph "T=2  Node B down"
        direction LR
        N1C[Node A<br/>shard 0,1,4,5,8,9...]
        N3C[Node C<br/>shard 2,3,6,7,10,11...]
        N4C[Node D<br/>shard 3,7,11...]
    end

    N1A -. rebalance .-> N1B
    N1B -. node down +<br/>remember-entities .-> N1C

With remember-entities = on, the active entities that were living on Node B are re-activated elsewhere on the surviving nodes. Each SessionMemoryActor (a PersistentActor) recovers its state from the event journal.

5. The Coordinator is a Cluster Singleton — workflows & rate limiters

🎯

3-line summary

Actors that must exist exactly once cluster-wide (multi-agent planner orchestrator, central tool registry) are Cluster Singletons.

LLM rate limiter is the canonical case — you need one call-counter across the whole cluster.

Callers use SingletonProxy and don't track location.

The Multi-Agent Planner workflow from Part 1's §9-1 chains WeatherAgent and ActivityAgent. In a cluster, we have to decide where the workflow runs.

Two options:

Workflow as Sharded Entity — one Entity per Workflow instance. Pairs well with the saga pattern.

Coordinator as Cluster Singleton — keep checkpoints in the persistence journal, but logical decisions in a single actor.

For LLM rate limiting, (2) is almost always correct. You need one view of cluster-wide call counts.


// Singleton registration — called on every node, Akka activates only on the oldest one
var singletonManager = system.ActorOf(
    ClusterSingletonManager.Props(
        singletonProps: Props.Create<LlmRateLimiterActor>(),
        terminationMessage: PoisonPill.Instance,
        settings: ClusterSingletonManagerSettings.Create(system)
            .WithRole("rate-limiter")),
    name: "rate-limiter");

// Caller — uses SingletonProxy, no location awareness
var proxy = system.ActorOf(
    ClusterSingletonProxy.Props(
        singletonManagerPath: "/user/rate-limiter",
        settings: ClusterSingletonProxySettings.Create(system)
            .WithRole("rate-limiter")),
    name: "rate-limiter-proxy");

// Inside LlmAgentActor
var token = await proxy.Ask<RateLimitGrant>(
    new AcquireToken(estimatedTokens: 1500),
    TimeSpan.FromSeconds(5));


sequenceDiagram
    autonumber
    participant A1 as LlmAgentActor @ Node A
    participant P1 as SingletonProxy @ Node A
    participant SM as SingletonManager @ Node B (oldest)
    participant S as LlmRateLimiterActor<br/>(one in cluster)
    participant A2 as LlmAgentActor @ Node C
    participant P2 as SingletonProxy @ Node C

    A1->>P1: AcquireToken(1500)
    P1->>SM: forward (location-aware)
    SM->>S: AcquireToken(1500)
    S-->>SM: Grant
    SM-->>P1: Grant
    P1-->>A1: Grant

    A2->>P2: AcquireToken(2000)
    P2->>SM: forward
    SM->>S: AcquireToken(2000)
    Note over S: remaining this minute: 4000<br/>→ Grant
    S-->>P2: Grant via SM
    P2-->>A2: Grant

    Note over A1,A2: Agents on Node A and C<br/>share the same rate limiter

The point of Singleton is single responsibility + determinism in a distributed environment. Used wrong, it becomes a bottleneck — so the rule is let it decide, delegate the heavy lifting.

6. Persistence — event sourcing + snapshot strategy

🎯

3-line summary

For SessionMemoryActor to make sense in a cluster, the journal backend has to be reachable from every node (PostgreSQL/Mongo/Cassandra).

To avoid the restart replay storm, snapshot cadence and lazy recovery are mandatory.

Keep LlmAgentActor stateless; let SessionMemoryActor be the only PersistentActor. Clean separation of responsibility.

6.1 Event flow


sequenceDiagram
    autonumber
    participant Client
    participant Agent as LlmAgentActor<br/>(Sharded Entity)
    participant Mem as SessionMemoryActor<br/>(PersistentActor)
    participant Journal as EventJournal<br/>(PostgreSQL)
    participant LLM

    Client->>Agent: AgentRequest("alice", "Hi")
    Agent->>Mem: LoadSessionMessages
    Mem->>Journal: SELECT events WHERE persistence_id='session-alice'
    Journal-->>Mem: [MessageAppended..., MessageAppended...]
    Note over Mem: Recover<MessageAppended>(Apply)<br/>history restored
    Mem-->>Agent: SessionMessages(history)
    Agent->>LLM: complete(prompt + history)
    LLM-->>Agent: reply
    Agent->>Mem: AppendMessages(user+assistant)
    Mem->>Journal: Persist(MessageAppended)
    Journal-->>Mem: ack
    Note over Mem: SaveSnapshot every 50th message
    Agent-->>Client: AgentResponse

6.2 Snapshot cadence — preventing replay storms

SessionMemoryActor already saves a snapshot every 50 messages (from Part 1). In a cluster, that cadence determines replay time.


public sealed class SessionMemoryActor : ReceivePersistentActor
{
    public override string PersistenceId => $"session-memory-{_sessionId}";
    private readonly LinkedList<ChatMessage> _history = new();
    private const int MaxHistory = 20;
    private const int SnapshotInterval = 50;

    public SessionMemoryActor(string sessionId)
    {
        _sessionId = sessionId;

        // 1. Restore from most recent snapshot first
        Recover<SnapshotOffer>(offer =>
        {
            if (offer.Snapshot is List<ChatMessage> snap)
                foreach (var m in snap) _history.AddLast(m);
        });

        // 2. Replay only events after the snapshot
        Recover<MessageAppended>(evt => Apply(evt));

        Command<LoadSessionMessages>(_ =>
            Sender.Tell(new SessionMessages(_history.ToList())));

        Command<AppendMessages>(cmd => Persist(new MessageAppended(cmd.Message), evt =>
        {
            Apply(evt);
            // ops tip: track sequence via SaveSnapshotAsync ack
            if (SnapshotSequenceNr % SnapshotInterval == 0)
                SaveSnapshot(_history.ToList());
        }));

        // 3. On snapshot ack — delete older snapshots/events to keep storage bounded
        Command<SaveSnapshotSuccess>(success =>
        {
            DeleteSnapshots(new SnapshotSelectionCriteria(
                maxSequenceNr: success.Metadata.SequenceNr - 1));
            DeleteMessages(success.Metadata.SequenceNr - SnapshotInterval);
        });
    }

    private void Apply(MessageAppended evt)
    {
        _history.AddLast(evt.Message);
        while (_history.Count > MaxHistory) _history.RemoveFirst();
    }
}

Three things to internalize:

Snapshot interval = max_replay_time / event_apply_time. If one apply costs 0.1 ms and your SLA is 100 ms replay, the interval is around 1000 events.

GC for events/snapshots. DeleteSnapshots + DeleteMessages matter. Drop those two lines and the PostgreSQL table grows unbounded.

DB connection pool is a cluster-level number, not a node-level number. N nodes × pool size per node = total concurrent connections.

6.3 Responsibility split — Sharded vs Persistent


graph TB
    subgraph "Node A"
        SR_A[ShardRegion 'agent']
        E_A1[LlmAgentActor<br/>SessionId=alice<br/>STATELESS]
        E_A2[LlmAgentActor<br/>SessionId=bob<br/>STATELESS]
        M_A1[SessionMemoryActor<br/>persistence_id=alice<br/>PERSISTENT]
        M_A2[SessionMemoryActor<br/>persistence_id=bob<br/>PERSISTENT]
        SR_A --> E_A1
        SR_A --> E_A2
        E_A1 -. child .-> M_A1
        E_A2 -. child .-> M_A2
    end

    M_A1 --> J[(EventJournal<br/>PostgreSQL)]
    M_A2 --> J

    style E_A1 fill:#e3f2fd
    style E_A2 fill:#e3f2fd
    style M_A1 fill:#fce4ec
    style M_A2 fill:#fce4ec

The principle: LlmAgentActor is stateless; only SessionMemoryActor is a PersistentActor. Reasoning: LLM calls are non-deterministic, so replaying the actions of LlmAgentActor makes no sense. What's worth persisting is the conversation history (events), full stop.

7. Three multi-agent collaboration topologies

🎯

3-line summary

Three collaboration patterns: Conversation (direct message), Topic (pub/sub), Workflow (central saga).

The choice depends on interaction synchronicity and state-sharing scope.

All three are first-class in Akka.NET — DistributedPubSub, Cluster Sharding, Cluster Singleton.

7.1 Conversation — Agent ↔ Agent direct call


sequenceDiagram
    autonumber
    participant W as WeatherAgent<br/>(Sharded)
    participant A as ActivityAgent<br/>(Sharded)
    participant SR as ShardRegion

    Note over W: One LlmAgentActor calls another like a tool
    W->>SR: AgentRequest(SessionId="alice-activity", "Madrid")
    SR->>A: forward (EntityId=alice-activity)
    A-->>W: AgentResponse("Indoor museums...")
    Note over W: wrap as tool_result,<br/>inject into next LLM call

The simplest pattern. One agent delegates a tool to another agent. The downside is long call chains become hard to debug.

7.2 Topic — DistributedPubSub broadcast


graph LR
    subgraph "Topic Mesh"
        P1[NewsCrawlerAgent<br/>publishes]
        P2[StockTickerAgent<br/>publishes]

        Med1[Mediator @ A]
        Med2[Mediator @ B]
        Med3[Mediator @ C]

        S1[SentimentAgent<br/>subscribes 'market.*']
        S2[PortfolioAgent<br/>subscribes 'market.stock']
        S3[AlertAgent<br/>subscribes 'market.alert']

        P1 --> Med1
        P2 --> Med1
        Med1 -. gossip .- Med2
        Med1 -. gossip .- Med3
        Med2 --> S1
        Med3 --> S2
        Med3 --> S3
    end


// Publisher
var mediator = DistributedPubSub.Get(system).Mediator;
mediator.Tell(new Publish("market.stock", new StockPriceUpdate("MSFT", 510.32)));

// Subscriber (at agent startup)
mediator.Tell(new Subscribe("market.stock", Self));
Receive<StockPriceUpdate>(update => /* trigger LLM analysis */);

Useful for event-stream-driven multi-agent scenarios. Caveat: delivery guarantee is at-most-once. In production you usually pair it with an external broker like Kafka or Pulsar.

7.3 Workflow — saga / central orchestration


sequenceDiagram
    autonumber
    participant Client
    participant W as TripPlannerWorkflow<br/>(Sharded Entity)
    participant Mem as Workflow EventStore
    participant WA as WeatherAgent<br/>(Sharded)
    participant AA as ActivityAgent<br/>(Sharded)
    participant PE as PreferencesEntity<br/>(Sharded)

    Client->>W: StartWorkflow("plan-trip", "alice", "Madrid")
    Note over W: state=GetWeather, PersistEvent
    W->>Mem: Persist(StateTransitioned)
    W->>WA: forward
    WA-->>W: AgentResponse("rainy 18C")
    Note over W: state=GetPreferences, PersistEvent
    W->>Mem: Persist
    W->>PE: GetPreferences("alice")
    PE-->>W: ["museum","cafe"]
    Note over W: state=SuggestActivity, PersistEvent
    W->>Mem: Persist
    W->>AA: forward
    AA-->>W: AgentResponse("Prado Museum")
    Note over W: state=Done, PersistEvent
    W->>Mem: Persist
    W-->>Client: WorkflowResult

This is almost 1:1 with Akka.io's Workflow component. The Workflow itself is a Sharded Entity, and each step transition is recorded as a PersistentActor event — so if a node dies mid-step, the workflow resumes from exactly the same step on a different node.

Pattern	Best for	Determinism	Debug difficulty
Conversation	Short synchronous tool calls	Drops as call chain grows	Low (clear call graph)
Topic	Async event fan-out	At-most-once	Medium (tracking broadcasts)
Workflow	Multi-step + side effects	Replayable via event log	High (need event log tooling)

8. Split-Brain — the most dangerous trap in production

🎯

3-line summary

A network partition can split the cluster into two halves that each behave like a separate cluster.

If both halves try to host the same Singleton, or run the same Sharded Entity instance on two nodes, data consistency breaks.

Among the four Akka.NET SBR strategies, keep-majority is the safest default.


graph TB
    subgraph "Healthy state — 5-node cluster"
        N1[Node A] --- N2[Node B]
        N2 --- N3[Node C]
        N3 --- N4[Node D]
        N4 --- N5[Node E]
        N1 --- N5
    end

    subgraph "Network partition"
        direction LR
        P1[Partition X<br/>Node A, B] -. broken link .- P2[Partition Y<br/>Node C, D, E]
        P1 -. SBR keep-majority .-> X1[ ❌ SHUTDOWN<br/>2 nodes — minority]
        P2 -.-> X2[ ✅ SURVIVE<br/>3 nodes — majority]
    end

    style X1 fill:#ffcdd2
    style X2 fill:#c8e6c9

Four SBR strategies

Strategy	Behavior	Use case
`keep-majority`	Survive if you're the larger partition	Most common — odd-numbered clusters
`static-quorum`	Survive if size >= preconfigured quorum	When cluster size varies
`keep-oldest`	Survive if you contain the oldest node	When preserving the Singleton matters most
`keep-referee`	Survive if you contain a designated referee	Intentional master/replica designs


akka.cluster.split-brain-resolver {
  active-strategy = keep-majority
  stable-after = 20s        # ignore network flutter
  down-all-when-unstable = on   # if instability persists, down everything
}

Operational checklist

Always run an odd number of nodes. With an even number plus keep-majority, a 5:5 split kills both halves.

Set stable-after to at least 5 s. Anything shorter overreacts to GC pauses and transient network blips.

Monitor membership. Surface Cluster.MemberStatus change events to your monitoring stack. SBR-downed nodes do not rejoin automatically — you need automated restart.

9. Operations — rolling upgrades and node replacement

🎯

3-line summary

During a rolling upgrade, take down at most ⌊(N-1)/2⌋ nodes at a time — preserve keep-majority.

With remember-entities = on, entities on the downed node reactivate elsewhere automatically.

Even an agent mid-LLM-call is safe — the PersistentActor resumes from the event log on a different node.

9.1 Rolling upgrade sequence


sequenceDiagram
    autonumber
    participant Ops as Operator
    participant LB as Load Balancer
    participant NA as Node A (v1.0)
    participant NB as Node B (v1.0)
    participant NC as Node C (v1.0)
    participant Coord as ShardCoordinator

    Note over NA,NC: T=0  3-node v1.0 cluster, shards [0..99] distributed
    Ops->>LB: drain Node A (block traffic)
    Ops->>NA: Cluster.Leave
    NA->>Coord: leaving
    Coord->>NB: rebalance shards 0,3,6...
    Coord->>NC: rebalance shards 0,3,6...
    Note over NB,NC: remember-entities=on<br/>active entities re-activate
    NA->>NA: graceful shutdown
    Ops->>Ops: deploy v1.1 binary
    Ops->>NA: start v1.1
    NA->>Coord: join cluster
    Coord->>NA: assign shards 0,3,6...
    Note over NA: state recovery from EventJournal
    Ops->>LB: re-add Node A

    Note over Ops,Coord: Repeat for Node B, then C<br/>cluster is fully v1.1

9.2 Anti-patterns

Anti-pattern	Consequence	Right way
SIGKILL without Cluster.Leave	Shard timeout → 5-30 s response delay during rebalance	Always graceful leave first
Two nodes down simultaneously (3-node cluster)	1 surviving node = not majority → SBR may down everything	One node at a time
Persistence schema migration interleaved with code deploy	No safe rollback	DB migration → code deploy → validate, in distinct phases
Putting in-memory state on `LlmAgentActor` itself	Lost when node dies; implicitly not replayable on restart	All state belongs in `SessionMemoryActor` or an external store

9.3 Observability — distributed tracing is non-negotiable

On a single node, one log file captured every flow. In a cluster, one user request now traverses multiple nodes.


// Petabridge.Cmd or OpenTelemetry middleware
Receive<AgentRequest>(req =>
{
    using var activity = AgentTelemetry.Source.StartActivity("agent.request");
    activity?.SetTag("session_id", req.SessionId);
    activity?.SetTag("node", Cluster.Get(Context.System).SelfAddress.Host);
    activity?.SetTag("shard_id", _shardId);
    // ... existing handling
});

Minimum trace context:

session_id — user tracking (mind PII)

shard_id / node — routing path

llm_call_id — identify LLM round-trips

tool_name — identify tool invocations

Pipe to an OpenTelemetry collector + Jaeger/Tempo and the Conversation pattern's call chains and the Workflow pattern's per-step durations become legible.

10. Closing — the cost of porting from single-node to cluster is ~0

A few decisions in Part 1 turned out to be with-cluster-in-mind choices.

Keeping LlmAgentActor stateless and putting all state in a child SessionMemoryActor.

Defining every message as an immutable record.

Letting AgentRouter handle session-keyed routing as a separate concern.

Those three decisions are what let the Part 2 cluster migration barely change any code at all. The three things that did change:

AgentRouter → ClusterSharding.Start(...).

Persistence plugin: SQLite → PostgreSQL/Mongo.

HOCON gains cluster, sharding, and split-brain-resolver blocks.

The business logic — the four-Behavior FSM (Idle/GatheringContext/Reasoning/Acting), the IAgentTool abstraction, the event-sourced SessionMemoryActor — stays the same.

This is what location transparency in the actor model actually buys you. Not a marketing line — a design property that drops the cost of moving from single-node to cluster to near zero. Across the four axes I covered in

Actor Model Underneath Agentic AI — No Surprise (Google AX / Akka / Orleans / Ray / Temporal — isolation, messaging, durability, location transparency), the last axis is the one whose value shows up latest — and the moment it shows up is exactly when you go cluster.

Candidate Part 3: MCP server integration and multi-tenant isolation on the Akka.NET cluster. The limits of Singleton (per-tenant rate limiters cannot be a single Singleton) and the multi-key Sharded entity pattern.

Editor's intent map (for translators and reviewers)

Korean intent	English rendering	Reason
"단일 노드의 진짜 한계"	"the real single-node limits"	Drops the awkward "true limits"; "real" reads more naturally for a senior eng audience
"진짜 문제 5가지" (concrete, hands-on tone)	"the five real problems"	Same emphasis — "real" doing the work the Korean "진짜" does
"코드 변경 거의 0"	"≈ 0 code change" / "barely change any code at all"	Two different renderings depending on diagram context vs prose
"마치며"	"Closing — the cost of porting from single-node to cluster is ~0"	Korean prose convention doesn't translate; turn the section closer into a thesis statement
"운영 필수" / "협상 불가능한 결격 사유"	"non-negotiable for production" / "non-negotiable disqualifier"	Standard English ops vocabulary

References

Tyler Jewell, Agentic AI: Why Experience Matters More Than Hype, Akka, 2025

Building Agent Orchestration on Akka.NET — Cluster Edition (Part 2)

1. Why cluster — the real single-node limits

2. Akka.NET Cluster in 5 minutes

Minimum HOCON

Three distributed primitives

3. The five real problems when going cluster

4. Sharding the Agent actors

4.1 Sharding setup

4.2 What the ShardRegion does for you

4.3 Automatic rebalance on join/leave

5. The Coordinator is a Cluster Singleton — workflows & rate limiters

6. Persistence — event sourcing + snapshot strategy

6.1 Event flow

6.2 Snapshot cadence — preventing replay storms

6.3 Responsibility split — Sharded vs Persistent

7. Three multi-agent collaboration topologies

7.1 Conversation — Agent ↔ Agent direct call

7.2 Topic — DistributedPubSub broadcast

7.3 Workflow — saga / central orchestration

8. Split-Brain — the most dangerous trap in production

Four SBR strategies

Operational checklist

9. Operations — rolling upgrades and node replacement

9.1 Rolling upgrade sequence

9.2 Anti-patterns

9.3 Observability — distributed tracing is non-negotiable

10. Closing — the cost of porting from single-node to cluster is ~0

Editor's intent map (for translators and reviewers)

References

Sister piece / direct predecessor

Akka.NET cluster official docs

Memorizer

External