Appearance
Running Distributed
Taking your mesh from 3 nodes to 100 in production requires careful configuration.
Seed Node Strategy
Use multiple seeds (at least 2-3) to avoid a single point of failure:
cpp
// Seeds should be on separate machines
config.seed_.host_.store(std::make_shared<const std::string>("seed1.cluster.local"));
config.seed_.node_port_.store(10000);If a seed is unreachable, the node retries automatically:
cpp
config.seed_.retry_interval_ms_.store(2000); // Retry every 2 secondsGossip Protocol
Nodes use a gossip protocol for failure detection and state dissemination:
cpp
config.gossip_.cycle_interval_ms_.store(1000); // How often to gossip
config.gossip_.suspect_timeout_s_.store(10); // Time before marking as suspect
config.gossip_.dead_timeout_s_.store(30); // Time before marking as dead
config.gossip_.prune_timeout_s_.store(300); // Time before removing dead nodes
config.gossip_.hybrid_threshold_.store(10); // Max peers for direct ping
config.gossip_.rumor_limit_per_frame_.store(5); // Max rumors per gossip cycleFor a Large Cluster (100+ nodes)
cpp
config.gossip_.cycle_interval_ms_.store(2000); // Reduce gossip frequency
config.gossip_.suspect_timeout_s_.store(15); // Be more tolerant
config.gossip_.hybrid_threshold_.store(20); // More direct pingsNode Configuration
cpp
config.node_.max_connections_.store(5); // Max connections per node
config.node_.max_peers_.store(8); // Max total peers
config.node_.max_hops_.store(10); // Max message propagation hops
config.node_.max_queue_size_.store(5000); // Send queue size
config.node_.max_frame_size_.store(65536); // Max frame size (64KB)
config.node_.read_timeout_seconds_.store(10); // Read timeoutRaft Consensus
For strongly consistent operations across the cluster, the framework uses the Raft consensus algorithm. Raft elects a single leader and replicates log entries to a quorum of followers.
cpp
// Access the Raft engine
auto& raft = app.get_state()->of_raft();
// Check current role
auto role = raft.get_role(); // follower, candidate, or leader
auto term = raft.get_current_term(); // current Raft term
auto leader = raft.get_leader_id(); // UUID of the elected leaderRaft operates on framework-internal state (node membership, cache consistency). Leader election and log replication happen automatically. The Raft engine is managed internally and does not expose a user-facing API for proposing custom state changes.
Raft Configuration
Raft timing is configured indirectly through the gossip and node configuration:
cpp
// Election timeout is derived from gossip.suspect_timeout_s_
// Heartbeat interval is derived from gossip.cycle_interval_ms_Failure Recovery
When a leader fails:
- A new election is triggered by the node with the shortest timeout.
- The candidate requests votes from a quorum of nodes.
- Once elected, the new leader begins replicating entries.
- When the old leader recovers, it re-joins as a follower and catches up via log replication.
Important: During a network partition, the isolated side that retains a quorum elects a new leader. When the partition heals, the isolated node re-joins as a follower — a new election is NOT triggered by restored connectivity.
Inter-Node Port
cpp
config.ports_.node_port_.store(10000); // Default for inter-node traffic
config.ports_.client_port_.store(0); // Client port (0 = same as node port)TLS for Inter-Node Communication
cpp
config.tls_.cert_chain_file_.store(std::make_shared<const std::string>("/etc/certs/server.crt"));
config.tls_.private_key_file_.store(std::make_shared<const std::string>("/etc/certs/server.key"));
config.tls_.ca_file_.store(std::make_shared<const std::string>("/etc/certs/ca.crt"));Monitoring Cluster State
Gossip Members
cpp
auto& gossip = app.get_state()->of_gossip();
// List active members
auto members = gossip.get_active_members();
for (auto& m : members) {
fmt::println("Node {} at {}:{} (state: {})",
m.node_id, m.host, m.nodes_port, static_cast<int>(m.state));
}
// Total known members
auto count = gossip.get_total_members_count();Metrics Collection
cpp
config.node_.metrics_interval_ms_.store(60000); // Report metrics every 60s
config.latency_.enabled_.store(true);
config.latency_.interval_s_.store(5);
config.latency_.history_size_.store(10);Common Pitfalls
- Clock skew — Nodes need synchronized clocks (NTP). Raft and vector clocks depend on accurate time.
- Network partitions — During a partition, the side with a quorum elects a new leader. When the partition heals, the isolated node re-joins as a follower (no new election is triggered).
- Too many seeds — Seeds can become bottlenecks. Keep it to 2-3.
- Logging volume — At 100 nodes, use scope filtering to reduce noise.