Failover Scenarios

The MindLink Chat Engine (MCE) is a distributed, fault-tolerant system made up of one or more servers (nodes) deployed in a cluster. MCE uses an actor-based auto-load balancing approach to handling work. In essence, each Chat Room is represented as an Actor and therefore Chat Room workloads are distributed across the cluster - different nodes become responsible for the workload of different rooms.

In normal operation this approach means work is evenly distributed and is horizontally scalable - to handle more work, add more nodes.

In failure scenarios this can cause some potentially confusing situations, however the MCE cluster is self-healing provided that a majority of nodes are operating normally. Nodes that are temporarily disconnected from the cluster will rejoin the cluster when connectivity is restored.

MindLink front-end service applications (MindLink Anywhere) that connect to the MCE cluster are also affected by failover. The MCE cluster represents connected front-end services as Actors. It is the responsibility of the front-end service to keep the Actor alive by periodically signaling to the cluster that it is still connected.

While we describe the details of failover scenarios below, essentially, if there are connectivity issues in the cluster then some operations will succeed and some will fail. However, once full connectivity is restored to a majority of nodes the cluster will be fully operational.

Single node failure#

In the face of single node failure in a cluster of 3 (or multi-node failure where f < n/2) the MCE cluster will detect the failure quickly and redistribute any work that resided on the failed node(s).

This kind of failure will manifest as:

  • A Chat Room whose workload is on the failed node(s) may briefly become inaccessible (joining/leaving, message sending may fail)

Database failover#

In the face of the underlying database failing over the entire cluster will be unable to perform any work (for safety in MCE, all operations that modify state must persist before being accepted). While the cluster nodes still have network communication they will remain active as a cluster, but be unable to make progress.

This initial failure will manifest as:

  • All Chat Rooms will become inaccessible
  • Managing Chat Rooms or searching Chat Rooms will fail

As the database failover completes and service is restored some cluster nodes will return to normal operation before others. The workload distribution will remain, because the communication between cluster nodes is not interrupted.

This partial restoration of service will manifest as:

  • A Chat Room whose workload is on an unrecovered node may briefly become inaccessible
  • A front-end service whose representative Actor is on an unrecovered node may disconnect its user sessions

Once all nodes have recovered full service will resume.

Additionally, all front-end services require database access in order to view the state of the cluster. In the event of database failover it is possible that the front-end service connectivity to the database is not restored quickly enough, this will manifest as:

  • A user session whose workload is on the unrecovered front-end service may be unable to perform any MCE operation
  • New user logon attempts distributed to an unrecovered front-end service may fail

Network failure#

In the face of a network failure that results in a partitioned network, the MCE cluster nodes that can form a majority will remain available. Any MCE cluster nodes that cannot form a majority will consider themselves clusterless. It is possible that no MCE cluster nodes can form a majority and so no MCE service will be available.

Any front-end services that can reach one or more MCE cluster nodes that form a majority will remain operational.

In the event that there is not a majority of nodes and you wish to restore connectivity for a specific node(s) then you must manually update the membership table to mark the other node(s) as Dead (Status = 6) or remove their entries.