Failover and High Availability


Failover

Here we discuss failover strategies for each tier in the MindLink stack, in terms of complete failure of an entire pool in an active/passive DR configuration.

In an active/passive DR deployment, the MindLink Server is deployed with an identical standby pool in another location. In the event of failover of the primary pool, requests from user devices and browsers should be directed to the standby pool.

The configuration of the two pools should be equivalent. For Anywhere, SharePoint and Mobile there is no data that needs to be synchronized between pools.

For the API, the provisioned data files (Agents.xml, Users.xml and Throttle.xml) should be periodically copied between sites. A scheduled copy job is sufficient to copy the files between the installation directories of each pool.

The frequency at which this synchronization occurs depends on how frequently changes are made to the provisioned agents, channels, users and throttles. If regular changes are made, synchronization should occur at least once per day.

Scenario: The MindLink pool becomes unavailable

- Administrative actions:

    - Redirect client requests to the new pool.
    - This can be achieved via DNS, load balancer configuration, or otherwise.
    - The load balanced pool name should be redirected to connect to the new pool.
    - For mobile, the individual names of each server should also be redirected to their equivalent nodes on the failover pool.


- Anywhere/SharePoint user experience:

    - The user will be notified that they have been disconnected from their session and reconnection will be attempted.
    - After failure to reconnect for longer than the configured session timeout interval, the user will be instructed to re-logon.
    - If connections are redirected to the failover pool within the session timeout interval, then the user will be instructed to re-logon immediately.
    - When connections have been redirected to the failover pool, users will be able to re-logon.


- Mobile user experience:

    - The user will be notified that they have become disconnected.
    - The app will continue to attempt to reconnect periodically.
    - The user may close and restart the app, at which point they will be given the choice to continue attempting reconnection to their old session or to re-logon.
    - When connections have been redirected to the new pool, the user will be immediately informed that their session ended and instructed to re-logon.


- Failback

    - Connections should be redirected to the primary pool.
    - Users will be notified immediately that their session has ended and they should re-logon.

High Availability

Here we discuss high-availability strategies for each tier in the MindLink stack, in terms of node failure within a clustered pool.

The MindLink Server may be deployed in a pool configuration to support high availability. This mechanism ensures that there is always a MindLink Server available to service new log on requests.

The administrator should monitor the service health via the HTTP health check service, performance counters, and status of the Windows Service. The load balancer should be configured to maintain the candidate set of nodes by monitoring the HTTP health check service.

Scenario: A MindLink Server node becomes unavailable:

- Administrative actions:

    - Diagnose the fault and restart the service.
    - The load balancer will remove the node from the candidate node list and redirect new log on requests to the other available servers.
    - When the node recovers the load balancer will begin directing it new log on requests.


- Anywhere/SharePoint user experience:

        - The user will be notified that they have been disconnected from their session and reconnection will be attempted.
        - After failure to reconnect for longer than the configured session timeout interval, the user will be instructed to re-logon.
        - If the node is restarted within the session timeout interval, then the user will be instructed to re-logon immediately.


- Mobile user experience:

    - The user will be notified that they have become disconnected.
    - The app will continue to attempt to reconnect periodically.
    - The user may close and restart the app, at which point they will be given the choice to continue attempting reconnection or to re-logon to a different node.
    - When the node is restarted, the user will be immediately informed that their session ended and instructed to re-logon.


Skype for Business

Failover

Here we discuss failover strategies for each tier in the MindLink stack, in terms of complete failure of an entire pool in an active/passive DR configuration.

Skype for Business

Scenario: The next-hop Skype for Business frontend pool becomes unavailable.

- Administrative actions:

    - Reconfigure the MindLink Server to use another pool as the next-hop pool.
    - If using manual configuration of the named next-hop pool, this requires setting the name of the failover pool and restarting the MindLink service.
    - If using auto-provisioning:

        - Mobile/Anywhere/SharePoint  the SfB DNS auto-discovery records should be changed to point to the failover pool. The MindLink service does not require a restart.
        - API  The new next-hop pool should be changed in the published SfB topology. The topology changes will be replicated to the MindLink server. The MindLink service does not require a restart.




- Anywhere/SharePoint/Mobile user experience:

    - The user will be notified that their session has ended and instructed to re-log on.
    - On re-logging on, a SfB session will be established on another node in the frontend pool.


- API behaviour:

    - The agent and channels will become inactive.
    - The agent will attempt to automatically reconnect and eventually become active and re-activate its channels.

Scenario: The users home Skype for Business frontend pool becomes unavailable.

- Administrative actions:

    - None. When homed users are moved to their failover pool the MindLink Server will be redirected to the new log on pool by the next-hop pool.


- Anywhere/SharePoint/Mobile user experience:

    - The user will be notified that their session has ended and instructed to re-log on.
    - On re-logging on, a SfB session will be established on another node in the frontend pool.


- API behaviour:

    - The agent and channels will become inactive.
    - The agent will attempt to automatically reconnect and eventually become active and re-activate its channels.

Scenario: The SfB Persistent Chat Pool becomes unavailable:

SfB persistent chat supports a stretched-pool configuration for DR. As such, recovery of services in at least one site should be the administrative priority.

If the persistent chat pool fails, users will be notified of session disconnection and unable to log on until services are restored.

Database Layer (Mobile)

Scenario: The database layer becomes unavailable.

- Administrative actions:

    - Configure the MindLink Server to connect to a new database via the Management Center.
    - Restart the MindLink Server.


- Failover process:

    - The MindLink Server periodically monitors the connectivity to and health of the database layer.
    - When a MindLink Server node identifies that it has not been able to connect to the database layer for a period of time, it will remove itself from the pool.
    - It will terminate all user sessions and refuse future log ons.


- User experience:

    - Logged on users will have their session removed and will be informed that they need to re-log on.
    - Once the database has been reconfigured and services restarted, users will be able to re-log on.