Exchange 2013 High Availability demystified
With the emerging global economy, organizations of every kind in any part of the world are evidently becoming more and more dependent on IT to be connected. The very popular and significant form of being connected is email. Having said that, it establishes clearly that how significant Exchange as a business critical application is for an organization. Microsoft has been bringing about radical changes with every version of Exchange resulting in unprecedented performance benefits armed with availability. With Exchange 2010, we experienced stupendous features like DAG for high availability within single/multiple datacenter coupled with dedicated site resilient and disaster recovery capabilities across multiple datacenter’s.
What Microsoft gets on the table this time for us with Exchange 2013, is exciting enhancements for High Availability, definitely this is in addition to architectural enhancements. In this article, we are going to focus exclusively around various aspects of High Availability in Exchange 2013.
High Availability is a very convenient, loosely used term. It definitely isn’t a technology or a product feature, rather a state that needs to be achieved in times of failure or disruptions.
In this article, we are trying to present to very specifically how Microsoft’s Exchange 2013’s High Availability works in the background scenes to give you a comprehensive experience for your IT and Business requirements.
Exchange 2013 uses a combination of workload-sharing solutions spread across multiple elements listed below
a. Managed Availability
b. Maintenance mode
c. Best Copy Selection Changes
d. Cmdlet Enhancements
e. DAG Network Auto-Config
f. Auto Reseed
g. Site Resilience
h. Multiple Databases per volume
i. Safety Net
The architectural enhancement in Exchange 2013, calls for a more superior HA mechanism. Managed availability provides a monitoring framework for all Exchange components. It provides a broader perspective in reliability and scalability across servers. It is a proactive mechanism that lets you manage availability thru a series of sequencing operations to control when recovery actions are performed rather than in a scenario when alert is issued. In short, Managed Availability is all about the activity that helps you to “recover”, and not “find out” the root cause of an issue.
Managed Availability is process that runs internally on all servers in two forms,
Exchange Health Manager Service (MSExchangeHMHost.exe) which manages the workers process (which could range from start, stop, build, execute). It also prevents the worker process from being a single of point of failure by assisting in recovery.
Exchange Health Manager Worker (MSExchangeHMHost.exe) process which performs the runtime tasks.
Let’s now understand how Managed Availability achieves this recovery process!
Managed Availability has three components namely-> probe engine, monitor and responder engine. The probe engine works with the servers to collect data probes, checks and notification logic. Probes are the activities done to detect the user experience, Checks are the infrastructure that perform the analysis on traffic, spikes and the Notification Logic enables the logic on which decisions can be made for action by the system.
The probe engine then passes this information to Monitor. What Monitor does is, it attempts finding out a pattern in the submitted data. Monitor achieves the pattern identification on the basis of predefined business logic to determine if a system is healthy or not. It also defines the time and the workflow process in case of a recovery. From a system perspective, Monitor has two states -> healthy and unhealthy and from an admin perspective, Monitor has four states -> degraded, disabled, repairing and unavailable.
Responder engine gets into the picture when a servers gets labeled as being in a “unhealthy state”. In short, it is responsible for executing the response action to the alert detected.
Health Set= Probe Engine+Monitor+Responder Engine
NOTE: Probe Engine, Monitor and Responder engine have all threshold values which can be adjusted based on your environment requirements.
This functionality lets you designate a server as in-service or out-of-service thru the cmdlet Set-ServerComponentState.
Separate tracking for:
Health –> MA triggered
Sidelined –> operator initiated
Functional –> setup running
Deployment –> machine being configured
Best Copy Selection Changes:
In Exchange 2010, scenarios of fail-over or seamless transition, the Primary Active Manager would allocate the server for transition without an analysis on the current capacity of the server. Chances are that if the server’s current capacity if full, the transition would fail during the mount point and again another server would have to be tried upon. We could however use the cmdlet:
Set-MailboxServer “ServerName” –MaximumActiveDatabases 20
to ensure no more than 20 mailbox database will be active on that server at any given point.
In Exchange 2013, the Primary Active Manager keeps track of the number of active databases/server. Thus when Exchange replication service is restarted or Primary Active Manager moves to a different server, this information gets rebuilt from cluster database. Thus this enhancement definitely allows us to isolate fully loaded servers in the first place.
But the question still remains, how does this work behind the scenes? Let’s try to understand this.
Exchange 2013 uses “Best Copy and Server Selection (BCSS) algorithm” to choose the best copy of a database and in Exchange 2010, Best Copy Selection (BCS) algorithm was used. The factors that are taken into account in BCSS make the failover much more smarter involving health of the hosting server at its focus. This strongly indicates that no failover will try to push a server to a state beyond what it can hold. Exchange 2013 evaluates the entire protocol stack thru Managed Availability sets. It examines the destination server. As a result of taking multiple factors for server, it is appropriately called as Best Copy and Server Selection in Exchange 2013.
Furthermore, let’s find out now how does the identification/checks happen.
i. All Healthy– checks for server hosting a copy of the database having all monitoring elements in a healthy state.
ii. Up to Normal Healthy– checks for a server hosting a copy of the database having all monitoring elements with a Normal priority in a healthy state.
iii. All Better than Source- checks for server hosting a copy of the database having all monitoring elements in a state better than the current server.
iv. Same as Source- checks for server hosting a copy of the database having all monitoring elements in a state same as the current server.
Today health benefits are vital not just for human beings but servers too. Keeping a health check on your Exchange Server to a deeper extent is now possible thru the cmdlets introduced in this version. There are multiple cmdlets that are available but we will discuss few significant ones -> Get-ServerHealth and Get-HealthReport.
Managed Availability is the new monitoring and recovery framework in this version.
Get-ServerHealth cmdlet displays the various health stats with values as degraded, repairing, unhealthy, disabled, uninitialised or unavailable.
Get-HealthReport cmdlet provides a summary report of health by using an Identity parameter, instead of InputObject/InputEntries. It returns health values as online, partially online, offline, sidelined, functional, or unavailable.
Update-MailboxDatabaseCopy cmdlet has multiple parameters for enable automation in seeding activities. The parameter that can be used are BeginSeed, MaximumSeedsInParallel, Server and SafeDeleteExistingFiles.
DAG Network Auto-Config
Exchange 2013 provides DAG Network Auto-Config, which means the manual tasks that had to be performed in Exchange 2010 are no longer required. Based on configuration settings, DAG’s can be automatically configured. They also have the capability to distinguish between a MAPI and replication networks.
Exchange 2013 automatically collapses DAG networks provided the config settings are correct.
Exchange Management Shell lets you view DAG Network settings in auto-config mode and Exchange Administration Center lets you to view, create, edit in manual mode.
Exchange 2013 database AutoReseed:
This feature lets you automatically restore a database redundancy in case of a disk failure by using spare disks available on the system. It supports 8 databases per volume. It has been added in AD, which essentially means that you could enable/disable it in AD itself. But the question remains, how does this feature detect a failure and how does it do the restoration? Let’s understand that.
The condition that needs to enable AutoReseed feature is that it should detect a database copy in “Failed & Suspended” state for 15 minutes. The system will first try to resume database copy until 3 occasions, with a lag of 5 minutes sleep each. So during this time frame, if the database copy doesn’t resume within 10 minutes, AutoReseed assumes that it has to execute the next step of allocating a spare disk. But before doing that, it performs a series of pre-checks which are as below:
i. Naming conventions match
ii. Database and log files are on the same volume
iii. Availability of a spare disk
iv. Verify that all copies are in a F&S State
Post all this, the AutoReseed will try to allocate a spare disk upto 5 times, with a lag of one hour each. It will now try to perform the InPlaceSeed operation.
This feature additionally also provides better tracking mechanism for mount paths and Exchange volume path.
In short, this feature will let you remap multiple databases for reseeding in parallel.
Exchange 2010 employed DAC mode for site resiliency and disaster recovery purposes. With Exchange 2013 the exchange product team has taken this to the next level allowing an automatic failover in the event of a failure. This means, there is no manual intervention to perform switch overs during datacenter failovers. Say in Exchange 2010 if the load balancer fails or the VIP of the CAS Array isn’t available, we had to perform datacenter switchover in most cases. This was mainly because Exchange 2010 DAG’s and CAS array were coupled together. With Exchange 2013 if you lose your CASArray, with proper planning in place your Outlook clients are automatically redirected to a second datacenter that has Client Access servers, and those Client Access servers proxy the requests back to the user’s Mailbox server, which remains unaffected by the outage.
With Exchange 2013, the namespace doesn’t need to move with the DAG. Exchange 2013 uses multiple ip addresses which helps with fault tolerance for namespace. This means the namespace is no longer a single point of failure as it was in Exchange 2010. In Exchange 2010, perhaps the biggest single point of failure in the messaging system is the FQDN that you give to users because it tells the user where to go. In the Exchange 2010 paradigm, changing where that FQDN goes isn’t easy because you have to change DNS, and then handle DNS latency, which in some parts of the world is challenging. And you have name caches in browsers that are typically about 30 minutes or more that also have to be handled.
Multiple Databases per volume:
This feature lets you have both active and passive databases on the same volume.
Exchange 2013 Safety Net
A feature that facilitates data recovery as well as provide for compliances in previous versions of Exchange 2010 and Exchange 2007 was set on the Hub transport server. All incoming and outgoing messages must go through the Hub transport before it reaches a mailbox.
Transport Dumpster is a feature of the Hub Transport of Exchange 2010 which limits the data losses during a lossy failover occurrence while mail delivery to a DAG. Transport dumpster was first seen in CCR and LCR mailboxes of exchange 2007.
One of the limitations of transport dumpster is that it can be used only for replicated mailboxes and not public folders or mailboxes that aren’t a part of the DAG. All Hub transport servers in the active directory sites of the DAG contains the transport dumpster queue for a particular mailbox and the dumpster is stored inside the mail.que file.
With the Exchange 2013, Microsoft replaced the transport dumpster with an improved and even better – Safety Net.
How Safety Net Works
Safety Net can be considered to be having two parts- Shadow Redundancy and Safety Net Redundancy.
While the safety Net keeps a redundant copy of a message after it is successfully processed, shadow redundancy keeps a redundant copy of the message which is in transit. All features of shadow redundancy like transport high availability boundary, primary messages, primary servers, shadow messages and shadow servers will be applicable to Safety Net.
The Primary Safety Net is applicable for a Mailbox server that holds the primary message before the Transport service completely processes the message. Once the processing of the message is over, the primary server moves the message to the Primary Safety Net from the active queue on the same server.
The Shadow Safety Net is applicable to the Mailbox server which holds the shadow message. Once the shadow server receives the information that the primary server has successfully processed the primary message, the shadow message is moved to the shadow safety net from the shadow queue on the server. For the Shadow Safety Net operation, shadow redundancy should be enabled, and shadow redundancy is enabled by default in Exchange 2013.
Similarities between Safety Net and Transport Dumpster
Just as in a transport dumpster, safety Net is also a queue that is related to the Transport service on a Mailbox server
It stores copies of messages already processed by the mailbox
The duration for which the messages remain in the queue can be specified as in a dumpster. The default is 2 days
Why Safety Net is better than Transport Dumpster
Safety Net is not just applicable for DAGs but also for Public Folders and other Mailboxes which are not a part of DAGs unlike a transport dumpster
Due to the redundant nature of Safety Net it is never a single point of failure. Because of the availability of the Primary Safety Net and the Shadow Safety Net, even if the Primary Safety Net is unavailable for more than 12 hours, resubmit requests are forwarded to shadow resubmit and act as shadow resubmit requests, and messages are re-delivered from the Shadow Safety Net thus ensuring message delivery even if one of the safety net fails
Another advantage of safety net is that safety net do net limit the message storage based on size but only by duration. For example if you set 12 days as the duration limit, the messages will be deleted only after 12 days of being in the inbox
Safety Net does not require manual resubmission of messages. Message resubmission is initiated by the Active Manager component of the Microsoft Exchange Replication service.
Email is undoubtedly the most mission critical application in organizations of any size. With a properly architected Exchange 2013 deployment coupled with these many stupendous high availability features, Microsoft has indeed taken Exchange Server 2013 to new heights in order to retain, protect and make your corporate messaging data secure and available at all times.