HA, SR, RPO, RTO, OLA, SLA and the 9’s of availability…!!!
Business efforts small, privately owned, multinational, service- oriented, all demand accessibility between clientele, employees, management and any other miniscule component on a 24/7 basis.
Click the below link to download this whitepaper in PDF format offline reading:
Databases and basic communication schemes like e-mail are all dependent on server based systems and hence, productivity depends on the ability of such systems to recover from failures and downtimes whether planned or unplanned. More specifically, if these systems have any shortcomings due to damage as in a fire or faulty hardware or IT- related maintenance activities.
To allow business goals to be effectively met by the IT infrastructure, anyone part of the so called IT world, not just the exchange or AD guy has to look deep into some technicalities that would ensure optimal results. Plus, the company management needs to be on the same page as you. For instance, the Microsoft Exchange application is dependent on a number of other factors like Active Directory, Storage, Network, DNS and so on… If the Network is broke, Exchange is broke and there isn’t a thing the Exchange team can do about it.
So, here’s something that will complement the effort and would probably even aid you in job interviews and to be at par with the so- called genius of the ITIL folks….
Server systems including Microsoft Exchange are designed to be ‘highly available’ so that they continue service despite failure. However, the scope of this availability depends on the requirements specific to the particular enterprise. Any mishap is succeeded by automatic efforts to rectify or find alternate options.
Simply, High Availability or HA for exchange means how “highly available” your exchange infrastructure is and how “quick” it will come back up in the event of a failure “Or” – How long will it take for your exchange 2010 DAG to come online if one/multiple members fail?
Site resiliency determines how efficient your system can function in the event of a catastrophic failure like a WAN outage or natural disasters like earthquake or flood usually concentrated on specific geographical locations where the primary datacentre resides. It could also be caused due to network failure or power outage. The point here is that you need to decide whether the system you design should failover to a secondary datacenter and continue servicing automatically or with manual intervention.
In Microsoft exchange terms, you may have a stretched DAG across multiple datacenters – one in New York and one in California, both equipped with their own supporting systems and processes in place – ready to be activated in the event one goes down.
Difference between High Availability and Site Resilience:
Systems based on high availability are automated or self-driven as when a passive database pops up when the active fails while manual action has to initiate corrective efforts in site resilience.
There are demands placed, often, to ensure high availability and site resiliency within the same datacenter. After reading details on HA and SR you know that it isn’t possible or logically doable. High availability basically focuses on how quick you can get your system back up and running in the event of a failure and site resiliency is a solution you have in place to deal with possible prolonged outages. You need to sit and discuss with your management whether to activate a ‘site resilient’ solution or measure the intensity of the damage done to your HA solution and calculate the time for it to come back up. You also make the decision on how long you can afford to be “down” with no service and the data you may lose due to this outage.
Now, to put this in technical terms, you basically develop a document where you have a Recovery Time Objective (RTO) and Recovery Point Objective (RPO). We ‘l come to that a little further down.
OLA and SLA:
The most important factor in the design of any system are a set of requirements that are specific to the company or job at hand. It is important to be able to precisely define needs like ‘increased availability’ and the like. It is also necessary to know the functions and responsibilities of different sections of a company.
It is the Operational Level Agreement (OLA) that specifies the area of control of each department so that the source or cause of failure or mistake can be handled by the particular department. For instance, which section should handle DNS failures and who should take responsibility of the Exchange server being down. These decisions should be properly documented, agreed upon and signed off by the business unit head or the respective application owner. It is equally important to have this information communicated to the helpdesk or the Incident Management team which handles all initial communications. Say, have a toll free number which all IT teams know to contact when they want help. The representative who picks up the call should be competent enough to understand technical terminologies and log a support incident to track changes and time taken to resolve the issue and he should refer to the OLA document to involve the right people from the right team. It’s when you don’t have this process in place, that the IT management system goes haywire not knowing what to do in the event of an outage, not knowing what the issue is and whom to contact and how.
The Service Level Agreement (SLA) is infact an understanding of the level of the service provided to the customer by the IT infrastructure and hence, defines the requirements of the system like the time constraints on the revival of normal functioning after downtimes. SLA may be between the domain users of a company and their internal IT team, between a client company and a vendor providing hosted service. For example, Microsoft Office 365 SLA is between Microsoft and the company they are hosting emails for.
It is often a financial agreement that requires the service provider to pay up if promised service levels are not met or if they aren’t able to meet the promised 9’s of availability.
Requirements and the 9’s of Availability:
How do we come to an SLA?
It is impossible to design a system unless there exists a definite understanding of requirements centering the availability to be ensured. An SLA may be described in terms of 9’s. But each organisation or even technician associates to it in their own terms. It is easy to calculate what the 9’s would amount to
99.999% = 5.26 minutes / year = 25.9 seconds / month
99.99% = 52.56 minutes / year = 4.32 minutes / month
99.95% = 4.38 hours / year = 21.56 minutes / month
99.9% = 8.76 hours / year = 43.2 minutes / month
However, it is ignorant not to anticipate the impact the number of 9’s will practically have on availability. For instance, the time limit on a database failover or a Patch Tuesday (how much time can be spent on any particular breakdown). Infact, it’s how these terms are associated with everyday situations that will help gain perspective on requirements. It might be just one system among numerous others that underperforms. There is no argument to the fact that simply knowing the seconds or minutes lost does not help anticipate the effect on availability in a real-life situation or on one particular day when something goes wrong.
RTO and RPO:
Since the 9’s of availability are a difficult choice to make, availability is more conveniently described in terms of Recovery Time Objective (RTO) and Recovery Point Objective (RPO). The former (RTO) gives the time limit on recovery from a failure and the latter (RPO), the amount of data that can be afforded to be lost due to the failure. So it’s easy to define the time to get back service and account for the data lost for every possible real-life scenario of a mishap. Take for instance, the loss of a single disc. The RTO can well be expected at 30 seconds and the RPO, less than 30 seconds.
Activating a site resilient solution is not an easy decision to make. You don’t want to activate your secondary datacenter if it will only take about 2-3 hours for the HA solution to be back up. However that is completely your decision to make. Availability of a system can be often directly proportional to the kind of business you do. For example, 100% email availability is vital for users in majority of industries like banking, order processing, service based companies who have signed contracts with other companies etc.
Expectations on application availability:
While placing demands on application availability or coming to an agreement on the level of service that can be provided, it’s important to understand several inter-connected factors that come into play- people, process and technology. Standard hardware has obvious advantages, factors like WAN bandwidth can determine RPO in the loss of a primary datacenter. In fact, technology is one of the easier parts not undermining the effect of the resources at hand while defining a process tailor-made to meet availability requirements.
Now, the process developed must be inherently simple meaning that there will definitely be compromises between availability and the extent of systems that can be incorporated to meet failovers. Including too much precautionary measures can render a facility far more complex than actually necessary for normal functioning. This will only render the solution impractical. Similar are the effects of having too rigid a process.
An efficient and trained workforce add to the viability of any high availability or site resilient design. The very choice between the two is also dependent on the kind of workforce. Factors that influence system design are whether administrators should be available on a 24/7 basis, the kind and extent of training and experience required, especially in decision-making.
These factors infact operate in a cycle with the level of service provided. The answer lies in striking the right balance. Any management needs to comprehend the interplay of these factors in achieving their desired output as in whether they should opt for a data failover process in case of a mere WAN outage of 15 minutes and even who makes such decisions, the amount of data replication necessary and so on.
Hopefully, you have come to an understanding that it is vital to communicate needs and expectations but before getting into all that perhaps, you should take time out for some perspective on the kind of agreements you can settle on, giving proportionate significance to all factors that contribute to satisfactory performance. Robust hardware, a workable taskforce, and a properly worked out availability requirement coupled with the choice of suitable or tailor- made system design should do the trick!
Keywords: RPO, RTO, High availability, site resilience, SLA, OLA, ITIL concepts for IT administrators, ITIL process explained.