Previously I blogged about how Disaster Recovery (DR) had evolved into the modern age (link for part 1 and part 2) and had reached the pinnacle of what can be achieved when using products such as VMware Site Recovery Manager and RecoverPoint. Solutions such as these truly represent the best possible DR solutions and tick every box. So why I hear you ask am I now reading part 3 of this blog if we have reached the top?
Well, I would have asked the same question a couple of years ago, as quite simply "you don't know, what you don't know" and although I realized back then that DR still had its shortcomings, I didn't know that these could ever be solved as quite simply back then the technology did not exist.
This kind of thing happens just about everywhere over time. For example, 20 years ago, most people were happy taking pictures with a camera that contains film, and needs developing. We didn't know there was anything better!
As luck would have it, fast forward a few years and get some very clever Canadian engineering folk from Edmonton, and hey presto the shortcomings get solved very quickly and efficiently!
So what are these shortcomings?
There are actually quite a few, but the biggest ones are:
- The lack of any return on investment since DR assets are passive
- Non-trivial and lengthy decision process to invoke DR
- Failure to meet the Recovery Time Objective (RTO)
- The ongoing operational complexity of DR testing and invocation.
Let me set out the case of these shortcomings on each of the above four areas and within each topic I shall summarize what I believe the requirement for utopia (aka "Continuous Availability") would look like.
For example, if we look at the camera 20 years ago, a major shortcoming was that you had to wait until you finish the entire film and then develop it before you can see the result. Today we have Utopia since we can instantly see the image we just took!
Lack of ROI / Passive Assets
The primary DR failing from a cost and business perspective is the lack of ROI and Passive Assets.
Firstly, think about how much your DR solution has cost you? If you are using some form of replication, typically it is roughly double the cost of the primary solution since it contains a full complement of assets (Storage, Network and hosts) at the DR site. These are effectively an "insurance policy" should a significant incident happens, and all of these assets sit idly by at the DR site waiting to be utilized..
Now ask yourself this question. Have you ever invoked Disaster Recovery?
If the answer is yes, then congratulations, you have at some point got a return on the DR investment!
But more than likely the answer is no, which means that up until now, all of the investment that was pumped into the DR solution has not (yet) yielded any return.
So, the bottom line is that even the best DR solution will typically not give you a return on your investment (ROI) until you've actually invoked DR, but for the majority of businesses the likelihood of a catastrophic disaster that would lead them to invoke DR is very low. (Unless they are located in regions like Japan or Taiwan that have fairly regular and highly extreme natural events.)
Now, before we move on, I know that some of you might be thinking at this point that I am wrong and that you do in fact use the DR assets actively and therefore get a return on investment. I accept that could be true in some environments, but typically (and without getting technical) this generally means that the DR hosts connect into the production storage over some kind of stretched network. If that is the case then yes you can use the remote compute assets, however the storage assets are still not in use. Furthermore this type of solution leads to a more complex recovery as there is a whole bunch of network re-configuration to perform should DR ever be invoked. Additionally it means that the WAN link will use more bandwidth (driving up costs) and the performance in the DR site will be much worse than the production site. This is due to the fact that all of the workload must go across the WAN which will incur additional latency thereby slowing down your application.
REQUIREMENT FOR UTOPIA: True Active/Active assets (compute, network and storage) in both the production and DR sites allowing better load balancing of workload spikes and automatically bursting workloads across both locations. In fact, complete elimination of the term "DR site" since both sites run the same production instance simultaneously.
As we highlighted earlier, the risk of having a situation that will cause you to invoke DR is low, but let's just take a minute to establish the circumstances that would cause you to invoke DR in the first place.
By definition the "DR invocation decision" only has two outcomes (yes and no) and it is only ever made when a given set of circumstances or specific incident(s) has caused the IT service supporting part of the business to stop in an unplanned fashion and whereby it cannot be brought back online immediately and/or without some form of remedial action to the primary system (be it power or physical repair)
Scenarios that could unleash themselves onto your IT infrastructure in an unplanned fashion include power outages, human error, floods, fires end even earthquakes, but interestingly these can be easily categorized into a severity matrix.
For instance, on a scale of 1 to 10 (10 being the worst case), a power outage would be at the lower end of the spectrum (i.e. a 2), whereas something like a fire would be at the upper end of the spectrum (perhaps even a 10).
Clearly, if a severity 10 scenario happened and the production equipment is completely destroyed then it really is an easy decision to invoke DR. For example, if a fire wipes out your primary datacenter it is a "no brainer" to invoke disaster recovery as you know the original copy is destroyed. However when we consider the decision to invoke DR for a low severity scenarios (a power failure perhaps), then it is not an easy decision any more since you would expect the power to come back on within a certain amount of time, and technically there is no long term damage anywhere.
History has shown us that this type of behavior is common, and in fact the majority of customers I have spoken to who have experienced a datacenter wide power failure have actually chosen not to invoke disaster recovery and ride through the failure which ultimately means the service is off line for the duration. Clearly each scenario is different, but normally there are a whole bunch of reasons companies choose not to invoke disaster recovery. Some of the main ones are outlined below:
Unknown time to repair: For example, if the power is lost at the production site, there is no way to tell how long the power will be down for. Why invoke DR if the power will be back on in an hour or so?
Time to fail over longer than incident duration: This is where it could take x amount of time to actually bring all of the systems back on line but in the meantime there is a high probability of the initial incident being rectified (i.e. the power comes back on for example). In this scenario let's say the failover time is 5 hours plus the decision time of 2 hours. That totals 7 hours, but I would suspect in this day and age most power outages should not last for 7 hours meaning it is not worth failing over since it will only case more disruption at a later time as failback is also offline (not to mention lengthy!). The other challenge here is what happens if you are half way through a recovery and the initial fault gets rectified. Well, put it this way, it is not a good outcome as the decision to invoke may have actually caused more delay to recovery!
Failing over means I also need to failback: Once DR has been invoked, at some time in the future the service will need to return home. In all DR solutions this type of failback will also incur further business downtime. This also dovetails into the first reason too, as if the power will come back within a small time window then sometimes it is not worth this risk. A further problem here is that failback after a real incident is not the same as failing back after a DR test, so generally this type of scenario is untested.
Risk: How certain are you that once DR is invoked the main objectives will be met (i.e. the business restarted without incident) and that the very act of performing network, storage and host failovers will not inject more problems into the IT environment? These are complex questions, and risk becomes a key factor here especially if DR testing has not been completed regularly as there will be many complex unknown items to deal with! When was the last time a DR test was done?
Lack of DR testing: This creates a significant problem as in some cases there may not be a full DR plan to cover all of the services and applications thereby meaning that a DR recovery process may need to be created on the fly. Again this dovetails into the risk reason, since the smallest mistake here could have significant ramifications to the wider business. Additionally, in the case of such an outage, the technical team will be under pressure to get the system online ASAP, and humans typically do not function well under pressure thereby further increasing risk!
Lack of resources: DR generally requires specialist skill sets, and depending on the solution and time of day there simply may not be sufficient staff resource to perform the invocation. Also the resources required may be in the wrong location to even perform a recovery.
So, as you can see, when you consider all of these things do you really want to push that "invoke DR button"?? I DARE YOU!!
Furthermore, we must also consider who the decision actually lies with. Typically it is not the storage admin, but rather the wider business that has to make the decision. If you then start to consider the Recovery Time Objective (RTO) this can easily be breached when you take into account the "decision factor" and all the while until the decision is made the business will remain in an offline state potentially losing millions of $!!
Another interesting side note here is that there is a general misconception in the industry that RTO is from the point of the incident, but in actual fact RTO is measured from the time that the decision is made which as we can see is an unknown variable.
REQUIREMENT FOR UTOPIA: The complete removal of a decision in any scenario. In fact the system should always do the correct thing and remain online in an unaffected location whatever the scenario.
Failure to meet the RTO
We touched on this one in the previous section, but let's re-cap why and how this can happen and then understand the Utopia here.
Ultimately if the business cannot meet the RTO it generally boils down to two things:
- Complexity of recovery and the number of systems to recover.
- The total amount of human resources to recover the systems in the allotted timeframe.
The challenge is that any time a human is required to perform certain recovery steps then all number of external factors come into play. Ultimately this means recovery times are unpredictable. Furthermore, if there is no common coherent replication strategy across the business, it typically means that product sprawl will occur. This is where different applications might use several different replication mechanisms causing a further negative effect on recovery times. If you stop and think about this for a minute, let's suppose we have 10 applications each with their own replication mechanisms. Recovery is not something that people do on a day to day basis, therefore what chance does the average system admin have of performing complex recovery at the drop of a hat for any given moment of the day or night?
REQUIREMENT FOR UTOPIA: A standardized, repeatable, and predictable recovery of any service after a site failure that is always well within the RTO and without the need for human interaction. In fact, running applications actively in more than one location means that if a site fails the application simply continues with no recovery at all thus delivering a true RTO of zero, nil, nothing.
Operational Complexity and DR testing
Q: Why do businesses have DR tests?
I bet that if you asked that question to five different businesses you would probable get five different answers, but the correct answer is a simple one:
A: To prove that the services can run and be recovered in the remote location and meet the Recovery Point Objective (RPO) and RTO requirements.
Potentially there also could be a regulatory/compliance reason, but if you think about it (and as I have written about in my previous blog) the fundamental premise of any DR solution is based on a design that has both a production site and a DR site. Since the production environment is only actively running at the production site then it goes without saying that proof is required to ensure recovery is possible at the DR site.
Clearly the other factor here is that up until now (and as I blogged previously) recovery has always been VERY complex. Due to the sheer amount of unsuccessful DR testing (especially from tape) giving different outcomes (many of which may resulted in failure to bring the business back online or, even worse, injected additional failures into the production system )we can clearly understand why DR testing is MANDATORY.
Ok, so in my last blog entry (History of DR part 2) I discussed VMware Site Recovery Manager which takes this problem off the table along with products like SRDF and RecoverPoint, however for everything else this is a significant problem, including all non-VMware parts of the infrastructure which use legacy solutions.
The other way to look at this is from a cost perspective. E.g. How many IT staff are required to be taken out of the BAU function and complete the DR testing? Also how long will it take them to complete and document the DR process? Furthermore, how many times must this be repeated annually to comfortably ensure the business can be recovered within the recovery time? Does your business even have enough staff to conduct all of this testing?
The testing methodology also requires careful consideration. For instance, if you have a zero RPO requirement and using a traditional active/passive product then you will need to use some form of clone or snap at the DR site to ensure that replication is not interrupted during the testing. Unless you are using RecoverPoint then this adds further complexity into the mix since the DR hosts are not attached to the disks that would be used in a true DR invocation which means there are additional steps to consider.
REQUIREMENT FOR UTOPIA: The complete removal of the need to perform DR testing. In fact, if an application is already running actively in two locations simultaneously then this automatically proves DR will simply work without any complexity. Additionally setting up this type of solution cannot be any more complex that setting up an off the shelf host cluster in a single data center.
Now we know the shortcomings, let's consider how we make this better to the point where it cannot be (knowingly) improved.
In summary, we need a product that can give us "Continuous Availability" i.e. is capable of the following:
- Allows us to seamlessly use assets at both locations actively (active/active) in a risk free manor over 100's of Kilometers
- Automatically making the correct decision for us keeping the system online regardless of scenario
- Ensures recovery happens as quickly as possible or in some cases no interruption whatsoever
- Eliminates DR testing
- No more complicated than configuring a simple host cluster
Ladies and Gentlemen, please meet VPLEX Metro!
VPLEX Metro has been designed and built from the ground up to deliver "Continuous Availability". Quite simply it means that VPLEX meets and exceeds all of the five points above, with the added benefit of being heterogeneous thereby making this a relevant strategy for virtually every business out there as it can simply be added to most existing storage arrays.
In summary, VPLEX is the reason I decided to write this 3 part blog. As I see it DR has been on quite a transformative journey over the years, and now it has reached its true pinnacle. Just like you, I have been a passenger on this journey but been fortunate enough to have witnessed the transformation first hand truly changing the face of DR forever! Another factor in writing this blog was that I meet with many businesses every day and discuss their supporting technologies behind their wider business continuity plans. All too often I find that many organizations do not believe that their traditional infrastructure is adequate from a recovery perspective (this can be further evidenced by the recent global trust survey here http://bit.ly/1gtV2Fo ). VPLEX alleviates this lack of trust and is a simple way to accelerate the move up the BC maturity flagpole which many businesses are now doing with great success.
Now that just leaves me with one final question, will I be writing a part 4 in a few years' time? Like I said, you don't know what you don't know, but perhaps there are other shortfalls I have not considered. Although I doubt that is the case, I would welcome your thoughts and comments.