Virtualizing your servers can help you to improve your readiness to respond to disasters, such as fires, floods, virus attacks, power outages, and the like. Popular solutions, such as VMWare’s ESX virtualization products, in combination with data replication to a remote facility, or backups using a third party application like vRanger can help speed up your ability to respond to emergencies, or even have fewer emergencies that require IT staff to intervene. This article will discuss a few solutions to help you improve your disaster recovery readiness.
Being able to respond to an emergency or a disaster requires planning before the emergency arises. Planning involves the following: (1) having an up-to-date system design map that explains the major systems in use, their criticality to the organization, and their system requirements; (2) having a policy that identifies what the organization’s expectations are with system uptime, the technical solutions in place to help mitigate risks, and the roles that staff within the organization will play during an emergency; and (3) conducting a risk assessment that reviews the risks, mitigations in place, and unmitigated risks that could cause an outage or disaster.
Once you have a system inventory, policy and risk assessment, you will need to identify user expectations for recovering from a system failure, which will provide a starting point for analyzing how far your systems are from user expectations for recovery. For example, if you use digital tape to perform system backups once weekly, but interviews with users indicate an expectation that data from a particular system can’t be recovered manually if a loss of more than a few hours is experienced, your gap analysis would indicate that your current mitigation is not sufficient.
Now, gentle reader, not all user expectations are reasonable. If you operate a database with many thousands of transactions worth substantial amounts in revenue every minute, but your DR budget is relatively small (or non-existent), users get what they pay for. Systems, like all things, will fail from time to time, no matter the quality of the IT staff or the computer systems themselves. There is truthfully no excuse for not planning for system failures to be able to respond appropriately – but then, I continue to meet people who are not prepared, so…
However, user expectations are helpful to know, because you can use them to gauge how much focus should be placed on recovering from a system failure, and where there are gaps in readiness, seeking to expand your budget or resources to help improve readiness as much as feasible. Virtualization can help.
First, virtualization generally can help to reduce your server hardware budget, as you can run more virtual servers on less physical hardware – especially those Windows servers that don’t really do that much (CPU and memory) most of the time. This, in turn can free up more resources to put towards a DR budget.
Second, virtualization (in combination with a replication technology, either on a storage area network, such as Lefthand, or through another software solution, for example, Doubletake) can help you to make efficient copies of your data to a remote system, which can be used to bring a DR virtual server up to operate as your production system until the emergency is resolved.
Third, virtual servers can be more easily backed up to disk using software solutions like vRanger Pro, which can in turn be backed up to tape or somewhere else entirely.
Virtualization does make recovery easier, but not pain-free. There is still some work required to make this kind of solution work properly, including training, practice, and testing. And you will likely need some expertise to help implement a solution (whether you work with a VMWare, Microsoft, or other vendor for virtualization). On the other hand, not doing this means that you are left to “hope” you can recover when a system failure occurs. Not much of a plan.
Testing and Practice
Once the technology is in place to help recover from a system failure, the most important thing you can do is to practice with this technology and the policy/procedure you have developed to make sure that (a) multiple IT staff can successfully perform a recovery, (b) that you have worked out the bugs in the plan and identified specific technical issues that can be worked on to improve the plan, and (c) that those who will participate in the recovery effort can work effectively under the added stress of performing a recovery with every user hollering “are you done yet?!?”.
Some of the testing should be purely technical: backing a system up and being able to bring it up on a different piece of equipment, and then verifying that the backup copy works like the production system. And some of the testing is discussion-driven: table-top exercises (as discussed on my law web site in more detail here) help staff to discuss scenarios and possible issues.
All of the testing results help to refine your policy, and also give you a realistic view of how effectively you can recover a system from a major failure or disaster. Some systems (like NT 4.0 based systems) will not be recoverable, no matter what you may do. Upgrading to a recent version of Windows, or to some other platform all together, is the best mitigation. In other cases, virtualization won’t be feasible because of current budget constraints, technical expertise, or incompatibility (not all current Windows systems can be virtualized because the system has unique hardware requirements, or otherwise won’t covert to a virtual system). But, there are a fair number of cases where virtualizing will help improve recoverability.
Virtualization can help your organization recover from disasters when the technology is implemented within a plan that is well-designed and well-tested. Feedback? Post a comment.