The following is an article authored by Ivo Mokros, Senior Technology Architect and Director, Cloud Systems Integration for IDS Systems. Ivo has spent many years running mission and life critical IT operations. He has been exposed to numerous operational incidents that have put disaster recovery and business continuity plans to the test. In his experience, there is one thing that matters more than anything else in a crisis – being prepared. In this article, Ivo discusses the importance of unexpected outage planning and recovery expectations. Regardless of the size of the organization, information integrity is crucial to the survivability of the organization and in many cases, its clients. Private and public sectors, knowledge of the impact of an outage and the tolerance that can be afforded is essential. For more insight and an opportunity to discuss your organization’s preparedness, contact Ivo at firstname.lastname@example.org.
Understanding Disaster Recovery Objectives – A commentary by Ivo Mokros.
If you have been involved in developing a Disaster Recovery Plan (DRP) or are involved in any aspect of Business Continuity (BC), you have no doubt encountered the terms RTO and RPO. RTO is the Recovery Time Objective, or the maximum amount of time to recover your failed systems, and RPO is the Recovery Point Objective, or the maximum allowable data loss. There are many implications to daily IT operations based on these two numbers; however, they do not tell the entire story. When calculating your disaster recovery objectives, many factors need to be taken in account to produce realistic recovery numbers.
Figure 1: Disaster Recovery Timeline
Let’s start with the RPO. This objective concerns activities leading up to a disaster, and has very little to do with activities after a disaster has occurred.
Your RPO is only as good as your last backup, but it doesn’t stop there. Your backup is at risk if the data is still in harm’s way – if the backup data is stored in your data centre or other facility that could be affected by the same disaster that destroyed your original data.
If your business defines the RPO at 24 hours, it may seem reasonable to backup once a day. However, your backup intervals needs to include the time it takes to complete the backup and the amount of time it takes to move the backup offsite. If your backup starts at midnight, takes 5 hours to complete, and an additional 5 hours elapses before your data is taken offsite, your RPO is actually 34 hours.
It is crucial to communicate a realistic data loss scenario to your business – taking into account the time required to get the data offsite, for example – so that proper business continuity plans can be established. Consider doing a structured walkthrough of a disaster scenario with the business so that everyone understands how the recovery process will take place.
Most businesses are becoming less tolerant of data loss, and pressure to reduce RPO intervals is increasing. Although tape backup systems are increasing in speed and capacity, tape is not keeping pace with data growth. Even in scenarios where disk backup is performed to a virtual tape library (VTL) and then copied to tape, short RPOs are difficult to achieve. The logistics involved in moving tapes offsite multiple times per day can make the expense and complexity prohibitive.
One solution to this problem is using a network-based backup, which uses modern compression and deduplication to move data to a remote location over a wide area network. This type of solution may be difficult to architect and set up. IDS DataGuard, a solution provided by IDS Systems, takes the guesswork out of implementing this scenario.
Now let’s move on to understanding what happens after a disaster. The RTO is a good starting point, but Figure 1 shows it doesn’t tell the whole story. Focusing on RTO alone does not achieve business continuity.
Your systems are up and running, and data from the last backup has been restored. The big question now is: can your business return to normal operations using data from the last backup? When systems are down, your business may continue to operate using a manual process or other contingency. What happens to this data? Does it have to be entered in sequence before your systems can be returned to normal operation? If so, who will enter the data and how? Many applications may not be able to tolerate data gaps, or there may be other impacts to the business caused by the missing data. The bottom line is, returning your technology infrastructure to a running state is not the end of the story.
Take some time to think about the capacity of your recovery environment. In a disaster situation, it is reasonable to assume that focus will be on keeping the lights on, and any development and other non-production activities will be curtailed. Typically, this translates to a recovery environment that is smaller than your normal operating environment. However, during recovery, your systems are being heavily loaded while lost data is re-entered, backlogs are cleared etc. This will most likely have an impact on recovery times. Testing and validating the success of the restoration are also factors that can increase recovery times. This is especially true in environments with regulatory compliance concerns, as system validation may be mandated. There are many others, but I will stop here and leave the rest as an exercise for you.
When planning for business continuity, the maximum tolerable downtime is the critical parameter you need to identify. Your RTO and RPO are only part of the story. Business continuity is a shared responsibility between IT and the business, and both have their roles to play. IDS systems can help you identify the constraints and understand their implications that are particular to your environment, and provide services and products to ensure that your business continues to operate under any scenario.