Continuity of Operations (COOP)

A disaster recovery plan (sometimes referred to as a business continuity plan or business process contingency plan) explains how a business will handle potential disasters. Just as a disaster is an event that causes the continuation of normal operations impossible, a disaster recovery plan lays out the steps put in place to minimize the damage caused by a disaster and enable the business to either continue or quickly resume mission critical activities. Usually, disaster recovery planning involves a review of the continuity of business processes in the face of a disaster caused by degradation, corruption, or destruction of data, applications, and infrastructure. Disaster recovery planning includes a significant focus on disaster prevention.

Assuming you have completed a risk assessment and have identified potential threats to your IT infrastructure, the next step is to determine which infrastructure elements are most important to the performance of your business. Also, assuming that all IT systems and networks are performing normally, your business should be fully viable, competitive and financially solid. When an incident, internal or external, negatively affects the IT infrastructure, your business could be compromised.

WAN Backups and Archives

The goal of backup, archiving, and disaster recovery is guarding and preserving data against loss and corruption. Local data protection has limited effectiveness which can cause your business great risk. The same flood, fire, or theft that damages your production data can just as easily destroy the copies of that data. The emergence of the Internet as well as other storage saving technologies have made remote data protection more practical and cost-effective with a much faster resumption of normal operations.

Today, a backup or disaster recovery set (the group of files or data that constitutes a disaster recovery package) can be stored on the other side of the world as easily as next door. With backups and disaster recovery sets, the primary consideration for the wide area network (WAN) is efficiency. The key is making the most of available bandwidth to move the maximum amount of data in the shortest amount of time. This ensures the shortest recovery point objective (RPO).

Due to the fact that high-bandwidth WAN connections are too costly for many organizations, several techniques have emerged to reduce the amount of data needed to perform a remote backup or disaster recovery set.

A full backup is a critical starting point. But, a full backup can take a long time, which can greatly extend the remote RPO. If it takes 36 hours to perform a complete backup across the WAN, the smallest possible RPO would be 36 hours which is longer than most companies can tolerate.

When backing up to a remote location, often to a remote virtual tape library (VTL), most companies will start with a full backup, then revert to differential or incremental backups to save only files that have changed since the full backup. The technique of “delta differencing” saves just the changed data. Therefore, the initial backup or disaster recovery set of 20 TB may take many hours, but an average delta difference of 15 GB per day can be transferred in just a few hours, well within an acceptable daily backup window.

Another data reduction method is data compression. This technique involves searching for repetitive data segments that can be removed from a file. The mathematical algorithm used to compress the file can rebuild it again when the file is read later. Compression usually cuts data volumes in half. But, since not all files compress well, the actual compression amount changes with file type.

Use of the data reduction technique known as data de-duplication continues to grow. Data de-duplication saves only one unique copy of a file, block or byte to remote storage.

The traditional ideas of full backups or disaster recovery sets are changing. For example, storage administrators realize they don’t need to back up every end user’s MP3 files in a disaster recovery set. More businesses are concentrating on protecting mission essential applications while ignoring secondary and non essential file types.

Restoration over the WAN

Because backup data and disaster recovery sets are useless unless they can be restored from a remote location, storage administrators must also be concerned with recovery time objectives (RTO). RTOs can be different than RPOs. A business might need an extremely short RPO to minimize potential data loss, but can tolerate up to 24 hours for recovery. What’s critical is that remote data can be restored within the allotted RTO. In some cases, an organization may temporarily draw additional bandwidth from a service provider in order to meet tight RTOs. Recovery tests can be used to train staff and streamline the recovery process.

Implications of Archiving Data

Unlike backups or disaster recovery sets, which are typically only accessed after a problem occurs, archival data can be accessed at any time (albeit infrequently). An example of archival data is patient records, where a physician may only lookup the patient’s history and medical images during an annual physical. Remote archives add data protection by placing the data in another location.

With remote archives, WAN bandwidth is not a large concern because the individual files being saved or accessed are tiny relative to the total archive size. For example, a patient’s x-ray image may only be a few megabytes that can be pulled across a low-bandwidth WAN link. However, if the WAN goes down, the archive becomes inaccessible. One way to mitigate the impact of WAN disruptions is to use a local archive platform and mirror to a remote archive for data protection.

Strong Link Data utilized the Business Continuity Planning Lifecycle approach to create COOP and DR plans, processes and procedures for the Washington Headquarters Services (WHS) using ITIL-based processes. As members of the DCIN Failover Team, we also created and conducted scheduled and random COOP tests for sub-directorates within WHS, which constituted approximately 40 mission-essential systems and 100 non-mission-essential systems hosted at secure local and geographically disbursed sites. A minimum of 10% of all WHS eBusiness systems were tested monthly to validate their designated COOP capabilities via SOPs and IT Contingency Plans. Mission Essential Systems were left running at alternate site locations for 30 days or more to confirm sustainable COOP capability. During COOP exercises and scenarios we were responsible for updating Standard Operating Procedures and Information Technology Contingency Plans.