Disaster Recovery in Cloud Computing

Site: Saylor Academy
Course: BUS611: Data Management
Book: Disaster Recovery in Cloud Computing
Printed by: Guest user
Date: Tuesday, July 1, 2025, 6:59 PM

Description

Read these sections to familiarize yourself with disaster recovery. Pay attention to the review of cloud computing and disaster recovery plans, and list the challenges associated with disaster recovery. Finally, compile a list of the different types of disaster recovery platforms.

Organizations that use cloud-based services can backup and store data in a virtual location. You learned that storing data in the cloud creates a faster and more agile organization. Think back to the list you made about the challenges associated with disaster recovery. Use what you learned about those challenges as you read this article and begin this next section.

Abstract

Disaster recovery is a persistent problem in IT platforms. This problem is more crucial in cloud computing, because Cloud Service Providers (CSPs) have to provide the services to their customers even if the data center is down, due to a disaster. In the past few years, researchers have shown interest to disaster recovery using cloud computing, and a considerable amount of literature has been published in this area. However, to the best of our knowledge, there is a lack of precise survey for detailed analysis of cloud-based disaster recovery. To fill this gap, this paper provides an extensive survey of disaster recovery concepts and research in the cloud environments. We present different taxonomy of disaster recovery mechanisms, main challenges and proposed solutions. We also describe the cloud-based disaster recovery platforms and identify open issues related to disaster recovery.

Keywords: cloud computing, disaster recovery, replication, backup, survey


Source: Mohammad Khoshkholghi et al., https://www.ccsenet.org/journal/index.php/cis/article/view/37067
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.

Introduction

Cloud computing becomes more popular in large-scale computing day by day due to its ability to share globally distributed resources. Users can access to cloud-based services through Internet around the world. The biggest IT companies are developing their data centers in the five continents to support different cloud services. The total value of the global cloud computing services market revenues is expected to reach about $241 billion by the end of 2020. Rapid development in cloud computing is motivating more industries to use variety of cloud services, for instance near to 61% of UK businesses are relying on some kinds of cloud services. However, many security challenges have been raised, such as risk management, trust and recovery mechanisms which should be taken into account to provide business continuity and better user satisfaction.

Disasters, either manmade or natural, can lead to expensive service disruption. Two different disaster recovery(DR) models can be used to prevent failure in a network or CSPs: Traditional and cloud-based service models. Traditional model can be used as either dedicated infrastructure or shared approach. Based on speed and cost, customers can choose the appropriate model. In dedicated approach, an infrastructure is assigned to one customer, so both cost and speed is high. On the other hand, in the shared model (we can also call it distributed approach) an infrastructure is assigned to more multiple users. This approach decreases both cost and speed of recovery. As shown in Figure 1, cloud computing is a way to gain both dedicated and shared model benefits. It can serve DR with low cost and high speed.

Figure 1. Comparison between traditional and cloud DR models

Table 1 shows a comparison between these three DR categories in terms of different features. Cloud computing decreases data synchronization between primary and backup site, minimizes different kinds of cost while increases independency between users' infrastructure and their DR systems.

DR model Data synchronization Independency Initial Cost Ongoing cost Cost of potential disasters
Dedicated High Low High Depends High
Distributed Medium High Medium Depends High
Cloud Low High Low Depends Low


Table 1. Disaster recovery models

According to IBM research, only 50% of disasters in IBM are because of weather and the rest are because of other causes. For instance, such as cut power lines, server hardware failures and security breaches. Hence, DR is not only a mechanism for natural events, but also for all severe disruptions in cloud systems.

Organizations and businesses can use DR services which are served by cloud service providers. Using these services, data protection and service continuity are guaranteed for customers at different levels. Table 2 shows different DR services offered by IBM. In addition, one critical issue in DR mechanisms is that how can cloud providers tolerate disaster to prevent data lost and service disruption of their own data, infrastructure and services. In this paper we investigate both challenges and solutions for DR mechanism in cloud provider's point of view. For enterprises, the main goal of DR is business continuity which means resuming back services online after a disruption. Recovery time objective (RTO) and Recovery Point Objective (RPO) are two important parameters which all the recovery mechanisms try to improve. By minimizing RTO and RPO business continuity can be achieved. RTO is the time duration between disruption till restoration of service, and RPO denotes the amount of data lost after a disaster. Failover delays consist of 5 steps depending on the level of backup.

S1: Hardware setup
S2: OS initiation time
S3: Application initiation time
S4: Data/process state restoration time
S5: IP switching time

Therefore, RPO and RTO can be defined as:

\(R P O \propto \frac{1}{\mathrm{Fb}}\)        (1)

Where Fb is Frequency of backup.

\(\text { TO } \text { fraction of } R P O+\sum_{j \min }^{55} \mathrm{~T}_{\mathrm{j}}\)       (2)

IBM SmartCloud
recovery service level
Recovery time Description
Gold 1 minute For mission-critical applications
Silver 30 minutes For rapid recovery
Bronze 6 to 24 hours Assisted failover and failback


Table 2. IBM different DR service level

The rest of this paper is organized as follows: In the section 2 cloud computing has been introduced briefly. In the section 3 we discuss cloud-based DR in detail. In section 4 and section 5 we investigate main challenges in DR mechanisms and some proposed solutions, respectively. It is followed by section 6 discussing some cloud-based DR systems will be introduced. In the section 7, the Open issues have been investigated. Finally, the paper ends with the proposed overall DR procedure and conclusion.

Cloud Computing: A Brief Review

Cloud computing – a long held dream of computing as a utility – is a promising technique which shifts data and computational services from individual devices to distributed architectures. The content of cloud was initially created to describe sets of complex on-demand services offered by commercial providers. Based on the advancement in network topology with high speed bandwidth and Smart phones, people can upload their information using the Internet anytime. Cloud computing denotes Internet-based distributed computing platforms which are highly scalable and flexible. Their features can change the fashion of conventional information processes. Cloud computing allocates IT resources, such as computational power, storage, software, hardware platforms and applications to a wide range of consumers, possessing a wide range of devices.

Cloud providers including -public, private or hybrid clouds- are able to offer seamless on-demand services as a pay-as-you-go model. Therefore, consumers can easily use the services without a need to install or worrying about the underlying infrastructure. So, they can focus on their applications and can scale and retrieve the allocated resources directly by interacting with Cloud Service Providers. Virtualization is the key enabling technology in which cloud computing can change the system's view from a piece of hardware to a dynamic and flexible entity. Cloud-based services can be divided into three levels: Infrastructure as a Service (IaaS), Software as a Service (SaaS) and Platform as a Service (PaaS).

According to the NIST, the essential features of cloud can be defined as: On demand self-service, broad network access, resource pooling, rapid elasticity, measured service. Taxonomy, advantages and challenges of cloud computing are shown in the Table 3 (Cloud taxonomy, online).

Taxonomy Infrastructure services: Storage, compute, services management.
Cloud software: Data, compute, appliances, file storage, cloud management.
Platform services: Database, business intelligence, Integration, development & testing.
Software services: Billing, financial, legal, sales, Desktop productivity, human resources, content management, backup & recovery, social networks, collaboration.
Advantages Improved business continuity, On demand storage and compute power, lower cost of ownership, agility, pay as you go model, increased availability, mobility and collaboration.
Challenges Security issues, Disaster recovery, Dependency, latency, transparency, performance concerns, SLA violations.

Table 3. Taxonomy, advantages and challenges of cloud computing

Disaster Recovery

A disaster is an unexpected event in a system lifetime. It can be made by nature (like the tsunami and earthquake), hardware/software failures (e.g. , VMs' failure of Heroku hosted on Amazon EC2 on 2011) or even human (human error or sabotage). It can lead to serious financial loss or even can put human lives at risk. Hence, between 2% and 4% of IT budget in huge companies is expended for DR every year. Cloud-based DR solution is an increasing trend because of its ability to tolerate disasters and to achieve the reliability and availability. It can be even more useful in small and medium enterprises (SMEs), because they do not have much resources as big companies do. As shown in Table 4, Data level, system level and application level are three DR levels which are defined in terms of system requirements.

DR level Description
Data level Security of application data
System level Reducing recovery time as short as possible
Application level Application continuity


Table 4. DR levels

DR mechanisms must have five requirements for an efficient performance:

  • Have to minimize RPO and RTO
  • Have a minimal effect on the normal system operation
  • Must be geographically separated
  • Application must be restored to a consistent state
  • Must guarantee privacy and confidentiality


Disaster Recovery Plan

There are different DR approaches to develop a recovery plan in cloud system. They are based on the nature of the system. However in the literature, all these approaches are based on redundancy and backup strategies. The redundancy strategy uses separated parallel sites which have the ability to start up the applications after a disaster; whereas backup strategy uses replication technology. The speed and protection degree of these approaches depend on the level of DR service that is shown in Table 5. In addition, three different types of replication technology are available: 1. Host and VM replication, 2. Database replication, 3. Storage replication.

Model Synchronize time Recovery time Backup characteristics Tolerance support
Hot Seconds Minutes Physical mirroring Very high
Modified Hot Minutes 1 hour Virtual mirroring High
Warm
Hours 1-24 hours Limited physical mirroring Moderate
Cold
Days More than 24 hours Off site backup Limited


Table 5. Cloud-based DR models

The objective of disaster recovery planning is to minimize RTO, RPO, cost, and latency by considering system constraints such as CPU, network and storage requirements. So we can say DR recovery planning can be considered as an optimization problem. According to, DR plans include two necessary phases:

  • Matching Phase: In this phase, all DR solutions have to be matched to the requirements of any data container (a data container means a data set with identical DR requirements)
  • Plan composition phase: Selecting an optimal DR solution which can minimize cost with respect to required QoS for each data container

ENDEAVOUR is a framework for DR planning process. As shown in Figure 2, it consists three modules:

  • Input modules: Including DR requirements (such as protection type, RTO, RPO and application latency), Discovery engine (To find configuration information of primary and secondary sites) and knowledge repository (Replication technologies, instructions and composition formula).
  • Planning Modules: Including solution generation (Analyzing DR requirements and matching them to replication techniques), Ranking (Sorting DR plans in terms of some attributes like cost, risk and latency and Global optimization (selecting an optimal DR plan.
  • Output: The output of ENDEAVOUR is an optimal DR plan for each application with some details like: target resources and devices, replication protocol configuration.

Disaster Recovery Challenges

In this section we investigate some common challenges of DR in cloud environments.


Dependency

One of the disadvantages of cloud services is that customers do not have control of the system and their data. Data backup is on premises of service providers as well. This issue makes dependency on CSPs for customers (such as organizations) and also loss of data because of disaster will be a concern for customers. Dependency also creates another challenge which is the selection of a trusted service provider.

Figure 2. ENDEAVOUR flowchart


Cost

It is obvious that one of the main factors to choose cloud as a DR service is its lower price. So, cloud service providers always seek cheaper ways to provide recovery mechanisms by minimizing different types of cost. The yearly cost of DR systems can be divided in three categories:

  • Initializing cost: amortized annual cost
  • Ongoing cost: storage cost, data transfer cost and processing cost
  • Cost of potential disaster
  • Cost of recovered disasters and also cost of unrecoverable disasters.


Failure Detection

Failure detection time strongly affects on the system downtime, so it is critical to detect and report a failure as soon as possible for a fast and correct DR. On the other hand, in multiple backup sites there is a major question: How to distinguish between network failure and service disruption.


Security

As mentioned before, DR can be created by nature or can be human-made. Cyber-terrorism attack is one of human-made disasters which can be accomplished for many reasons. In this case, protection and recovery of important data will be a main goal in DR plans beside of system restoration.


Replication Latency

DR mechanisms rely on replication technique to make backups. Current replication techniques are classified into two categories: synchronous and asynchronous. However, both of them have some benefits and some flaws. Synchronized replication, guarantees very good RPO and RTO, but it is expensive and also can affect on system performance because of large overhead. This issue is more serious in multi-tier web applications, because it can significantly increase Round Trip Time (RRR) between primary and backup site. On the other hand, a backup model adopted with async replication is cheaper and also system suffers low overhead, but the quality of DR service will be decreased. Therefore, trading off between cost, performance of the system and also replication latency is an undeniable challenge in cloud disaster solutions.


Data Storage

Business database storage is one of the problems of enterprises which can be solved by cloud services. By increasing of cloud usage in business and market, enterprises need to storage huge amount of data on cloud-based storages. Instead of conventional data storage devices, cloud storage service can save money and is also more flexible. The architecture of a cloud storage system includes four layers: physical storage, infrastructure management, application interface and access layer. In order to satisfy applications and also to guarantee the security of data, computing has to be distributed but storage has to be centralized. Therefore, storage single point of failure and data loss are critical challenges to store data in cloud service providers.


Lack of Redundancy

When a disaster happens, primary site becomes unavailable and secondary site has to be activated. In this case, there is no ability to sync or async replication in a backup site but data and system states only can be stored locally. It is a serious threat to the system. This issue is temporary and will be removed after recovery of the primary site. However, to achieve the best DR solutions, especially in high availability services (such as business data storage), it is better to consider all risky situations.

Disaster Recovery Platforms

In this section, different cloud-based DR systems will be introduced briefly. Also benefits and weaknesses of each system will be discussed.


SecondSite

The SecondSite is a disaster tolerance as a service system cloud. This platform is intended to cope three challenges: 1. Reducing RPO 2. Failure detection 3. Service restoration. For this reason, it uses three techniques:

  • Using a storage to keep writes between 2 checkpoints: Checkpoints move between sites in a specific period. However, if a failure happens in this time, some data will be lost. For this reason, a Distributed Replicated Block Device (DRBD) is used to store replications in both synchronous and asynchronous modes.
  • Using a quorum node to detect and distinguish a real failure: A quorum node has been designed to monitor primary and backup server. If replications have not been received by the backup site in the waiting time, backup site sends a message to quorum node. In this case, if the quorum node receives a heartbeat form primary node, it means primary server is active and the replication link has a problem; otherwise the backup site will be active.
  • Using a backup site: There is a geographically separated backup site which allows to replicate groups of virtual machines through wide-area Internet links. SecondSite increases ability to fast failure detection and also differentiate between network failures and host failures. Using DRDB, resynchronize storage can be done for recovering primary site without VMs interruption in the backup site.

Although, SecondSite is not suitable for stateless services, however it increases availability for small and medium businesses.


Remus

Remus - based on Xen hypervisor - is a high availability cloud service to tolerate disaster using storage replication combined with live VMs migration. In this system, a protected software is encapsulated in the virtual machines to asynchronously replicate whole-system checkpoints in a backup site with a high frequency. It is assumed that both replicas are in the same local area network (LAN). Remus pursues three main goals: 1. Providing low-level service to gain generality 2. Transparency 3. Seamless failure recovery.

Remus uses an active primary host and a passive backup host to replicate checkpoints. All writes have to be stored in backup RAM until a checkpoint completes. Migrated Virtual machines execute on the backup only if a failure is detected. Remus consists of 4 stages:

  • Stop running VMs and propagate only changed states into a buffer
  • Transmission of buffered states into backup RAM
  • Send an ACK message to primary host after checkpoint completion
  • Release the network buffer to external clients.

This system integrates a simple failure detection into the checkpoint process. If checkpoints are not received by the backup site in an epoch, backup site will be active, on the other hand, if backup response is not received during a specific period, then primary site will suppose a failure at the backup host. However, Remus increases performance overhead which leads to some latency, because it requires to ensure consistent replication. In addition, this system needs a significant bandwidth.


Romulus

Romulus has been designed as a disaster tolerant system based on the KVM hypervisor. This platform is an extension of Remus system. Romulus provides an accurate algorithm for disaster tolerant in seven stages in details, which are:

  • Disk replication and network protection
  • VM checkpoint
  • Checkpoint synchronization
  • Additional disk replication and network protection
  • VM replication
  • Replication synchronization
  • Failure detection and failover.

The flaw of Remus is that it uses one buffer to replicate writes between primary host and backup. Happening a failure in this buffer before transferring checkpoint causes an inconsistency between disk and VM state; and it can break fault tolerance of Remus. For this reason, Romulus uses a new buffer to replicate disk writes after any checkpoint. Second flaw is that network egress traffic cannot be released until completely transferring checkpoint to storage backup host which can decrease system performance. However, Romulus uses a new egress traffic buffer to solve this problem. Romulus can tolerate failure in two situations:

On the fly: it consists disk and VM state replication into a new writes buffer during VM running.

Failover: the ability of service recovery after a disaster.


DT Enabled Cloud Architecture

It is an extended architecture based on Romulus seven stage algorithm. It uses a hierarchical tree architecture based on the Eucalyptus IaaS architecture. It provides a disaster tolerant service with respect to resource allocation issue which is a challenge in DT services. Host and backup clusters are monitored by high availability controllers. Each cluster has three different controllers:

  • Storage controller: To control and manage the cluster storage.
  • Cluster controller: To manage IPs, centralized memory and CPU availability.
  • Node controller: To load, start and stop the VMs.

Different nodes and also different clusters can communicate with each other for better resource allocation. For this purpose, backup cluster controller allocates a VM to a node. Then, node controller loads and starts the VM and allocates it to the primary host. Finally, primary node controller loads and starts the VM.

In this system, VM failover consists of two scenarios. The first scenario is cluster failure. In this situation, backup cluster will be activated. Node failure is another scenario in which cluster controller releases VMs' IP and allocates a backup node to compose required VMs. This system is most useful for extended distance and metropolitan clusters because of low latency requirements.


Kemari

Kemari is a cluster system which tries to keep VMs transparently running in the event of hardware failures. Kemari uses primary-backup approach so that any storage or network event that changes the state of the primary VM must be synchronized in backup VM. This system has gained the benefits of the Lock stepping and the Checkpointing - two main approaches for synchronizing VM state- which are:

  • Less complexity compared to lock stepping approach.
  • It does not need any external buffering mechanisms which can affect on output latency.


RUBiS

RUBiS is a cloud architecture aims to both DR and also minimizing costs with respect to Service Level Agreement. As shown in Figure 5, in ordinary operation, a primary data center including some servers and a database accomplish normal traffics. A cloud is in charge of disaster recovery with two types of resources: Replication mode resources for getting backup before a disaster which is active; and failover mode resources that will be activated only after a disaster. It is notable that service providers can rent the inactive resources to other customers for revenue maximization. In the case of a disaster, leased resources must be released and allocated to the failover procedure.

Figure 5. Overviews of RUBiS system architecture


Taiji

Taiji is a Hypervisor-Based Fault Tolerant (HBFT) prototype which uses a mechanism similar to Remus. However, instead of Remus which uses separated local disk for replication, Taiji uses a Network Attach Storage (NAS). Shared storage may become a single point and cause a weakness of this method, so RAID (Patterson et al., 1988) or commercial NAS (Synology, online) solution should be deployed. On the other hand, because of using shared storage, the need of synchronizing is decreased and also file system state is maintained in the event of disaster.


HS-DRT System

The goal of the HS-DRT system is protecting important data from natural or subversive disasters. This system uses a HS-DRT processor -which we described as SDDB (section 6, part 5) - with a cloud computing system. Clients are as terminals which request some web applications. The HS-DRT processor has functioned as a web application and also encryption, spatial scrambling, fragmentation of data. At the end, data is sent and stored in a private or public cloud. The system architecture is shown in Figure 6. This system severely increases security of data before and after disaster in cloud environments. However, It has two weaknesses:

  • The performance of the web application will be decreased if the number of duplicated copies increases.
  • This system cannot guarantee consistency between different copies of file data.

Figure 6. The architecture of the HS-DRT system


PipeCloud

This cloud-based multi-tier application system uses the Pipelined replication technique (as mentioned in the last section) as a DR solution. PipeCloud architecture is composed of a cloud backup site and a primary data center. The goal of this system is mirroring storage to the backup site and minimizing RPO. The main tasks of PipeCloud are:

  • Replicating all disk writes to a backup site by the replication technique
  • Tracking the order and dependencies of the disk writes
  • Releasing network packets only after storing the disk writes on the backup site.

This system results in a higher throughput and lower response time by decreasing the impact of WAN latency on the performance. For this purpose, the system overlaps replication with application processing. Also, it guarantees zero data loss consistency. However instead of Remus, PipeCloud cannot protect the memory states because it leads to large overhead on WAN.


Disaster-CDM

Huge amount of disaster-related data have been generated by government, organization, automation systems and even social media aims to provide a Knowledge as a Service KaaS) framework for disaster cloud data management which can lead to better preparation, response and recovery of disasters.

As shown in Figure 7, this system uses both cloud storage and NoSQL to store data. Disaster-CDM consists two parts:

  • Knowledge acquisition: Obtaining knowledge from a variety of sources, processing and storing in datacenters.
  • Knowledge delivery service: Merging information from diverse databases and delivering knowledge to users.

Figure 7. Disaster-CDM Framework


Distributed Cloud System Architecture

In Silva et al., the authors have introduced a cloud system to provide high dependability of the system based on severe redundancy. The system has multiple datacenters which are geographically separated from each other. Each datacenter includes both hot and warm physical nodes. VMs are active in both warm and hot physical nodes but only running in the hot nodes. In order for DR, there is a backup server which stores a copy of each VM.

When a physical node failure occurs, the VMs migrate to a warm physical node. In the case of a disaster which makes a data center unavailable, backup site transmits VM copies to another data center. Although this system architecture is expensive, but it highly increases the dependability which can be adequate for Infrastructure as a Service (IaaS) clouds. In addition, this paper has introduced a hierarchical approach to model cloud systems based on dependability metrics as well as disaster occurrence using the Statistic Petri Net approach. Figure 8 shows the architecture of this DR system.

Figure 8. Distributed cloud system architecture


Table 7 shows an overall comparison of different cloud-based DR platforms in terms of 10 key properties.

Table 7. Comparing cloud-based DR platforms in terms of different properties