Home Cloud & Networking Cloud Computing High Availability, Protection, and Recovery using Microsoft Azure

High Availability, Protection, and Recovery using Microsoft Azure

April 2, 2015 - 12:00 am

2432

23 min read

Microsoft Azure can be used to protect your on-premise assets such as virtual machines, applications, and data. In this article by Marcel van den Berg, the author of Managing Microsoft Hybrid Clouds, you will learn how to use Microsoft Azure to store backup data, replicate data, and even for orchestration of a failover and failback of a complete data center.

We will focus on the following topics:

High Availability in Microsoft Azure
Introduction to geo-replication
Disaster recovery using Azure Site Recovery

(For more resources related to this topic, see here.)

High availability in Microsoft Azure

One of the most important limitations of Microsoft Azure is the lack of an SLA for single-instance virtual machines. If a virtual machine is not part of an availability set, that instance is not covered by any kind of SLA. The reason for this is that when Microsoft needs to perform maintenance on Azure hosts, in many cases, a reboot is required. Reboot means the virtual machines on that host will be unavailable for a while. So, in order to accomplish High Availability for your application, you should have at least two instances of the application running at any point in time. Microsoft is working on some sort of hot patching which enables virtual machines to remain active on hosts being patched. Details are not available at the moment of writing.

High Availability is a crucial feature that must be an integral part of an architectural design, rather than something that can be “bolted on” to an application afterwards. Designing for High Availability involves leveraging both the development platform as well as available infrastructure in order to ensure an application’s responsiveness and overall reliability. The Microsoft Azure Cloud platform offers software developers PaaS extensibility features and network administrators IaaS computing resources that enable availability to be built into an application’s design from the beginning. The good news is that organizations with mission-critical applications can now leverage core features within the Microsoft Azure platform in order to deploy highly available, scalable, and fault-tolerant cloud services that have been shown to be more cost-effective than traditional approaches that leverage on-premises systems.

Microsoft Failover Clustering support

Windows Server Failover Clustering (WSFC) is not supported on Azure. However, Microsoft does support SQL Server AlwaysOn Availability Groups. For AlwaysOn Availability Groups, there is currently no support for availability group listeners in Azure. Also, you must work around a DHCP limitation in Azure when creating WSFC clusters in Azure. After you create a WSFC cluster using two Azure virtual machines, the cluster name cannot start because it cannot acquire a unique virtual IP address from the DHCP service. Instead, the IP address assigned to the cluster name is a duplicate address of one of the nodes. This has a cascading effect that ultimately causes the cluster quorum to fail, because the nodes cannot properly connect to one another.

So if your application uses Failover Clustering, it is likely that you will not move it over to Azure. It might run, but Microsoft will not assist you when you encounter issues.

Load balancing

Besides clustering, we can also create highly available nodes using load balancing. Load balancing is useful for stateless servers. These are servers that are identical to each other and do not have a unique configuration or data.

When two or more virtual machines deliver the same application logic, you will need a mechanism that is able to redirect network traffic to those virtual machines. The Windows Network Load Balancing (NLB) feature in Windows Server is not supported on Microsoft Azure. An Azure load balancer does exactly this. It analyzes incoming network traffic of Azure, determines the type of traffic, and reroutes it to a service.

Managing Microsoft Hybrid Clouds

The Azure load balancer is running provided as a cloud service. In fact, this cloud service is running on virtual appliances managed by Microsoft. These are completely software-defined. The moment an administrator adds an endpoint, a set of load balancers is instructed to pass incoming network traffic on a certain port to a port on a virtual machine. If a load balancer fails, another one will take over.

Azure load balancing is performed at layer 4 of the OSI mode. This means the load balancer is not aware of the application content of the network packages. It just distributes packets based on network ports.

To load balance over multiple virtual machines, you can create a load-balanced set by performing the following steps:

In Azure Management Portal, select the virtual machine whose service should be load balanced.
Select Endpoints in the upper menu.
Click on Add.
Select Add a stand-alone endpoint and click on the right arrow.
Select a name or a protocol and set the public and private port.
Enable create a load-balanced set and click on the right arrow.
Next, fill in a name for the load-balanced set.
Fill in the probe port, the probe interval, and the number of probes. This information is used by the load balancer to check whether the service is available. It will connect to the probe port number; do that according to the interval. If the specified number of probes all result in unable to connect, the load balancer will no longer distribute traffic to this virtual machine.
Click on the check mark.

The load balancing mechanism available is based on a hash. Microsoft Azure Load Balancer uses a five tuple (source IP, source port, destination IP, destination port, and protocol type) to calculate the hash that is used to map traffic to the available servers.

A second load balancing mode was introduced in October 2014. It is called Source IP Affinity (also known as session affinity or client IP affinity). On using Source IP affinity, connections initiated from the same client computer go to the same DIP endpoint.

These load balancers provide high availability inside a single data center. If a virtual machine part of a cluster of instances fails, the load balancer will notice this and remove that virtual machine IP address from a table.

However, load balancers will not protect for failure of a complete data center. The domains that are used to direct clients to an application will route to a particular virtual IP that is bound to an Azure data center.

To access application even if an Azure region has failed, you can use Azure Traffic Manager. This service can be used for several purposes:

To failover to a different Azure region if a disaster occurs
To provide the best user experience by directing network traffic to Azure region closest to the location of the user
To reroute traffic to another Azure region whenever there’s any planned maintenance

The main task of Traffic Manager is to map a DNS query to an IP address that is the access point of a service.

This job can be compared for example with a job of someone working with the X-ray machine at an airport. I’m guessing that you have all seen those multiple rows of X-ray machines. Each queue at an X-ray machine is different at any moment. An officer standing at the entry of the area distributes people over the available X-rays machine such that all queues remain equal in length.

Traffic Manager provides you with a choice of load-balancing methods, including performance, failover, and round-robin. Performance load balancing measures the latency between the client and the cloud service endpoint. Traffic Manager is not aware of the actual load on virtual machines servicing applications.

As Traffic Manager resolved endpoints of Azure cloud services only, it cannot be used for load balancing between an Azure region and a non-Azure region (for example, Amazon EC2) or between on-premises and Azure services.

It will perform health checks on a regular basis. This is done by querying the endpoints of the services. If the endpoint does not respond, Traffic Manager will stop distributing network traffic to that endpoint for as long as the state of the endpoint is unavailable.

Traffic Manager is available in all Azure regions. Microsoft charges for using this service based on the number of DNS queries that are received by Traffic Manager. As the service is attached to an Azure subscription, you will be required to contact Azure support to transfer Traffic Manager to a different subscription.

The following table shows the difference between Azure’s built-in load balancer and Traffic Manager:

	Load balancer	Traffic Manager
Distribution targets	Must reside in same region	Can be across regions
Load balancing	5 tuple Source IP Affinity	Performance, failover, and round-robin
Level	OSI layer 4 TCP/UDP ports	OSI Layer 4 DNS queries

Third-party load balancers

In certain configurations, the default Azure load balancer might not be sufficient. There are several vendors supporting or starting to support Azure. One of them is Kemp Technologies.

Kemp Technologies offers a free load balancer for Microsoft Azure. The Virtual LoadMaster (VLM) provides layer 7 application delivery. The virtual appliance has some limitations compared to the commercially available unit. The maximum bandwidth is limited to 100 Mbps and High Availability is not offered. This means the Kemp LoadMaster for Azure free edition is a single point of failure. Also, the number of SSL transactions per second is limited.

One of the use cases in which a third-party load balancer is required is when we use Microsoft Remote Desktop Gateway. As you might know, Citrix has been supporting the use of Citrix XenApp and Citrix XenDesktop running on Azure since 2013. This means service providers can offer cloud-based desktops and applications using these Citrix solutions.

To make this a working configuration, session affinity is required. Session affinity makes sure that network traffic is always routed over the same server.

Windows Server 2012 Remote Desktop Gateway uses two HTTP channels, one for input and one for output, which must be routed over the same Remote Desktop Gateway. The Azure load balancer is only able to do round-robin load balancing, which does not guarantee both channels using the same server.

However, hardware and software load balancers that support IP affinity, cookie-based affinity, or SSL ID-based affinity (and thus ensure that both HTTP connections are routed to the same server) can be used with Remote Desktop Gateway.

Another use case is load balancing of Active Directory Federation Services (ADFS). Microsoft Azure can be used as a backup for on-premises Active Directory (AD). Suppose your organization is using Office 365. To provide single sign-on, a federation has been set up between Office 365 directory and your on-premises AD. If your on-premises ADFS fails, external users would not be able to authenticate. By using Microsoft Azure for ADFS, you can provide high availability for authentication.

Kemp LoadMaster for Azure can be used to load balance network traffic to ADFS and is able to do proper load balancing. To install Kemp LoadMaster, perform the following steps:

Download the Publish Profile settings file from https://windows.azure.com/download/publishprofile.aspx.
Use PowerShell for Azure with the Import-AzurePublishSettingsFile command.
Upload the KEMP supplied VHD file to your Microsoft Azure storage account.
Publish the VHD as an image.
The VHD will be available as an image. The image can be used to create virtual machines.

The complete steps are described in the documentation provided by Kemp.

Geo-replication of data

Microsoft Azure has geo-replication of Azure Storage enabled by default. This means all of your data is not only stored at three different locations in the primary region, but also replicated and stored at three different locations at the paired region.

However, this data cannot be accessed by the customer. Microsoft has to declare a data center or storage stamp as lost before Microsoft will failover to the secondary location.

In the rare circumstance where a failed storage stamp cannot be recovered, you will experience many hours of downtime. So, you have to make sure you have your own disaster recovery procedures in place.

Zone Redundant Storage

Microsoft offers a third option you can use to store data. Zone Redundant Storage (ZRS) is a mix of two options for data redundancy and allows data to be replicated to a secondary data center / facility located in the same region or to a paired region. Instead of storing six copies of data like geo-replicated storage does, only three copies of data are stored. So, ZRS is a mix of local redundant storage and geo-replicated storage. The cost for ZRS is about 66 percent of the cost for GRS.

Snapshots of the Microsoft Azure disk

Server virtualization solutions such as Hyper-V and VMware vSphere offer the ability to save the state of a running virtual machine. This can be useful when you’re making changes to the virtual machine but want to have the ability to reverse those changes if something goes wrong.

This feature is called a snapshot. Basically, a virtual disk is saved by marking it as read only. All writes to the disk after a snapshot has been initiated are stored on a temporary virtual disk. When a snapshot is deleted, those changes are committed from the delta disk to the initial disk.

While the Microsoft Azure Management Portal does not have a feature to create snapshots, there is an ability to make point-in-time copies of virtual disks attached
to virtual machines.

Microsoft Azure Storage has the ability of versioning. Under the hood, this works differently than snapshots in Hyper-V. It creates a snapshot blob of the base blob. Snapshots are by no ways a replacement for a backup, but it is nice to know you
can save the state as well as quickly reverse if required.

Introduction to geo-replication

By default, Microsoft replicates all data stored on Microsoft Azure Storage to the secondary location located in the paired region. Customers are able to enable or disable the replication. When enabled, customers are charged.

When Geo Redundant Storage has been enabled on a storage account, all data is asynchronous replicated. At the secondary location, data is stored on three different storage nodes. So even when two nodes fail, the data is still accessible.

However, before the read access Geo-Redundant feature was available, customers had no way to actually access replicated data. The replicated data could only be used by Microsoft when the primary storage could not be recovered again.

Microsoft will try everything to restore data in the primary location and avoid a so-called geo-failover process. A geo-failover process means that a storage account’s secondary location (the replicated data) will be configured as the new primary location. The problem is that a geo-failover process cannot be done per storage account, but needs to be done at the storage stamp level. A storage stamp has multiple racks of storage nodes. You can imagine how much data and how many customers are involved when a storage stamp needs to failover. Failover will have an effect on the availability of applications. Also, because of the asynchronous replication, some data will be lost when a failover is performed.

Microsoft is working on an API that allows customers to failover a storage account themselves. When geo-redundant replication is enabled, you will only benefit from it when Microsoft has a major issue. Geo-redundant storage is neither a replacement for a backup nor for a disaster recovery solution.

Microsoft states that the Recover Point Objective (RPO) for Geo Redundant Storage will be about 15 minutes. That means if a failover is required, customers can lose about 15 minutes of data. Microsoft does not provide a SLA on how long geo-replication will take.

Microsoft does not give an indication for the Recovery Time Objective (RTO). The RTO indicates the time required by Microsoft to make data available again after a major failure that requires a failover. Microsoft once had to deal with a failure of storage stamps. They did not do a failover but it took many hours to restore the storage service to a normal level.

In 2013, Microsoft introduced a new feature called Read Access Geo Redundant Storage (RA-GRS). This feature allows customers to perform reads on the replicated data. This increases the read availability from 99.9 percent when GRS is used to above 99.99 percent when RA-GRS is enabled.

Microsoft charges more when RA-GRS is enabled. RA-GRS is an interesting addition for applications that are primarily meant for read-only purposes. When the primary location is not available and Microsoft has not done a failover, writes are not possible.

The availability of the Azure Virtual Machine service is not increased by enabling RA-GRS. While the VHD data is replicated and can be read, the virtual machine itself is not replicated. Perhaps this will be a feature for the future.

Disaster recovery using Azure Site Recovery

Disaster recovery has always been on the top priorities for organizations. IT has become a very important, if not mission-critical factor for doing business. A failure
of IT could result in loss of money, customers, orders, and brand value.

There are many situations that can disrupt IT such as:

Hurricanes
Floods
Earthquakes
Disasters such as a failure of a nuclear power plant
Fire
Human error
Outbreak of a virus
Hardware or software failure

While these threads are clear and the risk of being hit by such a thread can be calculated, many organizations do not have a proper protection against those threads.

In three different situations, disaster recovery solutions can help an organization to continue doing business:

Avoiding a possible failure of IT infrastructure by moving servers to a different location.
Avoiding a disaster situation, such as hurricanes or floods, since
such situations are generally well known in advance due to weather forecasting capabilities.
Recovering as quickly as possible when a disaster has hit the data center. Disaster recovery is done when a disaster unexpectedly hit the data center, such as a fire, hardware error, or human error.

Some reasons for not having a proper disaster recovery plan are complexity, lack of time, and ignorance; however, in most cases, a lack of budget and the belief that disaster recovery is expensive are the main reasons. Almost all organizations that have been hit by a major disaster causing unacceptable periods of downtime started to implement a disaster recovery plan, including technology immediately after they recovered. However, in many cases, this insight came too late. According to Gartner, 43 percent of companies experiencing disasters never reopen and 29 percent close within 2 years.

Server virtualization has made disaster recovery a lot easier and cost effective. Verifying that your DR procedure actually works as designed and matches RTO and RPO is much easier using virtual machines.

Since Windows Server 2012, Hyper-V has a feature for asynchronous replication of virtual machine virtual disks to another location. This feature, Hyper-V Replica, is very easy to enable and configure. It does not cost extra. Hyper-V Replica is storage agnostic, which means the storage type at the primary site can be different than the storage type used in the secondary site. So, Hyper-V Replica perfectly works when your virtual machines are hosted on, for example, EMC storage while in the secondary a HP solution is used.

While replication is a must for DR, another very useful feature in DR is automation. As an administrator, you really appreciate the option to click on a button after deciding to perform a failover and sit back and relax. Recovery is mostly a stressful job when your primary location is flooded or burned and lots of things can go wrong if recovery is done manually.

This is why Microsoft designed Azure Site Recovery. Azure Site Recovery is able to assist in disaster recovery in several scenarios:

A customer has two data centers both running Hyper-V managed by System Center Virtual Machine Manager. Hyper-V Replica is used to replicate data at the virtual machine level.
A customer has two data centers both running Hyper-V managed by System Center Virtual Machine Manager. NetApp storage is used to replicate between two sites at the storage level.
A customer has a single data center running Hyper-V managed by System Center Virtual Machine Manager.
A customer has two data centers both running VMware vSphere. In this case InMage Scout software is used to replicate between two datacenters. Azure is not used for orchestration.
A customer has a single data centers not managed by System Center Virtual Machine Manager.

In the second scenario, Microsoft Azure is used as a secondary data center if a disaster makes the primary data center unavailable.

Microsoft announced also to support a scenario where vSphere is used on-premises and Azure Site Recovery can be used to replicate data to Azure. To enable this InMage software will be used. Details were not available at the time this article was written.

In the first two described scenarios Site Recovery is used to orchestrate the failover and failback to the secondary location. The management is done using Azure Management Portal. This is available using any browser supporting HTML5. So a failover can be initiated even from a tablet or smartphone.

Using Azure as a secondary data center for disaster recovery

Azure Site Recovery went into preview in June 2014. For organizations using Hyper-V, there is no direct need to have a secondary data center as Azure can be used as a target for Hyper-V Replica.

Some of the characteristics of the service are as follows:

Allows nondisruptive disaster recovery failover testing
Automated reconfigure of network configuration of guests
Storage agnostic supports any type of on-premises storage supported by Hyper-V
Support for VSS to enable application consistency
Protects more than 1,000 virtual machines (Microsoft tested with 2,000 virtual machines and this went well)

To be able to use Site Recovery, customers do not have to use System Center Virtual Machine Manager. Site Recovery can be used without this installed. System Center Virtual Machine Manager. Site Recovery will use information such as? virtual networks provided by SCVMM to map networks available in Microsoft Azure.

Site Recovery does not support the ability to send a copy of the virtual hard disks on removable media to an Azure data center to prevent the initial replication using WAN (seeding). Customers will need to transfer all the replication data over the network. ExpressRoute will help to get a much better throughput compared to a site-to-site VPN over the Internet.

Failover to Azure can be as simple as clicking on a single button. Site Recovery will then create new virtual machines in Azure and start the virtual machines in the order defined in the recovery plan. A recovery plan is a workflow that defines the startup sequence of virtual machines. It is possible to stop the recovery plan to allow a manual check, for example. If all is okay, the recovery plan will continue doing its job. Multiple recovery plans can be created.

Microsoft Volume Shadow Copy Services (VSS) is supported. This allows application consistency. Replication of data can be configured at intervals of 15 seconds, 5 minutes, or 15 minutes. Replication is performed asynchronously.

For recovery, 24 recovery points are available. These are like snapshots or point-in-time copies. If the most recent replica cannot be used (for example, because of damaged data), another replica can be used for restore. You can configure extended replication. In extended replication, your Replica server forwards changes that occur on the primary virtual machines to a third server (the extended Replica server). After a planned or unplanned failover from the primary server to the Replica server, the extended Replica server provides further business continuity protection. As with ordinary replication, you configure extended replication by using Hyper-V Manager, Windows PowerShell (using the –Extended option), or WMI.

At the moment, only VHD virtual disk format is supported. Generation 2 virtual machines that can be created on Hyper-V are not supported by Site Recovery. Generation 2 virtual machines have a simplified virtual hardware model and support Unified Extensible Firmware Interface (UEFI) firmware instead of BIOS-based firmware. Also, boot from PXE, SCSI hard disk, SCSCI DVD, and Secure Boot are supported in Generation 2 virtual machines.

However on March 19 Microsoft responded to numerous customer requests on support of Site Recovery for Generation 2 virtual machines. Site Recovery will soon support Gen 2 VM’s. On failover, the VM will be converted to a Gen 1 VM. On failback, the VM will be converted to Gen 2. This conversion is done till the Azure platform natively supports Gen 2 VM’s.

Customers using Site Recovery are charged only for consumption of storage as long as they do not perform a failover or failover test.

Failback is also supported. After running for a while in Microsoft Azure customers are likely to move their virtual machines back to the on-premises, primary data center. Site Recovery will replicate back only the changed data.

Mind that customer data is not stored in Microsoft Azure when Hyper-V Recovery Manager is used. Azure is used to coordinate the failover and recovery. To be able to do this, it stores information on network mappings, runbooks, and names of virtual machines and virtual networks. All data sent to Azure is encrypted.

By using Azure Site Recovery, we can perform service orchestration in terms of replication, planned failover, unplanned failover, and test failover. The entire engine is powered by Azure Site Recovery Manager.

Let’s have a closer look on the main features of Azure Site Recovery. It enables three main scenarios:

Test Failover or DR Drills: Enable support for application testing by creating test virtual machines and networks as specified by the user. Without impacting production workloads or their protection, HRM can quickly enable periodic workload testing.
Planned Failovers (PFO): For compliance or in the event of a planned outage, customers can use planned failovers, virtual machines are shutdown, final changes are replicated to ensure zero data loss, and then virtual machines are brought up in order on the recovery site as specified by the RP. More importantly, failback is a single-click gesture that executes a planned failover in the reverse direction.
Unplanned Failovers (UFO): In the event of unplanned outage or a natural disaster, HRM opportunistically attempts to shut down the primary machines if some of the virtual machines are still running when the disaster strikes. It then automates their recovery on the secondary site as specified by the RP.

If your secondary site uses a different IP subnet, Site Recovery is able to change the IP configuration of your virtual machines during the failover.

Part of the Site Recovery installation is the installation of a VMM provider.
This component communicates with Microsoft Azure. Site Recovery can be used even if you have a single VMM to manage both primary and secondary sites.

Site Recovery does not rely on availability of any component in the primary site when performing a failover. So it doesn’t matter if the complete site including link to Azure has been destroyed, as Site Recovery will be able to perform the coordinated failover.

Azure Site Recovery to customer owned sites is billed per protected virtual machine per month. The costs are approximately €12 per month. Microsoft bills for the average consumption of virtual machines per month. So if you are protecting 20 virtual machines in the first half and 0 in the second half, you will be charged for 10 virtual machines for that month.

When Azure is used as a target, Microsoft will only charge for consumption of storage during replication. The costs for this scenario are €40.22/month per instance protected.

As soon as you perform a test failover or actual failover Microsoft will charge for the virtual machine CPU and memory consumption.

Summary

Thus this article has covered the concepts of High Availability in Microsoft Azure and disaster recovery using Azure Site Recovery, and also gives an introduction to the concept of geo-replication.

Resources for Article:

Further resources on this subject:

Windows Azure Mobile Services – Implementing Push Notifications using [article]
Configuring organization network services [article]
Integration with System Center Operations Manager 2012 SP1 [article]

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…