Today, we will explore the disaster recovery basics to understand the common terms in high availability and disaster recovery. We will then discuss SQL Server offering for HA/DR options.
Disaster recovery basics
Disaster recovery (DR) is a set of tools, policies, and procedures, which help us during the recovery of your systems after a disastrous event. Disaster recovery is just a subset of a more complex discipline called business continuity planning, where more variables come in place and you expect more sophisticated plans on how to recover the business operations. With careful planning, you can minimize the effects of the disaster, because you have to keep in mind that it’s nearly impossible to completely avoid disasters.
|The main goal of a disaster recovery plan is to minimize the downtime of our service and to minimize the data loss. To measure these objectives, we use special metrics: Recovery Point and Time Objectives.|
Recovery Time Objective (RTO) is the maximum time that you can use to recover the system. This time includes your efforts to fix the problem without starting the disaster recovery procedures, the recovery itself, proper testing after the disaster recovery, and the communication to the stakeholders. Once a disaster strikes, clocks are started to measure the disaster recovery actions and the Recovery Time Actual (RTA) metric is calculated. If you manage to recover the system within the Recovery Time Objective, which means that RTA < RTO, then you have met the metrics with a proper combination of the plan and your ability to restore the system.
Recovery Point Objective (RPO) is the maximum tolerable period for acceptable data loss. This defines how much data can be lost due to disaster. The Recovery Point Objective has an impact on your implementation of backups, because you plan for a recovery strategy that has specific requirements for your backups. If you can avoid to lose one day of work,
you can properly plan your backup types and the frequency of the backups that you need to
The following image is an illustration of the very concepts that we discussed in the preceding paragraph:
When we talk about system availability, we usually use a percentage of the availability time. This availability is a calculated uptime in a given year or month (any date metric that you need) and is usually compared to a following table of “9s”.
Availability also expresses a tolerable downtime in a given time frame so that the system still meets the availability metric. In the following table, we’ll see some basic availability options with tolerable downtime a year and a day:
|Availability %||Downtime a year||Downtime a day|
|90%||36.5 days||2.4 hours|
|98%||7.3 days||28.8 minutes|
|99%||3.65 days||14.4 minutes|
|99.9%||8.76 hours||1.44 minutes|
|99.99%||52.56 minutes||8.64 seconds|
|99.999%||5.26 minutes||less than 1 second|
This tolerable downtime consists of the unplanned downtime and can be caused by many factors:
- Natural Disasters
- Hardware failures
- Human errors (accidental deletes, code breakdowns, and so on)
- Security breaches
For these, we can have a mitigation plan in place that will help us reduce the downtime to a tolerable range, and we usually deploy a combination of high availability solutions and disaster recovery solutions so that we can quickly restore the operations. On the other hand, there’s a reasonable set of events that require a downtime on your service due to the maintenance and regular operations, which does not affect the availability on your system. These can include the following:
- New releases of the software
- Operating system patching
- SQL Server patching
- Database maintenance and upgrades
Our goal is to have the database online as much as possible, but there will be times when the database will be offline and, from the perspective of the management and operation, we’re talking about several keywords such as uptime, downtime, time to repair, and time between failures, as you can see in the following image:
It’s really critical not only to have a plan for disaster recovery, but also to practice the disaster recovery itself. Many companies follow the procedure of proper disaster recovery plan testing with different types of exercise where each and every aspect of the disaster recovery is carefully evaluated by teams who are familiar with the tools and procedures for a real disaster event. This exercise may have different scope and frequency, as listed in the following points:
- Tabletop exercises usually involve only a small number of people and focus on a specific aspect of the DR plan. This would be a DBA team drill to recover a single SQL Server or a small set of servers with simulated outage.
- Medium-sized exercises will involve several teams to practice team communication and interaction.
- Complex exercises usually simulate larger events such as data center loss, where a new virtual data center is built and all new servers and services are provisioned by the involved teams.
Such exercises should be run on a periodic basis so that all the teams and team personnel are up to speed with the disaster recovery plans.
SQL Server options for high availability and disaster recovery
SQL Server has many features that you can put in place to implement a HA/DR solution that will fit your needs. These features include the following:
- Always On Failover Cluster
- Always On Availability Groups
- Database mirroring
- Log shipping
In many cases, you will combine more of the features together, as your high availability and disaster recovery needs will overlap. HA/DR does not have to be limited to just one single feature. In complex scenarios, you’ll plan for a primary high availability solution and secondary high availability solution that will work as your disaster recovery solution at the same time.
Always On Failover Cluster
An Always On Failover Cluster (FCI) is an instance-level protection mechanism, which is based on top of a Windows Failover Cluster Feature (WFCS). SQL Server instance will be installed across multiple WFCS nodes, where it will appear in the network as a single computer.
All the resources that belong to one SQL Server instance (disk, network, names) can be owned by one node of the cluster and, during any planned or unplanned event like a failure of any server component, these can be moved to another node in the cluster to preserve operations and minimize downtime, as shown in the following image:
Always On Availability Groups
Always On Availability Groups were introduced with SQL Server 2012 to bring a database-level protection to the SQL Server. As with the Always On Failover Cluster, Availability Groups utilize the Windows Failover Cluster feature, but in this case, single SQL Server is not installed as a clustered instance but runs independently on several nodes. These nodes can be configured as Always On Availability Group nodes to host a database, which will be synchronized among the hosts. The replica can be either synchronous or asynchronous, so Always On Availability Groups are a good fit either as a solution for one data center or even distant data centers to keep your data safe. With new SQL Server versions, Always On Availability Groups were enhanced and provide many features for database high availability and disaster recovery scenarios. You can refer to the following image for a better understanding:
Database mirroring is an older HA/DR feature available in SQL Server, which provides database-level protection. Mirroring allows synchronizing the databases between two servers, where you can include one more server as a witness server as a failover quorum.
Unlike the previous two features, database mirroring does not require any special setup such as Failover Cluster and the configuration can be achieved via SSMS using a wizard available via database properties. Once a transaction occurs on the primary node, it’s copied to the second node to the mirrored database. With proper configuration, database mirroring can provide failover options for high availability with automatic client redirection.
Database mirroring is not preferred solution for HA/DR, since it’s marked as a deprecated feature from SQL Server 2012 and is replaced by Basic Availability Groups on current versions.
Log shipping configuration, as the name suggests, is a mechanism to keep a database in sync by copying the logs to the remote server. Log shipping, unlike mirroring, is not copying each single transaction, but copies the transactions in batches via transaction log backup on the primary node and log restore on the secondary node. Unlike all previously mentioned features, log shipping does not provide an automatic failover option, so it’s considered more as a disaster recovery option than a high availability one. Log shipping operates on regular intervals where three jobs have to run:
- Backup job to backup the transaction log on the primary system
- Copy job to copy the backups to the secondary system
- Restore job to restore the transaction log backup on the secondary system
Log shipping supports multiple standby databases, which is quite an advantage compared to database mirroring. One more advantage is the standby configuration for log shipping, which allows read-only access to the secondary database. This is mainly used for many reporting scenarios, where the reporting applications use read-only access and such configuration allows performance offload to the secondary system.
Replication is a feature for data movement from one server to another that allows many different scenarios and topologies.
|Replication uses a model of publisher/subscriber, where the Publisher is the server offering the content via a replication article and subscribers are getting the data.|
The configuration is more complex compared to mirroring and log shipping features, but allows you much more variety in the configuration for security, performance, and topology.
Replication has many benefits and a few of them are as follows:
- Works on the object level (whereas other features work on database or instance level)
- Allows merge replication, where more servers synchronize data between each other
- Allows bi-directional synchronization of data
- Allows other than SQL Server partners (Oracle, for example)
There are several different replication types that can be used with SQL Server, and you can choose them based on the needs for HA/DR options and the data availability requirements on the secondary servers. These options include the following:
- Snapshot replication
- Transactional replication
- Peer-to-peer replication
- Merge replication
We introduced the disaster recovery discipline with the whole big picture of business continuity on SQL Server. Disaster recovery is not only about having backups, but more about the ability to bring the service back to operation after severe failures. We have seen several options that can be used to implement part of disaster recovery on SQL Server–log shipping, replication, and mirroring.
To know more about how to design and use an optimal database management strategy, do checkout the book SQL Server 2017 Administrator’s Guide.