Always On Availability Groups & VMs
As Data Platform Engineers, we are responsible for maintaining the Availability of data to those that are authorized to access it. We have many tools available to us in order to help us meet these needs, from the underlying infrastructure to the Always On features in SQL Server, from virtualization overtaking physical implementations to the rise of cloud IaaS offerings. It is important to know how these technologies interact in order to build a solid, highly available system. Here I will be focusing on the interplay between virtualization and Always On in SQL Server, and how it can hurt you.
Much of the time there is a systems team and a DBA team, and when the DBAs need to build out a new set of SQL Servers, they request X number of virtual servers from the systems team. The servers are handed over and the DBA team works its magic, and then we have our Failover Cluster Instance or Availability Group High Availability solution. But, is it really Highly Available?
The reason I ask is twofold:
- Which physical hosts are your Virtual Machines are located on?
- Which data stores are your virtual disks are located in?
If the answer to either of these questions results in the same answer for any of your Virtual Machines in an Availability Group, or Failover Cluster Instance for that matter. Then you potentially have a massive flaw in your implementation that can affect availability.
If you plan on implementing Always On features in SQL Server, either Availability Groups or Failover Cluster Instances, on a virtual platform, then you need to work with your systems team. When putting the request in for the virtual machines, it is vital that you make the following specifications:
- All virtual machines need to reside on different hosts
- There should be no overlap of virtual disks from different virtual machines in the same data stores
- Anti-Affinity should be configured on all appropriate resources to ensure that, under normal operation, the virtual resources will not co-exist on the same components
The reason for this is simple; if you have resources on the same infrastructure component, and that component fails, then the scope of impact is dramatically increased, resulting in an increased chance that you will suffer a larger outage, and potentially increase your workload to restore service.
What we are aiming for is something that looks a little like this:
Now I am going to have to admit that I am borrowing some terminology from Microsoft Azure here when referring to a "Fault Domain." Simply put, I feel that it conveys the concept of a unit of failure very concisely. A good definition of a Fault Domain can be found here. Essentially, it defines the boundary of impact in the event of the underlying infrastructure failing.
Now, this might seem like common sense to you or I, however it is very easy to end up in one of the following situations when there is a breakdown in communication between the DBAs and the Systems team.
Here we have a situation where multiple Virtual Machines that are part of the SQL Server High Availability configuration are sharing the same host resources.
In this situation, the failure of one component in the VM layer can nullify the SQL Server High Availability configuration. Depending on the number of nodes that exist in the Windows Failover cluster, the impact can be to remove multiple nodes from the cluster, all the way through to a full outage if both the Primary Replica and Automatic Failover Partner are on the host that failed. In short, allowing this configuration to arise increases the amount of risk associated with a deployment.
In this situation, we have the Virtual Machines distributed over the hosts. However there are VMDKs from multiple Virtual Machines in the same datastore.
Now I can hear you all saying, but who would be silly enough to do something like that! Well, you would be amazed. I have seen this situation a number of times in production environments. A lot of the time it stems from the SAN Admin/Systems Team trying to leverage de-duplication and compression technology in storage arrays, without realising just what can happen. You will hear many of the same lines about it being "backed by X number of disks" or "when was the last time a SAN failed?" Both of which are semi-valid; a SAN is designed in such a way to handle multiple component failures. But it does not take into account events like someone accidentally deleting the LUN/Datastore, which I have also seen happen in a production environment.
The impact is relatively easy to quantify. Simply think what would happen if all of the OS/Database Volumes/Transaction Log volumes, for all of the servers in your Availability Group, vanished at the same time. Not a situation that we want to happen.
With SQL Sentry v10, we introduced Performance Advisor for VMware, giving deeper insight on what is going on at the host level and allowing you to see how this could be impacting your guest SQL Servers. When combined with Performance Advisor for SQL Server, it allows us to build alerts that can notify us in the event that either of the two situations above are present in the systems we are monitoring.
How can we detect this? Simple: Custom Conditions.
We can write two custom conditions that pull data from the SQL Sentry repository, gathering together data from the Always On Availability Group monitoring and the VMware monitoring, then identify where we have overlap between replicas at the host and datastore level.
Once we have wrapped these up into Custom Conditions, we have the ability to set up the alerts to notify via email, Event Log, or any of the other actions that can be configured within the SQL Sentry system for alerting. Additionally, we have the resultset from the query included in any message that is sent to support staff, meaning that you will have the details about where there is an issue and the details to go and remediate it.
The result set for situation one can be seen below:
Here we can see that the host, Windows Server Failover Cluster, Availability Group, and the Availability Replicas are all identified. This gives us the information that we need to go to the Systems team and ask them to move specific virtual machines. This particular condition will work for both Hyper-V and VMware.
The result set for situation two, where multiple disks reside in the same datastore, is as follows:
Again you can see that all of the information that you need will be pulled back and presented to you so that it is easy to identify the scale of the work that needs to be completed in order to resolve the situation.
If you have SQL Sentry v10 and have both Performance Advisor for SQL and Performance Advisor for VMware monitoring your environment, then you can download the conditions below. You will just need to extract them from the zip file and then import them into your Conditions list. Details on how to create and import custom conditions can be found in the documentation here. Additionally there is a fantastic blog post by Melissa on creating custom conditions, should you want to write your own.
- SHA1 Checksum : 27F362057F5E1F9CB0FD0B5A8F2CD018941A562D
- MD5 Checksum : 85940D5D318142098E08DB8C9070DF38
- SHA1 Checksum : FED59CAA24B9889F48D8066965D735B38D25F8A1
- MD5 Checksum : DE46DD706EADF4E49A9DB9A8EB09CD66