Diagnose Performance Issues in Microsoft APS and Azure Data Warehouse with SentryOne Alerts
During my time as Product Manager of APS Sentry and DW Sentry at SentryOne, I’ve had the opportunity to talk to many users of Azure Data Warehouse and Microsoft Analytics Platform System (APS), also known as Parallel Data Warehouse (PDW). Likewise, I’ve spent time talking to Program Managers and Field Engineers at Microsoft about these same solutions. I always make a point to ask what they see as the most common issues that impact the performance of these massively parallel processing platforms.
In response I’ve compiled a list of some SentryOne built-in alerts and have authored some custom Advisory Conditions (AC) that many people find useful when monitoring these systems.
These common conditions are:
• Outdated Statistics
• Data Distribution Skew
• Suspended Requests
• High Request Queue Times
• Large Replicated Tables
• Low Available Memory
• High tempdb Growth
• Low Disk Space
• Long Running Requests
Below I’ll provide downloadable Advisory Conditions that you can import to your SentryOne client. Due to the nature of these platforms, some conditions may require minor adjustments to work properly in your environment. I’ll provide those details with the descriptions. A couple of these don’t even require a custom Advisory Condition, but are available out of the box with APS Sentry or DW Sentry. I’ll detail those conditions as well. Note while most of these conditions can be used for both platforms, some conditions may only support Microsoft Analytics Platform System (APS), others only Azure Data Warehouse (ADW). I will denote them accordingly.
Speaking with some team members at Microsoft, I was told the number one root cause of performance issues they see in APS and Azure DW is outdated statistics. This is because, for the most part, updating statistics is a manual process on these platforms. Even if statistics are up to date on the individual compute nodes, they may not be caught up on the control node. This can lead to the control node having a different estimate of how many rows are in a given table, which can then lead to a suboptimal MPP plan, and ultimately data movement that may be much more resource intensive than expected.
This advisory condition provides an effective way to determine if statistics are out of date by comparing row count estimates between the control node and the compute nodes for each table in a database.
By default, this condition will evaluate to true if there is at least 20% discrepancy on any table.
When configuring for your environment be sure to import at the individual target level, not All Targets. Also, be sure to update the Database name to the database you wish to evaluate.
Note that this condition only evaluates row counts and not cardinality. It also does not consider whether statistics exist on more than one column, nor for which column(s) the statistics exist.
The primary benefit of Massively Parallel Processing (MPP) is the fact that you spread the data, and effectively the workload of processing large quantities of data, across several compute nodes. If the data is not evenly distributed across those compute nodes, you are likely to demand more resources from some nodes than others, and not get the most benefit out of the MPP architecture.
APS Sentry and DW Sentry’s DMS Dashboard allows you to easily identify data skew during data movement activity, but wouldn’t it also be helpful to know of the potential for inefficient data movement before a request is made?
This advisory condition allows you to detect data skew of your data at rest, before it impacts your query performance.
By default, this condition will check row counts for each distribution for each table, and alert you if it detects a 10% or more difference between the smallest and largest distributions for a given table.
As with the Outdated Statistics AC, be sure to update the Database name to the database you wish to evaluate.
APS and Azure DW are OLAP systems optimized for handling requests involving large volumes of data quickly as opposed to high frequency, smaller transactional requests. In support of these large OLAP workloads, both these platforms utilize the concept of concurrency slots. APS appliances currently support 32 concurrency slots, while the number supported by Azure DW depends on the Performance Tier of the data warehouse. Requests can also utilize more than one concurrency slot based on their resource class.
More details regarding concurrency slots and performance tiers can be found at:
Concurrency and workload management in SQL Data Warehouse
When there are insufficient concurrency slots for a pending request, it is put into SUSPENDED status; once sufficient slots are available, then it will execute.
This advisory condition checks for any active requests in SUSPENDED status to alert you to concurrency slot saturation. Note this will only be true for active requests in this state. After they execute, they will show a normal completed status. (For detecting this or other similar issues after a query has completed, see my next advisory condition.)
This AC is global and only needs to be set up once under All Targets, since it queries the SentryOne repository database directly instead of any DMV’s on the target.
In addition to concurrency slot saturation, requests to APS or Azure DW can be queued for other reasons, such as blocking and resource contention. Even if you have available concurrency slots, these platforms still utilize locking and other mechanisms just like regular SQL Server. If it is excessive, you may notice a high percentage of the Total Elapsed Time for the request was spent in queue as represented in Queue Duration.
This Advisory Condition queries the SentryOne repository database every 10 minutes and evaluates to true if any completed or running queries in the last 10 minutes have a total queueing time greater than 1 minute (60000 ms). It will report up to 10 device name + request ID combinations by default. As such it only needs to be set up once, globally, like the Suspended Requests AC.
In addition to distributed tables that we’ve already discussed, there is also the option of replicated tables. Given the idea that storage tends to be cheaper than other resources, it may be more cost effective to simply have a complete copy of smaller tables, typically dimension tables, on each compute node rather than distribute them across all the nodes.
That said, tables can grow over time and you may reach a point that the table has reached a size where it makes more sense to distribute it.
This advisory condition will return true if any replicated table in the specified database exceeds 2 GB in size, suggesting it is a candidate for distribution.
The AC checks this condition every 8 hours by default, and triggers any configured actions as soon as it evaluates to True.
Apply this AC for each target, and when applying, edit it to replace Database name with the name of the database you wish to monitor on APS or Azure DW. As with these other AC’s, you can also easily modify the 2 GB threshold, if that is too low or too high for your environment.
These last four conditions are fairly self-explanatory, so I’ll be more brief in their description.
Memory can be a common root cause for slow performance and out-of-memory issues on any SQL Server and Microsoft’s MPP platforms are no exception. In Azure, consider scaling your data warehouse if you find SQL Server memory usage reaching its limits during query execution.
Since tempdb databases reside on the same volumes as the rest of your databases, and tempdb is used in data movement, it can help to track the size of tempdb and verify that is is not growing out of control. As with many of these conditions, you may find a different size threshold to be more appropriate in your environment.
While this condition can be detected by an advisory condition, it is not strictly necessary, as it is also covered by the built-in health alerts for APS. In fact, I included this condition specifically to remind you that I highly recommend enabling alerts for any APS Critical Health Alert. The Volume free space status is CRITICAL alert is fired when volume’s used disk space is beyond 90%.
See the Microsoft doc for more details on all APS Health Alerts.
This last condition, also available out of the box with APS Sentry and DW Sentry, should be configured to detect and alert on any long-running requests. It’s called Distributed Queries: Duration Threshold Max. There is one for APS and one for Azure DW. See the help docs for configuring a max threshold and setting up these alerts.
I've zipped the 10 condition files into a single download. Simply extract to pick and choose which ones you would like to implement. Conditions prefixed with "Azure DW" should be applied specifically to Azure Data Warehouse targets. Those prefixed with "APS" should be implemented for those targets individually as well. Those prefixed "APS/Azure DW" are global and should be applied once each at the global level.
APS and ADW AC Download
Well, hopefully you’ll find these alerts to be useful in your environment when monitoring APS or Azure DW with SentryOne. As always, don’t hesitate to contact the support team with any questions regarding the proper setup and configuration of these alerts. If you have any suggestions for future Advisory Conditions you have created on your own and would like to share with the rest of the community, please do! I’d love to hear how everyone is getting value out of these features and help others find the same value.
Finally, if you are working with Microsoft APS or Azure DW and are looking for a tool to provide greater visibility into these platforms, check out our overview post on these products.
And give it a try! Download your own free evaluation copy here.