Troubleshooting SQL Server, Part 3 : Where are the Waits?

Troubleshooting SQL Server Part 3: Where are the Waits?

In the popular white paper entitled Troubleshooting SQL Server Wait Statistics, I take you through three scenarios in which a wait stats-only approach is insufficient or even harmful to detecting, diagnosing, and resolving performance problems on an instance of SQL Server. The white paper identifies FIVE key steps to effective performance troubleshooting on Microsoft SQL Server (below), with one follow-up step to basically learn from past problems.

SQL Server Troubleshooting Checklist

Troubleshooting Checklist 02

In the previous two blog posts in this series (part 1 | part 2), I discussed ways in which we can take shortcuts in the troubleshooting process (and there aren't many), as well as determining whether the problem was inside of SQL Server (or not) and caused by a SQL Server error (or not).

The Process of Elimination

Now that we've figured out that the problem is definitely inside of SQL Server and is not due to erroneous behavior, we need to find the root cause of the performance issue. Our first check using native tools is to assess the SQL Server's wait statistics. Since wait stats tell us the general category in which SQL Server is waiting, they are an excellent way to narrow down the root cause of a performance issue when troubleshooting.

Note: Wait statistics are an additional arrow in your quiver for performance troubleshooting. They should not be your only method. Unfortunately, some commercially available tools only look at wait stats and, as a consequence, can lead to inaccurate troubleshooting actions and, at times, cause outright misdirection, causing you to waste time and/or overlook serious errors.

Wait statistics are a means of narrowing our search for the root cause of our performance issue. But it's rare that a wait statistic can point to the actual "smoking gun" of a performance problem. Instead, think of figuring out your most prominent waits as the first step in a process of elimination.

Watch our video of troubleshooting SQL Server performance using native tools at SQLSentry.TV!

To get a good feel for your SQL Server's wait statistics, you can take periodic queries against the waits DMV (below), and watch the values over time. The wait statistics are ever-incrementing values showing the amount of time SQL Server is waiting for a particular system resource. Wait stats don't reset except in the case of a SQL Server restart or by manually clearing the values. That means if you are only using native tools, you will need to take a set of baseline measurements, then watch them over time to see which are growing the fastest over a given period of time. This'll give you a good idea of which wait statistics may be most important in your performance troubleshooting scenario.

Let's Go Back to the Best Wait Statistic Sources

The best white paper solely focusing on wait statistics is SQL Server Performance Tuning Using Wait Statistics: A Beginners Guide by Jonathan Kehayias (b | t) and Erin Stellato (b | t), our good friends at SQLskills.com. If you're not up to date on wait statistics, start there. (Anything I add here is simply riffing on a topic they've done better and more in-depth than I can cover here).

There's also the seminal Microsoft white paper on the topic, Performance Tuning Using Waits and Queues, which is good – but also a bit long in the tooth these days. And there's always Books Online information on the underlying DMV sys.dm_os_wait_stats showing all of the waits. (As aside, the best wait statistics info in video form is Paul Randal's (b | t) online Pluralsight training course).

The DMVs relating to wait statistics expose what is going on inside SQL Server itself, both across all sessions in the server and, using other DMVs in modern and future versions of SQL Server, within specific sessions. The primary wait stat DMVs for performance troubleshooting projects include sys.dm_exec_requests, sys.dm_os_waiting_tasks, and sys.dm_os_wait_stats.

What are All of My Wait Statistics, or, Tell Me About sys.dm_os_wait_stats

Wait statistics, in general, are not very granular metrics. Paul's amazing must-read blog post Wait Statistics or Please Tell Me Where it Hurts provides us with an update on Glenn Berry's (blog|twitter) now-famous diagnostic queries, effective for showing your current wait statistics on a given SQL Server.

The great thing about Glenn’s query, shown below in Figure 1, is that it also excludes many of the wait stats that represent the natural, organic processes of SQL Server, which you can safely ignore. For example, SQL Server uses Service Broker for a number of internal and necessary processes, so the DMV query shown below excludes the wait stats associated with those internally-used broker processes. But since Glenn's is in wider distribution, take a look at this example:

Troubleshooting Wait Stats WP, Figure 10Figure 1: Glenn Berry’s Wait Stats Diagnostic Query

The long list of wait statistics named in the WHERE clause of the query are a variety of SQL Server 2014 wait statistics which are safe to ignore because they're benign parts of the normal, background operation of SQL Server. Noise, in other words. Paul takes the time to explain the major, non-noise wait statistics for you in his post. You can read more about those most important and relevant wait statistics there and also review the detailed advanced information on specific problematic wait statistics on the SQLskills website and on the SQLPerformance.com website.

The results of this DMV query show the major categories where SQL Server is experiencing bottlenecks or resource shortages. In this example, the result set from running the query shown in Figure 10 appears below in Figure 2:

Troubleshooting Wait Stats WP, Figure 11Figure 2: Top Wait Stats Query Results

In this case, the number one wait statistic is TRACEWRITE, which indicates that a trace operation is frequently waiting. Note that on this particular server, there are a number of traces that run all the time in the background. Because they are background processes, this wait statistic means those background processes were not affecting real users. So the number one wait type on the list is not a significant issue. Number two on the list, at only 6.48% of waits, is WRITELOG, which indicates that 6.48% of the time that any operation had to wait, they were waiting for the completion of write operations to all of the transaction logs on the SQL Server.

Refer to Paul Randal's canonical wait stats post for a definitive explanation of waits and the interpretation of specific wait statistics.

So now we have a clue as to the general category in which users on this system are waiting for their work to complete. Still, we do not know which specific database or databases are experiencing slow transaction log throughput. Nor do we know if this wait stat is caused by a slow I/O subsystem, a heavy workload, or perhaps long-standing open transactions that could cause write operations to remain open for long periods of time.

To help address this, Paul also shows you how to capture wait statistics over time and how to capture wait statistics for a specific operation using queries and extended events in SQL Server.

What are my Wait Statistics Right Now, or, Sys.dm_os_waiting_tasks

While looking at all of your SQL Server wait statistics focuses on the sys.dm_os_wait_stats DMV, finding the wait statistics currently accruing on SPIDs this very moment requires that we look at the sys.dm_os_waiting_tasks DMV. While the former DMV shows all of the wait statistics and their cumulative wait time, the latter DMV shows current SPIDs, their wait type, and the amount of time spent waiting on that resource.

A typical query of I might use in this case looks like:

SELECT session_id, wait_duration_ms, wait_type, 
  blocking_task_address, blocking_session_id, resource_description
FROM sys.dm_os_waiting_tasks
-- same WHERE clause as above;

The result set, in turn, is directly correlated with each SPID currently alive on the instance:

dm_os_waiting_tasks 01Figure 3: Results of Querying Sys.dm_os_waiting_tasks

So in this example, we can see that our instance of SQL Server has some active SPIDs currently accruing several types of lock (LCK_*) waits, as well as PAGEIOLATCH_EX (exclusive latch waits). You can also see that some of the SPIDs are blocking others in the blocking_session_id column. These might not be the most significant wait statistics on the server spanning all of its uptime, but they are the most significant at the moment.

Who is Invoking Which Commands, or, sys.dm_exec_requests

When you want to find out what each individual session is doing, down to the granularity of learning the specific command being executed, query the DMV sys.dm_exec_requests. This DMV returns the command, resource, wait types, wait time, and status. Status may have values of running, runnable and suspended, depending upon whether they are executing a command, waiting for CPU (and thus experiencing a signal wait), or awaiting a specific system resource (and thus experiencing a resource wait), respectively. Sessions may also have a status of sleeping, indicating that SQL Server is waiting for the next SQL Server command from a given session.

In the following example, shown in Figure 4, we query several columns of the sys.dm_exec_requests DMV, to see the commands, wait types, and accumulated wait times of sessions > 50 (i.e. user sessions). Note, though, that the old “rule of spid 50” is less dependable with each new release of SQL Server, where people would filter out any spid/session_id below 50 as a system process. In recent versions of SQL Server, the most dependable way to determine if a spid is a system process is to check the is_user_process column of sys.dm_exec_sessions.

Troubleshooting Wait Stats WP, Figure 13Figure 4: Results of Querying Sys.dm_os_exec_requests

In the example above, we can see that most sessions have a status of suspended, indicating that the session is awaiting a specific resource (also known as a resource wait). Since the wait_type is either LCK_M_S or LCK_M_X, we can conclude that these sessions are all waiting on locks, either shared (the trailing S value) or exclusive (the trailing X value), to a specific object or objects. An experienced DBA might first think that the transactions holding locks on these rows are poorly written. Furthermore, we see that the blocking_session_id value is 86 in all the columns with LCK waits, indicating that those sessions are all being blocked by themselves. This might also cause a busy DBA to conclude that they should simply kill the blocking session and then move on with their day. In actuality, the root cause in this example could be quickly revealed with one additional quick check.

Summary

As you can see, getting a simple query result set for the most important and significant wait statistics isn't hard. On the other hand, sleuthing those wait statistics all the way down to individual SPIDs and then to individual commands executed by those SPIDs is an order of magnitude more difficult.

In fact, that's one of the great things about SQL Sentry Performance Advisor. It makes all of that hard work a direct and easy process. If you're interested, I'd be delighted to get a trial version of the software to you and walk you through a product demo. Let me know if you're interested!

Have questions about wait statistics? Want more detail on a particular topic? Then leave me a comment!

-Kevin

Connect with me online! Facebook | Twitter | LinkedIn | Blog | SlideShare | YouTube | Google Author

Comments ( 0 )

    Leave A Comment

    Your email address will not be published.