Continuing with the 2016 challenge from Tim Ford that I accepted at the end of last year, this month I wanted to talk about the runaway query – or, more specifically, what you can do before you call your query a runaway query.
MAKE SURE IT IS A RUNAWAY
Sometimes people will start a query, wait five seconds, and then declare that this must be a runaway query. Sometimes a query's runtime can vary based on other things going on in the system, so four seconds on one run and six seconds on another is not necessarily "running forever." Be sure that you've given a query adequate time to start returning results before giving up on it, and remember that Management Studio might seem "stuck" before it starts to render any grid results, especially if the resultset is large. If you feel you've waited a reasonable amount of time, and you've tried both Results to Grid and Results to Text, then…
CHECK FOR SIMPLE THINGS
Once I have determined that I'm not just being impatient, the first thing I do is check the query logic for obvious issues that I am prone to miss – especially if the query text has been changed since the last time it ran well. A few of the main things I check for are:
- An infinite loop – it's very easy to forget the
SET @i += 1;in a counter-based loop, or the
FETCHfrom a cursor.
- A ginormous cartesian product – sometimes you shoot yourself in the foot by adding a table to a join and forgetting some of the join criteria (or all of them, if you like living in the 1970s).
- Missing or less restrictive filter – it's possible you highlighted most of the query, but left off an important
WHEREclause, and now the query returns far more rows than you expected.
- Uncommitted transactions – maybe you highlighted
BEGIN TRANSACTIONand some other statement in one window, or opened the table designer or
Edit Top n Rows, and are now trying to affect that same table in another window. The query will look like it's taking forever, but really it's just blocked by your initial transaction that you forgot to commit or roll back.
- A statistics update – a lot of events on an instance can trigger a stats update on a table or index, and that won't happen until the next time a query runs that needs those statistics. That query could be yours, and the delay could simply be due to an auto-stats update (if your settings are not
ASYNC). In SQL Sentry Plan Explorer PRO, for example, we capture this for you by showing the
[StatMan]call as part of the full query call stack:
(You can monitor for these events in other ways, of course. See Erin Stellato's two posts on tracking these updates, Tracking Automatic Updates to Statistics and Another Way to View Automatic Updates to Statistics.)
BLOCKING AND OTHER WAITS
If it isn't the query logic, next thing I check is
sys.dm_exec_requests, which can quickly tell me if the query is being blocked (look at
blocking_session_id) and, if not, what it's waiting on (
last_wait_type). Another thing I take into account is how many rows are being returned to the client, and the possibility that the delay is actually due to client rendering or network transfer – these are often indicated by a constant wait type of
ASYNC_NETWORK_IO. Plan Explorer can help here as well, since we intentionally discard results instead of pulling them across the wire, making sure that you're not measuring client rendering or data transfer. We now also allow you to see data movement at the operator level, using Live Query Profile, so you can see exactly where your query is spending its effort in real time – as long as you're running your query against SQL Server 2014 or better. See Jason Hall's post on this recent update for more details.
I can also look at
status to see if it is runnable, running, suspended, or sleeping, and double-check that
command is reflecting the statement that I *think* should currently be executing. SQL Sentry software can also help identify blocking proactively and help visualize even the most complex blocking chains (see an example at right). For more insight, see Understanding and Resolving SQL Server Blocking with SQL Sentry. Confession: I stole this image from that post.
Sometimes you will find that you are being blocked by another session, and that session doesn't have an entry in
sys.dm_exec_requests. This doesn't mean you are being blocked by nothing or a ghost session; in a lot of cases, it just means that they have locks on your objects but aren't actively running anything. This is easy to reproduce; just run
BEGIN TRANSACTION; UPDATE dbo.table SET column += 0;, then go to lunch. You can try to get that session's last command (note that it may not be related to the current locks) and which objects they are locking:
DECLARE @SPID INT = <blocking_session_id from your runaway query>; -- see the command they last run - might be related, might not DBCC INPUTBUFFER(@SPID); -- see what that session has locked SELECT s.name, o.name FROM sys.dm_tran_session_transactions AS tst INNER JOIN sys.dm_tran_locks AS tl ON tst.transaction_id = tl.request_owner_id INNER JOIN sys.objects AS o ON tl.resource_associated_entity_id = o.[object_id] INNER JOIN sys.schemas AS s ON o.[schema_id] = s.[schema_id] WHERE tst.session_id = @SPID;
I mentioned uncommitted transactions above, but one way to be sure is to stop the query and try to close the window. If you are dealing with a transaction in this session, you will get:
Check other windows, and also check to make sure you haven't changed Management Studio's default for SET IMPLICIT_TRANSACTIONS from no to yes:
IS IT THE SAME PLAN? IS IT OPTIMAL?
If you've ruled out all of the above, then maybe it's time to start looking into the mechanics of the query itself and, more importantly, its execution plan.
Before you do that, though, you need to make sure that you're actually comparing apples to apples – when the query was running fast (and I assume it is the exact same query), were you getting the same execution plan, or at least the same plan shape? This is where it pays off to be keeping historical performance metrics, like you can with SQL Sentry Performance Advisor for all supported versions of SQL Server, or the Query Store in SQL Server 2016 and Azure SQL Database. Being able to compare query plans that were captured yesterday, last week, or last month can be invaluable in determining the actual reasons behind performance changes in the plan you're looking at today.
If the plan has changed, it could be for a variety of reasons. I go over many of these in two earlier blog posts:
Those posts explain a variety of things that can cause performance differences between different versions of the query, or different execution contexts, or different environments. There are a whole host of other things that could have changed and caused a query to suddenly become slow, even the exact same query, executed by the same user, on the same server. Some off-the-cuff ideas:
- Someone applied a cumulative update, service pack, version upgrade, or compatibility level change (if an upgrade or compat level change, test under the old cardinality estimator using TF 2312)
- Data for your query was evicted from the buffer pool due to concurrent activity
- All data was evicted from the buffer pool, either directly using
DBCC DROPCLEANBUFFERS, or as the result of a server configuration change, service restart, or server reboot
- Your plan is recompiling – this can be caused by a variety of things, including schema changes, index changes, or explicit
DBCC FREEPROCCACHEor similar
- Substantial statistics updates (which can also be caused by a variety of things, such as major data changes)
- Your resources have been constricted due to concurrency, a server or database configuration change, or Resource Governor
There are many more, of course, those are just a few low-hanging fruits.
What you do next will vary. Sometimes the fix is as simple as updating statistics, or creating or changing an index. Sometimes you need to re-work the query itself – fix the logic, remove joins, add filters, or eliminate scalar functions. Sometimes you may need to evict a plan that was generated from an atypical set of parameters. And sometimes your query is better off running under a specific trace flag, with
MAXDOP, with a specific index or join hint, or with
OPTION (RECOMPILE) to thwart parameter sniffing. How you actually *fix* a slow query is a very large topic and probably better scoped to a different blog series.
Not every runaway query is really running away. There are a few things you can check first before tearing your hair out trying to fix the statement, eliminate execution plan operators, muck with indexes, or try the old reboot trick that magically fixes everything. Those things may ultimately be necessary, but in my experience, it's often one of the other things I mention above.