When Troubleshooting Performance Start with the Big Picture
While there are many approaches to performance troubleshooting, focusing on one metric can limit your effectiveness. One of the great benefits of SentryOne's SQL Sentry solution is that it presents you with an overall picture of the server's performance, allowing you to see what's going on as a whole, and then allowing you to drill into the areas that may be of concern.
Looking over this display from SQL Sentry, the Windows OS metrics are shown in graphs down the left side, and the SQL Server metrics on the right. I see a spike in CPU percent used on the left, and I also notice a spike in Key Lookups in the top middle graph on the right. It often happens that high numbers of key lookups cause spikes in CPU used, so this is something I look for regularly.
If I hold the left mouse button and drag across the problem area, the same time window in all the other graphs also highlight, which helps me see any other corresponding metrics that may help me know what's going on at that time. If I then right-click and choose Zoom In, the graph for that time window will expand to fill the available space in all the graphs.
At this point it's a good idea to switch from History mode to Sample mode, to see more detail on that point in time where the problem occurs. Now we can see if the problem was isolated to a single CPU in the CPU graph, we can see more detail on what waits are occurring at that point, and we can see more detail on how memory is being used. We're now getting a clearer picture of what may be slowing down the server.
The next step is to right click and select Top SQL, to see what queries are being run at this time. We can scroll down through the Completed Queries list and see what might be causing the high number of key lookups we noticed at first. It's important to know that the Key Lookups counter shown on the first graph is actually the count from the Access Methods>Probe performance counter, which consists of more than just key lookups. Michael Swart (b/t) provided a very nice write-up in his blog The Last Word On Scans a few years back
, explaining what exactly makes up the count being reported.
As we look at the Query Stats tab we see that at the time we're evaluating there were two batches that executed many times, even though they may not have taken much time each time they ran.
While the first of the two batches only has one statement, it is using an index scan, which doesn't support the high count of key lookups we noticed in the first image. The second batch, however, has multiple statements. One important thing to remember is that it's not necessarily the first query in the batch that contains the code causing the performance problem you're researching. In this case, the query causing the problem is the second one in the batch, as shown in this screen shot.
If we click the Open button on the left of the multi-statement batch, we'll open that batch in Plan Explorer, where we can look at the batch in more detail.
We can see that the query in question could definitely benefit from a better index, one that allows SQL Server to get all the data it needs at one time, and by making this change, we can effectively reduce the performance load on the server, allowing an overall better use of the resources available.