How do you know your SQL Server system is running well? Is it the simple method of noting that no one is complaining? (Sadly, in many businesses, this is exactly how it's done.) The fact is, if you're not monitoring SQL Server performance, you just don't know.
There are many tools available to monitor SQL Server performance, both commercially available (like ours) and 'home grown'. I have a session I've delivered many times (most recently at the Pittsburgh SQL Saturday) where I share a set of scripts I wrote using PowerShell to capture SQL Server performance metrics. These tools allow us to look at the data to tell us where and when we may have performance issues.
The first thing we "experts" will tell you is that you need to baseline. There's a good reason for this – if you don't know what the performance characteristics are of your system when it's running normally, how can you know what isn't normal when things aren't running normally. By establishing baselines you can determine what performance counters are outside normal parameters when something does go wrong. Baselines in SQL Sentry can be set at different intervals, so you can indicate a normal business day baseline, and end of month processing baseline, a nightly backup baseline, etc.
- Get a deeper dive with Kevin Kline's White Paper on benchmarking and baselines.
It's important to understand what the various SQL Server performance counters mean, and fortunately there are a lot of resources for this. Let's take it one counter at a time, starting with the counters common to all Windows servers.
Percent Processor Time tells us how busy the server's CPUs are. It's a basic indicator to help us know that a server is running well within acceptable operating parameters. Normally I'd expect to see this counter in the 20 to 40 percent range. When it jumps above 80% I get very nervous, as that means that activity that requires the processor is probably waiting for resources, and thus is slowing down someone's vital work.
The Available MBytes Memory counter helps me know if server memory is an issue. I can set Max Server Memory settings in SQL Server, which will help SQL Server share the memory nicely with the Windows OS, but there may be other processes on the server besides SQL Server. Capturing this counter allows me to know if other processes are taking memory SQL Server needs to perform well.
When Windows runs out of memory it takes large chunks of memory and swaps it out to disk, to the Paging File. Unfortunately, the slowest operation in all computing is writing to disk, regardless of the physical media involved, so swapping memory to disk is naturally going to slow down the performance of your system. Keeping an eye on this counter will help you know when you are encountering memory issues, and you can then take action to resolve the conflicts.
These two counters tell you how quickly your I/O subsystem is responding to requests for data from the operating system; in other words, latency. The latency values returned are valid regardless of the type of I/O subsystem you're using, whether it's local physical magnetic disk, SAN drives, NAS drives, or solid state drives. Your latency values should normally not be more than 20ms; if you're using SSD, probably not more than 5ms. If you see latency values of a second or more, your I/O subsystem has issues that need to be addressed to keep performance at an acceptable level.
The Processor Queue Length counter tells you the number of threads that are waiting for time on the system processor. If this number is greater than 0, that means that there are more requests per core than the system can handle, and this can be a cause for significant performance issues. I once had a client that had a month-end process that had to be run during the business day, which would take 2.5 to 3 hours to run; when it ran, performance for everyone else on that system would be horribly slow. I looked at the Processor Queue Length counter – normally it would get to no higher than 3 or 4 during the day, but During month-end it jumped to somewhere between 30 and 50. The client was running on a virtual machine with 4 processors, and I asked if they could double that. They did, and the next month-end completed in 45 minutes.
This is another baseline indicator that tells you the rate at which the network adapter is processing data. This number should remain lower than 60% of the maximum sustainable bandwidth of the network.
In addition to the operating system counters, you'll also want to monitor the SQL Server performance counters. If you're observing a default instance the first part of the counter name (to the left of the colon) is SQLServer. If it's a named instance, like INST01, it's MSSQL$ followed by the instance name, so in this case, MSSQL$INST01.
Forwarded Records/sec helps you understand how fragmented your heaps are. A heap is a SQL Server table without a clustered index, and SQL Server uses Row IDs to find the data it's looking for. If it arrives at the page where the data should be, and that data has been moved, SQL Server leaves a pointer there called a forwarding pointer, and the process has to incur an additional I/O operation to get it from the new location. Each time a search for data encounters a forwarding pointer it increments this counter. This may be unavoidable, but if you track this over time and this number starts to increase, you should think about ways to defragment your table (or stop using a heap).
Page Splits/sec also helps you understand how fragmented your tables are. In this case, even if your table is in good shape, when SQL Server adds pages to it, it'll increment this counter. If, however, SQL Server needs to insert a row onto a page, and there isn't room, SQL Server will split the page into multiple pages, move rows from one page to another to balance the pages out, and then insert the row. This is a very expensive and time-consuming process, and this counter will help you understand when this is happening a lot. Properly configuring the free space on each page will help minimize this activity, just note that there are "good" page splits and "bad" page splits, and this counter doesn't differentiate (Jonathan Kehayias of SQLskills has an Extended Events session you can use instead.)
On older SQL Server systems (think SQL Server 2000 and earlier) this counter would help you understand the percentage of times the needed page was already in the buffer pool when it was needed. On today's system it's mostly useless, but many older DBA managers still ask for it, so it doesn't hurt to capture this counter.
Page Life Expectancy is an often mis-interpreted counter. Jonathan Kehayias wrote a blog post called "Finding what queries in the plan cache use a specific index." Jump down to the paragraph entitled "What’s Wrong about Page Life Expectancy >= 300?" and read from there to the end of the post. In the SQL Sentry tool, we use Jonathan's guidelines for indicating what the correct expectation should be. The thing I use this counter for is to see when some process is causing the pages in the buffer pool to be flushed out, and getting to the root of that problem. It may be that person that insists on running the report that has to read all 30 million customer records during the business day, which may require some behavior modification. It could also be that you just ran DBCC CHECKDB, because that brings that counter's value to 0 as well. The important thing is to know your workload and look for those events that are outside normal expectations.
In any multi-user application you're going to have blocked processes, and SQL Server has mechanisms to handle blocked processes well, but when this counter goes outside the normal range (for your system) you'll want to investigate and see what might be causing the issue. There could be excessive blocking due to page escalation, for example, where entire tables are getting locked instead of individual rows or pages.
This counter is there to help you understand how busy your SQL Server system is. By capturing this counter, and using it in your baseline, you can identify variances easily – they might be reported by users, or it might just be extra load on the system because people are asking for more than they usually do.
These counters will increment when SQL Server has to compile or recompile query plans because either the plan in cache is no longer valid, or there's no plan in cache for this query. SQL Server uses a cost-based optimizer that relies on statistics to choose a good query plan, and when those statistics are out-of-date, additional compilations are done unnecessarily. It can be useful to understand the source of this problem, if it is a problem (this might be expected behavior, depending on the workload).
Now, there are a number of ways to monitor your SQL Server performance counters, and I'm more than happy to share my scripts with you, but the most efficient way to capture that data is through the SQL Sentry tool's monitoring service. I highly recommend you request a demo so you can see just how comprehensive the data gathering is, and how the tool can help you establish those baselines. We help you to know just when attention is required, and where to focus your attention when it is required.