The Relativity of Server Health
A few weeks ago, my wife reminded me of a doctor's appointment for a quick checkup. I don't enjoy having to go to the doctor, but it's important, so I didn't complain much. I was feeling great on the day of the appointment. I had no cough, no sniffles, or sore throat. I figured I would be out of there in 10 minutes with nothing more than a stern reminder to lose some weight.
The actual experience at the doctor's office was very different. My blood pressure was high, and this triggered a string of additional testing that revealed several other issues. No need to worry, I'm doing ok, but this trip to the doctor uncovered some things that I really need to pay attention to.
This brings me to the topic of server health. I've worked with many different environments over the years, and the concept of a server being "healthy" seems to differ from one place to another.
I fully understand that different industries have different requirements in terms of health. Having I/O reads take 80ms or longer will have a different impact for a company that is trading stocks than it will have for a company that is building television sets. Memory pressure affects an OLTP system differently than it does a reporting system.
My point is that, to me, the concept of health is not relative. This is the lesson I took home from my trip to the doctor. I felt fine, but that didn't mean that I was healthy. Maybe I'm not unhealthy either, but somewhere in between.
We should view server health the same way. We tend to settle, and be happy with what is acceptable (also known as "normal"), but there are always improvements that can be made.
I recall a conversation I had with someone that wanted to adjust thresholds on disk latency. The concept was that, in this environment, higher disk latency was normal, so the threshold should be adjusted up to compensate. Now, I can get on board with not wanting to be alerted about a situation that you can't do much with just yet, but I don't think lowering the bar is the answer. What if my doctor told me that my blood pressure was high, and I told him — well, that is just normal for me? I think that might start a rather interesting conversation.
My challenge to everyone is to try and strive for the best server health you can get. Triage critical, high, and medium warnings to tackle the highest impact issues first. Take action where you can, but temper your alerts rather than adjusting the definition of "healthy."
The winners of our recent Server Health Challenge did just that, and I believe they'll enjoy more success at work as a result.
If you've read this post, and you're looking for a way to measure the health of your server environment, you are invited to give SentryOne a try. Getting your health overview score only takes a few minutes, and the fully functional trial version is free for up to 30 days.