Validating network blips using SQL Sentry data

Validating network blips using SQL Sentry data

A customer recently asked me how they could detect any trends in network availability, since they knew that when we lose connectivity to a server, we have a client alert that indicates that – at least as far as we're concerned – the server is offline. They were trying to correlate these network blips with real service outages to show their networking team any patterns that might exist to help troubleshoot the issue, since they knew that multiple alerts had fired without any evidence of the SQL Server services being unavailable to users or applications.

In a recent update to SQL Sentry (8.1), a new table was added: dbo.EventSourceConnectionStatusHistory. This is where the Server Health Status uptime bar in our cloud service gets its data from:

Server Health StatusServer Health Status

We can use this data to help the customer visualize their network blips. First, let's understand how many instances of downtime or connectivity loss we've observed over the past, say, 72 hours, for a named instance:

SELECT 
  h.NormalizedStartTimeUtc, 
  h.NormalizedEndTimeUtc,
  duration = DATEDIFF(SECOND, h.NormalizedStartTimeUtc, h.NormalizedEndTimeUtc)
FROM dbo.EventSourceConnectionStatusHistory AS h
INNER JOIN dbo.EventSourceConnection AS c
ON h.EventSourceConnectionID = c.ID
WHERE c.ObjectName = N'Server\Instance'
  AND h.ServiceConnectionStatus = 2 -- offline
  AND h.ServiceConnectionType = 0   -- primary
  AND h.ServiceConnectionException LIKE N'A network-related%'
  AND h.NormalizedStartTimeUtc >= DATEADD(HOUR, -72, SYSUTCDATETIME());

Now, there might just be a few, so that should be enough to line up with any real outages or known service restarts, and spot any that are out of place. But if this returns a lot of data, it may be harder to grasp it by looking at a grid result in Management Studio. So we can visualize it better by joining against a time sequence, and then plotting the outages in Excel or Tableau. Borrowing techniques from one of my posts on generating sets, and assuming we want a marker for every 1-minute period over the last three days:

DECLARE 
  @linkedServer SYSNAME = N'Server\Instance',
  @days         INT = 3, 
  @today        DATE = SYSUTCDATETIME();
 
DECLARE @start DATETIME2 = DATEADD(DAY, -@days, @today);
 
;WITH t(marker) AS
(
  SELECT DATEADD(MINUTE, n-1, @start) 
  FROM (SELECT TOP (DATEDIFF(MINUTE, @start, @today) + 1)
   ROW_NUMBER() OVER (ORDER BY s1.Number) 
   FROM master.dbo.spt_values AS s1
   CROSS JOIN master.dbo.spt_values AS s2) AS x(n)
)
SELECT t.marker, 
  outage = CASE WHEN d.start IS NOT NULL THEN 1 ELSE 0 END
FROM t
LEFT OUTER JOIN
(
  SELECT 
    [start] = h.NormalizedStartTimeUtc, 
    finish  = COALESCE(h.NormalizedEndTimeUtc, SYSUTCDATETIME())
  FROM dbo.EventSourceConnectionStatusHistory AS h
  INNER JOIN dbo.EventSourceConnection AS c
  ON h.EventSourceConnectionID = c.ID
  WHERE h.ServiceConnectionStatus = 2
  AND h.ServiceConnectionType = 0
  AND h.ServiceConnectionException LIKE N'A network-related%'
  AND c.ObjectName = N'Server\Instance'
  AND h.NormalizedStartTimeUtc >= @start
) AS d
ON t.marker >= d.start AND t.marker < d.finish
ORDER BY t.marker;

We could also dump the data from the target instance's error logs – at least through the last six service starts, or more if you've customized the number of error logs you keep – and determine if any of these blips roughly correlate to an actual downtime. Assuming that the server where the SQL Sentry database is located has a linked server to the monitored instance, and that it is configured to use the same name as the ObjectName we store, you could do this to see the downtime caused by up to the last 7 starts/restarts (hopefully there aren't that many in recent history):

DECLARE 
  @linkedServer SYSNAME = N'Server\Instance',
  @days         INT = 2, 
  @today        DATE = SYSUTCDATETIME();
 
DECLARE 
  @start     DATE = DATEADD(DAY, -@days, @today),
  @logNumber INT = 0,
  @delta     INT,
  @rpc       NVARCHAR(MAX) = QUOTENAME(@linkedServer) + N'.master.sys.sp_executesql';
 
EXEC @rpc N'SELECT @delta = DATEDIFF(MINUTE, SYSDATETIME(), SYSUTCDATETIME());',
  N'@delta INT OUTPUT', @delta OUTPUT;
 
SET @rpc = @linkedServer + N'.master.sys.xp_readerrorlog';
 
CREATE TABLE #errlog(LogDate DATETIME, ProcessInfo SYSNAME, [Text] NVARCHAR(MAX));
 
WHILE @logNumber <= 6
BEGIN
  INSERT #errlog(LogDate, ProcessInfo, [Text])
  EXEC @rpc
    @LogNumber = @logNumber, 
    @LogType = 1, 
    @SearchTerm1 = N'SQL Server is now ready for client connections.',
    @SearchTerm2 = NULL,
    @StartDate = @start;
 
  INSERT #errlog(LogDate, ProcessInfo, [Text])
  EXEC @rpc
    @LogNumber = @logNumber, 
    @LogType = 1, 
    @SearchTerm1 = N'SQL Server is terminating',
    @SearchTerm2 = NULL,
    @StartDate = @start;
 
  SET @logNumber += 1;
END
 
;WITH d AS
(
  SELECT LogDate = DATEADD(MINUTE, @delta, LogDate),
    s = CASE WHEN [Text] LIKE N'%terminating%' THEN 'd' ELSE 'u' END, 
    rn = ROW_NUMBER() OVER (PARTITION BY LEFT([Text],17) ORDER BY LogDate)
  FROM #errlog
)
SELECT went_down = log1.LogDate, back_up = log2.LogDate
  FROM d AS log1
  INNER JOIN d AS log2
  ON log1.rn = log2.rn - 1
  AND log1.s = 'd'
  AND log2.s = 'u'
ORDER BY log1.LogDate;
 
DROP TABLE #errlog;

Now, we can get a little more complex and chart these two things together with a single script, and see if all of our network blips are "explained" by actual service restarts.

DECLARE 
  @linkedServer SYSNAME = N'Server\Instance',
  @days         INT = 3, 
  @logNumber    INT = 0,
  @today        DATE = DATEADD(DAY,-11,CONVERT(SMALLDATETIME, SYSUTCDATETIME()));
 
DECLARE 
  @start  DATE = DATEADD(DAY, -@days, @today),
  @delta  INT,
  @cmd    NVARCHAR(MAX) = QUOTENAME(@linkedServer) + N'.master.sys.sp_executesql';
 
EXEC @cmd N'SELECT @delta = DATEDIFF(MINUTE, SYSDATETIME(), SYSUTCDATETIME());',
  N'@delta INT OUTPUT', @delta OUTPUT;
 
SET @cmd = @linkedServer + N'.master.sys.xp_readerrorlog';
 
CREATE TABLE #errlog(LogDate DATETIME, ProcessInfo SYSNAME, [Text] NVARCHAR(MAX));
 
WHILE @logNumber <= 6
BEGIN
  INSERT #errlog(LogDate, ProcessInfo, [Text])
  EXEC @cmd
    @LogNumber = @logNumber, 
    @LogType = 1, 
    @SearchTerm1 = N'SQL Server is now ready for client connections.',
    @SearchTerm2 = NULL,
    @StartDate = @start;
 
  INSERT #errlog(LogDate, ProcessInfo, [Text])
  EXEC @cmd
    @LogNumber = @logNumber, 
    @LogType = 1, 
    @SearchTerm1 = N'SQL Server is terminating',
    @SearchTerm2 = NULL,
    @StartDate = @start;
 
  SET @logNumber += 1;
END
 
;WITH d AS
(
  SELECT LogDate = DATEADD(MINUTE, @delta, LogDate),
    s = CASE WHEN [Text] LIKE N'%terminating%' THEN 'd' ELSE 'u' END, 
    rn = ROW_NUMBER() OVER (PARTITION BY LEFT([Text],17) ORDER BY LogDate)
  FROM #errlog
),
ExplainedOutages(went_down, back_up) AS
(
  SELECT went_down = log1.LogDate, back_up = log2.LogDate
    FROM d AS log1
    INNER JOIN d AS log2
    ON log1.rn = log2.rn - 1
    AND log1.s = 'd'
    AND log2.s = 'u'
), 
t(marker) AS
(
  SELECT DATEADD(MINUTE, n-1, CONVERT(DATETIME,@start)) 
  FROM (SELECT TOP (DATEDIFF(MINUTE, @start, @today) + 1)
   ROW_NUMBER() OVER (ORDER BY s1.Number) 
   FROM master.dbo.spt_values AS s1
   CROSS JOIN master.dbo.spt_values AS s2) AS x(n)
)
SELECT t.marker, 
  ObservedBlip = CASE WHEN d.start IS NOT NULL THEN 0.5 ELSE 1 END,
  KnownOutage = CASE WHEN o.went_down IS NOT NULL THEN 0.5 ELSE 0 END
FROM t
LEFT OUTER JOIN
(
  SELECT 
    [start] = h.NormalizedStartTimeUtc, 
    finish  = COALESCE(h.NormalizedEndTimeUtc, SYSUTCDATETIME())
  FROM dbo.EventSourceConnectionStatusHistory AS h
  INNER JOIN dbo.EventSourceConnection AS c
  ON h.EventSourceConnectionID = c.ID
  WHERE h.ServiceConnectionStatus = 2
  AND h.ServiceConnectionType = 0
  AND h.ServiceConnectionException LIKE N'A network-related%'
  AND c.ObjectName = @linkedServer
  AND h.NormalizedStartTimeUtc >= @start
) AS d
ON t.marker >= d.start AND t.marker <= d.finish
LEFT OUTER JOIN ExplainedOutages AS o
ON t.marker >= o.went_down AND t.marker <= o.back_up
ORDER BY t.marker;
 
DROP TABLE #errlog;

The output of that query is largely unusable as is, since it will create 1,440 rows per day, but this is why we have software other than Management Studio for consuming results. I basically copied the output, pasted into Excel, and was able to make this simple line chart in a couple of minutes (then a couple more to annotate in Photoshop). I'm sure you could do something prettier with tools like Google Charts, Tableau, etc. But for a quick eyeball this was a pretty efficient and productive approach – it immediately shows where we have blips that can be explained away by service restarts, and a blip that cannot be explained by known service disruptions (click to enlarge):

Correlating Observed Blips with Known Outages

You can see that at around 2 PM on the 11th and 10 AM on the 12th, we had service restarts coupled with unsuccessful attempts to reach the instance. But on the 12th at around 2 AM, we had difficulty reaching the instance, but this cannot be explained by a service outage, so the reason must lie elsewhere. This is the data you can take to your network personnel.

There's a little bit of messiness here, because I haven't dealt with Daylight Saving Time (and this is not a simple topic – see my set of tips: Part 1, Part 2, and Part 3), but you can get a clear enough picture as long as you are querying a short span of recent history and not immediately after a time change (and even in that case, you can manually adjust any anomalies by shifting them an hour). Personally, I prefer to have all of my servers in UTC to avoid precisely this kind of problem. I also didn't deal with the case where the target instance is down right now (after all, if the service is down, you can't exactly do things like query its error log).

Comments ( 0 )

    Leave A Comment

    Your email address will not be published.

    This site uses Akismet to reduce spam. Learn how your comment data is processed.