T-SQL Tuesday #56 : SQL Server Assumptions
This month's T-SQL Tuesday is being hosted by Dev Nambi (@DevNambi), who I first met on SQLCruise a few years ago. For this installment, Dev wants us to write about assumptions that we encounter when working with SQL Server, and boy, do I come across a lot of those. Rather than pick just one, I thought I would lightly treat a handful of SQL Server assumptions that I believe are widespread and could be damaging in certain scenarios.
CASEexpression as soon as it hits a match. So the assumption is that this construct will never fail, because SQL Server would never reach the
DECLARE @i INT = 1; SELECT CASE WHEN @i = 1 THEN 0 ELSE 1/0 END;
Since its use is rampant, and it "always" works, I don't really blame people for making this assumption; even the documentation stated as much for a long time. However, there are a few cases (seriously, no pun intended!) where this short circuiting breaks down. One such scenario is where you introduce aggregates to the expression:
DECLARE @i INT = 1; SELECT CASE WHEN @i = 1 THEN 0 ELSE MIN(1/0) END;
Suddenly, we get the following error:
Divide by zero error encountered.
I mentioned this specific scenario in an answer on dba.stackexchange.com, noting that this is not a bug, just that the documentation needed re-wording (which happened, as explained in Connect items #690017 and #691535).
The truth is, in most cases, you can rely on
CASE to short circuit. In fact, I do so all the time, to prevent SQL Server from interpreting my statements in an illogical order (for example, when filters should eliminate rows from impossible conversions – Erland Sommarskog has a great example, and one of the posted workarounds uses a
CASE expression). Here's my own example that is tough to solve without a
CREATE TABLE dbo.t(d VARCHAR(20)); INSERT dbo.t(d) VALUES('20140708'),('Last Friday'); -- fails: SELECT d FROM dbo.t WHERE MONTH(d) = 7; -- fails: SELECT d FROM dbo.t WHERE ISDATE(d) = 1 AND MONTH(d) = 7; -- succeeds: SELECT d FROM dbo.t WHERE CASE WHEN ISDATE(d) = 1 THEN MONTH(d) END = 7;
At the end of the day, you just need to be aware of the scenarios where this assumption becomes brittle; this situation with aggregates isn't the only one. I wrote about this and several other
CASE behaviors here: Dirty Secrets of the CASE Expression.
The issue is that, in some cases, a scan is the most efficient way to retrieve the data. Depending on your index structure and the columns in the query (both those used for output and those used to filter), a seek – in particular, a range scan in disguise – may actually lead to a lot of costly lookups to retrieve the columns the index doesn't cover. This can be easy to fix, but only if you have the flexibility to change or add indexes.
And for small tables or indexes (think filtered indexes here too), performing a seek simply isn't going to offer any performance benefits over just scanning the whole table or index, and may not even be worth the effort of performing exhaustive searches for an "optimal" execution plan.
"Some Seeks are better than some Scans, and some Scans are better than some Seeks."
The takeaway: Don't expend too much effort trying to convert all scans to seeks. You may not be able to make the query any better; you may actually end up making it worse. Focus on the scans that you can prove are actually causing performance problems – are they scans because of a missing index, or an index that doesn't cover, or an implicit conversion, or a function against a predicate column? Is the query actually exhibiting poor performance, or did you just spot a scan in an execution plan and thought it must be a problem?
"It's always worked that way for me, so it will always work that way."
The truth of the matter is that a table, by definition, is an unordered set of rows; you may as well think about it as a bag of marbles. If your query does not include an
ORDER BY, you should not rely on any specific ordering, even if it is what you "always" observe. Here's why: If you don't specify an order, you are essentially telling SQL Server, "I don't care about order." So, SQL Server will feel free to return the results in the most efficient manner possible. The result may coincide with your assumption, but it may not.
I see a lot of people assume that if they issue
TOP (1) against a table, and don't specify an order, that they will get the first row they inserted. Going back to the marble analogy, imagine I dropped a bag of marbles on the floor while you weren't looking, and asked you to tell me which marble hit the floor first. Without some other indicator, you'd just pick an arbitrary marble off the floor. If I wasn't paying attention, I wouldn't know if you were right, either. The optimizer works much the same way: without an explicit ordering, it is free to choose whatever
TOP (1) it feels like. And unless you have some column that indicates order of insert (like an identity column, or a datetime column populated with the current time by default), you couldn't tell SQL Server how to get that row, and can't be sure it got the right row, either.
Many factors can affect the optimizer's choice here, and it can change its mind in direct or collateral response to various events, too. You may get a different plan today than the one you got yesterday, and you may get a different plan than the one your co-worker gets today. This could be caused by changes to the query, changes to the data / statistics / indexes, session settings, isolation level, allocation order or merry-go-round scans, partitioning, parallelism, turning on or off a trace flag,
sp_configure changes, clearing the plan cache, restarting the server, failing over, applying a hotfix/cumulative update/service pack, upgrading to a new major version, and on and on…
Here is a quick and simple example where adding an index can change the result:
CREATE TABLE dbo.People ( PersonID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, Name SYSNAME ); INSERT dbo.People(Name) VALUES('Zeke'),('Bob'),('Frank'); SELECT Name FROM dbo.People; -- clustered index scan, ordered [Zeke, Bob, Frank] CREATE UNIQUE INDEX pn ON dbo.People(Name); SELECT Name FROM dbo.People; -- non-clustered index scan, ordered [Bob, Frank, Zeke]
I've seen multiple scenarios like this, first-hand, where the application is released, everything is coming out in the order of the identity column, users are happy, and then suddenly the plan goes parallel or someone adds an index or the query gets recompiled for some reason. At this point, users are not happy. The developers want some way to restore the old behavior, without touching the query; every time, the ultimate solution is the only one that's feasible: Add
ORDER BY to the query (and sometimes add a column to make the
ORDER BY logical and deterministic).
An abbreviated list of scenarios where this can affect you unless you add an
ORDER BY to the outermost query:
- A view or inline table-valued function with
ORDER BY– the
TOP 100 PERCENTthat even makes this syntax possible is now optimized away, as described here.
- An "ordered"
SELECT INTO) for a temp table, table variable, or return table in a multi-statement table-valued function.
- An UPDATE that matches multiple rows, and updates one row in one scenario, and a different row in another.
ORDER BYor window functions inside a CTE, subquery or derived table (usually in combination with a meaningless
SET ROWCOUNTto determine the "first" or "last" row inserted.
- Even in the case of a hinted index, the index will be used (if possible, else an error in most cases), but just because the index is used doesn't mean the results will be returned sorted by the key(s) in that index. You still need an
ORDER BYto ensure sorting by the index key(s).
I have talked about a few of these scenarios here and here, as have Craig Freedman and Conor Cunningham. These are all old posts; if anything, SQL Server's optimizer has advanced in ways that only mean there are now as many or more reasons to stop expecting a reliable order without specifying
That hasn't exactly panned out.
And not because SQL Server is terrible at self-tuning. Nor because it has become a much more complex platform. I will concede that not all SQL Server-based applications needed a DBA in the first place. But SQL Server does not ship out of the box with scalability in mind across all aspects, and you can tell from several of the default settings. Some examples:
- File locations
Everything defaults to the system drive (C:\), no matter how many drives you have or how fast they are. System databases, tempdb, error logs, the location for new data and log files, and program files like Management Studio, SQLCMD and Profiler. Why couldn't SQL Server setup encourage you to put the more important things on other drives, if they're available? (Bonus points for detecting RAID levels, SSDs, and differentiating between SAN and direct-attached storage.) Since it can't, please don't assume that these pre-populated file locations are optimal for your installation, and consider distributing them from the start, since hanging them later can be a pain.
- New database file sizes and autogrow
The model database dictates the size and autogrow rates of new databases (ones you create, not ones you restore or attach). While the actual size depends on version and edition (3 MB data and 1 MB log for SQL Server 2014 Enterprise Edition), it is unlikely that the defaults are optimal for any database you're creating in this decade. How quickly will you use a measly 3 MB? And the autogrow rates (1 MB for data, 10% for log) leave a lot to be desired as well. Even with instant file initialization enabled, growing a data file from 2MB or 3MB to accommodate your data – 1 MB at a time – is going to be a very painful process. And growing the log file 10% at a time leads to longer and longer growth events, especially since log files can't benefit from instant file initialization, and can also lead to too many virtual log files (VLFs) and heavy VLF fragmentation. Please consider altering the model database so that the initial size and growth rates are reasonable and absolute sizes (no percentages), even if you are on SSDs or other flash-based storage and don't think these issues could ever become critical.
- Recovery model
Like file sizes and autogrow, the default recovery model of new databases is determined by the recovery model of the model database, and this can also vary by version and edition (Express Edition, for example, defaults to simple, while Enterprise Edition default to full). There is no right or wrong here; this is another case of It Depends™. In many systems, you are likely to want a mix of full and simple recovery databases; in those cases, it may make sense to set model to whichever recovery option will be used for the majority of new databases, and change the rest to the option that makes more sense for them. Some people set up policies using Policy-Based Management to identify databases that are set up in the "wrong" recovery model; others use auditing or DDL triggers to just keep an eye on new databases with a plan to review them manually. In any case, this is another area where you might not want to trust SQL Server's defaults for all databases.
The tempdb database is literally the dumping ground for all kinds of system and user activities in SQL Server; not surprisingly, it is often a major performance bottleneck. The defaults for this database are a single 8 MB data file set to 10% autogrow, and a 1 MB log file also set to 10% autogrow. As I've explained above, files this small are just begging for frequent and disruptive autogrow events, and those percentage-based autogrow events have the potential to take longer and longer as time goes on. And since tempdb is re-created to its initial size on restart or failover, the process just starts all over again. In addition to changing these sizes and growth rates to something more logical, other changes you can make from the defaults are to create multiple data files (see this post by Paul Randal (@PaulRandal) to help determine how many), all with the same size, and consider trace flag 1117. You can also consider implementing trace flag 1118, but Paul has some advice about that, too.
- Backup compression and logging
As Jeremiah Peschka (@peschkaj) explains in this blog post, backup compression is all benefit at very little cost, and I really would only hesitate to turn this on in the case where a workload is completely CPU-bound (which is still quite rare, even in the face of SSDs, PCIe storage, and In-Memory tables). So consider setting this as the default for all databases:
USE master; EXEC sp_configure 'backup compression default', 1; RECONFIGURE WITH OVERRIDE;
While you're at it, enable trace flag 3226 – by default, SQL Server logs every successful backup message to the error log, which also pollutes SQL Server Agent history. Hopefully you have some kind of alerting mechanism in place for backups that fail, so success messages are just noise.
- Optimize for ad hoc workloads
This is a server-wide setting that prevents plan cache bloat by requiring that a query plan is requested twice before caching a full version of the plan. Since SQL Server 2008 was released, I have recommended this setting on dozens and dozens of servers, and I have yet to come across a workload that didn't like it. Kimberly Tripp (@KimberlyLTripp) offers much more information and a ton of links to other resources about this setting. Not convinced? Try it out on your test/QA servers:
USE master; EXEC sp_configure 'show advanced options', 1; RECONFIGURE; EXEC sp_configure 'optimize for ad hoc workloads', 1; RECONFIGURE WITH OVERRIDE;
- Server memory
The default for max server memory is an astronomical 2 petabytes; effectively this means there is no cap. Since SQL Server is optimized to use memory aggressively, this means that the server can consume as much memory as it wants, only releasing it when getting pressure from the OS to do so. This often leads to complaints about memory leaks and people wanting SQL Server to stop hogging all of their memory; when in fact the server is doing exactly what they've implicitly told it to do. This gets more complicated when you have multiple instances, or other applications running on the server; in either case you don't want to leave this setting at the default. See this blog post by Jonathan Kehayias (@SQLPoolBoy) for guidance on how much memory to reserve for the operating system and other applications and services.
- Trace flags
SQL Server ships with no trace flags enabled. I've mentioned a few trace flags you should consider turning on; Pieter Vanhove (@Pieter_vanhove) has a good starter set to consider as well.
Note that not all of these default settings and behaviors are bad in all scenarios; I merely wanted to bring them up, since the typical assumption is that the defaults can't possibly be bad in any scenario. Do you change any other defaults on most or all SQL Server instances? I'd love to hear about others I see less frequently.
sapassword to blank? They are better these days at security best practices, but still lacking in other areas. Even today, when I look in Books Online, I often see things that makes me cringe.
Remember when you could do this?
Now, I'm not saying that I envy the documentation writers, or that I want their thankless jobs. There is a ton of content there, and it must be mind-numbing to even think about the prospect of making it "perfect." Microsoft has a pretty good track record of telling us how the product works, but doesn't do so well in the areas of practical usage, guidance, or code samples that promote sound techniques. Even system stored procedures and maintenance plans often fail to promote best practices. Not to mention products like SharePoint and Entity Framework that make way too many schema and query sins to even start to compile a meaningful list.
Some of the most egregious things that newer database developers learn from (and for a lot of these, I'm going to pick on one specific topic with multiple violations):
- SELECT *
While some people seem to think
SELECT *is okay until it becomes a performance problem, most of us probably agree that it is not a best practice to use this syntax in production code for various reasons. Yet there it is, in all its glory, in the BOL topic above (and many other topics as well).
- Default cursors
Microsoft seems to be a big fan of default cursors, in spite of their heavy-handed behaviors. In most cases, you'll want to specify
LOCAL FAST_FORWARDfor forward-only, read-only cursors. Or avoid cursors in the first place. But cursors with no options specified can be found in Books Online, AdventureWorks In-Memory OLTP sample procedures, and by far the most infamous:
sp_msforeachdb. While it is an undocumented and unsupported procedure, its use is quite prevalent; unfortunately, the problems caused by the cursor may lead to serious problems, including silently skipping databases without the user's knownledge. I've complained about this in Connect #264677 and #752629, but it will not be fixed. I've written a replacement without this flaw and with several other features; I've blogged about it here and here.
- No semi-colons
The topic above actually does use semi-colons to terminate statements, but a lot of topics don't, never mind Microsoft's own stored procedures. The problem is that "Not using a statement terminator for Transact-SQL statements" has been deprecated since SQL Server 2008; while it's unlikely to actually be implemented in our lifetime, I've blogged about why we should all make a conscious effort to use statement terminators more consistently.
- Varchar without length
The above topic doesn't do this, but several others do (including this one). The main problem with declaring varchar without length is that it sometimes defaults to 1 character, and others 30; and in some cases it can lead to silent truncation – meaning you can have data loss and not even know it. (I talk more about this here.) Similar ambiguity can affect other variable-length data types, like char, nchar, nvarchar, numeric, decimal and others. If it can take a length, precision or scale, please always specify it explicitly.
- Date query sins
I often wonder where people get ideas to store dates as strings, use ambiguous regional date formats, use
BETWEENand lazy shorthand like
GETDATE() + 1, and then I see gems like this in Microsoft's own documentation (yes, this appears in the topic above too):
AND h.OrderDate BETWEEN '1/1/06' AND '2/1/06' AND d.DueDate BETWEEN CAST ('20060101' AS datetime) + 14 AND CAST ('20060201' AS datetime) + 14;
What don't I like about this? Plenty of things:
You should avoid
BETWEENfor queries involving date/time data types, since rounding can take values like
23:59:59.997and round them up to the next day, and what the end range includes is actually confusing to a lot of people (often I see people believe that the above will include all data from February 1st – or is that January 2nd? – not just the data stamped at midnight). I've blogged about the issues with
BETWEENin more detail, and I am not a big fan of
EOMONTH()for similar reasons.
I alluded to it previously, but you should avoid regional and ambiguous formats like
M/D/Y, since both humans and SQL Server can mistake what you mean. The only truly safe formats for string literals in SQL Server are
YYYY-MM-DDTHH:nn:ss[.ms...](and yes that
Tis important). I've blogged about this here.
The shorthand syntax of adding a numeric value to a
DATETIMEshould be abandoned, since it doesn't work with the new data types. I bet there is a lot of code out there that is on the verge of breaking the moment someone decides to switch to
DATETIME2. I would rather type out an explicit
DATEADD, since clarity and forward compatibility are more important to me than saving a few characters of code or a few milliseconds of typing.
- No schema prefix
The above topic does not do this, but others do, including this one. To be fair, this topic also explains why you should always specify the schema when creating or referencing objects, so this is another case of "do as I say, not as I do." For more details on why you should always specify the schema, see this blog post.
- Alias types
Also known as "user-defined data types," there are several of these in the AdventureWorks sample database, and they seem to be used simply to save a little bit of typing during table and parameter declarations. I'll never understand why you want to specify a column as
dbo.Phonesimply maps to
NVARCHAR(25). (Also not sure why a phone number would ever need to support Unicode characters, but that's probably a different issue.) Much worse is the effort required to change an alias type once it's been used in tables, procedures, and functions – see this blog post for all the gory details.
- Old-style joins
While the ancient inner join syntax (
FROM t1, t2 WHERE t1.x = t2.x) has not been deprecated like the outer join syntax (
=*) has, it's not the clearest either; there is an example of this in the topic mentioned above as well. It discourages the separation of join criteria from filter criteria, and makes it much more likely to accidentally derive the cartesian product. I've even seen multiple scenarios where this has been "fixed" by slapping a
SELECTlist. This is certainly a stylistic preference and a lot more subjective than many of the things I tend to complain about, but in case you're not convinced, I've written a lot more about it.
- Added September 5, 2014: Best practices documents
These things age, and not everything stands up to the test of time. Denny Cherry (@mrdenny) recently wrote a post debunking a lot of things stated in this rather old (2008) blog post put out by MSDN.
I don't mean to crap all over Microsoft's documentation and code samples (never mind production code); all I'm saying is that, just because you see them do something, does not mean it is a good idea to do it yourself.
A lot of people make a lot of assumptions about SQL Server, myself included. I constantly learn things about our favorite platform, and occasionally have to change my opinion about a pre-conceived notion I had. Hopefully you're not making any of the assumptions I've outlined above; and if you are, I've persuaded you to think twice about them.