#EntryLevel : Compression & Data Types
SQL Server Data Compression
Earlier this year, I co-presented a webinar with Kevin Kline (b|t) on SQL Server Data Compression. In this post, I'm going to focus on the segment of that webinar which (1) covered the two types of SQL Server Data Compression and (2) illustrated which types of data will compress well and which will not.
Row compression is essentially smarter, more efficient storage.
- An int uses 4 Bytes of storage, whether it is 2 or 2,147,483,647. With row compression, 2 should use 1 Byte of storage space.
- Trailing zeros, padded, and blank characters are not stored (i.e. take up 0 bytes), where applicable.
For a detailed list of how Row Compression Implementation works for each data type, refer to this article on MSDN.
Page Compression is what I like to refer to as "compression for real this time" as it goes well beyond the smart storage method of row and uses patterns/repeating values to condense the stored data.
First, to gain a better understanding of this method, check out a simple representation of a page of data. This is illustrated below in Figure 1. You'll notice that there are some repeating values (e.g. SQLR) and some repeated strings of characters (e.g. SSSLL).
Figure 1: No Data Compression
Second, take a look at a page of data that has Prefix Compression applied (Figure 2). Some of the values have been removed from the data part of the page and stored once in the compression information metadata section below the page header. These are the prefix values.
Next, spot what happens to some entries that share part of those values. "SSSQLL" is at the top of one section, and below that, "SSSQQ" has been replaced by "4Q", as in the first 4 characters of "SSSQLL" followed by a "Q" equals the original "SSSQQ" in fewer characters. Rather than storing some of these repeating prefixes it can store them once and uses references to complete the values. At this stage, the prefixes are limited to their respective columns. As a result, you'll see that the "4Q" refers to the "SSSSLL" prefix above it in the middle column.
Figure 2: Prefix Compression
Finally, observe what happens when Dictionary Compression is utilized (Figure 3). Unlike prefix, this level allows for compression of duplicated values across the entire page. The "1" across the two right columns corresponds to the "0QQQ".
Figure 3: Dictionary Compression
For more information on Page Compression Implementation, see this article on MSDN.
Once you have an understanding of how compression works with repeated values and prefixes, you can have a better idea of whether your data will compress well.
In the "Yes" example below, I estimated savings on a table and discovered that one index would compress by about 80%:
EXEC sys.sp_estimate_data_compression_savings @schema_name = N'dbo', @object_name = N'EventSourceHistoryDetail', @index_id = NULL, @partition_number = NULL, @data_compression = N'PAGE';
Savings by Index
Upon seeing that % savings, I took a closer look at that index. Shown below, it has three columns:
I then ran a query to see how many of those values were unique:
SELECT COUNT(DISTINCT EventSourceID) AS UniqueEventsSourceIDs ,COUNT(DISTINCT RemoteSequenceID) AS UniqueRemoteSequenceIDs ,COUNT(DISTINCT RemoteObjectID) AS UniqueRemoteObjectIDs ,COUNT(*) AS TotalEventSources FROM dbo.EventSourceHistoryDetail;
Querying for repeating values
Below, you can see in the results that across 253,127 rows, there are only 16 different values for EventSourceID and 241 values for RemoteObjectID. The RemoteObjectID has the potential to have the largest size (256 Bytes). Consequently, being able to refer to those values instead of duplicating them across the quarter million rows will certainly add up to a high % savings in storage space.
Unique Table Data
Furthermore, the RemoteSequenceID is a bigint, which uses twice as much space as an int. Row compression is able to store those values more efficiently since they do not require the maximum space required by a bigint. A value of '2' can use a single byte instead of eight.
Perhaps most noteworthy about this example is the that the data not only fails to compress well, but that the compressed size is greater than the original.
EXEC sys.sp_estimate_data_compression_savings @schema_name = N'dbo', @object_name = N'PerformanceAnalysisPlan', @index_id = NULL, @partition_number = NULL, @data_compression = N'PAGE';
Recall that compression creates an area of metadata for it to work. As a result, that metadata overhead could increase the storage size in some cases.
Savings by Index
Taking a closer look at the index in question reveals that it consists of a plan hash, a binary data type. According to Microsoft documentation, binary data types row compress by removing trailing zeros.
Querying the data shows that each plan hash is a unique value:
SELECT COUNT(DISTINCT PlanHash) AS UniquePlans, COUNT(*) AS TotalPlans FROM dbo.PerformanceAnalysisPlan;
Querying for repeating values
Unique Table Data
If you're not familiar with what a plan hash is, you might be wondering why it doesn't compress the prefix values. Just because the values are unique doesn't mean they don't share prefix values, right? Due to the nature of a plan hash, this isn't the case.
Plan Hash Data
As shown above, the prefix values for the plan hash are rather limited. There's the 0x, but the remainder of the hash has less to offer for prefixes and dictionary compression methods. Notice that the removal of trailing zeros is not helpful here, either. Row compression is unable to store this more efficiently.
While there are a multitude of factors to consider before applying compression to your data, one of the first things you might want to determine is whether or not your data actually compresses well enough to use it. Just because you have a large table (or index), that doesn't mean it will benefit from either data compression method. Having an understanding of what is in your data and how those methods work is an excellent start.
Finally, if you're interested in learning more about how to decide when it's a good idea to use data compression, check out the aforementioned webinar or read my five-part series detailing how I decided where to use it in the SQL Sentry database.