#EntryLevel : Compression & Data Types - SentryOne Team Blog

#EntryLevel : Compression & Data Types

SQL Server Data Compression

Earlier this year, I co-presented a webinar with Kevin Kline (b|t) on SQL Server Data Compression. In this post, I'm going to focus on the segment of that webinar which (1) covered the two types of SQL Server Data Compression and (2) illustrated which types of data will compress well and which will not.

Data Compression Methods

Row Compression

Row compression is essentially smarter, more efficient storage.

  • An int uses 4 Bytes of storage, whether it is 2 or 2,147,483,647. With row compression, 2 should use 1 Byte of storage space.
  • Trailing zeros, padded, and blank characters are not stored (i.e. take up 0 bytes), where applicable.

For a detailed list of how Row Compression Implementation works for each data type, refer to this article on MSDN.

Page Compression

Page Compression is what I like to refer to as "compression for real this time" as it goes well beyond the smart storage method of row and uses patterns/repeating values to condense the stored data.

First, to gain a better understanding of this method, check out a simple representation of a page of data. This is illustrated below in Figure 1. You'll notice that there are some repeating values (e.g. SQLR) and some repeated strings of characters (e.g. SSSLL).

No Data CompressionFigure 1: No Data Compression

Second, take a look at a page of data that has Prefix Compression applied (Figure 2). Some of the values have been removed from the data part of the page and stored once in the compression information metadata section below the page header. These are the prefix values.

Next, spot what happens to some entries that share part of those values. "SSSQLL" is at the top of one section, and below that, "SSSQQ" has been replaced by "4Q", as in the first 4 characters of "SSSQLL" followed by a "Q" equals the original "SSSQQ" in fewer characters. Rather than storing some of these repeating prefixes it can store them once and uses references to complete the values. At this stage, the prefixes are limited to their respective columns. As a result, you'll see that the "4Q" refers to the "SSSSLL" prefix above it in the middle column.

Prefix CompressionFigure 2: Prefix Compression

Finally, observe what happens when Dictionary Compression is utilized (Figure 3). Unlike prefix, this level allows for compression of duplicated values across the entire page. The "1" across the two right columns corresponds to the "0QQQ".

Dictionary CompressionFigure 3: Dictionary Compression

For more information on Page Compression Implementation, see this article on MSDN.

Will It Compress Well?

Yes Example

Once you have an understanding of how compression works with repeated values and prefixes, you can have a better idea of whether your data will compress well.

In the "Yes" example below, I estimated savings on a table and discovered that one index would compress by about 80%:

EXEC sys.sp_estimate_data_compression_savings 
  @schema_name      = N'dbo', 
  @object_name      = N'EventSourceHistoryDetail', 
  @index_id         = NULL,
  @partition_number = NULL,
  @data_compression = N'PAGE';

Estimating Savings

Savings by IndexSavings by Index

Upon seeing that % savings, I took a closer look at that index. Shown below, it has three columns:

Index InfoIndex Info

I then ran a query to see how many of those values were unique:

SELECT COUNT(DISTINCT EventSourceID) AS UniqueEventsSourceIDs
,COUNT(DISTINCT RemoteSequenceID) AS UniqueRemoteSequenceIDs
,COUNT(DISTINCT RemoteObjectID) AS UniqueRemoteObjectIDs
,COUNT(*) AS TotalEventSources
FROM dbo.EventSourceHistoryDetail;

Querying for repeating values

Below, you can see in the results that across 253,127 rows, there are only 16 different values for EventSourceID and 241 values for RemoteObjectID. The RemoteObjectID has the potential to have the largest size (256 Bytes). Consequently, being able to refer to those values instead of duplicating them across the quarter million rows will certainly add up to a high % savings in storage space.

Unique Table DataUnique Table Data

Furthermore, the RemoteSequenceID is a bigint, which uses twice as much space as an int. Row compression is able to store those values more efficiently since they do not require the maximum space required by a bigint. A value of '2' can use a single byte instead of eight.

No Example

Perhaps most noteworthy about this example is the that the data not only fails to compress well, but that the compressed size is greater than the original.

 EXEC sys.sp_estimate_data_compression_savings 
 @schema_name		= N'dbo',  
 @object_name		= N'PerformanceAnalysisPlan',  
 @index_id		= NULL,
 @partition_number	= NULL,
 @data_compression	= N'PAGE';

Estimating Savings

Recall that compression creates an area of metadata for it to work. As a result, that metadata overhead could increase the storage size in some cases.

Savings by IndexSavings by Index

Taking a closer look at the index in question reveals that it consists of a plan hash, a binary data type. According to Microsoft documentation, binary data types row compress by removing trailing zeros.

Index InfoIndex Info

Querying the data shows that each plan hash is a unique value:

 SELECT COUNT(DISTINCT PlanHash) AS UniquePlans,
 COUNT(*) AS TotalPlans
 FROM dbo.PerformanceAnalysisPlan;

Querying for repeating values

Unique Table DataUnique Table Data

If you're not familiar with what a plan hash is, you might be wondering why it doesn't compress the prefix values. Just because the values are unique doesn't mean they don't share prefix values, right? Due to the nature of a plan hash, this isn't the case.

Plan Hash DataPlan Hash Data

As shown above, the prefix values for the plan hash are rather limited. There's the 0x, but the remainder of the hash has less to offer for prefixes and dictionary compression methods. Notice that the removal of trailing zeros is not helpful here, either. Row compression is unable to store this more efficiently.

Summary


While there are a multitude of factors to consider before applying compression to your data, one of the first things you might want to determine is whether or not your data actually compresses well enough to use it. Just because you have a large table (or index), that doesn't mean it will benefit from either data compression method. Having an understanding of what is in your data and how those methods work is an excellent start.

More Information

Finally, if you're interested in learning more about how to decide when it's a good idea to use data compression, check out the aforementioned webinar or read my five-part series detailing how I decided where to use it in the SQL Sentry database.

Comments ( 0 )

    Leave A Comment

    Your email address will not be published.

    This site uses Akismet to reduce spam. Learn how your comment data is processed.