In this post we’re going to evaluate the possibility of concealing poor data type choices with compression. Why are we doing this? If you don’t know yet, life is made of bad ideas. If you know, you know, you know? So, now you know too. If your next question is: “Why don’t we simply change the data types?” You need to get out of here with your so-called logic.
Just kidding. You should always eat your vegetables and design your tables with appropriate data types. With that being said, it’s possible that situations arise where we’re unable to do that, for example, because the database is part of a third-party application.
So, the objetive here is to understand some of the mechanisms behind Row and Page compression and check if they can be helpful in those situations where bad decisions were made and we need to be extra creative, in order to fix them.
I Think, Therefore I Compress
Using compression can significantly decrease the space required by data on disk. That’s cool. But, one might argue that storage these days is not expensive and that is true, sometimes. Even so, that is not the only advantage of using Row or Page compression. Reducing the space required by the data on a data page will allow to store more data on each page. Consequently, this also means that we are able to store more data on memory, and that is awesome. Not only memory is considerably more expensive than storage, but also, if this means that we’re able to fetch data directly from memory instead of going to disk, the positive impacts on performance can be quite substantial.
But, as almost always, these witchcrafts come as a tradeoff. In this case, the extra work that is necessary to compress and decompress our data, will incur in extra CPU cycles. Despite that, if you’re not living on the edge of CPU consumption, you should give it a try. The overhead is barely observable and generally far outweighed by the benefits offered by compressing the data. Just be aware that some datatypes, such as nvarchar(max) or smalldatetime, won’t benefit from compression, for different reasons. More details and links to relevant resources regarding this matter are provided below, but you can always use the system store procedure sp_estimate_data_compression_savings to obtain an estimate of the potential savings.
In very loose terms, Row compression handles fixed length data types as if they have variable length, by reducing the allocated space to the minimum possible, given the underlying values. For example, the number 1 stored as bigint will take all 8 bytes if the column is not compressed, whereas if Row compression is applied, the same value will only take 1 byte. Furthermore, it’s worth to mention that nulls and zeros take no space, and there are other optimizations to the variable length data types metadata structure, that reduces its overhead. For more details on Row Compression, check this page.
The first step of Page compression is the previously mentioned Row compression, which is then followed by Prefix compression and Dictionary compression. A structure called Compression Information (CI) is created after the page header and used to enforce both these additional steps, as we’ll see.
These compression techniques may not be applied to every single page, since pattern repetition is an essential prerequisite, and, contrarily to Row compression, they’re applied only to leaf level pages.
For each column that is being compressed, an anchor value is chosen and stored in the CI. This anchor will be the largest most common value that is found. Then, other occurrences of this value in the column are replaced by a reference to the corresponding prefix. If the value in a row does not exactly match the selected prefix value, a partial match is referred.
For instance, in the first column of the first row, the value AAAAB will be transformed to 4B. This means that the first 4 letters (AAAB) matched partially the anchor’s value (AAABCC), so they’re replaced by the number 4 and the “B” that did not match is kept.
The occurrences that match exactly the anchor’s value will store a special kind of null, to indicate it. Contrarily, in completely unmatched pairs, the current value is prefixed with a 0, which actually takes more space instead of saving.
After Prefix compression has been completed, Dictionary compression is applied. It searches for repeated values anywhere on the page, and stores them in the CI area. Unlike Prefix compression, it is not restricted to one column, so it can replace repeated values that occur anywhere on a page. The following image illustrates how Dictionary compression is applied.
Note that the value 4B has been referenced from different columns of the same page. For more details about Page compression check this page, or the awesome Bob Pusateri’s presentation on the 8KB conference here.
I’ve Learned So Much From My Mistakes…
Now, the fun part. Let’s consider the following table, containing the columns Id and ViewCount from the Posts table in the StackOverflow2010 database. Both are INT columns and considering the values they hold, it seems appropriate.
CREATE TABLE PostViews ( [PostId] INT NOT NULL, [ViewCount] INT NOT NULL, CONSTRAINT [PK_Id] PRIMARY KEY CLUSTERED([PostId]) ) INSERT INTO dbo.PostViews (PostId,ViewCount) SELECT Id,ViewCount FROM Posts
The table has 3 729 195 rows, which materialize in 63.1 MB and 62.7MB of reserved and actual data size, respectively.
…That I’m Thinking of Making a Few More
This time, we recreate the table changing the PostId and ViewCount to BIGINT, instead of INT. After inserting the exact same data, we can confirm the consequences of our choice:
The same 3 729 195 rows will now require 93MB.
Nolege Is Power
To test our hypothesis, we need to recreate our table with compression. We’ll enable Row compression, executing the following command:
ALTER TABLE PostViews REBUILD WITH (DATA_COMPRESSION = ROW);
By looking into the underlying data with the DBCC PAGE command, we’ll be able to notice the gains from compression.
Even though the bigint data type has a fixed length of 8 bytes, we can verify that the values 2290772 and 0 take only 3 and 0 bytes, respectively. This has a significant impact on the table size, as expected:
All our records fit in 40MB now, which is nearly one third less than the uncompressed table with the appropriate data types. If we compress that version of the table with the int columns, we’ll notice that the results are virtually the same. But now that we know the mechanisms behind compression, is easy to understand why.
As we saw, under certain circumstances, compression can successfully conceal bad design decisions, at least to some extent. But it’s important to stress that there are other punishments for those who live with poor data-type choices, that compression can’t fix. This is particularly true when we are not talking about mildly bad ideas like using bigint instead of int and start talking about scarier patterns like #WeAreAllVarchar. So, despite this being a valid approach for cases where can’t change columns data types, compression should not be used as a enabler for lazy and bad design decisions.