Clustered ColumnStore Indexes – Space Savings

Over the last weeks I have done a lot of work with Clustered ColumnStore Indexes in SQL Server 2014/2016, and it is really amazing what you can achieve with this technology. In today’s blog post I want to talk a little bit more about the space savings that you can make by using a Clustered ColumnStore Index.

Clustered ColumnStore Indexes

The ColumnStore Index was first introduced back with SQL Server 2012 – but only as a Non-Clustered Index that made your table read-only. I had a lot of customers who were really amazed about this new index, but as soon as I have showed them the limitations they changed their mind.

But in SQL Server 2014 Microsoft provides you with a Clustered ColumnStore Index that is also writeable – or at least it seems that way. But the great thing about it is the fact that it is a *Clustered* ColumnStore index. Therefore you are able to store your data primarily in the highly efficient ColumnStore format, and you don’t need to store the data additionally in the RowStore format. This is a huge advantage over the Non-Clustered ColumnStore Index that was introduced in SQL Server 2012.

But how much space saving can you make by using a Clustered ColumnStore Index? Until now I have always worked with the ContosoRetailDW database from Microsoft in my demonstrations, but this example database is a little bit – let’s say – “small”. I wanted to test the Clustered ColumnStore Index with a realistic, large data set to see how the ColumnStore Index performs and how much space you can save with it.

I therefore decided to use the data set from the TPC-H benchmark. I’m a big fan of the software Benchmark Factory from Dell (which I always use as a load generator in my SQL Server workshops). With this software it is quite easy to run a TPC-E (OLTP) or TPC-H (DWH) workload, and also generate the initial data set for it. And you can use a scaling factor to define how large your database will be.

To make things easy I have started up an Amazon EC2 instance (r3.8xlarge), which gives me 32 Cores, 244 GB RAM, and 2×320 GB SSD drives – enough hardware for my test case. The cloud really rocks here! (I am not being sarcastic! Really!) After I have installed SQL Server 2016 CTP 3.2 I generated the TPC-H database with a scaling factor of 250. This gave me a initial database size of around 403 GB. And trust me, it took some time to generate that database!

Benchmark Factory generates all the tables as Heap Tables – without any traditional Clustered Index. So 403 GB is our baseline. The largest table – H_LineItem – consists of around 1 500 000 000 rows. How much space do you think we can gain when we create a Clustered ColumnStore Index on each table? Let’s try it. (Side note: I am not worrying about Segment Elimination here, so I just created the Clustered ColumnStore Index from the Heap Tables without any predefined sorting order).

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON H_Customer
GO

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON H_LineItem
GO

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON H_Nation
GO

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON H_Order
GO

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON H_Part
GO

CREATE CLUSTERED COLUMNSTORE INDEX cssi ON H_PartSupp
GO

CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON H_Region
GO

CREATE CLUSTERED COLUMNSTORE INDEX cssi ON H_Supplier
GO

After the creation of all Clustered ColumnStore Indexes I have check the size of the tables again: the complete tables needed about 60 GB of space altogether! Imagine that: I was able to get a space saving of 343 GB just by creating Clustered ColumnStore Indexes! This is really amazing!

But we are not finished yet: Since SQL Server 2014 you can also enable a so-called Archival Compression, where SQL Server will use the LZ77 algorithm when you write the LOB data pages (that contains the ColumnStore data) out to the storage subsystem. So let’s enable Archival Compression on every table.

ALTER INDEX ccsi ON H_Customer
REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
GO

ALTER INDEX ccsi ON H_LineItem
REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
GO

ALTER INDEX ccsi ON H_Nation
REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
GO

ALTER INDEX ccsi ON H_Order
REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
GO

ALTER INDEX ccsi ON H_Part
REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
GO

ALTER INDEX cssi ON H_PartSupp
REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
GO

ALTER INDEX ccsi ON H_Region
REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
GO

ALTER INDEX cssi ON H_Supplier
REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
GO

After the rebuild of every ColumnStore Index I checked the size of the tables again: around 49 GB! So the Archival Compression gave me around 11 GB of additional space savings – nice! The following picture gives you an overview of the detailed space savings on every table. You can also download a larger copy of that picture here.

Space Savings with a Clustered ColumnStore Index

Summary

As I always say in my SQL Server workshops: the ColumnStore Index in SQL Server is just a turbo-booster! USE IT! Since SQL Server 2014 there is no reason why you shouldn’t use it. And the Clustered ColumnStore Index gives you a *huge* space saving over traditional Clustered Indexes and/or Heap Tables.

Of course exactly how much space you can gain depends on your data. Just try it with your database in a test environment – I think you will be amazed at what you can achieve here.

Thanks for your time,

-Klaus

4 thoughts on “Clustered ColumnStore Indexes – Space Savings”

  1. Andrea Caldarone

    Hello Klaus,
    have you tried to build a traditional “row-based” clustered indexes on the tables and then rebuilding the indexes with page compression to compare:

    no compression
    page compression on row-based indexes
    ckustered columnstore indexes with columnstore_archive

    it would be interesting.

  2. Klaus Aschenbrenner

    Hello Andrea,

    Thanks for your comment.
    No, I haven’t compared the space savings to a traditional Clustered Index.

    Thanks,

    -Klaus

  3. Hello Klaus,

    Could you please let me know how do you load all the tables from Benchmark Factory. I would like to test with such load in my environment.

    1. Klaus Aschenbrenner

      Hello Sushil,

      You can create a TPC-H project within Benchmark Factory, and that one will create (and populate) the necessary tables for you.

      Thanks,

      -Klaus

Leave a Reply to Andrea Caldarone Cancel Reply

Your email address will not be published. Required fields are marked *