The pain of Real-Time Data Analysis in SQL Server 2016
SQL Server 2016 provides you with a lot of new features and functionality around the ColumnStore index that was introduced back with SQL Server 2012. One of these new features is the ability to create a Filtered Non-Clustered ColumnStore Index on top of a regular Row-Store table.
This might sound a bit surprising at first, but Microsoft positions this feature in the area of Real-Time Data Analysis.
Real-Time Data Analysis – what?
Let’s talk first about Real-Time Data Analysis itself. A lot of people have OLTP systems in place and transfer this data regularly through ETL processes into a Data Warehouse for data analysis purposes. This architecture seems to be legitimate, but has 2 “problems”:
- The data in the Data Warehouse is always out-of-date, depending on the frequency of the running ETL processes. On the other hand Business Analysts always need to see the most up-to-date data…
- Sometimes the Data Warehouse is only sourced from one OLTP database. In that case the implementation of a ETL process and the whole Data Warehouse itself seems to be a huge overhead that costs you a lot of money and time.
For of these 2 reasons, people are now starting to talk about Real-Time Data Analysis. Instead of querying a populated denormalized Data Warehouse database, let’s query the one-and-only OLTP database directly!
ColumnStore Indexes in SQL Server 2016
The problem with a traditional OLTP database is the simple fact that regular Row-Store indexes (Clustered Index, Non-Clustered Index) are slow for data analysis. They are great for OLTP style workloads, but they don’t give you great performance for Data Warehouse queries, which consume a huge amount of data.
Therefore back in SQL Server 2012 Microsoft introduced the ColumnStore Index, which acts as a turbo-booster for Data Warehouse queries. It really is a turbo-booster – trust me! In SQL Server 2012 you had only one Non-Clustered ColumnStore Index available – and that also made your table read-only. Then with SQL Server 2014 Microsoft provided us with a Clustered ColumnStore Index that can be also written to.
And for Real-Time Data Analysis SQL Server 2016 now offers a Non-Clustered ColumnStore Index that can be also written to – and even filtered. Why do we need that new kind of index? The idea behind this index is to sub-divide your table data into cold and hot data portions. The cold data portion is hardly changing, but the hot data portion is still accepting a read/write workload. Let’s have a look at the following picture, which illustrates this concept.
To access the cold data portion for Real-Time Data Analysis in a very efficient way, you can define a Filtered Non-Clustered ColumnStore Index on that table. And because the cold data portion is not changing that much anymore, the overhead of maintaining the Non-Clustered ColumnStore Index is quite small. Let’s check out the following SQL statement.
-- Create a Non-Clustered ColumnStore Index for the "cold" data partition CREATE NONCLUSTERED COLUMNSTORE INDEX idx_ColdData ON Sales.SalesOrderDetail(ProductID) WHERE ModifiedDate < '20140601' GO
Here I have created a Filtered Non-Clustered ColumnStore Index on a traditional Row-Store table Sales.SalesOrderDetail. This means that every record older than June 2016 is part of the ColumnStore Index. So our indexing strategy currently looks like the following:
Let’s execute a few queries against the table Sales.SalesOrderDetail.
-- Scans the complete Clustered Index SELECT * FROM Sales.SalesOrderDetail GO
When you perform a simple SELECT * query without any Filter Predicate, SQL Server scans the complete Clustered Index as you can see in the following picture.
Now let’s run a query where we request data from the cold data portion of the table.
-- Uses the Non-Clustered ColumnStore Index (the "cold" data partition) SELECT ProductID FROM Sales.SalesOrderDetail WHERE ModifiedDate < '20140601' GO
SQL Server can now access the Filtered Non-Clustered ColumnStore Index that we have defined on that table – awesome!
Now let’s query for older data from the cold data portion of our table.
-- Scans the complete Clustered Index ;-( SELECT ProductID FROM Sales.SalesOrderDetail WHERE ModifiedDate < '20140531' GO
When you now look at the execution plan, it gets dirty. SQL Server scans again the *complete* traditional RowStore Clustered Index. SQL Server can’t make use of the Non-Clustered ColumnStore Index that we have defined on the table!
That’s the first lesson learned: the Filtered Non-Clustered ColumnStore Index is only used if the Filter Predicate of the index matches *exactly*! It’s the same prerequisite as with a traditional Row-Store Filtered Non-Clustered Index. This is very important to know! And now let’s go ahead and let’s query for the hot data portion of our table – with the following query.
-- Without a supporting Non-Clustered Index we can't access the "hot" data partition in a very -- efficient way - we have to scan the complete Clustered Index... -- We also have to think here about the Tipping Point in combination with Bookmark Lookups! SELECT ProductID FROM Sales.SalesOrderDetail WHERE ModifiedDate >= '20140601' GO
Without any other supporting Row-Store Index, SQL Server again has to scan the *complete* Row-Store Clustered Index!
The question is now how we can efficiently access the hot data portion of the table? The answer is easy: you have to create a traditional Row-Store Filtered Non-Clustered Index that indexes the hot portion of your table data.
-- Create a traditional Non-Clustered RowStore Index for the "hot" data partition CREATE NONCLUSTERED INDEX idx_HotData ON Sales.SalesOrderDetail(ProductID) WHERE ModifiedDate >= '20140601' GO
When you now execute your query again, SQL Server can now finally use the Row-Store Non-Clustered Index to access the hot data portion.
But the question here is also how do you deal with Bookmark Lookups? If your Filter Predicate is not selective anymore, SQL Server will again scan your complete Clustered Index because of the Tipping Point. How you can elimate this specific problem? By adding the columns from the Bookmark Lookup into the leaf level of the Non-Clustered Index.
And which columns do you query in your SELECT list? You can’t know, because a Business Analyst is doing the data analysis – not you. She can request every possible column from your table. To be safe you have to add *every* column from the table into the leaf level of the Filtered Non-Clustered Row-Store Index. But when you do that, your indexing strategy looks like the following:
We have a traditional Row-Store Clustered Index across the whole table, the cold data portion is indexed with a Non-Clustered ColumnStore Index, and the hot data portion is indexed with a traditional Row-Store Non-Clustered Index. You have *duplicated* your table data! Do you really want to do this? Hmm... That’s the pain of Real-Time Data Analysis – how to efficiently access the hot portion of your table data.
Real-Time Data Analysis sounds great – at first. But when you look at this concept in more detail, there are many performance pitfalls. Especially when it comes to accessing your hot table data. To be honest, I’m not really a big fan of it. Maybe it would be a better approach to split your table into 2 different, separate tables. One table for your cold data, and one table for your hot data (and UNION them together in a view). But in that case, you would even not need a Filtered Non-Clustered ColumnnStore Index...
Please feel free to leave a comment with your thoughts about this specific problem.
Thanks for your time,