The pain of Real-Time Data Analysis in SQL Server 2016

SQL Server 2016 provides you with a lot of new features and functionality around the ColumnStore index that was introduced back with SQL Server 2012. One of these new features is the ability to create a Filtered Non-Clustered ColumnStore Index on top of a regular Row-Store table.

This might sound a bit surprising at first, but Microsoft positions this feature in the area of Real-Time Data Analysis.

Real-Time Data Analysis – what?

Let’s talk first about Real-Time Data Analysis itself. A lot of people have OLTP systems in place and transfer this data regularly through ETL processes into a Data Warehouse for data analysis purposes. This architecture seems to be legitimate, but has 2 “problems”:

  • The data in the Data Warehouse is always out-of-date, depending on the frequency of the running ETL processes. On the other hand Business Analysts always need to see the most up-to-date data…
  • Sometimes the Data Warehouse is only sourced from one OLTP database. In that case the implementation of a ETL process and the whole Data Warehouse itself seems to be a huge overhead that costs you a lot of money and time.

For of these 2 reasons, people are now starting to talk about Real-Time Data Analysis. Instead of querying a populated denormalized Data Warehouse database, let’s query the one-and-only OLTP database directly!

ColumnStore Indexes in SQL Server 2016

The problem with a traditional OLTP database is the simple fact that regular Row-Store indexes (Clustered Index, Non-Clustered Index) are slow for data analysis. They are great for OLTP style workloads, but they don’t give you great performance for Data Warehouse queries, which consume a huge amount of data.

Therefore back in SQL Server 2012 Microsoft introduced the ColumnStore Index, which acts as a turbo-booster for Data Warehouse queries. It really is a turbo-booster – trust me! In SQL Server 2012 you had only one Non-Clustered ColumnStore Index available – and that also made your table read-only. Then with SQL Server 2014 Microsoft provided us with a Clustered ColumnStore Index that can be also written to.

And for Real-Time Data Analysis SQL Server 2016 now offers a Non-Clustered ColumnStore Index that can be also written to – and even filtered. Why do we need that new kind of index? The idea behind this index is to sub-divide your table data into cold and hot data portions. The cold data portion is hardly changing, but the hot data portion is still accepting a read/write workload. Let’s have a look at the following picture, which illustrates this concept.

The table data is divided into a cold and hot data portion

To access the cold data portion for Real-Time Data Analysis in a very efficient way, you can define a Filtered Non-Clustered ColumnStore Index on that table. And because the cold data portion is not changing that much anymore, the overhead of maintaining the Non-Clustered ColumnStore Index is quite small. Let’s check out the following SQL statement.

-- Create a Non-Clustered ColumnStore Index for the "cold" data partition
CREATE NONCLUSTERED COLUMNSTORE INDEX idx_ColdData ON Sales.SalesOrderDetail(ProductID)
WHERE ModifiedDate < '20140601'
GO

Here I have created a Filtered Non-Clustered ColumnStore Index on a traditional Row-Store table Sales.SalesOrderDetail. This means that every record older than June 2016 is part of the ColumnStore Index. So our indexing strategy currently looks like the following:

The cold data portion is indexed through a ColumnStore Index

Let’s execute a few queries against the table Sales.SalesOrderDetail.

-- Scans the complete Clustered Index
SELECT * FROM Sales.SalesOrderDetail
GO

When you perform a simple SELECT * query without any Filter Predicate, SQL Server scans the complete Clustered Index as you can see in the following picture.

Scanning the whole Clustered Index

Now let’s run a query where we request data from the cold data portion of the table.

-- Uses the Non-Clustered ColumnStore Index (the "cold" data partition)
SELECT ProductID FROM Sales.SalesOrderDetail 
WHERE ModifiedDate < '20140601'
GO

SQL Server can now access the Filtered Non-Clustered ColumnStore Index that we have defined on that table – awesome!

Accessing the cold data portion through the Non-Clustered ColumnStore Index

Now let’s query for older data from the cold data portion of our table.

-- Scans the complete Clustered Index ;-(
SELECT ProductID FROM Sales.SalesOrderDetail
WHERE ModifiedDate < '20140531'
GO

When you now look at the execution plan, it gets dirty. SQL Server scans again the *complete* traditional RowStore Clustered Index. SQL Server can’t make use of the Non-Clustered ColumnStore Index that we have defined on the table!

Ups, we are scanning now again the complete Clustered Index

That’s the first lesson learned: the Filtered Non-Clustered ColumnStore Index is only used if the Filter Predicate of the index matches *exactly*! It’s the same prerequisite as with a traditional Row-Store Filtered Non-Clustered Index. This is very important to know! And now let’s go ahead and let’s query for the hot data portion of our table – with the following query.

-- Without a supporting Non-Clustered Index we can't access the "hot" data partition in a very
-- efficient way - we have to scan the complete Clustered Index...
-- We also have to think here about the Tipping Point in combination with Bookmark Lookups!
SELECT ProductID FROM Sales.SalesOrderDetail
WHERE ModifiedDate >= '20140601'
GO

Without any other supporting Row-Store Index, SQL Server again has to scan the *complete* Row-Store Clustered Index!

Ups, we are now scanning again the complete Clustered Index

The question is now how we can efficiently access the hot data portion of the table? The answer is easy: you have to create a traditional Row-Store Filtered Non-Clustered Index that indexes the hot portion of your table data.

-- Create a traditional Non-Clustered RowStore Index for the "hot" data partition
CREATE NONCLUSTERED INDEX idx_HotData ON Sales.SalesOrderDetail(ProductID)
WHERE ModifiedDate >= '20140601'
GO

When you now execute your query again, SQL Server can now finally use the Row-Store Non-Clustered Index to access the hot data portion.

Now we can use a Covering Non-Clustered Index

But the question here is also how do you deal with Bookmark Lookups? If your Filter Predicate is not selective anymore, SQL Server will again scan your complete Clustered Index because of the Tipping Point. How you can elimate this specific problem? By adding the columns from the Bookmark Lookup into the leaf level of the Non-Clustered Index.

And which columns do you query in your SELECT list? You can’t know, because a Business Analyst is doing the data analysis – not you. She can request every possible column from your table. To be safe you have to add *every* column from the table into the leaf level of the Filtered Non-Clustered Row-Store Index. But when you do that, your indexing strategy looks like the following:

Perfect, our table data is duplicated!

We have a traditional Row-Store Clustered Index across the whole table, the cold data portion is indexed with a Non-Clustered ColumnStore Index, and the hot data portion is indexed with a traditional Row-Store Non-Clustered Index. You have *duplicated* your table data! Do you really want to do this? Hmm... That’s the pain of Real-Time Data Analysis – how to efficiently access the hot portion of your table data.

Summary

Real-Time Data Analysis sounds great – at first. But when you look at this concept in more detail, there are many performance pitfalls. Especially when it comes to accessing your hot table data. To be honest, I’m not really a big fan of it. Maybe it would be a better approach to split your table into 2 different, separate tables. One table for your cold data, and one table for your hot data (and UNION them together in a view). But in that case, you would even not need a Filtered Non-Clustered ColumnnStore Index...

Please feel free to leave a comment with your thoughts about this specific problem.

Like or share to get the source code.

Thanks for your time,

-Klaus

8 thoughts on “The pain of Real-Time Data Analysis in SQL Server 2016”

  1. Sunil Agarwal

    Klaus, thanks for playing with the functionality! We value your feedback. I want to hi-light couple of things (a) when you use NCCI as a filtered index, the QO will choose NCCI when it is cost effective. It will automatically include rows from the ‘hot’ portion. (b) Our recommendation is to use filtered condition on clustered index. If ‘hot’
    data is large, SQL will choose scanning the rowstore table instead (c) filtered condition on ‘date’ like you have in example is not recommended as this will be an static date. Ideally, you want to ‘filter’ on some column such as ‘order status’. ForT RTM, we are adding another option called ‘Compression Delay’ if filtered index is not appropriate. Please stay tuned

    Thanks
    Sunil Agarwal
    SQL Server team

    1. Klaus Aschenbrenner

      Hello Sunil,

      Thanks for your great comment! 🙂
      Already looking forward to RTM and thanks for your great work in this area.

      -Klaus

  2. Thinking out loud here. Is it possible to store the hot/cold date and use that date to restructure queries to union those hot and cold partitions such that you took advantage of the indexes in both? While not ideal, it might be less of a pain and give the performance necessary.

    1. Klaus Aschenbrenner

      Hello Paul,

      Thanks for your comment.
      Yes, that would be a viable approach.

      -Klaus

    1. Klaus Aschenbrenner

      Hello Sunil,

      Thanks for the link to the blog posting.
      Creating a Clustered Index on the Filtering column of the NCCI of course works, but it will be a *really* bad Clustered Key, because it violates all the recommendations of a “good” Clustered Key 😉

      -Klaus

Leave a Reply to Klaus Aschenbrenner Cancel Reply

Your email address will not be published. Required fields are marked *