(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)
In today’s blog posting I want to talk about a very specific problem in cardinality estimation: working with correlated columns in a search predicate. In the first step we will look at the behavior that SQL Server is using since version 7.0, and finally we will have a more detailed look on SQL Server 2014, which implements a completely new behavior how to handle correlated columns during Query Optimization.
Correlated Columns – what?
Before we go into the details about that specific problem, we have in the first step to clarify what correlated columns are. When we look at the Query Optimizer used by SQL Server, the optimizer is based on 4 core assumptions:
I don’t want to elaborate on each assumption in detail, because they are described in other whitepapers very well. You will find a link to such a whitepaper in the summary section of this blog posting. The assumption on which we will concentrate today is the 1st one – Independence. Independence means that the columns used in a search predicate (the WHERE clause) are independent and returning different records, when queried alone. They are not correlated against each other. Unfortunately this assumption is not always true. Let’s have a look on a more concrete example, where that assumption is violated. Assume the following 2 queries:
SELECT * FROM Sales.SalesOrderHeader WHERE SalesOrderID > 74000 AND SalesOrderID < 75000 GO SELECT * FROM Sales.SalesOrderHeader WHERE OrderDate >= '20080626' AND OrderDate <= '20080724' GO
In the first query we are returning 999 rows from the AdventureWorks2012 database by restricting on some specific SalesOrderID values. And the second query returns 1125 records based on the OrderDate column. But there are in sum 912 rows, where both search predicates evaluating to TRUE, means we have a correlation between both columns. But the Query Optimizer isn’t aware of this correlation.
SQL Server 7.0 – 2012 behavior
Let’s have now a look on how that correlation is handled from SQL Server 7.0 to SQL Server 2012. When you look at the execution plan of both queries, you can see that the Query Optimizer estimates for the first query 999.936 rows and for the second one 1125 rows.
With that information and the cardinality of our table (the number of records our table has – 31465 in our case), we can calculate the so-called Selectivity of this search predicate. The selectivity is just a number between 0 and 1. The smaller the selectivity is, the less records are returned from a query (0.0 = 0% of the rows are returned, 1.0 means 100% of the rows are returned). We can calculate the selectivity of a search predicate by dividing the estimated number of rows by the number of records that we have in our table stored. The selectivity of the first search predicate is therefore 0.03177931034482758620689655172414 (999.936 / 31465 table rows), and the selectivity of the second search predicate is 0.03575401239472429683775623708883 (1125 / 31465 table rows). Imagine now what happens, if we use both search predicates combined with an AND operator (a so-called Conjunction):
SELECT * FROM Sales.SalesOrderHeader WHERE SalesOrderID > 74000 AND SalesOrderID < 75000 AND OrderDate >= '20080626' AND OrderDate <= '20080724' GO
When you look at the execution plan, the Query Optimizer estimates at the Clustered Index Seek (Clustered) operator 35.7517 rows!
In reality the query returns the 912 rows mentioned earlier. That’s a huge discrepancy. The Query Optimizer just multiplies both selectivity values of our search predicates with the cardinality of our table to produce the final row estimation. SQL Server just assumes that both search predicates are returning DIFFERENT rows – there is an Independence assumed between both columns:
0.03177931034482758620689655172414 * 0.03575401239472429683775623708883 * 31465 = 35.7517241379310344827586206896586
Of course, the reality is completely different, because there is a huge correlation between both search predicates. Therefore you can see the large discrepancy between the Estimated and Actual Number of Rows. The more individual search predicates you are combining with a conjunction in a query, the larger the discrepancy will be. When the final estimation falls below 1.0 row, the Query Optimizer will always estimate 1 row at least – 0 rows are never estimated.
SQL Server 2014 behavior
As you might heard, SQL Server 2014 consists of a new Cardinality Estimator. As soon as your database is in the compatibility mode 120, the new Cardinality Estimator is used. Be aware of that fact, when you are restoring or attaching a database from an earlier version of SQL Server – the compatibility is here still the old one! If you want to use the new Cardinality Estimator without changing the compatibility mode, you can also use the new Trace Flag 2312. Let’s run now the above query with both search predicates with the new Cardinality Estimator by applying Trace Flag 2312 to the query.
SELECT * FROM Sales.SalesOrderHeader WHERE SalesOrderID > 74000 AND SalesOrderID < 75000 AND OrderDate >= '20080626' AND OrderDate <= '20080724' OPTION (RECOMPILE, QUERYTRACEON 2312) GO
When you look at the execution plan, you can see that the Cardinality Estimation has changed.
The new Cardinality Estimator estimates now 188.898 rows. Much more than the 35.75 rows with the old one. But there is still a huge gap to the 912 rows, that the query returns in reality. But which formula is using now the new Cardinality Estimator? The new one uses now a so-called Exponential Back-off algorithm. The Query Optimizer takes the first 4 search predicates, and sorts them by their selectivity. All selectivity values are again multiplied by each other, but the difference is that every subsequent value is softened by taking a larger square root of it. Let’s have a look on the formula to understand this behavior:
c0 * (c1 ^ 1/2) * (c2 ^1/4) * (c3 ^1/8)
When we look on our concrete example we can derive the final cardinality by using the following calculation:
0.03177931034482758620689655172414 * SQRT(0.03575401239472429683775623708883) * 31465 = 189.075212620762
There is a small discrepancy in our calculation compared to the estimation of 188.898 rows, because the estimation of 999.936 rows was rounded in the execution plan provided by SQL Server. By using the exponential back-off algorithm, the Query Optimizer makes sure to make a better estimation and tightens the hole between the Estimated and Actual Number of Rows, if there is a correlation between the applied search arguments.
In this blog posting we have looked on one specific problem during cardinality estimation in relational databases: handling the correlation between columns that are used as search predicates. Prior to SQL Server 2014, the Query Optimizer used a very moderate approach by just multiplying the various selectivity values. This can lead to huge underestimations, which can cause you troubles, when an upfront operator in the execution plan relies on these estimations (e.g. a Sort or Hash operator).
The new Cardinality Estimator of SQL Server 2014 deals with an enhanced mechanism for this specific problem by using an exponential back-off formula during the Cardinality Estimations which produces better estimations. But there are still discrepancies compared to the Actual Number of Rows during runtime. If you want to read more about the new Cardinality Estimator of SQL Server 2014, I suggest to read the excellent whitepaper Optimizing your Query Plans with the SQL Server 2014 Cardinality Estimator written by Joe Sack (Blog, Twitter).
Thanks for reading!