Cardinality Estimation for Correlated Columns

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan - you get a weekly email packed with all the essential knowledge you need to know about performance tuning on SQL Server.)

In today’s blog posting I want to talk about a very specific problem in cardinality estimation: working with correlated columns in a search predicate. In the first step we will look at the behavior that SQL Server is using since version 7.0, and finally we will have a more detailed look on SQL Server 2014, which implements a new way of handling correlated columns during Query Optimization.

Correlated Columns – what?

Before we go into the details about that specific problem, we must first clarify what correlated columns are. When we look at the Query Optimizer used by SQL Server, the optimizer is based on 4 core assumptions:

  • Independence
  • Uniformity
  • Containment
  • Inclusion

I don’t want to elaborate on each assumption in detail, because they are described in other whitepapers very well. You will find a link to such a whitepaper in the summary section of this blog posting. The assumption on which we will concentrate today is the 1st one – Independence. Independence means that the columns used in a search predicate (the WHERE clause) are independent and returning different records, when queried alone. They are not correlated against each other. Unfortunately this assumption is not always true. Let’s have a look at a more concrete example, where that assumption is violated. Assume the following 2 queries:

In the first query we are returning 999 rows from the AdventureWorks2012 database by restricting on some specific SalesOrderID values. And the second query returns 1125 records based on the OrderDate column.But there are 912 rows where both search predicates evaluate to TRUE, which means we have a correlation between the two columns. But the Query Optimizer isn’t aware of this correlation.

SQL Server 7.0 – 2012 behavior

Let’s have now a look at how that correlation is handled from SQL Server 7.0 to SQL Server 2012. When you look at the execution plan of both queries, you can see that the Query Optimizer estimates 999.936 rows for the first query and 1125 rows for the second.

Cardinality Estimation for a single search prediateCardinality Estimation for a single search prediate

With that information and the cardinality of our table (the number of records our table has – 31465 in our case), we can calculate the so-called Selectivity of this search predicate. The selectivity is just a number between 0 and 1. The smaller the selectivity is, the less records are returned from a query (0.0 = 0% of the rows are returned, 1.0 means 100% of the rows are returned). We can calculate the selectivity of a search predicate by dividing the estimated number of rows by the number of records stored in our table. The selectivity of the first search predicate is therefore 0.03177931034482758620689655172414 (999.936 / 31465 table rows), and the selectivity of the second search predicate is 0.03575401239472429683775623708883 (1125 / 31465 table rows). Now let’s see what happens if we use both search predicates combined with an AND operator (a so-called Conjunction):

When you look at the execution plan, the Query Optimizer estimates at the Clustered Index Seek (Clustered) operator 35.7517 rows!

Wrong Cardinality Estimation for correlated columns

The actual execution returns 912 rows as mentioned earlier. That’s a huge discrepancy. The Query Optimizer just multiplies both selectivity values of our search predicates with the cardinality of our table to produce the final row estimation. SQL Server just assumes that both search predicates are returning DIFFERENT rows – there is an Independence assumed between both columns:

0.03177931034482758620689655172414 * 0.03575401239472429683775623708883 * 31465 = 35.7517241379310344827586206896586

Of course, the reality is completely different, because there is a huge correlation between both search predicates. Therefore you can see the large discrepancy between the Estimated and Actual Number of Rows. The more individual search predicates you are combining with a conjunction in a query, the larger the discrepancy will be. When the final estimation falls below 1.0 row, the Query Optimizer will always estimate 1 row at least – 0 rows are never estimated.

SQL Server 2014 behavior

As you might have heard, SQL Server 2014 includes a new Cardinality Estimator. As soon as your database is in the compatibility mode 120, the new Cardinality Estimator is used. Be aware that when you are restoring or attaching a database from an earlier version of SQL Server – the compatibility is here still the old one! If you want to use the new Cardinality Estimator without changing the compatibility mode, you can also use the new Trace Flag 2312. Let’s run now the above query with both search predicates with the new Cardinality Estimator by applying Trace Flag 2312 to the query.

When you look at the execution plan, you can see that the Cardinality Estimation has changed.

Softened Cardinality Estimation for correlated columns in SQL Server 2014

The new Cardinality Estimator estimates now 188.898 rows. Much more than the 35.75 rows with the old one. But there is still a huge gap to the 912 rows, that the query returns in reality. But which formula is using now the new Cardinality Estimator? The new one uses now a so-called Exponential Back-off algorithm. The Query Optimizer takes the first 4 search predicates, and sorts them by their selectivity. All selectivity values are again multiplied by each other, but the difference is that every subsequent value is softened by taking a larger square root of it. Let’s have a look on the formula to understand this behavior:

c0 * (c1 ^ 1/2) * (c2 ^1/4) * (c3 ^ 1/8)

When we look on our concrete example we can derive the final cardinality by using the following calculation:

0.03177931034482758620689655172414 * SQRT(0.03575401239472429683775623708883) * 31465 = 189.075212620762

There is a small discrepancy in our calculation compared to the estimation of 188.898 rows, because the estimation of 999.936 rows was rounded in the execution plan provided by SQL Server. By using the exponential back-off algorithm, the Query Optimizer makes sure to make a better estimation and tightens the hole between the Estimated and Actual Number of Rows, if there is a correlation between the applied search arguments.

Summary

In this blog posting we have looked on one specific problem during cardinality estimation in relational databases: handling the correlation between columns that are used as search predicates. Prior to SQL Server 2014, the Query Optimizer used a very moderate approach by just multiplying the various selectivity values. This can lead to huge underestimations, which can cause you troubles, when an upfront operator in the execution plan relies on these estimations (e.g. a Sort or Hash operator).

The new Cardinality Estimator of SQL Server 2014 deals with an enhanced mechanism for this specific problem by using an exponential back-off formula during the Cardinality Estimations which produces better estimations. But there are still discrepancies compared to the Actual Number of Rows during runtime. If you want to read more about the new Cardinality Estimator of SQL Server 2014, I suggest to read the excellent whitepaper Optimizing your Query Plans with the SQL Server 2014 Cardinality Estimator written by Joe Sack (Blog, Twitter).

Thanks for reading!

-Klaus

It`s your turn

Your email address will not be published. Required fields are marked *

SQLpassion

Copyright © 2015 by SQLpassion · Klaus Aschenbrenner · Imprint · Offerings · Academy · Contact · Go to Top