Cardinality Estimation Limitations
As you might know, SQL Server uses internally Statistics to estimate how many rows are returned from a specific query. That process is called the Cardinality Estimation. Based on the output from the Cardinality Estimation the Query Optimizer generates an Execution Plan, which is cached and finally executed. In today’s blog posting I’m going to talk about a few limitations that the Cardinality Estimation currently has, and how you can overcome these by applying various techniques.
Cardinality Estimation Errors
As I have written some time ago, SQL Server uses the Histogram and the Density Vector for the Cardinality Estimation, when an execution plan gets compiled. The model that SQL Server is using here is a very static one, with a lot of disadvantages and pitfalls (the implementation details have changed here beginning with SQL Server 2014+).
To give you a concrete example of where the Cardinality Estimation has problems, imagine the following 2 tables: Orders and Country. Every row in the Orders table represents an order placed by a customer (like a fact table in a Data Warehousing scenario), and that table points through a foreign key constraint to the Country table (it acts like a dimension table). And now let’s run a query against both tables to retrieve all sales from UK:
SELECT SalesAmount FROM Country JOIN Orders ON Country.ID = Orders.ID WHERE Name = 'UK'
When you look at the execution plan, you can see that SQL Server has a big problem with the cardinality estimation.
SQL Server estimates 501 rows, and in reality the Clustered Index Seek operator returns 1000 rows. SQL Server uses here the Density Vector of the Statistics Object idx_ID_SalesAmount to make that estimation: the Density Vector is 0.5 (we have only 2 distinct values in that column), and therefore the estimation is 501 (1001 * 0.5). You can fix that specific problem by adding a Filtered Statistics object. This will give SQL Server more information about the data distribution itself, and will also help with the Cardinality Estimation.
CREATE STATISTICS Country_UK ON Country(ID) WHERE Name = 'UK'
When you now look at the execution plan again, you can see that both the estimated and actual numbers of rows are now the same. You will find further information about that specific problem in one of my blog postings.
Another problem that you have with the current Cardinality Estimation in SQL Server is when search predicates are correlated to each other. Imagine the following pseudo SQL query:
SELECT * FROM Products WHERE Company = 'Microsoft' AND Product = 'iPhone'
When you look as a human at this query, you know immediately how many rows are returned: 0! The company Microsoft doesn’t sell iPhones. When you run such a query against SQL Server, the Query Optimizer looks at each search predicate independently:
- In the 1st step the Cardinality Estimation is done for the predicate Company = ‘Microsoft’.
- In the 2nd the Query Optimizer produces a Cardinality Estimation for the other predicate Product =’iPhone’.
And finally both estimations are multiplied by each other to produce the final estimation. When the first predicate produces a cardinality of 0.3 and the 2nd one produces a cardinality of 0.4, the final cardinality will be 0.12 (0.3 * 0.4). The Query Optimizer handles every predicate on its own without any correlation between them.
Paul White has also written a very interesting article about that specific problem, and how you can influence the Query Optimizer with various trace flags in SQL Server.
Statistics and the Cardinality Estimation are very important for accurate and well-performing execution plans in SQL Server. Unfortunately their usage is also limited, especially in some edge cases. Throughout today’s blog posting you have seen how you can use Filtered Statistics to help the Query Optimizer to produce a better Cardinality Estimate, and how you can handle correlated columns in SQL Server.
Thanks for your time,