INTERSECT in SQL Server

In today’s blog posting I want to talk about the INTERSECT set operation in SQL Server. The INTERSECT set operation intersects two sets with each other and returns the rows that appear with identical column values in both sets. The following picture illustrates this concept. 

Intersection between 2 sets

INTERSECT vs. Inner Join

As you will realize, it is nearly the same as an INNER JOIN between 2 tables, but there are some important differences that I want to describe today. Let’s start by creating 2 simple tables, which we will use as the input tables:

-- Create the 1st table
CREATE TABLE t1
(
	Col1 INT,
	Col2 INT,
	Col3 INT
)
GO

-- Create the 2nd table
CREATE TABLE t2
(
	Col1 INT,
	Col2 INT
)
GO

-- Create a unique Clustered Index on both tables
CREATE UNIQUE CLUSTERED INDEX idx_ci ON t1(col1)
CREATE UNIQUE CLUSTERED INDEX idx_ci ON t2(col1)
GO

-- Insert some records into both tables
INSERT INTO t1 VALUES (1, 1, 1), (2, 2, 2), (NULL, 3, 3)
INSERT INTO t2 VALUES (2, 2), (NULL, 3)
GO
GO

As you can see from the T-SQL code, I have also created a unique Clustered Index on both tables, and inserted a few test records. Now let’s intersect the tables with each other:

SELECT Col1, Col2 FROM t1

INTERSECT

SELECT Col1, Col2 FROM t2
GO

SQL Server returns 2 records: the record with the column values 2 and the record with the NULL value. That’s one of the first big differences to an INNER JOIN operation: records with NULL values are not eliminated if they appear in both tables. When you perform an INNER JOIN operation between both tables on column Col1 the record with the NULL value is not returned:

SELECT t1.col1, t1.col2 FROM t1
INNER JOIN t2 ON t2.col1 = t1.col1
GO

The following picture shows how the result sets differ between INTERSECT and the INNER JOIN approaches.

Different results with INTERSECTION and an INNER JOIN

Now let’s analyze the execution plan of the INTERSECT set operation. Because you have a supporting index on column Col1, the Query Optimizer is able to translate the INTERSECT operation to a traditional logical INNER JOIN operation.

INTERSECT leads to an INNER JOIN in the Execution Plan

But here the operator Nested Loop (Inner Join) doesn’t really perform an INNER JOIN operation. Let’s see why not. When you look at the inner side of the Nested Loop operator, you can see that you have a residual predicate on the Clustered Index Seek (Clustered) operator.

A remaining residual predicate

The residual predicate is evaluated on the column Col2, because that column is not part of the navigation structure of the previously created Clustered Index. As I have said at the beginning, SQL Server has to find matching rows that have same values in ALL columns in BOTH tables. With the Clustered Index Seek (Clustered) operator and the residual predicate SQL Server just checks if we have a matching record with the same column values in the table t1. And the Nested Loop operator itself just returns the column values from one table – in our case from table t1.

SQL Server just returns columns from one table - a traditional Semi Join

So the INNER JOIN is just a Left Semi Join: SQL Server checks if we have matching records in the right table – if yes the matching record from the left table is returned. The residual predicate on the Clustered Index Seek (Clustered) operator can be removed by providing an index that contains all necessary columns in the navigation structure, as follows:

-- Create a supporting Non-Clustered Index
CREATE NONCLUSTERED index id_nci ON t1(Col1, Col2)
GO

Now when you look again at the execution plan of the INTERSECT operation you can see that SQL Server performs an Index Seek (NonClustered) operation on the previously created index, and that the residual predicate is not needed anymore.

With the additional index you have removed the residual predicate

Now let’s check what the generated execution plan looks like, when we drop all supporting index structures:

-- Drop all supporting indexes
DROP INDEX id_nci ON t1
DROP INDEX idx_ci ON t1
DROP INDEX idx_ci ON t2
GO

When you perform the INTERSECT between both tables again, now you can see a Nested Loop (Left Semi Join) operation in the execution plan. SQL Server now has to perform the Left Semi Join physically in the execution plan by performing a Table Scan operation on the inner side and making a row-by-row comparison with a residual predicate inside the Nested Loop operator.

A residual predicate at the Left Semi Join

This execution plan isn’t really efficient, because the Table Scan operator on the inner side has to be performed over and over again – for every row that is returned from the outer table. Supporting indexes are very important if we want to perform the INTERSECT set operation as efficiently as possible.

Summary

The INTERSECT set operation is a really scary one, because almost nobody knows it well. When you use it, you have to be aware of the differences between it and an INNER JOIN. And as you have seen it is also very important to have an efficient indexing strategy, so that the Query Optimizer can generate an execution plan that performs well.

Like or share to get the source code.

Thanks for your time,

-Klaus

Leave a Comment

Your email address will not be published. Required fields are marked *