The dangerous beauty of the PIVOT operator in SQL Server

In today’s blog posting I want to discuss one specific T-SQL language construct in SQL Server – the PIVOT operator that was introduced back with SQL Server 2005. I always refer to this language construct as one of the most dangerous ones in SQL Server – and you will very soon see why. Before we get into some specific problems and pitfalls, I first want to give you a basic overview of what can be accomplished with pivoting in SQL Server.

Overview

The basic idea behind the PIVOT operator in SQL Server is that you can rotate rows into columns during a T-SQL query. The operator itself was introduced back with SQL Server 2005 and is mainly used for databases that are built on the principles of an Entity Attribute Value model (EAV). The idea behind an EAV model is that you can extend database entities without performing database schema changes. For that reason an EAV model stores all attributes of an entity in one table as key/value pairs. Let’s have a look at the following table that models a simple key/value pairs table.

CREATE TABLE EAVTable
(
        RecordID INT NOT NULL,
	Element CHAR(100) NOT NULL,
	Value SQL_VARIANT NOT NULL,
	PRIMARY KEY (RecordID, Element)
)
GO

-- Insert some records
INSERT INTO EAVTable (RecordID, Element, Value) VALUES
(1, 'FirstName', 'Klaus'),
(1, 'LastName', 'Aschenbrenner'),
(1, 'City', 'Vienna'),
(1, 'Country', 'Austria'),
(2, 'FirstName', 'Bill'),
(2, 'LastName', 'Gates'),
(2, 'City', 'Seattle'),
(2, 'Country', 'USA')
GO

As you can see, I have inserted 2 database entities into that table, and every entity consists of multiple attributes. Every attribute is just an additional record in that table. If you want to extend the entity with additional attributes, you need to insert additional records into that table but there is no need to perform a database schema change – that’s the “power” of an open database schema…

Querying such an EAV table is of course very hard, because you are dealing with a flat key/value pair structure. For that reason you want to rotate the table contents, and rotate rows into columns. You can perform this rotation natively with the PIVOT operator, or completely manually through a traditional CASE expression. Before we get down to the nasty details of the PIVOT operator I want to show you how you can perform the rotation by hand with T-SQL and some CASE expressions. If you perform the rotation manually, your T-SQL query has to implement 3 different phases:

  1. Grouping Phase
  2. Spreading Phase
  3. Aggregation Phase

In the Grouping Phase we compact our EAV table into distinct database entities. In our case we perform a GROUP BY on the column RecordID. In the 2nd phase, the Spreading Phase, we use multiple CASE expressions to rotate rows into columns. And finally we use the MAX expression in the Aggregation Phase to return one distinct value for every row and column. Let’s have a look at the following T-SQL code.

-- Pivot the data with a handwritten T-SQL statement.
-- Make sure you have an index defined on the grouping column.
SELECT
	RecordID,
	-- Spreading and aggregation phase
	MAX(CASE WHEN Element = 'FirstName' THEN Value END) AS 'FirstName',
	MAX(CASE WHEN Element = 'LastName' THEN Value END) AS 'LastName',
	MAX(CASE WHEN Element = 'City' THEN Value END) AS 'City',
	MAX(CASE WHEN Element = 'Country' THEN Value END) AS 'Country'
FROM EAVTable
GROUP BY RecordID -- Grouping phase
GO

As you can see from the code, it is very easy to identify each phase, and how they are mapped to the T-SQL query. The following picture shows you the result of the query, where we have finally rotated rows into columns.

Manually pivoted data

The PIVOT operator

Beginning with SQL Server 2005 (almost 10 years ago!), Microsoft introduced the PIVOT operator in T-SQL. With that operator you can perform the same transformation (rows to columns) with just one native operator. Sounds very easy and promising, doesn’t it? The following listing shows how you can perform the same transformation natively with the PIVOT operator.

-- Perform the same query with the native PIVOT operator.
-- The grouping column is not specified explicitly, it's the remaining column
-- that is not referenced in the spreading and aggregation elements.
SELECT
	RecordID,
	FirstName,
	LastName,
	City,
	Country
FROM EAVTable
PIVOT(MAX(Value) FOR Element IN (FirstName, LastName, City, Country)) AS t
GO

When you run that query, you will get back the same result as seen in the previous picture. But when you look at the syntax of the PIVOT operator, you will see one big difference compared to the manual approach:

You can only specify the spreading and aggregation elements! There is no way to define the grouping elements explicitly!

The grouping elements are just the remaining columns that you haven’t referenced in the PIVOT operator. In our case we haven’t referenced the column RecordID in the PIVOT operator, therefore this column is used during the Grouping Phase. This leads to interesting side effects if we subsequently change the database schema, for example by adding additional columns to the underlying base table:

-- Add a new column to the table
ALTER TABLE EAVTable ADD SomeData CHAR(1)
GO

When you now run the same query with the PIVOT operator again (make sure you have non NULL values in that column), you will get back a completely different result, because the Grouping Phase is now done on the columns RecordID and SomeData (which we have just added).

The native PIVOT operator can lead to a wrong result

Contrast that with what happens if we re-execute the manually T-SQL query that we wrote to start with. It still returns the same correct result. That’s one of the biggest negative side effects of the PIVOT operator in SQL Server: the grouping elements can’t be defined explicitly. To overcome this problem, it is a best practice to use a table expression and only return the necessary columns from the table. With this approach you will also have no problems if you change the table schema at a later stage, because the additional columns are not returned (by default) from the table expression. Let’s have a look at the following listing.

-- Use a table expression to state explicitly which columns you want to 
-- return from the base table. Therefore you can always control on which
-- columns the PIVOT operator is performing the grouping.
SELECT
	RecordID,
	FirstName,
	LastName,
	City,
	Country
FROM
(
	-- Table Expression
	SELECT RecordID, Element, Value FROM EAVTable
) AS t
PIVOT(MAX(Value) FOR Element IN (FirstName, LastName, City, Country)) AS t1
GO

As you can see from this code, I’m feeding the PIVOT operator through a table expression. And within the table expression you are only selecting the needed columns from the base table. This then means that you can change the table schema in the future without breaking the result of the PIVOT query.

Summary

I hope that this blog posting has shown you why the PIVOT operator can be very dangerous in SQL Server. The syntax itself leads to very efficient code, but as a side effect you can’t specify the grouping elements directly. Therefore you should always make sure to use a table expression to define exactly which columns are fed into the PIVOT operator to make the given result deterministic.

What is your experience with the PIVOT operator? Do you like it or not? If you don’t like it, what would you want to have changed?

Like or share to get the source code.

Thanks for reading!

-Klaus

17 thoughts on “The dangerous beauty of the PIVOT operator in SQL Server”

    1. Hello Brett,

      Thanks for your comment.
      I just checked both execution plans, and they are almost the same: Clustered Index Scan + Stream Aggregate operator.
      With the manual T-SQL solution, the Query Optimizer also added a Compute Scalar operator between the Clustered Index Scan and the Stream Aggregate operator.
      The Compute Scalar operator is used to perform the calculation of the CASE expressions.

      Best regards,

      -Klaus

  1. After adding the extra column, I get the same result with the original Pivot query. (2014 Express)

    1. Hello Filip,

      Thanks for your comment.
      Do you have also updated the newly added column with some data?
      When you have NULL values in the column, then the original PIVOT query gets you the same result.

      Just have a look at the enclosed source code.

      Best regards,

      -Klaus

  2. After adding the tinyint column, add that to the PIVOT and see what happens. Then add the column to the “old” solution with MAX(CASE…).

  3. Hi Klaus,

    I added the SomeData column, and put in ‘C’ for Record 1 and ‘D’ for Record 2, and when I run the first PIVOT query, I still get 2 rows.

  4. Nice article, Klaus, but this is not so dangerous as it may look like. Usually, you are not using the table directly in the PIVOT clause. Usually, a derived table used there. Roughly 100% percent of people will follow the syntax and example of PIVOT from MSDN – https://technet.microsoft.com/en-us/library/ms177410(v=sql.105).aspx or from Craig Freedman’s blog – https://blogs.msdn.microsoft.com/craigfr/2007/07/03/the-pivot-operator/, so this situation will not happen to them.

  5. I have a scenario where I have hundred fields with IP Address.Is it better to use key value pair column and use PIVOT for reporting, or to use a flat table with 100 columns

  6. John Turpin

    Thanks Klaus, good article. Thanks for explaining the hidden nature of the Group By operation when using PIVOT.

  7. This post was elegantly written and I was able to figure out a very similar use case. In my case I have multiple “value types” (Y/N value, Number Value, Text Value) coming in because of how the table was designed so it looked like there were duplicate records, mostly NULL. I fixed it by bringing only one “value”.

Leave a Reply to Brett Gerhardi Cancel Reply

Your email address will not be published. Required fields are marked *