The dangerous beauty of the PIVOT operator in SQL Server
In today’s blog posting I want to discuss one specific T-SQL language construct in SQL Server – the PIVOT operator that was introduced back with SQL Server 2005. I always refer to this language construct as one of the most dangerous ones in SQL Server – and you will very soon see why. Before we get into some specific problems and pitfalls, I first want to give you a basic overview of what can be accomplished with pivoting in SQL Server.
The basic idea behind the PIVOT operator in SQL Server is that you can rotate rows into columns during a T-SQL query. The operator itself was introduced back with SQL Server 2005 and is mainly used for databases that are built on the principles of an Entity Attribute Value model (EAV). The idea behind an EAV model is that you can extend database entities without performing database schema changes. For that reason an EAV model stores all attributes of an entity in one table as key/value pairs. Let’s have a look at the following table that models a simple key/value pairs table.
CREATE TABLE EAVTable ( RecordID INT NOT NULL, Element CHAR(100) NOT NULL, Value SQL_VARIANT NOT NULL, PRIMARY KEY (RecordID, Element) ) GO -- Insert some records INSERT INTO EAVTable (RecordID, Element, Value) VALUES (1, 'FirstName', 'Klaus'), (1, 'LastName', 'Aschenbrenner'), (1, 'City', 'Vienna'), (1, 'Country', 'Austria'), (2, 'FirstName', 'Bill'), (2, 'LastName', 'Gates'), (2, 'City', 'Seattle'), (2, 'Country', 'USA') GO
As you can see, I have inserted 2 database entities into that table, and every entity consists of multiple attributes. Every attribute is just an additional record in that table. If you want to extend the entity with additional attributes, you need to insert additional records into that table but there is no need to perform a database schema change – that’s the “power” of an open database schema…
Querying such an EAV table is of course very hard, because you are dealing with a flat key/value pair structure. For that reason you want to rotate the table contents, and rotate rows into columns. You can perform this rotation natively with the PIVOT operator, or completely manually through a traditional CASE expression. Before we get down to the nasty details of the PIVOT operator I want to show you how you can perform the rotation by hand with T-SQL and some CASE expressions. If you perform the rotation manually, your T-SQL query has to implement 3 different phases:
- Grouping Phase
- Spreading Phase
- Aggregation Phase
In the Grouping Phase we compact our EAV table into distinct database entities. In our case we perform a GROUP BY on the column RecordID. In the 2nd phase, the Spreading Phase, we use multiple CASE expressions to rotate rows into columns. And finally we use the MAX expression in the Aggregation Phase to return one distinct value for every row and column. Let’s have a look at the following T-SQL code.
-- Pivot the data with a handwritten T-SQL statement. -- Make sure you have an index defined on the grouping column. SELECT RecordID, -- Spreading and aggregation phase MAX(CASE WHEN Element = 'FirstName' THEN Value END) AS 'FirstName', MAX(CASE WHEN Element = 'LastName' THEN Value END) AS 'LastName', MAX(CASE WHEN Element = 'City' THEN Value END) AS 'City', MAX(CASE WHEN Element = 'Country' THEN Value END) AS 'Country' FROM EAVTable GROUP BY RecordID -- Grouping phase GO
As you can see from the code, it is very easy to identify each phase, and how they are mapped to the T-SQL query. The following picture shows you the result of the query, where we have finally rotated rows into columns.
The PIVOT operator
Beginning with SQL Server 2005 (almost 10 years ago!), Microsoft introduced the PIVOT operator in T-SQL. With that operator you can perform the same transformation (rows to columns) with just one native operator. Sounds very easy and promising, doesn’t it? The following listing shows how you can perform the same transformation natively with the PIVOT operator.
-- Perform the same query with the native PIVOT operator. -- The grouping column is not specified explicitly, it's the remaining column -- that is not referenced in the spreading and aggregation elements. SELECT RecordID, FirstName, LastName, City, Country FROM EAVTable PIVOT(MAX(Value) FOR Element IN (FirstName, LastName, City, Country)) AS t GO
When you run that query, you will get back the same result as seen in the previous picture. But when you look at the syntax of the PIVOT operator, you will see one big difference compared to the manual approach:
You can only specify the spreading and aggregation elements! There is no way to define the grouping elements explicitly!
The grouping elements are just the remaining columns that you haven’t referenced in the PIVOT operator. In our case we haven’t referenced the column RecordID in the PIVOT operator, therefore this column is used during the Grouping Phase. This leads to interesting side effects if we subsequently change the database schema, for example by adding additional columns to the underlying base table:
-- Add a new column to the table ALTER TABLE EAVTable ADD SomeData CHAR(1) GO
When you now run the same query with the PIVOT operator again (make sure you have non NULL values in that column), you will get back a completely different result, because the Grouping Phase is now done on the columns RecordID and SomeData (which we have just added).
Contrast that with what happens if we re-execute the manually T-SQL query that we wrote to start with. It still returns the same correct result. That’s one of the biggest negative side effects of the PIVOT operator in SQL Server: the grouping elements can’t be defined explicitly. To overcome this problem, it is a best practice to use a table expression and only return the necessary columns from the table. With this approach you will also have no problems if you change the table schema at a later stage, because the additional columns are not returned (by default) from the table expression. Let’s have a look at the following listing.
-- Use a table expression to state explicitly which columns you want to -- return from the base table. Therefore you can always control on which -- columns the PIVOT operator is performing the grouping. SELECT RecordID, FirstName, LastName, City, Country FROM ( -- Table Expression SELECT RecordID, Element, Value FROM EAVTable ) AS t PIVOT(MAX(Value) FOR Element IN (FirstName, LastName, City, Country)) AS t1 GO
As you can see from this code, I’m feeding the PIVOT operator through a table expression. And within the table expression you are only selecting the needed columns from the base table. This then means that you can change the table schema in the future without breaking the result of the PIVOT query.
I hope that this blog posting has shown you why the PIVOT operator can be very dangerous in SQL Server. The syntax itself leads to very efficient code, but as a side effect you can’t specify the grouping elements directly. Therefore you should always make sure to use a table expression to define exactly which columns are fed into the PIVOT operator to make the given result deterministic.
What is your experience with the PIVOT operator? Do you like it or not? If you don’t like it, what would you want to have changed?
Thanks for reading!