The power of Grouping Sets in SQL Server

Have you ever wanted to perform aggregations across multiple columns/dimensions in SQL Server without licensing SSAS (SQL Server Analysis Services)? I am not talking about using the Developer edition in production, nor am I talking about installing a pirate copy of SQL Server.

Mission Impossible? Not really, because it’s possible through the so-called Grouping Sets in SQL Server. I want to give you in this blog posting an overview about grouping sets, what kind of queries you can achieve with them, and what are their performance advantages.

Aggregations with Grouping Sets

Imagine you have an orders table, and you want to perform T-SQL query aggregations across multiple groups. In the context of the Sales.SalesOrderHeader table of the AdventureWorks2012 database, these groupings can be something like the following:

  • A grouping across “everything”
  • GROUP BY SalesPersonID, YEAR(OrderDate)
  • GROUP BY CustomerID, YEAR(OrderDate)
  • GROUP BY CustomerID, SalesPersonID, YEAR(OrderDate)

When you want to perform these individual groupings with a traditional T-SQL query, you need multiple statements, where you perform a UNION ALL between the individual result sets. Let’s have a look at such a query:

SELECT * FROM
(
	-- 1st Grouping Set
	SELECT
		NULL AS 'CustomerID',
		NULL AS 'SalesPersonID', 
		NULL AS 'OrderYear', 
		SUM(TotalDue) AS 'TotalDue' 
	FROM Sales.SalesOrderHeader
	WHERE SalesPersonID IS NOT NULL

	UNION ALL

	-- 2nd Grouping Set
	SELECT
		NULL AS 'CustomerID',
		SalesPersonID, 
		YEAR(OrderDate) AS 'OrderYear', 
		SUM(TotalDue) AS 'TotalDue' 
	FROM Sales.SalesOrderHeader
	WHERE SalesPersonID IS NOT NULL
	GROUP BY SalesPersonID, YEAR(OrderDate)

	UNION ALL

	-- 3rd Grouping Set
	SELECT
		CustomerID,
		NULL AS 'SalesPersonID', 
		YEAR(OrderDate) AS 'OrderYear', 
		SUM(TotalDue) AS 'TotalDue' 
	FROM Sales.SalesOrderHeader
	WHERE SalesPersonID IS NOT NULL
	GROUP BY CustomerID, YEAR(OrderDate)

	UNION ALL

	-- 4th Grouping Set
	SELECT
		CustomerID,
		SalesPersonID,
		YEAR(OrderDate) AS 'OrderYear', 
		SUM(TotalDue) AS 'TotalDue' 
	FROM Sales.SalesOrderHeader
	WHERE SalesPersonID IS NOT NULL
	GROUP BY CustomerID, SalesPersonID, YEAR(OrderDate)
) AS t
ORDER BY CustomerID, SalesPersonID, OrderYear
GO

The approach used by this T-SQL statement has multiple disadvantages:

  • The T-SQL statement itself is huge, because every individual group is one distinct query.
  • The table Sales.SalesOrderHeader has to be accessed 4 times – once for every distinct query.
  • When you look at the execution plan, you can see that SQL Server performs an Index Seek (NonClustered) operation 4 times – once for every query.

You can dramatically simplify the T-SQL code that you need, if you use the grouping sets functionality introduced back in SQL Server 2008. The following code shows you the same query, but this time implemented with grouping sets.

SELECT
	CustomerID, 
	SalesPersonID, 
	YEAR(OrderDate) AS 'OrderYear', 
	SUM(TotalDue) AS 'TotalDue'
FROM Sales.SalesOrderHeader
WHERE SalesPersonID IS NOT NULL
GROUP BY GROUPING SETS
(
	-- Our 4 different grouping sets
	(CustomerID, SalesPersonID, YEAR(OrderDate)),
	(CustomerID, YEAR(OrderDate)),
	(SalesPersonID, YEAR(OrderDate)),
	()
)
GO

As you can see from the code itself, you just specify the needed grouping sets inside the GROUP BY GROUPING SETS clause – everything else is performed transparently by SQL Server. The empty parentheses specify the so-called Empty Grouping Set, the aggregation across the whole table. When you also look at the output of STATISTICS IO, you can see that the table Sales.SalesOrderHeader was accessed only once! That’s a huge difference from the previous manual implementation that we have performed.

Within the execution plan SQL Server uses a Table Spool operator that stores the retrieved data temporarily in TempDb. The data from the work table created in TempDb is afterwards used in the second branch of the execution plan. Therefore the data isn’t rescanned for every group from the base table, which leads to a better performance of the whole execution plan.

Query plan that calculates Grouping Sets

When you look at the execution plan, you can also see that the query plan contains 3 Stream Aggregate operators (highlighted in red, blue, and green). These 3 operators are calculating the individual grouping sets:

  • The blue highlighted operator calculates the grouping set for CustomerID, SalesPersonID, and YEAR(OrderDate).
  • The red highlighted operator calculates the grouping set for SalesPersonID and YEAR(OrderDate). In addition it also calculates the grouping set across “everything”.
  • The green highlighted operator calculates the grouping set for CustomerID and YEAR(OrderDate).

The idea behind the 2 subsequent Stream Aggregate operators is to calculate so-called Super Aggregates – aggregations of aggregations.

Summary

In today’s blog posting I have introduced Grouping Sets to you, a T-SQL enhancement introduced back in SQL Server 2008. As you have seen grouping sets have 2 huge advantages: they simplify your code, and they improve the performance of the query by accessing the base data only once.

I hope that you have now got some good ideas about grouping sets, and it would be really interesting for me, if you can leave a short comment if you are able to use this functionality in your own databases.

Like or share this blog posting to get the source code.

Thanks for your time!

-Klaus

5 thoughts on “The power of Grouping Sets in SQL Server”

  1. Oldie but a goodie, very nice explanation, thanks!

    Another way of retrieving the same results I often see (and until just now, have been using) is the use of a CUBE() or ROLLUP() with various checks against Grouping_ID() or GROUPING() in the HAVING clause, for instance the following (ugly) code I’ve just been trying to fix:


    Group by ROLLUP(ColA, ColB, ColC)
    Having NOT (GROUPING(ColA) = 1
    AND GROUPING(ColB) = 1
    AND GROUPING(ColC) = 1)

    I was simply going to replace the Groupings in the having clause with a single Grouping_ID(ColA, ColB, ColC) != 7, but correct use of Grouping sets is miles better!

  2. I was able to use this to find user visit statistics by user, month, and total across everything SUPER easy. Thank you for the explanation!

    SELECT
    UserName,
    VisitDate,
    COUNT(*) AS Visits
    FROM SiteVisits
    GROUP BY GROUPING SETS (
    (UserName, VisitDate),
    (UserName),
    (VisitDate),
    ()
    )

Leave a Comment

Your email address will not be published. Required fields are marked *