The ambiguity of the ORDER BY in SQL Server
In today’s blog posting I want to talk about a very controversial and complex topic in SQL Server: the ambiguity of the ORDER BY clause.
Views and ORDER BY
Let’s start with a very simple SELECT statement.
-- A very simple SELECT statement SELECT * FROM Person.Person ORDER BY LastName GO
As you can see from the previous listing, we just want to return the rows from the table Person.Person ordered by the column LastName. And because we want to be able to reuse that SQL statement as easily as possible, we finally put it into a view definition, like the following one.
-- This doesn't work CREATE VIEW v_Persons AS SELECT * FROM Person.Person ORDER BY LastName GO
But as you can see, SQL Server is not able to create that view, and just returns an error message:
Msg 1033, Level 15, State 1, Procedure v_Persons, Line 27 The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
The error message tells you that an ORDER BY clause is not allowed in a view when you don’t use a TOP, OFFSET or FOR XML expression. Based on that information in the error message, we can fix the problem very easily by adding a TOP 100 PERCENT clause into the view definition: we just return everything from the table, and we are finally allowed to use the ORDER BY in the view definition.
-- Let's make it work! CREATE VIEW v_Persons AS SELECT TOP 100 PERCENT * FROM Person.Person ORDER BY LastName GO
The creation of the view has now worked without any problems! Now let’s perform a SELECT statement against the view.
SELECT * FROM v_Persons GO
The SELECT statement itself works, but when you look at the returned data crazy things have happened: the returned data is not really sorted by the column LastName – SQL Server has sorted the rows by the column BusinessEntityID – the Clustered Key column on that table!
Is this a bug in SQL Server? No, not really – it’s “by design”! Let’s explain why. First of all you have to know that the ORDER BY clause is used in 2 different contexts in SQL – the programming language itself:
- With the ORDER BY clause you can define in which sort order the rows are returned back to your client application
- In addition the ORDER BY clause is used to define which rows are returned from a TOP expression.
The most important thing you have to know is that you define a so called Set with a view, inline functions, derived tables, sub queries, and common table expressions. A set is a mathematical concept, which is part of the Set Theory on which relational databases – like SQL Server – are based. And a set is by its nature unsorted. Therefore you are not allowed to use the ORDER BY clause in combination with a view definition – as you have seen above. If you try to do it, SQL Server disallows that operation and gives you an error message.
Of course you can use the ORDER BY in combination with the TOP expression. But you are mainly fooling SQL Server and yourself, because the ORDER BY doesn’t tell SQL Server in which sort order the rows should be returned back to the client application. The ORDER BY just tells SQL Server which rows should be returned from the TOP expression. Imagine you use a TOP 10 PERCENT. What is the first 10 percent of that table? You need to specify a sort order to define that in a deterministic way.
And because we have to use the ORDER BY in combination with a TOP 100 PERCENT, the Query Optimizer has not actually introduced an explicit sort operator in the execution plan. A TOP 100 PERCENT just means everything, and therefore the Top operator doesn’t need sorted input in the execution plan as you can see in the following picture.
In this specific example we get back the rows sorted in the order in which they are read from the underlying data structure. It’s up to the Storage Engine of SQL Server in which sort order the rows are returned. In our case we read the rows from the Clustered Index. Therefore we get the rows back sorted by the column BusinessEntityID, which is the Clustered Key value column of that index.
Now let’s change the view definition and return only 10 percent of the rows from the table Person.Person. Again, we have specified an ORDER BY clause.
-- Alter the view ALTER VIEW v_Persons AS SELECT TOP 10 PERCENT * FROM Person.Person ORDER BY LastName GO
When you now look at the result set, you can see that the rows are returned sorted by the column LastName. This makes perfect sense now, because you can see an explicit sort operator in the execution plan that presorts the rows on the column LastName, so that the TOP operator can finally return the top 10 percent of the provided input rows.
You can of course return the top 10 percent of the rows in a different sort order back to your client application by adding a final ORDER BY clause after you have referenced your view.
SELECT * FROM v_Persons ORDER BY FirstName GO
Now when you look at the execution plan, you can see 2 sort operators in the plan.
The first (right) sort operator is used to presort the data for the TOP operator (to return the top 10 percent). And the second (left) sort operator is finally used to define the sort order in which the rows are returned back to the client application. Be aware of all these implications when you try to enforce an ORDER BY clause in a view definition by adding a TOP 100 PERCENT – you mainly fool SQL Server…
TOP without ORDER BY
Another problem that arises is that a TOP expression *without* an ORDER BY clause doesn’t provide you with a deterministic result. I want to give you a concrete example of this specific problem. Let’s imagine the following SELECT statement:
SELECT TOP 1 LastName FROM Person.Person GO
This SQL statement just returns the first row from the table Person.Person with a TOP 1 expression – without explicitly defining a sort order with an ORDER BY clause. The sort order is implicitly based on the index chosen in the execution plan. In this specific example SQL Server returns the value “Abbas” to you as a result, because this is the last name of the first record available in a Non-Clustered Index that was chosen by the Query Optimizer in the execution plan.
Therefore which row is returned as the first one from this query depends on which index is chosen in the execution plan. Imagine now that we disable this Non-Clustered Index.
-- Let's deactivate this index ALTER INDEX [IX_Person_LastName_FirstName_MiddleName] ON Person.Person DISABLE GO
When you afterwards rerun the original SELECT statement, SQL Server returns the value “Sánchez“, because this is the first record of the Clustered-Index that is now chosen in the execution plan. SQL Server just returns the first row with the BusinessEntityID value of 1 from the Clustered Index.
Therefore you are dealing with a non-deterministic result: your result depends on the index chosen in the execution plan! You can very easily make the result of the query deterministic by adding an explicit ORDER BY clause. In that case the ORDER BY clause is used for the TOP expression to make the result deterministic – and as a result you get a Sort (Top N Sort) operator in the execution plan.
It doesn’t matter from which index SQL Server reads the rows in the execution plan – the Sort (TOP N Sort) operator just physically presorts the rows in the execution plan, and returns the first N rows from it – easy, isn’t it?
The ORDER BY clause is not really the easiest concept in SQL – the programming language itself. As you have learned in this blog posting, the ORDER BY clause is used in 2 different contexts, and therefore you always have to consider in which context you are using it. And never ever fool SQL Server and yourself by adding a TOP 100 PERCENT to a view definition – it doesn’t reflect the sort order of the final result set.
Thanks for your time,