A ever-increasing Clustered Key value doesn’t scale!

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

You know all the best practices how to choose a Clustered Key value? A good Clustered Key value should have the following 3 properties:

  • Narrow
  • Static
  • Ever Increasing

Let’s have a more detailed look on all 3 properties, and why an ever increasing value doesn’t really scale in SQL Server.

Narrow

A Clustered Key value should be as small as possible. Why? Because it takes space, and the Clustered Key Value also serves a logical pointer in the leaf level in every Non-Clustered Index. If you have a very wide Clustered Key value, you also deal with larger Non-Clustered Indexes. If you have defined a Non-Unique Non-Clustered Index (which is normally the case), the Clustered Key value is also part of the navigation structure of your Non-Clustered Index. Therefore the overhead in your index gets very large. And our goal is to minimize overhead on our index pages. Overhead that we have to pay in the physical storage, and also in the Buffer Pool, where SQL Server caches the read index pages from storage.

Normally you choose a technical key value (like INT/BIGINT data type) instead of a natural key value. I have already seen over the years Clustered Key value lengths of 100 bytes and more (combinations of LastName, FirstName, SocialSecurityNumber, etc.). Believe me – you are just waisting memory! You don’t have to do that. Choose a technical key, and you are fine.

Static

Because the Clustered Key value is replicated in every Non-Clustered Index, your Clustered Key value should never ever change! Otherwise SQL Server has to maintain transparently in your UPDATE Execution Plan EVERY Non-Clustered Index that you have defined on your table. You are again just introducing additional computing overhead that you don’t need. Use your CPU cycles for more other important stuff. As you know, natural key values can change (like a LastName column, when you get married).

A technical key value (like an INT IDENTITY) can’t change (by default). Therefore the logical pointer (in the form of the Clustered Key value) in all your Non-Clustered Indexes remains stable - without any need to change them – forever!

Ever Increasing

And the 3rd final important property of a “good” Clustered Key value is that the chosen column should give you ever-increasing values. Why? Because you are always adding additional records at the end of your Clustered Index, and therefore you are avoiding expensive Page Splits (regarding CPU cycles, and Transaction Log overhead) and Index Fragmentation. In 99% of all cases you will be fine with an ever-increasing value like an INT IDENTITY column, but there are some scenarios, where this approach can lead to serious scalability problems. Imagine you have a workload, where a lot of different users are inserting permanently into the same table with an ever-increasing Clustered Key value. Just think a second about about Logging/Auditing tables.

Let’s have a more detailed look on what happens internally in SQL Server, when you reading and writing pages in memory. When SQL Server accesses certain memory structures (like pages that are stored in the Buffer Pool), these memory accesses must be synchronized among multiple threads. You can’t write concurrently to the same page in memory. When a thread writes to a page, some other thread can’t read the page at the same time. In traditional concurrent programming you solve that problem with Mutexes - like a Critical Section. Certain code paths are just mutually exclusive. A Latch in SQL Server is almost the same as a Critical Section. And latches are used to synchronize threads/queries among each other. Every time when you read a page, the worker thread has to acquire a Shared Latch (SH), every time when you write a page, the worker thread has to acquire an Exclusive Latch (EX). And both latches are incompatible to each other.

When you now perform an INSERT statement, the worker thread exclusively latches the page where the INSERT statement occurs. In the mean time no one else can read and write from/to this page. With an ever-increasing Clustered Key value this approach doesn’t really scale, because you are just inserting your records at the end of your Clustered Index. Therefore your parallel threads/queries are contending about an Exclusive Latch on the same last page in your Clustered Index. As a side-effect SQL Server executes your INSERT statement serially – one INSERT after the next one. You have hit the famous Last Page Insert Latch Contention. Let’s have a look at the following picture.

Last Page Insert Latch Contention

With the best practice of an ever-increasing Clustered Key value you have a single hotspot at the end of your Clustered Key. The smaller your records are, the more contention you are introducing here. How can you solve that problem? Easy: spread your INSERT statements across the whole B-Tree structure of the Clustered Index. There are various approaches how you can achieve that:

  • Use a random Clustered Key value (like a UNIQUEIDENTIFIER). But be aware of the side-effects: larger logical pointer in EVERY Non-Clustered Index, Page Splits…)
  • Implement Hash Partitioning, if you are using the Enterprise Edition of SQL Server.
  • Eliminate latching through the use of In-Memory OLTP, that is part of SQL Server 2014.
  • Use a so-called Reverse Index. Unfortunately SQL Server doesn’t provide you that kind of index out-of-the box, like Oracle. But you can implement it at your own

Summary

At 99% you will be fine with a narrow, static, and ever-increasing Clustered Key value like an INT IDENTITY data type. But in some rare scenarios where you need a huge amount of parallel INSERT statements (like Logging/Auditing tables), you can hit the Last Page Insert Latch Contention with that approach. If you hit that specific problem, you have to leave your comfort zone, and you have to make sure that you spread the INSERT statements across your whole B-Tree structure. You are mainly fighting against a limitation of how multi-threaded access happens in a traditional B-Tree structure.

I hope that I have given you with that blog posting a good insight, why ever-increasing Clustered Key values can hurt the scalability of your tables.

If you are more interested in how to choose the right Clustered Key Value, I’m also offering a 1-hour long training video through the SQLpassion Online Academy.

Thanks for reading!

-Klaus

Improving Query Performance by using correct Search Arguments

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

In today’s blog posting I want to talk about a very specific performance problem related to indexing in SQL Server.

The Problem

Imagine the following simple query, which you have already seen hundreds of times in your SQL Server life:

-- Results in an Index Scan
SELECT * FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) = 2005 AND MONTH(OrderDate) = 7
GO

With that simple query, we request sales information for a specific month in a specific year. Not very complicated. Unfortunately that query doesn’t perform very well – even with a Non-Clustered Index on the column OrderDate. When you look at the execution plan, you can see that the Query Optimizer has chosen the Non-Clustered Index that is defined on the column OrderDate, but unfortunately SQL Server performs a complete Scan of the index, instead of performing an efficient Seek operation.

Non Clustered Index Scan

This isn’t really a limitation of SQL Server, this is the way how relational databases are working and thinking :-) . As soon as you are applying an expression (function calls, calculations) on an indexed column (a so-called Search Argument), the database engine HAS TO SCAN that index, instead of performing a seek operation.

The Solution

To get a scalable seek operation in the execution plan, you have to rewrite your query in a way to avoid the call of the function DATEPART:

-- Results in an Index Seek
SELECT * FROM Sales.SalesOrderHeader
WHERE OrderDate >= '20050701' AND OrderDate < '20050801'
GO

As you can see from the rewritten query, the query returns the same result, but we have just eliminated the function call of DATEPART. When you look at the execution plan, you can see that SQL Server performs a seek operation – in that specific case it is a so-called Partial Range Scan: SQL Server seeks to the first value, and scan until he hits the last value of the requested range. If you have to call functions in the context of indexed columns, you should always make sure that these function calls are performed on the right hand side of your column in the query. Let’s have a look at a concrete example. The following query casts the indexed column CreditCardID to the data type CHAR(4):

-- Results in an Index Scan
SELECT * FROM Sales.SalesOrderHeader
WHERE CAST(CreditCardID AS CHAR(4)) = '1347'
GO

When you have a more detailed look on the execution, you can see that SQL Server scans again the whole Non-Clustered Index. Not really scalable if your table gets larger and larger. If you are performing that conversation on the right hand side of your indexed column in the query, you can again eliminate the function call on the indexed column, and SQL Server is able to perform a seek operation:

-- Results in an Index Seek
SELECT * FROM Sales.SalesOrderHeader
WHERE CreditCardID = CAST('1347' AS INT)
GO

Another nice example where you can run into the same problems is the use of the CASE expression in SQL Server. Let’s have a look at the following query, where the column PersonType is indexed through a Non-Clustered Index:

-- Results in an Index Scan
SELECT
	CASE PersonType
		WHEN 'IN' THEN 1
		WHEN 'EM' THEN 2
		WHEN 'SP' THEN 3
		WHEN 'SC' THEN 4
		WHEN 'VC' THEN 5
		WHEN 'GC' THEN 6
	END AS ConvertedPersonType
FROM Person.Person
GO

You are calling here again a function on an indexed column, in our case the CASE expression. You can’t directly see that function call in the T-SQL query, but internally it’s nothing more than a function call. You are getting again a Scan of the Non-Clustered Index instead of a Seek operation. How can you avoid that problem? You can rewrite the query to get rid of the CASE expression. One example is a join against a Common Table Expression, which stores the needed lookup values. Let’s have a look on the rewritten query:

-- Results in an Index Seek
WITH LookupCTE AS
(
	SELECT * FROM
	(
		VALUES
			(N'IN', 1),
			(N'EM', 2),
			(N'SP', 3),
			(N'SC', 4),
			(N'VC', 5),
			(N'GC', 6)
	) tbl (PersonType, Value)
)
SELECT cte.Value FROM Person.Person p
INNER JOIN LookupCTE cte ON cte.PersonType = p.PersonType
GO

With that approach, SQL Server is able to perform a seek operation on the indexed column PersonType.

Summary

As you have seen in this blog posting, it is very important that you are NOT calling any function directly or indirectly on your indexed columns. Otherwise SQL Server has to scan your index, instead of performing an efficient seek operation. And scans will never ever scale, when your table gets more and more rows.

Please feel free to leave a comment, if you want to share other good examples where you have encountered this specific behavior.

Thanks for reading!

-Klaus

Goodbye SQL Server, Hello FileMaker…

The last weeks were very time consuming for me, because I had to make a very tough decision: staying independent and providing high quality SQL Server consulting and training services, or getting employed (again), and being part of a greater team to influence how the IT industry will be in future.

After a very long thinking time, I have made my decision, and joining starting today – on April 1 – Apple. I will be the lead architect for FileMaker, the relational database on the Apple platform. I will be in charge of the whole dev team, leading the various development teams in FileMaker and also coordinate the necessary changes within Mac OS X. Our goal will be the integration of FileMaker within the OS, making HFS a transactional file system – the vision of WinFS, which was discarded by Microsoft…

Why me? I have a lot of experience in the area of relational databases (I know how to write basic SELECT * queries, which was also the toughest part of the SQLMCM certification). And of course the another most important factor why Apple has chosen me over other candidates is my hands-on experience in writing operating systems. As you might know, I have written my own OS, which boots from a FAT12 partition (I have found some code somewhere in the Internet…), reads a key from the keyboard, and writes it out. Everything done in x32 Protected Mode, without any BIOS calls! Impressive, huh? Apple was really amazed by that piece of software, therefore we will also evaluate if that code can be used as the future basis for Mac OS X 15.1.

Of course, Apple screened a lot of different candidates for this position, and I had a very hard competition. In the final round we were only 3 people, and I was feeling like someone in the TV show “Deutschland sucht den Superstar” (Germany searches for the super star). I never heard anything about the other 2 persons, the 1st one was a guy called Bob, and the name of the other one was Conor.

Bob performed really amazing, but he had too less low-level experience. To understand that a little bit better, I want to give you an example: one task that we had to perform during the final round was to join 2 tables together. Very simple task when you know the concept of a cartesian product. But Bob was opening WinDbg, and was doing everything with a graphical hex-dump. He violated the rules, because we were not allowed to use any GUI front-end… And Conor? He talked the whole day about trees, and cardinality estimations… So my solution with the cartesian product was chosen over their approaches. I`m the lucky one!

So wish me luck in my new role, and let´s change the way how relational databases are working.

Thanks for reading and a happy 1st April!

-Klaus

SQL Server Quickie #12 – Nested Loop Operator

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

Today I have uploaded the 12th SQL Server Quickie to YouTube. This time I’m talking about the Nested Loop Join Operator in SQL Server. You can find the script used during the demonstration here as a download.

Thanks for watching!

-Klaus

The Illusion of Updateable Clustered ColumnStore Indexes

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

In todays blog posting I want to concentrate in more details on the Updateable Clustered ColumnStore Index that is introduced with SQL Server 2014. Before we go down to the details, I want to give you a brief overview about their first appearance in SQL Server 2012, and their limitations.

ColumnStore Indexes in SQL Server 2012

The introduction of ColumnStore Indexes in SQL Server 2012 was one of the hottest new features (besides AlwaysOn). If chosen right, they can make an impressive performance improvement for Data Warehousing workloads. Unfortunately, they had 2 big limitations:

  • There was only the support for Non-Clustered ColumnStore Indexes
  • As soon as you had created a ColumnStore Index on a table, the underlying table was read only, and no changes to the data were allowed anymore

Both limitations were a huge bummer for some customers. Imagine you had a table with 300 GB of traditional row store data. With a ColumnStore Index it’s possible to compress that data down to a size of 30 GB. But SQL Server 2012 only allowed a Non-Clustered ColumnStore Index, means you had to store your data twice: one in the traditional row store format, and once in the new ColumnStore format. That’s a huge waste of storage, because your queries will (hopefully) only use your Non-Clustered ColumnStore Index.

And as soon as you had created your Non-Clustered ColumnStore Index, you were not allowed to change the underlying table data anymore – your table was just read only! Of course, there were some workarounds for this problem, like Partition Switching, but you still needed to implement that on your own…

ColumnStore Indexes in SQL Server 2014

Things are changing now with SQL Server 2014, because Microsoft has resolved the above mentioned issues – with some magic and illusion: SQL Server 2014 provides you an Updateable Clustered ColumnStore Index! Let’s have a more detailed look on how this magic and illusion happens internally in SQL Server.

The first most important fact is that an underlying direct update of a ColumnStore Index is not possible! It would be too time consuming to do the complete decompress and compress on the fly during your INSERT, UPDATE, and DELETE transactions. Therefore SQL Server 2014 uses help from some magic: Delta Stores and Delete Bitmaps. Let’s have a more detailed look on both concepts.

Every time when you run an INSERT statement, the new record isn’t directly inserted into the ColumnStore Index – the record is inserted into a Delta Store. The Delta Store itself is nothing else than a traditional B-Tree structure with all its pro’s and con’s. When you afterwards read from the ColumnStore Index, SQL Server returns you the data from the compressed ColumnStore Index AND also from the Delta Store.

When you run an DELETE statement, again nothing happens in the compressed ColumnStore Index. The only thing that happens is that the record is deleted logically through a Delete Bitmap. Every record in the ColumnStore Index has a corresponding bit in that Delete Bitmap. When you again read your ColumnStore Index, SQL Server just discards the rows that are marked as deleted in the Delete Bitmap.

And running an UPDATE statement just means inserting the new version into the Delta Store, and marking the old version as deleted in the Delete Bitmap. Easy, isn’t it? The following picture (source http://research.microsoft.com/apps/pubs/default.aspx?id=193599) shows this concept in more details.

ColumnStore

Because of the Delta Stores and the Delete Bitmap it seems that your ColumnStore Index is updateable, but in reality it is just immutable. There is also a background process called the Tuple Mover, which runs regularly and finally pushes your changes asynchronously into the compressed ColumnStore Index.

In addition you can also finally define a ColumnStore Index in SQL Server 2014 as Clustered. This means that you don’t need to have your data in the traditional row store format anymore. Just create your table and create a Clustered ColumnStore Index on it. You can some with this approach huge space savings in your storage, because everything is now compressed. When you work in more details with the concepts of relational databases, Clustered always means Sorted. But with a Clustered ColumnStore Index this statement is not true: when you create a Clustered ColumnStore Index, there is NO sorting order in your data! Just be aware of this tiny little fact ;-)

Summary

Updateable Clustered ColumnStore Indexes are a nice magical illusion of SQL Server 2014. Don’t get me wrong: I really like the possibilities introduced with this new feature, but you have to understand how the feature is implemented internally, if you want to make the best use of it.

Thanks for reading

-Klaus

Performance improvements in SQL Server 2014

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

SQL Server 2014 will be an amazing release regarding all the various performance enhancements that are part of this new release. Over the last months I have already done a huge amount of work and tests on the latest CTP version of SQL Server 2014, and I have already blogged about my experience. In this blog posting I want to give you an overview about the various performance related enhancements that are introduced.

If you are interested in a more personal face-2-face training about performance enhancements in SQL Server 2014, I’m also running in May & June around Europe my one-day long SQL Server 2014 Performance Tuning Day, where you will learn the nifty details about this various enhancements.

Buffer Pool Extensions

The idea about Buffer Pool Extensions is very easy: expand the Buffer Pool with a paging file that is stored on very fast storage, like SSD drives. The Buffer Pool Extensions are coming quite handy, if you don’t have the ability to physically add more additional RAM to your database server.

Resource Governor

Resource Governor was introduced first back with SQL Server 2008, but wasn’t really a mature technology, because you had no possibility to govern I/O operations on the storage level, and you also had no chance to limit the size of the Buffer Pool for a specific workload group. With SQL Server 2014 things are changing, because you can now throttle I/O operations. Limiting Buffer Pool usage is still not possible, but hey who knows what comes in SQL Server 2016 ;-) .

Lock Priorities

As you might know, SQL Server gives you in the Enterprise Edition Online operations, or as I call them “Almost Online Operations”. They are almost online, because internally SQL Server still has to acquire some locks, which can lead to blocking situations. For that reason SQL Server 2014 introduces Lock Priorities, where you can control how SQL Server should react, if such a blocking situation occurs.

Clustered ColumnStore Indexes

One of the hottest enhancements in SQL Server 2014 is the introduction of Clustered ColumnStore Indexes, which is an amazingly new way concept how to deal with ColumnStore data in SQL Server. And in addition the Clustered ColumnStore Index can be also changed directly – without using tricks like Partition Switching.

In-Memory OLTP

With In-Memory OLTP Microsoft claims that the performance of your workload can be improved up to 100x. Awesome! Everything is now stored directly in the memory, without touching your physical storage anymore (besides the transaction log, if you want). And in addition In-Memory OLTP is based on so-called Lock Free Data Structures, means locking, blocking, latching, and spinlocking is just gone. Of course, there are side-effects and even limitations with this promising approach…

Delayed Transactions

It doesn’t matter how good the throughput of your workload is, the final barrier and bottleneck is almost every time the transaction log. Because of the Write-Ahead Logging mechanism used by SQL Server, a transaction must be always written physically to the transaction log, before the transaction is committed. When your transaction log is on slow storage, your performance and throughput will suffer. For that reason SQL Server 2014 implements so-called Delayed Transactions

Cardinality Estimation

Cardinality Estimation is the most important thing in a relational database, because these estimations are feeded into the Query Optimizer, whose job it is to produce a good-enough execution plan. With SQL Server 2014 Microsoft has rewritten the cardinality estimator completely from scratch to overcome some limitations based on the history of this very important component.

Thanks for reading

-Klaus

Buffer Pool Extensions in SQL Server 2014

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

In todays blog posting I want to talk about Buffer Pool Extensions that are introduced with SQL Server 2014. As you know, the Buffer Pool is one of the main memory consumers in SQL Server. When you read data from your storage, the data is cached in the Buffer Pool. SQL Server caches Execution Plans in the Plan Cache, which is also part of the Buffer Pool. The more physical memory you have, the larger your Buffer Pool will be (configured through the Max Server Memory setting).

A lot of SQL Server customers are dealing with the problem that physical memory is restricted in database servers: all memory slots are already occupied, so how do you want to add additional memory to the physical server? Of course, you can migrate to a larger server, but that’s another story… The solution to this specific problem is the introduction of Buffer Pool Extensions in SQL Server 2014. With the help of the Buffer Pool Extensions SQL Server introduces another layer in the memory hierarchy. Let’s have a look on the following picture:

Memory Hierarchy with Buffer Pool Extensions

As you can see you have on the top the Buffer Pool itself, which is very fast (regarding latency times), and at the bottom you can see our traditional storage, which is mainly slow. And the Buffer Pool Extensions are just settled down between of both – between the traditional Buffer Pool and our storage. The Buffer Pool Extensions itself are consisting of one simple file (the so-called Extension File) that should be stored on very fast storage – like a SSD drive. The Extension File is mainly the same as the page file in the Windows OS. Instead of adding additional physical memory to your database server, you just configure an Extension File on a SSD drive – that’s it!

Before I’m talking about the configuration and the concrete use of the Buffer Pool Extensions, I want to talk a little bit about the architecture and design behind the Buffer Pool Extensions. The traditional Buffer Pool of SQL Server always differentiates between clean and dirty pages. A clean page is a page which has the same content as the page in the storage. A dirty page is changed in memory, but hasn’t yet written to the storage. Around every minute the so-called CHECKPOINT process writes dirty pages out to the storage, means that our dirty page becomes a clean page.

The Buffer Pool Extensions itself are only used if the Buffer Pool of SQL Server get’s into memory pressure. Memory pressure means that SQL Server needs more memory than currently available. In that case the Buffer evicts pages from the Buffer Pool, which were least recently used. SQL Server uses here a Least Recently Used Policy (LRU). If you have now configured an Extension File, then SQL Server will write these pages into it, instead of writing them directly out to our slow storage. If the page is a dirty one, then the page will be also concurrently written to the physical storage (through an asynchronous I/O operation). Therefore you can’t loose any data when you are dealing with the Buffer Pool Extensions. At some point in time your Extension File will be also completely full. In that case SQL Server has to evict older pages from the Extension File (again through a LRU policy), and finally writes them to the physical storage. The Extension File just acts as an additional layer between the Buffer Pool and the storage itself.

Let’s have now a look on how we can configure the Buffer Pool Extensions in SQL Server 2014. SQL Server offers you here the command ALTER SERVER CONFIGURATION SET BUFFER POOL EXTENSION. Let’s have a more detailed look on how you use it:

ALTER SERVER CONFIGURATION
SET BUFFER POOL EXTENSION ON
(
   FILENAME = 'd:\ExtensionFile.BPE',
   SIZE = 10 GB
)
GO

The first restriction that you are hitting here is the fact that the Extension File must have at least the same size as the Buffer Pool itself. If you are specifying a smaller file size, you will get a lovely error message from SQL Server:


Msg 868, Level 16, State 1, Line 1
Buffer pool extension size must be larger than the current memory allocation threshold 1596 MB. Buffer pool extension is not enabled.

The next restriction that you will definitely hit is that you can’t change the size of the Extension File during the runtime of SQL Server. For example, when you want to change the Extension File to a larger size, you have to disable the Buffer Pool Extensions, and re-enable them again. You will have of course a performance degradation in the time of this operation, because you are just disabling one important caching layer used by SQL Server!

Be aware of this fact, when you are planning a deployment of the Buffer Pool Extensions for your production environment!!!

And in addition you are also not able to reduce the size of the Extension File, the file must be always larger than previously. Otherwise you are again getting an error message:


Msg 868, Level 16, State 1, Line 3
Buffer pool extension size must be larger than the current memory allocation threshold 4096 MB. Buffer pool extension is not enabled.

The whole configuration of the Buffer Pool Extensions can be also retrieved through the Dynamic Management View sys.dm_os_buffer_pool_extension_configuration.

When should you use the Buffer Pool Extensions? Microsoft makes the recommendation that your workload should be write-heavy, e.g. a OLTP workload. You should not have a look on the Buffer Pool Extensions, when you are dealing with a DWH/BI related workload – an Extension File doesn’t make sense here. And when we are talking about the Extension File itself, you should spend a very fast SSD for it! Traditional rotational hard disks are a No-Go for it!

Thanks for reading!

-Klaus

SQL Server Quickie #11 – Non Clustered Indexes

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

Today I have uploaded the 11th SQL Server Quickie to YouTube. This time I’m talking about Non Clustered Indexes in SQL Server. You can find the script used during the demonstration here as a download.

Thanks for watching!

-Klaus

Lock Escalations

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

In todays blog posting I want to talk about Lock Escalations in SQL Server. Lock Escalations are an optimization technique used by SQL Server to control the amount of locks that are held within the Lock Manager of SQL Server. Let’s start in the first step with the description of the so-called Lock Hierarchy in SQL Server, because that’s the reason why the concept of the Lock Escalations exists in a relational database like SQL Server.

Lock Hierarchy

The following picture shows you the lock hierarchy used by SQL Server:

Lock Hierarchy in SQL Server

As you can see from the picture, the lock hierarchy starts at the database level, and goes down to the row level. You always have a Shared Lock (S) on the database level itself. When your query is connected to a database (e.g. USE MyDatabase), the Shared Lock prevents the dropping of the database, or that backups are restored over that database. And underneath the database level, you have locks on the table, on the pages, and the records when you are performing an operation.

When you are executing a SELECT statement, you have an Intent Shared Lock (IS) on the table and page level, and a Shared Lock (S) on the record itself. When you are performing a data modification statement (INSERT, UPDATE, DELETE), you have an Intent Exclusive or Update Lock (IX or IU) on the table and page level, and a Exclusive or Update Lock (X or U) on the changed records. SQL Server always acquires locks from top to bottom to prevent so-called Race Conditions, when multiple threads trying to acquire locks concurrently within the locking hierarchy. Imagine now how the lock hierarchy would look like, when you perform a DELETE operation on a table against 20.000 rows. Let’s assume that a row is 400 bytes long, means that 20 records fit onto one page of 8kb:

Without Lock Escalation

You have one S Lock on the database, 1 IX Lock on the table, 1.000 IX locks on the pages (20.000 records are spread across 1.000 pages), and you have finally 20.000 X locks on the records itself. In sum you have acquired 21.002 locks for the DELETE operation. Every lock needs in SQL Server 96 bytes of memory, so we look at 1.9 MB of locks just for 1 simple query. This will not scale indefinitely when you run multiple queries in parallel. For that reason SQL Server implements now the so-called Lock Escalation.

Lock Escalations

As soon as you have more than 5.000 locks on one level in your locking hierarchy, SQL Server escalates these many fine-granularity locks into a simple coarse-granularity lock. By default SQL Server *always* escalates to the table level. This mean that your locking hierarchy from the previous example looks like the following after the Lock Escalation has been successfully performed.

With Lock Escalation

As you can see, you have only one big lock on the table itself. In the case of the DELETE operation, you have one Exclusive Lock (X) on the table level. This will hurt the concurrency of your database in a very negative way! Holding an Exclusive Lock on the table level means that no other session is able anymore to access that table – every other query will just block. When you are running your SELECT statement in the Repeatable Read Isolation Level, you are also holding your Shared Locks till the end of the transaction, means you will have a Lock Escalation as soon as you have read more than 5.000 rows. The result is here one Shared Lock on the table itself! Your table is temporary readonly, because every other data modification on that table will be blocked!

There is also a misconception that SQL Server will escalate from the row level to the page level, and finally to the table level. Wrong! Such a code path doesn’t exist in SQL Server! SQL Server will by default *always* escalate directly to the table level. An escalation policy to the page level just doesn’t exist. If you have your table partitioned (Enterprise Edition only!), then you can configure an escalation to the partition level. But you have to test here very carefully your data access pattern, because a Lock Escalation to the partition level can cause a deadlock. Therefore this option is also not enabled by default.

Since SQL Server 2008 you can also control how SQL Server performs the Lock Escalation – through the ALTER TABLE statement and the property LOCK_ESCALATION. There are 3 different options available:

  • TABLE
  • AUTO
  • DISABLE
-- Controllling Lock Escalation
ALTER TABLE Person.Person
SET
(
	LOCK_ESCALATION = AUTO -- or TABLE or DISABLE
)
GO

The default option is TABLE, means that SQL Server *always* performs the Lock Escalation to the table level – even when the table is partitioned. If you have your table partitioned, and you want to have a Partition Level Lock Escalation (because you have tested your data access pattern, and you don’t cause deadlocks with it), then you can change the option to AUTO. AUTO means that the Lock Escalation is performed to the partition level, if the table is partitioned, and otherwise to the table level. And with the option DISABLE you can completely disable the Lock Escalation for that specific table. But disabling Lock Escalations is not the very best option, because the Lock Manager of SQL Server can then consume a huge amount of memory, if you are not very carefully with your queries and your indexing strategy.

Conclusion

Lock Escalation in SQL Server is mainly a nightmare. How will you delete more than 5.000 rows from a table without running into Lock Escalations? You can disable Lock Escalation temporarily, but you have to be very careful here. Another option (that I’m suggesting) is to make your DELETE/UPDATE statements in a loop as different, separate transactions: DELETE/UPDATE less than 5.000 rows, so that you can prevent Lock Escalations. As a very nice side-effect your huge, big transaction will be splitted into multiple smaller ones, which will also help you with Auto Growth issues that you maybe have with your transaction log.

Thanks for reading

-Klaus

Improving SQL Server Performance by using Instant File Initialization

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan, where you are getting week by week via email all the essential knowledge you need to know about performance tuning on SQL Server.)

Today I want to talk about a very specific topic in SQL Server – Instant File Initialization. If you have Instant File Initialization for your SQL Server Instance enabled, you can have a tremendous performance improvement – under specific circumstances. Instant File Initialization defines how the SQL Server engine interacts with the Windows OS, when allocating new space in data files.

The Problem

When you are allocating new space in a data file in the default configuration of SQL Server, SQL Server has to call internally Win32 API functions that are zero-initializing the new allocated NTFS clusters. This mean that every byte of new allocated space is overwritten with zero values (0×0). This behavior prevents the problem of accessing old data, that was previously stored physically in the same NTFS clusters. The zero-initialization takes place during the following SQL Server operations:

  • Creating a new database
  • Auto growing a database
  • Restore of a database backup

When you are creating a database file with 50 GB, SQL Server has to initialize in the first step that new block of data with 50 GB of zero values. And this can take a lot of time. Let’s have a look on the following CREATE DATABASE statement.

-- Create a new 50 GB database
CREATE DATABASE TestDatabase ON PRIMARY
( 
	NAME = N'TestDatabase',
	FILENAME = N'G:\SQL\DATA\TestDatabase.mdf' , 
	SIZE = 51200000KB , 
	FILEGROWTH = 1024KB
)
LOG ON 
(
	NAME = N'TestDatabase_log', 
	FILENAME = N'G:\SQL\Log\TestDatabase_log.ldf' ,
	SIZE = 1024KB , 
	FILEGROWTH = 10%
)
GO

As you can see from the code, I’m creating here a database file of 50 GB. In my default SQL Server configuration this statement takes around 16 seconds, because SQL Server writes through a Win32 API function 50 GB of zeros to the storage. Imagine what happens if you have a corrupt database (e.g. 50 GB of size), and you want to restore a backup? What people are normally doing in the first step, is to delete the corrupted database. This means that the database files are gone, and SQL Server has to recreate in the first step during the restore operation the files:

  1. SQL Server creates in a first step an “empty” database of 50 GB, where the data file will be zero-initialized in the NTFS file system
  2. As the final step the backup is restored, and SQL Server writes again 50 GB of data into the data files

As you can see, you are writing with this approach 100 GB of data to your storage! If you are just restoring your backup *over* the existing files, SQL Server can skip the first step, and just writes 50 GB of data to your storage – you have achieved an performance improvement of 100%!

Instant File Initialization

If you don’t want that SQL Server is doing the zero-initialization of your data files, you can reconfigure SQL Server. If you grant the service account, under which SQL Server is running – the privilege Performance Volume Maintenance Task, SQL Server will skip the zero-initialization of the data files, if you have restarted SQL Server afterwards. As I have said this only applies to data files – log files are ALWAYS zero-initialized in SQL Server! There is NO WAY around that!!! Without the zero-initialization of the log file, the crash recovery process would have no idea where to stop, when the log file was wrapped around. Crash Recovery stops where it finds zero values in the header of the next log record to be processed.

You can grant the permission Performance Volume Maintenance Task through secpol.msc to the service account of SQL Server.

Granting the permission through secpol.msc

After a restart, SQL Server is now able to skip the zero-initialization of data files. When I’m running the CREATE DATABASE statement from about again, it takes around 250ms – that’s a huge difference! The side-effect? You are able to retrieve the old content that was stored in the allocated NTFS clusters through the DBCC PAGE command:

-- Enable DBCC trace flag 3604
DBCC TRACEON(3604)
GO

-- Dump out a page somewhere in the data file
-- A hex dump is working here
DBCC PAGE (TestDatabase, 1, 1000, 2)
GO

As you can see I’m just dumping out a page somewhere in my data file. In that case, it can now happen that SQL Server just returns you some garbage data – data that was previously stored in the new allocated NTFS clusters – data that has no relevance to SQL Server:

Page Dump of old data

By granting this permission to SQL Server, you are mainly opening a security hole: users (with the right permissions) are able to retrieve old data, that was previously stored in the file system. So you have to think very carefully about that, if you grant the permission to SQL Server, or not. 

If you want to know, if your SQL Server is running with this permission, or not, you can enable the trace flags 3004 and 3605. With these trace flags enabled, SQL Server reports in the error log which files are zero initialized. When you afterwards create a new database, and the permission wasn’t granted to SQL Server, you can see from the error log, that data AND log files were zero-initialized:

Data & Log Initialization

If SQL Server has the permission Perform Volume Maintenance Task, you can see from the error log, that ONLY the log file was zero-initialized:

Log Initialization

The Windows Internals

But what happens now under hood in the Windows OS, when you grant the permission Perform Volume Maintenance Task to the service account, under which SQL Server is running? With this permission enabled (it’s internally called SE_MANAGE_VOLUME_NAME by the Win32 API), SQL Server is able to call the Win32 API function SetFileValidData. As you can see from the documentation, the process who is calling that function, has to have the permission SE_MANAGE_VOLUME_NAME. When that function is called by SQL Server, the function itself just sets the so-called High Watermark of the file – the file is just expanded WITHOUT overwriting the old content in the underlying NFTS clusters! As the documentation says:

“The SetFileValidData function allows you to avoid filling data with zeros when writing nonsequentially to a file. The function makes the data in the file valid without writing to the file. As a result, although some performance gain may be realized, existing data on disk from previously existing files can inadvertently become available to unintended readers.”

“If SetFileValidData is used on a file, the potential performance gain is obtained by not filling the allocated clusters for the file with zeros. Therefore, reading from the file will return whatever the allocated clusters contain, potentially content from other users. This is not necessarily a security issue at this point, because the caller needs to have SE_MANAGE_VOLUME_NAME privilege for SetFileValidData to succeed, and all data on disk can be read by such users.”

As I have said earlier, it’s mainly a security concern if you are enabling that specific permission for your SQL Server instance, or not.

Summary

Should you now enable Instant File Initialization for your SQL Server instance, or not? It dep… When you are the SQL Server AND Windows administrator, it’s a good idea to grant that permission, because as a Windows admin, you always have access to the file system. But when you have dedicated Windows and SQL Server admins, it can be the case, that the Windows admin doesn’t trust you, and that you are not getting that permission for your SQL Server instance. In that case SQL Server will always zero-initialize the data and log files…

Thanks for reading

-Klaus