Troubleshooting THREADPOOL Waits

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan – you get a weekly email packed with all the essential knowledge you need to know about performance tuning on SQL Server.)

Today I want to talk about a specified wait type that can be sometimes very hard to troubleshoot: the THREADPOOL wait type. This wait type is specific to the internal thread scheduling mechanism inside SQL Server.

As you might know SQL Server internally uses its own “operating system” to implement thread scheduling and memory management – the SQLOS. The SQLOS provides a set of worker threads that are used to execute queries that are submitted to SQL Server. The problem is that those worker threads can be exhausted sometimes – maybe because of a Locking/Blocking scenario. In this case SQL Server isn’t able to execute any more requests inside the engine, because no free worker threads are available any more.

You can configure through the max worker threads option (through sp_configure) how many worker threads are available to SQLOS. By default the value of this option is 0, which means SQL Server itself decides how many worker threads are used. The number of the available worker threads depends on the processor architecture (x32, x64) and the number of CPUs that you have available. Books Online (see http://msdn.microsoft.com/en-us/library/ms187024.aspx) has the following table that describes the various possible combinations:

Number of CPUs

x32

x64

<= 4 Processors

256

512

8 Processors

288

576

16 Processors

352

704

32 Processors

480

960

 

You can also check through the column max_workers_count in sys.dm_os_sys_info how many worker threads your SQL Server instance is using. With the following example I want to demonstrate now how you can get thread starvation in SQL Server and how you can resolve it.

CAUTION: DON’T DO THE FOLLOWING STEPS ON A PRODUCTION SYSTEM!!!

In the first step we create a new database and a simple table for our sample scenario. I want to be unique as possible; therefore I use unique table and column names ;-)

USE master
GO
 
CREATE DATABASE ThreadPoolWaits
GO
 
USE ThreadPoolWaits
GO
 
–- Create a new test table (this one will be unique on earth – hopefully…)
CREATE TABLE [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49]
(
	[MyUniqueColumnName1_F67DAC4A-C202-49BB-829A-071130BF1160] INT IDENTITY(1, 1) NOT NULL PRIMARY KEY,
	[MyUniqueColumnName2_438B7184-B476-48A4-B5FA-DC34B99FA0A4] INT
)
GO
 
– Insert a record
INSERT INTO [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49]
VALUES (1)
GO

As you can see from the previous listing, our table definition is very simple. In the next step I’m creating a new stored procedure that encapsulates some read workload inside that database.

–- Create a stored procedure that encapsulates a read workload
CREATE
PROCEDURE MyCustomUniqueStoredProcedureName_ReadWorkload
AS
BEGIN
	SELECT * FROM [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49]
END
GO

Finally we are beginning a new transaction, making an update to the previous created table, and never committing that transaction:

–- Begin a transaction that never commits…
BEGIN TRANSACTION
	UPDATE [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49] WITH (TABLOCKX)
		SET [MyUniqueColumnName2_438B7184-B476-48A4-B5FA-DC34B99FA0A4] = 2
GO

By now you have a pending transaction inside your SQL Server instance that holds an exclusive table lock because of the TABLOCKX query hint. When you now execute the previous created stored procedure from a different session, the stored procedure is waiting because it needs to acquire a Shared lock for reading the record:

EXEC MyCustomUniqueStoredProcedureName_ReadWorkload
GO

You can also check this Locking/Blocking scenario through the DMV sys.dm_db_tran_locks, which will show you a waiting request:

SELECT
	resource_associated_entity_id,
	request_mode,
	request_status,
	request_session_id
FROM sys.dm_tran_locks
WHERE resource_database_id = DB_ID(‘ThreadPoolWaits’) AND resource_type = ‘OBJECT’
GO

In this simple scenario with just one waiting query inside SQL Server, nothing special occurs. But how will SQL Server react when you use a massive amount of queries that is larger than the possible max worker threads? Let’s try it. I’m using for this task the ostress.exe utility that is part of the RML Tools that are provided free by Microsoft (see here). In my configuration (x64, 8 CPUs) SQL Server uses internally 576 worker threads. So I’m simulating with ostress.exe 600 concurrent connections to SQL Server through the following command line:

ostress.exe
-Q”EXEC ThreadPoolWaits.dbo.MyCustomUniqueStoredProcedureName_ReadWorkload”
-n600

When you execute that command prompt, it takes a few seconds until ostress.exe has created the 600 worker threads, and nothing special happens. Seems so far so good. Let’s now analyze the situation and create a new connection through SQL Server Management Studio to your SQL Server instance. Oops, the connection can’t be made:

SQL Server isn’t responding anymore!!! This makes sense, because we have now exhausted the maximum available worker threads. Almost all submitted requests to SQL Server are currently waiting for a Shared Lock (LCK_M_S wait type), and all the other ones can’t be enqueued inside SQL Server because no worker threads are available anymore (THREADPOOL wait type). But how can we troubleshoot that scenario now? Restarting SQL Server isn’t always really an option…

Fortunately Microsoft provides since SQL Server 2005 the so-called Dedicated Admin Connection (DAC). With this connection you are able to log into SQL Server even when you have worker thread starvation or high memory pressure, because the DAC has its own

  • Scheduler
  • Memory Node
  • TCP Port

inside SQLOS. Therefore SQL Server is able to accept and serve the DAC connection – even in high sophisticated troubleshooting scenarios like this one. But there is only one available DAC for the whole SQL Server instance, which must be also taken into account! When you want to connect through the DAC, you have to use the following syntax: admin:<servername> where <servername> is the name of your SQL Server instance. So let’s start up a new instance of SQL Server Management Studio and log into SQL Server through the DAC. Please be aware that you don’t connect the Object Explorer through the DAC, because the DAC isn’t supported for the Object Explorer. You can only use a simple query window that connects through the DAC:

When you have successfully connected through the DAC, you are now able to run your diagnostic queries. You must be also aware that the DAC doesn’t support Auto Completion, because Auto Completion uses its own connection in the background – so you have to know the DMVs you want to use for troubleshooting J. In the first step we can check sys.dm_exec_requests which requests are currently waiting inside SQL Server:

SELECT
	r.command,
	r.sql_handle,
	r.plan_handle,
	r.wait_type,
	r.wait_resource,
	r.wait_time,
	r.session_id,
	r.blocking_session_id
FROM sys.dm_exec_requests r INNER JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id
WHERE s.is_user_process = 1
GO

In my configuration this query returns me 547 rows, which means 547 user requests (WHERE s.is_user_process = 1) are currently waiting inside SQL Server. But we have started our stored procedure with 600 concurrent users… sys.dm_exec_requests shows you only those requests that have an underlying worker thread inside SQL Server, because those requests are currently executing inside SQL Server. But where are the others? Those other pending requests are only accessible through sys.dm_os_waiting_tasks – they have a wait type of THREADPOOL:

SELECT * FROM sys.dm_os_waiting_tasks
WHERE wait_type = ‘THREADPOOL’
GO

They are just waiting until a new thread from the worker pool gets free. But in our scenario every thread is currently suspended and bound to a user request, therefore those requests will wait forever! You can also see the THREADPOOL wait type only inside sys.dm_os_waiting_tasks and never in sys.dm_exec_requests, because a request in sys.dm_exec_requests is already bound to a worker thread inside SQL Server. When you look back to the output of sys.dm_exec_requests you can also see the columns session_id and blocking_session_id at the end of the result set. Those 2 columns are showing you the blocking chain inside SQL Server:

As you can see almost every session has a blocking_session_id of 56, and the session_id 56 has a blocking_session_id of 52. The session_id 52 is our head blocker! Let’s further analyze the session of the head blocker:

-– Analyze the head blocker session
SELECT
	login_time,
	[host_name],
	[program_name],
	login_name
FROM sys.dm_exec_sessions
WHERE session_id = 52
GO
 
–- Analyze the head blocker connection
SELECT
	connect_time,
	client_tcp_port,
	most_recent_sql_handle
FROM sys.dm_exec_connections
WHERE session_id = 52
GO

The most interesting column is here most_recent_sql_handle from sys.dm_exec_connections which we can use to retrieve the executed SQL statement. When you use the DMF sys.dm_exec_sql_text and pass in the value of the most_recent_sql_handle column you are able to retrieve the executed SQL statement:

SELECT FROM sys.dm_exec_sql_text(0x01001A0015BE5D3170CC4483000000000000000000000000
GO

This SELECT statement will return you the following string:

– Begin a transaction that never commits… BEGIN TRANSACTION UPDATE [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49] WITH (TABLOCKX) SET
[MyUniqueColumnName2_438B7184-B476-48A4-B5FA-DC34B99FA0A4] = 2

This is our initial query where we left our transaction open. By now we have tracked down the problematic query that had leaded to THREADPOOL and LCK_M_S waits, and finally we can kill that session through the DAC:

KILL 52
GO

Now it takes several seconds until the session is killed, and finally our blocking scenario is gone. By now your SQL Server is again able to accept new connections and will work in the usual way. When you are done with troubleshooting through the DAC connection, don’t forget to close that special connection, because there is only one DAC available for the whole SQL Server instance! When you afterwards look into the SQL Server Error Log, you will also see a message like the following one:

New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 1680 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the “max worker threads” configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%. System Idle: 96%.

This is also an indication that you had worker thread starvation inside your SQL Server instance. As a side-effect this scenario has also leaded to so-called Deadlocked Schedulers, which Amit Banerjee describes in more detail here. When SQL Server encounters Deadlocked Schedulers, SQL Server will write out a Stack Dump to your SQL Server LOG directory. You can also see a Deadlocked Scheduler inside the SQL Server Error Log:

Sometimes I see customers which just blindly reconfigure the max worker threads setting inside SQL Server, because they think they need more worker threads for their workload. But as with almost every problem in SQL Server, there is some root cause which has leaded to the problem that you are currently seeing. In our scenario the root cause was an uncommitted transaction, which leaded to a blocking scenario, which leaded to thread starvation, which finally leaded to an unresponsive SQL Server. As you can see from this explanation, there could be a very long chain until you find your root cause – so keep that in mind for your next troubleshooting scenarios.

To make it easy for you to reproduce that special scenario, you can download the needed scripts from here.

Thanks for reading!

-Klaus

Query Memory Spills

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan – you get a weekly email packed with all the essential knowledge you need to know about performance tuning on SQL Server.)

When you sometimes look at Execution Plans, you can see that the SELECT operator has sometimes a so-called Memory Grant assigned. This Memory Grant is specified in kilobytes and is needed for the query execution, when some operators (like Sort/Hash operators) in the Execution Plans need memory for execution – the so called Query Memory.

This query memory must be granted by SQL Server before the query is actually executed. The Query Optimizer uses the underlying Statistics to determine how much Query Memory must be acquired for a given query. The problem is now, when the Statistics are out-of-date, and SQL Server underestimates the processed rows. In this case, SQL Server will also request to less Query Memory for the given query. But when the query actually executes, the query can’t resize its granted Query Memory, and can’t just request more. The query must operate within the granted Query Memory. In this case, SQL Server has to spill the Sort/Hash-Operation into TempDb, which means that our very fast in-memory operation becomes a very slow physical On-Disk operation. SQL Server Profiler will report those Query Memory Spills through the events Sort Warnings and Hash Warning.

Unfortunately SQL Server 2008 (R2) provides you no events through Extended Events to track down those Memory Spills. In SQL Server 2012 this will change, and you will have additional events available inside Extended Events for troubleshooting this problem. In this posting I will illustrate you with a simple example how you can reproduce a simple Query Memory Spill because of out-of-date statistics. Let’s create a new database and a simple test table inside it:

SET STATISTICS IO ON
SET STATISTICS TIME ON
GO

-- Create a new database
CREATE DATABASE InsufficientMemoryGrants
GO

SUE InsufficientMemoryGrants
GO

-- Create a test table
CREATE TABLE TestTable
(
   Col1 INT IDENTITY PRIMARY KEY,
   Col2 INT,
   Col3 CHAR(4000)
)
GO

-- Create a Non-Clustered Index on column Col2
CREATE NONCLUSTERED INDEX idxTable1_Column2 ON TestTable(Col2)
GO

The table TestTable contains the primary key on the first column, and the second column is indexed through a Non-Clustered Index. The third column is a CHAR(4000) column which isn’t indexed. We will use that column afterwards for an ORDER BY, so that the Query Optimizer must generate an explicit Sort Operator inside the Execution Plan. In the next step I’m just inserting 1500 records, where we have an even data distribution across all the values in the second column – each value exists once in our table.

With that test data prepared we can now execute a simple query, which must use a separate Sort operator in the Execution Plan:

DECLARE @x INT
 
SELECT @x = Col2 FROM TestTable
WHERE Col2 = 2
ORDER BY Col3
GO

This query uses the following Execution Plan:

When you look into SQL Server Profiler and you have enabled the above mentioned events, nothing happens. You can also use the DMV sys.dm_io_virtual_file_stats and the columns num_of_writes and num_of_bytes_written to find out if there was some activity in TempDb for a given query. This works – of course only – when you are the only person who currently uses the given SQL Server instance:

-- Check the activity in TempDb before we execute the sort operation.
SELECT num_of_writes, num_of_bytes_written FROM 
sys.dm_io_virtual_file_stats(DB_ID('tempdb'), 1)
GO

-- Select a record through the previous created Non-Clustered Index from the table.
-- SQL Server retrieves the record through a Non-Clustered Index Seek operator.
-- SQL Server estimates for the sort operator 1 record, which also reflects
-- the actual number of rows.
-- SQL Server requests a memory grant of 1024kb - the sorting is done inside
-- the memory.
DECLARE @x INT

SELECT @x = Col2 FROM TestTable
WHERE Col2 = 2
ORDER BY Col3
GO

-- Check the activity in TempDb after the execution of the sort operation.
-- There was no activity in TempDb during the previous SELECT statement.
SELECT num_of_writes, num_of_bytes_written FROM 
sys.dm_io_virtual_file_stats(DB_ID('tempdb'), 1)
GO

Again you will see no activity in TempDb, which means the output from sys.dm_io_virtual_file_stats is the same before and after executing the query. The query takes on my system around 1ms of execution time.

Now we have a table with 1500 records, means that our table needs 20% + 500 rows of data changes so that SQL Server will update the statistics. If you’re doing the math, we need 800 data modifications in that table (500 + 300). So let’s just insert 799 additional rows where the value of the second column is 2. We are just changing the data distribution and SQL Server WILL NOT update the statistics, because one additional data change is still missing, until Update Statistics is triggered automatically inside SQL Server!

-- Insert 799 records into table TestTable
SELECT TOP 799 IDENTITY(INT, 1, 1) AS n INTO #Nums
FROM
master.dbo.syscolumns sc1

INSERT INTO TestTable (Col2, Col3)
SELECT 2, REPLICATE('x', 4000) FROM #nums
DROP TABLE #nums
GO

When you now execute the same query again, SQL Server will now spill the Sort operation to TempDb, because SQL Server will only request a Query Memory Grant of 1024 kilobytes, which is estimated for just 1 record – the memory grant has the same size as before:

-- Check the activity in TempDb before we execute the sort operation.
SELECT num_of_writes, num_of_bytes_written FROM 
sys.dm_io_virtual_file_stats(DB_ID('tempdb'), 1)
GO

-- SQL Server estimates now 1 record for the sort operation and requests a memory grant of 1.024kb for the query.
-- This is too less, because actually we are sorting 800 rows!
-- SQL Server has to spill the sort operation into TempDb, which now becomes a physical I/O operation!!!
DECLARE @x INT

SELECT @x = Col2 FROM TestTable
WHERE Col2 = 2
ORDER BY Col3
GO

-- Check the activity in TempDb after the execution of the sort operation.
-- There is now activity in TempDb during the previous SELECT statement.
SELECT num_of_writes, num_of_bytes_written FROM
sys.dm_io_virtual_file_stats(DB_ID('tempdb'), 1)
GO

If you check the Estimated Number of Rows in the Execution Plan, they are differing completely from the Actual Number of Rows:

When you track the query execution time you will also see that the execution time increased – in my case it increased up to 200ms, which is a huge difference to the earlier execution time of just 1ms! The DMV sys.dm_io_virtual_file_stats will also report some activity inside TempDb, which is also the evidence that SQL Server spilled the Sort operation into TempDb! SQL Server Profiler will also show you a Sort Warning event.

If you now insert one additional record, and you run the query again, everything is fine, because SQL Server will trigger the Statistics Update and estimate the Query Memory Grant correctly:

-- Insert 1 records into table TestTable
SELECT TOP 1 IDENTITY(INT, 1, 1) AS n INTO #Nums
FROM master.dbo.syscolumns sc1
 
INSERT INTO TestTable (Col2, Col3)
SELECT 2, REPLICATE(‘x’, 2000) FROM #nums
DROP TABLE #nums
GO
 
-- Check the activity in TempDb before we execute the sort operation.
SELECT num_of_writes, num_of_bytes_written FROM
sys.dm_io_virtual_file_stats(DB_ID(‘tempdb’), 1)
GO
 
-- SQL Server has now accurate statistics and estimates 801 rows for the sort operator.
-- SQL Server requests a memory grant of 6.656kb, which is now enough.
-- SQL Server now spills the sort operation not to TempDb.
-- Logical reads: 577
DECLARE @x INT
 
SELECT @x = Col2 FROM TestTable
WHERE Col2 = 2
ORDER BY Col3
GO
 
-- Check the activity in TempDb after the execution of the sort operation.
-- There is now no activity in TempDb during the previous SELECT statement.
SELECT num_of_writes, num_of_bytes_written FROM
sys.dm_io_virtual_file_stats(DB_ID(‘tempdb’), 1)
GO

So this is a very basic example which shows you how you can reproduce Sort Warnings inside SQL Server – not really a magic. Adam Machanic (http://sqlblog.com/blogs/adam_machanic) has done last week at the SQLPASS Summit in Seattle a whole session about Query Memory at a 500 level, where he went into more details on this complicated topic, especially in combination with Parallel Execution Plans.

Thanks for reading!

-Klaus

SQL Server Indexing Day, 18. Jänner 2012 in Wien

(Be sure to checkout the FREE SQLpassion Performance Tuning Training Plan – you get a weekly email packed with all the essential knowledge you need to know about performance tuning on SQL Server.)

Sie wollen SQL Server Abfragen um 100% oder gar 1000% beschleunigt haben? Gleichzeitig wollen Sie Hardware– und Lizenzkosten Ihrer SQL Server Anwendungen und deren Infrastruktur senken? Ein Wunschdenken? – Keineswegs!

Mit einer gut indizierten SQL Server Datenbank können Sie Geschwindigkeitsvorteile von mehreren 100% für bestimmte Abfragen gewinnen. Als positiven Seiteneffekt senken Sie die Hardwareanforderungen an Ihre SQL Server Infrastruktur und dadurch direkt die Hardware– und Lizenzkosten, die für SQL Server Installation anfallen.

Am 18. Jänner 2012 ist es endlich soweit:

Klaus Aschenbrenner – unabhängigier SQL Server Experte und internationaler Konferenzsprecher – führt den SQL Server Indexing Day in Wien durch.

In 1 intensiven Tag vermittelt Ihnen Klaus Aschenbrenner das notwendige Know-How, um Indizes im SQL Server effektiv einsetzen zu können – wie oben genannt – Performance Optimierungen von mehreren 100% durchführen zu können.

In diesem intensiven Workshop-Tag werden Sie im ersten Schritt den Aufbau von SQL Server Datenbanken und der verschiedenen zur Verfügung stehenden Indizes näher kennenlernen. Aufbauend auf diesem Wissen werden Sie eine Reihe von unterschiedlichen Performance Tuning Techniken näher kennenlernen, die Sie durch den Einsatz des jeweiligen Indexes einsetzen können. Abgerundet wird der Tag mit einem Ausblick auf die Index-Wartung, und über welche Fallstricke Sie hier in Produktionsumgebungen stolpern können.

Inhalte:

  • SQL Server 2008 Database Internals
    • Database Structure
    • Table Metadata
    • Data Page Details
    • Record Storage
  • Index Architecture
    • Index Architecture
    • Index Structures
    • Clustered vs. Non-Clustered Indexes
    • Index Dependencies
    • Clustered Key Problems
  • Index Tuning
    • Search Arguments
    • Index Access Methods
    • Covering Index
    • Tipping Point
    • Index Intersection
    • Filtered Indexes
    • Indexed Views
  • Index Maintenance
    • Statistics
    • Fragmentation
    • Index Rebuild
    • Index Reorganize

Weitere Informationen zum Event können Sie sich unter http://SQLpassion.at/Events.html ansehen, bzw. können Sie den dazugehörigen Folder unter http://SQLPassion.at/IndexingDay.pdf herunterladen.

Veranstaltungsort:

Hotel & Palais Strudlhof

Pasteurgasse 1

A-1090 Wien

http://www.strudlhof.at

Inkludierte Leistungen:

  • Workshop von 09:00 – 17:00
  • 2 Kaffeepausen (Vormittag, Nachmittag)
  • Gemeinsames Mittagessen
  • Early-Bird Preis bis zum 31.10.2011: € 349,00 exkl Ust.
  • Danach: € 399,00 exkl Ust.

Anmeldung:

  • Anmeldungen werden unter http://SQLpassion.at/Events.html entgegengenommen.
  • Ausgestellte Rechungen müssen bis zum Beginn der Veranstaltung einbezahlt werden, damit eine Teilnahme an der Veranstaltung gewährleistet werden kann.

-Klaus