The DMV Diaries: sys.dm_os_threads

Confession. I’ve been intending for ages to do a series on the guts of DMVs. We all use them, I often find answers in their inner workings, and they admirably serve the purpose of surfacing things which would otherwise remain buried.

This isn’t a promise to hammer out one per day, or even one per week, but one has to start somewhere. And since I keep getting seduced by SQLOS threading internals, where better to start than the one listing threads, sys.dm_os_threads?

My reference version here is SQL Server 2016 SP1, although I’ll make one interesting comparison to 2014.

The dm_os_threads data source

Broadly speaking, a DMV presents just another iterator that can be plugged into a query plan. The execution engine calls GetRow() repeatedly until it reaches the end, and the iterator emits rows. The only unusual thing is that the ultimate source of the data may have nothing to do with the storage engine.

Now if you asked me to guess where in the world we’d find a list of all threads to iterate over, I would have expected that we’d start with the NodeManager, iterating over all SOS_Nodes, and then for each of them iterating over its collection of associated SystemThreads. After all, we have a guaranteed 1:1 correspondence between threads and SystemThreads, and I figured that all SystemThreads enlist themselves into a parent SOS_Node upon creation. No-brainer, right?

Turns out that this guess would have been completely wrong, and the reason it would have been a broken implementation will become apparent when we look at the started_by_sqlservr column.

To understand the true data source, we have to take a step back into the formative years of a SQLOS thread. Two reference points:

  • Unsung SQLOS: the SystemThread
  • The early life of a SQLOS thread
  • The MiniSOSThreadResourcesMgr revisited

    When a SystemThread gets constructed within the DLLmain thread-attach callback, there are some resources which are so crucial that it’s pointless to continue if they can’t be provided:

    • A memory allocation of 312 bytes to contain the SystemThread object itself
    • The Windows event that will be used in the co-operative context switching mechanism, for which we want a handle ready
    • A memory allocation of 2816 bytes in which we’ll construct a Worker object. Although this thread could eventually be bound to a different Worker, there should be a pool of at least as many Workers as threads.
    • A memory allocation of 984 bytes for an SOS_Task object. For similar reasons, there should be at least one of these floating around per thread.

    The SystemThread allocation itself is handled by the SystemThreadPool, a specialised memory allocator. Because this is stuff that must be in place very early in the boot process, before the SQLOS memory hierarchy is built up, this is one of the few cases where HeapAlloc() is directly invoked.

    The other three are packaged up together in a MiniSOSThreadResources instance, with this neat package being dispensed by the MiniSOSThreadResourcesMgr, another HeapAlloc-happy class.

    These objects don’t live independent lives. Each dispensed MiniSOSThreadResources instance is enlisted in a linked list owned by the MiniSOSThreadResourcesMgr, and contains a pointer to the SystemThread it was created to support. Additionally, the MiniSOSThreadResources instance contains the thread ID (also stored in the SystemThread itself) and a pointer to that thread’s Thread Environment Block (which isn’t).

    Now the creation of a SystemThread, and the association with a MiniSOSThreadResources instance, isn’t something that a programmer has control over, short of removing the thread-attach callback. Whatever reason a thread was created for, and whether or not it plays by SQLOS rules, it will have these accessories if it was created in the SQL Server process. If a thread didn’t start life in the standard entry point function SchedulerManager::ThreadEntryPoint it may not end up enlisted in an SOS_Node, but it sure will be discoverable through the MiniSOSThreadResourcesMgr.

    Here is a fairly simple picture of how these things fit together:

    MiniSOSThreadResources + SystemThread

    The global thread iterator

    You can see where this is heading. When sys.dm_os_threads wants to iterate over all the threads in the process, this global list belonging to the MiniSOSThreadResourcesMgr is the golden data source, since each MiniSOSThreadResources contains a pointer to a SystemThread, and there is a perfect 1:1 correspondence between the two sets of objects.

    The mechanism of iteration serves as a reminder of how to engage hygienically with linked lists, and how the goal of “reasonably accurate and nonblocking” will trump “logically perfect” when it comes to low-level DMVs.

    As a starting point, remember that it isn’t safe to traverse a linked list without exclusively locking it. Now if we wanted to iterate over the members one by one, with an unknown period passing between retrieving items, holding that lock from the first row to the last one is a non-starter, because it could interference in the core workings of SQLOS.

    Instead, here is how things work at high level, leaning on the fact that the list count is stored (“denormalised”) in the list head:

    1. Keep track of how many rows have been produced.
    2. For every GetRow() call, start traversing afresh from the list head.
    3. Every time, go one element further than last time and call that the next item.
    4. Once we have either traversed back around to the list head, or produced the number of rows the list currently claims to contain, we deem it to be the end of the data set.

    Doing it this way may sound wasteful in having to count members from the beginning on every row retrieval – this is what Joe Spolsky dubbed the “Schlemiel the painter” algorithm. However, that small cost comes at the massive benefit of not leaving the structure locked between item retrievals – on each iteration the spinlock is only held up to the point of finding the next target item. Let’s also be clear: not locking it is simply not an option for this source.

    A delightful side effect is of course that we can theoretically skip a member, or read it twice. Those are simply the odds we live with.

    Once the next item (a MiniSOSThreadResources instance) has been found, a “reporting” copy is made of a chunk of its Thread Environment Block and the complete SystemThread. Now everything we need is in those snapshots – or derivable from them – and the risk of inconsistent reads or interference with normal operations is behind us.

    DMV members from the Thread Environment Block

    The following columns in sys.dm_os_threads are served straight from the Thread Environment Block, i.e. user-mode thread state belonging not to SQLOS but to the underlying OS:

    • stack_base_address
    • stack_end_address
    • stack_bytes_committed
    • stack_bytes_used

    Additionally, there are three columns which used to get populated from the TEB in SQL Server 2014, but are now hard-coded to null in 2016. Make of it what you will, but these are ones that were added in later Windows versions and live at higher addresses in the TEB:

    • locale
    • is_impersonating
    • is_waiting_on_loader_lock

    DMV members from the kernel

    Some interesting stuff can only be found out through kernel calls, made using thread handles retrieved through calling OpenThread() using the thread IDs we already have available. Yes, this means that each row retrieved for this DMV involves multiple kernel transitions on Windows, but we’ll have to stay tuned to learn what this really means in SQLPAL.

    From GetThreadTimes() we get the following:

    • creation_time
    • kernel_time
    • usermode_time

    From GetThreadContext() we get instruction_address. Kind of.

    See, for a running thread this isn’t going to be a meaningful answer. And in recognition of that, some clever wag coded up a special case for the current thread, i.e. the one querying the DMV: the hard-coded response coming back is the address of sqldk!SOS_OS::GetThreadControlRegisters(), the function wrapping GetThreadContext().

    Finally, the priority column comes from a straight call to GetThreadPriority(), using the handle we had within the SystemThread all along.

    DMV members from the SystemThread

    The very first one, thread_address, is a “duh” case: it is the address where the actual SystemThread lives. This should of course match self_address which is stored within the SystemThread to support simple checking for an uninitialised of corrupted SystemThread.

    affinity comes straight from the SystemThread, and the related processor_group is derived by reading the full GroupAffinity struct within it. These columns are all straightforward SystemThread members:

    • os_thread_id
    • status – hope to document it some day
    • token
    • fiber_data
    • thread_handle
    • event_handle
    • scheduler_address
    • worker_address
    • fiber_context_address

    Finally, started_by_sqlservr is an interesting piece of derivation. If the list item within the SystemThread, intended to link it into a SOS_Node, is null, we know that the thread didn’t start its life the normal SQLOS way, and hence this column value is derived as false. Otherwise it is true.

    Further reading

    In hunting down evolution of the 64-bit Thread Environment Block layout, I found Geoff Chappell’s excellent reference very helpful. Definitely worth a bookmark.

The DMV diaries: Worker, task, request and session state

We have all been there. You believe that a certain status (e.g. is the order shipped?) lives in a simple database column, only to find that it comes from a view built on a view with all kinds of creative CASE statements. And it may look ugly, but at the end of the day, you have to admit that it successfully serves the purpose of exposing business data in the way that users expect to see it.

Guess what: The “V” in “DMV” exists for a similar reason. Today I’ll be whizzing through the various ways in which the status of a running piece of work is exposed to us in sys.dm_os_workers, sys.dm_os_tasks, sys.dm_exec_requests, and sys.dm_exec_sessions.
Continue reading “The DMV diaries: Worker, task, request and session state”