The DMV Diaries: sys.dm_os_workers

Following hot on the heels of sys.dm_os_threads, today we look into the worker objects built on top of them. This is intended to be supplementary to the official documentation on sys.dm_os_workers, so I’ll comment on that where appropriate, rather than repeating it. My reference version is 2016 SP1.

Basic plumbing of related objects

worker_address is of course simply the address of the Worker class instance we’re looking at. It is bound to the SOS_Scheduler living at scheduler_address and (once out of the SystemThreadDispatcher) to the SystemThread class instance at thread_address.

Once bound to an SOS_Task in the owning scheduler’s WorkDispatcher, it will expose that object’s address in task_address.

If we’re running in fiber mode, fiber_address points to the fiber data containing the user-mode thread state.

Now this isn’t intended to be about memory, but memory issues tend to touch everything we look at anyway, so a short diversion is in order.

We saw in my previous post on sys.dm_os_threads that each thread gets an associated MiniSOSThreadResources object which already contains a worker. Beyond that initial thread bootstrapping though, the factory method Worker::CreateWorker() is called to create useful Workers. One of the first things that function does is to allocate the memory (2816 bytes) in which to construct the Worker instance. This memory is provided by a memory object which is specially created for the occasion, and the pointer to the memory object is stored within the Worker; this is what gets exposed as memory_object_address.

What’s interesting in the memory hierarchy is that this is a memory object which both “belongs” to the Worker and is its parent. It is possible for other objects and functions to milk it for further “worker-local” allocations, although that would be an exploration for another day.

State and status

The state enum is a familiar one; here are its integer values:

0 - Init
1 - Running
2 - Runnable
3 - Suspended

As described in The DMV diaries: Task, Request and Session State, the value exposed in the DMV lacks the layered semantics of e.g. task state. Once we go beyond Init, we simply see how the SQLOS scheduler views the worker when in nonpreemptive mode. If the worker is running, it owns the scheduler. If runnable it is owned by the scheduler’s runnable queue. And if suspended, the scheduler doesn’t have any interest in its movement, unless the worker is waiting on IO or a timer.

status is an interesting one, and is the source of a bunch of the following is_xxx flags, which break out its individual bits.

Here is the 0-based bit mapping of the flags exposed in the DMV, plus a handful of others I know to be in use but aren’t exposed here.

bit  2 - is_preemptive
bit  3 - is_fiber
bit  4 - is_sick
bit  5 - is_in_cc_exception
bit  6 - is_fatal_exception
bit  7 - is_inside_catch
bit  8 - also involved in exception state
bit 11 - used in scheduling
bit 12 - lazy preemptive (in conjunction with 2)
bit 13 - is_in_polling_io_completion_routine
bit 19 - set in SOS_Task::DoomThread() 

There is a second bitmask member in the Worker class, containing flags like “do not suspend”, “is in external code”, “is in exception backout” etc. For whatever reason, the DMV authors didn’t expose any of these flags.

I/O, exception and affinity metrics

pending_io_count is a straightforward member of the Worker instance, as is pending_io_byte_count. And pending_io_byte_average is simply a convenience column derived from the other two, saving you from having to special-case around potential division by zero.

It is possible for exception_num to be either a 16-bit or 32-bit integer; which it is is determined by another flag elsewhere in the worker. exception_severity lives within the worker, but additional information like exception state is found in a separate ExceptionInfo struct, which is the thing pointed to by exception_address.

affinity comes straight from a Worker member, whereas processor_group is derived from an embedded NodeAffinity instance.

Timestamps

Time to talk about time again. Within SQLOS, the vast majority of “Now()” time stamps are stored as integers sourced from one of two domains. Which domain gets used is determined at service startup and – to the best of my knowledge – will not change until shutdown. The two option are:

  1. The QueryPerformanceCounter (QPC), used if an invariant timestamp is available. This is the more predictable and finely grained option, and I’d assume that it applies on most serious systems.
  2. As fallback, timer interrupt ticks can be used. These are very easy and cheap to retrieve from the KUSER_SHARED_DATA structure, but resolution is at the mercy of outside forces, and can be as bad as 15.6ms.

I have previously touched on the two options when discussing the source of getdate() in Milliseconds 10, ticks 3.

So to get to the point, all the below columns expose normalised values, derived either by applying the instance-specific SOS_PublicGlobals:sm_QueryPerformanceFrequencyBase (QPC) or a simple constant factor of 10,000 (interrupt ticks) to the underlying properties:

  • worker_created_ms_ticks
  • task_bound_ms_ticks
  • wait_started_ms_ticks
  • wait_resumed_ms_ticks
  • start_quantum
  • quantum_used
  • max_quantum

Here are a few notes to pad out the official documentation.

wait_started_ms_ticks is set in SOS_Task::PreWait(), i.e. just before actually suspending, and again cleared in SOS_Task::PostWait(). For more about the choreography of suspending, see here.

wait_resumed_ms_ticks is set in SOS_Scheduler::PrepareWorkerForResume(), itself called by the mysteriously named but highly popular SOS_Scheduler::ResumeNoCuzz().

start_quantum is set for the Resuming and InstantResuming case within SOS_Scheduler::TaskTransition(), called by SOS_Scheduler::Switch() as the worker is woken up after a wait.

max_quantum is a high-water mark for the longest time the worker has spent on a single quantum, and quantum_used is the total time the worker has spent in the Running state.

Definitely the most interesting one of the bunch is end_quantum. This is a calculated field, and is simply start_quantum plus the scheduler’s quantum length, which is currently always 4ms.

What makes it interesting is that this calculation is redundant with the quantum target actually stored as a property within the Worker. This has been touched on recently by Paul Randal in a great thought-provoking blog post when he mentioned that the quantum end is stored in RDTSC ticks.

My best guess is that RDTSC came to the fore here for the sake of fine grain and very low cost. Even in the face of clock speed changes, having a bit of variation in the quantum end is probably no big deal, compared with the risk of having to use interrupt ticks with a dodgy or completely unusable accuracy. And on the cost front, the classic “is it time to yield yet?” check is really cheap to express when it’s just a case of pulling up the current TSC and comparing it with the precalculated finish line.

Anyhow, when calculating end_quantum for the DMV, we get the additional conversion joy of leaning on SOS_PublicGlobals::sm_CpuTicksPerMillisecond, because the quantum length (at scheduler level) is expressed in CPU ticks.

Odds and ends

last_wait_type, tasks_processed_count and signal_worker_address are fairly straightforward.

context_switch_count is incremented within SOS_Scheduler::TaskTransition() in three of its cases:

  1. Suspending, the normal case where a worker is about to be switched out.
  2. InstantResuming, where the quantum was exhausted but the runnable queue is empty, so the worker gets another quantum without any switch actually taking place.
  3. SwitchToNonPreemptive, where a preemptively scheduled worker rejoins the cooperative ecosystem.

The return_code property shows up the mechanism by which results of asynchronous calls propagate back to the calling worker. Whoever makes the waiting worker runnable again (typically return of a wait function), the result of the wait is written into this member while the worker remains asleep. After getting back to the runnable queue, and eventually being picked as the next worker to run, SOS_Scheduler::Switch() reads this value and returns it as the return value to the awakened worker. This may propagate through a few layers of function calls, but ultimately it will reach a function that will know what to do with it, e.g. turn a timeout result into an exception.

boost_count is an oddity for dragging a rather useful explanation of priority boosting into the official documentation. Here is my best effort at making sense of this mechanism, whereby a thread waiting on a synchronisation object gets bumped to the head of the runnable queue 1000 times in a row, but then going to the back once before being eligible for a boost again.

The idea of applying a priority boost is a familiar one from OS scheduling. This serves to avoid the priority inversion that can be caused in the below scenario:

  1. Many threads wait for a resource
  2. The low priority thread eventually gets its turn, is assigned ownership and made runnable, while others (including high-priority ones) are waiting in line behind it
  3. Because it has low priority to the scheduler, it doesn’t get scheduled
  4. Now the higher priority threads don’t get to run either, because they are waiting on something that won’t get to run and pass the baton to them

While the boosting mechanism does apply as described, we are glossing over the fact that each scheduler’s “runnable queue” may actually consist of a lot of different queues. The detail has changed between 2014 and 2016, but the omitted bit still boils down to this: Upon becoming runnable, a worker gets a place at the head or the tail of its assigned queue, but this mechanism doesn’t affect what queue it goes into. In the face of workload groups, it might still be possible to craft priority inversion.

DMV data source and iteration mechanism

Good news. This one is a lot more simple and obvious than sys.dm_os_threads, although with a nice twist that made me rethink the object ownership hierarchy.

There isn’t a single global list of all Workers, so we start at the root of all things, the singleton NodeManager, to iterate over all SOS_Nodes. Now on a per-node basis, we indirectly find a way to iterate over Workers associated with that node.

It turns out that while workers are associated with schedulers, the relationship works one-way, and schedulers don’t keep lists of their workers, apart from the suspend queues that live within the scheduler (timer lists, IO lists, runnable queues). However,taking one step up in the hierarchy, we do find such a list in the SchedulerManager.

This makes sense when you consider that a worker starts its life being suspended in a SystemThreadDispatcher, which itself lives in a SchedulerManager with a 1:1 relationship between them. The linked list item (right at the start of the Worker object) which enlists it into the SystemThreadDispatcher is the suspend queue entry, and this is the one which moves between different suspend queues, or which belongs to no list at times when the worker is running. There is however a second linked list entry sixteen bytes into the Worker; this one’s list head is in the SchedulerManager.

Each node contains separate SchedulerManagers for normal and hidden schedulers, so the full iteration pattern for the global worker iterator goes like this:

  1. Start with the NodeManager
  2. Iterate over all the SOS_Nodes
  3. Per SOS_Node, first iterate over the workers belonging to its “regular” SchedulerManager
  4. When done with these, now iterate over workers belonging to the “hidden” SchedulerManager

Here is an outline of the involved objects.

The global worker iterator

The mechanism of engaging with a retrieved Worker is less clunky than is the case for dm_os_threads, which requires the target objects to be cloned while locked. Workers have their lifetimes controlled by reference counting, so upon finding the next worker, the reference count on the worker is increased before releasing the list’s spinlock. This avoids the worker getting destroyed while we are querying it. Upon moving to the next worker, the reference count on the previous one is decremented, and – in keeping with the reference-counting contract – if this was the last reference, the worker is then destroyed and its memory deallocated.

Well, there you have it for workers. Who can tell where we’ll go next?

The DMV Diaries: sys.dm_os_threads

Confession. I’ve been intending for ages to do a series on the guts of DMVs. We all use them, I often find answers in their inner workings, and they admirably serve the purpose of surfacing things which would otherwise remain buried.

This isn’t a promise to hammer out one per day, or even one per week, but one has to start somewhere. And since I keep getting seduced by SQLOS threading internals, where better to start than the one listing threads, sys.dm_os_threads?

My reference version here is SQL Server 2016 SP1, although I’ll make one interesting comparison to 2014.

The dm_os_threads data source

Broadly speaking, a DMV presents just another iterator that can be plugged into a query plan. The execution engine calls GetRow() repeatedly until it reaches the end, and the iterator emits rows. The only unusual thing is that the ultimate source of the data may have nothing to do with the storage engine.

Now if you asked me to guess where in the world we’d find a list of all threads to iterate over, I would have expected that we’d start with the NodeManager, iterating over all SOS_Nodes, and then for each of them iterating over its collection of associated SystemThreads. After all, we have a guaranteed 1:1 correspondence between threads and SystemThreads, and I figured that all SystemThreads enlist themselves into a parent SOS_Node upon creation. No-brainer, right?

Turns out that this guess would have been completely wrong, and the reason it would have been a broken implementation will become apparent when we look at the started_by_sqlservr column.

To understand the true data source, we have to take a step back into the formative years of a SQLOS thread. Two reference points:

  • Unsung SQLOS: the SystemThread
  • The early life of a SQLOS thread
  • The MiniSOSThreadResourcesMgr revisited

    When a SystemThread gets constructed within the DLLmain thread-attach callback, there are some resources which are so crucial that it’s pointless to continue if they can’t be provided:

    • A memory allocation of 312 bytes to contain the SystemThread object itself
    • The Windows event that will be used in the co-operative context switching mechanism, for which we want a handle ready
    • A memory allocation of 2816 bytes in which we’ll construct a Worker object. Although this thread could eventually be bound to a different Worker, there should be a pool of at least as many Workers as threads.
    • A memory allocation of 984 bytes for an SOS_Task object. For similar reasons, there should be at least one of these floating around per thread.

    The SystemThread allocation itself is handled by the SystemThreadPool, a specialised memory allocator. Because this is stuff that must be in place very early in the boot process, before the SQLOS memory hierarchy is built up, this is one of the few cases where HeapAlloc() is directly invoked.

    The other three are packaged up together in a MiniSOSThreadResources instance, with this neat package being dispensed by the MiniSOSThreadResourcesMgr, another HeapAlloc-happy class.

    These objects don’t live independent lives. Each dispensed MiniSOSThreadResources instance is enlisted in a linked list owned by the MiniSOSThreadResourcesMgr, and contains a pointer to the SystemThread it was created to support. Additionally, the MiniSOSThreadResources instance contains the thread ID (also stored in the SystemThread itself) and a pointer to that thread’s Thread Environment Block (which isn’t).

    Now the creation of a SystemThread, and the association with a MiniSOSThreadResources instance, isn’t something that a programmer has control over, short of removing the thread-attach callback. Whatever reason a thread was created for, and whether or not it plays by SQLOS rules, it will have these accessories if it was created in the SQL Server process. If a thread didn’t start life in the standard entry point function SchedulerManager::ThreadEntryPoint it may not end up enlisted in an SOS_Node, but it sure will be discoverable through the MiniSOSThreadResourcesMgr.

    Here is a fairly simple picture of how these things fit together:

    MiniSOSThreadResources + SystemThread

    The global thread iterator

    You can see where this is heading. When sys.dm_os_threads wants to iterate over all the threads in the process, this global list belonging to the MiniSOSThreadResourcesMgr is the golden data source, since each MiniSOSThreadResources contains a pointer to a SystemThread, and there is a perfect 1:1 correspondence between the two sets of objects.

    The mechanism of iteration serves as a reminder of how to engage hygienically with linked lists, and how the goal of “reasonably accurate and nonblocking” will trump “logically perfect” when it comes to low-level DMVs.

    As a starting point, remember that it isn’t safe to traverse a linked list without exclusively locking it. Now if we wanted to iterate over the members one by one, with an unknown period passing between retrieving items, holding that lock from the first row to the last one is a non-starter, because it could interference in the core workings of SQLOS.

    Instead, here is how things work at high level, leaning on the fact that the list count is stored (“denormalised”) in the list head:

    1. Keep track of how many rows have been produced.
    2. For every GetRow() call, start traversing afresh from the list head.
    3. Every time, go one element further than last time and call that the next item.
    4. Once we have either traversed back around to the list head, or produced the number of rows the list currently claims to contain, we deem it to be the end of the data set.

    Doing it this way may sound wasteful in having to count members from the beginning on every row retrieval – this is what Joe Spolsky dubbed the “Schlemiel the painter” algorithm. However, that small cost comes at the massive benefit of not leaving the structure locked between item retrievals – on each iteration the spinlock is only held up to the point of finding the next target item. Let’s also be clear: not locking it is simply not an option for this source.

    A delightful side effect is of course that we can theoretically skip a member, or read it twice. Those are simply the odds we live with.

    Once the next item (a MiniSOSThreadResources instance) has been found, a “reporting” copy is made of a chunk of its Thread Environment Block and the complete SystemThread. Now everything we need is in those snapshots – or derivable from them – and the risk of inconsistent reads or interference with normal operations is behind us.

    DMV members from the Thread Environment Block

    The following columns in sys.dm_os_threads are served straight from the Thread Environment Block, i.e. user-mode thread state belonging not to SQLOS but to the underlying OS:

    • stack_base_address
    • stack_end_address
    • stack_bytes_committed
    • stack_bytes_used

    Additionally, there are three columns which used to get populated from the TEB in SQL Server 2014, but are now hard-coded to null in 2016. Make of it what you will, but these are ones that were added in later Windows versions and live at higher addresses in the TEB:

    • locale
    • is_impersonating
    • is_waiting_on_loader_lock

    DMV members from the kernel

    Some interesting stuff can only be found out through kernel calls, made using thread handles retrieved through calling OpenThread() using the thread IDs we already have available. Yes, this means that each row retrieved for this DMV involves multiple kernel transitions on Windows, but we’ll have to stay tuned to learn what this really means in SQLPAL.

    From GetThreadTimes() we get the following:

    • creation_time
    • kernel_time
    • usermode_time

    From GetThreadContext() we get instruction_address. Kind of.

    See, for a running thread this isn’t going to be a meaningful answer. And in recognition of that, some clever wag coded up a special case for the current thread, i.e. the one querying the DMV: the hard-coded response coming back is the address of sqldk!SOS_OS::GetThreadControlRegisters(), the function wrapping GetThreadContext().

    Finally, the priority column comes from a straight call to GetThreadPriority(), using the handle we had within the SystemThread all along.

    DMV members from the SystemThread

    The very first one, thread_address, is a “duh” case: it is the address where the actual SystemThread lives. This should of course match self_address which is stored within the SystemThread to support simple checking for an uninitialised of corrupted SystemThread.

    affinity comes straight from the SystemThread, and the related processor_group is derived by reading the full GroupAffinity struct within it. These columns are all straightforward SystemThread members:

    • os_thread_id
    • status – hope to document it some day
    • token
    • fiber_data
    • thread_handle
    • event_handle
    • scheduler_address
    • worker_address
    • fiber_context_address

    Finally, started_by_sqlservr is an interesting piece of derivation. If the list item within the SystemThread, intended to link it into a SOS_Node, is null, we know that the thread didn’t start its life the normal SQLOS way, and hence this column value is derived as false. Otherwise it is true.

    Further reading

    In hunting down evolution of the 64-bit Thread Environment Block layout, I found Geoff Chappell’s excellent reference very helpful. Definitely worth a bookmark.

Where do SQL Server tasks come from?

In my previous post I discussed the unsung early years of a SQLOS thread. Now it’s all very well knowing that threads extend themselves with SystemThreads, don Worker outfits, and execute SOS_Tasks, but I keep glossing over where tasks come from.

Gloss no more.

SOS_Task::Param – the seed of a work request

This nested struct only shows up by name in one spot on stack traces, when we see the below call as the crossover point between thread setup boilerplate and the meat of the task:

sqldk!SOS_Task::Param::Execute

This is a simple piece of abstraction, calling a previously specified function pointer with a previously specified parameter value. Those two values are the core data members of the SOS_Task::Param. In itself, such a pair is similar to the pair one would pass to a thread creation call like CreateRemoteThreadEx(). However, among the other data members we find SQLOS-specific things like a pointer to a resource group, the XEvent version of task identity and – if applicable – a parent task. These do pad out the picture a bit.

Now the SOS_Task::Param is a comparatively small structure, but it contains the core of what a task is about. Without it, an SOS_Task is a bag of runtime state without a defined mission, and isn’t “executable”. It’s just a typed block of memory: 984 bytes in the case of SQL Server 2016 SP1, of which the Param takes up 68 bytes.

Ultimately, getting a task run translates into this: Specify what you want done by phrasing it as an SOS_Task::Param, and get “someone” to create and enqueue a task based on it to a WorkDispatcher.

Which came first, the task or the task?

I have lately been spending some delightful hours unpicking the SQL Server boot process. For now, let’s stick with the simple observation that some tasks are started as part of booting. One example would be the (or nowadays “a”) logwriter task, whose early call stack looks like this:

sqlmin!SQLServerLogMgr::LogWriter
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask
sqldk!SOS_Scheduler::ProcessTasks
sqldk!SchedulerManager::WorkerEntryPoint
sqldk!SystemThread::RunWorker
sqldk!SystemThreadDispatcher::ProcessWorker
sqldk!SchedulerManager::ThreadEntryPoint
KERNEL32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Everything up to RunTask() is the thread starting up, getting married to a worker, and dequeuing a task from the WorkDispatcher. Apart from it being enqueued by the boot process, and the task never really completing, there is nothing special about this core piece of system infrastructure. It is just another task, one defined through an SOS_Task::Param with sqlmin!SQLServerLogMgr::LogWriter as its function pointer.

Here is one that is actually a bit more special:

sqldk!SOS_Node::ListenOnIOCompletionPort
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask
sqldk!SOS_Scheduler::ProcessTasks
sqldk!SchedulerManager::WorkerEntryPoint
sqldk!SystemThread::RunWorker
sqldk!SystemThreadDispatcher::ProcessWorker
sqldk!SchedulerManager::ThreadEntryPoint
KERNEL32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

This is of course the network listener that makes it possible for us to talk to SQL Server whatsoever. It is a bit unusual in using preemptive OS scheduling rather than the cooperative SQLOS flavour, but this only affects its waiting behaviour. What makes it very special is that this is the queen bee: it lays the eggs from which other tasks hatch.

The SQL nursery

The above system tasks wear their purpose on their sleeves, because the function pointer in the SOS_Task::Param is, well, to the point. The tasks that run user queries are more abstract, because the I/O completion port listener can’t be bothered with understanding much beyond the rudiments of reading network packets – it certainly can’t be mucking about with fluent TDS skills, SQL parsing and compilation, or permission checks.

So what it does is to enqueue a task that speaks TDS well, pointing it to a bunch of bytes, and sending it on its way. Here is an example of such a dispatch, which shows the “input” side of the WorkDispatcher:

sqldk!WorkDispatcher::EnqueueTask
sqldk!SOS_Node::EnqueueTaskDirectInternal
sqllang!CNetConnection::EnqueuePacket
sqllang!TDSSNIClient::ReadHandler
sqllang!SNIReadDone
sqldk!SOS_Node::ListenOnIOCompletionPort
...

At this point, the Resource Group classifier function will have been run, and the target SOS_Node will have been chosen. Its EnqueueTaskDirectInternal() method picks a suitable SOS_Scheduler in the node, instantiates an SOS_Task based on the work specification in the SOS_Task::Param, and the WorkDispatcher within that scheduler is then asked to accept the task for immediate or eventual dispatching. Side note: it looks as if SOS_Scheduler::EnqueueTask() is inlined within EnqueueTaskDirectInternal, so we can imagine it being in the call stack before WorkDispatcher::EnqueueTask().

In simple terms, the network listener derives a resource group and chooses a node, the node chooses a scheduler, and the scheduler delegates the enqueue to its WorkDispatcher component.

Ideally an idle worker will have been found and assigned to the task. If so, that worker is now put on its scheduler’s runnable queue by the network listener thread, taking it out of the WorkDispatcher. If no worker is available, the task is still enqueued, but in the WorkDispatcher’s wallflower list of workerless tasks, giving us our beloved THREADPOOL wait.

Let’s however assume that we have a worker. Over on the other side, its underlying thread is still sleeping, but tied to a worker which is now on its scheduler’s runnable queue. It will eventually be woken up to start its assigned task in the normal course of its scheduler’s context switching operations. And what screenplay was thrown over the wall for it to read?

sqllang!CSQLSource::Execute
sqllang!process_request
sqllang!process_commands_internal
sqllang!process_messages
sqldk!SOS_Task::Param::Execute
...

Here we get an extra level of abstraction. The instantiated task runs the TDS parser, which creates and invokes specialised classes based on the contents of the TDS packet. In this case we got a SQL batch request, so a CSQLSource was instantiated to encapsulate the text of the request and set to work on it, taking us down the road of SQL parsing, compilation, optimisation and execution.

Once parallelism comes into play, such a task could instantiate yet more tasks, but that is a story for another day.

Conclusion

From a high enough view, this is quite elegant and simple. A handful of tasks are created to get SQL Server in motion; some of these have the ability to create other tasks, and the machine just chugs on executing all of them.

To tie this up with the previous post, here is the complete lifecycle of a SQLOS thread:

The SQLOS thread lifecycle

If you haven’t lately read Remus Rusanu’s excellent Understanding how SQL Server executes a query, do pay it a visit. This post is intended to complement its opening part, although I’m placing more emphasis on the underlying classes and methods recognisable from stack traces.

Since we’ve now covered the thread lifecycle from birth into its working life, next up I’ll go back and look at what a wait feels like from the worker angle.

The early life of a SQLOS thread

So I have checked off that bucket list item of speaking at a SQLSaturday. In the process of getting my act together, I learned a thing or two about the undocumented youth of SQLOS threads, between birth and entering the workplace. And you didn’t, which seems unfair.

We normally see stack traces while looking at the top of the stack, typically during a wait, at which point the thread is wearing full worker garb, and has been executing a task for a while. Let’s today reflect on those happy times when our thread was in diapers.

Conception and birth

Threads are born because the system as a whole decides they are cute and it wants more of them. The decision is made in the SystemThreadDispatcher, which is a component of a SchedulerManager, itself a component of an SOS_Node, aka CPU node.

We can simplify this: Threads are born into nodes.

Now a thread isn’t created at the moment that it is needed, and it isn’t legally able to perform work right from birth. The idea is to have a reasonable number of grown-up threads in the population, ready to be put to work at short notice. We are just at the first step.

Thread creation is done through a CreateRemoteThreadEx() call, within the function SystemThreadDispatcher::CreateNewSysThreadIfRequired(), which is invoked as a side task by another thread when it leaves the pool of unemployed threads.

The function pointer passed in as thread entry point is SchedulerManager::ThreadEntryPoint(), and the parameter that will be passed to that entry point is a pointer to the target node’s SchedulerManager. In other words, when the function runs, it will be a completely normal instance method call on that SchedulerManager, parameterless except for the This pointer. And since the SchedulerManager knows what node it belongs to, our newborn thread will instinctively be able to crawl into the arms of the maternal SOS_Node.

But I am getting ahead of myself here. Before even running that entry point function, the thread creation callback registered during SQLOS boot (SystemThread::DllMainCallback()) is invoked by the OS runtime in the context of the new thread. And that gives us a SystemThread associated with the thread, meaning it has – among other things – the Windows event that will let it participate in SQLOS context switching.

So the very first thing our newborn thread, cosily wrapped up in a SystemThread, does is to enlist itself in the parent SOS_Node – and by “enlist” I literally mean adding itself to a linked list. Strictly speaking, it enlists the SystemThread, which is now SQLOS’s proxy to the thread: whenever we want to refer to a thread, we do so through a pointer to its SystemThread. Looking at it from one direction, the SystemThread contains a handle to the thread. From the other direction, any running code can find the ambient SystemThread through a thread-local storage lookup.

As it stands, the thread can’t do much useful in polite company yet, other than suspend itself. SystemThread::Suspend() is the most rudimentary of scheduling functions, just calling WaitForSingleObject() on the thread’s personal Event.

When a thread loves a Worker

ThreadEntryPoint now calls SystemThreadDispatcher::ProcessWorker() on the SOS_Node’s SystemThreadDispatcher, i.e. the one within the current SchedulerManager.

The SystemThreadDispatcher shows itself to be a dating agency, keeping separate linked lists of unattached SystemThreads and idle Workers, and pairing them off according to supply and demand.

From the viewpoint of the thread running it, ProcessWorker() means “find me an unattached Worker so we can change the world together”. If there isn’t a spare Worker at this moment though, the thread goes to sleep through the aforementioned SystemThread::Suspend() call, only to be woken up when a fresh young Worker arrives on the dating scene. This moment is celebrated by ProcessWorker() moving on to call SystemThread::RunWorker()

Pairing the two up includes the SystemThread swearing a vow of loyalty to the Worker’s associated SOS_Scheduler. Up to this point, the thread was “in the SystemThreadDispatcher” and associated with an SOS_Node, but not a specific scheduler. From here onwards, the SystemThread and Worker are fully part of the family of workers for that scheduler.

We now move on to SchedulerManager::WorkerEntryPoint() which initialises the Worker, e.g. setting timestamps and the first quantum target, before invoking the first SOS_Scheduler method, ProcessTasks().

Interesting aside regarding waits: The suspension of a thread within the SystemThreadDispatcher isn’t a measured wait, because waiting is measured at the level of workers and schedulers, neither of which have yet entered the picture.

Your task, should you choose to accept it…

Moving into the family home of the Worker, the first stop within ProcessTasks() is a courtesy call on that scheduler’s WorkDispatcher. If the SystemThreadDispatcher was a dating agency for Workers and SystemThreads, the WorkDispatcher is an employment agency for those couples, pairing them up with jobs in the form of SOS_Tasks.

Entering the WorkDispatcher initially, the pair generally wouldn’t find a pending tasks. At this point they (though the pair is now just viewed as a Worker by the scheduler) are put to sleep through a full-fledged scheduler method, SOS_Scheduler::SuspendNonPreemptive(). This means that the Worker ends up on a suspend queue, specifically the WorkDispatcher’s list of idle workers.

When a task is lobbed over the wall into the scheduler from elsewhere, the WorkDispatcher will assign it to an idle Worker, and the worker made runnable. In due course it will be chosen as the next worker to run, continuing with the ProcessTasks() call to run the specific function specified through the task: this is SOS_Scheduler:RunTask() into SOS_Task::Param::Execute().

The task gets executed, through all the joys and heartaches of taskhood, and if parallelism is involved, child tasks may even be spawned. Ultimately though, the task will be done, and the pair return to the WorkDispatcher’s idle list, blocked in SOS_Scheduler::ProcessTasks() but ready for the next challenge.

You want pictures? Sure.

The relationship between SOS_Node, SOS_Scheduler and their dispatching components

(For the sake of honesty, I should note that a node actually has separate SchedulerManagers for normal and hidden schedulers.)

Up next

This takes care of how tasks, workers, and threads interact – at least in thread mode, which is the only mode we probably care about. In the next blog post I will look into how tasks actually get instantiated.

Unsung SQLOS: The SOS_UnfairMutexPair behind CMEMTHREAD waits

As with the droppings of the Questing Beast, we recognise synchronisation code paths by their emissions. But even when not leaving telltale fewmets, these creatures wander among us unseen, unsung, until shutdown doth us part.

The place of the SOS_UnfairMutexPair

Bob Dorr has done a great job of illustrating why thread-safe SQLOS memory objects need serialised access in How It Works: CMemThread and Debugging Them. This has been followed up by the “It just runs faster” coverage of dynamic memory partitioning.

Today I’ll be examining the SOS_UnfairMutexPair, the synchronisation object behind memory allocations. While I’m going to describe synchronisation as a standalone subject, it’s useful to think of it as the CMEMTHREAD wait; to the best of my knowledge nothing other than memory allocations currently uses this class.

For context, I have previously described a bunch of other synchronisation classes:

One picture which emerges from all these synchronisation flavours is that a developer can choose between busy waiting (eagerly burning CPU), politely going to sleep (thread rescheduling), or a blend between the two. As we’ll see the SOS_UnfairMutexPair goes for a peculiar combination, which was clearly tuned to work with the grain of SQLOS scheduling.

The shape of the object

With the exception of the central atomic lock member, everything comes in pairs here. The class models a pair of waitable objects, each having an associated pair of members to note which scheduler and task currently owns it:

Layout of the SOS_UnfairMutexPair (2016 incarnation)

Although it exposes semantics to acquire either both mutexes or just the second, in its memory allocation guise we always grab the pair, and it effectively acts as a single mutex. I’ll therefore restrict my description by only describing half of the pair.

Acquisition: broad outline

The focal point of the mutex’s state – in fact one might say the mutex itself – is the single Spinlock bit within the 32-bit lock member. Anybody who finds it zero, and manages to set it to one atomically, becomes the owner.

Additionally, if you express an interest in acquiring the lock, you need to increment the WaiterCount, whether or not you managed to achieve ownership at the first try. Finally, to release the lock, atomically set the spinlock to zero and decrement the WaiterCount; if the resultant WaiterCount is nonzero, wake up all waiters.

Now one hallmark of a light-footed synchronisation object is that it is very cheap to acquire in the non-contended case, and this class checks that box. If not owned, taking ownership (the method SOS_UnfairMutexPair::AcquirePair()) requires just a handful of instructions, and no looping. The synchronisation doesn’t get in the way until it is needed.

However, if the lock is currently owned, we enter a more complicated world within the SOS_UnfairMutexPair::LongWait() method. This broadly gives us a four-step loop:

  1. If the lock isn’t taken at this moment we re-check it, grab it, simultaneously increment WaiterCount, then exit triumphantly, holding aloft the prize.
  2. Fall back on only incrementing WaiterCount for now, if this is the first time around the loop and the increment has therefore not been done yet.
  3. Now wait on the EventInternal, i.e. yield the thread to the scheduler.
  4. Upon being woken up by the outgoing owner releasing the lock as described above, try again to acquire the lock. Repeat the whole loop.

The unfairness derives from the fact that there is no “first come, first served” rule, in other words the wait list isn’t a queue. This is not a very British class at all, but as we’ll see, there is a system within the chaos.

The finicky detail

Before giving up and waiting on the event, there is a bit of aggressive spinning on the spinlock. As is standard with spinlocks, spinning burns CPU on the optimistic premise that it wouldn’t have to do it for long, so it’s worth a go. However, the number of spin iterations is limited. Here is a slight simplification of the algorithm:

  • If the scheduler owning the lock is the ambient scheduler, restrict to a single spin.
  • Else, give it a thousand tries before sleeping.
  • Each spin involves first trying to grab both the lock and (if not yet done) incrementing WaiterCount. If that doesn’t work, just try and increment the WaiterCount.

This being of course the bit where the class knows a thing or two about SQLOS scheduling: If I am currently running, then no other worker on my scheduler can be running. But if another worker on my scheduler currently holds the lock, it can’t possibly wake up and progress towards releasing it unless *I* go to sleep. Mind you, this is already a edge case, because we’d hope that the owner of this kind of lock wouldn’t go to sleep holding it.

To see how scheduling awareness comes into play, I’m going to walk through a scenario involving contention on such a mutex. If some of the scheduling detail makes you frown, you may want to read Scheduler stories: The myth of the waiter list.

A chronicle of contention

In this toy scenario, we have two schedulers with two active workers each. Three of the four workers will at some point during their execution try and acquire the same mutex, and one of them will try twice. Time flows from left to right, and the numbered callouts are narrated below. A red horizontal bracket represents the period where a worker owns the mutex, which may be a while after the acquisition attempt started.

The mutex acquisition tango
  1. A1 wants to acquire the mutex and, finding it uncontended, gets it straight away.
  2. B2 tries to acquire it, but since it is held by A1, it gives up after a bit of optimistic spinning, going to sleep. This gives B1 a turn on the scheduler.
  3. A1 releases the mutex, and finding that there is a waiter, signals it. This moves B2 off the mutex’s waiter list and onto scheduler B’s runnable queue, so it will be considered eligible for running at the next scheduler yield point.
  4. B1 wants the mutex, and since it isn’t taken, grabs it. Even though B2 has been waiting for a while, it wasn’t running, and it’s just tough luck that B1 gets it first.
  5. A1 wants the mutex again, but now B1 is holding it, so A1 goes to sleep, yielding to A2.
  6. B1 releases the mutex and signals the waiter A1 – note that B2 isn’t waiting on the resource anymore, but is just in a signal wait.
  7. B1 reaches the end of its quantum and politely yields the scheduler. B2 is picked as the next worker to run, and upon waking, the first thing it does is to try and grab that mutex. It succeeds.
  8. A2 reaches a yield point, and now A1 can get scheduled, starting its quantum by trying to acquire the mutex. However, B2 is still holding it, and after some angry spinning, A2 is forced to go to sleep again, yielding to A1.
  9. B2 releases the mutex and signals the waiting A1, who will hopefully have better luck acquiring it when it wakes up again.

While this may come across as a bit complex, remember that an acquisition attempt (whether immediately successful or not) may also involve spinning on the lock bit. And this spinning manifests as “useful” work which doesn’t show up in spinlock statistics; the only thing that gets exposed is the CMEMTHREAD waiting between the moment a worker gives up and goes to sleep and the moment it is woken up. This may be followed by another bout of unsuccessful and unmeasured spinning.

All in all though, you can see that this unfair acquisition pattern keeps the protected object busy doling out its resource: in this case, an memory object providing blocks of memory. In an alternative universe, the mutex class may well have decided on its next owner at the moment that the previous owner releases it. However, this means that the allocator won’t do useful work until the chosen worker has woken up; in the meantime, the unlucky ones on less busy schedulers may have missed an opportunity to get woken up and do a successful acquire/release cycle. So while the behaviour may look unfair from the viewpoint of the longest waiter, it can turn out better for overall system throughput.

Of course, partitioning memory objects reduced the possibility of even having contention. But the fact remains: while any resources whatsoever are shared, we need to consider how they behave in contended scenarios.

Compare-and-swap trivia

Assuming we want the whole pair, as these memory allocations do, there are four atomic operations performed against the lock member:

  • Increment the waiter count: add 0x00010001
  • Increment the waiter count and grab the locks: add 0x80018001
  • Just grab the locks (after the waiter count was previously incremented): add 0x80008000
  • Release the locks and decrement the waiter count: deduct 0x80018001

For the first three, the usual multi-step pattern comes into play:

  1. Retrieve the current value of the lock member
  2. Add the desired value to it, or abandon the operation, e.g. if we find the lock bit set and we’re not planning to spin
  3. Perform the atomic compare-and-swap (lock cmpxchg instruction) to replace the current value with the new one as long as the current value has not changed since the retrieval in step 1
  4. Repeat if not successful

The release is simpler, since we know that the lock bits are set (we own it!) and there is no conditional logic. Here the operation is simple interlocked arithmetic, but two’s complement messes with your mind a bit: the arithmetic operation is the addition of 0x7ffe7fff. Not a typo: that fourth digit is an “e”!

This all comes down to thinking of the lock bit as a sign bit we need to overflow in order to set to 1. The higher one overflows out of the 32-bit space, but the lower one overflows into the lowest bit of the first count. To demonstrate, we expect 0x80018001 to turn to zero after applying this operation:

    8001 8001
  + 7ffe 7fff
    ---------
(1) 0000 0000

Final thoughts

So you thought we’ve reached the end of scheduling and bit twiddling? This may turn out to be a perfect opportunity to revisit waits, and to start exploring those memory objects themselves.

I’d like to thank Brian Gianforcaro (b | t) for feedback in helping me confirm some observations.

Context in perspective 6: Taking Sessions to Task

A few weeks ago, Kendra Little did a very entertaining Dear SQL DBA episode about stack dumps. That gave me a light bulb moment when I tied the preamble of her stack dump example to a DBCC command that I’ve had on the back burner for a while. The command puts some session-related objects on stage together, and those objects bridge the gap between a DMV-centric view of the SQL Server execution model and the underlying implementation.

Let’s revisit the purpose of a dump. When you catch one of these happening on your server, it’s sensible to be alarmed, but it is rather negative to say “Oh no, something horrible went wrong!” A more positive angle is that a disturbance was detected, BUT the authorities arrived within milliseconds, pumped the whole building full of Halon gas, and then took detailed forensic notes. Pity if innocent folks were harmed by that response, but there was cause to presume the involvement of a bad hombre, and that is the way we roll around here.

Today we’ll be learning lessons from those forensic notes.

Meet DBCC TEC

Boring old disclaimer: What I am describing here is undocumented, unsupported, likely to change between versions, and will probably make you go blind. In fact, the depth of detail exposed illustrates one reason why Microsoft would not want to document it: if end users of SQL Server found a way to start relying on this not changing, it would hamstring ongoing SQL Server improvement and refactoring.

With that out of the way, let’s dive right into DBCC TEC, a command which can dump a significant chunk of the object tree supporting a SQL Server session. This result is the same thing that shows up within a dump file, namely the output of the CSession::Dump() function – it’s just that you can invoke this through DBCC without taking a dump (cue staring match with Kendra). Until corrected, I shall imagine that TEC stands for Thread Execution Context.

In session 53 on a 2014 instance, I start a batch running the boring old WAITFOR DELAY ’01:00:00′, and while it is running, I use another session to run DBCC TEC against this session 53. The DBCC command requires traceflag 3604, and I’m going to ignore the first and third parameters today; my SQL batch then looks like this:

 
DBCC TRACEON(3604);
DBCC TEC(0,53,0);

Here is the output I got, reformatted for readability, and with the bulk of the long output buffer omitted:

TEC:


CSession @0x0000000038772470

m_sessionId = 53                      m_cRef = 13                         
m_rgcRefType[0] = 1                   m_rgcRefType[1] = 1                 
m_rgcRefType[2] = 9                   m_rgcRefType[3] = 1
m_rgcRefType[4] = 0                   m_rgcRefType[5] = 1                 
m_pmo = 0x0000000038772040            m_pstackBhfPool = 0x0000000000000000
m_dwLoginFlags = 0x100083e0           m_fBackground = 0
m_eConnResetOption = 0                m_fUserProc = 1                     
m_fConnReset = 0                      m_fIsConnReset = 0                  
m_fInLogin = 0                        m_fAuditLoginSent = 1
m_fAuditLoginFailedSent = 0           m_fReplRelease = 0                  
m_fKill = 0                           m_ulLoginStamp = 110                
m_eclClient = 7                       m_protType = 6
m_hHttpToken = FFFFFFFFFFFFFFFF    

 

CUEnvTransient @0x0000000038772A60

m_ululRowcount = 0                    m_cidLastOpen = 0                   
m_lFetchStat = 0                      m_lRetStat = 0                      
m_lLastError = 0                      m_lPrevError = 0
m_cNestLevel = 0                      m_lHoldRand1 = 0                    
m_lHoldRand2 = 0                      m_cbContextInfo = 0                 
m_rgbContextInfo = 0000000038772C28 



CDbAndSetOpts @0x0000000038772A80

m_fHasDbRef = 0                       m_fDateFirstSet = 0                 
m_fDateFormatSet = 0                  m_lUserOpt1 = 0x14817000            
m_lUserOpt2 = 0x40                    m_lUserOpt1SetMask = 0xfffff7f7
m_idtInsert.objid = 0                 m_idtInsert.state = 0               
m_idtInsert.dbid = 0                  m_llRowcnt = 0                      
m_lStatList = 0                       m_lTextSize = 2147483647
m_lOffsets = 0x0                      m_ulLockTimeout = 4294967295        
m_ulQueryGov = 0                      m_eDtFmt = 1                        
m_eDayDateFirst = 7                   m_eDdLckPri = 0
m_eIsoLvl = 2                         m_eFipsFlag = 0x0                   
m_sLangId = 0


m_pV7LoginRec

0000000000000000:   00000000 00000000 00000000 00000000 00000000  ....................
0000000000000014:   00000000 00000000 00000000 00000000 00000000  ....................
0000000000000028:   00000000 00000000 00000000 00000000 00000000  ....................
000000000000003C:   00000000 00000000 00000000 00000000 00000000  ....................
0000000000000050:   00000000 00000000 00000000 0000               ..............


CPhysicalConnection @0x0000000038772240

m_pPhyConn->m_pmo = 0x0000000038772040                                   
m_pPhyConn->m_pNetConn = 0x0000000038772DF0                              
m_pPhyConn->m_pConnList = 0x0000000038772440                             
m_pPhyConn->m_pSess = 0x00000000387724C8                                 
m_pPhyConn->m_fTracked = -1
m_pPhyConn->m_cbPacketsize = 4096   
m_pPhyConn->m_fMars = 0             
m_pPhyConn->m_fKill = 0



CBatch @0x0000000038773370

m_pSess = 0x0000000038772470          m_pConn = 0x0000000038773240        
m_cRef = 3                            m_rgcRefType[0] = 1                 
m_rgcRefType[1] = 1                   m_rgcRefType[2] = 1
m_rgcRefType[3] = 0                   m_rgcRefType[4] = 0                 
m_pTask = 0x000000003868F468


EXCEPT (null) @0x000000004AD8F240

exc_number = 0                        exc_severity = 0                    
exc_func = sqllang.dll!0x00007FFD60863940                                


Task @0x000000003868F468

CPU Ticks used (ms) = 31              Task State = 4
WAITINFO_INTERNAL: WaitResource = 0x0000000000000000                     
WAITINFO_INTERNAL: WaitType = 0x20000CA
WAITINFO_INTERNAL: WaitSpinlock = 0x0000000000000000                     
SchedulerId = 0x1                     ThreadId = 0xb60                    
m_state = 0                           m_eAbortSev = 0



EC @0x000000003875E9A0

spid = 0                              ecid = 0                            
ec_stat = 0x0                         ec_stat2 = 0x0                        
ec_atomic = 0x0                       ecType = 0
__pSETLS = 0x00000000387732B0         __pSEParams = 0x0000000038773500   


 
SEInternalTLS @0x00000000387732B0

	m_flags = 0                        	
        m_TLSstatus = 3                    	
        m_owningTask = 0x00000000336CB088
	m_activeHeapDatasetList = 0x00000000387732B0                            
	m_activeIndexDatasetList = 0x00000000387732C0                           	
        m_pDbccContext = 0x0000000000000000
	m_pAllocFileLimit = 0x0000000000000000                                  	
        m_dekInstanceIndex = 0x-1
	m_pImbiContext = 0x0000000000000000


SEParams @0x0000000038773500

	m_lockTimeout = -1                 	
        m_isoLevel = 4096                  	
        m_logDontReplicate = 0
	m_neverReplicate = 0               	  
        m_XactWorkspace = 0x000000003A158040
	m_execStats = 0x000000003A877780   

CCompExecCtxtBasic @0x000000003875E040


CConnection @0x0000000038773240


CNetConnection @0x0000000038772DF0

m_pNetConn->status = 8              m_pNetConn->bNewPacket = 0          
m_pNetConn->fClose = 0              m_pNetConn->pmo = 0x0000000038772040
m_pNetConn->m_lConnId = 0           client threadId = 0x38f11827

m_pNetConn->m_pFirstInputBuffer

0000000000000000:   16000000 12000000 02000000 00000000 00000100  ....................
0000000000000014:   00007700 61006900 74006600 6f007200 20006400  ..w.a.i.t.f.o.r. .d.
0000000000000028:   65006c00 61007900 20002700 30003100 3a003000  e.l.a.y. .'.0.1.:.0.
000000000000003C:   30003a00 30003000 27003b00 0d000a00           0.:.0.0.'.;.....

m_pNetConn->srvio.outbuff

0000000000000000:   04010025 00350100 e40b0000 00030000 0001fe04  ...%.5..ä.........þ.
0000000000000014:   00000000 fd0000f3 00000000 00000000 00000000  ....ý..ó............
... and so on - this is rather long.



CUEnvTransient @0x000000003875E0B8

m_ululRowcount = 0                  m_cidLastOpen = 0                   
m_lFetchStat = 0                    m_lRetStat = 0                      
m_lLastError = 0                    m_lPrevError = 0
m_cNestLevel = 0                    m_lHoldRand1 = 0                    
m_lHoldRand2 = 0                    m_cbContextInfo = 0                 
m_rgbContextInfo = 000000003875E280 




CDbAndSetOpts @0x000000003875E0D8

m_fHasDbRef = 0                     m_fDateFirstSet = 0                 
m_fDateFormatSet = 0                m_lUserOpt1 = 0x14817000            
m_lUserOpt2 = 0x40                  m_lUserOpt1SetMask = 0xfffff7f7
m_idtInsert.objid = 0               m_idtInsert.state = 0               
m_idtInsert.dbid = 0                m_llRowcnt = 0                      
m_lStatList = 0                     m_lTextSize = 2147483647
m_lOffsets = 0x0                    m_ulLockTimeout = 4294967295        
m_ulQueryGov = 0                    m_eDtFmt = 1                        
m_eDayDateFirst = 7                 m_eDdLckPri = 0
m_eIsoLvl = 2                       m_eFipsFlag = 0x0                   
m_sLangId = 0




CExecLvlIntCtxt @0x000000003875E330

m_pBatch = 0x0000000038773370             m_pxlvl = 0x000000003875F800        
m_pOAInfo = 0x0000000000000000            m_pOAUnInitFunc = 0x0000000000000000  
m_pXPUserData = 0x0000000000000000        m_pRpcInfo = 0x0000000000000000
m_pSharedRpcInfo = 0x0000000000000000     m_pCeiInfo = 0x0000000000000000
m_etts = 0x00000000                       m_eQueryProps = 00000000            
m_ulNotificationSweeperMode = 0           m_MsqlXactYieldMode = 2               
m_dbidPubExec = 0                         m_pBulkCtx = 0x0000000000000000
m_pBulkImp = 0x0000000000000000           m_pInsExec = 0x0000000000000000     
m_pbhfdCurrentlyActive = 0x000000003875E438 


                             
IBHFDispenser @0x000000003875E438

m_pbhfdPrev = 0x0000000000000000          m_pbhf = 0x0000000000000000         
m_bErrAction = 0                          m_ulStmtId = 0                          
m_ulCompStmtId = 1                        m_pShowplanTable = 0x0000000000000000
m_pShowplanXMLCtxt = 0x0000000000000000   m_pStmtAlterTable = 0x0000000000000000
m_pxliStackTop = 0x000000003875F9E0       m_pqni = 0x0000000000000000         
m_ullTransactionDescriptor = 0            m_dwOpenResultSetCount = 1          
m_InternalFlags = 0x00001002        



CExecLvlRepresentation @0x000000003875F890


CStmtCurState @0x000000004AD8E670

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

First observation: this is a treasure trove, or perhaps a garbage dump. Second observation: the output could actually get a lot more verbose, depending on what the session under scrutiny is doing.

All in all, the fact of its inclusion in dump files tells us that it contains useful information to support personnel, and helps to add context in a crash investigation.

A glimpse of the object tree

You certainly can’t accuse DBCC TEC of being easy on the eyes. But with a bit of digging (my homework, not yours) it helps to piece together how some classes fit together, tying up tasks, sessions and execution contexts.

Recall that Part 5 demonstrated how one can find the CSession for a given thread. DBCC TEC walks this relationship in the other direction, by deriving tasks and threads – among other things – from sessions.

While this isn’t made obvious in the text dump, below diagram illustrates the classes involved, and their linkages.

The object tree revealed by DBCC TEC

Of course these are very chunky classes, and I’m only including a few members of interest, but you should get the idea.

Observations on DBCC TEC and class relationships

Let’s stick to a few interesting tidbits. CSession is the Big Daddy, and it executes CBatches. Although we’d normally only see a single batch, under MARS there can be more; this manifests by the session actually having a linked list of batches, even though normally the list would contain a single item (I simplified to a scalar on the diagram).

A session keeps state and settings within a CUEnvTransient, which in turn refers to a CDbAndSetOpts. You should recognise many of those attributes from the flattened model we see in DMVs.

However, to make it more interesting, a task (indirectly) has its own CUenvTransient; this is one reflection on the difference between batch-scoped and session-scoped state. And on top of that, even a session actually has two of them, with the ability to revert to a saved copy.

Although a CBatch may execute CStatements, we don’t see this here. The actual work done by a batch is of course expressed as a SOS_Task, which is a generic SQLOS object with no knowledge of SQL Server.

At this point the DBCC output layout is a bit misleading. Where we see EXCEPT (null) it is SOS_Task state being dumped, and we’re not within the CBatch anymore. However, the SOS_Task is actually just the public interface that SQLOS presents to SQL Server, and some of the state it presents live in the associated Worker. The exception information lives within an ExcHandler pointed to by the Worker, and things like CPU Ticks used come directly from the Worker. Task State is the same rather convoluted derivation exposed in sys.dm_os_tasks, as outlined in Worker, Task, Request and Session state. ThreadID is an attribute of the SystemThread pointed to by the task’s Worker, and similarly SchedulerID comes indirectly via the Task’s SOS_Scheduler.

WAITINFO_INTERNAL and (where applicable) WAITINFO_EXTERNAL describe cooperative and preemptive waits respectively. The latter can actually get nested, giving us a little stack of waits, as described in Going Preemptive.

Now we start peering deeper into the task, or rather through it into its associated worker’s LocalStorage as described in Part 5. The first two slots, respectively pointing to a CCompExecCtxtBasic and an ExecutionContext, have their contents dumped at this point.

The ExecutionContext points to one instance each of SEInternalTLS and SEParams. Pause here for a moment while I mess with your mind: should the task in question be executing a DBCC command, for instance DBCC TEC, the m_pDbccContext within the SEInternalTLS will be pointing to a structure that gives DBCC its own sense of context.

Another fun observation: while SEInternalTLS claims with a straight face to have an m_owningTask member, I call bull on that one. Probably it did in the past, and some poor developer was required to keep exposing something semi-meaningful, but the address reported is the address of the ambient task, i.e. the one running DBCC TEC. The idea of ownership is simply not modelled in the class.

Then (still down in the LocalStorage of a Task performed by the Batch under scrutiny) we get the CCompExecCtxtBasic we saw in Part 5. This now shows us a few more members, starting with a CConnection which in turn points to a CNetConnection getting its buffers dumped for out viewing pleasure. This CConnection is in fact the same one pointed to directly by the Batch as m_pConn.

Now, as already mentioned, we see the CUEnvTransient with associated CDbAndSetOpts belonging to the task at hand (through its Worker’s LocalStorage->CCompExecCtxtBasic).

The next CCompExecCtxtBasic member is a pointer to a CEvevLvlIntCtxt which in turn references an IBHFDispenser; in case you’re wondering, BHF stands for BlobHandleFactory, but I’m not going to spill the beans on that yet, because I am fresh out of beans.

Finally, CCompExecCtxtBasic’s performance ends with the duo of CExecLvlRepresentation and a CStmtCurState, neither of whom has much to say.

Whew. That was the superficial version.

Are we all on the same page?

While we could spend a lot of time on what all these classes really do, that feels like material for multiple separate blog posts. Let’s finish off by looking at the pretty memory layout.

Recall that addresses on an 8k SQLOS page will cover ranges ending in 0000 – 1FFF, 2000 – 3FFF, 4000 – 5FFF etc, and that all addresses within a single such page will have been allocated by the same memory object. Now note that the objects which expose a pmo (pointer to memory object) all point to 0x38772040. This sits on (and manages) a page stretching from 0x38772000 to 0x38773FFF.

Have a look through those object addresses now. Do you see how many of them live on that page? This is a great example of memory locality in action. If a standard library’s heap allocation was used, those addresses would be all over the place. Instead, we see that the memory was allocated, and the objects instantiated, with a clear sense of family membership, even though the objects are notionally independent entities.

As a final note on the subject of memory addresses, the eagle-eyed may have noticed that the CPhysicalConnection has a m_pSess pointer which is close to, but not equal to the address of the CSession at the top of the object tree. This is because it is an ILogonSession interface pointer to this CSession instance, with the interface’s vftable living at an offset of 0x58 within the CSession class. See Part 3 for more background.

Further reading

Go on and re-read Remus Rusanu’s excellent How SQL Server executes a query. If you’ve never read it, do so now – it’s both broad and deep, and you can trust Remus on the details.

While reading it, see in how many places you can replace e.g. “Session” with “CSession”, and you will be that bit closer to Remus’s source code-flavoured subtext.

Context in perspective 5: Living next door TLS

Time to switch context to an old thread, specifically the SystemThread class. I have done this to death in an Unsung SQLOS post, but if you don’t want to sit through that, here are the bits we care about today:

  • All work in SQLOS is performed by good old-fashioned operating system threads.
  • SQLOS notionally extends threads with extra attributes like an associated OS event object, and behaviour like a method for for waiting on that event. This bonus material lives in the SystemThread class, which is conceptually a subclass of an operating system thread, although this isn’t literally class inheritance.
  • A SystemThread instance lives in memory which is either allocated on the heap and pointed to by a thread-local storage slot, or it squats right across a range of TLS slots within the Thread Environment Block.

Due to the nature of thread-local storage, at any moment any code running on a SQLOS thread can get a reference to the SystemThread it’s running on; this is described in gory detail in Windows, Mirrors and a sense of Self. Powerful stuff, because this ambient SystemThread indirectly exposes EVERYTHING that defines the current SQL Server context, from connection settings to user identity and beyond.

Life is a box of ogres

LS through the looking glass

Understanding SQLOS takes us only so far, because it is after all just the substrate upon which SQL Server is built. We’ve now reached the point where SQLOS hands SQL Server a blank slate and says “knock yourself out”.

This blank slate is a small but delicious slice of local storage: an array of eighteen pointers living within the SystemThread. SQLOS has no interest in what these are used for and what they mean. As far as it is concerned, it’s just a bunch of bytes of no known type. But of course SQL Server cares a lot more.

Park that thought for a moment to consider that we’re starting to build up a hierarchy of thread-local storage:

  1. Upon an OS context switch, the kernel swaps the value held in the CPU’s GS register to point to the Thread Environment Block of the incoming thread.
  2. Within this Thread Environment block lives TLS slots that SQLOS takes advantage of. One of these will always point to the currently running SystemThread instance. In other words, when the kernel does a context switch, the change of OS thread brings with it a change in the ambient SystemThread which can be retrieved from TLS.
  3. Now within this SystemThread, an array of eighteen pointer-sized slots are made available for the client application (SQL Server) to build upon.

What does this mean from the viewpoint of SQL Server? Well, even within the parts that don’t care about SQLOS and the underlying OS, code can express and find current thread-specific state – at a useful abstraction level – somewhere within those eighteen slots.

Worker LocalStorage vs thread-local storage

We often skirt around the distinction between a worker and a thread. This is a benign simplification, because instances of Worker and SystemThread classes are bound together in a 1:1 paired relationship during all times when SQL Server code is running.

The only time the distinction becomes particularly interesting is when we’re in fiber mode, because now a single SystemThread can promiscuously service multiple Workers in a round-robin fashion. I have documented some details in The Joy of Fiber Mode, but of special interest today is the treatment of these local storage slots.

These now become a more volatile part of thread context, and a SQLOS context switch (in this case, a fiber switch) must include copying the Worker’s LocalStorage payload into the SystemThread slots, because there is no guarantee that two different Workers will share the exact same context; in fact it is close to certain they won’t.

So clearly the context that matters to the Task at hand lives within the Worker, and SQLOS makes sure that this is the context loaded into a SystemThread while it runs; it’s just that the SQLOS scheduler has to intervene more in fiber mode, whereas it can let this bit of state freewheel in thread mode.

On to the implementation details then. Unlike the case with the SystemThread, a Worker’s local storage isn’t a chunk of memory within the class, but lives in a heap allocation, represented by an instance of the LocalStorage class embedded within the Worker.

Additionally, while the SystemThread’s slot count is baked into SQLOS, somebody clearly wanted a bit more abstraction in the Worker class, so the slot count becomes an attribute of a LocalStorage instance, which thus consists of a grand total of two members:

  1. The slot count, passed into the LocalStorage constructor
  2. A pointer to the actual chunk of memory living somewhere on the heap and allocated by the LocalStorage constructor, which is a thin wrapper around a standard page allocator

Show me some numbers, and don’t spare the horses

On to the fun bit, but you’re going to have to take my word on two things. Firstly, the SystemThread’s local storage slots are right at the start of the object. And secondly, the LocalStorage instance lives at an offset of 0x30 within a Worker, at least on the version I’m using here.

To prepare, I needed to capture the addresses of a bound SystemThread:Worker pair before breaking into the debugger, so I started a request running in session 53, executing nothing but a long WAITFOR – this should keep things static enough while I fumble around running a DMV query in another session:

Capturing the worker and thread addresses of our dummy request

So we’re off yet again to stage a break-in within Windbg. Having done this, everything is frozen in time, and I can poke around to my heart’s content while the business users wonder what just happened to their server. No seriously, you don’t want to do this on an instance that anyone else needs at the same time.

Here is the dump of the first 64 bytes of that Worker. As in Part 4, I’m dumping in pointer format, although only some of these entries are pointers or even necessarily 64 bits wide. In case it isn’t clear, the first column is the address we’re looking at, and the second column is the contents found at that address. The dps command sweetens the deal a bit: if that payload is an address corresponding to a known symbol (e.g. a the name of a function), that symbol will be displayed in a third column.

0:075> dps 0x000000003B656160 l0x8
00000000`3b656160  00000000`00000000
00000000`3b656168  00000000`00000000
00000000`3b656170  00000000`3d948170
00000000`3b656178  00000000`3a21a170
00000000`3b656180  00000000`00000001
00000000`3b656188  00000000`41080158
00000000`3b656190  00000000`00000012
00000000`3b656198  00000000`3b656c80

Those highlighted last two represent the LocalStorage instance, confirming that we do indeed have eighteen (=0x12) slots, and telling us the address where that slot array starts. Let’s see what payload we find there:

0:075> dps 00000000`3b656c80 l0x12
00000000`3b656c80  00000000`3875e040
00000000`3b656c88  00000000`3875e9a0
00000000`3b656c90  00000000`00000000
00000000`3b656c98  00000000`38772608
00000000`3b656ca0  00000000`38773440
00000000`3b656ca8  00000000`00000000
00000000`3b656cb0  00000000`00000000
00000000`3b656cb8  00000000`00000000
00000000`3b656cc0  00000000`00000000
00000000`3b656cc8  00000000`00000000
00000000`3b656cd0  00000000`3a8776a0
00000000`3b656cd8  00000000`00000000
00000000`3b656ce0  00000000`00000000
00000000`3b656ce8  00000000`00000000
00000000`3b656cf0  00000000`00000000
00000000`3b656cf8  00000000`00000000
00000000`3b656d00  00000000`00000000
00000000`3b656d08  00000000`00000000

It seems that only five out of the eighteen are in use. Oh well, that’s neither here nor there. Let’s compare this to the first eighteen quadwords at the bound SystemThread’s address we found in sys.dm_os_threads:

0:075> dps 0x00007FF653311508 l0x12
00007ff6`53311508  00000000`3875e040
00007ff6`53311510  00000000`3875e9a0
00007ff6`53311518  00000000`00000000
00007ff6`53311520  00000000`38772608
00007ff6`53311528  00000000`38773440
00007ff6`53311530  00000000`00000000
00007ff6`53311538  00000000`00000000
00007ff6`53311540  00000000`00000000
00007ff6`53311548  00000000`00000000
00007ff6`53311550  00000000`00000000
00007ff6`53311558  00000000`3a8776a0
00007ff6`53311560  00000000`00000000
00007ff6`53311568  00000000`00000000
00007ff6`53311570  00000000`00000000
00007ff6`53311578  00000000`00000000
00007ff6`53311580  00000000`00000000
00007ff6`53311588  00000000`00000000
00007ff6`53311590  00000000`00000000

This isn’t the same address as the Worker’s local storage array, but the contents is the same, which is consistent with my expectation. I’m highlighting that first entry, because we’ll be visiting it later.

Just out of interest, I’m also going to try and tie things back to memory page observations made in Part 4. Let’s peek at the top of the 8k page that the LocalStorage lives on. Its address is 0x3b656c80, which rounded down to the nearest 8k (=0x2000) gives us a page starting address of 0x3b656000.

0:075> dps 0x000000003B656000
00000000`3b656000  00010002`00000000
00000000`3b656008  00000000`3b656040
00000000`3b656010  00060003`00000001
00000000`3b656018  00000000`000012c0
00000000`3b656020  00000000`00000000
00000000`3b656028  00000000`00000000
00000000`3b656030  00000000`00000001
00000000`3b656038  00000000`00000110
00000000`3b656040  00007ffd`5f95c290 sqldk!CMemObj::`vftable'

The shape of that page header looks familiar. The second quadword is a pointer to the address 0x40 bytes into this very page. And just to hand it to us on a plate, the object starting there reveals itself by its vftable symbol to be a CMemObj.

In other words, this particular bit of LocalStorage is managed by a memory object which lives on its very page – obviously it would be wasteful if a memory object refused to share its page with some of the objects it allocated memory for. Also note that this is a plain CMemObj and not a CMemThread<CMemObj>, i.e. it isn’t natively thread-safe. This may simply confirm that local storage is not intended for sharing across threads; see Dorr for more.

Cutting to the chase

So far, this is all rather abstract, and we’re just seeing numbers which may or may not be pointers, pointing to heaven knows what. Let me finish off by illuminating one trail to something we can relate to.

Out of the five local storage slots which contain something, the first one here points to 00000000`3b656040. As it turns out, this is an instance of the CCompExecCtxtBasic class, and you’ll just have to take my word for it today. Anyway, we’re hunting it for its meat, rather than for its name. Have some:

0:075> dps 3875e040
00000000`3875e040  00000000`3875ff00
00000000`3875e048  00000000`387727f0
00000000`3875e050  00000000`3875e0c0
00000000`3875e058  00000000`38772470
00000000`3875e060  00000000`38773560
00000000`3875e068  00000000`3875f310
00000000`3875e070  00000000`38773370
00000000`3875e078  00000000`38773240
00000000`3875e080  00000000`3875e0b8
00000000`3875e088  00000000`3875e330
00000000`3875e090  00000000`38773440
00000000`3875e098  00000000`00000001
00000000`3875e0a0  00000000`3875f890
00000000`3875e0a8  00000000`00000001
00000000`3875e0b0  00000000`4777e110
00000000`3875e0b8  00000000`00000000

You may recognise by the range of memory addresses we’ve seen so far that most of these are likely themselves pointers. I’ll now lead you by the hand to the highlighted fourth member of CCompExecCtxtBasic, and demonstrate what that member points to:

0:075> dps 00000000`38772470 
00000000`38772470 00007ffd`61d575f8 sqllang!CSession::`vftable'
00000000`38772478 00000000`0000000c
00000000`38772480 00000000`00000035
00000000`38772488 00000000`00000000
00000000`38772490 00000000`00000000
00000000`38772498 00000000`00000000
00000000`387724a0 00000001`00000001
00000000`387724a8 00000001`00000009
00000000`387724b0 00000000`00000000
00000000`387724b8 00000000`00000000
00000000`387724c0 00000000`00000000
00000000`387724c8 00007ffd`61d17b20 sqllang!CSession::`vftable'
00000000`387724d0 00000000`38772040
00000000`387724d8 00000000`38772240
00000000`387724e0 00000000`38772a60
00000000`387724e8 00000000`3875f570

Bam! Given away by the vftable yet again – this is a CSession instance, in other words probably the session object associated with that running thread. As per Part 3, the presence of more than one vftable indicates that we may be dealing with multiple virtual inheritance in the CSession class.

We’ll leave the last word for the highlighted third line, containing the value 0x35 as payload.

What does 0x35 translate to in decimal? Would you believe it, it’s 53, the session_id of the session associated with the thread. If we were feeling bored, we could go and edit it to be another number, essentially tinkering with the identity of that session. But today is not that day.

Putting it all together in a diagram, here then is one trail by which an arbitrary chunk of SQL Server code can find the current session, given nothing more than the contents of the GS register:

Sure, your programming language and framework may very well abstract away this kind of thing. But that doesn’t mean you aren’t relying on it all the time.

Final thoughts

This exercise would qualify as a classic case of spelunking, where we have a general sense of what we’d expect to find, perhaps egged on by a few clues. Just as I’m writing this, I see that Niels Berglund has also been writing about spelunking in Windbg. So many angles, so much interesting stuff to uncover!

The key takeaway should be a reminder of how much one can do with thread-local storage, which forms the third and often forgotten member of this trio of places to put state:

  1. Globally accessible heap storage – this covers objects which are accessible everywhere in a process, including ones where lexical scope attempts to keep them private.
  2. Function parameters and local variables – these have limited lifespans and live in registers or on the stack, but remain private to an invocation of a function unless explicitly shared.
  3. Thread-local storage – this appears global in scope, but is restricted to a single thread, and each thread gets its own version of the “same” global object. This is a great place to squirrel away the kind of state we’d associate with a session, assuming we leverage a link between sessions and threads.

I hope that the theme of this series is starting to come together now. One or two more to go, and the next one will cover sessions and their ancillary context in a lot more breadth.

Indirection indigestion, virtual function calls and SQLOS

One of Slava Oks’s classic posts from the 2005 era is
A new platform layer in SQL Server 2005 to exploit new hardware capabilities and their trends. I have occasionally revisited it as a kind of SQLOS manifesto, and some things which at first I found mystifying have become clearer over the years.

In many ways, it seems that the more recent SQLOSv2/SQLPAL work is a simple case of continuing with a project that has been organically evolving since the SQL Server 7 User Mode Scheduler, and rooted in classic Stonebraker: just how far can we assume control of core OS functions within an RDBMS process?
Continue reading “Indirection indigestion, virtual function calls and SQLOS”

Scheduler stories: Going Preemptive

SQLOS is built upon the idea of cooperative, AKA non-preemptive, scheduling: out of any given collection of threads belonging to a scheduler, only one will own the scheduler at a given moment. To the degree that these cooperative threads represent the only work done by the underlying CPU, this means that the thread owning the scheduler really owns the CPU. Of course, a CPU will occasionally get side-tracked into doing other work, so the SQLOS scheduler as “virtual CPU” only represents a chunk of the real CPU, but we live in the expectation that this is a suitably large chunk.

John Tenniel's White Rabbit from "Alice in Wonderland"

It is understandable when such competition for the CPU comes from outside of SQL Server, but it can also be instigated from within SQL Server itself. This is the world of preemptive – or if you prefer, antisocial – scheduling.
Continue reading “Scheduler stories: Going Preemptive”

Scheduler stories: Interacting with the Windows scheduler

In the previous post, The joy of fiber mode, we saw how a fiber mode scheduler firmly controls which worker runs on a thread at a given moment. While it can’t positively ensure that the thread in question remains running all the time, the soul of the scheduler lives in that one thread, and as long as the thread runs, the scheduler gets invoked by its team of fiber workers, dispatching them in an appropriate order.
Continue reading “Scheduler stories: Interacting with the Windows scheduler”