In the footsteps of a cooperative wait

In the last two posts, I gradually climbed up the stack trace of a typical SQLOS thread. Here is such a stack trace from thread creation to context switch, omitting the actual meat of SQL Server so we can focus on the scheduling aspect.

Today I’m looking at the upper part of that map, where the thread was deep in thought, but stepped into the open manhole of a latch wait, which – as is typical for most waits – involves an EventInternal as underlying mechanism.

Slipping into SwitchContext()

The SQLOS scheduler exists in the cracks between user tasks. As we’re well aware, in order for scheduling to happen at all, it is necessary for tasks to run scheduler-friendly code every now and again. In practice this means either calling methods which have the side effect of checking your quantum mileage and yielding if needed, or explicitly yielding yourself when the guilt gets too much.

Now from the viewpoint of the user task, the experience of yielding is no different than the experience of calling any long-running CPU-intensive function: You call a function and it eventually returns. The real difference is that the CPU burned between the call and its return was spent on one or more other threads, while the current thread went lifeless for a bit. But you don’t know that, because you were asleep at the time!

Anyway, for perfectly valid reasons, in the example an EventInternal‘s Wait() method decided to go to sleep, or viewed from a different angle, to put its caller to sleep. We know how that story ends. Ultimately the Wait() call will return, but before then, the thread will snake its way into a cooperative context switch involving SignalObjectAndWait().

The recipe

The EventInternal’s Wait() function is one of a handful of blocking functions that ferry threads into the cooperative context switch – or alternatively you can view it as ferrying the CPU across workers. In SQL Server 2017, you’ll start seeing WaitableBase::Wait(), but this is mostly refactoring, or possibly even un-inlining of existing code phrasing which only now shows up in public symbols.

Getting into a context switch and back out again – i.e. eventually having the context switched back to the current thread – in the polite context of a task, includes a sequence of three function calls within Wait():

  1. SOS_Task::PreWait() – this sets up wait accounting and publishes the wait_info XEvent.
  2. SOS_Scheduler::SuspendNonPreemptive() – this sets up timeout logic and does a final check for task abort before calling SwitchContext(). The result of SwitchContext (which is ultimately the result of its call to Switch() will be passed back up to Wait() as the wait result.
  3. SOS_Task::PostWait() – this performs the actual wait accounting and clears the waiting status of the task

These are outlined below:

EventInternal Wait sequence

The elusive SwitchContext() and its uncle Switch()

Okay, I was guilty of a white, or perhaps red, lie by including a call to SwitchContext() in that first diagram. Unless you have a breakpoint on that function, you probably will never see it in a stack trace. This is because it makes a tail call to Switch(), meaning the compiler tranfers control to its child Switch() through a jmp instruction rather than a call, thus erasing and reusing the SwitchContext stack frame. Goto is neither dead nor particularly harmful once you’re into assembly language.

But anyway, there is a nice delineation between the two. Switch() is the lowest-level SQLOS function where a thread may enter and block before eventually returning, and this is where the call to SignalObjectAndWait() happens. As input parameters, it receives a pointer to the current Worker and the one that must be switched to. This includes special handling for the case where the two are the same, e.g. if the worker graciously yielded due to quantum exhaustion, but was the only viable worker on the runnable queue, so got rescheduled immediately. In this case (“InstantResuming” in TaskTransition parlance) there is no mucking about with SignalObjectAndWait, and the function simply returns as soon as it can.

Otherwise, the outgoing worker is tidied up with a TaskTransition call of type “Suspending”, and the long-awaited SignalObjectAndWait ceremony is performed. Next thing it knows, SignalObjectAndWait returns because time passed, other workers ran, and Lady Luck – or if you’re lucky, Father Fairness – chose it as the next worker eligible for a quantum. At this point we get a “Resuming” TaskTransition, and the return value inscribed into the worker by its waker-upper, back when it was put back onto the runnable queue, becomes the return value of Switch() and hence SwitchContext().

However, as a last-ditch guard against against spurious wakeup, should the SignalObjectAndWait call return without the prescribed sacrament of the ambient SystemThread having its LastSignalledBy set by another, we cry foul and go to sleep again using a simple WaitForSingleObject(). As of 2016, there is now even an accompanying premature_systemthread_wakeup XEvent to herald the outrage.

Working backwards then, what does SwitchContext() do? Simple. This is where all the scheduler chores (e.g. checking timer and I/O lists) happen, and crucially, where the next incoming worker is chosen from the runnable queue. Its special case is finding an empty runnable queue, at which point the scheduler’s idle worker is scheduled, which may eventually go to sleep through WaitForSingleObject() on the scheduler’s own idle event. At this point the whole scheduler would be asleep, and it will require another scheduler to wake it up by signalling that idle event.

My, how the runnable queue has grown up. You may have gathered from Bob Dorr’s 2016 updated scheduling algorithms post that the PercentileQueue used in 2012 and 2014 got replaced with something else. What we have in 2016 (and AFAIK in 2017) is the continued use of said PercentileQueue for system schedulers, but the new GroupWorkerQueue for normal ones. This is a thing of beauty, containing a linked list per resource group per scheduler, i.e. partitioned in such a way that little in the way of locking is needed. I would like to highlight that its use doesn’t affect the awarded quantum target, which remains at 4ms, but only which worker gets chosen. One day I might have enough meat to write a whole post about it…

Final thoughts

This touches upon something I scratched at in The Myth of the Waiter List, which can always do with being aired again.

For the most part, a wait list, aka suspend queue, is something that belongs to a synchronisation resource like a latch or a a reader-writer lock. Apart from the timer and I/O lists, and the work dispatcher, suspend queues have nothing to do with schedulers: the synchronisation objects that own those suspend queues will move a worker from them to the relevant scheduler’s runnable queue when its time is up. The scheduler really only cares about the runnable queue, and will not waste tears or time worrying about workers suspended on synchronisation objects.

It should be clear from the context, but I have completely ignored fiber mode today. A fiber switch is a different beast entirely, and you can read more about it in The Joy of Fiber Mode.

Yes, there is some repetition from earlier posts, but I hope that covering the same ground multiple times in different ways works as well for you as it does for me. Any questions or observations? I’m all ears.

The early life of a SQLOS thread

So I have checked off that bucket list item of speaking at a SQLSaturday. In the process of getting my act together, I learned a thing or two about the undocumented youth of SQLOS threads, between birth and entering the workplace. And you didn’t, which seems unfair.

We normally see stack traces while looking at the top of the stack, typically during a wait, at which point the thread is wearing full worker garb, and has been executing a task for a while. Let’s today reflect on those happy times when our thread was in diapers.

Conception and birth

Threads are born because the system as a whole decides they are cute and it wants more of them. The decision is made in the SystemThreadDispatcher, which is a component of a SchedulerManager, itself a component of an SOS_Node, aka CPU node.

We can simplify this: Threads are born into nodes.

Now a thread isn’t created at the moment that it is needed, and it isn’t legally able to perform work right from birth. The idea is to have a reasonable number of grown-up threads in the population, ready to be put to work at short notice. We are just at the first step.

Thread creation is done through a CreateRemoteThreadEx() call, within the function SystemThreadDispatcher::CreateNewSysThreadIfRequired(), which is invoked as a side task by another thread when it leaves the pool of unemployed threads.

The function pointer passed in as thread entry point is SchedulerManager::ThreadEntryPoint(), and the parameter that will be passed to that entry point is a pointer to the target node’s SchedulerManager. In other words, when the function runs, it will be a completely normal instance method call on that SchedulerManager, parameterless except for the This pointer. And since the SchedulerManager knows what node it belongs to, our newborn thread will instinctively be able to crawl into the arms of the maternal SOS_Node.

But I am getting ahead of myself here. Before even running that entry point function, the thread creation callback registered during SQLOS boot (SystemThread::DllMainCallback()) is invoked by the OS runtime in the context of the new thread. And that gives us a SystemThread associated with the thread, meaning it has – among other things – the Windows event that will let it participate in SQLOS context switching.

So the very first thing our newborn thread, cosily wrapped up in a SystemThread, does is to enlist itself in the parent SOS_Node – and by “enlist” I literally mean adding itself to a linked list. Strictly speaking, it enlists the SystemThread, which is now SQLOS’s proxy to the thread: whenever we want to refer to a thread, we do so through a pointer to its SystemThread. Looking at it from one direction, the SystemThread contains a handle to the thread. From the other direction, any running code can find the ambient SystemThread through a thread-local storage lookup.

As it stands, the thread can’t do much useful in polite company yet, other than suspend itself. SystemThread::Suspend() is the most rudimentary of scheduling functions, just calling WaitForSingleObject() on the thread’s personal Event.

When a thread loves a Worker

ThreadEntryPoint now calls SystemThreadDispatcher::ProcessWorker() on the SOS_Node’s SystemThreadDispatcher, i.e. the one within the current SchedulerManager.

The SystemThreadDispatcher shows itself to be a dating agency, keeping separate linked lists of unattached SystemThreads and idle Workers, and pairing them off according to supply and demand.

From the viewpoint of the thread running it, ProcessWorker() means “find me an unattached Worker so we can change the world together”. If there isn’t a spare Worker at this moment though, the thread goes to sleep through the aforementioned SystemThread::Suspend() call, only to be woken up when a fresh young Worker arrives on the dating scene. This moment is celebrated by ProcessWorker() moving on to call SystemThread::RunWorker()

Pairing the two up includes the SystemThread swearing a vow of loyalty to the Worker’s associated SOS_Scheduler. Up to this point, the thread was “in the SystemThreadDispatcher” and associated with an SOS_Node, but not a specific scheduler. From here onwards, the SystemThread and Worker are fully part of the family of workers for that scheduler.

We now move on to SchedulerManager::WorkerEntryPoint() which initialises the Worker, e.g. setting timestamps and the first quantum target, before invoking the first SOS_Scheduler method, ProcessTasks().

Interesting aside regarding waits: The suspension of a thread within the SystemThreadDispatcher isn’t a measured wait, because waiting is measured at the level of workers and schedulers, neither of which have yet entered the picture.

Your task, should you choose to accept it…

Moving into the family home of the Worker, the first stop within ProcessTasks() is a courtesy call on that scheduler’s WorkDispatcher. If the SystemThreadDispatcher was a dating agency for Workers and SystemThreads, the WorkDispatcher is an employment agency for those couples, pairing them up with jobs in the form of SOS_Tasks.

Entering the WorkDispatcher initially, the pair generally wouldn’t find a pending tasks. At this point they (though the pair is now just viewed as a Worker by the scheduler) are put to sleep through a full-fledged scheduler method, SOS_Scheduler::SuspendNonPreemptive(). This means that the Worker ends up on a suspend queue, specifically the WorkDispatcher’s list of idle workers.

When a task is lobbed over the wall into the scheduler from elsewhere, the WorkDispatcher will assign it to an idle Worker, and the worker made runnable. In due course it will be chosen as the next worker to run, continuing with the ProcessTasks() call to run the specific function specified through the task: this is SOS_Scheduler:RunTask() into SOS_Task::Param::Execute().

The task gets executed, through all the joys and heartaches of taskhood, and if parallelism is involved, child tasks may even be spawned. Ultimately though, the task will be done, and the pair return to the WorkDispatcher’s idle list, blocked in SOS_Scheduler::ProcessTasks() but ready for the next challenge.

You want pictures? Sure.

The relationship between SOS_Node, SOS_Scheduler and their dispatching components

(For the sake of honesty, I should note that a node actually has separate SchedulerManagers for normal and hidden schedulers.)

Up next

This takes care of how tasks, workers, and threads interact – at least in thread mode, which is the only mode we probably care about. In the next blog post I will look into how tasks actually get instantiated.

Unsung SQLOS: The SOS_UnfairMutexPair behind CMEMTHREAD waits

As with the droppings of the Questing Beast, we recognise synchronisation code paths by their emissions. But even when not leaving telltale fewmets, these creatures wander among us unseen, unsung, until shutdown doth us part.

The place of the SOS_UnfairMutexPair

Bob Dorr has done a great job of illustrating why thread-safe SQLOS memory objects need serialised access in How It Works: CMemThread and Debugging Them. This has been followed up by the “It just runs faster” coverage of dynamic memory partitioning.

Today I’ll be examining the SOS_UnfairMutexPair, the synchronisation object behind memory allocations. While I’m going to describe synchronisation as a standalone subject, it’s useful to think of it as the CMEMTHREAD wait; to the best of my knowledge nothing other than memory allocations currently uses this class.

For context, I have previously described a bunch of other synchronisation classes:

One picture which emerges from all these synchronisation flavours is that a developer can choose between busy waiting (eagerly burning CPU), politely going to sleep (thread rescheduling), or a blend between the two. As we’ll see the SOS_UnfairMutexPair goes for a peculiar combination, which was clearly tuned to work with the grain of SQLOS scheduling.

The shape of the object

With the exception of the central atomic lock member, everything comes in pairs here. The class models a pair of waitable objects, each having an associated pair of members to note which scheduler and task currently owns it:

Layout of the SOS_UnfairMutexPair (2016 incarnation)

Although it exposes semantics to acquire either both mutexes or just the second, in its memory allocation guise we always grab the pair, and it effectively acts as a single mutex. I’ll therefore restrict my description by only describing half of the pair.

Acquisition: broad outline

The focal point of the mutex’s state – in fact one might say the mutex itself – is the single Spinlock bit within the 32-bit lock member. Anybody who finds it zero, and manages to set it to one atomically, becomes the owner.

Additionally, if you express an interest in acquiring the lock, you need to increment the WaiterCount, whether or not you managed to achieve ownership at the first try. Finally, to release the lock, atomically set the spinlock to zero and decrement the WaiterCount; if the resultant WaiterCount is nonzero, wake up all waiters.

Now one hallmark of a light-footed synchronisation object is that it is very cheap to acquire in the non-contended case, and this class checks that box. If not owned, taking ownership (the method SOS_UnfairMutexPair::AcquirePair()) requires just a handful of instructions, and no looping. The synchronisation doesn’t get in the way until it is needed.

However, if the lock is currently owned, we enter a more complicated world within the SOS_UnfairMutexPair::LongWait() method. This broadly gives us a four-step loop:

  1. If the lock isn’t taken at this moment we re-check it, grab it, simultaneously increment WaiterCount, then exit triumphantly, holding aloft the prize.
  2. Fall back on only incrementing WaiterCount for now, if this is the first time around the loop and the increment has therefore not been done yet.
  3. Now wait on the EventInternal, i.e. yield the thread to the scheduler.
  4. Upon being woken up by the outgoing owner releasing the lock as described above, try again to acquire the lock. Repeat the whole loop.

The unfairness derives from the fact that there is no “first come, first served” rule, in other words the wait list isn’t a queue. This is not a very British class at all, but as we’ll see, there is a system within the chaos.

The finicky detail

Before giving up and waiting on the event, there is a bit of aggressive spinning on the spinlock. As is standard with spinlocks, spinning burns CPU on the optimistic premise that it wouldn’t have to do it for long, so it’s worth a go. However, the number of spin iterations is limited. Here is a slight simplification of the algorithm:

  • If the scheduler owning the lock is the ambient scheduler, restrict to a single spin.
  • Else, give it a thousand tries before sleeping.
  • Each spin involves first trying to grab both the lock and (if not yet done) incrementing WaiterCount. If that doesn’t work, just try and increment the WaiterCount.

This being of course the bit where the class knows a thing or two about SQLOS scheduling: If I am currently running, then no other worker on my scheduler can be running. But if another worker on my scheduler currently holds the lock, it can’t possibly wake up and progress towards releasing it unless *I* go to sleep. Mind you, this is already a edge case, because we’d hope that the owner of this kind of lock wouldn’t go to sleep holding it.

To see how scheduling awareness comes into play, I’m going to walk through a scenario involving contention on such a mutex. If some of the scheduling detail makes you frown, you may want to read Scheduler stories: The myth of the waiter list.

A chronicle of contention

In this toy scenario, we have two schedulers with two active workers each. Three of the four workers will at some point during their execution try and acquire the same mutex, and one of them will try twice. Time flows from left to right, and the numbered callouts are narrated below. A red horizontal bracket represents the period where a worker owns the mutex, which may be a while after the acquisition attempt started.

The mutex acquisition tango
  1. A1 wants to acquire the mutex and, finding it uncontended, gets it straight away.
  2. B2 tries to acquire it, but since it is held by A1, it gives up after a bit of optimistic spinning, going to sleep. This gives B1 a turn on the scheduler.
  3. A1 releases the mutex, and finding that there is a waiter, signals it. This moves B2 off the mutex’s waiter list and onto scheduler B’s runnable queue, so it will be considered eligible for running at the next scheduler yield point.
  4. B1 wants the mutex, and since it isn’t taken, grabs it. Even though B2 has been waiting for a while, it wasn’t running, and it’s just tough luck that B1 gets it first.
  5. A1 wants the mutex again, but now B1 is holding it, so A1 goes to sleep, yielding to A2.
  6. B1 releases the mutex and signals the waiter A1 – note that B2 isn’t waiting on the resource anymore, but is just in a signal wait.
  7. B1 reaches the end of its quantum and politely yields the scheduler. B2 is picked as the next worker to run, and upon waking, the first thing it does is to try and grab that mutex. It succeeds.
  8. A2 reaches a yield point, and now A1 can get scheduled, starting its quantum by trying to acquire the mutex. However, B2 is still holding it, and after some angry spinning, A2 is forced to go to sleep again, yielding to A1.
  9. B2 releases the mutex and signals the waiting A1, who will hopefully have better luck acquiring it when it wakes up again.

While this may come across as a bit complex, remember that an acquisition attempt (whether immediately successful or not) may also involve spinning on the lock bit. And this spinning manifests as “useful” work which doesn’t show up in spinlock statistics; the only thing that gets exposed is the CMEMTHREAD waiting between the moment a worker gives up and goes to sleep and the moment it is woken up. This may be followed by another bout of unsuccessful and unmeasured spinning.

All in all though, you can see that this unfair acquisition pattern keeps the protected object busy doling out its resource: in this case, an memory object providing blocks of memory. In an alternative universe, the mutex class may well have decided on its next owner at the moment that the previous owner releases it. However, this means that the allocator won’t do useful work until the chosen worker has woken up; in the meantime, the unlucky ones on less busy schedulers may have missed an opportunity to get woken up and do a successful acquire/release cycle. So while the behaviour may look unfair from the viewpoint of the longest waiter, it can turn out better for overall system throughput.

Of course, partitioning memory objects reduced the possibility of even having contention. But the fact remains: while any resources whatsoever are shared, we need to consider how they behave in contended scenarios.

Compare-and-swap trivia

Assuming we want the whole pair, as these memory allocations do, there are four atomic operations performed against the lock member:

  • Increment the waiter count: add 0x00010001
  • Increment the waiter count and grab the locks: add 0x80018001
  • Just grab the locks (after the waiter count was previously incremented): add 0x80008000
  • Release the locks and decrement the waiter count: deduct 0x80018001

For the first three, the usual multi-step pattern comes into play:

  1. Retrieve the current value of the lock member
  2. Add the desired value to it, or abandon the operation, e.g. if we find the lock bit set and we’re not planning to spin
  3. Perform the atomic compare-and-swap (lock cmpxchg instruction) to replace the current value with the new one as long as the current value has not changed since the retrieval in step 1
  4. Repeat if not successful

The release is simpler, since we know that the lock bits are set (we own it!) and there is no conditional logic. Here the operation is simple interlocked arithmetic, but two’s complement messes with your mind a bit: the arithmetic operation is the addition of 0x7ffe7fff. Not a typo: that fourth digit is an “e”!

This all comes down to thinking of the lock bit as a sign bit we need to overflow in order to set to 1. The higher one overflows out of the 32-bit space, but the lower one overflows into the lowest bit of the first count. To demonstrate, we expect 0x80018001 to turn to zero after applying this operation:

    8001 8001
  + 7ffe 7fff
    ---------
(1) 0000 0000

Final thoughts

So you thought we’ve reached the end of scheduling and bit twiddling? This may turn out to be a perfect opportunity to revisit waits, and to start exploring those memory objects themselves.

I’d like to thank Brian Gianforcaro (b | t) for feedback in helping me confirm some observations.

Context in perspective 5: Living next door TLS

Time to switch context to an old thread, specifically the SystemThread class. I have done this to death in an Unsung SQLOS post, but if you don’t want to sit through that, here are the bits we care about today:

  • All work in SQLOS is performed by good old-fashioned operating system threads.
  • SQLOS notionally extends threads with extra attributes like an associated OS event object, and behaviour like a method for for waiting on that event. This bonus material lives in the SystemThread class, which is conceptually a subclass of an operating system thread, although this isn’t literally class inheritance.
  • A SystemThread instance lives in memory which is either allocated on the heap and pointed to by a thread-local storage slot, or it squats right across a range of TLS slots within the Thread Environment Block.

Due to the nature of thread-local storage, at any moment any code running on a SQLOS thread can get a reference to the SystemThread it’s running on; this is described in gory detail in Windows, Mirrors and a sense of Self. Powerful stuff, because this ambient SystemThread indirectly exposes EVERYTHING that defines the current SQL Server context, from connection settings to user identity and beyond.

Life is a box of ogres

LS through the looking glass

Understanding SQLOS takes us only so far, because it is after all just the substrate upon which SQL Server is built. We’ve now reached the point where SQLOS hands SQL Server a blank slate and says “knock yourself out”.

This blank slate is a small but delicious slice of local storage: an array of eighteen pointers living within the SystemThread. SQLOS has no interest in what these are used for and what they mean. As far as it is concerned, it’s just a bunch of bytes of no known type. But of course SQL Server cares a lot more.

Park that thought for a moment to consider that we’re starting to build up a hierarchy of thread-local storage:

  1. Upon an OS context switch, the kernel swaps the value held in the CPU’s GS register to point to the Thread Environment Block of the incoming thread.
  2. Within this Thread Environment block lives TLS slots that SQLOS takes advantage of. One of these will always point to the currently running SystemThread instance. In other words, when the kernel does a context switch, the change of OS thread brings with it a change in the ambient SystemThread which can be retrieved from TLS.
  3. Now within this SystemThread, an array of eighteen pointer-sized slots are made available for the client application (SQL Server) to build upon.

What does this mean from the viewpoint of SQL Server? Well, even within the parts that don’t care about SQLOS and the underlying OS, code can express and find current thread-specific state – at a useful abstraction level – somewhere within those eighteen slots.

Worker LocalStorage vs thread-local storage

We often skirt around the distinction between a worker and a thread. This is a benign simplification, because instances of Worker and SystemThread classes are bound together in a 1:1 paired relationship during all times when SQL Server code is running.

The only time the distinction becomes particularly interesting is when we’re in fiber mode, because now a single SystemThread can promiscuously service multiple Workers in a round-robin fashion. I have documented some details in The Joy of Fiber Mode, but of special interest today is the treatment of these local storage slots.

These now become a more volatile part of thread context, and a SQLOS context switch (in this case, a fiber switch) must include copying the Worker’s LocalStorage payload into the SystemThread slots, because there is no guarantee that two different Workers will share the exact same context; in fact it is close to certain they won’t.

So clearly the context that matters to the Task at hand lives within the Worker, and SQLOS makes sure that this is the context loaded into a SystemThread while it runs; it’s just that the SQLOS scheduler has to intervene more in fiber mode, whereas it can let this bit of state freewheel in thread mode.

On to the implementation details then. Unlike the case with the SystemThread, a Worker’s local storage isn’t a chunk of memory within the class, but lives in a heap allocation, represented by an instance of the LocalStorage class embedded within the Worker.

Additionally, while the SystemThread’s slot count is baked into SQLOS, somebody clearly wanted a bit more abstraction in the Worker class, so the slot count becomes an attribute of a LocalStorage instance, which thus consists of a grand total of two members:

  1. The slot count, passed into the LocalStorage constructor
  2. A pointer to the actual chunk of memory living somewhere on the heap and allocated by the LocalStorage constructor, which is a thin wrapper around a standard page allocator

Show me some numbers, and don’t spare the horses

On to the fun bit, but you’re going to have to take my word on two things. Firstly, the SystemThread’s local storage slots are right at the start of the object. And secondly, the LocalStorage instance lives at an offset of 0x30 within a Worker, at least on the version I’m using here.

To prepare, I needed to capture the addresses of a bound SystemThread:Worker pair before breaking into the debugger, so I started a request running in session 53, executing nothing but a long WAITFOR – this should keep things static enough while I fumble around running a DMV query in another session:

Capturing the worker and thread addresses of our dummy request

So we’re off yet again to stage a break-in within Windbg. Having done this, everything is frozen in time, and I can poke around to my heart’s content while the business users wonder what just happened to their server. No seriously, you don’t want to do this on an instance that anyone else needs at the same time.

Here is the dump of the first 64 bytes of that Worker. As in Part 4, I’m dumping in pointer format, although only some of these entries are pointers or even necessarily 64 bits wide. In case it isn’t clear, the first column is the address we’re looking at, and the second column is the contents found at that address. The dps command sweetens the deal a bit: if that payload is an address corresponding to a known symbol (e.g. a the name of a function), that symbol will be displayed in a third column.

0:075> dps 0x000000003B656160 l0x8
00000000`3b656160  00000000`00000000
00000000`3b656168  00000000`00000000
00000000`3b656170  00000000`3d948170
00000000`3b656178  00000000`3a21a170
00000000`3b656180  00000000`00000001
00000000`3b656188  00000000`41080158
00000000`3b656190  00000000`00000012
00000000`3b656198  00000000`3b656c80

Those highlighted last two represent the LocalStorage instance, confirming that we do indeed have eighteen (=0x12) slots, and telling us the address where that slot array starts. Let’s see what payload we find there:

0:075> dps 00000000`3b656c80 l0x12
00000000`3b656c80  00000000`3875e040
00000000`3b656c88  00000000`3875e9a0
00000000`3b656c90  00000000`00000000
00000000`3b656c98  00000000`38772608
00000000`3b656ca0  00000000`38773440
00000000`3b656ca8  00000000`00000000
00000000`3b656cb0  00000000`00000000
00000000`3b656cb8  00000000`00000000
00000000`3b656cc0  00000000`00000000
00000000`3b656cc8  00000000`00000000
00000000`3b656cd0  00000000`3a8776a0
00000000`3b656cd8  00000000`00000000
00000000`3b656ce0  00000000`00000000
00000000`3b656ce8  00000000`00000000
00000000`3b656cf0  00000000`00000000
00000000`3b656cf8  00000000`00000000
00000000`3b656d00  00000000`00000000
00000000`3b656d08  00000000`00000000

It seems that only five out of the eighteen are in use. Oh well, that’s neither here nor there. Let’s compare this to the first eighteen quadwords at the bound SystemThread’s address we found in sys.dm_os_threads:

0:075> dps 0x00007FF653311508 l0x12
00007ff6`53311508  00000000`3875e040
00007ff6`53311510  00000000`3875e9a0
00007ff6`53311518  00000000`00000000
00007ff6`53311520  00000000`38772608
00007ff6`53311528  00000000`38773440
00007ff6`53311530  00000000`00000000
00007ff6`53311538  00000000`00000000
00007ff6`53311540  00000000`00000000
00007ff6`53311548  00000000`00000000
00007ff6`53311550  00000000`00000000
00007ff6`53311558  00000000`3a8776a0
00007ff6`53311560  00000000`00000000
00007ff6`53311568  00000000`00000000
00007ff6`53311570  00000000`00000000
00007ff6`53311578  00000000`00000000
00007ff6`53311580  00000000`00000000
00007ff6`53311588  00000000`00000000
00007ff6`53311590  00000000`00000000

This isn’t the same address as the Worker’s local storage array, but the contents is the same, which is consistent with my expectation. I’m highlighting that first entry, because we’ll be visiting it later.

Just out of interest, I’m also going to try and tie things back to memory page observations made in Part 4. Let’s peek at the top of the 8k page that the LocalStorage lives on. Its address is 0x3b656c80, which rounded down to the nearest 8k (=0x2000) gives us a page starting address of 0x3b656000.

0:075> dps 0x000000003B656000
00000000`3b656000  00010002`00000000
00000000`3b656008  00000000`3b656040
00000000`3b656010  00060003`00000001
00000000`3b656018  00000000`000012c0
00000000`3b656020  00000000`00000000
00000000`3b656028  00000000`00000000
00000000`3b656030  00000000`00000001
00000000`3b656038  00000000`00000110
00000000`3b656040  00007ffd`5f95c290 sqldk!CMemObj::`vftable'

The shape of that page header looks familiar. The second quadword is a pointer to the address 0x40 bytes into this very page. And just to hand it to us on a plate, the object starting there reveals itself by its vftable symbol to be a CMemObj.

In other words, this particular bit of LocalStorage is managed by a memory object which lives on its very page – obviously it would be wasteful if a memory object refused to share its page with some of the objects it allocated memory for. Also note that this is a plain CMemObj and not a CMemThread<CMemObj>, i.e. it isn’t natively thread-safe. This may simply confirm that local storage is not intended for sharing across threads; see Dorr for more.

Cutting to the chase

So far, this is all rather abstract, and we’re just seeing numbers which may or may not be pointers, pointing to heaven knows what. Let me finish off by illuminating one trail to something we can relate to.

Out of the five local storage slots which contain something, the first one here points to 00000000`3b656040. As it turns out, this is an instance of the CCompExecCtxtBasic class, and you’ll just have to take my word for it today. Anyway, we’re hunting it for its meat, rather than for its name. Have some:

0:075> dps 3875e040
00000000`3875e040  00000000`3875ff00
00000000`3875e048  00000000`387727f0
00000000`3875e050  00000000`3875e0c0
00000000`3875e058  00000000`38772470
00000000`3875e060  00000000`38773560
00000000`3875e068  00000000`3875f310
00000000`3875e070  00000000`38773370
00000000`3875e078  00000000`38773240
00000000`3875e080  00000000`3875e0b8
00000000`3875e088  00000000`3875e330
00000000`3875e090  00000000`38773440
00000000`3875e098  00000000`00000001
00000000`3875e0a0  00000000`3875f890
00000000`3875e0a8  00000000`00000001
00000000`3875e0b0  00000000`4777e110
00000000`3875e0b8  00000000`00000000

You may recognise by the range of memory addresses we’ve seen so far that most of these are likely themselves pointers. I’ll now lead you by the hand to the highlighted fourth member of CCompExecCtxtBasic, and demonstrate what that member points to:

0:075> dps 00000000`38772470 
00000000`38772470 00007ffd`61d575f8 sqllang!CSession::`vftable'
00000000`38772478 00000000`0000000c
00000000`38772480 00000000`00000035
00000000`38772488 00000000`00000000
00000000`38772490 00000000`00000000
00000000`38772498 00000000`00000000
00000000`387724a0 00000001`00000001
00000000`387724a8 00000001`00000009
00000000`387724b0 00000000`00000000
00000000`387724b8 00000000`00000000
00000000`387724c0 00000000`00000000
00000000`387724c8 00007ffd`61d17b20 sqllang!CSession::`vftable'
00000000`387724d0 00000000`38772040
00000000`387724d8 00000000`38772240
00000000`387724e0 00000000`38772a60
00000000`387724e8 00000000`3875f570

Bam! Given away by the vftable yet again – this is a CSession instance, in other words probably the session object associated with that running thread. As per Part 3, the presence of more than one vftable indicates that we may be dealing with multiple virtual inheritance in the CSession class.

We’ll leave the last word for the highlighted third line, containing the value 0x35 as payload.

What does 0x35 translate to in decimal? Would you believe it, it’s 53, the session_id of the session associated with the thread. If we were feeling bored, we could go and edit it to be another number, essentially tinkering with the identity of that session. But today is not that day.

Putting it all together in a diagram, here then is one trail by which an arbitrary chunk of SQL Server code can find the current session, given nothing more than the contents of the GS register:

Sure, your programming language and framework may very well abstract away this kind of thing. But that doesn’t mean you aren’t relying on it all the time.

Final thoughts

This exercise would qualify as a classic case of spelunking, where we have a general sense of what we’d expect to find, perhaps egged on by a few clues. Just as I’m writing this, I see that Niels Berglund has also been writing about spelunking in Windbg. So many angles, so much interesting stuff to uncover!

The key takeaway should be a reminder of how much one can do with thread-local storage, which forms the third and often forgotten member of this trio of places to put state:

  1. Globally accessible heap storage – this covers objects which are accessible everywhere in a process, including ones where lexical scope attempts to keep them private.
  2. Function parameters and local variables – these have limited lifespans and live in registers or on the stack, but remain private to an invocation of a function unless explicitly shared.
  3. Thread-local storage – this appears global in scope, but is restricted to a single thread, and each thread gets its own version of the “same” global object. This is a great place to squirrel away the kind of state we’d associate with a session, assuming we leverage a link between sessions and threads.

I hope that the theme of this series is starting to come together now. One or two more to go, and the next one will cover sessions and their ancillary context in a lot more breadth.

Indirection indigestion, virtual function calls and SQLOS

One of Slava Oks’s classic posts from the 2005 era is
A new platform layer in SQL Server 2005 to exploit new hardware capabilities and their trends. I have occasionally revisited it as a kind of SQLOS manifesto, and some things which at first I found mystifying have become clearer over the years.

In many ways, it seems that the more recent SQLOSv2/SQLPAL work is a simple case of continuing with a project that has been organically evolving since the SQL Server 7 User Mode Scheduler, and rooted in classic Stonebraker: just how far can we assume control of core OS functions within an RDBMS process?
Continue reading “Indirection indigestion, virtual function calls and SQLOS”

Scheduler stories: The myth of the waiter list

‘Tis the season to be controversial, so let’s take a stroll down memory lane to Ken Henderson’s classic Inside the SQL Server 2000 User Mode Scheduler:

The waiter list maintains a list of workers waiting on a resource. When a UMS worker requests a resource owned by another worker, it puts itself on the waiter list for the resource and enters an infinite wait state for its associated event object. When the worker that owns the resource is ready to release it, it is responsible for scanning the list of workers waiting on the resource and moving them to the runnable list, as appropriate. And when it hits a yield point, it is responsible for setting the event of the first worker on the runnable list so that the worker can run. This means that when a worker frees up a resource, it may well undertake the entirety of the task of moving those workers that were waiting on the resource from the waiter list to the runnable list and signaling one of them to run.

John Tenniel's White Rabbit from "Alice in Wonderland"

The lists behind the legend

I have not gone as far as opening up my rusty copy of SQL Server 2000 to see how Ken’s description fits in there, but I am now pretty certain that the above quote has transmuted over the years into a common misunderstanding about SQLOS scheduling mechanics.

Now nothing Ken said is untrue or particularly out of date. It is just that we often hear “the waiter list” (by implication handling resource waits) described as an attribute of a scheduler, which is not the case.

Let’s revisit when the scheduler code runs, and what it does:

  • A worker will yield, either because it needs to wait for a resource, or because it is eaten up with guilt over reaching the end of its allotted quantum.
  • The act of yielding means that scheduler code (methods on the SOS_Scheduler class) gets invoked.
  • After a bit of housekeeping for the common good of all workers sharing the scheduler, control is transferred back to a worker to do its thing – this may even be the same worker who originally yielded.
  • The housekeeping consists of checking for aborted tasks, processing pending I/Os, and checking for I/O completions and timer list timeouts.

The single most important list that a scheduler owns is the collection of runnable workers, that is, the subset of workers belonging to this scheduler who are not waiting for anything other than CPU. This has variously been described as a list and a queue; I shall be using the term “runnable queue” by convention, but be aware that it is a data structure that has changed over the years and isn’t a simple queue.

A scheduler has one piece of “creative” interaction with this runnable queue, and it comes with only two variables:

  • When a context switch is requested by an outgoing worker owning the scheduler, the scheduler code has to choose which one of potentially multiple workers is going to be its next owner.
  • The incoming worker gets given a quantum expiry date, by which time it is expected to yield.

Core scheduler code running during context switching only dequeues runnable workers, and at such moments a given scheduler only looks at its own runnable queue. In contrast, code running all over the place, including in the context of workers belonging to other schedulers, may enqueue workers on to the runnable queue.

Time for a simple diagram:

Someone to watch over me

What I’m trying to get across here is that each instance of a waitable resource has its own wait list, and the scheduler has no interest in this, because a scheduler only acts upon its runnable queue. Seen from a different angle, once a worker is waiting on a resource, its scheduler doesn’t care, because it can’t and won’t manage the waiting logic of something like a latch. This splits the responsibilities neatly in two:

  • The synchronisation class guarding a resource (which inevitably will be built upon an EventInternal) stands watch over all the workers queueing up to have a ride on that resource. The act of granting access to a worker involves moving the worker from the wait list and getting it on to the runnable queue of that scheduler’s worker, and this is achieved by the synchronisation class.
  • The scheduler, in turn, doesn’t decide who is runnable, but it does get to pick which of the runnable workers (however they reached that state) runs next.

The I/O and timer lists

There are however two cases where the scheduler decides to make a worker runnable in the normal course of events. One is when a worker was waiting on I/O to complete, where periodic scheduler housekeeping is the mechanism by which SQLOS takes note of the I/O completion. At this point some workers who were on the I/O list may find themselves moved to the runnable queue just before the next worker is picked to be granted ownership of the scheduler – the lucky winner might be one of these workers, or it may be someone else who has been runnable for a while.

The second, and actually more interesting case, is the timer list. In its simplest use case, this is where you will find workers executing T-SQL WAITFOR statements. The list is neatly ordered by timer expiry date, and at each invocation of the scheduler context-switch housekeeping, workers whose timer expiry dates have now passed will be moved to the runnable queue.

What makes a timer list particularly interesting though, is when it implements a resource wait timeout, for instance a lock timeout. In this scenario we actually have a worker waiting on two things simultaneously: a resource and a timer. If the resource is acquired before the timer expires, all is good: the worker goes on to the runnable queue, and upon being woken up it finds a thumbs-up as the return value of its resource acquisition call.

However, should the timer expire before the resource has been acquired, the scheduler will actually venture forth and take the worker off that waiter list before making it runnable and setting an error return value as wake-up call. Think of it as every teenager’s worst nightmare: you’re not home by curfew, so Mom comes to your dodgy party to drag your sorry ass home. And then you wake up with a hangover and note stuck to your forehead reading “No cake for you”.

Whither next?

I tried to keep this comparatively high-level, but might take a nice little detour into the WorkerTimerRequest some day if time permits.

There you have it. Be home on time and have a thread-safe festive season.

Scheduler stories: Going Preemptive

SQLOS is built upon the idea of cooperative, AKA non-preemptive, scheduling: out of any given collection of threads belonging to a scheduler, only one will own the scheduler at a given moment. To the degree that these cooperative threads represent the only work done by the underlying CPU, this means that the thread owning the scheduler really owns the CPU. Of course, a CPU will occasionally get side-tracked into doing other work, so the SQLOS scheduler as “virtual CPU” only represents a chunk of the real CPU, but we live in the expectation that this is a suitably large chunk.

John Tenniel's White Rabbit from "Alice in Wonderland"

It is understandable when such competition for the CPU comes from outside of SQL Server, but it can also be instigated from within SQL Server itself. This is the world of preemptive – or if you prefer, antisocial – scheduling.
Continue reading “Scheduler stories: Going Preemptive”

Scheduler Stories: When does your scheduler run?

I am planning to burn a fair number of cycles on SQLOS scheduling internals for the foreseeable future, and with some luck, this turns into an interesting series. OS scheduling is already a subject that belongs “on the other side of the looking glass”, and this only gets more interesting when we look at user-mode SOS_Scheduler scheduling built on top of it.

If I don’t specifically mention a version, my frame of reference is SQL Server 2014. Yes, things changed since then, but the 2012-2014 scheduler is a good starting point, and the fundamental mechanisms I’ll initially cover have changed very little since the User Mode Scheduler (UMS) of SQL Server 7.0.
Continue reading “Scheduler Stories: When does your scheduler run?”