Scheduler Stories: Interacting with the Windows scheduler

In the previous post, The joy of fiber mode, we saw how a fiber mode scheduler firmly controls which worker runs on a thread at a given moment. While it can’t positively ensure that the thread in question remains running all the time, the soul of the scheduler lives in that one thread, and as long as the thread runs, the scheduler gets invoked by its team of fiber workers, dispatching them in an appropriate order.

John Tenniel's White Rabbit as herald

Two things can interfere with our monopolising of a CPU:

Hardware interrupts.
The scheduling of a different thread on the CPU.

Hardware interrupts, which run in kernel mode and return to user mode quickly, should be nothing more than tiny hiccups in a running thread’s quantum. The other 90% of the interrupt iceberg manifests in user mode as Deferred Procedure Calls (DPCs, or “bottom halves” to the Linux crowd) but should still only steal small change in terms of CPU cycles. Context switches to another thread represent a completely different story, because it could be ages before control returns to our thread, meaning that our fiber scheduler is completely out of commission for a while.

This possibility – a SQLOS scheduler losing the CPU for an extended period – is just one of those things we need to live with, but on a sane server, it shouldn’t be something to be too concerned about. Consider that this happens all the time in virtualised environments, where our vCPU can essentially cease to exist while another VM has a ride on the physical CPU. Thinking it through along those lines, you can see a simple virtualisation hierarchy:

A physical CPU
Optionally, a vCPU, with access mediated by a hypervisor
A chance for a SQLOS thread to run on the (v)CPU, mediated by the underlying operating system
A chance for our worker to own the SQLOS scheduler, mediated by that scheduler and the commonwealth of sibling workers

Still, the best we can do is to stick to a simple mental framework: Pretend that our cooperative scheduler does own the CPU 100% of the time, and focus on how it transfers control between its workers, rather than to dwell on the times it loses the CPU entirely.

Windows Events

This passing of control between workers on a scheduler is straightforward in fiber mode, but it’s time to deal with reality and look at how SQLOS switches between workers in thread mode.

Quick review. A Windows Event is a kernel object that can be in one of two states: signalled and non-signalled. I actually find the name “event” slightly confusing, because in natural language only the transition from non-signalled to signalled, i.e. becoming or getting signalled, fits our SQLOS scenario. Another slight bit of confusion is that all kernel objects (everything under the auspices of the Windows Object Manager) implement this signalled/non-signalled state – an Event just builds on top of that.

Keep in mind that this is something totally distinct from the SQLOS EventInternal, although their programming interfaces follow the same pattern. To interact with an Event, we can do two things:

Signal it – this is a method call that sets it to the signalled state
Wait on it – this is a method call that returns immediately if the event is signalled, or some time later when it becomes signalled

Note the emphasis on how the caller experiences waiting: it is just a method call that will return like all method calls. However, if the event isn’t signalled when we call a wait method, the kernel dispatcher is going to have to make a judgement and find something else for the CPU to do in the meantime. So if you knew for a fact that the event you’re about to wait on isn’t currently signalled, calling e.g. WaitForSingleObject() on it translates to “Dear Dispatcher, please switch to another thread, because I have nothing useful to do until some condition becomes true”.

Wake me up before you go-go

The canonical description of the SQLOS worker-switching mechanism in thread mode is still Ken Henderson’s classic Inside the SQL Server 2000 User Mode Scheduler. The below explanation is an expansion of the How UMS takes over scheduling section, referencing the SQL Server 2014 state of affairs.

Every SQLOS worker thread comes with an associated SystemThread object. One could view the SystemThread as a wrapper for the OS thread, although strictly speaking it is just an object instance (i.e. a bunch of member variables that share related instance methods) squirreled away in the thread’s thread-local storage. Either way, any code running within SQL Server can refer to the ambient SystemThread whenever needed, though this tends to be buried a layer or two below the code you’re recognise as e.g. storage engine code, and the interesting interactions are mediated through SOS_Scheduler methods.

One of SystemThread’s member variables is a handle to an auto-reset kernel event, created as part of SystemThread construction, which in turn happens within the thread attach callback upon creation of a new thread. This per-thread event object is at the root of SQLOS’s ability to orchestrate worker switches in thread mode.

Let’s assume that we are in the steady state where a bunch of threads belong to a scheduler, and all but one are asleep. The sleeping ones will be stuck in an infinite-timeout wait on their own event objects, which will only get signalled as part of SQLOS scheduling activity. At some point – due to either blocking or quantum exhaustion – the running worker will need to hand over the baton to another, i.e. put itself to sleep and wake up the chosen successor. This normal handover is handled by calling SignalObjectAndWait(), signalling the incoming worker’s event and waiting on its own event.

SignalObjectAndWait() acts atomically, i.e. it’s impossible for the wait to happen without the signal, a situation which would render the scheduler frozen. Likewise, the signal won’t happen without the wait – if it did, we’d have two threads acting as if they are the current owners of the scheduler, and even if they didn’t literally run at the same time, they’d corrupt the state of the scheduler.

Now what we’d like to happen is for the one thread to pause and for the other to start up, much as with the handover in fiber mode. In practice the handover ritual comes down to the outgoing thread telling Windows “I’m going to sleep now, and by the way, signal this other thing which I know will make my sibling thread runnable“. It is still up to the kernel dispatcher to make the decision to switch to that thread, and it may very well schedule some other work before that one gets a turn. However, most of the time the dispatcher has no realistic choice other than to schedule the thread picked, so even though the SQLOS scheduler isn’t really in charge, it is a pretty compelling presence behind the throne. Talk about state capture…

More events, more signalling, more fun

It is worth picking up on the earlier observation that all kernel objects implement signalled/unsignalled state. One such object is a Windows thread object itself, which spends its whole life unsignalled, only becoming signalled when it exits. It is possible to wait on a thread, i.e. to wait for it to die, although that doesn’t come into play here. I am highlighting this because when folks say in SQLOS context that “the thread gets signalled” it is shorthand for “the event belonging to the SystemThread within that thread’s TLS gets signalled”.

There is another important event that can come into play. Each SOS_Scheduler itself has an associated kernel event, created during scheduler initialisation. This is used to implement an idle sleep for the scheduler itself, when no workers on the scheduler are runnable. Once a scheduler deems itself eligible to go to sleep, the current worker (which I believe to be the special idle worker, a Worker instance embedded within the SOS_Scheduler) calls WaitForSingleObject() not on its own event, but on the scheduler’s event. This is the one case where it requires an outside agent or a wait timeout to get any worker on the scheduler back into action again.

Finally, there are other events used within SQLOS. One such (okay, the only one whose existence I’m certain of) is the boot event. During the course of SQLOS booting, threads are spun up, but having performed basic initialisation, they then wait until all boot activity is complete. This is done by calling WaitForSingleObject() on the global boot event, which eventually gets signalled towards the end of the boot process. This is one of those cases where the name “event” is in fact a perfect fit in plain English.

Premature wakeup

If you’re a defensive coder, you may wonder about the possibility for chaos caused by a sleeping thread’s associated event accidentally getting signalled – after all, any code within SQL Server can find the event handle and signal it. Well, someone did consider that, because a little paranoia never hurt a systems programmer.

When a thread goes to sleep, the SignalObjectAndWait() or WaitForSingleObject will generally reside in SystemThread::Suspend() or SOS_Scheduler::Switch(). These methods should only return when it’s fairly certain that the signalling was intentional and that the wakeup wasn’t due to an Asynchronous Procedure Call being enqueued to the thread – APCs don’t belong in cooperative SQLOS scheduling.

In support of a SQLOS threading convention, the SystemThread class has a member variable containing the address of the last SystemThread to have signalled it, i.e. which sibling thread woke it up. This will always get set to zero by a thread before it goes to sleep, and the signalling thread sets it to its own SystemThread address just before signalling.

Now should the wait call return but we find a zero in this member variable, clearly something is wrong. The response is to write an error log entry (“SysThread 0x%p woken up prematurely” with the SystemThread address) and to go straight back to WaitForSingleObject() on that event. This helps ensure that some edge case bugs get recognised, without allowing the cooperative scheduling rhythm to get broken.

Dynamic priority boosting and an odd trace flag

Generally speaking, it is just stupid to play around with process and thread priorities in the hope of winning at the scheduling game. However, the existence of an unusual trace flag hints at a response to priority-related issues.

As part of the complex logic around the kernel dispatcher’s scheduling choices, threads can temporarily receive dynamic priority boosting by Windows itself. Of three cases documented here, only one is relevant to a server process: a thread becoming runnable due to a wait condition being satisfied. One can imagine a system where many threads unrelated to SQL Server are running, managing to starve SQLOS threads by dint of having their base priority set too high, and dynamic priority boosting can help alleviate thread starvation here.

The logwriter task must be peculiarly sensitive to thread starvation, especially in the pre-2016 single-logwriter case. Just consider a whole server brought to its knees due to a bunch of workers waiting for log flushes that don’t happen due to a starved logwriter. So there is this start-up trace flag, TF 8064, which does one thing for the logwriter thread when set, and only in specific circumstances:

If the SQL Server process has dynamic priority boosting disabled…
AND the logwriter thread which has just been spun up has dynamic priority boosting disabled…
THEN enable dynamic priority boosting for the logwriter thread

Your guess is as good as mine here, but I think this one may have been designed to provide some relief where SQL Server is deployed on a system that is in trouble, when it may improve throughput slightly under certain circumstances. The fact that this has been around for a while and I’ve never heard of it before suggests that it is very rare that anybody gets in the kind of hole where they’re advised to use it, or if they do, they are embarrassed to talk about it.

If nothing else, it is interesting to consider this as an example of abstraction leakage. For the most part, low-level concerns about OS interaction belong within SQLOS, and consuming code should be oblivious to the underlying OS. However, here we have scheduler-related logic within storage engine code (sqlmin) that has no choice but to acknowledge the substrate and talk to it.

Non-SQLOS waiting

There is one common case that I have previously covered, where a cooperatively scheduled SQLOS thread yields to Windows, but not through an SOS_Scheduler method and neat handover to a sibling thread: when a spinlock goes to sleep in between exponentially increasing bouts of spin waiting.

Here the idea is to give some CPU time to another thread which the kernel dispatcher deems runnable: this may turn out to be a non-SQL Server thread or even (within affinity constraints) a worker from another scheduler which is being starved of CPU. However, by definition it will never be a cooperatively scheduled worker from the current SOS_Scheduler: by calling Sleep() or SwitchToThread(), we are effectively putting the ambient scheduler to sleep, and the only thing that will allow any of its workers to progress will be the acquisition of the contented spinlock.

There are also a handful of cases internal to SQLOS thread dispatching where a thread will sleep (as a Windows sleep, not a SQLOS yield) while waiting for action to be performed by another thread, but this isn’t textbook SQLOS that happens all the time.

Going preemptive

Finally, we ought to give some consideration to what happens when a worker within the cooperative scheduling environment goes preemptive. However, this is a big enough subject to warrant a whole post, so that is exactly what I intend doing.

Scheduler stories: Interacting with the Windows scheduler