Scheduler stories: Going Preemptive

SQLOS is built upon the idea of cooperative, AKA non-preemptive, scheduling: out of any given collection of threads belonging to a scheduler, only one will own the scheduler at a given moment. To the degree that these cooperative threads represent the only work done by the underlying CPU, this means that the thread owning the scheduler really owns the CPU. Of course, a CPU will occasionally get side-tracked into doing other work, so the SQLOS scheduler as “virtual CPU” only represents a chunk of the real CPU, but we live in the expectation that this is a suitably large chunk.

John Tenniel's White Rabbit from "Alice in Wonderland"

It is understandable when such competition for the CPU comes from outside of SQL Server, but it can also be instigated from within SQL Server itself. This is the world of preemptive – or if you prefer, antisocial – scheduling.

Today I shall be exploring preemptive scheduling within SQLOS threads, including a dive into xp_cmdshell as an example case.

TL;DR, let’s have it in rhyme

Though rather irregular,
sometimes you need
to abandon the scheduler
ere you proceed

into places one should
only venture alone,
lest Medusa turn ALL
of your friends into stone.

(from The Thread in the Head)

Cooperative scheduling is a relay race: you simply don’t stop without passing over the baton. If you write code which reaches a point where it may have to wait to acquire a resource, this waiting behaviour must be implemented by registering your desire with the resource, and then passing over control to a sibling worker. Once the resource becomes available, it or its proxy lets the scheduler know that you aren’t waiting anymore, and in due course a sibling worker (as the outgoing bearer of the scheduler’s soul) will hand the baton back to you.

This is complicated stuff, and not something that just happens by accident. The textbook scenario for such cooperative waiting is the traditional storage engine’s asynchronous disk I/O behaviour, mediated by page latches. Notionally, if a page isn’t in buffer cache, you want to call some form of Read() method on a database file, a method which only returns once the page has been read from disk. The issue is that other useful work could be getting done during this wait.

Of course, the storage engine, working on top of SQLOS scheduling primitives, implements the wait by transferring control to a different runnable worker. I/O processing housekeeping by the scheduler eventually notes completion of the I/O, and marks the waiting worker as runnable. Further down the line – in accordance with rules of scheduling priority, fairness, and resource governance – the worker will be given ownership of the scheduler again, at which point it can consume the page it was waiting for.

“Going preemptive” is a different, and much simpler, approach to the general problem of a thread needing to call an external function which may involve waiting – the emphasis being on “external”, i.e. code which can’t be expected to play by SQLOS rules. Here the finicky waiting logic is not implemented within SQLOS. Instead, before the external code is invoked, a sibling worker is signalled and given ownership of the scheduler while the current thread continues to run. Assuming that both threads will tend to remain on the same physical CPU, this clearly goes against the spirit of cooperative scheduling, because now the two are knowingly competing for CPU cycles.

In mitigation, the typical use case for a thread going preemptive involves a high probability of waiting around, i.e. not actually burning CPU cycles, so the degree of competition from the “preemptive” thread may be minimal.

So why did I just put “preemptive” in quotation marks? Well, if you think about it, neither thread is more preemptive than the other, and both are equally at the mercy of the underlying OS’s scheduling algorithms. The only difference is that the “preemptive” one thunders on without any concern for cooperative behaviour, whereas the “cooperative” one is playing by cooperative rules, but they both live under the threat of being preemptively scheduled off the CPU. This threat never goes away completely, but the likelihood of preemption for a “cooperative” SQLOS thread (i.e. preemption of the SOS_Scheduler itself) increases when there are more viable “uncooperative” threads out there, whether SQLOS siblings or threads from other processes.

From the viewpoint of the broad community of SQLOS workers, there is a clear case for a thread going preemptive when calling non-SQLOS-aware external code. Failing to do so would mean that none of the sibling threads on your scheduler would be running until your wait is fulfilled. Which is pretty darn uncooperative behaviour.

Implementation

Take xp_cmdshell,
committee of one,
the powerful, petulant,
prodigal son.

Packs his bags,
singing “PREEMPTIVE_OS_PIPEOPS”.
He won’t say where he goes
You can’t tell when he stops.

But he picks a successor
as he steps out the gate.
Though he calls SignalObject(),
he refuses to wait.

The act of going preemptive really is that simple. Having identified the next runnable worker in line (using standard logic), we call SignalObject() on its SystemThread’s associated event and go through the usual rituals to set it up as the current owner of the scheduler.

The ambient thread (the one going preemptive) on the other hand does something odd. It tells SQLOS that it is waiting, which isn’t strictly true: for the immediately foreseeable future, it will remain running or runnable, i.e. schedulable by the OS, but it won’t have the vocabulary to tell SQLOS when it will go to sleep and what it will be waiting on.

All this adds up to some unusual quantum accounting. At the point that the thread goes preemptive, its cooperative quantum ends, exactly as if it yielded cooperatively, and no accounting is done for CPU usage until it reverts to cooperative scheduling. We can easily see the time spent off-scheduler, since this surfaces in the preemptive “wait” of a type self-declared by the thread, but SQLOS doesn’t know how much of this wall clock time was spent burning CPU. (This being the point where Lonny Niederstadt comes up with a brilliant model to square observed SQLOS accounting with Perfmon measurements!)

Playing nice again

His sister awakens,
chairperson du jour,
and the merry-go-round
just goes on as before

while our preemptive pal
feeds on alms and on dribblings
and acts as a tax
on his scheduler siblings.

Dark deeds done, he returns
from the dank and the cold.
And gibbering, slithers back
into the fold.

He will signal no sibling.
Just one last thing to do:
he’ll add himself onto
the runnable queue.

As he sinks into sleep
there’s a smile with the snore,
In the fullness of time
he’ll get signalled once more.

While the thread continues in preemptive mode it is neither on any waiter list nor on the runnable queue. If it goes off and waits for something which never comes to pass, or somehow manages to die (nigh impossible, given that it is wrapped up in structured exception handling within SQLOS code) it won’t directly affect its scheduler or sibling threads.

Eventually though, the work that needed to be done in preemptive mode will be complete, and the thread can return to the cooperative scheduling game. At that time a sibling thread is probably running and owning the scheduler, so the ex-preemptive thread can’t just march in and start ordering siblings around. As such, it declares the end of its preemptive “wait”, adds itself to the runnable queue, and calls WaitForSingleObject() on its SystemThread’s associated event.

Regular cooperative scheduling logic will wake it up in due course to continue its task.

Case study: xp_cmdshell

I chose this because it covers some really interesting territory. For starters, it is a piece of code that truly does the unknown. But to make it more exciting, it spawns a completely new process containing threads which have absolutely no chance of interacting with the SQL Server process except by competing with it as an isolated application and possibly being a regular SQL Server client. And to top it all, the implementation is very Windows-specific, so it makes a nice case for asking ourselves how the Linux version (if any) would plumb in.

xp_cmdshell falls in a category of system procedures which are exposed to the execution engine through the same interfaces used by T-SQL stored procecures, but essentially represent your chance to call C++ functions, wrapped within the CSpecProc class, from T-SQL. I’m not sure if old-fashioned user-supplied extended stored procedures sit in the same bucket, but I imagine they do. Either way, today we are dealing with a specific one shipped with SQL Server and compiled into sqllang.

As a broad outline, and omitting error handling, security checks, and preemptive/cooperative switches, here is the structure of the method, which of course takes a string parameter representing the command you want to execute:

create Read and Write pipe objects
set up a STARTUPINFO for process creation
{
  StdInput comes from our read pipe (unused)
  StdOutput and StdError go to our write pipe
}
if using proxy account
{
  get saved credentials
  create process using CreateProcessAsUserW
}
else
{
  create process using CreateProcessW
}
close handle to child process
create a BackupResultSet for returning output
read the piped output line by line until process complete
{
  if no_output wasn't specified
  {
    emit text line as result set row
  }
}
close remaining handles

Summarised:

Spawn a process running the requested command line as argument to cmd.exe with the /c switch, with output redirected to a pipe
Parse the pipe’s output into a result set, line by line
Once the pipe runs dry, we’re done

In other words, in terms of control flow and “knowing when done”, internally it is rather like a client calling a stored procedure where it may or may not care about the result set, but the act of generating that result set could have really important side effects.

Bridging the cooperative/preemptive divide

The bulk of the method is wrapped in an AutoSwitchPreemptive scope; this pattern is reminiscent of C# code wrapped in a using block, where cleanup happens automatically when the lexical scope ends at the end of the block. It does rather what you’d expect, putting the enclosed code into preemptively scheduled mode, and reverting to cooperative mode at the end.

You can also view it from another angle:

At the start of the block, a sibling thread is woken up to compete with the ambient thread, and given ownership of the SQLOS scheduler.
At the end of the block, the ambient thread goes to sleep, only to be awoken cooperatively by a sibling thread in due course.

Interestingly though, there are sections within the method which temporarily return to cooperative scheduling, and these are enclosed within AutoSwitchNonPreemptive blocks. The total effect is that the method is a preemptive island in a cooperative sea, but the island itself contains a few lakes of cooperative scheduling. This is all terribly polite, because well-behaved code doesn’t wear the uncooperative hat any longer than necessary.

The structure of these blocks (lakes of COOPERATIVE on the PREEMPTIVE island) are as follows:

PREEMPTIVE 
{
  prepare for process creation
  COOPERATIVE 
  {
    do final access check
  } // COOPERATIVE
  create process
  do until process completed
  {
    read a line of output from pipe
    if no_output wasn't specified
    {
      COOPERATIVE 
      {
        construct result set row
      } // COOPERATIVE
    }
  }
  if no_output wasn't specified
  {
    COOPERATIVE
    {
      close off the result set
    } // COOPERATIVE
  }  
  close handles
} // PREEMPTIVE

What can we clearly tell from this? For starters, the construction of result sets is internal code that we know to play well within the cooperative ecosystem. Conversely, for a bunch of Windows API calls, even where we don’t expect them to block, we simply do not perform the call while remaining within cooperative mode. The API calls falling within the preemptive blocks are:

CreatePipe()
GetCurrentProcess()
DuplicateHandle()
CloseHandle()
GetModuleHandleW()
GetProcAddress()
InitializeProcThreadAttributeList()
UpdateProcThreadAttribute()
CreateProcessAsUserW()
CreateEnvironmentBlock()
CreateProcessW()
DestroyEnvironmentBlock()
DeleteProcThreadAttributeList()
ReadFile() – the one we expect to block while reading the StdOut pipe
MultiByteToWideChar()

Deeper into preemptive wait types

Something I will explore more in future is SOS_WaitInfo, the class that tracks an instance of a wait. In short, when a worker is at risk of going to sleep, it fills out one of these pink slips to note the wait type; scheduler code then fills in the wait start time in the “For Office Use Only” section. On completion of the wait, scheduler code calculates the elapsed time since the wait start, and adds this to the aggregated wait time for that wait type. For the moment, accept that I’m about to describe something that extends the regular SOS_WaitInfo.

By definition, a cooperatively scheduled worker can’t nest one wait inside another. The worker actually goes to sleep during a cooperative wait, and waking up means that the wait is over, so the next wait will be a different wait instance. However, since a preemptive SQLOS “wait” isn’t really a wait, and can be interrupted by the worker going cooperative and potentially waiting cooperatively, it is possible for a preemptive “wait” to be interrupted and resumed later. On top of that, it is also possible for a section of cooperatively scheduled code within the preemptive section (the “lake on the island”) to have another preemptive island within it!

What this comes down to is that preemptive waits can be nested, and SQLOS has special handling for this situation. Instead of a “waiting” preemptive worker being deemed to have a scalar current wait type, preemptive waits go on a simple stack, allowing nesting up to a depth of six levels – the restriction is due to it being implemented as a six-element array.

The previously described AutoSwitchPreemptive class encompasses an SOS_ExternalAutoWait, which is the preemptive variation on SOS_WaitInfo. Declaring preemptive “wait” start and end points is performed by the AutoSwitchPreemptive constructor and destructor, by passing this SOS_ExternalAutoWait to the ambient task’s PushWait() and PopWait() functions respectively. These methods in turn keep track of the current wait within the stack-flavoured array. Popping a wait updates the scheduler-level wait time stats for the relevant wait type in the same way that ending a cooperative (true) wait would do.

Now while I haven’t confirmed this positively, it appears as if the nesting of a cooperative wait inside a preemptive one, as well as the nesting of a preemptive wait island inside a cooperative lake on a preemptive island, means that the elapsed time is counted against all active waits on the wait stack. In other words, one or more preemptive wait types, plus perhaps a cooperative wait type, can double-count the same chunk of elapsed time. If indeed true, it means that wait times for non-preemptive wait types can always be trusted, but the actual numbers for preemptive wait types should just be taken as a general indication that time was spent outside of cooperative scheduling. Since the amount of CPU burned is actually a more interesting question, and this isn’t noted anywhere, it seems like a non-issue in the bigger scheme of things though.

Finally, I find it instructive that the SOS_ExternalAutoWait constructor comes in three flavours, distinguished by which wait type they declare their waiting against:

A non-parameterised one, logging against the MISCELLANEOUS / UNKNOWN wait type
A non-parameterised one, logging against PREEMPTIVE_OS_GETPROCADDRESS
A parameterised one which allows the caller to declare a wait type

The last one is the one used by xp_cmdshell, which declares a PREEMPTIVE_OS_PIPEOPS wait. But the existence of the other two helps paint the picture why we often see PREEMPTIVE_OS_GETPROCADDRESS waits – it isn’t so much that GetProcAddress() is truly the focus of the functions in question, but more a case of this being a bland default. In fairness, when multiple Windows API calls are involved, it would be stupid to wrap each in its own individually labelled preemptive transition, and short of creating another “pipe ops” or “authentication ops” bucket, PREEMPTIVE_OS_GETPROCADDRESS is reasonable shorthand for “do some stuff involving the Windows API”. And MISCELLANEOUS? To the degree that any code still uses this variation, we actually do not have a clear indication that we’re looking at a preemptive wait.

A few final thoughts

This was a fair amount of detail, but I should call out that I restricted myself to thread mode here. Going preemptive in fiber mode is a completely different game, and gets more hairy.

Beyond scheduling, xp_cmdshell presents unique security and debugging aspects. Consider how desirable it must appear as an attack vector, especially in those sad situations where SQL Server runs as the local system account. One peculiar xp_cmdshell troubleshooting aid involves the ability to put a breakpoint on the Windows ZwTerminateProcess() call (turned on through a pair of none-of-our-business trace flags) which remains in place until the child process completes.

Since this is now picking away at the notions of thread states and waits, I guess I have set the stage for a follow-up there. Next up I intend to look at what different DMVs try and tell us when they report the state of a worker/thread/request, and how they subtly diverge in suggesting a mental model of SQLOS scheduling internals.