Threading for humans

I used to think of threading as a complicated subject that everybody except me had a grip on. Turns out that it’s actually simple stuff with complicated repercussions, and no, many people don’t really get it, so I was in good company all along. Because I’m heading into some SQLOS thread internals, this is a good time to take stock and revisit a few fundamentals. This will be an idiosyncratic explanation, and one that ignores many complexities inside the black box – CPU internals, the added abstraction of a hypervisor, and programming constructs like thread pooling and tasks – for the sake of focusing on functionality directly exposed to the lower levels of software.

What does your CPU do for a living anyway?

Let’s imagine a computer with a single CPU. Here is the simplest possible picture of that CPU’s day-to-day life:

Fetch the next instruction from memory
Execute it, which normally involves some internal data manipulation
…and optionally may include reading or writing memory or doing hardware I/O
Decide where to get the next instruction, which would be the next one in memory unless the one just executed was a branching instruction or a return from a subroutine
Repeat

While the above description is reasonable from the machine language viewpoint, having to write or read code at that level is like suffering through an Entmoot. The takeaway here is that the CPU has a very simple stream of consciousness. In its pathological single-mindedness, it is a textbook junky, always on the insatiable quest for the next instruction. And the story arising from this stream of instructions is your program’s execution.

Three entities are already taking shape. Firstly there is the CPU itself, forever devouring instructions and data, caring about nothing but the next instruction, and completely lacking in higher abstractions; one might say it’s all id and no ego, let alone superego. Then we have the set of instructions comprising the program, most likely the result of compiling a high-level language. Finally there is the unfolding story told by the execution of the program, which is where the magic happens. Distinguishing between the program and its execution is significant: one can think of those three respectively as the actor, the written script and the performance.

When CPUs get the hiccups

A casual glance would suggest that a CPU is perfectly in control of what it does. Nonetheless it is subject to involuntary forces which can’t be suppressed. Apart from large-scale acts of God like catastrophic power loss, it needs to deal with the everyday occurrence of hardware interrupts, which are as disconcerting as a bout of hiccups: when the impulse strikes, the hiccup has to be performed.

Here we are in the realm of the interrupt handler. Such a handler is just another subroutine, except that it is called not by user code but in response to an interrupt signal. Dealing with the interrupt request requires the CPU to snapshot its state, save that state on the stack and execute the interrupt handler. Upon return from the interrupt handler, the prior state is cleanly restored (one of the CPU’s party tricks), and the program continues running where it left off, blissfully unaware that it just had a seizure. Continuing the acting analogy, we’re now dealing with a filmed performance. The camera stops rolling while the actor goes off to powder his nose; the thread of the story picks up when the camera starts rolling again, and there is no continuity blip in the narrative’s timeline.

These are baby steps towards running multiple threads of execution on a single processor. The background noise of interrupt handling can and does tell a story in its own right, and the work being performed in these interrupt handlers will contribute to the healthy normal running of a system. For instance, a scheduler can be implemented in rudimentary form using an interrupt that fires every few milliseconds, does some housekeeping, and then returns back to either the interrupted thread or to a different one that is waiting to start or was previously interrupted. Now we’re cooking with fire!

So our Fisher-Price scheduler ain’t Windows, but it is a step in a promising direction. Let’s park that thought for a minute and consider what capabilities of the CPU and its associated hardware have been used so far:

It keeps executing instructions one by one, and if you squint just right, these instructions represent your program
Your program can – even without any innate altruism – give up the processor to another program due to the magic of hardware interrupts
The processor has built-in support to handle interrupts cleanly and to return from them without getting tied in a knot

The bounds of the sandpit

I have already hinted that one shouldn’t expect altruism in code. Put differently, your program really ought to assume that my program is out to get it. Good fences make good neighbours and all that. Here the fences are implemented at hardware level through virtual memory addressing; my program is given a range of addresses that map somewhere in physical memory, and your program might be given the exact same range of addresses that actually point somewhere else, but no verbs in their code vocabulary normally grant them the ability to peer over the fence or lob dead cats into each others’ back yards.

At this point it looks like a land registry and possibly a police force are needed to ensure order in our little kingdom, and these organisations must have special powers beyond that granted to ordinary citizens. This is the realm of the operating system, which additionally serves to give us running water, electricity and sewerage.

Let’s now shed some metaphors and assume that our world is built on Windows running on an Intel processor. While similar patterns are implemented on other processors and by other operating systems, it’s time to start relating to the planet that SQL Server currently lives in.

Colonel mode

While much useful operating system code runs in the same sandpit as user code, special powers granted to parts of the OS are implemented by having the relevant code run in kernel mode, giving it an extended instruction vocabulary with extra magic incantations that can exert control over everything else in the system should it choose to do so. In theory you and I could also run our code there, although Windows does its darndest to make this difficult and avoids letting us join it in the control room, and with good reason too. Code running here gets to tear down fences, make nuggets of the neighbour’s chickens and potentially move about without being seen – malware nirvana. With great power comes great responsibility, and it makes sense to have a high barrier against arbitrary code being granted that power.

It also makes sense to minimise the fraction of the operating system that needs to run in kernel mode, because that reduces the volume of code that must be held to a gold standard of scrutiny from a security and stability angle. Another advantage of staying out of kernel mode is that switching between kernel mode and user mode is quite expensive. If every use of operating system infrastructure involved a kernel transition, consuming OS services would be unnecessarily heavyweight, rather like buying groceries by sending your shopping list by registered mail to a helpful bureaucrat on a military base. There are a few things that absolutely need to happen in kernel mode on Windows though, for instance:

Interaction with hardware, at least the close-to-the-metal part of it
Interaction between processes, except where shared memory between cooperating processes has been set up (hence the performance advantage of SQL Server’s shared memory client library)
Use of synchronisation constructs that involve the Windows Object Manager
Context switching
Page faults, which are a special case of hybrid hardware/software interrupt. The CPU notes that the virtual memory access being requested isn’t mapped in physical memory, and then raises itself an interrupt to deal with it

There is an interesting trade-off here: while wonderful functionality is exposed by the OS kernel, using it involves bureaucratic overhead which you may want to avoid in a high-performance system. Additionally, there are things which it may do in a generally sensible way, but where your specific requirements are at odds with its generality. If you didn’t see it coming, in the lore of SQLOS this is the point where the interested reader is usually referred to the writings of Saint Stonebraker, and who am I to buck the trend?

We’ve decided to offer you a place in John Malkovich

A Windows process can fairly be described as a private memory space in which one or more threads can run, plus the collection of threads associated with it. No other non-administrative process can interfere with it, although its threads have free reign to stomp on each other. Upon creation, a process contains a single thread, and this thread can then set about creating more siblings. Threads can be created and destroyed at any time during the process’s lifetime, and the process ends (tearing down its memory space and accounting objects) when the last thread exists.

There is still a delicious duality here. From the CPU’s viewpoint, there isn’t really such a thing as a thread, let alone the schizoid concept of multiple threads that take turns controlling it. And a thread would be shocked to learn that someone occasionally anaesthetizes it and then wakes it up without its freaking knowledge.

Keep in mind that work done by a thread can get paused by a few distinct mechanisms. Offhand, I can think of five broadly deliniated cases:

It runs code which just happens to take long, but stays in user mode. There is really no distinction here between calling a slow function which you wrote yourself and calling one in a library provided by someone else, which could include purely user-mode components of the OS.
As a variation on the above, it could be calling an OS function (i.e. a category of library code) which has to transition into kernel mode, but then completes and returns to the caller without doing a context switch. While in kernel mode, some housekeeping unrelated to the call might get done, and this would form part of the background noise of keeping the system running.
As above, it calls an OS function which transitions into kernel mode, but this function is required to wait before returning. A simple example would be the Windows API call Sleep(5000), which from the caller’s perspective does absolutely nothing, and takes five seconds doing it. From the OS perspective though, it is a licence to schedule another thread, and in fact to keep doing its context switching merry-go-round, disallowing the calling thread from being switched in again until five seconds have passed. In other words, the Sleep call is a self-inflicted context switch. A very common variation on this would be a call to WaitForSingleObject on a non-signaled event; however, this is getting ahead of ourselves a bit.
In the middle of the thread running something, a hardware interrupt fires, and the CPU scuttles off to perform the interrupt handler before picking up the thread again. In a remotely well-behaved system, these interruptions should be very brief though.
Driven by a hardware interrupt (the regular timer interrupt or perhaps an inter-processor interrupt coming from scheduler code invoked on a sibling processor) the thread scheduling code interrupts our thread and decides that the current thread has had enough time on the CPU since its quantum started, or possibly it has just been preempted by a thread with higher priority. Prior to emerging from kernel mode back into to user mode, the interrupted thread’s state is ironed and folded, and control passes to a previously waiting thread. In other words, we have an involuntary context switch due to quantum exhaustion or preemption by something more important.

The famously cooperative aspect of SQLOS scheduling comes into play here. At high user-mode level, a SQLOS thread gets to play by grownup rules where it is trusted to choose its own sensible bedtime, and to wake up the night shift before it goes to sleep. Also, since we’re not counting on it getting randomly possessed by a janitorial spirit in the middle of doing its office job, part of the grownup expectation is that it will also make the bed, do the dishes and feed the chickens at a convenient time .

Thread-local storage

User-mode code running in a process (in other words, any normal code) can reference “global” variables that are actually specific to that thread and not global at all. This is simple to grasp if you squint just right; consider that all humans have a thread-local variable called $MyName. Let’s say I asked a bunch of British DBAs to make the declaration “I, $MyName, undertake before $CurrentMonarch not to cause a data leak either through malice or stupidity” I wouldn’t need to pass your name as a parameter in the calls: it is understood that everybody will retrieve a different value for $MyName i.e. the symbol $MyName relates to a different memory location for each thread. This is distinct from $CurrentMonarch, which could be a true global variable, with each thread reading the value from the same memory location.

In broad strokes, thread-local storage (TLS) really is that simple, and one can think of it as a personalised form of a global variable. It’s a completely different beast from a simple instance variable in object-oriented programming though. Consider that multiple instances of a class can be referenced by code running in a single thread, and the TLS variable lives outside of those instances. Conversely, multiple threads can (carefully!) reference a single instance of a class, and each of them will engage with that instance using a personalised view of “global” state. Hence TLS is orthogonal to member variables, unless you are setting out to create a class that models the actual abstraction of a thread. And this is exactly where we’re heading.

Conclusion

My hope is that this provided one or two moments where either something clicked or you felt moved to violent disagreement – I am very open to being violently disagreed with. We’ll take a deeper dive into thread-local storage next before returning to SQLOS itself.