I am planning to burn a fair number of cycles on SQLOS scheduling internals for the foreseeable future, and with some luck, this turns into an interesting series. OS scheduling is already a subject that belongs “on the other side of the looking glass”, and this only gets more interesting when we look at user-mode SOS_Scheduler scheduling built on top of it.
If I don’t specifically mention a version, my frame of reference is SQL Server 2014. Yes, things changed since then, but the 2012-2014 scheduler is a good starting point, and the fundamental mechanisms I’ll initially cover have changed very little since the User Mode Scheduler (UMS) of SQL Server 7.0.
What is a scheduler really?
Two prior blog posts have prepared the ground in rather unconventional formats:
- King Arther, Energizer Bunnies and the search for the SQLOS scheduler suggested a scheduler as a team of Energizer bunnies who share one battery between them.
- The thread in the head: Dr SQLOS explains context switching gets both more technical and more anthropomorphic, but repeats the idea of a team that depends on a code of conduct to keep their work moving along. Oh, and if ever you struggle to keep your word count down, I can recommend writing in rhyme as a useful constraint!
Building on those, here is a mental model that works well for me. Think of the word “scheduler” as a collective noun to describe a set of workers, as in “a scheduler of workers”, interchangeable with “herd” or “clan”. Don’t worry about drawing the line between workers, threads and tasks, because it isn’t yet material at this point in the game. I’ll tease those apart in due course.
To simplify things initially, we’ll forget about hidden schedulers and assume hard CPU affinity. That gives us an execution environment that looks like this:
- Each CPU is physically tied to a scheduler.
- Therefore, out of all the workers in the system, there is a subset of workers that will only run on that CPU.
- Workers occasionally hand over control of their CPU to a different worker in their scheduler.
- At any given moment, each CPU is expected to be running a worker that does something of interest to the middle or upper layers of SQL Server.
- Some of this useful work will be done on behalf of the worker’s scheduler siblings.
- However, a (hopefully) tiny percentage of a worker’s time is spent within the act of scheduling.
This gives us a nice feel-good definition: A scheduler is a family of workers who can work cooperatively. Since they share some common state and common values, they trust each other, they are willing to expend effort towards the common good, and they accept give-and-take between themselves. This is the cooperation within a scheduler.
Where does that leave workers that don’t belong to that particular clan? Well, even though they have different bags of shared state, the Montagues and the Capulets actually have the exact same values and behaviour. And those values do include a few rules that support cooperation between schedulers.
Abstractions all the way down
Fanciful as such metaphors are, they do fit in the spirit of object-oriented programming, where we try and instantiate encapsulated objects that have defined behaviour. We can talk about things like a scheduler executing a task by binding the encapsulated work request to a worker (which is normally an abstraction for a thread) and then scheduling that worker. Beautiful stuff.
However, this is all just an abstraction that hide the naked truth of CPU nature: from the viewpoint of the CPU, even threads don’t exist, let alone higher abstractions. All a CPU does is to latch on to an instruction and follow the trail of ensuing instructions, which may lead to branches in the ingested code. Sure, it may contain features that support abstractions like threads and processes, but honey badger doesn’t care. Until it gets bitten by a hardware interrupt, it dines on a simple stream of code. And while hardware interrupts do feature in preemptive scheduling, SQLOS only taps into a minimal part of that ecosystem in order to support its cooperative habit.
Our task then is to try and join the dots between the high-level abstractions and the CPU in order to find a mental model that is both technically accurate and useful. This may prove quite a challenge, and your mental model may vary.
The SOS_Scheduler class
We have two good starting points for coming to grips with an SOS_Scheduler. One is sys.dm_os_schedulers, which illuminates a nice subset of the state contained in a scheduler, i.e. some of its member variables. There is of course some degree of interpretation involved, both in terms of how members are documented/labelled and how the journey from farm to table occurs within the code for the DMV itself, but I can’t understate how great it is that this stuff is exposed whatsoever.
The second angle is to look within sqldk public symbols for SOS_Scheduler methods – you may be familiar with some of these from call stacks. If member variables exposed by sys.dm_os_schedulers are the nouns of Schedulerese, those methods are the imperatives and interrogatives. The full list is pretty long, but here is a subset, biased towards ones that may turn out to be of particular interest:
AddIOCompletionRequest AddPendingIORequest CheckForIOCompletion EnqueueTask GetCurrent GetNewWorkerCpuAllocation Idle IsPotentiallyDeadlocked IsShrinkWorkersNecessary IsTaskActivityIdle IsTimerQueueReadyToIdle PrepareWorkerForResume ProcessTasks Resume RunTask RunnableQueueInsert SuspendNonPreemptive SuspendPreemptive SwitchContext
In broad strokes, on can point at three categories of these code entry points. Firstly, there is the case where an outsider interacts with the scheduler, which modifies its state (the nouns) but doesn’t immediately affect the code running on the associated CPU. An example would be Resume(), which makes a suspended worker runnable, and thus eligible to be scheduled in due course. I’d like to drive home the point that this work is done by a worker who does not belong to the scheduler that the method is called on.
A second category is where the currently running worker runs a scheduler method, e.g. SuspendNonPreemptive(), which is done at the point where it knows it has to yield to a sibling worker. As an interesting aside – to be explored in MUCH greater depth later – even this call is hidden from the client SQL Server code that runs queries. It just happens to sit in a friendly code path which ultimately wrests control away from the caller by calling into the scheduler.
The third category is the truly internal code where housekeeping and the nitty-gritty of scheduling decisions and mechanisms live: methods like IsTimerQueueReadyToIdle() and SwitchContext().
Another duality is that some methods deal with the scheduler as the parent container of workers, whereas others embody the actual deed of scheduling.
And the answer is?
Unlike the underlying preemptive OS scheduler, which can be invoked at any time through interrupt activity, SQLOS scheduler-related methods only run when explicitly invoked by other SQL Server code. We can however talk about “being in the scheduler” in the same sense that control flow for a CPU can be “in the kernel”, even if no user vs kernel mode distinction is involved here. The big distinction is that the SQLOS scheduler is only ever voluntarily entered, even if it is through client code being tricked into it.
So the answer to the question “When does your scheduler run?” is simple: When workers get around to running it. Its motion comes in fits and starts, but as long as the workers occasionally call into its methods in a reasonably sane pattern, it keeps moving. The magic is that we can speak either about the scheduler running the workers or about the workers running the scheduler: both are equally valid viewpoints. Welcome to the other side of the looking glass.
The title of this post is a tribute to When does your OS run? by Gustavo Duarte, who is absolutely wonderful at clarifying arcane subjects. When I grow up, I want to be like him.
There are some touchpoints with my older post Threading for humans. This is no coincidence: although a fair amount of time and detours have featured since, I am really now picking up that old, um, thread.