Where do SQL Server tasks come from?

In my previous post I discussed the unsung early years of a SQLOS thread. Now it’s all very well knowing that threads extend themselves with SystemThreads, don Worker outfits, and execute SOS_Tasks, but I keep glossing over where tasks come from.

Gloss no more.

SOS_Task::Param – the seed of a work request

This nested struct only shows up by name in one spot on stack traces, when we see the below call as the crossover point between thread setup boilerplate and the meat of the task:

sqldk!SOS_Task::Param::Execute

This is a simple piece of abstraction, calling a previously specified function pointer with a previously specified parameter value. Those two values are the core data members of the SOS_Task::Param. In itself, such a pair is similar to the pair one would pass to a thread creation call like CreateRemoteThreadEx(). However, among the other data members we find SQLOS-specific things like a pointer to a resource group, the XEvent version of task identity and – if applicable – a parent task. These do pad out the picture a bit.

Now the SOS_Task::Param is a comparatively small structure, but it contains the core of what a task is about. Without it, an SOS_Task is a bag of runtime state without a defined mission, and isn’t “executable”. It’s just a typed block of memory: 984 bytes in the case of SQL Server 2016 SP1, of which the Param takes up 68 bytes.

Ultimately, getting a task run translates into this: Specify what you want done by phrasing it as an SOS_Task::Param, and get “someone” to create and enqueue a task based on it to a WorkDispatcher.

Which came first, the task or the task?

I have lately been spending some delightful hours unpicking the SQL Server boot process. For now, let’s stick with the simple observation that some tasks are started as part of booting. One example would be the (or nowadays “a”) logwriter task, whose early call stack looks like this:

sqlmin!SQLServerLogMgr::LogWriter
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask
sqldk!SOS_Scheduler::ProcessTasks
sqldk!SchedulerManager::WorkerEntryPoint
sqldk!SystemThread::RunWorker
sqldk!SystemThreadDispatcher::ProcessWorker
sqldk!SchedulerManager::ThreadEntryPoint
KERNEL32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Everything up to RunTask() is the thread starting up, getting married to a worker, and dequeuing a task from the WorkDispatcher. Apart from it being enqueued by the boot process, and the task never really completing, there is nothing special about this core piece of system infrastructure. It is just another task, one defined through an SOS_Task::Param with sqlmin!SQLServerLogMgr::LogWriter as its function pointer.

Here is one that is actually a bit more special:

sqldk!SOS_Node::ListenOnIOCompletionPort
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask
sqldk!SOS_Scheduler::ProcessTasks
sqldk!SchedulerManager::WorkerEntryPoint
sqldk!SystemThread::RunWorker
sqldk!SystemThreadDispatcher::ProcessWorker
sqldk!SchedulerManager::ThreadEntryPoint
KERNEL32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

This is of course the network listener that makes it possible for us to talk to SQL Server whatsoever. It is a bit unusual in using preemptive OS scheduling rather than the cooperative SQLOS flavour, but this only affects its waiting behaviour. What makes it very special is that this is the queen bee: it lays the eggs from which other tasks hatch.

The SQL nursery

The above system tasks wear their purpose on their sleeves, because the function pointer in the SOS_Task::Param is, well, to the point. The tasks that run user queries are more abstract, because the I/O completion port listener can’t be bothered with understanding much beyond the rudiments of reading network packets – it certainly can’t be mucking about with fluent TDS skills, SQL parsing and compilation, or permission checks.

So what it does is to enqueue a task that speaks TDS well, pointing it to a bunch of bytes, and sending it on its way. Here is an example of such a dispatch, which shows the “input” side of the WorkDispatcher:

sqldk!WorkDispatcher::EnqueueTask
sqldk!SOS_Node::EnqueueTaskDirectInternal
sqllang!CNetConnection::EnqueuePacket
sqllang!TDSSNIClient::ReadHandler
sqllang!SNIReadDone
sqldk!SOS_Node::ListenOnIOCompletionPort
...

At this point, the Resource Group classifier function will have been run, and the target SOS_Node will have been chosen. Its EnqueueTaskDirectInternal() method picks a suitable SOS_Scheduler in the node, instantiates an SOS_Task based on the work specification in the SOS_Task::Param, and the WorkDispatcher within that scheduler is then asked to accept the task for immediate or eventual dispatching. Side note: it looks as if SOS_Scheduler::EnqueueTask() is inlined within EnqueueTaskDirectInternal, so we can imagine it being in the call stack before WorkDispatcher::EnqueueTask().

In simple terms, the network listener derives a resource group and chooses a node, the node chooses a scheduler, and the scheduler delegates the enqueue to its WorkDispatcher component.

Ideally an idle worker will have been found and assigned to the task. If so, that worker is now put on its scheduler’s runnable queue by the network listener thread, taking it out of the WorkDispatcher. If no worker is available, the task is still enqueued, but in the WorkDispatcher’s wallflower list of workerless tasks, giving us our beloved THREADPOOL wait.

Let’s however assume that we have a worker. Over on the other side, its underlying thread is still sleeping, but tied to a worker which is now on its scheduler’s runnable queue. It will eventually be woken up to start its assigned task in the normal course of its scheduler’s context switching operations. And what screenplay was thrown over the wall for it to read?

sqllang!CSQLSource::Execute
sqllang!process_request
sqllang!process_commands_internal
sqllang!process_messages
sqldk!SOS_Task::Param::Execute
...

Here we get an extra level of abstraction. The instantiated task runs the TDS parser, which creates and invokes specialised classes based on the contents of the TDS packet. In this case we got a SQL batch request, so a CSQLSource was instantiated to encapsulate the text of the request and set to work on it, taking us down the road of SQL parsing, compilation, optimisation and execution.

Once parallelism comes into play, such a task could instantiate yet more tasks, but that is a story for another day.

Conclusion

From a high enough view, this is quite elegant and simple. A handful of tasks are created to get SQL Server in motion; some of these have the ability to create other tasks, and the machine just chugs on executing all of them.

To tie this up with the previous post, here is the complete lifecycle of a SQLOS thread:

The SQLOS thread lifecycle

If you haven’t lately read Remus Rusanu’s excellent Understanding how SQL Server executes a query, do pay it a visit. This post is intended to complement its opening part, although I’m placing more emphasis on the underlying classes and methods recognisable from stack traces.

Since we’ve now covered the thread lifecycle from birth into its working life, next up I’ll go back and look at what a wait feels like from the worker angle.

Speaking at SQL Saturday Manchester

I was surprised to find it within me to submit a session abstract to SQL Saturday 645 in Manchester. Not to mention delighted when my session on SQLOS scheduling got chosen.

This will be my first time speaking at a SQL Saturday, so I’m pretty excited about the experience. The plan is to cover some fundamentals, revisit a few things I have blogged about, and add bits that have probably never been covered anywhere before.

What can possibly go wrong if I get up on stage? Here endeth the SQL section of this blog post.

This can possibly go wrong when I get up on stage

In a previous life, I spent 7.5 years as a cruise ship musician, first playing in a lounge band, then moving to the show band. I ended up as music director, leading the show band from the piano, and vaguely responsible for supervising and scheduling all music around the ship. Fortunately the latter mostly involved letting people do what they’re good at, and I wasn’t in a position of any responsibility during the below incident.

Back in the mists of early 2001, I worked on a ship that was doing a series of musical theme cruises. One particular cruise was really juicy: the star attraction was a Frank Sinatra impersonator, and our normal 7-piece band got expanded to a 20-piece big band for the occasion. The act was great, the stage was satisfyingly full, and those classic arrangements were a joy to play.

The full show early in the week was over, and regular entertainment went on as usual, except that the star singer was doing a short segment in a variety show on the last night: a magician to open, the singer in the prime spot, and then a short production number from the dance cast.

In line with dinner, which was split between two seatings, shows were done twice each night, at 20:30 and then again at 22:15. Except for the last night, when the second seating had an early 18:45 show before their meal.

Somehow it never occurred to anyone to explicitly mention the flipped showtimes to the headliner. No big deal when we didn’t see his face before curtain up, because he still had twenty minutes to get backstage. But around this time we started asking whether he knows about the early show. Frantic phone calls were made, and towards the end of the magician’s act someone managed to get him on the phone after he just stepped out of the shower.

With one minute to go, and the hope that he can get dressed and ready in three more, the cruise director shouted at the band to play something – anything – while we figure out what to do. The cast was roused from their backstage lazing and told they might have to go on in two.

It was around this point where things completely fell apart. Like any production deployment, stage management works on reasonable certainty and a pre-arranged set of cues. Improvisation rarely has a happy ending.

Keeping a token brave face, the cruise director did a hand-over to the band, who figured we may as well play “A-train” and see where it goes. And the curtain opened up as a half-naked dancer frantically ran across the stage, trailing bits of costume.

The singer turned up a minute after the production cast finished their bit, culminating with the stage being littered with the disgorged content of streamer cannons and unfit for further use. The cruise director, having completely lost the erstwhile brave face, had by then walked on to tell the audience that this was indeed it, and could they please swallow their disappointment as they fill in their feedback forms rating the week’s experience.

The curtain resolutely stayed down. Muffled swearing could be heard backstage. Fingers pointed impotently.

The good news

None of this will happen at SQL Saturday Manchester. I have learned about the dangers of confetti cannons, I will turn up on time, and I probably won’t be in drag.

But if you are interested in the gubbins of context switching, what a SystemThreadDispatcher really does for a living, and you don’t live too far away, do drop by.

Scheduler stories: OLEDB, the external wait that isn’t preemptive

So I’ve been having this ongoing discussion with Lonny Niederstadt (b | t) recently, where he tries to make sense of how CheckDB performs, and I fixate on a possibly irrelevant detail. In other words, he talks over my head, and I talk under his feet. Seems fair to me.

Today is the day to air my funny bits, and with some luck Lonny will continue to take us to interesting places in his own explorations.

Update: Aaaand Lonny has delivered a headful!

The thing about external waits

I’m going to go out on a limb and suggest that none of us are very clear about what external waits are. Code external to SQL Server? External to the CPU? Code that was dropped by aliens?

Previously I dug into preemptive waits in SQLOS, and to be honest, I equated “preemptive” with “external”. For the most part the two go hand in hand after all.

To recap, a preemptive wait isn’t necessarily a wait at all. What happens is that a worker needs to run some code that can’t be trusted to play by cooperative scheduling rules. And rather than put the SQLOS scheduler (and all its sibling workers) at the mercy of that code, the worker detaches itself from the scheduler and cedes control to a sibling runnable worker.

The time period from this moment until the worker returns to cooperative scheduling is labeled as a preemptive wait. During this time, one would hope that the thread did indeed sleep a bit, because it would be directly competing for CPU with its sibling through the mediation of the operating system scheduler. In other words, the time ascribed to a preemptive wait covers an unknown combination of working and sleeping.

In that blog post, I also covered the possibility of external waits getting nested: during the time where we’re executing code but counting each passing tick as external wait A, it is possible to declare another external wait B, and temporarily double count the same slice of time against both. Even more confusingly, we could temporarily dip back into cooperative scheduling while the external “wait” continues.

First, a fundamental premise. A wait type is just a label – a task name someone decided to fill in on a notional time card. Different pieces of code can use the same label for different things, or if we’re lucky, a given wait type is used in one place only, and its presence pinpoints the exact function that was running.

Today I’m only talking about the OLEDB wait as it manifests in CheckDB, although a similar story may pertain elsewhere. In case you didn’t read the title, OLEDB is an external wait type that isn’t preemptive. But what does this mean?

It means that some of what I said about preemptive waits still applies, most significantly the idea that code can actively be running while advertising itself as being in a wait. This can be seen as a slight slant on what the wait type measures: we are using an existing mechanism as a profiling tool that surfaces how much time is spend in a certain chunk of code.

However, in this case the worker doesn’t go preemptive, i.e. it doesn’t yield the scheduler to a sibling while it does the work advertised as a wait.

Here it really gets weird. This worker is cooperatively scheduled, and has a conscience that tells it not to hog the scheduler. Every so often, it will offer to yield, and if the scheduler accepts the offer the CheckDB worker will properly go to sleep, with this sleep time labeled as SOS_SCHEDULER_YIELD.

But at the same time, the clock is still ticking on the OLEDB wait!

This is an entirely new twist on double counting. We claim to be waiting, but are working, except that sometimes we do stop working, counting a bit of time against both OLEDB and SOS_SCHEDULER_YIELD.

I am not saying that this happens all the time, but this is the way that a certain branch of code is written, as least in SQL Server 2014. The OLEDB “wait” is declared within CQScanRmtScanNew::GetRowHelper(), and during this “wait” we get a call to CUtRowset::GetNextRows(), which itself calls CTRowsetInstance::FGetNextRow(). However, within GetNextRows(), after every 16 invocations a courtesy call is made to YieldAndCheckForAbort(), which may yield the scheduler with an SOS_SCHEDULER_YIELD wait.

To visualise broadly:

Visualisation of time spent and time measured

Bonus material: The little preemptive wait that wasn’t

The OLEDB wait is neatly encapsulated in an instance of the CAutoOledbWait class, which in turn contains an SOS_ExternalAutoWait, the same object used by preemptive waits.

Now if we look into SOS_ExternalAutoWait, it comes with three constructors. One gives us a bland UNKNOWN/MISCELLANEOUS wait, presumably on the historic premise that folks didn’t always want to bother categorising their wait activity. Another is fully parameterised, and can emit any supplied wait type. But the third one really catches one’s eye: it’s wired to emit PREEMPTIVE_OS_GETPROCADDRESS.

Clearly PREEMPTIVE_OS_GETPROCADDRESS must serve as a convenient “smoke break” wait type for certain callers, and I’d find it hard to believe that so many people really call GetProcAddress(). So on the premise that nothing in this external wait guarantees it is used preemptively, I am inclined to think that:

  1. When you see this particular wait, you have to read it as MISCELLANEOUS
  2. It ain’t necessarily preemptive

See you next time!

Unsung SQLOS: The SOS_UnfairMutexPair behind CMEMTHREAD waits

As with the droppings of the Questing Beast, we recognise synchronisation code paths by their emissions. But even when not leaving telltale fewmets, these creatures wander among us unseen, unsung, until shutdown doth us part.

The place of the SOS_UnfairMutexPair

Bob Dorr has done a great job of illustrating why thread-safe SQLOS memory objects need serialised access in How It Works: CMemThread and Debugging Them. This has been followed up by the “It just runs faster” coverage of dynamic memory partitioning.

Today I’ll be examining the SOS_UnfairMutexPair, the synchronisation object behind memory allocations. While I’m going to describe synchronisation as a standalone subject, it’s useful to think of it as the CMEMTHREAD wait; to the best of my knowledge nothing other than memory allocations currently uses this class.

For context, I have previously described a bunch of other synchronisation classes:

One picture which emerges from all these synchronisation flavours is that a developer can choose between busy waiting (eagerly burning CPU), politely going to sleep (thread rescheduling), or a blend between the two. As we’ll see the SOS_UnfairMutexPair goes for a peculiar combination, which was clearly tuned to work with the grain of SQLOS scheduling.

The shape of the object

With the exception of the central atomic lock member, everything comes in pairs here. The class models a pair of waitable objects, each having an associated pair of members to note which scheduler and task currently owns it:

Layout of the SOS_UnfairMutexPair (2016 incarnation)

Although it exposes semantics to acquire either both mutexes or just the second, in its memory allocation guise we always grab the pair, and it effectively acts as a single mutex. I’ll therefore restrict my description by only describing half of the pair.

Acquisition: broad outline

The focal point of the mutex’s state – in fact one might say the mutex itself – is the single Spinlock bit within the 32-bit lock member. Anybody who finds it zero, and manages to set it to one atomically, becomes the owner.

Additionally, if you express an interest in acquiring the lock, you need to increment the WaiterCount, whether or not you managed to achieve ownership at the first try. Finally, to release the lock, atomically set the spinlock to zero and decrement the WaiterCount; if the resultant WaiterCount is nonzero, wake up all waiters.

Now one hallmark of a light-footed synchronisation object is that it is very cheap to acquire in the non-contended case, and this class checks that box. If not owned, taking ownership (the method SOS_UnfairMutexPair::AcquirePair()) requires just a handful of instructions, and no looping. The synchronisation doesn’t get in the way until it is needed.

However, if the lock is currently owned, we enter a more complicated world within the SOS_UnfairMutexPair::LongWait() method. This broadly gives us a four-step loop:

  1. If the lock isn’t taken at this moment we re-check it, grab it, simultaneously increment WaiterCount, then exit triumphantly, holding aloft the prize.
  2. Fall back on only incrementing WaiterCount for now, if this is the first time around the loop and the increment has therefore not been done yet.
  3. Now wait on the EventInternal, i.e. yield the thread to the scheduler.
  4. Upon being woken up by the outgoing owner releasing the lock as described above, try again to acquire the lock. Repeat the whole loop.

The unfairness derives from the fact that there is no “first come, first served” rule, in other words the wait list isn’t a queue. This is not a very British class at all, but as we’ll see, there is a system within the chaos.

The finicky detail

Before giving up and waiting on the event, there is a bit of aggressive spinning on the spinlock. As is standard with spinlocks, spinning burns CPU on the optimistic premise that it wouldn’t have to do it for long, so it’s worth a go. However, the number of spin iterations is limited. Here is a slight simplification of the algorithm:

  • If the scheduler owning the lock is the ambient scheduler, restrict to a single spin.
  • Else, give it a thousand tries before sleeping.
  • Each spin involves first trying to grab both the lock and (if not yet done) incrementing WaiterCount. If that doesn’t work, just try and increment the WaiterCount.

This being of course the bit where the class knows a thing or two about SQLOS scheduling: If I am currently running, then no other worker on my scheduler can be running. But if another worker on my scheduler currently holds the lock, it can’t possibly wake up and progress towards releasing it unless *I* go to sleep. Mind you, this is already a edge case, because we’d hope that the owner of this kind of lock wouldn’t go to sleep holding it.

To see how scheduling awareness comes into play, I’m going to walk through a scenario involving contention on such a mutex. If some of the scheduling detail makes you frown, you may want to read Scheduler stories: The myth of the waiter list.

A chronicle of contention

In this toy scenario, we have two schedulers with two active workers each. Three of the four workers will at some point during their execution try and acquire the same mutex, and one of them will try twice. Time flows from left to right, and the numbered callouts are narrated below. A red horizontal bracket represents the period where a worker owns the mutex, which may be a while after the acquisition attempt started.

The mutex acquisition tango
  1. A1 wants to acquire the mutex and, finding it uncontended, gets it straight away.
  2. B2 tries to acquire it, but since it is held by A1, it gives up after a bit of optimistic spinning, going to sleep. This gives B1 a turn on the scheduler.
  3. A1 releases the mutex, and finding that there is a waiter, signals it. This moves B2 off the mutex’s waiter list and onto scheduler B’s runnable queue, so it will be considered eligible for running at the next scheduler yield point.
  4. B1 wants the mutex, and since it isn’t taken, grabs it. Even though B2 has been waiting for a while, it wasn’t running, and it’s just tough luck that B1 gets it first.
  5. A1 wants the mutex again, but now B1 is holding it, so A1 goes to sleep, yielding to A2.
  6. B1 releases the mutex and signals the waiter A1 – note that B2 isn’t waiting on the resource anymore, but is just in a signal wait.
  7. B1 reaches the end of its quantum and politely yields the scheduler. B2 is picked as the next worker to run, and upon waking, the first thing it does is to try and grab that mutex. It succeeds.
  8. A2 reaches a yield point, and now A1 can get scheduled, starting its quantum by trying to acquire the mutex. However, B2 is still holding it, and after some angry spinning, A2 is forced to go to sleep again, yielding to A1.
  9. B2 releases the mutex and signals the waiting A1, who will hopefully have better luck acquiring it when it wakes up again.

While this may come across as a bit complex, remember that an acquisition attempt (whether immediately successful or not) may also involve spinning on the lock bit. And this spinning manifests as “useful” work which doesn’t show up in spinlock statistics; the only thing that gets exposed is the CMEMTHREAD waiting between the moment a worker gives up and goes to sleep and the moment it is woken up. This may be followed by another bout of unsuccessful and unmeasured spinning.

All in all though, you can see that this unfair acquisition pattern keeps the protected object busy doling out its resource: in this case, an memory object providing blocks of memory. In an alternative universe, the mutex class may well have decided on its next owner at the moment that the previous owner releases it. However, this means that the allocator won’t do useful work until the chosen worker has woken up; in the meantime, the unlucky ones on less busy schedulers may have missed an opportunity to get woken up and do a successful acquire/release cycle. So while the behaviour may look unfair from the viewpoint of the longest waiter, it can turn out better for overall system throughput.

Of course, partitioning memory objects reduced the possibility of even having contention. But the fact remains: while any resources whatsoever are shared, we need to consider how they behave in contended scenarios.

Compare-and-swap trivia

Assuming we want the whole pair, as these memory allocations do, there are four atomic operations performed against the lock member:

  • Increment the waiter count: add 0x00010001
  • Increment the waiter count and grab the locks: add 0x80018001
  • Just grab the locks (after the waiter count was previously incremented): add 0x80008000
  • Release the locks and decrement the waiter count: deduct 0x80018001

For the first three, the usual multi-step pattern comes into play:

  1. Retrieve the current value of the lock member
  2. Add the desired value to it, or abandon the operation, e.g. if we find the lock bit set and we’re not planning to spin
  3. Perform the atomic compare-and-swap (lock cmpxchg instruction) to replace the current value with the new one as long as the current value has not changed since the retrieval in step 1
  4. Repeat if not successful

The release is simpler, since we know that the lock bits are set (we own it!) and there is no conditional logic. Here the operation is simple interlocked arithmetic, but two’s complement messes with your mind a bit: the arithmetic operation is the addition of 0x7ffe7fff. Not a typo: that fourth digit is an “e”!

This all comes down to thinking of the lock bit as a sign bit we need to overflow in order to set to 1. The higher one overflows out of the 32-bit space, but the lower one overflows into the lowest bit of the first count. To demonstrate, we expect 0x80018001 to turn to zero after applying this operation:

    8001 8001
  + 7ffe 7fff
    ---------
(1) 0000 0000

Final thoughts

So you thought we’ve reached the end of scheduling and bit twiddling? This may turn out to be a perfect opportunity to revisit waits, and to start exploring those memory objects themselves.

I’d like to thank Brian Gianforcaro (b | t) for feedback in helping me confirm some observations.

SQL Agent and the hairiest Dateadd in town

Like many good things, it started with a #sqlhelp question, this time from Todd Histed (b | t):

Including a bit of back and forth, the issue boiled down to this:

  • A job is scheduled every minute and takes pretty much exactly 55 seconds to run, due to an included WAITFOR delay of 55 seconds
  • On SQL Server 2014, this did reliably run every minute
  • On SQL Server 2016, it skips minutes, effectively running only every two minutes
  • Reducing the WAITFOR to 30 seconds fixed the issue

Now I get unreasonably excited both by date manipulation and by scheduling issues, so this firmly grabbed my attention. After a bit of unwarranted thinking out loud, I figured it is time to go and do some spelunking.

TL;DR

I understand: you’re a busy elf. Here is the quick summary of what I found.

Disclaimer: the different schedule types, e.g. daily vs monthly, have different special case logic; I’m just concerning myself with (intra-) daily schedules here.

Upon completion of a job, the next run time is calculated based on the last scheduled time plus the schedule interval. However, allowance is made for the edge cases where the completed invocation overruns into the next start time. In such a situation, there isn’t a “catch-up” run; instead, the schedule is advanced iteratively until it reaches a future point in time.

However, 2016 introduces a new twist. When applying the “is the proposed next schedule time after Now()?” check, it adds five seconds to Now(). In other words, the question becomes “Is the proposed next schedule time more than five seconds in the future?”

This looks more like a bug fix than a bug. Clearly cases were found where erroneous schedule invocations were happening, perhaps caused by some combination of job duration, high schedule frequency, and clock adjustments. Whatever the story, this is the behaviour we now live with going forward.

Scheduling is a notoriously tricky subject, because we tend to have an intuitive understanding of how things “ought to work,” while merrily ignoring a million little edge cases that keep developers on their toes. Just in case you needed proof.

Spelunking time

In the course of researching this, a few more fun details emerged.

Please send two mongooses
Please send two mongeese
Please send a mongoose. While you’re at it, send another.

When date and time manipulation is needed, where do you turn to? If you’re SQL Agent, this involves custom structures and functions living in SQLSVC.dll. Internally it works with date and time structures with separate 32-bit integer components for the component parts like seconds or months, but translates between this and the funky “packed decimal” pair we see persisted in tables like sysjobhistory, e.g. the integer values 20180321 and 200000 for 20h00 on 21 March 2018.

What I find really cute is the simple approach to implementing Dateadd. We get two functions here, QScheduleDateModify() and QScheduleTimeModify() to operate on such a date/time structure pair. These will only increment or decrement the structure by one tick of the specified unit, so if you want to add six hours to a time, you call QScheduleTimeModify six times with parameter values indicating Increment and Hours.

This may seem primitive, but I suppose it has the advantage of straightforward implementation: as long as overflow of your chosen unit is dealt with correctly, you never need to deal with weird questions like “how many year boundaries are crossed when I add 23,655,101 seconds to the current date?” It isn’t a generic date library, but then it doesn’t need to be. And by definition, normal calculations for schedule times don’t involve a large number of increments, so there isn’t a case for hyper-optimisation.

At this point you probably expect me to whip out Windbg. Here you go then.

Having created a job that was due to fire, I attached to the SQL Agent process, put a breakpoint on SQLSVC!QScheduleCalcNextOccurrence and waited. At the appointed hour, the job ran and the breakpoint was hit as the schedule’s next occurrence needed to get calculated.

Breakpoint 0 hit
SQLSVC!QScheduleCalcNextOccurrence:
00007fff`0c0d3e20 48894c2408      mov     qword ptr [rsp+8],rcx ss:0000008c`e460c240=0000008ce460c274

Just for orientation, let’s look at the call stack and the state of the CPU registers upon entry:

0:022> k
 # Child-SP          RetAddr           Call Site
00 0000008c`e460c238 00007ff7`51595990 SQLSVC!QScheduleCalcNextOccurrence
01 0000008c`e460c240 00007ff7`51595459 SQLAGENT!SetScheduleNextRunDate+0x1f0
02 0000008c`e460c310 00007ff7`5158f992 SQLAGENT!SetJobNextRunDate+0x1a9
03 0000008c`e460c7f0 00007fff`18514f87 SQLAGENT!JobManager+0xfe2
04 0000008c`e460fc70 00007fff`1851512e MSVCR120!_callthreadstartex+0x17 [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 376]
05 0000008c`e460fca0 00007fff`261913d2 MSVCR120!_threadstartex+0x102 [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 354]
06 0000008c`e460fcd0 00007fff`266754e4 KERNEL32!BaseThreadInitThunk+0x22
07 0000008c`e460fd00 00000000`00000000 ntdll!RtlUserThreadStart+0x34
0:022> r
rax=000000000002ed9c rbx=0000000000000000 rcx=0000008ce460c288
rdx=000000000133c6b1 rsi=0000008cc8b23380 rdi=0000008ce428c278
rip=00007fff0c0d3e20 rsp=0000008ce460c238 rbp=0000008ce460c2a9
 r8=000000000002eda1  r9=0000000000000000 r10=00000000000118a0
r11=0000000000000011 r12=0000000000000001 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na pe cy
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000283
SQLSVC!QScheduleCalcNextOccurrence:
00007fff`0c0d3e20 48894c2408      mov     qword ptr [rsp+8],rcx ss:0000008c`e460c240=0000008ce460c274

Now whether this was C or C++ source code, the Windows x64 ABI tells us that the first parameter passed into the function lives in rcx (highlighted above). I’ll dump a chunk of memory starting at the address it points to, displaying it as 32-bit doublewords:

0:022> dd 8ce460c288 l0x20
0000008c`e460c288  00000004 00000001 00000004 00000005
0000008c`e460c298  00000000 00000000 0133c6b1 05f5bebf
0000008c`e460c2a8  00000190 000399b7 0133c6b1 0002ed9c
0000008c`e460c2b8  0133c6b1 0002ed9c 56b04598 00004634
0000008c`e460c2c8  c8b23390 0000008c 00000000 00000000
0000008c`e460c2d8  00000000 00000000 e4280400 0000008c
0000008c`e460c2e8  00000000 00000000 00000000 00000000
0000008c`e460c2f8  e428c278 0000008c e460c410 0000008c

I’ll push this along – the numbers highlighted in red are of interest. Windbg will oblige us by translating the hexadecimal representation into decimal if we ask nicely:

0:022> ? 133c6b1
Evaluate expression: 20170417 = 00000000`0133c6b1
0:022> ? 0x2ed9c
Evaluate expression: 191900 = 00000000`0002ed9c
0:022> ? 05f5bebf
Evaluate expression: 99991231 = 00000000`05f5bebf

Well, those kinds of numbers look more familiar, especially if you consider that the schedule fired at 19h19 on 17 April.

The rest of the experiment consisted of deliberately keeping the process frozen in the debugger for ten minutes before letting it run again; because this job was scheduled to run every five minutes, this guaranteed missing at least one run. As expected, and confirmed through a breakpoint on QScheduleTimeModify(), we went through three sets of five 1-minute increments to reach a next invocation time that was in the future.

Of course, some intent peering into assembly language was also involved, but I’ll not bore you further.

Where next?

I have a hunch that I’ll be revisiting SQL Agent at some point in the future again. However, the “Context in perspective” series is still hanging in the air, and will be picked up again. Let’s wait and see.

Scheduler stories: The myth of the waiter list

‘Tis the season to be controversial, so let’s take a stroll down memory lane to Ken Henderson’s classic Inside the SQL Server 2000 User Mode Scheduler:

The waiter list maintains a list of workers waiting on a resource. When a UMS worker requests a resource owned by another worker, it puts itself on the waiter list for the resource and enters an infinite wait state for its associated event object. When the worker that owns the resource is ready to release it, it is responsible for scanning the list of workers waiting on the resource and moving them to the runnable list, as appropriate. And when it hits a yield point, it is responsible for setting the event of the first worker on the runnable list so that the worker can run. This means that when a worker frees up a resource, it may well undertake the entirety of the task of moving those workers that were waiting on the resource from the waiter list to the runnable list and signaling one of them to run.

John Tenniel's White Rabbit from "Alice in Wonderland"

The lists behind the legend

I have not gone as far as opening up my rusty copy of SQL Server 2000 to see how Ken’s description fits in there, but I am now pretty certain that the above quote has transmuted over the years into a common misunderstanding about SQLOS scheduling mechanics.

Now nothing Ken said is untrue or particularly out of date. It is just that we often hear “the waiter list” (by implication handling resource waits) described as an attribute of a scheduler, which is not the case.

Let’s revisit when the scheduler code runs, and what it does:

  • A worker will yield, either because it needs to wait for a resource, or because it is eaten up with guilt over reaching the end of its allotted quantum.
  • The act of yielding means that scheduler code (methods on the SOS_Scheduler class) gets invoked.
  • After a bit of housekeeping for the common good of all workers sharing the scheduler, control is transferred back to a worker to do its thing – this may even be the same worker who originally yielded.
  • The housekeeping consists of checking for aborted tasks, processing pending I/Os, and checking for I/O completions and timer list timeouts.

The single most important list that a scheduler owns is the collection of runnable workers, that is, the subset of workers belonging to this scheduler who are not waiting for anything other than CPU. This has variously been described as a list and a queue; I shall be using the term “runnable queue” by convention, but be aware that it is a data structure that has changed over the years and isn’t a simple queue.

A scheduler has one piece of “creative” interaction with this runnable queue, and it comes with only two variables:

  • When a context switch is requested by an outgoing worker owning the scheduler, the scheduler code has to choose which one of potentially multiple workers is going to be its next owner.
  • The incoming worker gets given a quantum expiry date, by which time it is expected to yield.

Core scheduler code running during context switching only dequeues runnable workers, and at such moments a given scheduler only looks at its own runnable queue. In contrast, code running all over the place, including in the context of workers belonging to other schedulers, may enqueue workers on to the runnable queue.

Time for a simple diagram:

Someone to watch over me

What I’m trying to get across here is that each instance of a waitable resource has its own wait list, and the scheduler has no interest in this, because a scheduler only acts upon its runnable queue. Seen from a different angle, once a worker is waiting on a resource, its scheduler doesn’t care, because it can’t and won’t manage the waiting logic of something like a latch. This splits the responsibilities neatly in two:

  • The synchronisation class guarding a resource (which inevitably will be built upon an EventInternal) stands watch over all the workers queueing up to have a ride on that resource. The act of granting access to a worker involves moving the worker from the wait list and getting it on to the runnable queue of that scheduler’s worker, and this is achieved by the synchronisation class.
  • The scheduler, in turn, doesn’t decide who is runnable, but it does get to pick which of the runnable workers (however they reached that state) runs next.

The I/O and timer lists

There are however two cases where the scheduler decides to make a worker runnable in the normal course of events. One is when a worker was waiting on I/O to complete, where periodic scheduler housekeeping is the mechanism by which SQLOS takes note of the I/O completion. At this point some workers who were on the I/O list may find themselves moved to the runnable queue just before the next worker is picked to be granted ownership of the scheduler – the lucky winner might be one of these workers, or it may be someone else who has been runnable for a while.

The second, and actually more interesting case, is the timer list. In its simplest use case, this is where you will find workers executing T-SQL WAITFOR statements. The list is neatly ordered by timer expiry date, and at each invocation of the scheduler context-switch housekeeping, workers whose timer expiry dates have now passed will be moved to the runnable queue.

What makes a timer list particularly interesting though, is when it implements a resource wait timeout, for instance a lock timeout. In this scenario we actually have a worker waiting on two things simultaneously: a resource and a timer. If the resource is acquired before the timer expires, all is good: the worker goes on to the runnable queue, and upon being woken up it finds a thumbs-up as the return value of its resource acquisition call.

However, should the timer expire before the resource has been acquired, the scheduler will actually venture forth and take the worker off that waiter list before making it runnable and setting an error return value as wake-up call. Think of it as every teenager’s worst nightmare: you’re not home by curfew, so Mom comes to your dodgy party to drag your sorry ass home. And then you wake up with a hangover and note stuck to your forehead reading “No cake for you”.

Whither next?

I tried to keep this comparatively high-level, but might take a nice little detour into the WorkerTimerRequest some day if time permits.

There you have it. Be home on time and have a thread-safe festive season.

Scheduler stories: Going Preemptive

SQLOS is built upon the idea of cooperative, AKA non-preemptive, scheduling: out of any given collection of threads belonging to a scheduler, only one will own the scheduler at a given moment. To the degree that these cooperative threads represent the only work done by the underlying CPU, this means that the thread owning the scheduler really owns the CPU. Of course, a CPU will occasionally get side-tracked into doing other work, so the SQLOS scheduler as “virtual CPU” only represents a chunk of the real CPU, but we live in the expectation that this is a suitably large chunk.

John Tenniel's White Rabbit from "Alice in Wonderland"

It is understandable when such competition for the CPU comes from outside of SQL Server, but it can also be instigated from within SQL Server itself. This is the world of preemptive – or if you prefer, antisocial – scheduling.
Continue reading “Scheduler stories: Going Preemptive”

Scheduler stories: Interacting with the Windows scheduler

In the previous post, The joy of fiber mode, we saw how a fiber mode scheduler firmly controls which worker runs on a thread at a given moment. While it can’t positively ensure that the thread in question remains running all the time, the soul of the scheduler lives in that one thread, and as long as the thread runs, the scheduler gets invoked by its team of fiber workers, dispatching them in an appropriate order.
Continue reading “Scheduler stories: Interacting with the Windows scheduler”

Scheduler stories: The joy of fiber mode

Probably the funniest thing I had ever seen on stage was a two-hander called “Frank ‘n Stein”. It’s a telling of the classic Frankenstein story, with the physical comedy of two actors having to rotate continuously between a large number of roles, including a whole crowd chasing the monster. This was all made possible by them never leaving the stage, but instead changing characters in front of the audience, using only rudimentary props to help differentiate the characters.

If this is the only thing you remember about fiber mode scheduling, it should see you through.
Continue reading “Scheduler stories: The joy of fiber mode”

Scheduler Stories: When does your scheduler run?

I am planning to burn a fair number of cycles on SQLOS scheduling internals for the foreseeable future, and with some luck, this turns into an interesting series. OS scheduling is already a subject that belongs “on the other side of the looking glass”, and this only gets more interesting when we look at user-mode SOS_Scheduler scheduling built on top of it.

If I don’t specifically mention a version, my frame of reference is SQL Server 2014. Yes, things changed since then, but the 2012-2014 scheduler is a good starting point, and the fundamental mechanisms I’ll initially cover have changed very little since the User Mode Scheduler (UMS) of SQL Server 7.0.
Continue reading “Scheduler Stories: When does your scheduler run?”