In the footsteps of a cooperative wait

In the last two posts, I gradually climbed up the stack trace of a typical SQLOS thread. Here is such a stack trace from thread creation to context switch, omitting the actual meat of SQL Server so we can focus on the scheduling aspect.

Today I’m looking at the upper part of that map, where the thread was deep in thought, but stepped into the open manhole of a latch wait, which – as is typical for most waits – involves an EventInternal as underlying mechanism.

Slipping into SwitchContext()

The SQLOS scheduler exists in the cracks between user tasks. As we’re well aware, in order for scheduling to happen at all, it is necessary for tasks to run scheduler-friendly code every now and again. In practice this means either calling methods which have the side effect of checking your quantum mileage and yielding if needed, or explicitly yielding yourself when the guilt gets too much.

Now from the viewpoint of the user task, the experience of yielding is no different than the experience of calling any long-running CPU-intensive function: You call a function and it eventually returns. The real difference is that the CPU burned between the call and its return was spent on one or more other threads, while the current thread went lifeless for a bit. But you don’t know that, because you were asleep at the time!

Anyway, for perfectly valid reasons, in the example an EventInternal‘s Wait() method decided to go to sleep, or viewed from a different angle, to put its caller to sleep. We know how that story ends. Ultimately the Wait() call will return, but before then, the thread will snake its way into a cooperative context switch involving SignalObjectAndWait().

The recipe

The EventInternal’s Wait() function is one of a handful of blocking functions that ferry threads into the cooperative context switch – or alternatively you can view it as ferrying the CPU across workers. In SQL Server 2017, you’ll start seeing WaitableBase::Wait(), but this is mostly refactoring, or possibly even un-inlining of existing code phrasing which only now shows up in public symbols.

Getting into a context switch and back out again – i.e. eventually having the context switched back to the current thread – in the polite context of a task, includes a sequence of three function calls within Wait():

  1. SOS_Task::PreWait() – this sets up wait accounting and publishes the wait_info XEvent.
  2. SOS_Scheduler::SuspendNonPreemptive() – this sets up timeout logic and does a final check for task abort before calling SwitchContext(). The result of SwitchContext (which is ultimately the result of its call to Switch() will be passed back up to Wait() as the wait result.
  3. SOS_Task::PostWait() – this performs the actual wait accounting and clears the waiting status of the task

These are outlined below:

EventInternal Wait sequence

The elusive SwitchContext() and its uncle Switch()

Okay, I was guilty of a white, or perhaps red, lie by including a call to SwitchContext() in that first diagram. Unless you have a breakpoint on that function, you probably will never see it in a stack trace. This is because it makes a tail call to Switch(), meaning the compiler tranfers control to its child Switch() through a jmp instruction rather than a call, thus erasing and reusing the SwitchContext stack frame. Goto is neither dead nor particularly harmful once you’re into assembly language.

But anyway, there is a nice delineation between the two. Switch() is the lowest-level SQLOS function where a thread may enter and block before eventually returning, and this is where the call to SignalObjectAndWait() happens. As input parameters, it receives a pointer to the current Worker and the one that must be switched to. This includes special handling for the case where the two are the same, e.g. if the worker graciously yielded due to quantum exhaustion, but was the only viable worker on the runnable queue, so got rescheduled immediately. In this case (“InstantResuming” in TaskTransition parlance) there is no mucking about with SignalObjectAndWait, and the function simply returns as soon as it can.

Otherwise, the outgoing worker is tidied up with a TaskTransition call of type “Suspending”, and the long-awaited SignalObjectAndWait ceremony is performed. Next thing it knows, SignalObjectAndWait returns because time passed, other workers ran, and Lady Luck – or if you’re lucky, Father Fairness – chose it as the next worker eligible for a quantum. At this point we get a “Resuming” TaskTransition, and the return value inscribed into the worker by its waker-upper, back when it was put back onto the runnable queue, becomes the return value of Switch() and hence SwitchContext().

However, as a last-ditch guard against against spurious wakeup, should the SignalObjectAndWait call return without the prescribed sacrament of the ambient SystemThread having its LastSignalledBy set by another, we cry foul and go to sleep again using a simple WaitForSingleObject(). As of 2016, there is now even an accompanying premature_systemthread_wakeup XEvent to herald the outrage.

Working backwards then, what does SwitchContext() do? Simple. This is where all the scheduler chores (e.g. checking timer and I/O lists) happen, and crucially, where the next incoming worker is chosen from the runnable queue. Its special case is finding an empty runnable queue, at which point the scheduler’s idle worker is scheduled, which may eventually go to sleep through WaitForSingleObject() on the scheduler’s own idle event. At this point the whole scheduler would be asleep, and it will require another scheduler to wake it up by signalling that idle event.

My, how the runnable queue has grown up. You may have gathered from Bob Dorr’s 2016 updated scheduling algorithms post that the PercentileQueue used in 2012 and 2014 got replaced with something else. What we have in 2016 (and AFAIK in 2017) is the continued use of said PercentileQueue for system schedulers, but the new GroupWorkerQueue for normal ones. This is a thing of beauty, containing a linked list per resource group per scheduler, i.e. partitioned in such a way that little in the way of locking is needed. I would like to highlight that its use doesn’t affect the awarded quantum target, which remains at 4ms, but only which worker gets chosen. One day I might have enough meat to write a whole post about it…

Final thoughts

This touches upon something I scratched at in The Myth of the Waiter List, which can always do with being aired again.

For the most part, a wait list, aka suspend queue, is something that belongs to a synchronisation resource like a latch or a a reader-writer lock. Apart from the timer and I/O lists, and the work dispatcher, suspend queues have nothing to do with schedulers: the synchronisation objects that own those suspend queues will move a worker from them to the relevant scheduler’s runnable queue when its time is up. The scheduler really only cares about the runnable queue, and will not waste tears or time worrying about workers suspended on synchronisation objects.

It should be clear from the context, but I have completely ignored fiber mode today. A fiber switch is a different beast entirely, and you can read more about it in The Joy of Fiber Mode.

Yes, there is some repetition from earlier posts, but I hope that covering the same ground multiple times in different ways works as well for you as it does for me. Any questions or observations? I’m all ears.

Scheduler stories: OLEDB, the external wait that isn’t preemptive

So I’ve been having this ongoing discussion with Lonny Niederstadt (b | t) recently, where he tries to make sense of how CheckDB performs, and I fixate on a possibly irrelevant detail. In other words, he talks over my head, and I talk under his feet. Seems fair to me.

Today is the day to air my funny bits, and with some luck Lonny will continue to take us to interesting places in his own explorations.

Update: Aaaand Lonny has delivered a headful!

The thing about external waits

I’m going to go out on a limb and suggest that none of us are very clear about what external waits are. Code external to SQL Server? External to the CPU? Code that was dropped by aliens?

Previously I dug into preemptive waits in SQLOS, and to be honest, I equated “preemptive” with “external”. For the most part the two go hand in hand after all.

To recap, a preemptive wait isn’t necessarily a wait at all. What happens is that a worker needs to run some code that can’t be trusted to play by cooperative scheduling rules. And rather than put the SQLOS scheduler (and all its sibling workers) at the mercy of that code, the worker detaches itself from the scheduler and cedes control to a sibling runnable worker.

The time period from this moment until the worker returns to cooperative scheduling is labeled as a preemptive wait. During this time, one would hope that the thread did indeed sleep a bit, because it would be directly competing for CPU with its sibling through the mediation of the operating system scheduler. In other words, the time ascribed to a preemptive wait covers an unknown combination of working and sleeping.

In that blog post, I also covered the possibility of external waits getting nested: during the time where we’re executing code but counting each passing tick as external wait A, it is possible to declare another external wait B, and temporarily double count the same slice of time against both. Even more confusingly, we could temporarily dip back into cooperative scheduling while the external “wait” continues.

First, a fundamental premise. A wait type is just a label – a task name someone decided to fill in on a notional time card. Different pieces of code can use the same label for different things, or if we’re lucky, a given wait type is used in one place only, and its presence pinpoints the exact function that was running.

Today I’m only talking about the OLEDB wait as it manifests in CheckDB, although a similar story may pertain elsewhere. In case you didn’t read the title, OLEDB is an external wait type that isn’t preemptive. But what does this mean?

It means that some of what I said about preemptive waits still applies, most significantly the idea that code can actively be running while advertising itself as being in a wait. This can be seen as a slight slant on what the wait type measures: we are using an existing mechanism as a profiling tool that surfaces how much time is spend in a certain chunk of code.

However, in this case the worker doesn’t go preemptive, i.e. it doesn’t yield the scheduler to a sibling while it does the work advertised as a wait.

Here it really gets weird. This worker is cooperatively scheduled, and has a conscience that tells it not to hog the scheduler. Every so often, it will offer to yield, and if the scheduler accepts the offer the CheckDB worker will properly go to sleep, with this sleep time labeled as SOS_SCHEDULER_YIELD.

But at the same time, the clock is still ticking on the OLEDB wait!

This is an entirely new twist on double counting. We claim to be waiting, but are working, except that sometimes we do stop working, counting a bit of time against both OLEDB and SOS_SCHEDULER_YIELD.

I am not saying that this happens all the time, but this is the way that a certain branch of code is written, as least in SQL Server 2014. The OLEDB “wait” is declared within CQScanRmtScanNew::GetRowHelper(), and during this “wait” we get a call to CUtRowset::GetNextRows(), which itself calls CTRowsetInstance::FGetNextRow(). However, within GetNextRows(), after every 16 invocations a courtesy call is made to YieldAndCheckForAbort(), which may yield the scheduler with an SOS_SCHEDULER_YIELD wait.

To visualise broadly:

Visualisation of time spent and time measured

Bonus material: The little preemptive wait that wasn’t

The OLEDB wait is neatly encapsulated in an instance of the CAutoOledbWait class, which in turn contains an SOS_ExternalAutoWait, the same object used by preemptive waits.

Now if we look into SOS_ExternalAutoWait, it comes with three constructors. One gives us a bland UNKNOWN/MISCELLANEOUS wait, presumably on the historic premise that folks didn’t always want to bother categorising their wait activity. Another is fully parameterised, and can emit any supplied wait type. But the third one really catches one’s eye: it’s wired to emit PREEMPTIVE_OS_GETPROCADDRESS.

Clearly PREEMPTIVE_OS_GETPROCADDRESS must serve as a convenient “smoke break” wait type for certain callers, and I’d find it hard to believe that so many people really call GetProcAddress(). So on the premise that nothing in this external wait guarantees it is used preemptively, I am inclined to think that:

  1. When you see this particular wait, you have to read it as MISCELLANEOUS
  2. It ain’t necessarily preemptive

See you next time!

Fishing for wait types in WinDbg

Last night a #sqlhelp question from Monica Rathbun (@SQLEspresso) caught my eye:

@SQLEspresso Twitter question

Now some of us take way too much delight in worrying about obscure wait types, and since I’ve recently been in preemptive territory I thought I should take some degree of interest. Spoiler alert: I did nothing to solve Monica’s problem, but my attempt to figure out where the wait type might emanate from made me realise that this is worth a blog post.

Without getting hung up on the detail, here is a very crude and simple way to hunt for areas of SQL Server that may use a particular wait type. The only prerequisite is that you need to be willing and able to attach Windbg to SQL Server, and that you have public symbols loaded.

In this case I was looking for PREEMPTIVE_COM_RELEASE, and sys.dm_xe_map_values tells me that on my 2014 RTM instance it has an index of 01d4 hexadecimal. Crazy as it sounds, I’m going to do a simple search through the code to look for places that magic number is used. As a two-byte (word) pattern we’ll get lots of false positives, but fortunately wait types are internally doublewords, with only one bit set in the high-order word. In other words, we’re going to look for the pattern 000101d4, 000201d4, 000401d4 and so forth up to 800001d4. Ignore the meaning of when which bit is going to be set; with only sixteen permutations, it’s quick enough to try them all.

Let’s focus on sqllang as the likely source – the below would apply to any other module too.

Upon starting the debugger, the module load addresses are listed right away. In my case, I got this for sqllang:

ModLoad: 00007ffe`23870000 00007ffe`25ad7000   C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Binn\sqllang.dll

So we have a start and end memory address. Take note of the length in bytes, using Windbg as a calculator:

0:063> ? 07ffe`25ad7000 - 7ffe`23870000
Evaluate expression: 36073472 = 00000000`02267000

Great. Now we have everything we need. The s command searches for patterns in a range of memory, and we’ll use the -d flag to make it a doubleword search. First few tries come up empty:

0:063> s -d 0x7ffe`23870000 L0x2267000 000101d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000201d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000401d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000801d4

But now we get one:

0:063> s -d 0x7ffe`23870000 L0x2267000 001001d4
00007ffe`287d39f8  001001d4 8948f633 48602474 68247489  ....3.H.t$`H.t$h

Ignore everything other than the address at the start of the line – we’re not expecting the byte dump to make sense to the human eye. Let’s see what piece of code this belongs to – the uf disassembles the function that this piece of memory falls in.

0:063> uf 0x7ffe`287d39f8

I’m not even going to show you the output, because this one turned out to be a red herring – experience and/or intuition needed to confirm that. But let’s go on…

 
0:063> s -d 0x7ffe`23870000 L0x2267000 002001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 004001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 008001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 010001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 020001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 040001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 080001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 100001d4

Still nothing, but then we strike gold:

 
0:063> s -d 0x7ffe`23870000 L0x2267000 200001d4
00007ffe`246a3fe4  200001d4 244c8d48 32bee840 4890ff1d  ... H.L$@..2...H
00007ffe`246a6694  200001d4 244c8d48 0c0ee840 4890ff1d  ... H.L$@......H
00007ffe`246a8d44  200001d4 244c8d48 e55ee840 4890ff1c  ... H.L$@.^....H
00007ffe`246ac8a4  200001d4 244c8d48 a9fee840 4890ff1c  ... H.L$@......H
00007ffe`246b0404  200001d4 244c8d48 6e9ee840 4890ff1c  ... H.L$@..n...H
... and many more!

Try the uf trick again on the first one:

0:063> uf 0x7ffe`246a3fe4

And we get rewarded with a disassembly of the function sqllang!IWrapInterface<IAccessor>::Release – this one pretty much comes with flashing lights given that IAccessor reeks of COM and we were expecting something involving “RELEASE”. I’ll spare you the bulk of the assembly dump, but would like to highlight the bit that confirms the setup of a preemptive wait type:

00007ffe`246a3fe3 bad4010020      mov     edx,200001D4h
00007ffe`246a3fe8 488d4c2440      lea     rcx,[rsp+40h]
00007ffe`246a3fed e8be321dff      call    sqllang!AutoSwitchPreemptive::AutoSwitchPreemptive (00007ffe`238772b0)

That assignment to the edx register means that the encoded wait type is the second parameter to the AutoSwitchPreemptive constructor. And while it may not always be a recognisable setup, in this case I was already familiar with AutoSwitchPreemptive (see here).

Now this kind of trawling is by no means scientific. The wait type could have been loaded from a memory address, in which case it wouldn’t have been hard-coded in the function. And of course without the code running in context, it doesn’t tell you what kind of call stack it might show up in – only running the relevant code paths and catching the wait through a breakpoint or XEvent will do that. But as a quick and dirty way of hunting for wait type usage in a module up there on the marble slab? Hey, it works for me.