Fishing for wait types revisited – QUERY_TASK_ENQUEUE_MUTEX

This post comes with a generous serving of humble pie. Many months ago I did a write-up about using Windbg to search for likely occurences of wait types within SQL Server modules, taking advantage of the fact that the wait type tends to show up “hard-coded” within a parameter assignment before the call to a wait function – see Fishing for wait types in WinDbg.

Now I must admit I have only sporadically used that technique, but I had some degree of faith in it. So far the only wait type I have encountered that doesn’t show up so obviously is THREADPOOL, a.k.a. SOS_WORKER. This is simply because there is a static SOS_WaitInfo structure for it, and hence we see the address of that structure referenced, i.e. a level of indirection that takes the magic number out of the code.

So last week my ears pricked up – or more accurately, were pulled in the right direction by Randolph West ( b | t ) when Joe Obbish wrote about the QUERY_TASK_ENQUEUE_MUTEX wait encountered during his adventures with The Trillion Row Table.

Trying to find out something about obscure wait types is always an irresistible diversion, and judging by the comments, David Williams ( b | t ) thought the same. Unfortunately I was away from home and SQL Server for a few days, but I figured it would take me a hour at the weekend to find something interesting.

Wherein our hero whips his own stupid ass

Ha! Cracking open Windbg on 2016 SP1 with the s command to look for byte patterns yielded nothing of value. Maybe something has changed with conventions or indirection? Nope, no joy in 2014 either.

In the end, it took the extremely brave step of RTFM, in this case the Windbg online help, to realise where I was going wrong. I was searching for a four-byte pattern by searching for doublewords. Sounds reasonable on the face of it, but what I had missed was that this specifically required the doublewords to be doubleword-aligned, i.e. starting on an address divisible by four. My method only had a 25% chance of working, so it’s sheer luck I ever got good results with it.

Changing to a byte search for four consecutive bytes gave me the non-aligned semantics my taste buds craved, and the results came pouring in.

Worked example

So I am not going to try and say too much about the wait type yet, but I may as well use the opportunity to show the process of finding occurrences in code. Main thing to remember is that searching for four consecutive bytes, rather than a doubleword, requires you to reverse the order of the bytes, i.e. express it explicitly as little endian.

sys.dm_xe_map_values tells us that QUERY_TASK_ENQUEUE_MUTEX has an index (“map_key”) of 719 in 2016 SP1; this translates to 0x02cf in hexadecimal. We’ll be looking for doublewords with one bit set in the most significant word, and 0x02cf in the least significant word, i.e.

0x000102cf
0x000202cf
0x000402cf
0x000802cf
0x001002cf
0x002002cf
etc...

This translates to byte-reversed four byte patterns like cf 02 01 00, cf 02 02 00 etc. Let’s give it a whirl and look in sqlmin. First get the module starting address and length, using Windbg as a hex calculator. Your results will vary.

0:085> lm m sqlmin
start             end                 module name
00007ff9`e6d80000 00007ff9`e924b000   sqlmin     (pdb symbols)          C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\sym\sqlmin.pdb\D38058F49E7C4D62970677E4315F1F1C2\sqlmin.pdb
0:085> ? 7ff9`e924b000 - 7ff9`e6d80000
Evaluate expression: 38580224 = 00000000`024cb000

Cool, now I know that sqlmin occupies 0x024cb000 bytes starting at 0x7ff9`e6d80000. Time to go fishin’.

0:085> s -b 0x7ff9`e6d80000 l0x024cb000 cf 01 40 00

Nada. Move up a bit.

0:085> s -b 0x7ff9`e6d80000 l0x024cb000 cf 02 40 00
00007ff9`e6eca58d  cf 02 40 00 49 8d 8b a0-fb ff ff e8 13 7a 09 00  ..@.I........z..
00007ff9`e76c463b  cf 02 40 00 48 8d 4c 24-28 e8 67 d9 89 ff 48 8b  ..@.H.L$(.g...H.
00007ff9`e7bb8d49  cf 02 40 00 4c 89 65 38-4c 89 65 48 4c 89 65 40  ..@.L.e8L.eHL.e@

Holy carp, the sqlmin are spawning! I’d be scared to meet anyone who can sight-read those byte patterns, but we have addresses that might lie within functions of interest. Let’s see what happens there; long boring assembly dumps reduced to function names:

0:085> uf 7ff9`e6eca58d
sqlmin!CQScanExchangeNew::StartAllProducers
...
0:085> uf 7ff9`e7bb8d49
sqlmin!CBpQScanColumnStoreScan::InitRowsetColumnBatch
...
0:085> uf 7ff9`e76c463b
sqlmin!CXRowset::GetXact
...

Okay, so to prove the point, here is how the wait call manifests in the first function:

00007ff9`e6eca58c bacf024000      mov     edx,4002CFh
00007ff9`e6eca591 498d8ba0fbffff  lea     rcx,[r11-460h]
00007ff9`e6eca598 e8137a0900      call    sqlmin!TAutoMutex<SOS_Mutex,1>::GetAccess (00007ff9`e6f61fb0)

This uses a template TAutoMutex, which wraps a plain old SOS_Mutex as a convenience. Merely instantiating the TAutoMutex as a local variable means that the developer gets mutex acquisition and release for free through variable scoping. Of course, how the mutex works internally is far less interesting than what it protects.

A bird’s-eye view of the functions

Disclaimer time. I am not pretending to be dropping a proper explanation about the blocking scenario that Joe came up against. I only speak three words of Columnstore, and two of them are unprintable, so I lack domain expertise. But I do take quite an interest in things that enqueue tasks, and someone else might be able to make meaningful observations from the below.

From the name of the wait type alone, we can’t tell which of the three functions was the likely blocker and which was the likely blockee. They could well have been two instances of the same function. So let’s just list out which calls each does while holding the mutex – the blocking is certain to sit in a child function, since nothing in these functions by themselves is heavyweight. A cursory glance also doesn’t tell us whether or not the mutex sits in a structure of which a single one gets shared across the workers, but we may as well assume there is only one going around.

CWRowset::GetXact is short and sweet. Get the mutex, call CQueryInstance::AddSubXact, and release the mutex.

CQScanExchangeNew::StartAllProducers is a lot more meaty. Here are the functions called, and spinlocks held briefly, while holding the mutex:

  • CXPort::StartNestedTransactions
  • SUBPDESC_LIST spinlock
  • CQScanExchangeNew::PxproddescCreate within a loop
  • QUERYSCAN spinlock
  • SubprocessMgr::EnqueueMultipleSubprocesses – this is getting close to my main area of interest, and something I will definitely explore in depth down the line!
  • CSpinEventMutex::GetAccess – another one for my to-do list
  • SPIN_EVENT_MUTEX spinlock – yup, you only get to try this after already having access to the preceding item
  • SubprocessMgr::GetFirstSubprocDesc and GetNextSubprocDesc – much of this is in a loop, so we’re iterating over subprocesses
  • SUBPDESC spinlock

CBpQScanColumnStoreScan::InitRowsetColumnBatch is equally meaty, although a far smaller part of it runs while holding the mutex of interest – here are the calls within that portion:

  • A function pointer call I have yet to identify (sorry, limitation of static analysis!)
  • optionally, HkNestedTransactionAlloc
  • CQueryInstance::AddSubXact
  • Another indirect call on a function pointer, but only if that pointer isn’t CQte::NidGet. Hand up if you know what a Nid is šŸ˜‰
  • CBpQScanColumnsStoreScan::SetupSubSegmentScan
    • So all in all, in hand-wavey broad strokes you get a picture of something involving parallel tasks, nested transactions, and sub-segments. Seems safe to say that we are in Joe’s sand box.

      Now if only I can stop thinking about getting myself a Nid for Christmas…

In the footsteps of a cooperative wait

In the last two posts, I gradually climbed up the stack trace of a typical SQLOS thread. Here is such a stack trace from thread creation to context switch, omitting the actual meat of SQL Server so we can focus on the scheduling aspect.

Today I’m looking at the upper part of that map, where the thread was deep in thought, but stepped into the open manhole of a latch wait, which – as is typical for most waits – involves an EventInternal as underlying mechanism.

Slipping into SwitchContext()

The SQLOS scheduler exists in the cracks between user tasks. As we’re well aware, in order for scheduling to happen at all, it is necessary for tasks to run scheduler-friendly code every now and again. In practice this means either calling methods which have the side effect of checking your quantum mileage and yielding if needed, or explicitly yielding yourself when the guilt gets too much.

Now from the viewpoint of the user task, the experience of yielding is no different than the experience of calling any long-running CPU-intensive function: You call a function and it eventually returns. The real difference is that the CPU burned between the call and its return was spent on one or more other threads, while the current thread went lifeless for a bit. But you don’t know that, because you were asleep at the time!

Anyway, for perfectly valid reasons, in the example an EventInternal‘s Wait() method decided to go to sleep, or viewed from a different angle, to put its caller to sleep. We know how that story ends. Ultimately the Wait() call will return, but before then, the thread will snake its way into a cooperative context switch involving SignalObjectAndWait().

The recipe

The EventInternal’s Wait() function is one of a handful of blocking functions that ferry threads into the cooperative context switch – or alternatively you can view it as ferrying the CPU across workers. In SQL Server 2017, you’ll start seeing WaitableBase::Wait(), but this is mostly refactoring, or possibly even un-inlining of existing code phrasing which only now shows up in public symbols.

Getting into a context switch and back out again – i.e. eventually having the context switched back to the current thread – in the polite context of a task, includes a sequence of three function calls within Wait():

  1. SOS_Task::PreWait() – this sets up wait accounting and publishes the wait_info XEvent.
  2. SOS_Scheduler::SuspendNonPreemptive() – this sets up timeout logic and does a final check for task abort before calling SwitchContext(). The result of SwitchContext (which is ultimately the result of its call to Switch() will be passed back up to Wait() as the wait result.
  3. SOS_Task::PostWait() – this performs the actual wait accounting and clears the waiting status of the task

These are outlined below:

EventInternal Wait sequence

The elusive SwitchContext() and its uncle Switch()

Okay, I was guilty of a white, or perhaps red, lie by including a call to SwitchContext() in that first diagram. Unless you have a breakpoint on that function, you probably will never see it in a stack trace. This is because it makes a tail call to Switch(), meaning the compiler tranfers control to its child Switch() through a jmp instruction rather than a call, thus erasing and reusing the SwitchContext stack frame. Goto is neither dead nor particularly harmful once you’re into assembly language.

But anyway, there is a nice delineation between the two. Switch() is the lowest-level SQLOS function where a thread may enter and block before eventually returning, and this is where the call to SignalObjectAndWait() happens. As input parameters, it receives a pointer to the current Worker and the one that must be switched to. This includes special handling for the case where the two are the same, e.g. if the worker graciously yielded due to quantum exhaustion, but was the only viable worker on the runnable queue, so got rescheduled immediately. In this case (“InstantResuming” in TaskTransition parlance) there is no mucking about with SignalObjectAndWait, and the function simply returns as soon as it can.

Otherwise, the outgoing worker is tidied up with a TaskTransition call of type “Suspending”, and the long-awaited SignalObjectAndWait ceremony is performed. Next thing it knows, SignalObjectAndWait returns because time passed, other workers ran, and Lady Luck – or if you’re lucky, Father Fairness – chose it as the next worker eligible for a quantum. At this point we get a “Resuming” TaskTransition, and the return value inscribed into the worker by its waker-upper, back when it was put back onto the runnable queue, becomes the return value of Switch() and hence SwitchContext().

However, as a last-ditch guard against against spurious wakeup, should the SignalObjectAndWait call return without the prescribed sacrament of the ambient SystemThread having its LastSignalledBy set by another, we cry foul and go to sleep again using a simple WaitForSingleObject(). As of 2016, there is now even an accompanying premature_systemthread_wakeup XEvent to herald the outrage.

Working backwards then, what does SwitchContext() do? Simple. This is where all the scheduler chores (e.g. checking timer and I/O lists) happen, and crucially, where the next incoming worker is chosen from the runnable queue. Its special case is finding an empty runnable queue, at which point the scheduler’s idle worker is scheduled, which may eventually go to sleep through WaitForSingleObject() on the scheduler’s own idle event. At this point the whole scheduler would be asleep, and it will require another scheduler to wake it up by signalling that idle event.

My, how the runnable queue has grown up. You may have gathered from Bob Dorr’s 2016 updated scheduling algorithms post that the PercentileQueue used in 2012 and 2014 got replaced with something else. What we have in 2016 (and AFAIK in 2017) is the continued use of said PercentileQueue for system schedulers, but the new GroupWorkerQueue for normal ones. This is a thing of beauty, containing a linked list per resource group per scheduler, i.e. partitioned in such a way that little in the way of locking is needed. I would like to highlight that its use doesn’t affect the awarded quantum target, which remains at 4ms, but only which worker gets chosen. One day I might have enough meat to write a whole post about it…

Final thoughts

This touches upon something I scratched at in The Myth of the Waiter List, which can always do with being aired again.

For the most part, a wait list, aka suspend queue, is something that belongs to a synchronisation resource like a latch or a a reader-writer lock. Apart from the timer and I/O lists, and the work dispatcher, suspend queues have nothing to do with schedulers: the synchronisation objects that own those suspend queues will move a worker from them to the relevant scheduler’s runnable queue when its time is up. The scheduler really only cares about the runnable queue, and will not waste tears or time worrying about workers suspended on synchronisation objects.

It should be clear from the context, but I have completely ignored fiber mode today. A fiber switch is a different beast entirely, and you can read more about it in The Joy of Fiber Mode.

Yes, there is some repetition from earlier posts, but I hope that covering the same ground multiple times in different ways works as well for you as it does for me. Any questions or observations? I’m all ears.

Scheduler stories: OLEDB, the external wait that isn’t preemptive

So I’ve been having this ongoing discussion with Lonny Niederstadt (b | t) recently, where he tries to make sense of how CheckDB performs, and I fixate on a possibly irrelevant detail. In other words, he talks over my head, and I talk under his feet. Seems fair to me.

Today is the day to air my funny bits, and with some luck Lonny will continue to take us to interesting places in his own explorations.

Update: Aaaand Lonny has delivered a headful!

The thing about external waits

I’m going to go out on a limb and suggest that none of us are very clear about what external waits are. Code external to SQL Server? External to the CPU? Code that was dropped by aliens?

Previously I dug into preemptive waits in SQLOS, and to be honest, I equated “preemptive” with “external”. For the most part the two go hand in hand after all.

To recap, a preemptive wait isn’t necessarily a wait at all. What happens is that a worker needs to run some code that can’t be trusted to play by cooperative scheduling rules. And rather than put the SQLOS scheduler (and all its sibling workers) at the mercy of that code, the worker detaches itself from the scheduler and cedes control to a sibling runnable worker.

The time period from this moment until the worker returns to cooperative scheduling is labeled as a preemptive wait. During this time, one would hope that the thread did indeed sleep a bit, because it would be directly competing for CPU with its sibling through the mediation of the operating system scheduler. In other words, the time ascribed to a preemptive wait covers an unknown combination of working and sleeping.

In that blog post, I also covered the possibility of external waits getting nested: during the time where we’re executing code but counting each passing tick as external wait A, it is possible to declare another external wait B, and temporarily double count the same slice of time against both. Even more confusingly, we could temporarily dip back into cooperative scheduling while the external “wait” continues.

First, a fundamental premise. A wait type is just a label – a task name someone decided to fill in on a notional time card. Different pieces of code can use the same label for different things, or if we’re lucky, a given wait type is used in one place only, and its presence pinpoints the exact function that was running.

Today I’m only talking about the OLEDB wait as it manifests in CheckDB, although a similar story may pertain elsewhere. In case you didn’t read the title, OLEDB is an external wait type that isn’t preemptive. But what does this mean?

It means that some of what I said about preemptive waits still applies, most significantly the idea that code can actively be running while advertising itself as being in a wait. This can be seen as a slight slant on what the wait type measures: we are using an existing mechanism as a profiling tool that surfaces how much time is spend in a certain chunk of code.

However, in this case the worker doesn’t go preemptive, i.e. it doesn’t yield the scheduler to a sibling while it does the work advertised as a wait.

Here it really gets weird. This worker is cooperatively scheduled, and has a conscience that tells it not to hog the scheduler. Every so often, it will offer to yield, and if the scheduler accepts the offer the CheckDB worker will properly go to sleep, with this sleep time labeled as SOS_SCHEDULER_YIELD.

But at the same time, the clock is still ticking on the OLEDB wait!

This is an entirely new twist on double counting. We claim to be waiting, but are working, except that sometimes we do stop working, counting a bit of time against both OLEDB and SOS_SCHEDULER_YIELD.

I am not saying that this happens all the time, but this is the way that a certain branch of code is written, as least in SQL Server 2014. The OLEDB “wait” is declared within CQScanRmtScanNew::GetRowHelper(), and during this “wait” we get a call to CUtRowset::GetNextRows(), which itself calls CTRowsetInstance::FGetNextRow(). However, within GetNextRows(), after every 16 invocations a courtesy call is made to YieldAndCheckForAbort(), which may yield the scheduler with an SOS_SCHEDULER_YIELD wait.

To visualise broadly:

Visualisation of time spent and time measured

Bonus material: The little preemptive wait that wasn’t

The OLEDB wait is neatly encapsulated in an instance of the CAutoOledbWait class, which in turn contains an SOS_ExternalAutoWait, the same object used by preemptive waits.

Now if we look into SOS_ExternalAutoWait, it comes with three constructors. One gives us a bland UNKNOWN/MISCELLANEOUS wait, presumably on the historic premise that folks didn’t always want to bother categorising their wait activity. Another is fully parameterised, and can emit any supplied wait type. But the third one really catches one’s eye: it’s wired to emit PREEMPTIVE_OS_GETPROCADDRESS.

Clearly PREEMPTIVE_OS_GETPROCADDRESS must serve as a convenient “smoke break” wait type for certain callers, and I’d find it hard to believe that so many people really call GetProcAddress(). So on the premise that nothing in this external wait guarantees it is used preemptively, I am inclined to think that:

  1. When you see this particular wait, you have to read it as MISCELLANEOUS
  2. It ain’t necessarily preemptive

See you next time!

Fishing for wait types in WinDbg

Last night a #sqlhelp question from Monica Rathbun (@SQLEspresso) caught my eye:

@SQLEspresso Twitter question

Now some of us take way too much delight in worrying about obscure wait types, and since I’ve recently been in preemptive territory I thought I should take some degree of interest. Spoiler alert: I did nothing to solve Monica’s problem, but my attempt to figure out where the wait type might emanate from made me realise that this is worth a blog post.

Without getting hung up on the detail, here is a very crude and simple way to hunt for areas of SQL Server that may use a particular wait type. The only prerequisite is that you need to be willing and able to attach Windbg to SQL Server, and that you have public symbols loaded.

In this case I was looking for PREEMPTIVE_COM_RELEASE, and sys.dm_xe_map_values tells me that on my 2014 RTM instance it has an index of 01d4 hexadecimal. Crazy as it sounds, I’m going to do a simple search through the code to look for places that magic number is used. As a two-byte (word) pattern we’ll get lots of false positives, but fortunately wait types are internally doublewords, with only one bit set in the high-order word. In other words, we’re going to look for the pattern 000101d4, 000201d4, 000401d4 and so forth up to 800001d4. Ignore the meaning of when which bit is going to be set; with only sixteen permutations, it’s quick enough to try them all.

Let’s focus on sqllang as the likely source – the below would apply to any other module too.

Upon starting the debugger, the module load addresses are listed right away. You can also use the lm command at any time afterwards to list them again. In my case, I got this for sqllang:

ModLoad: 00007ffe`23870000 00007ffe`25ad7000   C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Binn\sqllang.dll

So we have a start and end memory address. Take note of the length in bytes, using Windbg as a calculator:

0:063> ? 07ffe`25ad7000 - 7ffe`23870000
Evaluate expression: 36073472 = 00000000`02267000

Update 2017/08/29:
I’m keeping the original version below as a permanent record of a schoolboy error. Searching for a four-byte pattern expressed as a doubleword can in fact bring up hits, BUT only ones which are doubleword aligned, i.e. starting on an address divisible by 4. The correct way to cast the net wide enough is to use the -b flag and searching for four consecutive bytes; this search doesn’t presuppose doubleword alignment. Remembering to byte-reverse the pattern, that first command should have been s -b 0x7ffe`23870000 L0x2267000 d4 01 01 00. I was just lucky to have caught a fish using the -d variation.

Great. Now we have everything we need. The s command searches for patterns in a range of memory, and we’ll use the -d flag to make it a doubleword search. First few tries come up empty:

0:063> s -d 0x7ffe`23870000 L0x2267000 000101d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000201d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000401d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000801d4

But now we get one:

0:063> s -d 0x7ffe`23870000 L0x2267000 001001d4
00007ffe`287d39f8  001001d4 8948f633 48602474 68247489  ....3.H.t$`H.t$h

Ignore everything other than the address at the start of the line – we’re not expecting the byte dump to make sense to the human eye. Let’s see what piece of code this belongs to – the uf disassembles the function that this piece of memory falls in.

0:063> uf 0x7ffe`287d39f8

I’m not even going to show you the output, because this one turned out to be a red herring – experience and/or intuition needed to confirm that. But let’s go on…

 
0:063> s -d 0x7ffe`23870000 L0x2267000 002001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 004001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 008001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 010001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 020001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 040001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 080001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 100001d4

Still nothing, but then we strike gold:

 
0:063> s -d 0x7ffe`23870000 L0x2267000 200001d4
00007ffe`246a3fe4  200001d4 244c8d48 32bee840 4890ff1d  ... H.L$@..2...H
00007ffe`246a6694  200001d4 244c8d48 0c0ee840 4890ff1d  ... H.L$@......H
00007ffe`246a8d44  200001d4 244c8d48 e55ee840 4890ff1c  ... H.L$@.^....H
00007ffe`246ac8a4  200001d4 244c8d48 a9fee840 4890ff1c  ... H.L$@......H
00007ffe`246b0404  200001d4 244c8d48 6e9ee840 4890ff1c  ... H.L$@..n...H
... and many more!

Try the uf trick again on the first one:

0:063> uf 0x7ffe`246a3fe4

And we get rewarded with a disassembly of the function sqllang!IWrapInterface<IAccessor>::Release – this one pretty much comes with flashing lights given that IAccessor reeks of COM and we were expecting something involving “RELEASE”. I’ll spare you the bulk of the assembly dump, but would like to highlight the bit that confirms the setup of a preemptive wait type:

00007ffe`246a3fe3 bad4010020      mov     edx,200001D4h
00007ffe`246a3fe8 488d4c2440      lea     rcx,[rsp+40h]
00007ffe`246a3fed e8be321dff      call    sqllang!AutoSwitchPreemptive::AutoSwitchPreemptive (00007ffe`238772b0)

That assignment to the edx register means that the encoded wait type is the second parameter to the AutoSwitchPreemptive constructor. And while it may not always be a recognisable setup, in this case I was already familiar with AutoSwitchPreemptive (see here).

Now this kind of trawling is by no means scientific. The wait type could have been loaded from a memory address, in which case it wouldn’t have been hard-coded in the function. And of course without the code running in context, it doesn’t tell you what kind of call stack it might show up in – only running the relevant code paths and catching the wait through a breakpoint or XEvent will do that. But as a quick and dirty way of hunting for wait type usage in a module up there on the marble slab? Hey, it works for me.