Fishing for wait types revisited – QUERY_TASK_ENQUEUE_MUTEX

This post comes with a generous serving of humble pie. Many months ago I did a write-up about using Windbg to search for likely occurences of wait types within SQL Server modules, taking advantage of the fact that the wait type tends to show up “hard-coded” within a parameter assignment before the call to a wait function – see Fishing for wait types in WinDbg.

Now I must admit I have only sporadically used that technique, but I had some degree of faith in it. So far the only wait type I have encountered that doesn’t show up so obviously is THREADPOOL, a.k.a. SOS_WORKER. This is simply because there is a static SOS_WaitInfo structure for it, and hence we see the address of that structure referenced, i.e. a level of indirection that takes the magic number out of the code.

So last week my ears pricked up – or more accurately, were pulled in the right direction by Randolph West ( b | t ) when Joe Obbish wrote about the QUERY_TASK_ENQUEUE_MUTEX wait encountered during his adventures with The Trillion Row Table.

Trying to find out something about obscure wait types is always an irresistible diversion, and judging by the comments, David Williams ( b | t ) thought the same. Unfortunately I was away from home and SQL Server for a few days, but I figured it would take me a hour at the weekend to find something interesting.

Wherein our hero whips his own stupid ass

Ha! Cracking open Windbg on 2016 SP1 with the s command to look for byte patterns yielded nothing of value. Maybe something has changed with conventions or indirection? Nope, no joy in 2014 either.

In the end, it took the extremely brave step of RTFM, in this case the Windbg online help, to realise where I was going wrong. I was searching for a four-byte pattern by searching for doublewords. Sounds reasonable on the face of it, but what I had missed was that this specifically required the doublewords to be doubleword-aligned, i.e. starting on an address divisible by four. My method only had a 25% chance of working, so it’s sheer luck I ever got good results with it.

Changing to a byte search for four consecutive bytes gave me the non-aligned semantics my taste buds craved, and the results came pouring in.

Worked example

So I am not going to try and say too much about the wait type yet, but I may as well use the opportunity to show the process of finding occurrences in code. Main thing to remember is that searching for four consecutive bytes, rather than a doubleword, requires you to reverse the order of the bytes, i.e. express it explicitly as little endian.

sys.dm_xe_map_values tells us that QUERY_TASK_ENQUEUE_MUTEX has an index (“map_key”) of 719 in 2016 SP1; this translates to 0x02cf in hexadecimal. We’ll be looking for doublewords with one bit set in the most significant word, and 0x02cf in the least significant word, i.e.

0x000102cf
0x000202cf
0x000402cf
0x000802cf
0x001002cf
0x002002cf
etc...

This translates to byte-reversed four byte patterns like cf 02 01 00, cf 02 02 00 etc. Let’s give it a whirl and look in sqlmin. First get the module starting address and length, using Windbg as a hex calculator. Your results will vary.

0:085> lm m sqlmin
start             end                 module name
00007ff9`e6d80000 00007ff9`e924b000   sqlmin     (pdb symbols)          C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\sym\sqlmin.pdb\D38058F49E7C4D62970677E4315F1F1C2\sqlmin.pdb
0:085> ? 7ff9`e924b000 - 7ff9`e6d80000
Evaluate expression: 38580224 = 00000000`024cb000

Cool, now I know that sqlmin occupies 0x024cb000 bytes starting at 0x7ff9`e6d80000. Time to go fishin’.

0:085> s -b 0x7ff9`e6d80000 l0x024cb000 cf 01 40 00

Nada. Move up a bit.

0:085> s -b 0x7ff9`e6d80000 l0x024cb000 cf 02 40 00
00007ff9`e6eca58d  cf 02 40 00 49 8d 8b a0-fb ff ff e8 13 7a 09 00  ..@.I........z..
00007ff9`e76c463b  cf 02 40 00 48 8d 4c 24-28 e8 67 d9 89 ff 48 8b  ..@.H.L$(.g...H.
00007ff9`e7bb8d49  cf 02 40 00 4c 89 65 38-4c 89 65 48 4c 89 65 40  ..@.L.e8L.eHL.e@

Holy carp, the sqlmin are spawning! I’d be scared to meet anyone who can sight-read those byte patterns, but we have addresses that might lie within functions of interest. Let’s see what happens there; long boring assembly dumps reduced to function names:

0:085> uf 7ff9`e6eca58d
sqlmin!CQScanExchangeNew::StartAllProducers
...
0:085> uf 7ff9`e7bb8d49
sqlmin!CBpQScanColumnStoreScan::InitRowsetColumnBatch
...
0:085> uf 7ff9`e76c463b
sqlmin!CXRowset::GetXact
...

Okay, so to prove the point, here is how the wait call manifests in the first function:

00007ff9`e6eca58c bacf024000      mov     edx,4002CFh
00007ff9`e6eca591 498d8ba0fbffff  lea     rcx,[r11-460h]
00007ff9`e6eca598 e8137a0900      call    sqlmin!TAutoMutex<SOS_Mutex,1>::GetAccess (00007ff9`e6f61fb0)

This uses a template TAutoMutex, which wraps a plain old SOS_Mutex as a convenience. Merely instantiating the TAutoMutex as a local variable means that the developer gets mutex acquisition and release for free through variable scoping. Of course, how the mutex works internally is far less interesting than what it protects.

A bird’s-eye view of the functions

Disclaimer time. I am not pretending to be dropping a proper explanation about the blocking scenario that Joe came up against. I only speak three words of Columnstore, and two of them are unprintable, so I lack domain expertise. But I do take quite an interest in things that enqueue tasks, and someone else might be able to make meaningful observations from the below.

From the name of the wait type alone, we can’t tell which of the three functions was the likely blocker and which was the likely blockee. They could well have been two instances of the same function. So let’s just list out which calls each does while holding the mutex – the blocking is certain to sit in a child function, since nothing in these functions by themselves is heavyweight. A cursory glance also doesn’t tell us whether or not the mutex sits in a structure of which a single one gets shared across the workers, but we may as well assume there is only one going around.

CWRowset::GetXact is short and sweet. Get the mutex, call CQueryInstance::AddSubXact, and release the mutex.

CQScanExchangeNew::StartAllProducers is a lot more meaty. Here are the functions called, and spinlocks held briefly, while holding the mutex:

  • CXPort::StartNestedTransactions
  • SUBPDESC_LIST spinlock
  • CQScanExchangeNew::PxproddescCreate within a loop
  • QUERYSCAN spinlock
  • SubprocessMgr::EnqueueMultipleSubprocesses – this is getting close to my main area of interest, and something I will definitely explore in depth down the line!
  • CSpinEventMutex::GetAccess – another one for my to-do list
  • SPIN_EVENT_MUTEX spinlock – yup, you only get to try this after already having access to the preceding item
  • SubprocessMgr::GetFirstSubprocDesc and GetNextSubprocDesc – much of this is in a loop, so we’re iterating over subprocesses
  • SUBPDESC spinlock

CBpQScanColumnStoreScan::InitRowsetColumnBatch is equally meaty, although a far smaller part of it runs while holding the mutex of interest – here are the calls within that portion:

  • A function pointer call I have yet to identify (sorry, limitation of static analysis!)
  • optionally, HkNestedTransactionAlloc
  • CQueryInstance::AddSubXact
  • Another indirect call on a function pointer, but only if that pointer isn’t CQte::NidGet. Hand up if you know what a Nid is 😉
  • CBpQScanColumnsStoreScan::SetupSubSegmentScan
    • So all in all, in hand-wavey broad strokes you get a picture of something involving parallel tasks, nested transactions, and sub-segments. Seems safe to say that we are in Joe’s sand box.

      Now if only I can stop thinking about getting myself a Nid for Christmas…

Context in perspective 5: Living next door TLS

Time to switch context to an old thread, specifically the SystemThread class. I have done this to death in an Unsung SQLOS post, but if you don’t want to sit through that, here are the bits we care about today:

  • All work in SQLOS is performed by good old-fashioned operating system threads.
  • SQLOS notionally extends threads with extra attributes like an associated OS event object, and behaviour like a method for for waiting on that event. This bonus material lives in the SystemThread class, which is conceptually a subclass of an operating system thread, although this isn’t literally class inheritance.
  • A SystemThread instance lives in memory which is either allocated on the heap and pointed to by a thread-local storage slot, or it squats right across a range of TLS slots within the Thread Environment Block.

Due to the nature of thread-local storage, at any moment any code running on a SQLOS thread can get a reference to the SystemThread it’s running on; this is described in gory detail in Windows, Mirrors and a sense of Self. Powerful stuff, because this ambient SystemThread indirectly exposes EVERYTHING that defines the current SQL Server context, from connection settings to user identity and beyond.

Life is a box of ogres

LS through the looking glass

Understanding SQLOS takes us only so far, because it is after all just the substrate upon which SQL Server is built. We’ve now reached the point where SQLOS hands SQL Server a blank slate and says “knock yourself out”.

This blank slate is a small but delicious slice of local storage: an array of eighteen pointers living within the SystemThread. SQLOS has no interest in what these are used for and what they mean. As far as it is concerned, it’s just a bunch of bytes of no known type. But of course SQL Server cares a lot more.

Park that thought for a moment to consider that we’re starting to build up a hierarchy of thread-local storage:

  1. Upon an OS context switch, the kernel swaps the value held in the CPU’s GS register to point to the Thread Environment Block of the incoming thread.
  2. Within this Thread Environment block lives TLS slots that SQLOS takes advantage of. One of these will always point to the currently running SystemThread instance. In other words, when the kernel does a context switch, the change of OS thread brings with it a change in the ambient SystemThread which can be retrieved from TLS.
  3. Now within this SystemThread, an array of eighteen pointer-sized slots are made available for the client application (SQL Server) to build upon.

What does this mean from the viewpoint of SQL Server? Well, even within the parts that don’t care about SQLOS and the underlying OS, code can express and find current thread-specific state – at a useful abstraction level – somewhere within those eighteen slots.

Worker LocalStorage vs thread-local storage

We often skirt around the distinction between a worker and a thread. This is a benign simplification, because instances of Worker and SystemThread classes are bound together in a 1:1 paired relationship during all times when SQL Server code is running.

The only time the distinction becomes particularly interesting is when we’re in fiber mode, because now a single SystemThread can promiscuously service multiple Workers in a round-robin fashion. I have documented some details in The Joy of Fiber Mode, but of special interest today is the treatment of these local storage slots.

These now become a more volatile part of thread context, and a SQLOS context switch (in this case, a fiber switch) must include copying the Worker’s LocalStorage payload into the SystemThread slots, because there is no guarantee that two different Workers will share the exact same context; in fact it is close to certain they won’t.

So clearly the context that matters to the Task at hand lives within the Worker, and SQLOS makes sure that this is the context loaded into a SystemThread while it runs; it’s just that the SQLOS scheduler has to intervene more in fiber mode, whereas it can let this bit of state freewheel in thread mode.

On to the implementation details then. Unlike the case with the SystemThread, a Worker’s local storage isn’t a chunk of memory within the class, but lives in a heap allocation, represented by an instance of the LocalStorage class embedded within the Worker.

Additionally, while the SystemThread’s slot count is baked into SQLOS, somebody clearly wanted a bit more abstraction in the Worker class, so the slot count becomes an attribute of a LocalStorage instance, which thus consists of a grand total of two members:

  1. The slot count, passed into the LocalStorage constructor
  2. A pointer to the actual chunk of memory living somewhere on the heap and allocated by the LocalStorage constructor, which is a thin wrapper around a standard page allocator

Show me some numbers, and don’t spare the horses

On to the fun bit, but you’re going to have to take my word on two things. Firstly, the SystemThread’s local storage slots are right at the start of the object. And secondly, the LocalStorage instance lives at an offset of 0x30 within a Worker, at least on the version I’m using here.

To prepare, I needed to capture the addresses of a bound SystemThread:Worker pair before breaking into the debugger, so I started a request running in session 53, executing nothing but a long WAITFOR – this should keep things static enough while I fumble around running a DMV query in another session:

Capturing the worker and thread addresses of our dummy request

So we’re off yet again to stage a break-in within Windbg. Having done this, everything is frozen in time, and I can poke around to my heart’s content while the business users wonder what just happened to their server. No seriously, you don’t want to do this on an instance that anyone else needs at the same time.

Here is the dump of the first 64 bytes of that Worker. As in Part 4, I’m dumping in pointer format, although only some of these entries are pointers or even necessarily 64 bits wide. In case it isn’t clear, the first column is the address we’re looking at, and the second column is the contents found at that address. The dps command sweetens the deal a bit: if that payload is an address corresponding to a known symbol (e.g. a the name of a function), that symbol will be displayed in a third column.

0:075> dps 0x000000003B656160 l0x8
00000000`3b656160  00000000`00000000
00000000`3b656168  00000000`00000000
00000000`3b656170  00000000`3d948170
00000000`3b656178  00000000`3a21a170
00000000`3b656180  00000000`00000001
00000000`3b656188  00000000`41080158
00000000`3b656190  00000000`00000012
00000000`3b656198  00000000`3b656c80

Those highlighted last two represent the LocalStorage instance, confirming that we do indeed have eighteen (=0x12) slots, and telling us the address where that slot array starts. Let’s see what payload we find there:

0:075> dps 00000000`3b656c80 l0x12
00000000`3b656c80  00000000`3875e040
00000000`3b656c88  00000000`3875e9a0
00000000`3b656c90  00000000`00000000
00000000`3b656c98  00000000`38772608
00000000`3b656ca0  00000000`38773440
00000000`3b656ca8  00000000`00000000
00000000`3b656cb0  00000000`00000000
00000000`3b656cb8  00000000`00000000
00000000`3b656cc0  00000000`00000000
00000000`3b656cc8  00000000`00000000
00000000`3b656cd0  00000000`3a8776a0
00000000`3b656cd8  00000000`00000000
00000000`3b656ce0  00000000`00000000
00000000`3b656ce8  00000000`00000000
00000000`3b656cf0  00000000`00000000
00000000`3b656cf8  00000000`00000000
00000000`3b656d00  00000000`00000000
00000000`3b656d08  00000000`00000000

It seems that only five out of the eighteen are in use. Oh well, that’s neither here nor there. Let’s compare this to the first eighteen quadwords at the bound SystemThread’s address we found in sys.dm_os_threads:

0:075> dps 0x00007FF653311508 l0x12
00007ff6`53311508  00000000`3875e040
00007ff6`53311510  00000000`3875e9a0
00007ff6`53311518  00000000`00000000
00007ff6`53311520  00000000`38772608
00007ff6`53311528  00000000`38773440
00007ff6`53311530  00000000`00000000
00007ff6`53311538  00000000`00000000
00007ff6`53311540  00000000`00000000
00007ff6`53311548  00000000`00000000
00007ff6`53311550  00000000`00000000
00007ff6`53311558  00000000`3a8776a0
00007ff6`53311560  00000000`00000000
00007ff6`53311568  00000000`00000000
00007ff6`53311570  00000000`00000000
00007ff6`53311578  00000000`00000000
00007ff6`53311580  00000000`00000000
00007ff6`53311588  00000000`00000000
00007ff6`53311590  00000000`00000000

This isn’t the same address as the Worker’s local storage array, but the contents is the same, which is consistent with my expectation. I’m highlighting that first entry, because we’ll be visiting it later.

Just out of interest, I’m also going to try and tie things back to memory page observations made in Part 4. Let’s peek at the top of the 8k page that the LocalStorage lives on. Its address is 0x3b656c80, which rounded down to the nearest 8k (=0x2000) gives us a page starting address of 0x3b656000.

0:075> dps 0x000000003B656000
00000000`3b656000  00010002`00000000
00000000`3b656008  00000000`3b656040
00000000`3b656010  00060003`00000001
00000000`3b656018  00000000`000012c0
00000000`3b656020  00000000`00000000
00000000`3b656028  00000000`00000000
00000000`3b656030  00000000`00000001
00000000`3b656038  00000000`00000110
00000000`3b656040  00007ffd`5f95c290 sqldk!CMemObj::`vftable'

The shape of that page header looks familiar. The second quadword is a pointer to the address 0x40 bytes into this very page. And just to hand it to us on a plate, the object starting there reveals itself by its vftable symbol to be a CMemObj.

In other words, this particular bit of LocalStorage is managed by a memory object which lives on its very page – obviously it would be wasteful if a memory object refused to share its page with some of the objects it allocated memory for. Also note that this is a plain CMemObj and not a CMemThread<CMemObj>, i.e. it isn’t natively thread-safe. This may simply confirm that local storage is not intended for sharing across threads; see Dorr for more.

Cutting to the chase

So far, this is all rather abstract, and we’re just seeing numbers which may or may not be pointers, pointing to heaven knows what. Let me finish off by illuminating one trail to something we can relate to.

Out of the five local storage slots which contain something, the first one here points to 00000000`3b656040. As it turns out, this is an instance of the CCompExecCtxtBasic class, and you’ll just have to take my word for it today. Anyway, we’re hunting it for its meat, rather than for its name. Have some:

0:075> dps 3875e040
00000000`3875e040  00000000`3875ff00
00000000`3875e048  00000000`387727f0
00000000`3875e050  00000000`3875e0c0
00000000`3875e058  00000000`38772470
00000000`3875e060  00000000`38773560
00000000`3875e068  00000000`3875f310
00000000`3875e070  00000000`38773370
00000000`3875e078  00000000`38773240
00000000`3875e080  00000000`3875e0b8
00000000`3875e088  00000000`3875e330
00000000`3875e090  00000000`38773440
00000000`3875e098  00000000`00000001
00000000`3875e0a0  00000000`3875f890
00000000`3875e0a8  00000000`00000001
00000000`3875e0b0  00000000`4777e110
00000000`3875e0b8  00000000`00000000

You may recognise by the range of memory addresses we’ve seen so far that most of these are likely themselves pointers. I’ll now lead you by the hand to the highlighted fourth member of CCompExecCtxtBasic, and demonstrate what that member points to:

0:075> dps 00000000`38772470 
00000000`38772470 00007ffd`61d575f8 sqllang!CSession::`vftable'
00000000`38772478 00000000`0000000c
00000000`38772480 00000000`00000035
00000000`38772488 00000000`00000000
00000000`38772490 00000000`00000000
00000000`38772498 00000000`00000000
00000000`387724a0 00000001`00000001
00000000`387724a8 00000001`00000009
00000000`387724b0 00000000`00000000
00000000`387724b8 00000000`00000000
00000000`387724c0 00000000`00000000
00000000`387724c8 00007ffd`61d17b20 sqllang!CSession::`vftable'
00000000`387724d0 00000000`38772040
00000000`387724d8 00000000`38772240
00000000`387724e0 00000000`38772a60
00000000`387724e8 00000000`3875f570

Bam! Given away by the vftable yet again – this is a CSession instance, in other words probably the session object associated with that running thread. As per Part 3, the presence of more than one vftable indicates that we may be dealing with multiple virtual inheritance in the CSession class.

We’ll leave the last word for the highlighted third line, containing the value 0x35 as payload.

What does 0x35 translate to in decimal? Would you believe it, it’s 53, the session_id of the session associated with the thread. If we were feeling bored, we could go and edit it to be another number, essentially tinkering with the identity of that session. But today is not that day.

Putting it all together in a diagram, here then is one trail by which an arbitrary chunk of SQL Server code can find the current session, given nothing more than the contents of the GS register:

Sure, your programming language and framework may very well abstract away this kind of thing. But that doesn’t mean you aren’t relying on it all the time.

Final thoughts

This exercise would qualify as a classic case of spelunking, where we have a general sense of what we’d expect to find, perhaps egged on by a few clues. Just as I’m writing this, I see that Niels Berglund has also been writing about spelunking in Windbg. So many angles, so much interesting stuff to uncover!

The key takeaway should be a reminder of how much one can do with thread-local storage, which forms the third and often forgotten member of this trio of places to put state:

  1. Globally accessible heap storage – this covers objects which are accessible everywhere in a process, including ones where lexical scope attempts to keep them private.
  2. Function parameters and local variables – these have limited lifespans and live in registers or on the stack, but remain private to an invocation of a function unless explicitly shared.
  3. Thread-local storage – this appears global in scope, but is restricted to a single thread, and each thread gets its own version of the “same” global object. This is a great place to squirrel away the kind of state we’d associate with a session, assuming we leverage a link between sessions and threads.

I hope that the theme of this series is starting to come together now. One or two more to go, and the next one will cover sessions and their ancillary context in a lot more breadth.

Context in perspective 3: Seven vftables for seven interfaces

This is the third part of a series about various places where “context” in some form or another crops up within SQL Server. The story so far:

  • What the CPU sees in you takes us down to a very low level, below such towering abstractions as object instances and threads. Not a prerequisite to make sense of today’s post.
  • A pretty stack frame is like a melody looks at the stack frame of a single invocation of a single function as a self-contained context within a string of other such contexts up the call chain. Although I’ll pick up examples I first brought up there, again you don’t need to have absorbed it to follow this one.

In this episode, I look at one aspect of polymorphic behaviour, whereby the address we use to refer to an object may change depending on what interface the called member function belongs to.

Come, walk with me

Today the world was just an address, a place for me to live in

In Part 2 my first and simplest example of a member function was the below. For a supplied security context token address in the rcx register, this returns the 32-bit “impersonation type”, which is a member variable, presumably of an enumeration type:

sqllang!CSecContextToken::GetImpersonationType:
===============================================
  mov   eax,dword ptr [rcx+1188h]
  ret

The emphasis back then was on the implicit stack context – while only the ret(urn) instruction explicitly references the stack, the function was written for an ecosystem where one can rely on certain stack conventions being followed.

Now I’ll look at it from another angle: how sure am I that it is indeed the CSecContextToken instance’s address that goes into that first rcx parameter? As a reminder, when calling a class instance method in an object-oriented language, the first parameter is a pointer to the instance, i.e. the address of the block of memory storing its state. However, high-level languages normally hide this parameter from you. What the programmer might think of as the method’s first parameter is actually the second, with the compiler inserting a This pointer as a preceding parameter. This is nothing more than syntactic sugar, e.g. translating the equivalent of the “parameterless” function call h = orderDate.HourPart() into a call of the form h = HourPart(&orderDate). In fact, in .Net, extension method syntax exposes the true nature of “instance” methods.

In a simple class, assuming that rcx is indeed the address of the instance would be a safe guess, but this isn’t a simple class.

Try this pair of GetData() overloads for size. Both take a single address as a parameter, and return a different address; being overloads, I’ll include the function addresses to disambiguate them. Firstly this one which returns an address 0x1448 bytes lower than the supplied address:

sqllang!CSecContextToken::GetData (00007ffb`ee810e40):
======================================================
  lea   rax, [rcx-1448h]
  ret

Clearly if rcx is the This pointer, the return value is something that lives before the CSecContextToken in memory. But the function is within the CSecContextToken class, so that doesn’t look very plausible. We’ll have to assume that rcx doesn’t in fact contain the address of the CSecContextToken instance: that address is somewhere on the West side of rcx. (Spoiler alert. The returned address is actually the address of the CSecContextToken.)

Now for the second overload which calls the first as a tail call after a sleight-of-hand adjustment to rcx:

sqllang!CSecContextToken::GetData (00007ffb`ee810e30):
======================================================
  movsxd  rax, dword ptr [rcx-4]
  sub     rcx, rax
  jmp     00007ffb`ee810e40

In English:

  1. For the supplied address, take the 32-bit value four bytes below it and sign-extend it to 64 bits. The use of movsxd rather than the zero-extend sibling movzxd tells us that this is a signed member variable, whether or not negative numbers will actually get used in practice.
  2. Deduct this number from the supplied address.
  3. Call the other overload with this adjusted rcx as its first parameter.

Since one calls the other, clearly the two overloads return the same value from the same object instance, but they do so from different starting point addresses:

  1. The first one deals with an rcx that is a fixed offset of 0x1448 bytes away from the address within the object it must return. This return value may be the address of the CSecContextToken, or it may itself be somewhere within it – there isn’t enough information supplied to be certain.
  2. The second one is a lot more complex. Its rcx parameter points somewhere within a structure which itself contains the offset from the address we need to pass to the first overload. In other words, while the first overload’s offset is known at compile time, the second one requires runtime information saved within the object instance to do that little calculation.

The nitty gritty of multiple inheritance

The broad explanation for all these different ways of accessing the same CSecContextToken is that it implements quite a number of interfaces due to multiple inheritance – seven of them to be precise – and each of them has its own virtual function table. I have previously looked at vftables in Indirection indigestion, virtual function calls and SQLOS, but now we have a more chunky context to chew on.

Let’s just back up that assertion by checking our public symbols in Windbg: x /n sqllang!CSecContextToken::`vftable’ yields the following:

00007ffb`f014e530 sqllang!CSecContextToken::`vftable' = 
00007ffb`eff985e0 sqllang!CSecContextToken::`vftable' = 
00007ffb`f014e568 sqllang!CSecContextToken::`vftable' = 
00007ffb`effd7ea0 sqllang!CSecContextToken::`vftable' = 
00007ffb`eff99080 sqllang!CSecContextToken::`vftable' = 
00007ffb`f014e550 sqllang!CSecContextToken::`vftable' = 
00007ffb`effd7ee0 sqllang!CSecContextToken::`vftable' = 

It turns out that it’s the last one which is of interest to us, because it is the only one which contains an entry for one of our GetData() method, namely the second one. Here is the evidence: dps 7ffb`effd7ee0:

00007ffb`effd7ee0  00007ffb`eead6e70 sqllang!CSecCtxtCacheEntry::GetKey
00007ffb`effd7ee8  00007ffb`ef4f7140 sqllang!CSecContextToken::Serialize
00007ffb`effd7ef0  00007ffb`ee810e30 sqllang!CSecContextToken::GetData

The first and simpler GetData() overload doesn’t show up in a vftable, but the second does. Oddly, the vftable for the second one lives at an offset of +0x1448 into the class instance – you’re going to have to trust me on this one. So the rcx passed into either variation will actually be the same one! But if the virtual version is called, it needs to find its position relative to +0x1448 dynamically, by doing a data lookup. We can confirm that by peeking at what is saved four bytes earlier at +0x1444, and that is indeed the value zero.

The setup of both vftables and dynamic offset lookups is done within the class constructor, with reference to other structures with the symbol of sqllang!CSecContextToken::`vbtable’. This seems to be the glue keeping compile-time and run-time structures in sync, with each vbtable containing the offsets needed to allow one vftable (interface) to reference another, in other words to be cast to it.

When it comes to virtual multiple inheritance and the implementation of multiple interfaces, the previously solid “this” pointer passed into member functions changes its colours. Either it adapts itself with a known compile-time offset or it does so by doing a runtime lookup.

To make an example of the vftable of the second case:

  • Clearly a CSecContextToken can be upcast to a CSecCtxtCacheEntry<CSecContextTokenKeySid,1,CPartitionedRefManager>: the vftable tells us it is one of its base types.
  • However, the act of casting actually adjusts the “this” pointer passed to the function – in this case by adding 0x1448 to it. Then the implementation of that function does further adjustments as needed so it can act upon the intended underlying object rather than on the interface itself.

Inheritance isn’t always this complicated. For single inheritance, even with virtual functions, an object still retains a single vftable, conventionally located at offset zero (in class constructors, you can see these progressively overwriting each other, revealing the hierarchy). But once multiple inheritance and multiple interfaces are involved, each one wants its own vftable, and the fun of static and dynamic address adjustment starts.

There’s a place for us

Long breath. I am not a C++ programmer, and my only prior exposure to multiple inheritance has been of the form “some languages make this possible, but it might fry your brain”. The implementation details sure are frying mine.

Then again, this is another illustration of the kind of thing high level languages do for us. If you can just focus on the right level of abstraction and trust the compiler to get the details right, you are riding a powerful beast. In a handful of cases, the benefit of the abstraction might be outweighed by the performance cost of moving too far from the metal (branch misprediction and memory stalls due to virtual function calls and the interposition of those “this”-adjusting functions), but most code of most people don’t live on that edge.

So without knowing where I was heading, I think SQL Server just taught me a few things about multiple inheritance and the context implications of coding with it. And it seems that those functions that do nothing more than adjust the incoming “this” pointer before delegating to an overload are in fact adjustor thunks. I read about them ages ago, but it went over my head, leaving only the catchy name, but Googling it now seems to make sense.

There is a moral in there. Somewhere. To find your place in the world, sometimes you just have to find your adjustor thunk.

Further reading

Inside the C++ object model by Stanley Lippman covers this kind of ground while staying in C++-rooted explanations. By the nature of the beast, I have been discovering Microsoft-specific implementation details, but now I feel ready to continue working my way through Lippman.

Fishing for wait types in WinDbg

Last night a #sqlhelp question from Monica Rathbun (@SQLEspresso) caught my eye:

@SQLEspresso Twitter question

Now some of us take way too much delight in worrying about obscure wait types, and since I’ve recently been in preemptive territory I thought I should take some degree of interest. Spoiler alert: I did nothing to solve Monica’s problem, but my attempt to figure out where the wait type might emanate from made me realise that this is worth a blog post.

Without getting hung up on the detail, here is a very crude and simple way to hunt for areas of SQL Server that may use a particular wait type. The only prerequisite is that you need to be willing and able to attach Windbg to SQL Server, and that you have public symbols loaded.

In this case I was looking for PREEMPTIVE_COM_RELEASE, and sys.dm_xe_map_values tells me that on my 2014 RTM instance it has an index of 01d4 hexadecimal. Crazy as it sounds, I’m going to do a simple search through the code to look for places that magic number is used. As a two-byte (word) pattern we’ll get lots of false positives, but fortunately wait types are internally doublewords, with only one bit set in the high-order word. In other words, we’re going to look for the pattern 000101d4, 000201d4, 000401d4 and so forth up to 800001d4. Ignore the meaning of when which bit is going to be set; with only sixteen permutations, it’s quick enough to try them all.

Let’s focus on sqllang as the likely source – the below would apply to any other module too.

Upon starting the debugger, the module load addresses are listed right away. You can also use the lm command at any time afterwards to list them again. In my case, I got this for sqllang:

ModLoad: 00007ffe`23870000 00007ffe`25ad7000   C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Binn\sqllang.dll

So we have a start and end memory address. Take note of the length in bytes, using Windbg as a calculator:

0:063> ? 07ffe`25ad7000 - 7ffe`23870000
Evaluate expression: 36073472 = 00000000`02267000

Update 2017/08/29:
I’m keeping the original version below as a permanent record of a schoolboy error. Searching for a four-byte pattern expressed as a doubleword can in fact bring up hits, BUT only ones which are doubleword aligned, i.e. starting on an address divisible by 4. The correct way to cast the net wide enough is to use the -b flag and searching for four consecutive bytes; this search doesn’t presuppose doubleword alignment. Remembering to byte-reverse the pattern, that first command should have been s -b 0x7ffe`23870000 L0x2267000 d4 01 01 00. I was just lucky to have caught a fish using the -d variation.

Great. Now we have everything we need. The s command searches for patterns in a range of memory, and we’ll use the -d flag to make it a doubleword search. First few tries come up empty:

0:063> s -d 0x7ffe`23870000 L0x2267000 000101d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000201d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000401d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000801d4

But now we get one:

0:063> s -d 0x7ffe`23870000 L0x2267000 001001d4
00007ffe`287d39f8  001001d4 8948f633 48602474 68247489  ....3.H.t$`H.t$h

Ignore everything other than the address at the start of the line – we’re not expecting the byte dump to make sense to the human eye. Let’s see what piece of code this belongs to – the uf disassembles the function that this piece of memory falls in.

0:063> uf 0x7ffe`287d39f8

I’m not even going to show you the output, because this one turned out to be a red herring – experience and/or intuition needed to confirm that. But let’s go on…

 
0:063> s -d 0x7ffe`23870000 L0x2267000 002001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 004001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 008001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 010001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 020001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 040001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 080001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 100001d4

Still nothing, but then we strike gold:

 
0:063> s -d 0x7ffe`23870000 L0x2267000 200001d4
00007ffe`246a3fe4  200001d4 244c8d48 32bee840 4890ff1d  ... H.L$@..2...H
00007ffe`246a6694  200001d4 244c8d48 0c0ee840 4890ff1d  ... H.L$@......H
00007ffe`246a8d44  200001d4 244c8d48 e55ee840 4890ff1c  ... H.L$@.^....H
00007ffe`246ac8a4  200001d4 244c8d48 a9fee840 4890ff1c  ... H.L$@......H
00007ffe`246b0404  200001d4 244c8d48 6e9ee840 4890ff1c  ... H.L$@..n...H
... and many more!

Try the uf trick again on the first one:

0:063> uf 0x7ffe`246a3fe4

And we get rewarded with a disassembly of the function sqllang!IWrapInterface<IAccessor>::Release – this one pretty much comes with flashing lights given that IAccessor reeks of COM and we were expecting something involving “RELEASE”. I’ll spare you the bulk of the assembly dump, but would like to highlight the bit that confirms the setup of a preemptive wait type:

00007ffe`246a3fe3 bad4010020      mov     edx,200001D4h
00007ffe`246a3fe8 488d4c2440      lea     rcx,[rsp+40h]
00007ffe`246a3fed e8be321dff      call    sqllang!AutoSwitchPreemptive::AutoSwitchPreemptive (00007ffe`238772b0)

That assignment to the edx register means that the encoded wait type is the second parameter to the AutoSwitchPreemptive constructor. And while it may not always be a recognisable setup, in this case I was already familiar with AutoSwitchPreemptive (see here).

Now this kind of trawling is by no means scientific. The wait type could have been loaded from a memory address, in which case it wouldn’t have been hard-coded in the function. And of course without the code running in context, it doesn’t tell you what kind of call stack it might show up in – only running the relevant code paths and catching the wait through a breakpoint or XEvent will do that. But as a quick and dirty way of hunting for wait type usage in a module up there on the marble slab? Hey, it works for me.