This is the third part of a series about various places where “context” in some form or another crops up within SQL Server. The story so far:
What the CPU sees in you takes us down to a very low level, below such towering abstractions as object instances and threads. Not a prerequisite to make sense of today’s post.
A pretty stack frame is like a melody looks at the stack frame of a single invocation of a single function as a self-contained context within a string of other such contexts up the call chain. Although I’ll pick up examples I first brought up there, again you don’t need to have absorbed it to follow this one.
In this episode, I look at one aspect of polymorphic behaviour, whereby the address we use to refer to an object may change depending on what interface the called member function belongs to.
Today the world was just an address, a place for me to live in
In Part 2 my first and simplest example of a member function was the below. For a supplied security context token address in the rcx register, this returns the 32-bit “impersonation type”, which is a member variable, presumably of an enumeration type:
sqllang!CSecContextToken::GetImpersonationType:
===============================================
mov eax,dword ptr [rcx+1188h]
ret
The emphasis back then was on the implicit stack context – while only the ret(urn) instruction explicitly references the stack, the function was written for an ecosystem where one can rely on certain stack conventions being followed.
Now I’ll look at it from another angle: how sure am I that it is indeed the CSecContextToken instance’s address that goes into that first rcx parameter? As a reminder, when calling a class instance method in an object-oriented language, the first parameter is a pointer to the instance, i.e. the address of the block of memory storing its state. However, high-level languages normally hide this parameter from you. What the programmer might think of as the method’s first parameter is actually the second, with the compiler inserting a This pointer as a preceding parameter. This is nothing more than syntactic sugar, e.g. translating the equivalent of the “parameterless” function call h = orderDate.HourPart() into a call of the form h = HourPart(&orderDate). In fact, in .Net, extension method syntax exposes the true nature of “instance” methods.
In a simple class, assuming that rcx is indeed the address of the instance would be a safe guess, but this isn’t a simple class.
Try this pair of GetData() overloads for size. Both take a single address as a parameter, and return a different address; being overloads, I’ll include the function addresses to disambiguate them. Firstly this one which returns an address 0x1448 bytes lower than the supplied address:
sqllang!CSecContextToken::GetData (00007ffb`ee810e40):
======================================================
lea rax, [rcx-1448h]
ret
Clearly if rcx is the This pointer, the return value is something that lives before the CSecContextToken in memory. But the function is within the CSecContextToken class, so that doesn’t look very plausible. We’ll have to assume that rcx doesn’t in fact contain the address of the CSecContextToken instance: that address is somewhere on the West side of rcx. (Spoiler alert. The returned address is actually the address of the CSecContextToken.)
Now for the second overload which calls the first as a tail call after a sleight-of-hand adjustment to rcx:
For the supplied address, take the 32-bit value four bytes below it and sign-extend it to 64 bits. The use of movsxd rather than the zero-extend sibling movzxd tells us that this is a signed member variable, whether or not negative numbers will actually get used in practice.
Deduct this number from the supplied address.
Call the other overload with this adjusted rcx as its first parameter.
Since one calls the other, clearly the two overloads return the same value from the same object instance, but they do so from different starting point addresses:
The first one deals with an rcx that is a fixed offset of 0x1448 bytes away from the address within the object it must return. This return value may be the address of the CSecContextToken, or it may itself be somewhere within it – there isn’t enough information supplied to be certain.
The second one is a lot more complex. Its rcx parameter points somewhere within a structure which itself contains the offset from the address we need to pass to the first overload. In other words, while the first overload’s offset is known at compile time, the second one requires runtime information saved within the object instance to do that little calculation.
The nitty gritty of multiple inheritance
The broad explanation for all these different ways of accessing the same CSecContextToken is that it implements quite a number of interfaces due to multiple inheritance – seven of them to be precise – and each of them has its own virtual function table. I have previously looked at vftables in Indirection indigestion, virtual function calls and SQLOS, but now we have a more chunky context to chew on.
Let’s just back up that assertion by checking our public symbols in Windbg: x /n sqllang!CSecContextToken::`vftable’ yields the following:
It turns out that it’s the last one which is of interest to us, because it is the only one which contains an entry for one of our GetData() method, namely the second one. Here is the evidence: dps 7ffb`effd7ee0:
The first and simpler GetData() overload doesn’t show up in a vftable, but the second does. Oddly, the vftable for the second one lives at an offset of +0x1448 into the class instance – you’re going to have to trust me on this one. So the rcx passed into either variation will actually be the same one! But if the virtual version is called, it needs to find its position relative to +0x1448 dynamically, by doing a data lookup. We can confirm that by peeking at what is saved four bytes earlier at +0x1444, and that is indeed the value zero.
The setup of both vftables and dynamic offset lookups is done within the class constructor, with reference to other structures with the symbol of sqllang!CSecContextToken::`vbtable’. This seems to be the glue keeping compile-time and run-time structures in sync, with each vbtable containing the offsets needed to allow one vftable (interface) to reference another, in other words to be cast to it.
When it comes to virtual multiple inheritance and the implementation of multiple interfaces, the previously solid “this” pointer passed into member functions changes its colours. Either it adapts itself with a known compile-time offset or it does so by doing a runtime lookup.
To make an example of the vftable of the second case:
Clearly a CSecContextToken can be upcast to a CSecCtxtCacheEntry<CSecContextTokenKeySid,1,CPartitionedRefManager>: the vftable tells us it is one of its base types.
However, the act of casting actually adjusts the “this” pointer passed to the function – in this case by adding 0x1448 to it. Then the implementation of that function does further adjustments as needed so it can act upon the intended underlying object rather than on the interface itself.
Inheritance isn’t always this complicated. For single inheritance, even with virtual functions, an object still retains a single vftable, conventionally located at offset zero (in class constructors, you can see these progressively overwriting each other, revealing the hierarchy). But once multiple inheritance and multiple interfaces are involved, each one wants its own vftable, and the fun of static and dynamic address adjustment starts.
There’s a place for us
Long breath. I am not a C++ programmer, and my only prior exposure to multiple inheritance has been of the form “some languages make this possible, but it might fry your brain”. The implementation details sure are frying mine.
Then again, this is another illustration of the kind of thing high level languages do for us. If you can just focus on the right level of abstraction and trust the compiler to get the details right, you are riding a powerful beast. In a handful of cases, the benefit of the abstraction might be outweighed by the performance cost of moving too far from the metal (branch misprediction and memory stalls due to virtual function calls and the interposition of those “this”-adjusting functions), but most code of most people don’t live on that edge.
So without knowing where I was heading, I think SQL Server just taught me a few things about multiple inheritance and the context implications of coding with it. And it seems that those functions that do nothing more than adjust the incoming “this” pointer before delegating to an overload are in fact adjustor thunks. I read about them ages ago, but it went over my head, leaving only the catchy name, but Googling it now seems to make sense.
There is a moral in there. Somewhere. To find your place in the world, sometimes you just have to find your adjustor thunk.
Further reading
Inside the C++ object model by Stanley Lippman covers this kind of ground while staying in C++-rooted explanations. By the nature of the beast, I have been discovering Microsoft-specific implementation details, but now I feel ready to continue working my way through Lippman.
This is part 2 in a series exploring ways in which a chunk of running SQL Server code gains context, i.e. the mechanisms by which a bunch of instructions get to participate in something bigger and just possibly a bit unpredictable. You can find part 1 here. (Update: and now there is also part 3.)
Today’s instalment on the use of the stack may come across as rather obvious or perhaps “computer science-y”, so I’m going to try and sweeten the deal by using small but perfectly nuggets of complete SQL Server functions to illustrate some detail. Detail there will be aplenty, so feel free to skim-read just to get the aroma – that is probably what I would have done if I were in your aromatic shoes. Special thanks to Klaus Aschenbrenner (b1 | b2 | t) for recently filling in a bothersome gap in my knowledge by pointing me to this great resource.
Stacks revisited
Refresher time. In its strict sense, a stack is a LIFO data structure where all interaction occurs through push (insert item at the end) or pop (remove item from that same end) operations. By its nature, a stack grows and shrinks all the time while in use, and it can’t get internally fragmented by having chunks of unused space.
When we talk about “the stack” in Windows, we normally mean the user-mode stack of a specific thread. As functions call each other, the last thing done before transferring control to the callee is to push the address of the next instruction (the return address) on to the stack. Once the callee is finished, the act of returning involves popping that address into the instruction pointer, which transfers control back to the expected point within the caller. Note that this implies a strict requirement for symmetric management of the stack within each function: one push or pop too many, and the control flows goes haywire.
Additionally, the thread stack provides a convenient place to store local variables, within reasonable memory usage limits. The natural flow of stack allocation and cleanup is a perfect fit for variables whose lexical scope and lifetime are restricted to the function invocation.
Every invocation of a function gets its own stack frame. In its most rudimentary form, it consists of only the return address, but it can get arbitrarily big (within reason) as space is allocated for saved registers and local variables. Generally speaking, a function should only work with stack memory within its own stack frame, i.e. on its side of that return address, and anything beyond that has the smell of malicious interference. Special exceptions exist for access to stack-passed parameters and (in the Windows x64 convention) to the shadow space described later. Additionally, a pointer passed from a parent function to a child for an output parameter may very well refer to stack-allocated memory within the parent, and the child can end up legally writing into the stack frame of the parent – after all, as long as it follows good hygiene and doesn’t play buffer overrun games, it shouldn’t care where an output parameter points to.
Traditionally, stacks grow downward in memory, i.e. they start at a high address, with a push operation decrementing the stack pointer, and a pop incrementing it. I tend to visualise high memory addresses as being near the top of a printed page. However, when representing a call chain, standard current usage puts the high memory at the bottom of the page, i.e. the stack grows visually upward.
Whatever language you program in, chances are your compiler does the honours of sorting out the details and not letting you shoot yourself in the foot by corrupting the stack. So this isn’t something where you are normally exposed to the mechanics, unless you look at it from the low-level perspective of assembler code emitted by the compiler.
Calling conventions: the 32-bit vs 64-bit divide
Before getting into details of mechanism, I should point out that stack usage is an area where 64-bit Windows code looks entirely different from 32-bit equivalents. A lot of excellent older resources are 32-bit specific, but I restrict myself to the 64-bit world here.
Specifically, the 32-bit world had quite a number of competing calling conventions, notably cdecl, stdcall, and fastcall. 64-bit Windows specifies a single Application Binary Interface (ABI) based on the old fastcall. This favours parameter passing through registers and thus gets away with much less stack usage, which is a performance bonus, since register accesses are a lot faster than memory access, even disregarding cache misses. Be that as it may, it still makes sense to remember the central position of the stack in function calls.
Leaving aside floating point and SIMD registers, the convention is fairly simple:
Parameter 1 goes in the rcx register
Parameter 2 goes in rdx
Parameter 3 goes in r8
Parameter 4 goes in r9
Further parameters are pushed onto the stack from right to left (standard C convention, which allows for variable number of parameters since the first one will be in a known place)
The function’s return value goes in rax
Additionally, the caller of a function must leave 32 bytes free for the callee’s use, located just before (above) the return address. All functions get this scratchpad – known variously as “shadow space” or “parameter homing space” – for free, but only functions which call child functions need to worry about doing this little allocation. Think of it as a memory nest egg.
Finally, while 32-bit code had the freedom to push and pop to their hearts’ content, such adjustment of the stack pointer is now only allowed in a function’s prologue and epilogue, i.e. the very first and very last bit of the emitted code. This is because of another very important convention change. In the 32-bit world, a function always saved the initial value of the stack pointer in the EBP register after pushing the old value of EBP, and this allowed for traversal of the call chain. Such traversal is needed for structured exception handling, as well as for diagnostic call chain dumps, such as the ones we sometimes get exposed to in SQL Server.
However, the mechanism for finding one’s way in the call stack now relies on metadata in the 64-bit world: if something goes wrong while the instruction pointer is at address 0x12345, it is possible for exception dispatching code to know what function we’re in from linker/loader metadata alone. This then allows an appropriate Catch block within the function to be called. And if the exception is uncaught and needs to bubble up? Well, since the stack allocation size of the function is part of that metadata, and the stack pointer doesn’t change during the function body, we can just add a constant to the current stack pointer to find the return address, hence deduce who the caller was, continuing the same game.
Note that all this describes a set of Windows conventions. However, this is still at a level where SQL Server on Linux plays by the Windows rules, up to a certain point within SQLPAL.
A word on Assembler examples
I picked four short SQL Server functions, in gently increasing level of complexity, to illustrate stack frames, finished with the prologue of a bigger function. When you dump functions in Windbg using the uf command, e.g. uf sqllang!CSecContextToken::GetImpersonationAttr, the output includes addresses as loaded, the hexadecimal representation of the instructions, and the human-readable Assembler, for instance:
To remove irrelevant noise, I have stripped my examples down to just the bare Assembler, removing address references except in one place where I substituted the target of a conditional jump with a simple symbol.
Then, to address the elephant in the room: Assembler is verbose but simple. There are only a handful of instructions that do most of the heavy lifting here:
mov dest, source is a simple assignment. Mentally read as “SET @dest = @source”. Square brackets around either operand indicates that we’re addressing a memory location, with the operand designating its address.
lea (load effective address) is essentially a more flexible version of mov, although the square brackets are deceptive: it never accesses memory.
push decrements the stack points (rsp) and then writes the operand at the memory location pointed to by the modified rsp.
pop copies the value at the memory location pointed to by rsp into the operand, and then increments rsp.
add operand, value and sub operand, value respectively adds or subtracts the supplied value from the operand.
jmp is an unconditional jump, loading the instruction pointer with the supplied address (note that symbols are just surrogates for addresses). je (jump if equal/zero) is one of a family of conditional jumps.
call does the same as jmp, except that the jump is preceded by first pushing the address of the next instruction (after the call) onto the stack.
Don’t sweat the details, I’ll explain as we go along.
Example 1: no stack manipulation in function
Here we have a function that looks like a straightforward property accessor. It takes one parameter in rcx, which the caller populated, and it is a reasonable assumption that it will be the “This” pointer for a class instance. Certainly it is a pointer to a memory address, because we dereference the (32-bit) doubleword that is 0x1188 bytes away to come up with a return value.
sqllang!CSecContextToken::GetImpersonationType:
===============================================
mov eax,dword ptr [rcx+1188h]
ret
So what does the stack frame look like? Well, by definition the function call gave us the 32-byte shadow space, although we don’t end up using it, and this is followed by the return address. Within this function, nothing uses the stack except the act of returning.
Nice and simple. From the viewpoint of the rsp value, the bit of the stack we are allowed to take an interest in looks like this:
rsp + 0x20 – shadow space
rsp + 0x18 – shadow space
rsp + 0x10 – shadow space
rsp + 0x8 – shadow space
rsp – return address
How things got set up to be this way isn’t any concern of this function.
Example 2: child function as tail call
The next example features a child call to the standard C library memcpy function. However, since this is the very last thing the function does, we see a so-called tail call optimisation: there is no need to call this function for the sake of it returning to nothing more than a ret instruction. By directly jmping to it, we get the same functionality with fewer instructions and fewer memory accesses, and since we’re done with the shadow space, the child can inherit ours. The return address seen by memcpy() isn’t its caller, but its caller’s caller, and that is all fine and dandy.
This stack frame has the same structure as the previous example, although more non-stack parameters come into play in the child function call. Let’s explore them for laughs.
memcpy() has a well-documented signature, and we can simply look up the parameter list, all of which fit into the pass-by-register convention:
*destination goes in rcx. The assignment “mov rcx, rdx” tells us that this is in fact the second parameter passed to GetImpersonationAttr() – or the first explictly passed parameter after an implicit *This – so we immediately know that GetImpersonationAttr() takes a parameter of a pointer to the target where something needs to be copied, i.e. an output parameter.
*source goes in rdx, and is supplied as the same 0x1188 offset into *This as the previous function, and now we know that the first four bytes of the impersonation attribute structure is in fact the impersonation type. Whatever that means š
num goes into r8d (the lower 32 bits of r8), so clearly the impersonation attribute structure has a size of 0x38 bytes.
Fun fact about tail call optimisation. Textbook explanations of simple recursive functions often work on the premise that you’ll overflow the stack beyond a certain recursion depth. However, many practical recursive functions, including the hoary old factorial example, can have the recursion optimised away into tail calls. The functional programming folks get all frothy about this stuff.
Example 3: stack parameters and child function
In this function, we come up against the need for local storage, meaning a stack allocation is required. This couldn’t be simpler: just pull down the Venetian blind of the stack pointer and scribble on the newly exposed section. Or more concretely, decrement the stack pointer by the amount you need.
This example also involves more than four incoming parameters, hence some will be passed on the stack. And the nature of the function is that it is really just a wrapper for another function, and must pass those same parameters through the stack for the child function. This then also brings the responsibility or pulling down the stack pointer by an extra 32 bytes to provide shadow space for the child function.
Here goes – I added line breaks around pairs of related assignment instructions.
Let’s get the register usage out of the way first: although four of them are invisible in the code, we can infer their presence through convention:
rcx is adjusted before being passed to the child. Assuming that these are both instance methods, this suggests that the CTokenAccessResultCache instance is 0x12A0 bytes into CSecContextToken.
rdx, r8 and r9 are passed straight through to the child function. We can’t tell from this code whether they use the full 64 bits or perhaps only 32, but this would be made more clear from examining the caller or the child.
rax is used as a scratch variable in indirect memory-to-memory copies – such copies can’t be done in a single CPU instruction. However, assuming the child function has a return value, it will be emitted in rax, and that will go straight back to our caller, i.e. the parent inherits the child’s return value.
No other general-purpose registers are touched, so we don’t need to save them.
Boring arithmetic time. What happens here is that seven parameters beyond the four passed in variables get copied from the caller’s stack frame to the child parameter area of our stack frame. The parent saved them at its rsp + 0x20 upwards; with the interposition of the return address, it is our rsp + 0x28 upwards, and then after adjusting rsp by 0x68 it becomes rsp + 0x90 upwards. We dutifully copy them to our rsp + 0x20 upwards, which the child function will again see as rsp + 0x28 upwards.
One interesting sight to call out is the rcx adjustment squeezed in between the “from” and “to” halves of the first parameter copy. This is not messy code, but a classic tiny compiler optimisation. Because two consecutive memory accesses lead to memory stalls, the compiler will attempt to weave non-memory-accessing instructions among these memory access if possible. Essentially we get that addition for free, because it can run during a clock cycle when the memory instruction is still stalling, even in cases where the literal expression of the source code logic suggests a different execution order.
A more interesting observation is that all this allocation and copying is just busywork. For whatever reason, the compiler didn’t or couldn’t apply a tail call optimisation here. If it did, the child function could just have inherited the parent’s parameters, apart from the rcx adjustment, yielding this three-liner:
sqllang!CSecContextToken::FGetAccessResult:
===========================================
add rcx,12A0h
jmp sqllang!CTokenAccessResultCache::FGetAccessResult
ret
Example 4: two child functions and saved registers
This doesn’t add much complexity to the previous one, and is a more straightforward read, although it introduces a conditional branch for fun. Hopefully my ASCII art makes it clear that control flow after the je (jump if equal, or in this case the synonym jump if zero), can either continue or skip the second child function call.
sqllang!CSecContextToken::`vector deleting destructor':
=======================================================
mov qword ptr [rsp+8],rbx
push rdi
sub rsp,20h
mov ebx,edx
mov rdi,rcx
call sqllang!CSecContextToken::~CSecContextToken
test bl,1
+-+- je LabelX
| |
| | { if low bit of original edx was set
| -> mov rcx,rdi
| call sqllang!commondelete
| }
|
LabelX:
+--> mov rax,rdi
mov rbx,qword ptr [rsp+30h]
add rsp,20h
pop rdi
ret
The rdi register is used as a scratchpad to save the incoming rcx value for later – first it is used to restore the possibly clobered rcx value for the second child function call, and then it becomes the return value going into rax. However, since we don’t want to clobber its value for our caller, we first save its old value and then, at the very end, restore it. This is done through a traditional push/pop pair, which in x64 is only allowed in the function prologue and epilogue.
Similarly, ebx (the bottom half of rbx) is used to save the incoming value of the second parameter in edx. However, here we don’t use a push/pop pair, but instead save it in a slot in the shadow space. Why two different mechanisms for two different registers? Sorry, that’s one for the compiler writers to answer.
Example 5: The Full Monty
This example is from a much meatier function – somewhere north of 200 lines – so I’m just going to present the prologue and the very beginning of the function proper. For the sake of readability, I’m again adding a few line breaks around related chunks of instructions.
This has a lot more going on, so to save our collective sanity I have prepared a table of stack offsets relative to rsp at function entry, rbp, and the adjusted rsp. Note that the Local Variables section covers a lot of real estate, but I only listed the first few slots. This picture should drive home the notion of why we talk about a stack “frame”: it is a chunk of memory serving as the frame for a single invocation of a function, as is itself the feaming context for subservient sections.
For unknown reasons, rbp comes into play here. This is odd, because its value relative to rsp is constant, and all rbp-relative memory references could have been expressed just as easily by making them rsp-relative. Even more mysteriously, the compiler decides that an unaligned (non divisible by 8) offset is just the ticket. Having said that, it doesn’t change anything from a functional viewpoint; it just means that the arithmetic looks more weird.
I’m not going to do this one line by line, because you can translate things fairly easily using the table as a crutch. If I leave you with the general feeling that those translated angry bracket numbers can be coaxed into something sensible, I have achieved a lot.
So let’s just wrap up with a few highlights:
The local variable at [rbp-19h] receives what appears to be a very large number. Or if you consider it signed (see here for more signed integer gossip), it’s just the value -2. The presence of this variable is is very common in the SQL Server code base, and presumably is part of some convention.
The value of rbx is again saved in the shadow area, while the parameter-carrying registers r9 and rdx get saved lower down in the stack among the rest of the local variables. rcx gets backed up in a register of its own (r13), and rdx goes to r15 in addition to that stack backup. Why both? Again, a compiler quirk.
The value written to [rbp 0x17] is a stack cookie, used as a safety mechanism to detect stack corruption. By its nature, it is an unpredictable value, and when its value is re-checked at the end of the function, detected corruption can be dealt with as an exception.
The assignment of a vftable address to a memory location in the local variable area (line 27), followed by a bunch of further zero-value initialisations of subsequent addresses, is a dead giveaway that we’re constructing a temporary instance of a class with virtual functions, in this case a CShowplanString.
Conclusion and further reading
Gosh, that sure was more detail than I was expecting. And welcome to the end of the post.
I’ll let Gustavo Duarte have the last word again. Thrice.
Journey to the stack is an excellent coverage of the stack in general. Just be aware that it is 32-bit focused.
Waitress: Hello, I’m Diana, I’m your waitress for tonight… Where are you from? Mr and Mrs Hendy: We’re from Room 259. Mr Hendy: Where are you from? Waitress: [pointing to kitchen] Oh I’m from the doors over there…
(Monty Python, “The Meaning of Life”)
When code runs, there is always an implied context. Depending on what level of abstraction you’re thinking at, there are endless angles from which to consider the canvas upon which we paint executing code.
Some examples, roughly in increasing level of abstraction:
What machine is it running on, and what is the global hardware setup for things like memory size and cache configuration?
Within this machine, what CPU is currently executing the code in question?
Within the currently executing function, what state was passed to it through parameters and global variables, and what would a point-in-time snapshot of the function invocation’s current state look like?
Under what credentials is the current thread executing that function, and what rights are associated with those credentials?
What is the bigger technical task being performed, from what source site was this ultimately invoked, and what application-level credentials are modeled for this task?
What business problem is being solved?
In this short series of blog posts, I’m going to cherry-pick a few technical subjects in this line of thinking, conveniently sticking with ones where I actually have something to contribute, because that way the page won’t be blank. And this will of course happen in the (ahem) context of SQL Server.
What is the point-in-time context of a running CPU?
The x64 Intel CPUs we know and love have a state which can broadly be defined as the current values of a set of user-visible registers, each of which is nothing more than a global variable that is only visible to that CPU. Ignoring floating-point and SIMD ones, this leaves us with a handful:
RIP, the instruction pointer, which points to the address of the current instruction (Thanks Klaus for finding an embarrassing typo there!). Normally we only interact with this through its natural incrementing behaviour, or by causing it to leap around through jump/call instructions (planned control flow) or a hardware interrupt (out-of-band interventions).
RSP, the stack pointer. This is also automatically managed through stepwise changes as we do stack-based operations, but we can and do interact with it directly at times.
RAX, RBX, RCX, RDX, RSI, RSI, RBP, and R8 through R15 – these are general-purpose registers that can be used however we like. Some of them have associations with specific instructions, e.g. RDI and RSI for memory copying, but they remain available for general-purpose use. However, beyond that, strong conventions keep their use sane and predictable. See Register Usage on MSDN for a sense of how Windows defines such conventions.
The segment registers CS, DS, ES, FS, GS and SS. These allow a second level of abstraction for memory address translation, although in modern usage we have flat address spaces, and they can mostly be ignored. The big exception here is GS, which both Windows and Linux uses to point to CPU- or thread-local structures, a usage which is explicitly supported by Intel’s SWAPGS instruction. However, I’m getting ahead of myself here, because this occurs at a much higher level of abstraction.
Context switching in its most basic form involves nothing more than saving a snapshot of these registers and swapping in other values saved previously. Broadly speaking, this is what the Windows CONTEXT structure is all about. By its nature, this is processor architecture-specific.
One interesting thing comes up when you consider how tricky it is to talk about the point-in-time state of a pipelined CPU, since it could be executing multiple instructions at the same time. The answer here is one that will have a familiar ring to database folks: although the incoming stream of instructions is expressed linearly, that clever CPU not only knows how to parallelise sections of them, but it can treat such groups as notionally transactional. In database-friendly terms, only the right stuff commits, even in the face of speculative execution.
This end up as a battle of wits between the CPU and compiler designers. Any suffienctly clever optimising compiler will reorder instructions in a way which lubricates the axles of instruction-level parallelism, and any sufficiently clever CPU will internally reorder things anyway, irrespective of what the compiler emitted. But fortunately for our sanity, we can still think of the CPU’s PacMan-like progress through those delicious instructions as happening in a single serial stream.
A CPU asks “Who am I?”
It shouldn’t come as a surprise that a single CPU has precious little awareness of its surroundings. In reading and writing memory, it may experience stalls caused by contention with other CPUs, but it has no means – or indeed need – to get philosophical about it.
Our first stopping point is a dive into the very simple Windows API call GetCurrentProcessorNumber(). This does pretty much what it says, but its workings highlights how this isn’t a hardware or firmware intrinsic, but instead something cooked up by the operating system.
Before we get to the implementation, consider how even such a simple thing can twist your brain a bit:
Who is asking the question? Candidate answer: “The thread, executing the code containing the question on the processor which has to answer it.”
Because threads can be switched between processors, the answer may cease to be correct the moment it is received. In fact, it can even become incorrect within GetCurrentProcessorNumber(), before it returns with the wrong answer.
So here in all its three-line glory, is the disassembly of the function from my Windows 8.1 system:
mov eax, 53h
lsl eax, eax
shr eax, 0Eh
ret
This uses the unusual incantation lsl (load segment limit), which dereferences and decodes an entry in the Global Descriptor Table, returning a segment limit for entry 0x53, itself a magic number that is subject to change between Windows versions. Compared to normal assembly code, this is esoteric stuff – for our purposes we we only need to know that the Windows kernel puts a different value in here for each processor as it sets it up during the boot process. And it abuses the segment limit bit field by repurposing it, smuggling both the processor number and the kernel group number into it: the processor number is the higher-order chunk here. (If this kind of thing makes your toes curl in a good way, you can actually see the setup being done in systembg.asm in the Windows Research Kernel. Some Googling required.)
This sets the tone for this exploration of context. At any given level, we find that something at a lower level stuffed nuggets of information in a safe – ideally read-only – location for us to read. I should emphasise in this example that even though GetCurrentProcesor is an OS function, it isn’t a system call requiring an expensive kernel transition. If we wrote our own copy of it within our own DLL, it would be rude in terms of breaking abstraction, but it would have just as much of a right to read that GDT entry as the Windows-supplied function does.
Let’s visit the kernel in disguise
It’s unavoidable that we would occasionally need to make a system call, and here we encounter another way identity is turned sideways.
Problem statement: No matter how neatly a system call is wrapped up, it is still just a function taking parameters, and any arbitrary code can invoke any system call. This is a Bad Thing from the viewpoint of enforcing restrictions on who can execute what. How does the kernel know whether it ought fulfil your request to perform a dangerous function if it can’t be sure who you are? Surely it can’t trust your own declaration that you have the authority?
Clearly a trusting kernel is a dead kernel. Here is where we pay another visit to ambient identity. Previously we looked at thread-local storage, where the thread-specific pointer to its user-mode Thread Environment Block is always accessible through the GS register. Now the issue is slightly different: without putting any trust in the content of the TEB, which can be trivially edited by that nasty user-mode code, the kernel needs to have a sense of who is calling into it.
The answer lies yet again in a “secret” storage compartment, in this case one not even exposed to user mode code. Beyond the normal CPU registers I mentioned above, there is a collection of so-called model-specific registers. These are the ones that support lower-level functions like virtual address translation, and even if complete garbage is passed as parameters to a system call, the kernel can find its feet and respond appropriately, e.g. by returning to the caller with a stern error message or even shutting down the offending process entirely.
And here’s the flip side of the coin. In user mode, the locus of identity is a thread, which carries credentials and thread-local storage (for the sake of the user-mode code) and implies a process (for sandbox enforcement by kernel code). In kernel mode though, we cross over into CPU-centric thinking. This is exemplified by what the constant beacon of the GS register gets set to by Windows: in user mode it points to the current thread’s Thread Environment Block, but in kernel mode it changes to point to the current processor’s Processor Control Region, and a similar situation applies in Linux.
Per-processor partitioning of certain thread management functions makes perfect sense, since we’d aim to minimise the amount of global state. Thus each processor would have its own dispatcher state, its own timer list… And hang on, this is familiar territory we know from SQLOS! The only difference is that SQLOS operates on the premise of virtualising a CPU in the form of a Scheduler, whereas the OS kernel deals with physical CPUs, or at least what it honestly believes to be physical CPUs even in the face of CPU virtualisation.
Without even looking at the read-only state passed over to user mode, once a thread calls into the kernel, the kernel can be absolutely sure what that thread is, by virtue of this CPU-centric thinking. “I last scheduled thread 123, and something just called into the kernel from user mode. Ergo, I’m dealing with thread 123.”
We’ll be seeing a few variations on this theme. Whenever thread state (and by extension, session or process state) needs to be protected from corruption, at minimum we need some way of associating a non-overwritable token with that thread, and then saving the state somewhere where the thread can’t get at it except through safe interfaces. For an OS kernel, hardware protection takes care of drawing a line between untrusted code and the kernel. And as we’ll see later, within SQL Server the nature of the interface (T-SQL batch requests) is such that arbitrary code can’t be injected into the application’s process space, and the interface doesn’t allow for uncontrolled privilege escalation.
And all it takes is the ability to squirrel away a single secret.
Gossip hour
In researching this, I came across GetCurrentProcessorNumber() because it is called within a Hekaton synchronisation method that partitions state by CPU. That is itself interesting, since SQLOS tends to encourage partitioning by SQLOS scheduler. A very simple reading would be that this is a symptom of the Hekaton development team having run with the brief to minimise their dependence on existing layers within SQL Server. This is supported by the observation that Hekaton seems to bypass the local storage layer provided within SQLOS workers on top of thread-local storage, directly assigning itself TLS slots from the OS.
In fairness (at least to answer the first point), GetCurrentProcessorNumber() was only added in recent Windows versions, and core SQLOS was developed before that existed. But it is easy to project one’s own experiences of Not Invented Here Syndrome onto others.
So back to “I’m from those doors over there”… In sys.dm_os_threads, we find the column instruction_address, purporting to be the address of the instruction currently executing. Now for a suspended thread, this is a sensible thing to wonder about, but once a thread is running, no outside agent, for instance a DMV-supporting function running on another CPU, has a hope of getting a valid answer. This is documented behaviour for the Windows function GetThreadContext(): “You cannot get a valid context for a running thread”. Then again, any non-running thread will have an instruction address pointing to a SQLOS synchronisation function, which isn’t really interesting in itself without a full stack trace. That leaves the edge case of what value you get for the actual thread which is running sys.dm_os_threads. And the answer is that you get the starting address of sqldk!SOS_OS::GetThreadControlRegisters, the function that wraps GetThreadContext(). Turns out that someone put a special case in there to return that hard-coded magic value, rather than the thread attempting to query itself, which I rather like to think of as an Easter egg. The doors over there indeed.
Part 2 will consist of a look into stack frames. See you there!
Further reading
CPU Rings, Privilege, and Protection by Gustavo Duarte. If you want both technical depth and an easy read, you’d be hard pressed to improve on Gustavo’s extraordinary talent. Inside KiSystemService by shift32. Although it is written from a 32-bit viewpoint, it goes very deeply into the actual system call mechanism, including how trap frames are set up. This Alex Ionescu post giving some more technical insight into the (ab)use of segment selectors. If it makes the rest of us feel any better, here we see the co-author of Windows Internals admitting that he, too, had to look up the LSL instruction.
Until an hour or two ago, I had written off the idea of contributing to this month’s T-SQL Tuesday, hosted by Matt Gordon with the theme of “Fixing Old Problems with Shiny New Toys”. I’ve lately had a bee in my bonnet about a new blog series, but the introduction keeps receding into the distance as I continue to encounter the dreaded “further research needed” syndrome. (Update: And here is part 1!)
Now it is 8pm UTC on Valentine’s day with four hours to go before the T-SQL Tuesday deadline, and I’m thinking about Sessions. Which is a good thing, because as of today we can all stop worrying about Flynn.
SESSION_CONTEXT() brings two major innovations. Firstly, it replaces a 128-byte scalar payload with a key-value structure that can accommodate 256kB of data. You can really go to town filling this thing up.
The second change is less glamorous, but possibly more significant: it is possible to set an entry to read-only, meaning that it can safely be used for the kind of contextual payload you don’t want tampered with. This makes me happy, not because I currently have a great need for it, but because it neatly ties in with things I have been thinking about a lot lately.
The rise of the kernel
Something that comes up time and again in multi-layered architectures is the Inner Platform Effect. Just when is it justified to use a programming framework to recreate a function that said framework already fulfils?
An OS kernel is sacred ground. When designed sanely and safely, it doesn’t allow clients (application code) to execute arbitrary code in kernel context. This is because the deeper kernel layer has privileges that could be abused, and user code must be kept at arm’s length within user processes, sandboxed in such a way that they can attack neither the kernel nor each other. Quite simply, we provide the means to enforce the principle of least privilege.
This separation is enforced on the hardware level through things like virtual memory mapping, whereby different processes can’t see each other’s memory. And while the remapping of memory (a simple attack vector) is just another software function exposed by the CPU, the ability to modify these mappings is reserved for a higher privilege level than common application code.
On top of this, we build the notion of threads, each having a distinct identity. This is a far more hazy concept than memory mapping, in that the CPU provides the barest minimum of support functionality to support the illusion. Switching between threads may involve changing memory mappings (when the outgoing and incoming threads belong to different processes), but it always includes changing a tiny bit of thread identity which user code might be able to read, but can’t overwrite. Since user identity, and hence permissions checking, is tied into thread identity, this makes perfect sense. A thread which is allowed to muck about with its own identity, or the identity of other threads, is a security risk.
Now we go and build a multithreaded server application like SQL Server on top of these abstractions. The code which is trusted to have the run of its process space is the code that shipped with the executable, assuming we temporarily blank out the terror of extended stored procedures. This code in turn maintains a cosy environment for user-supplied code in the form of T-SQL, which plays in a memory space consisting of access-controlled global objects (tables), plus some session-scoped objects (temp tables) and batch-scoped ones (variables).
In simple textbook cases, it stops there. Ahmed is restricted to audited querying of North-East region sales data, while Ivanka gets to be security admin.
The database is the application’s kernel
That simple textbook case falls flat when you move permissions and identity out to the application layer, with all application/database users getting represented by a single database user. All understandable, especially in web apps, but now the burden is upon application code, whether inside or outside the database, to find ways of continually answering that thorny “Who Am I?” question.
Outside the database (the web app) this is a solved problem, but externally defined identity and permission sets aren’t accessible to stored procedures and triggers, unless the user identity is pushed through as a parameter. And clearly this is something that can be faked, which brings us neatly back to the potential need for secure session-scoped metadata that is non-editable after being set up.
Think of these stored procedures and triggers as kernel code. We need only the tiniest smidgen of an identifier that represents a trusted identity token. For actual OS kernel code, this can be (and is!) reduced to a single internal CPU register. Anything beyond that can be derived by allocating storage and passing payload through bit by bit, as long as all the communication is over this trusted connection with its identifying metadata token.
The context of context
This frames where I hope to be heading with my next set of blog posts. You already know the punch line: in some form or another, we are always reliant on thread-local storage. It’s just a question of how many extra layers get piled on top of that basic thread abstraction until we get to a SQL Server session.
And then, just when you think you have a good abstraction, along comes a programming pattern that strips the session back to an anonymous connection, and uses SESSION_CONTEXT() to build something new on top. Be that as it may, session context is a great user-visible touch point for some juicy internals!
Occasionally we do as we are told. And when Brent Ozar tells us that we should celebrate the first T-SQL Tuesday of 2017 by writing about SQL Server bugs and enhancement requests, the only appropriate response is “What color, sir?” Continue reading “#TSQL2SDAY: Sing a song of unsigned ints”
In many ways, it seems that the more recent SQLOSv2/SQLPAL work is a simple case of continuing with a project that has been organically evolving since the SQL Server 7 User Mode Scheduler, and rooted in classic Stonebraker: just how far can we assume control of core OS functions within an RDBMS process? Continue reading “Indirection indigestion, virtual function calls and SQLOS”
The waiter list maintains a list of workers waiting on a resource. When a UMS worker requests a resource owned by another worker, it puts itself on the waiter list for the resource and enters an infinite wait state for its associated event object. When the worker that owns the resource is ready to release it, it is responsible for scanning the list of workers waiting on the resource and moving them to the runnable list, as appropriate. And when it hits a yield point, it is responsible for setting the event of the first worker on the runnable list so that the worker can run. This means that when a worker frees up a resource, it may well undertake the entirety of the task of moving those workers that were waiting on the resource from the waiter list to the runnable list and signaling one of them to run.
The lists behind the legend
I have not gone as far as opening up my rusty copy of SQL Server 2000 to see how Ken’s description fits in there, but I am now pretty certain that the above quote has transmuted over the years into a common misunderstanding about SQLOS scheduling mechanics.
Now nothing Ken said is untrue or particularly out of date. It is just that we often hear “the waiter list” (by implication handling resource waits) described as an attribute of a scheduler, which is not the case.
Let’s revisit when the scheduler code runs, and what it does:
A worker will yield, either because it needs to wait for a resource, or because it is eaten up with guilt over reaching the end of its allotted quantum.
The act of yielding means that scheduler code (methods on the SOS_Scheduler class) gets invoked.
After a bit of housekeeping for the common good of all workers sharing the scheduler, control is transferred back to a worker to do its thing – this may even be the same worker who originally yielded.
The housekeeping consists of checking for aborted tasks, processing pending I/Os, and checking for I/O completions and timer list timeouts.
The single most important list that a scheduler owns is the collection of runnable workers, that is, the subset of workers belonging to this scheduler who are not waiting for anything other than CPU. This has variously been described as a list and a queue; I shall be using the term “runnable queue” by convention, but be aware that it is a data structure that has changed over the years and isn’t a simple queue.
A scheduler has one piece of “creative” interaction with this runnable queue, and it comes with only two variables:
When a context switch is requested by an outgoing worker owning the scheduler, the scheduler code has to choose which one of potentially multiple workers is going to be its next owner.
The incoming worker gets given a quantum expiry date, by which time it is expected to yield.
Core scheduler code running during context switching only dequeues runnable workers, and at such moments a given scheduler only looks at its own runnable queue. In contrast, code running all over the place, including in the context of workers belonging to other schedulers, may enqueue workers on to the runnable queue.
Time for a simple diagram:
Someone to watch over me
What I’m trying to get across here is that each instance of a waitable resource has its own wait list, and the scheduler has no interest in this, because a scheduler only acts upon its runnable queue. Seen from a different angle, once a worker is waiting on a resource, its scheduler doesn’t care, because it can’t and won’t manage the waiting logic of something like a latch. This splits the responsibilities neatly in two:
The synchronisation class guarding a resource (which inevitably will be built upon an EventInternal) stands watch over all the workers queueing up to have a ride on that resource. The act of granting access to a worker involves moving the worker from the wait list and getting it on to the runnable queue of that scheduler’s worker, and this is achieved by the synchronisation class.
The scheduler, in turn, doesn’t decide who is runnable, but it does get to pick which of the runnable workers (however they reached that state) runs next.
The I/O and timer lists
There are however two cases where the scheduler decides to make a worker runnable in the normal course of events. One is when a worker was waiting on I/O to complete, where periodic scheduler housekeeping is the mechanism by which SQLOS takes note of the I/O completion. At this point some workers who were on the I/O list may find themselves moved to the runnable queue just before the next worker is picked to be granted ownership of the scheduler – the lucky winner might be one of these workers, or it may be someone else who has been runnable for a while.
The second, and actually more interesting case, is the timer list. In its simplest use case, this is where you will find workers executing T-SQL WAITFOR statements. The list is neatly ordered by timer expiry date, and at each invocation of the scheduler context-switch housekeeping, workers whose timer expiry dates have now passed will be moved to the runnable queue.
What makes a timer list particularly interesting though, is when it implements a resource wait timeout, for instance a lock timeout. In this scenario we actually have a worker waiting on two things simultaneously: a resource and a timer. If the resource is acquired before the timer expires, all is good: the worker goes on to the runnable queue, and upon being woken up it finds a thumbs-up as the return value of its resource acquisition call.
However, should the timer expire before the resource has been acquired, the scheduler will actually venture forth and take the worker off that waiter list before making it runnable and setting an error return value as wake-up call. Think of it as every teenager’s worst nightmare: you’re not home by curfew, so Mom comes to your dodgy party to drag your sorry ass home. And then you wake up with a hangover and note stuck to your forehead reading “No cake for you”.
Whither next?
I tried to keep this comparatively high-level, but might take a nice little detour into the WorkerTimerRequest some day if time permits.
There you have it. Be home on time and have a thread-safe festive season.
Last night a #sqlhelp question from Monica Rathbun (@SQLEspresso) caught my eye:
Now some of us take way too much delight in worrying about obscure wait types, and since I’ve recently been in preemptive territory I thought I should take some degree of interest. Spoiler alert: I did nothing to solve Monica’s problem, but my attempt to figure out where the wait type might emanate from made me realise that this is worth a blog post.
Without getting hung up on the detail, here is a very crude and simple way to hunt for areas of SQL Server that may use a particular wait type. The only prerequisite is that you need to be willing and able to attach Windbg to SQL Server, and that you have public symbols loaded.
In this case I was looking for PREEMPTIVE_COM_RELEASE, and sys.dm_xe_map_values tells me that on my 2014 RTM instance it has an index of 01d4 hexadecimal. Crazy as it sounds, I’m going to do a simple search through the code to look for places that magic number is used. As a two-byte (word) pattern we’ll get lots of false positives, but fortunately wait types are internally doublewords, with only one bit set in the high-order word. In other words, we’re going to look for the pattern 000101d4, 000201d4, 000401d4 and so forth up to 800001d4. Ignore the meaning of when which bit is going to be set; with only sixteen permutations, it’s quick enough to try them all.
Let’s focus on sqllang as the likely source – the below would apply to any other module too.
Upon starting the debugger, the module load addresses are listed right away. You can also use the lm command at any time afterwards to list them again. In my case, I got this for sqllang:
Update 2017/08/29:
I’m keeping the original version below as a permanent record of a schoolboy error. Searching for a four-byte pattern expressed as a doubleword can in fact bring up hits, BUT only ones which are doubleword aligned, i.e. starting on an address divisible by 4. The correct way to cast the net wide enough is to use the -b flag and searching for four consecutive bytes; this search doesn’t presuppose doubleword alignment. Remembering to byte-reverse the pattern, that first command should have been s -b 0x7ffe`23870000 L0x2267000 d4 01 01 00. I was just lucky to have caught a fish using the -d variation.
Great. Now we have everything we need. The s command searches for patterns in a range of memory, and we’ll use the -d flag to make it a doubleword search. First few tries come up empty:
0:063> s -d 0x7ffe`23870000 L0x2267000 000101d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000201d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000401d4
0:063> s -d 0x7ffe`23870000 L0x2267000 000801d4
Ignore everything other than the address at the start of the line – we’re not expecting the byte dump to make sense to the human eye. Let’s see what piece of code this belongs to – the uf disassembles the function that this piece of memory falls in.
0:063> uf 0x7ffe`287d39f8
I’m not even going to show you the output, because this one turned out to be a red herring – experience and/or intuition needed to confirm that. But let’s go on…
0:063> s -d 0x7ffe`23870000 L0x2267000 002001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 004001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 008001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 010001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 020001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 040001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 080001d4
0:063> s -d 0x7ffe`23870000 L0x2267000 100001d4
And we get rewarded with a disassembly of the function sqllang!IWrapInterface<IAccessor>::Release – this one pretty much comes with flashing lights given that IAccessor reeks of COM and we were expecting something involving “RELEASE”. I’ll spare you the bulk of the assembly dump, but would like to highlight the bit that confirms the setup of a preemptive wait type:
That assignment to the edx register means that the encoded wait type is the second parameter to the AutoSwitchPreemptive constructor. And while it may not always be a recognisable setup, in this case I was already familiar with AutoSwitchPreemptive (see here).
Now this kind of trawling is by no means scientific. The wait type could have been loaded from a memory address, in which case it wouldn’t have been hard-coded in the function. And of course without the code running in context, it doesn’t tell you what kind of call stack it might show up in – only running the relevant code paths and catching the wait through a breakpoint or XEvent will do that. But as a quick and dirty way of hunting for wait type usage in a module up there on the marble slab? Hey, it works for me.
We have all been there. You believe that a certain status (e.g. is the order shipped?) lives in a simple database column, only to find that it comes from a view built on a view with all kinds of creative CASE statements. And it may look ugly, but at the end of the day, you have to admit that it successfully serves the purpose of exposing business data in the way that users expect to see it.
SQLOS is built upon the idea of cooperative, AKA non-preemptive, scheduling: out of any given collection of threads belonging to a scheduler, only one will own the scheduler at a given moment. To the degree that these cooperative threads represent the only work done by the underlying CPU, this means that the thread owning the scheduler really owns the CPU. Of course, a CPU will occasionally get side-tracked into doing other work, so the SQLOS scheduler as “virtual CPU” only represents a chunk of the real CPU, but we live in the expectation that this is a suitably large chunk.
It is understandable when such competition for the CPU comes from outside of SQL Server, but it can also be instigated from within SQL Server itself. This is the world of preemptive – or if you prefer, antisocial – scheduling. Continue reading “Scheduler stories: Going Preemptive”