Context in perspective 5: Living next door TLS

Time to switch context to an old thread, specifically the SystemThread class. I have done this to death in an Unsung SQLOS post, but if you don’t want to sit through that, here are the bits we care about today:

  • All work in SQLOS is performed by good old-fashioned operating system threads.
  • SQLOS notionally extends threads with extra attributes like an associated OS event object, and behaviour like a method for for waiting on that event. This bonus material lives in the SystemThread class, which is conceptually a subclass of an operating system thread, although this isn’t literally class inheritance.
  • A SystemThread instance lives in memory which is either allocated on the heap and pointed to by a thread-local storage slot, or it squats right across a range of TLS slots within the Thread Environment Block.

Due to the nature of thread-local storage, at any moment any code running on a SQLOS thread can get a reference to the SystemThread it’s running on; this is described in gory detail in Windows, Mirrors and a sense of Self. Powerful stuff, because this ambient SystemThread indirectly exposes EVERYTHING that defines the current SQL Server context, from connection settings to user identity and beyond.

Life is a box of ogres

LS through the looking glass

Understanding SQLOS takes us only so far, because it is after all just the substrate upon which SQL Server is built. We’ve now reached the point where SQLOS hands SQL Server a blank slate and says “knock yourself out”.

This blank slate is a small but delicious slice of local storage: an array of eighteen pointers living within the SystemThread. SQLOS has no interest in what these are used for and what they mean. As far as it is concerned, it’s just a bunch of bytes of no known type. But of course SQL Server cares a lot more.

Park that thought for a moment to consider that we’re starting to build up a hierarchy of thread-local storage:

  1. Upon an OS context switch, the kernel swaps the value held in the CPU’s GS register to point to the Thread Environment Block of the incoming thread.
  2. Within this Thread Environment block lives TLS slots that SQLOS takes advantage of. One of these will always point to the currently running SystemThread instance. In other words, when the kernel does a context switch, the change of OS thread brings with it a change in the ambient SystemThread which can be retrieved from TLS.
  3. Now within this SystemThread, an array of eighteen pointer-sized slots are made available for the client application (SQL Server) to build upon.

What does this mean from the viewpoint of SQL Server? Well, even within the parts that don’t care about SQLOS and the underlying OS, code can express and find current thread-specific state – at a useful abstraction level – somewhere within those eighteen slots.

Worker LocalStorage vs thread-local storage

We often skirt around the distinction between a worker and a thread. This is a benign simplification, because instances of Worker and SystemThread classes are bound together in a 1:1 paired relationship during all times when SQL Server code is running.

The only time the distinction becomes particularly interesting is when we’re in fiber mode, because now a single SystemThread can promiscuously service multiple Workers in a round-robin fashion. I have documented some details in The Joy of Fiber Mode, but of special interest today is the treatment of these local storage slots.

These now become a more volatile part of thread context, and a SQLOS context switch (in this case, a fiber switch) must include copying the Worker’s LocalStorage payload into the SystemThread slots, because there is no guarantee that two different Workers will share the exact same context; in fact it is close to certain they won’t.

So clearly the context that matters to the Task at hand lives within the Worker, and SQLOS makes sure that this is the context loaded into a SystemThread while it runs; it’s just that the SQLOS scheduler has to intervene more in fiber mode, whereas it can let this bit of state freewheel in thread mode.

On to the implementation details then. Unlike the case with the SystemThread, a Worker’s local storage isn’t a chunk of memory within the class, but lives in a heap allocation, represented by an instance of the LocalStorage class embedded within the Worker.

Additionally, while the SystemThread’s slot count is baked into SQLOS, somebody clearly wanted a bit more abstraction in the Worker class, so the slot count becomes an attribute of a LocalStorage instance, which thus consists of a grand total of two members:

  1. The slot count, passed into the LocalStorage constructor
  2. A pointer to the actual chunk of memory living somewhere on the heap and allocated by the LocalStorage constructor, which is a thin wrapper around a standard page allocator

Show me some numbers, and don’t spare the horses

On to the fun bit, but you’re going to have to take my word on two things. Firstly, the SystemThread’s local storage slots are right at the start of the object. And secondly, the LocalStorage instance lives at an offset of 0x30 within a Worker, at least on the version I’m using here.

To prepare, I needed to capture the addresses of a bound SystemThread:Worker pair before breaking into the debugger, so I started a request running in session 53, executing nothing but a long WAITFOR – this should keep things static enough while I fumble around running a DMV query in another session:

Capturing the worker and thread addresses of our dummy request

So we’re off yet again to stage a break-in within Windbg. Having done this, everything is frozen in time, and I can poke around to my heart’s content while the business users wonder what just happened to their server. No seriously, you don’t want to do this on an instance that anyone else needs at the same time.

Here is the dump of the first 64 bytes of that Worker. As in Part 4, I’m dumping in pointer format, although only some of these entries are pointers or even necessarily 64 bits wide. In case it isn’t clear, the first column is the address we’re looking at, and the second column is the contents found at that address. The dps command sweetens the deal a bit: if that payload is an address corresponding to a known symbol (e.g. a the name of a function), that symbol will be displayed in a third column.

0:075> dps 0x000000003B656160 l0x8
00000000`3b656160  00000000`00000000
00000000`3b656168  00000000`00000000
00000000`3b656170  00000000`3d948170
00000000`3b656178  00000000`3a21a170
00000000`3b656180  00000000`00000001
00000000`3b656188  00000000`41080158
00000000`3b656190  00000000`00000012
00000000`3b656198  00000000`3b656c80

Those highlighted last two represent the LocalStorage instance, confirming that we do indeed have eighteen (=0x12) slots, and telling us the address where that slot array starts. Let’s see what payload we find there:

0:075> dps 00000000`3b656c80 l0x12
00000000`3b656c80  00000000`3875e040
00000000`3b656c88  00000000`3875e9a0
00000000`3b656c90  00000000`00000000
00000000`3b656c98  00000000`38772608
00000000`3b656ca0  00000000`38773440
00000000`3b656ca8  00000000`00000000
00000000`3b656cb0  00000000`00000000
00000000`3b656cb8  00000000`00000000
00000000`3b656cc0  00000000`00000000
00000000`3b656cc8  00000000`00000000
00000000`3b656cd0  00000000`3a8776a0
00000000`3b656cd8  00000000`00000000
00000000`3b656ce0  00000000`00000000
00000000`3b656ce8  00000000`00000000
00000000`3b656cf0  00000000`00000000
00000000`3b656cf8  00000000`00000000
00000000`3b656d00  00000000`00000000
00000000`3b656d08  00000000`00000000

It seems that only five out of the eighteen are in use. Oh well, that’s neither here nor there. Let’s compare this to the first eighteen quadwords at the bound SystemThread’s address we found in sys.dm_os_threads:

0:075> dps 0x00007FF653311508 l0x12
00007ff6`53311508  00000000`3875e040
00007ff6`53311510  00000000`3875e9a0
00007ff6`53311518  00000000`00000000
00007ff6`53311520  00000000`38772608
00007ff6`53311528  00000000`38773440
00007ff6`53311530  00000000`00000000
00007ff6`53311538  00000000`00000000
00007ff6`53311540  00000000`00000000
00007ff6`53311548  00000000`00000000
00007ff6`53311550  00000000`00000000
00007ff6`53311558  00000000`3a8776a0
00007ff6`53311560  00000000`00000000
00007ff6`53311568  00000000`00000000
00007ff6`53311570  00000000`00000000
00007ff6`53311578  00000000`00000000
00007ff6`53311580  00000000`00000000
00007ff6`53311588  00000000`00000000
00007ff6`53311590  00000000`00000000

This isn’t the same address as the Worker’s local storage array, but the contents is the same, which is consistent with my expectation. I’m highlighting that first entry, because we’ll be visiting it later.

Just out of interest, I’m also going to try and tie things back to memory page observations made in Part 4. Let’s peek at the top of the 8k page that the LocalStorage lives on. Its address is 0x3b656c80, which rounded down to the nearest 8k (=0x2000) gives us a page starting address of 0x3b656000.

0:075> dps 0x000000003B656000
00000000`3b656000  00010002`00000000
00000000`3b656008  00000000`3b656040
00000000`3b656010  00060003`00000001
00000000`3b656018  00000000`000012c0
00000000`3b656020  00000000`00000000
00000000`3b656028  00000000`00000000
00000000`3b656030  00000000`00000001
00000000`3b656038  00000000`00000110
00000000`3b656040  00007ffd`5f95c290 sqldk!CMemObj::`vftable'

The shape of that page header looks familiar. The second quadword is a pointer to the address 0x40 bytes into this very page. And just to hand it to us on a plate, the object starting there reveals itself by its vftable symbol to be a CMemObj.

In other words, this particular bit of LocalStorage is managed by a memory object which lives on its very page – obviously it would be wasteful if a memory object refused to share its page with some of the objects it allocated memory for. Also note that this is a plain CMemObj and not a CMemThread<CMemObj>, i.e. it isn’t natively thread-safe. This may simply confirm that local storage is not intended for sharing across threads; see Dorr for more.

Cutting to the chase

So far, this is all rather abstract, and we’re just seeing numbers which may or may not be pointers, pointing to heaven knows what. Let me finish off by illuminating one trail to something we can relate to.

Out of the five local storage slots which contain something, the first one here points to 00000000`3b656040. As it turns out, this is an instance of the CCompExecCtxtBasic class, and you’ll just have to take my word for it today. Anyway, we’re hunting it for its meat, rather than for its name. Have some:

0:075> dps 3875e040
00000000`3875e040  00000000`3875ff00
00000000`3875e048  00000000`387727f0
00000000`3875e050  00000000`3875e0c0
00000000`3875e058  00000000`38772470
00000000`3875e060  00000000`38773560
00000000`3875e068  00000000`3875f310
00000000`3875e070  00000000`38773370
00000000`3875e078  00000000`38773240
00000000`3875e080  00000000`3875e0b8
00000000`3875e088  00000000`3875e330
00000000`3875e090  00000000`38773440
00000000`3875e098  00000000`00000001
00000000`3875e0a0  00000000`3875f890
00000000`3875e0a8  00000000`00000001
00000000`3875e0b0  00000000`4777e110
00000000`3875e0b8  00000000`00000000

You may recognise by the range of memory addresses we’ve seen so far that most of these are likely themselves pointers. I’ll now lead you by the hand to the highlighted fourth member of CCompExecCtxtBasic, and demonstrate what that member points to:

0:075> dps 00000000`38772470 
00000000`38772470 00007ffd`61d575f8 sqllang!CSession::`vftable'
00000000`38772478 00000000`0000000c
00000000`38772480 00000000`00000035
00000000`38772488 00000000`00000000
00000000`38772490 00000000`00000000
00000000`38772498 00000000`00000000
00000000`387724a0 00000001`00000001
00000000`387724a8 00000001`00000009
00000000`387724b0 00000000`00000000
00000000`387724b8 00000000`00000000
00000000`387724c0 00000000`00000000
00000000`387724c8 00007ffd`61d17b20 sqllang!CSession::`vftable'
00000000`387724d0 00000000`38772040
00000000`387724d8 00000000`38772240
00000000`387724e0 00000000`38772a60
00000000`387724e8 00000000`3875f570

Bam! Given away by the vftable yet again – this is a CSession instance, in other words probably the session object associated with that running thread. As per Part 3, the presence of more than one vftable indicates that we may be dealing with multiple virtual inheritance in the CSession class.

We’ll leave the last word for the highlighted third line, containing the value 0x35 as payload.

What does 0x35 translate to in decimal? Would you believe it, it’s 53, the session_id of the session associated with the thread. If we were feeling bored, we could go and edit it to be another number, essentially tinkering with the identity of that session. But today is not that day.

Putting it all together in a diagram, here then is one trail by which an arbitrary chunk of SQL Server code can find the current session, given nothing more than the contents of the GS register:

Sure, your programming language and framework may very well abstract away this kind of thing. But that doesn’t mean you aren’t relying on it all the time.

Final thoughts

This exercise would qualify as a classic case of spelunking, where we have a general sense of what we’d expect to find, perhaps egged on by a few clues. Just as I’m writing this, I see that Niels Berglund has also been writing about spelunking in Windbg. So many angles, so much interesting stuff to uncover!

The key takeaway should be a reminder of how much one can do with thread-local storage, which forms the third and often forgotten member of this trio of places to put state:

  1. Globally accessible heap storage – this covers objects which are accessible everywhere in a process, including ones where lexical scope attempts to keep them private.
  2. Function parameters and local variables – these have limited lifespans and live in registers or on the stack, but remain private to an invocation of a function unless explicitly shared.
  3. Thread-local storage – this appears global in scope, but is restricted to a single thread, and each thread gets its own version of the “same” global object. This is a great place to squirrel away the kind of state we’d associate with a session, assuming we leverage a link between sessions and threads.

I hope that the theme of this series is starting to come together now. One or two more to go, and the next one will cover sessions and their ancillary context in a lot more breadth.

Context in perspective 1: What the CPU sees in you

Waitress: Hello, I’m Diana, I’m your waitress for tonight… Where are you from?
Mr and Mrs Hendy: We’re from Room 259.
Mr Hendy: Where are you from?
Waitress: [pointing to kitchen] Oh I’m from the doors over there…

(Monty Python, “The Meaning of Life”)

When code runs, there is always an implied context. Depending on what level of abstraction you’re thinking at, there are endless angles from which to consider the canvas upon which we paint executing code.

Some examples, roughly in increasing level of abstraction:

  • What machine is it running on, and what is the global hardware setup for things like memory size and cache configuration?
  • Within this machine, what CPU is currently executing the code in question?
  • Within the currently executing function, what state was passed to it through parameters and global variables, and what would a point-in-time snapshot of the function invocation’s current state look like?
  • Under what credentials is the current thread executing that function, and what rights are associated with those credentials?
  • What is the bigger technical task being performed, from what source site was this ultimately invoked, and what application-level credentials are modeled for this task?
  • What business problem is being solved?

In this short series of blog posts, I’m going to cherry-pick a few technical subjects in this line of thinking, conveniently sticking with ones where I actually have something to contribute, because that way the page won’t be blank. And this will of course happen in the (ahem) context of SQL Server.

What is the point-in-time context of a running CPU?

The x64 Intel CPUs we know and love have a state which can broadly be defined as the current values of a set of user-visible registers, each of which is nothing more than a global variable that is only visible to that CPU. Ignoring floating-point and SIMD ones, this leaves us with a handful:

  • RIP, the instruction pointer, which points to the address of the current instruction (Thanks Klaus for finding an embarrassing typo there!). Normally we only interact with this through its natural incrementing behaviour, or by causing it to leap around through jump/call instructions (planned control flow) or a hardware interrupt (out-of-band interventions).
  • RSP, the stack pointer. This is also automatically managed through stepwise changes as we do stack-based operations, but we can and do interact with it directly at times.
  • RAX, RBX, RCX, RDX, RSI, RSI, RBP, and R8 through R15 – these are general-purpose registers that can be used however we like. Some of them have associations with specific instructions, e.g. RDI and RSI for memory copying, but they remain available for general-purpose use. However, beyond that, strong conventions keep their use sane and predictable. See Register Usage on MSDN for a sense of how Windows defines such conventions.
  • The segment registers CS, DS, ES, FS, GS and SS. These allow a second level of abstraction for memory address translation, although in modern usage we have flat address spaces, and they can mostly be ignored. The big exception here is GS, which both Windows and Linux uses to point to CPU- or thread-local structures, a usage which is explicitly supported by Intel’s SWAPGS instruction. However, I’m getting ahead of myself here, because this occurs at a much higher level of abstraction.

Context switching in its most basic form involves nothing more than saving a snapshot of these registers and swapping in other values saved previously. Broadly speaking, this is what the Windows CONTEXT structure is all about. By its nature, this is processor architecture-specific.

One interesting thing comes up when you consider how tricky it is to talk about the point-in-time state of a pipelined CPU, since it could be executing multiple instructions at the same time. The answer here is one that will have a familiar ring to database folks: although the incoming stream of instructions is expressed linearly, that clever CPU not only knows how to parallelise sections of them, but it can treat such groups as notionally transactional. In database-friendly terms, only the right stuff commits, even in the face of speculative execution.

This end up as a battle of wits between the CPU and compiler designers. Any suffienctly clever optimising compiler will reorder instructions in a way which lubricates the axles of instruction-level parallelism, and any sufficiently clever CPU will internally reorder things anyway, irrespective of what the compiler emitted. But fortunately for our sanity, we can still think of the CPU’s PacMan-like progress through those delicious instructions as happening in a single serial stream.

A CPU asks “Who am I?”

It shouldn’t come as a surprise that a single CPU has precious little awareness of its surroundings. In reading and writing memory, it may experience stalls caused by contention with other CPUs, but it has no means – or indeed need – to get philosophical about it.

Our first stopping point is a dive into the very simple Windows API call GetCurrentProcessorNumber(). This does pretty much what it says, but its workings highlights how this isn’t a hardware or firmware intrinsic, but instead something cooked up by the operating system.

Before we get to the implementation, consider how even such a simple thing can twist your brain a bit:

  • Who is asking the question? Candidate answer: “The thread, executing the code containing the question on the processor which has to answer it.”
  • Because threads can be switched between processors, the answer may cease to be correct the moment it is received. In fact, it can even become incorrect within GetCurrentProcessorNumber(), before it returns with the wrong answer.

So here in all its three-line glory, is the disassembly of the function from my Windows 8.1 system:

mov   eax, 53h
lsl   eax, eax
shr   eax, 0Eh
ret

This uses the unusual incantation lsl (load segment limit), which dereferences and decodes an entry in the Global Descriptor Table, returning a segment limit for entry 0x53, itself a magic number that is subject to change between Windows versions. Compared to normal assembly code, this is esoteric stuff – for our purposes we we only need to know that the Windows kernel puts a different value in here for each processor as it sets it up during the boot process. And it abuses the segment limit bit field by repurposing it, smuggling both the processor number and the kernel group number into it: the processor number is the higher-order chunk here. (If this kind of thing makes your toes curl in a good way, you can actually see the setup being done in systembg.asm in the Windows Research Kernel. Some Googling required.)

This sets the tone for this exploration of context. At any given level, we find that something at a lower level stuffed nuggets of information in a safe – ideally read-only – location for us to read. I should emphasise in this example that even though GetCurrentProcesor is an OS function, it isn’t a system call requiring an expensive kernel transition. If we wrote our own copy of it within our own DLL, it would be rude in terms of breaking abstraction, but it would have just as much of a right to read that GDT entry as the Windows-supplied function does.

Let’s visit the kernel in disguise

It’s unavoidable that we would occasionally need to make a system call, and here we encounter another way identity is turned sideways.

Problem statement: No matter how neatly a system call is wrapped up, it is still just a function taking parameters, and any arbitrary code can invoke any system call. This is a Bad Thing from the viewpoint of enforcing restrictions on who can execute what. How does the kernel know whether it ought fulfil your request to perform a dangerous function if it can’t be sure who you are? Surely it can’t trust your own declaration that you have the authority?

Clearly a trusting kernel is a dead kernel. Here is where we pay another visit to ambient identity. Previously we looked at thread-local storage, where the thread-specific pointer to its user-mode Thread Environment Block is always accessible through the GS register. Now the issue is slightly different: without putting any trust in the content of the TEB, which can be trivially edited by that nasty user-mode code, the kernel needs to have a sense of who is calling into it.

The answer lies yet again in a “secret” storage compartment, in this case one not even exposed to user mode code. Beyond the normal CPU registers I mentioned above, there is a collection of so-called model-specific registers. These are the ones that support lower-level functions like virtual address translation, and even if complete garbage is passed as parameters to a system call, the kernel can find its feet and respond appropriately, e.g. by returning to the caller with a stern error message or even shutting down the offending process entirely.

And here’s the flip side of the coin. In user mode, the locus of identity is a thread, which carries credentials and thread-local storage (for the sake of the user-mode code) and implies a process (for sandbox enforcement by kernel code). In kernel mode though, we cross over into CPU-centric thinking. This is exemplified by what the constant beacon of the GS register gets set to by Windows: in user mode it points to the current thread’s Thread Environment Block, but in kernel mode it changes to point to the current processor’s Processor Control Region, and a similar situation applies in Linux.

Per-processor partitioning of certain thread management functions makes perfect sense, since we’d aim to minimise the amount of global state. Thus each processor would have its own dispatcher state, its own timer list… And hang on, this is familiar territory we know from SQLOS! The only difference is that SQLOS operates on the premise of virtualising a CPU in the form of a Scheduler, whereas the OS kernel deals with physical CPUs, or at least what it honestly believes to be physical CPUs even in the face of CPU virtualisation.

Without even looking at the read-only state passed over to user mode, once a thread calls into the kernel, the kernel can be absolutely sure what that thread is, by virtue of this CPU-centric thinking. “I last scheduled thread 123, and something just called into the kernel from user mode. Ergo, I’m dealing with thread 123.”

We’ll be seeing a few variations on this theme. Whenever thread state (and by extension, session or process state) needs to be protected from corruption, at minimum we need some way of associating a non-overwritable token with that thread, and then saving the state somewhere where the thread can’t get at it except through safe interfaces. For an OS kernel, hardware protection takes care of drawing a line between untrusted code and the kernel. And as we’ll see later, within SQL Server the nature of the interface (T-SQL batch requests) is such that arbitrary code can’t be injected into the application’s process space, and the interface doesn’t allow for uncontrolled privilege escalation.

And all it takes is the ability to squirrel away a single secret.

Gossip hour

In researching this, I came across GetCurrentProcessorNumber() because it is called within a Hekaton synchronisation method that partitions state by CPU. That is itself interesting, since SQLOS tends to encourage partitioning by SQLOS scheduler. A very simple reading would be that this is a symptom of the Hekaton development team having run with the brief to minimise their dependence on existing layers within SQL Server. This is supported by the observation that Hekaton seems to bypass the local storage layer provided within SQLOS workers on top of thread-local storage, directly assigning itself TLS slots from the OS.

In fairness (at least to answer the first point), GetCurrentProcessorNumber() was only added in recent Windows versions, and core SQLOS was developed before that existed. But it is easy to project one’s own experiences of Not Invented Here Syndrome onto others.

So back to “I’m from those doors over there”… In sys.dm_os_threads, we find the column instruction_address, purporting to be the address of the instruction currently executing. Now for a suspended thread, this is a sensible thing to wonder about, but once a thread is running, no outside agent, for instance a DMV-supporting function running on another CPU, has a hope of getting a valid answer. This is documented behaviour for the Windows function GetThreadContext(): “You cannot get a valid context for a running thread”. Then again, any non-running thread will have an instruction address pointing to a SQLOS synchronisation function, which isn’t really interesting in itself without a full stack trace. That leaves the edge case of what value you get for the actual thread which is running sys.dm_os_threads. And the answer is that you get the starting address of sqldk!SOS_OS::GetThreadControlRegisters, the function that wraps GetThreadContext(). Turns out that someone put a special case in there to return that hard-coded magic value, rather than the thread attempting to query itself, which I rather like to think of as an Easter egg. The doors over there indeed.

Part 2 will consist of a look into stack frames. See you there!

Further reading

CPU Rings, Privilege, and Protection by Gustavo Duarte. If you want both technical depth and an easy read, you’d be hard pressed to improve on Gustavo’s extraordinary talent.
Inside KiSystemService by shift32. Although it is written from a 32-bit viewpoint, it goes very deeply into the actual system call mechanism, including how trap frames are set up.
This Alex Ionescu post giving some more technical insight into the (ab)use of segment selectors. If it makes the rest of us feel any better, here we see the co-author of Windows Internals admitting that he, too, had to look up the LSL instruction.

Windows, mirrors and a sense of self

In my previous post, Threading for humans, I ended with a brief look at TLS, thread-local storage. Given its prominent position in SQLOS, I’d like to take you on a deeper dive into TLS, including some x64 implementation details. Far from being a dry subject, this gets interesting when you look at how TLS helps to support the very abstraction of a thread, as well as practical questions like how cleanly any/or efficiently SQLOS encapsulates mechanisms peculiar to Windows on Intel, or for that matter Windows as opposed to Linux.
Continue reading “Windows, mirrors and a sense of self”