Month: February 2017

Waitress: Hello, I’m Diana, I’m your waitress for tonight… Where are you from?
Mr and Mrs Hendy: We’re from Room 259.
Mr Hendy: Where are you from?
Waitress: [pointing to kitchen] Oh I’m from the doors over there…

(Monty Python, “The Meaning of Life”)

When code runs, there is always an implied context. Depending on what level of abstraction you’re thinking at, there are endless angles from which to consider the canvas upon which we paint executing code.

Some examples, roughly in increasing level of abstraction:

What machine is it running on, and what is the global hardware setup for things like memory size and cache configuration?
Within this machine, what CPU is currently executing the code in question?
Within the currently executing function, what state was passed to it through parameters and global variables, and what would a point-in-time snapshot of the function invocation’s current state look like?
Under what credentials is the current thread executing that function, and what rights are associated with those credentials?
What is the bigger technical task being performed, from what source site was this ultimately invoked, and what application-level credentials are modeled for this task?
What business problem is being solved?

In this short series of blog posts, I’m going to cherry-pick a few technical subjects in this line of thinking, conveniently sticking with ones where I actually have something to contribute, because that way the page won’t be blank. And this will of course happen in the (ahem) context of SQL Server.

What is the point-in-time context of a running CPU?

The x64 Intel CPUs we know and love have a state which can broadly be defined as the current values of a set of user-visible registers, each of which is nothing more than a global variable that is only visible to that CPU. Ignoring floating-point and SIMD ones, this leaves us with a handful:

RIP, the instruction pointer, which points to the address of the current instruction (Thanks Klaus for finding an embarrassing typo there!). Normally we only interact with this through its natural incrementing behaviour, or by causing it to leap around through jump/call instructions (planned control flow) or a hardware interrupt (out-of-band interventions).
RSP, the stack pointer. This is also automatically managed through stepwise changes as we do stack-based operations, but we can and do interact with it directly at times.
RAX, RBX, RCX, RDX, RSI, RSI, RBP, and R8 through R15 – these are general-purpose registers that can be used however we like. Some of them have associations with specific instructions, e.g. RDI and RSI for memory copying, but they remain available for general-purpose use. However, beyond that, strong conventions keep their use sane and predictable. See Register Usage on MSDN for a sense of how Windows defines such conventions.
The segment registers CS, DS, ES, FS, GS and SS. These allow a second level of abstraction for memory address translation, although in modern usage we have flat address spaces, and they can mostly be ignored. The big exception here is GS, which both Windows and Linux uses to point to CPU- or thread-local structures, a usage which is explicitly supported by Intel’s SWAPGS instruction. However, I’m getting ahead of myself here, because this occurs at a much higher level of abstraction.

Context switching in its most basic form involves nothing more than saving a snapshot of these registers and swapping in other values saved previously. Broadly speaking, this is what the Windows CONTEXT structure is all about. By its nature, this is processor architecture-specific.

One interesting thing comes up when you consider how tricky it is to talk about the point-in-time state of a pipelined CPU, since it could be executing multiple instructions at the same time. The answer here is one that will have a familiar ring to database folks: although the incoming stream of instructions is expressed linearly, that clever CPU not only knows how to parallelise sections of them, but it can treat such groups as notionally transactional. In database-friendly terms, only the right stuff commits, even in the face of speculative execution.

This end up as a battle of wits between the CPU and compiler designers. Any suffienctly clever optimising compiler will reorder instructions in a way which lubricates the axles of instruction-level parallelism, and any sufficiently clever CPU will internally reorder things anyway, irrespective of what the compiler emitted. But fortunately for our sanity, we can still think of the CPU’s PacMan-like progress through those delicious instructions as happening in a single serial stream.

A CPU asks “Who am I?”

It shouldn’t come as a surprise that a single CPU has precious little awareness of its surroundings. In reading and writing memory, it may experience stalls caused by contention with other CPUs, but it has no means – or indeed need – to get philosophical about it.

Our first stopping point is a dive into the very simple Windows API call GetCurrentProcessorNumber(). This does pretty much what it says, but its workings highlights how this isn’t a hardware or firmware intrinsic, but instead something cooked up by the operating system.

Before we get to the implementation, consider how even such a simple thing can twist your brain a bit:

Who is asking the question? Candidate answer: “The thread, executing the code containing the question on the processor which has to answer it.”
Because threads can be switched between processors, the answer may cease to be correct the moment it is received. In fact, it can even become incorrect within GetCurrentProcessorNumber(), before it returns with the wrong answer.

So here in all its three-line glory, is the disassembly of the function from my Windows 8.1 system:

mov   eax, 53h
lsl   eax, eax
shr   eax, 0Eh
ret

This uses the unusual incantation lsl (load segment limit), which dereferences and decodes an entry in the Global Descriptor Table, returning a segment limit for entry 0x53, itself a magic number that is subject to change between Windows versions. Compared to normal assembly code, this is esoteric stuff – for our purposes we we only need to know that the Windows kernel puts a different value in here for each processor as it sets it up during the boot process. And it abuses the segment limit bit field by repurposing it, smuggling both the processor number and the kernel group number into it: the processor number is the higher-order chunk here. (If this kind of thing makes your toes curl in a good way, you can actually see the setup being done in systembg.asm in the Windows Research Kernel. Some Googling required.)

This sets the tone for this exploration of context. At any given level, we find that something at a lower level stuffed nuggets of information in a safe – ideally read-only – location for us to read. I should emphasise in this example that even though GetCurrentProcesor is an OS function, it isn’t a system call requiring an expensive kernel transition. If we wrote our own copy of it within our own DLL, it would be rude in terms of breaking abstraction, but it would have just as much of a right to read that GDT entry as the Windows-supplied function does.

Let’s visit the kernel in disguise

It’s unavoidable that we would occasionally need to make a system call, and here we encounter another way identity is turned sideways.

Problem statement: No matter how neatly a system call is wrapped up, it is still just a function taking parameters, and any arbitrary code can invoke any system call. This is a Bad Thing from the viewpoint of enforcing restrictions on who can execute what. How does the kernel know whether it ought fulfil your request to perform a dangerous function if it can’t be sure who you are? Surely it can’t trust your own declaration that you have the authority?

Clearly a trusting kernel is a dead kernel. Here is where we pay another visit to ambient identity. Previously we looked at thread-local storage, where the thread-specific pointer to its user-mode Thread Environment Block is always accessible through the GS register. Now the issue is slightly different: without putting any trust in the content of the TEB, which can be trivially edited by that nasty user-mode code, the kernel needs to have a sense of who is calling into it.

The answer lies yet again in a “secret” storage compartment, in this case one not even exposed to user mode code. Beyond the normal CPU registers I mentioned above, there is a collection of so-called model-specific registers. These are the ones that support lower-level functions like virtual address translation, and even if complete garbage is passed as parameters to a system call, the kernel can find its feet and respond appropriately, e.g. by returning to the caller with a stern error message or even shutting down the offending process entirely.

And here’s the flip side of the coin. In user mode, the locus of identity is a thread, which carries credentials and thread-local storage (for the sake of the user-mode code) and implies a process (for sandbox enforcement by kernel code). In kernel mode though, we cross over into CPU-centric thinking. This is exemplified by what the constant beacon of the GS register gets set to by Windows: in user mode it points to the current thread’s Thread Environment Block, but in kernel mode it changes to point to the current processor’s Processor Control Region, and a similar situation applies in Linux.

Per-processor partitioning of certain thread management functions makes perfect sense, since we’d aim to minimise the amount of global state. Thus each processor would have its own dispatcher state, its own timer list… And hang on, this is familiar territory we know from SQLOS! The only difference is that SQLOS operates on the premise of virtualising a CPU in the form of a Scheduler, whereas the OS kernel deals with physical CPUs, or at least what it honestly believes to be physical CPUs even in the face of CPU virtualisation.

Without even looking at the read-only state passed over to user mode, once a thread calls into the kernel, the kernel can be absolutely sure what that thread is, by virtue of this CPU-centric thinking. “I last scheduled thread 123, and something just called into the kernel from user mode. Ergo, I’m dealing with thread 123.”

We’ll be seeing a few variations on this theme. Whenever thread state (and by extension, session or process state) needs to be protected from corruption, at minimum we need some way of associating a non-overwritable token with that thread, and then saving the state somewhere where the thread can’t get at it except through safe interfaces. For an OS kernel, hardware protection takes care of drawing a line between untrusted code and the kernel. And as we’ll see later, within SQL Server the nature of the interface (T-SQL batch requests) is such that arbitrary code can’t be injected into the application’s process space, and the interface doesn’t allow for uncontrolled privilege escalation.

And all it takes is the ability to squirrel away a single secret.

Gossip hour

In researching this, I came across GetCurrentProcessorNumber() because it is called within a Hekaton synchronisation method that partitions state by CPU. That is itself interesting, since SQLOS tends to encourage partitioning by SQLOS scheduler. A very simple reading would be that this is a symptom of the Hekaton development team having run with the brief to minimise their dependence on existing layers within SQL Server. This is supported by the observation that Hekaton seems to bypass the local storage layer provided within SQLOS workers on top of thread-local storage, directly assigning itself TLS slots from the OS.

In fairness (at least to answer the first point), GetCurrentProcessorNumber() was only added in recent Windows versions, and core SQLOS was developed before that existed. But it is easy to project one’s own experiences of Not Invented Here Syndrome onto others.

So back to “I’m from those doors over there”… In sys.dm_os_threads, we find the column instruction_address, purporting to be the address of the instruction currently executing. Now for a suspended thread, this is a sensible thing to wonder about, but once a thread is running, no outside agent, for instance a DMV-supporting function running on another CPU, has a hope of getting a valid answer. This is documented behaviour for the Windows function GetThreadContext(): “You cannot get a valid context for a running thread”. Then again, any non-running thread will have an instruction address pointing to a SQLOS synchronisation function, which isn’t really interesting in itself without a full stack trace. That leaves the edge case of what value you get for the actual thread which is running sys.dm_os_threads. And the answer is that you get the starting address of sqldk!SOS_OS::GetThreadControlRegisters, the function that wraps GetThreadContext(). Turns out that someone put a special case in there to return that hard-coded magic value, rather than the thread attempting to query itself, which I rather like to think of as an Easter egg. The doors over there indeed.

Part 2 will consist of a look into stack frames. See you there!

SESSION_CONTEXT() as Swiss Army knapsack

So the shiny new tool I came across is session context, the family-sized successor to the old CONTEXT_INFO. Aaron Bertrand has written a great blog post about it: Phase out CONTEXT_INFO() in SQL Server 2016 with SESSION_CONTEXT().

SESSION_CONTEXT() brings two major innovations. Firstly, it replaces a 128-byte scalar payload with a key-value structure that can accommodate 256kB of data. You can really go to town filling this thing up.

The second change is less glamorous, but possibly more significant: it is possible to set an entry to read-only, meaning that it can safely be used for the kind of contextual payload you don’t want tampered with. This makes me happy, not because I currently have a great need for it, but because it neatly ties in with things I have been thinking about a lot lately.

The rise of the kernel

Something that comes up time and again in multi-layered architectures is the Inner Platform Effect. Just when is it justified to use a programming framework to recreate a function that said framework already fulfils?

An OS kernel is sacred ground. When designed sanely and safely, it doesn’t allow clients (application code) to execute arbitrary code in kernel context. This is because the deeper kernel layer has privileges that could be abused, and user code must be kept at arm’s length within user processes, sandboxed in such a way that they can attack neither the kernel nor each other. Quite simply, we provide the means to enforce the principle of least privilege.

This separation is enforced on the hardware level through things like virtual memory mapping, whereby different processes can’t see each other’s memory. And while the remapping of memory (a simple attack vector) is just another software function exposed by the CPU, the ability to modify these mappings is reserved for a higher privilege level than common application code.

On top of this, we build the notion of threads, each having a distinct identity. This is a far more hazy concept than memory mapping, in that the CPU provides the barest minimum of support functionality to support the illusion. Switching between threads may involve changing memory mappings (when the outgoing and incoming threads belong to different processes), but it always includes changing a tiny bit of thread identity which user code might be able to read, but can’t overwrite. Since user identity, and hence permissions checking, is tied into thread identity, this makes perfect sense. A thread which is allowed to muck about with its own identity, or the identity of other threads, is a security risk.

Now we go and build a multithreaded server application like SQL Server on top of these abstractions. The code which is trusted to have the run of its process space is the code that shipped with the executable, assuming we temporarily blank out the terror of extended stored procedures. This code in turn maintains a cosy environment for user-supplied code in the form of T-SQL, which plays in a memory space consisting of access-controlled global objects (tables), plus some session-scoped objects (temp tables) and batch-scoped ones (variables).

In simple textbook cases, it stops there. Ahmed is restricted to audited querying of North-East region sales data, while Ivanka gets to be security admin.

The database is the application’s kernel

That simple textbook case falls flat when you move permissions and identity out to the application layer, with all application/database users getting represented by a single database user. All understandable, especially in web apps, but now the burden is upon application code, whether inside or outside the database, to find ways of continually answering that thorny “Who Am I?” question.

Outside the database (the web app) this is a solved problem, but externally defined identity and permission sets aren’t accessible to stored procedures and triggers, unless the user identity is pushed through as a parameter. And clearly this is something that can be faked, which brings us neatly back to the potential need for secure session-scoped metadata that is non-editable after being set up.

Think of these stored procedures and triggers as kernel code. We need only the tiniest smidgen of an identifier that represents a trusted identity token. For actual OS kernel code, this can be (and is!) reduced to a single internal CPU register. Anything beyond that can be derived by allocating storage and passing payload through bit by bit, as long as all the communication is over this trusted connection with its identifying metadata token.

The context of context

This frames where I hope to be heading with my next set of blog posts. You already know the punch line: in some form or another, we are always reliant on thread-local storage. It’s just a question of how many extra layers get piled on top of that basic thread abstraction until we get to a SQL Server session.

And then, just when you think you have a good abstraction, along comes a programming pattern that strips the session back to an anonymous connection, and uses SESSION_CONTEXT() to build something new on top. Be that as it may, session context is a great user-visible touch point for some juicy internals!