In my previous post, Threading for humans, I ended with a brief look at TLS, thread-local storage. Given its prominent position in SQLOS, I’d like to take you on a deeper dive into TLS, including some x64 implementation details. Far from being a dry subject, this gets interesting when you look at how TLS helps to support the very abstraction of a thread, as well as practical questions like how cleanly any/or efficiently SQLOS encapsulates mechanisms peculiar to Windows on Intel, or for that matter Windows as opposed to Linux.
Mental models and threads
Let’s step right back to fundamentals here. In a superficial reading of early hype around object-orientated programming, mere procedural code constructs were to gain sentience, ending up with not only capabilities (boring) but also an exciting awareness of themselves and the world around them. Hence when a Dog, derived from Animal, was asked to Speak, it knew to “bark” rather than “quack”, and in an collection of Dog instances, each of them had its own Name. Heck, these Dogs might even have gone as far as constructing other Dogs with different traits, keeping track of their pedigrees. Life was good.
While executing methods within the Dog class hierarchy, this is easy to reason about: if I invoke Dog.Run, “it” is fully engaged in the joys of dogness, even if (or especially if) it runs like a dog. However, the Run method will at some point refer to icky things like Muscle.Contract. Sure, we have some lovely encapsulation in the fact that a Dog doesn’t need to know how a Muscle works, and likewise, a Muscle doesn’t need to know that it belongs to Harry the husky. But let’s assume that 95% of the effort expended in running is done by methods in the Muscle class. If we got our sense of self by looking at what class the currently executing code belongs to, there isn’t a lot of dogness going around.
Let’s also consider that our dogs need to have their resource usage tracked. Even if we somehow knew that 95% CPU is expended by methods of the Muscle class within Dog context we really want to be able to distinguish between the energy expenditure of Harry the husky as opposed to his fellow husky Harriet, or that chihuahua they met down at the park.
This gets to the nub of the identity issue. We have at least four candidate sources for the identity or context of running user-mode code, and these are at odds with each other:
- The currently running method, i.e. what chunk of source code it belongs to (Muscle.Contract)
- Some higher-level abstraction of the above, whereby the currently running method knows it is being called from another class which is the true locus of identity (the Dog class)
- Building upon that further, a specific instance of a Dog, viz Harry the husky
- The security principal which is executing the code
As a teaser for where this is heading, I’ll reframe the problem as classic SQL Server examples. Firstly, when a latch wait occurs somewhere in the bowels of a LatchBase subclass instance, how does that latch method know to track the wait against an instance of a Worker, or make it known to the world that it is holding up that Worker? And secondly, at a much higher abstraction level, when a task executes a user query and needs to access a table, how does the access methods code know what security principal to do security checks against? We are taking the first steps towards answering these questions here.
The above should have highlighted how tricky things can get when we move beyond language like “it executes the Run method”. What is ultimately the “it” being referred to here? The lowest-level answer is the CPU, which isn’t very satisfying. And the highest-level one, namely the security principal, is problematic in a server application when there are thousands of things being done near-simultaneously by a single service account acting on behalf of a few dozen user accounts, each having a few hundred requests in flight at a given moment.
This question resolves itself at the level of a thread, and specifically when we allow each thread to have a sense of its own identity. A Windows thread encompasses a lot of gubbins which only a kernel developer could love. However, there are also many other attributes which gently cross the line from complex here-be-dragons kernel code towards an abstraction that can be viewed through the kinder lens of object-oriented programming. Some well-known ones:
- The entry point, which is the function called when the thread is scheduled for the first time. One entry point can be shared by many threads, assuming that the code in question is thread-safe.
- The thread’s user-mode stack (each thread also has a kernel-mode stack, but we’re leaving that alone today).
- The thread priority.
- The CPU context, i.e. a snapshot of all CPU registers as it stands right now. This is in fact a tricky example; the context of a running thread is something that exists solely
within the CPU, and it is only when the thread is suspended that the Windows thread dispatcher code snapshots it and squirrels it away in a kernel-mode structure.
- The current structured exception handling frame.
Much of a thread’s state lives in kernel space, hence inaccessible to user-mode code. This is necessary to avoid opening the door to easy hacks like granting the thread an infinite quantum or giving it unfettered access across the machine. However, two chunks of state do live in user mode, hence accessible to all code running in the process. Although per-thread privacy can’t be enforced for them and they are therefore accessible to all threads in a process, by convention we don’t casually share references to them across threads. One of these is the thread’s user-mode stack, and the other is a structure called the Thread Environment Block (TEB). Although officially undocumented, its layout is well known, and direct code references to it are in common use, elevating it to the position of “de facto documented”. Significantly, each thread’s TEB is easily accessible by any code running in the context of that thread.
To return to the “who am I?” question, let’s assume we have started a thread, and the entry point somehow has derived a notion of the thread’s identity. In order to expose or publish it across the thread, we have two basic options:
- Make it a parameter to every single method call down the call stack. This would require either a lot of discipline or a language/framework that does it tacitly for us.
- Place it at a well-known location in memory so all potential consuming code knows where to find it.
As we’ll see, the Windows TEB implementation takes the second of the two options.
Inside the Thread Environment Block?
Here, as a type dump from the Windows debugger, are a few choice bits you’ll find in the TEB and some of its child structures. I’m including some members I’ll be referring to, plus some more that should give you a flavour of what else lives there:
0:066> dt _teb ntdll!_TEB +0x000 NtTib : _NT_TIB ... +0x040 ClientId : _CLIENT_ID ... +0x068 LastErrorValue : Uint4B ... +0x108 CurrentLocale : Uint4B ... +0x1480 TlsSlots :  Ptr64 Void +0x1680 TlsLinks : _LIST_ENTRY ... 0:066> dt _client_id ntdll!_CLIENT_ID +0x000 UniqueProcess : Ptr64 Void +0x008 UniqueThread : Ptr64 Void 0:066> dt _nt_tib ntdll!_NT_TIB +0x000 ExceptionList : Ptr64 _EXCEPTION_REGISTRATION_RECORD +0x008 StackBase : Ptr64 Void +0x010 StackLimit : Ptr64 Void +0x018 SubSystemTib : Ptr64 Void +0x020 FiberData : Ptr64 Void +0x020 Version : Uint4B +0x028 ArbitraryUserPointer : Ptr64 Void +0x030 Self : Ptr64 _NT_TIB
The TlsLinks member of type _LIST_ENTRY should ring a bell if you read my linked list post. Yup, this is a list head! Makes you itch to see where it leads, but we’ll resist that temptation in favour of looking at three members that I’d like to call out today:
- Self (within _NT_TIB, which is the Thread Information Block): the linear address of the TEB itself. I’ll explain the point of that below.
- UniqueThread (within ClientId): this is the Windows thread ID, that much beloved signature of spinlock acquisition.
- TlsSlots: These 0x200 bytes represent the allocated storage for the first sixty-four TLS slots. Although up to 1088 may be available altogether, the first 64 are absolutely guaranteed, and they happen to live right here inside the TEB as a contiguous array. This makes it very tempting to access TLS directly through the TEB rather than calling the Windows TlsGetValue function, which is the official gateway to TLS.
So I just declared that the TEB lives at a well-known location in memory. Yet there is a separate TEB per thread, and all threads share the same address space. What gives?
If you are a child of the eighties, you may recall the segmented memory model of the 16-bit days, when a process would have a separate Code Segment, Data Segment and Stack Segment, and the combination of segment registers and 16-bit addresses helped us achieve a 20-bit address space. Nowadays Windows uses a flat memory model; the segment registers are still there in the background, but for the most part they resolve to the same base address, so address 0 in the data segment is the same as address 0 in the code segment, rendering the question of which segment you’re in irrelevant. The exception here is the GS segment register on x64, or FS if you’re singing along on x86. Here the Windows kernel does a lovely trick. Upon switching in a thread, it points GS to that thread’s TEB, so that reading address 0 with GS as segment base gives you immediate access to the TEB. All your threads can look into the same mirror, and – surprise, surprise! – each one sees something else.
This also contextualises why having a Self member is very convenient. We don’t want to be restricted to doing all TEB access through GS, and the value stored at Self is the “true” linear address we otherwise know by the code name GS: . With all that in mind then, here are three TEB references as seen in x64 assembly in the wild:
- mov ebx, GS:[48h] loads the current Windows thread ID into the ebx register
- mov rcx, GS:[30h] loads the linear address of the TEB into rcx.
- Assume that we have previously requested a TLS slot and been allocated slot 3. We know that this slot’s location is 0x1498 into the TEB. So assuming rcx has been loaded with the TEB’s linear address as above, mov rdx,[rcx+1498h] loads the value of TLS slot 3 into rdx
A toy TLS example
To close the circle, let’s revisit the simple example where each thread has a name in the form of an arbitrary string. As we now understand, this attribute belongs in TLS, because the aim is for any arbitrary code (in whatever modules or classes) to have access to the thread identity. Here is how it could be implemented:
- During process initialisation, we’d ask Windows to allocate a slot; let’s say we get assigned slot 3. Assign the value 3 to a global variable g_MyNameSlotNumber.
- From this moment on, any thread in our process has the right to store something of its own choice in slot 3. It will be referred to as slot 3 by all threads, but each thread’s TEB, hence slot array, is distinct and lives in its own chunk of memory.
- When a new thread is created, it is (somehow!) told its name in the form of a pointer to a string. The thread initialisation code saves this pointer in slot 3.
- From this point on, each thread (irrespective of where it is in its call stack, and what methods are running) can retrieve its name by looking up slot number g_MyNameSlotNumber (=3) in its TLS.
So this is rather artificial – for one thing it is more sensible to store pointers to objects richer than mere strings in TLS – but hopefully you get the idea.
This post was ultimately about thread-local storage, but I hope I got the point across that user-visible TLS is just one aspect of thread identity, and that much other stuff in the Thread Environment Block is a flavour of TLS used internally by the operating system. When we go on to consider the role of TLS in SQLOS, the line between “user TLS” and “system TLS” gets crossed a lot. If you stayed with me this far, you’ll enjoy where we’re heading next.