Unsung SQLOS: linked lists

I’m slowly working towards some more juicy subjects again, but for the moment it is sensible to get fundamentals out of the way. The use of linked lists by SQL Server and SQLOS is a great place to start, because so much is built upon them. Grok this and you’re set for countless hours of fun.

From general exposure to SQLOS scheduling, many of us are familiar with concepts like “the worker is put on the runnable list”. At a high level this is simple to grasp, and the good news is that not much is hidden from us in the implementation detail. Nonetheless, it seems that it’s only systems programmers who deal with these details nowadays, but it’s still useful for the rest of us to get a chance to get comfortable with such internals.

Linked list implementation in SQLOS

A SQLOS doubly linked list follows a common Windows pattern based on a ListEntry structure. This is remarkably simple, containing only two pointer-sized members, i.e. taking up 16 bytes on x64:

flink (forward link) – a pointer to the next entry in the list
blink (backward link) – a pointer to the previous entry in the list

Any number of list entries can be daisy-chained together, with the first one’s blink pointing at the last one, and conversely, the last one’s flink pointing at the first. To gain a foothold into traversing the list, we need one special list entry that can serve as the list head; once a piece of code has a reference to this list head, it becomes possible to iterate through all the list’s members. Although in sense the list head is just another list entry in terms of traversal logic, it is necessary to keep special track of it. If nothing else, given the circular nature of the list, consider this: if we start going around the list from the list head, how would we know we have reached the end if we didn’t take note of the starting point?

Setting it apart from regular list entries, the list head can also be in the peculiar state of designating an empty list. Simple stuff – if flink == blink, or both are set to zero (two possible valid states for a newly initialised list head), the list must clearly be empty. Move along, nothing to see here.

On its own, a set of list entries is pretty meaningless, because although they are linked together, they contain no payload. It isn’t a list of anything; instead it is simply an abstract embodiment of listhood. Let’s put some meat on those bones.

Simplified linked list example

As a simple example, let’s say we want to make a list of three 64-bit integers. Think of it as a overblown array. Each payload object takes 8 bytes of storage, and we want to link a bunch of them together, so we somehow need to graft that payload on to a list entry. Let’s then extend the list entry structure, yielding a class which is laid out in memory like this, with numbers representing the offset from the starting address of
the object:

Offset	Field
0x00	flink
0x08	blink
0x10	payload

I’m using hexademical here because it’s just the normal representation for memory addresses and offsets. Besides, it yields nicer looking round numbers. So putting three of them together along with a list head gives us something like this:

Linked list with simple payload

You can also show the same layout in a way that highlights the compound nature, which is in fact closer to what the source code would look like:

Offset	Member
0x00	listentry
0x10	payload

With me? We’re nearly there. Just remember that this is nothing but a dump of memory locations with descriptive metadata attached. And that is metadata which we can keep in mind and which a compiler knows during the act of compilation, but which won’t necessarily show up in the compiled executable or the memory space of the running process. All that remains of these neatly packaged objects in that process space is a bunch of bytes, but the code that deals with those objects knows the rules of engagement, so all is well.

Turning it up a notch

There is one more twist that shows up in practice. The above example grafts the payload on to the ListEntry structure, and while it works fine, it puts the focus on the ListEntry structure rather than the thing that needs the ability to be part of a list. More seriously, it means that an object can only be a member of one list at a time, which is an artificial restriction. To fast-forward to a concrete example, a SQLOS Worker object can be on a list belonging to (i.e. having a list head within) an SOS_Node, but it could simultaneously be on the waiter list for a memory grant. This is problematic.

The solution lies in inverting the relationship between an object and a ListEntry: instead of embedding the Worker in a ListEntry, we embed a ListEntry in the Worker. Referring back to the diagram above, this means we stop thinking about a ListEntry that contains an integer payload and start thinking about an integer that is prefixed with a ListEntry member linking it to other integers – let’s call it the LinkableInteger class. This is actually very subtle: since the compound class and the ListEntry have the same starting address, we can treat a pointer to the address polymorphically as pointing to either a ListEntry (giving list membership semantics) or a LinkableInteger (giving access to the payload).

The interchangability between ListEntry and LinkableInteger breaks once we put the ListEntry anywhere other than at the start of the object. Let’s change that class layout around so that it starts with the integer payload:

Example of adding a simple payload to a ListEntry

I have retained the linking arrows as indicating that each LinkableInteger has a reference to another, but this isn’t strictly true in the implementation detail. We now have ListEntries referring to each other but buried within the objects they aim to link, so the pointers aren’t pointing at the payload anymore. Fortunately we know the class layout, so a bit of simple arithmetic does the trick: given a LinkableInteger, we can find the next ListEntry in the chain by looking at flink within the current object, i.e. the current object’s address plus eight bytes. The value retrieved can be used directly to keep traversing the list, or if we subtract eight bytes from it, we can get back at the containing object. While this might sound tricky, this -8 adjustment is a simple piece of metadata for the compiler to keep track of, and it gets baked into the object code very cheaply without needing more indirection and/or lookups in runtime metadata. Have a look for documentation of the CONTAINING_RECORD macro to see how this is done in Windows C programming.

With that out of the way, linked list manipulation isn’t that hard. Here are some common operations one can do using a reference to a list head:

Insert an item at the head of the list. We set the blink of the old first entry (with the latter found by following flink in the list head) to the new item, and complete that side of the link by pointing the new item’s flink to the old first entry. Likewise, point the list head’s flink to the new item and the new item’s blink to the list head.
Check if there is anything in the list. This is done by checking whether flink and blink of the list head are equal.
Remove an item from the tail of the list. Here we find that tail entry by following blink from the list head. The list head’s blink is set the value currently held in that tail item’s blink (i.e. the second-to-last item) and the flink of that second-to-last item is set to point to the list head. In the special case where there was only one item on the list, this will have set the list head’s flink and blink to be equal, with both pointing at the list head itself. The address of the object containing the removed ListItem can be returned as a pointer to a dequeued object.

The memory layout of a linked list doesn’t imply specific usage semantics. If we consistently insert at the head and remove from the tail, we have a queue. If we both insert and remove items from the head, we have a stack. And it is possible to have variations of these as well.

Finally, it is clear that insert and remove operations are multi-step, and the list is in an inconsistent state – i.e. not safe to traverse or modify – in the middle of such an operation. For this reason, locking semantics must be implemented. This will typically take the form of a spinlock which must be aquired before trying to access the list for any purpose. The object which owns the list head will then normally have a spinlock as a data member associated with the list head, although it is possible to have one spinlock protect multiple items beyond just a single linked list; this could be a sign of sane design, but conversely it means a coarser locking grain, which can sometimes work against you.

Worker::RemoveFromSuspendQueue – an example from the wild

LinkableInteger is a contrived example, because it is rather overblown to have sixteen bytes of overhead to track and manage an eight-byte payload. I’ll pick up the case of the SQLOS worker here as an example of a fairly large and complex object that participates in linked lists. Actual details are cited as they exist in SQL Server 2014 – it goes without saying that this is just for illustration, because this is specific implementation detail of an undocumented area.

The Worker class is, uh, at least three times bigger than LinkableInteger. In fact, we’re talking about a 2824-byte structure here, containing all kinds of juicy state including 1752 bytes of wait information (you can tell that accounting for waits is a big deal to the people who wrote this class). One of its methods is RemoveFromSuspendQueue, which does just that. For the moment, I’ll sidestep the issue of what the suspend queue is, because we are purely interested in the mechanics of interacting with it as a linked list. Here are the relevant members of the Worker class:

A ListEntry, viz a flink + blink pointer pair
A pointer to the list head of the suspend queue
A pointer to the spinlock which controls access to this suspend queue

Without those last two pointers, the Worker could be inserted into any linked list on the premise that the code owning (or at least knowing how to find) the list head can do what it needs with enlisted Workers. However, if we had nothing but the ListEntry within (i.e. known by) the Worker, it would be impossible for the Worker class’s own methods to interact with the list: it wouldn’t be able to take a spinlock, and it wouldn’t be able to distinguish between sibling ListEntries (workers) and the ListHead itself, buried within another class. So in this case we have an extension to the bare minimum needed to engage with a linked list.

Here is a pseudocode outline of the Worker class’s RemoveFromSuspendQueue method.

if the pointer to the suspend queue list head isn't 0 (i.e. the worker is noted as being on a suspend queue) 
{
	acquire the associated SOS_SUSPEND_QUEUE spinlock pointed to
	if the list head's blink == 0 
	{ 
		release spinlock 
		return error code 
	}
	traverse the list from the tail (counterclockwise per above diagrams), stopping either when we're back at the list head or when we found the current worker
	if the current worker was found 
	{ 
		unlink it from the list
		set the worker's ListEntry flink and blink both to 0
		release spinlock
		return success 
	}
	if we got all the way back to the list head without finding the worker 
	{ 
		release spinlock
		return error code 
	}
}

This is one of the cases where the ListEntry in question is the very first member of the worker, so each flink and blink refers not only to the adjoining ListEntry but can directly be used as the address of the containing worker. Intuitively, this is reminiscent of a clustered index: the link takes you directly where you want to be without having to do any further dereferencing. Continuing the comparison, the retrieval of the ListEntry also gives you a chunk of the containing object within the same cache line, hinting that careful placement and alignment of a ListEntry within a class can be a conscious performance optimisation to leverage L1 cache.

Final thoughts

Although simple, and many levels removed from the abstractions we normally engage with, I hope that this little exploration gave a taste of where things are heading, because it will be the first of a series. Next stop will be the shy but spectacularly important SystemThread class.