Unsung SQLOS: the SOS_Mutex

A mutex, short for “mutual exclusion”, is arguably the simplest waitable synchronisation construct you can imagine. It exposes methods for acquisition and release, and the semantics are straightforward:

Initially it isn’t “owned”, and anybody who asks to acquire it is granted ownership
While owned, anybody else who comes around to try and acquire it must wait her turn
When the owner is done with it, the mutex is released, which then transfers ownership to one waiter (if any) and unpends that waiter

A mutex can also validly be referred to as a critical section, in the sense that it protects a critical section of code, or more accurately, data. When programming libraries expose both a mutex and a critical section, as Windows does, it really just reflects different implementations of synchronisation objects with the same semantics. You could also consider a spinlock to be a flavour of mutex: while the name “spinlock” describes the mechanism by which competing threads jostle for exclusive ownership (it can’t be politely waited upon), the idea of mutual exclusion with at most one concurrent owner still applies.

SOS_Mutex class layout and interface

This class is directly derived from EventInternal<SuspendQSlock>, with three modifications:

The addition of an ExclusiveOwner member.
The override of the Wait() method to implement mutex-specific semantics, although the main act of waiting is still delegated to the base class method.
The addition of an AddAsOwner() method, called by Wait(), which crowns the ambient task as the exclusive owner after a successful wait.

Put differently, we can build a mutex from an auto-reset EventInternal by tacking on an owner attribute, making a rule that only the owner has the right to signal the event, and adding assignment of ownership as a fringe benefit of a successful wait. A nonsignalled event means an acquired mutex, and a signalled event means that the next acquisition attempt will succeed without waiting, since nobody currently owns the mutex. The end result is that our SOS_Mutex class exposes the underlying event’s Signal() method and its own take on Wait(). From the viewpoint of the mutex consumer, the result of a successful wait is that it owns the mutex, and it should act honourably by calling Signal() as soon as it is done using the resource that the mutex stands guard over.

All the difficult code lives in the base EventInternal class, and that includes interaction with the underlying OS layers, viz SystemThreads, Workers and SOS_Schedulers. Here we see the benefits of a layered system: the SOS_Mutex can concentrate on implementing mutex-specific semantics, and as long as there is some form of EventInternal implementation written for the platform in question (SQLOS) the mutex code doesn’t need to sweat the details, much like the mutex consumer shouldn’t need to worry about how the mutex is implemented. As such, I strongly suspect that the Linux vs Windows distinction is invisible at the source code level of a class like SOS_Mutex; this likely applies to EventInternal itself.

Internals and trivia

Semantically, waiting on a bare event doesn’t carry the idea that “someone else” is blocking you: you are simply waiting for something to happen. However, if you find yourself having to wait on a mutex, there is a clear guilty party, namely the current owner. As such, while EventInternal indirectly manages the workflow of putting a worker to sleep and later waking it up, SOS_Mutex adds the responsibility of pointing an accusing finger at the blocking task by logging that task’s address (strictly speaking a pointer to that address, i.e. a **SOS_Task) in the worker before it is put to sleep. This can then be exposed to/via sys.dm_os_waiting_tasks, plus of course to any interested internal processes. Upon waking up, just before the call to Wait() finally returns, this is cleared again.

The ability of a Wait() call on an EventInternal to carry a timeout propagates into SOS_Mutex::Wait(). As such, it becomes possible for a wait attempt to fail due to a timeout, apart from the inevitable failure caused in the base class’s Wait() if the worker is not allowed to suspend but a suspend is necessary.

Like the bare EventInternal, an SOS_Mutex represents a waiting mechanism and not a wait type. Another parameter to Wait() is the usual pointer to an SOS_WaitInfo structure, and it is this which carries the wait type against which waiting will be logged – this pointer is simply passed through to the base class method.

One last interesting thing about SOS_Mutex is that it is implemented in both sqldk and sqlmin. The implementations are nearly identical, although sqlmin adds an extra parameter to Wait(). In the sqlmin version, a waiter is always inserted at the head of the suspend queue, i.e. the oldest waiter will be the first in line to be given ownership and unpended. However, the sqlmin version allows the consumer to specify whether the suspend queue insert should be at the head or the tail (something implemented in EventInternal, so the parameter is again just passed through), meaning that a consistent LIFO pattern or even FIFO with queue-jumping is an option for sqlmin mutex consumers.

Mutexes vs spinlocks

It is instructive to see how this argument plays out within the Windows kernel, and then compare it to the situation in SQLOS. Windows kernel spinlocks are the only synchronisation option available in high-IRQL situations within kernel code (e.g. drivers). This basically means that a spinlock has to be used if we’re in a situation where running of the Windows scheduler is suppressed, i.e. when your critical section of code physically can’t yield the processor to another thread.

Fortunately things are more relaxed up in user mode. Within SQLOS we deal with two schedulers: the ambient SQLOS scheduler as well as the Windows scheduler. Up here we have no way of suppressing the Windows scheduler from running, but since we are in complete control of our SQLOS scheduler, suppressing its operation is trivial: simply don’t call methods which will enter the scheduler. To be sure, persistently avoiding this beyond the bounds of one’s fair quantum is dishonourable by the rules of the ecosystem, but it’s not enforced. As such, the choice between a spinlock and a mutex is more a case of style and situational awareness. We spin when we know that the cost of spinning is going to be low to the rest of the ecosystem, compared to the benefit to us of not giving up our quantum. Entering the SOS scheduler while holding a spinlock would be a really bad idea though, because of the social promise that the spinlock will be released for consumption by others very quickly, without us going to sleep holding it. (This is a completely different issue than backing off while attempting to acquire it, by the way.) Here cooperative scheduling means that common sense must prevail. A mutex is always safe to everybody, although it might be suboptimal to the guy who wants to acquire it, but using a spinlock requires a healthy amount of common sense.

Example usage

The plain vanilla SOS_Mutex shows up occasionally as the synchronisation weapon of choice. By putting a breakpoint on sqldk!SOS_Mutex::Wait() I quickly found its use by the MemoryPoolManager class, which controls a global resource and is invoked by multiple consumers. Here are the bottom bits of thread stack by a lazywriter and resource monitor task respectively:

sqldk!SOS_Mutex::Wait
sqldk!MemoryPoolManager::AdjustMemoryUsage
sqldk!MemoryNode::RecalculateTarget
sqldk!ResourceMonitor::CheckIndicators
sqldk!ResourceMonitor::ResourceMonitorTask

sqldk!SOS_Mutex::Wait
sqldk!MemoryPoolManager::AdjustMemoryUsage
sqldk!MemoryNode::RecalculateTarget
sqlmin!BPool::LazyWriter

As noted previously, the fact that an SOS_Mutex was used as synchronisation mechanism to protected MemoryPoolManager member data isn’t normally visible. What is visible is that the Wait() call above initialises an SOS_WaitInfo structure with a wait type of SOS_MEMORY_USAGE_ADJUSTMENT, and as I write this, the SQLskills wait type library doesn’t have much information about it. What this tells me isn’t that the wait type is underdocumented: this simply means that we have good old-fashioned synchronisation against a global resource where contention has never been a significant issue and we shouldn’t worry about it until such a time as it rears its head.

Conclusion

Because the class is so simple, I have managed to go into a bit more detail than usual. The takeaway should be an underlining of just how much magic is locked up in EventInternal. There are other synchronisation classes built on EventInternal, but SOS_Mutex is the obvious introductory one. Next step is to tackle a more involved synchronisation object before we eventually start exploring the internals of scheduler context switching.