Unsung SQLOS: The SOS_UnfairMutexPair behind CMEMTHREAD waits

As with the droppings of the Questing Beast, we recognise synchronisation code paths by their emissions. But even when not leaving telltale fewmets, these creatures wander among us unseen, unsung, until shutdown doth us part.

The place of the SOS_UnfairMutexPair

Bob Dorr has done a great job of illustrating why thread-safe SQLOS memory objects need serialised access in How It Works: CMemThread and Debugging Them. This has been followed up by the “It just runs faster” coverage of dynamic memory partitioning.

Today I’ll be examining the SOS_UnfairMutexPair, the synchronisation object behind memory allocations. While I’m going to describe synchronisation as a standalone subject, it’s useful to think of it as the CMEMTHREAD wait; to the best of my knowledge nothing other than memory allocations currently uses this class.

For context, I have previously described a bunch of other synchronisation classes:

One picture which emerges from all these synchronisation flavours is that a developer can choose between busy waiting (eagerly burning CPU), politely going to sleep (thread rescheduling), or a blend between the two. As we’ll see the SOS_UnfairMutexPair goes for a peculiar combination, which was clearly tuned to work with the grain of SQLOS scheduling.

The shape of the object

With the exception of the central atomic lock member, everything comes in pairs here. The class models a pair of waitable objects, each having an associated pair of members to note which scheduler and task currently owns it:

Layout of the SOS_UnfairMutexPair (2016 incarnation)

Although it exposes semantics to acquire either both mutexes or just the second, in its memory allocation guise we always grab the pair, and it effectively acts as a single mutex. I’ll therefore restrict my description by only describing half of the pair.

Acquisition: broad outline

The focal point of the mutex’s state – in fact one might say the mutex itself – is the single Spinlock bit within the 32-bit lock member. Anybody who finds it zero, and manages to set it to one atomically, becomes the owner.

Additionally, if you express an interest in acquiring the lock, you need to increment the WaiterCount, whether or not you managed to achieve ownership at the first try. Finally, to release the lock, atomically set the spinlock to zero and decrement the WaiterCount; if the resultant WaiterCount is nonzero, wake up all waiters.

Now one hallmark of a light-footed synchronisation object is that it is very cheap to acquire in the non-contended case, and this class checks that box. If not owned, taking ownership (the method SOS_UnfairMutexPair::AcquirePair()) requires just a handful of instructions, and no looping. The synchronisation doesn’t get in the way until it is needed.

However, if the lock is currently owned, we enter a more complicated world within the SOS_UnfairMutexPair::LongWait() method. This broadly gives us a four-step loop:

  1. If the lock isn’t taken at this moment we re-check it, grab it, simultaneously increment WaiterCount, then exit triumphantly, holding aloft the prize.
  2. Fall back on only incrementing WaiterCount for now, if this is the first time around the loop and the increment has therefore not been done yet.
  3. Now wait on the EventInternal, i.e. yield the thread to the scheduler.
  4. Upon being woken up by the outgoing owner releasing the lock as described above, try again to acquire the lock. Repeat the whole loop.

The unfairness derives from the fact that there is no “first come, first served” rule, in other words the wait list isn’t a queue. This is not a very British class at all, but as we’ll see, there is a system within the chaos.

The finicky detail

Before giving up and waiting on the event, there is a bit of aggressive spinning on the spinlock. As is standard with spinlocks, spinning burns CPU on the optimistic premise that it wouldn’t have to do it for long, so it’s worth a go. However, the number of spin iterations is limited. Here is a slight simplification of the algorithm:

  • If the scheduler owning the lock is the ambient scheduler, restrict to a single spin.
  • Else, give it a thousand tries before sleeping.
  • Each spin involves first trying to grab both the lock and (if not yet done) incrementing WaiterCount. If that doesn’t work, just try and increment the WaiterCount.

This being of course the bit where the class knows a thing or two about SQLOS scheduling: If I am currently running, then no other worker on my scheduler can be running. But if another worker on my scheduler currently holds the lock, it can’t possibly wake up and progress towards releasing it unless *I* go to sleep. Mind you, this is already a edge case, because we’d hope that the owner of this kind of lock wouldn’t go to sleep holding it.

To see how scheduling awareness comes into play, I’m going to walk through a scenario involving contention on such a mutex. If some of the scheduling detail makes you frown, you may want to read Scheduler stories: The myth of the waiter list.

A chronicle of contention

In this toy scenario, we have two schedulers with two active workers each. Three of the four workers will at some point during their execution try and acquire the same mutex, and one of them will try twice. Time flows from left to right, and the numbered callouts are narrated below. A red horizontal bracket represents the period where a worker owns the mutex, which may be a while after the acquisition attempt started.

The mutex acquisition tango
  1. A1 wants to acquire the mutex and, finding it uncontended, gets it straight away.
  2. B2 tries to acquire it, but since it is held by A1, it gives up after a bit of optimistic spinning, going to sleep. This gives B1 a turn on the scheduler.
  3. A1 releases the mutex, and finding that there is a waiter, signals it. This moves B2 off the mutex’s waiter list and onto scheduler B’s runnable queue, so it will be considered eligible for running at the next scheduler yield point.
  4. B1 wants the mutex, and since it isn’t taken, grabs it. Even though B2 has been waiting for a while, it wasn’t running, and it’s just tough luck that B1 gets it first.
  5. A1 wants the mutex again, but now B1 is holding it, so A1 goes to sleep, yielding to A2.
  6. B1 releases the mutex and signals the waiter A1 – note that B2 isn’t waiting on the resource anymore, but is just in a signal wait.
  7. B1 reaches the end of its quantum and politely yields the scheduler. B2 is picked as the next worker to run, and upon waking, the first thing it does is to try and grab that mutex. It succeeds.
  8. A2 reaches a yield point, and now A1 can get scheduled, starting its quantum by trying to acquire the mutex. However, B2 is still holding it, and after some angry spinning, A2 is forced to go to sleep again, yielding to A1.
  9. B2 releases the mutex and signals the waiting A1, who will hopefully have better luck acquiring it when it wakes up again.

While this may come across as a bit complex, remember that an acquisition attempt (whether immediately successful or not) may also involve spinning on the lock bit. And this spinning manifests as “useful” work which doesn’t show up in spinlock statistics; the only thing that gets exposed is the CMEMTHREAD waiting between the moment a worker gives up and goes to sleep and the moment it is woken up. This may be followed by another bout of unsuccessful and unmeasured spinning.

All in all though, you can see that this unfair acquisition pattern keeps the protected object busy doling out its resource: in this case, an memory object providing blocks of memory. In an alternative universe, the mutex class may well have decided on its next owner at the moment that the previous owner releases it. However, this means that the allocator won’t do useful work until the chosen worker has woken up; in the meantime, the unlucky ones on less busy schedulers may have missed an opportunity to get woken up and do a successful acquire/release cycle. So while the behaviour may look unfair from the viewpoint of the longest waiter, it can turn out better for overall system throughput.

Of course, partitioning memory objects reduced the possibility of even having contention. But the fact remains: while any resources whatsoever are shared, we need to consider how they behave in contended scenarios.

Compare-and-swap trivia

Assuming we want the whole pair, as these memory allocations do, there are four atomic operations performed against the lock member:

  • Increment the waiter count: add 0x00010001
  • Increment the waiter count and grab the locks: add 0x80018001
  • Just grab the locks (after the waiter count was previously incremented): add 0x80008000
  • Release the locks and decrement the waiter count: deduct 0x80018001

For the first three, the usual multi-step pattern comes into play:

  1. Retrieve the current value of the lock member
  2. Add the desired value to it, or abandon the operation, e.g. if we find the lock bit set and we’re not planning to spin
  3. Perform the atomic compare-and-swap (lock cmpxchg instruction) to replace the current value with the new one as long as the current value has not changed since the retrieval in step 1
  4. Repeat if not successful

The release is simpler, since we know that the lock bits are set (we own it!) and there is no conditional logic. Here the operation is simple interlocked arithmetic, but two’s complement messes with your mind a bit: the arithmetic operation is the addition of 0x7ffe7fff. Not a typo: that fourth digit is an “e”!

This all comes down to thinking of the lock bit as a sign bit we need to overflow in order to set to 1. The higher one overflows out of the 32-bit space, but the lower one overflows into the lowest bit of the first count. To demonstrate, we expect 0x80018001 to turn to zero after applying this operation:

    8001 8001
  + 7ffe 7fff
    ---------
(1) 0000 0000

Final thoughts

So you thought we’ve reached the end of scheduling and bit twiddling? This may turn out to be a perfect opportunity to revisit waits, and to start exploring those memory objects themselves.

I’d like to thank Brian Gianforcaro (b | t) for feedback in helping me confirm some observations.

A spanking new ReaderWriterSpinlock

Well, I never. There is life in the old spinlock yet!

Being a synchronisation fetishist, I took great interest in last week’s great blog post by Sanjay Mishra and Arvind Shyamsundar about the ReaderWriterSpinlock added in 2016 CU2. Great story, happy ending, room for a sequel, all good news.

We have already seen recent improvement in the reader-writer lock algorithm, so clearly people are finding more corners to squeeze. Now what, pray tell, is a reader-writer spinlock?
Continue reading “A spanking new ReaderWriterSpinlock”

The Latch Files 2: The spinlock that dares not speak its name

Spinlocks live among us. We see them on duty, in uniform, and greet them by name. When we interact, they show a badge and leave a receipt for the time they eroded from our working day. Or so we’d like to think.

A spinlock headbanging orgy
A spinlock headbanging party in full spin

When looking at the 2016 SOS_RWLock, we came across the one-bit spinlock buried within its Count member. Since it protects a very simple wait structure, someone evidently made the decision that it is cheap enough to spin aggressively until acquired, with no backoff logic. This suggests that a low degree of spinlock contention is anticipated, either because few threads are expected to try and acquire the lock simultaneously or because the amount of business to be done while holding the lock is very light and likely to finish quickly.
Continue reading “The Latch Files 2: The spinlock that dares not speak its name”

Unsung SQLOS: SOS_WaitableAddress

One of the more amusing words in the SQL Server synchronisation lexicon is “lightweight”. Locks bad. Nolocks good. Latches lightweight. The more spinlocks you eat, the more wait you lose!

If only things were that simple… But hey, I love the poetry of compromise. Check out the SOS_WaitableAddress for one of the many competing definitions of “lightweight”.
Continue reading “Unsung SQLOS: SOS_WaitableAddress”

Unsung SQLOS: the 2016 SOS_RWLock

TSQL2SDAY logo

Talk about serendipity. I’ve been working on a progression of blog posts that would include dealing with the SOS_RWLock in both 2014 and 2016 versions, and today is a perfect excuse to align it with the 2016-themed T-SQL Tuesday hosted by Michael J Swart.

The 2014 incarnation of the SOS_RWLock looked sensible enough, but since we’ve been told it was improved, it’s a great opportunity to see how one goes about tuning a lock algorithm. So with lock-picking tools in hand, follow me to the launch party of the Spring 2016 SQLOS collection to see what the hype is all about. Is the 2014 implementation truly Derelocte?
Continue reading “Unsung SQLOS: the 2016 SOS_RWLock”

Unsung SQLOS: the classic SOS_RWLock

Moving along with our bestiary of synchronisation classes, the SOS_RWLock, a reader-writer lock, feels like a logical next stop. It has been in the news recently, it has fairly simple semantics, and it is built upon primitives that we have already explored, namely spinlocks, linked lists and the EventInternal class. Its implementation is quite a leap from the simple SOS_Mutex and there is more scope for alternative implementations providing the same functionality. And, would you believe it, as called out by Bob Dorr, the 2012/2014 implementation has now been found wanting and got rewritten for 2016. Today we’re looking at the “classic” version though, because we then get the chance to understand the 2016 rewrite in terms of concrete design decisions. (Update: I examine the 2016 update here).
Continue reading “Unsung SQLOS: the classic SOS_RWLock”

Unsung SQLOS: the SOS_Mutex

A mutex, short for “mutual exclusion”, is arguably the simplest waitable synchronisation construct you can imagine. It exposes methods for acquisition and release, and the semantics are straightforward:

  • Initially it isn’t “owned”, and anybody who asks to acquire it is granted ownership
  • While owned, anybody else who comes around to try and acquire it must wait her turn
  • When the owner is done with it, the mutex is released, which then transfers ownership to one waiter (if any) and unpends that waiter

A mutex can also validly be referred to as a critical section, in the sense that it protects a critical section of code, or more accurately, data. When programming libraries expose both a mutex and a critical section, as Windows does, it really just reflects different implementations of synchronisation objects with the same semantics. You could also consider a spinlock to be a flavour of mutex: while the name “spinlock” describes the mechanism by which competing threads jostle for exclusive ownership (it can’t be politely waited upon), the idea of mutual exclusion with at most one concurrent owner still applies.

SOS_Mutex class layout and interface

This class is directly derived from EventInternal<SuspendQSlock>, with three modifications:

  1. The addition of an ExclusiveOwner member.
  2. The override of the Wait() method to implement mutex-specific semantics, although the main act of waiting is still delegated to the base class method.
  3. The addition of an AddAsOwner() method, called by Wait(), which crowns the ambient task as the exclusive owner after a successful wait.

Continue reading “Unsung SQLOS: the SOS_Mutex”

Unsung SQLOS: the EventInternal

Today we’re taking a step towards scheduler territory by examining the EventInternal class, the granddaddy of SQLOS synchronisation objects. At the outset, let’s get one formality out of the way: although it is a template class taking a spinlock type as template parameter, we only see it instantiated as EventInternal<SuspendQSLock> as of SQL Server 2014. What this means is that spins on its internal spinlock is always going to be showing up as SOS_SUSPEND_QUEUE.

It’s a very simple class (deceptively so even) which can implement a few different event flavours, doing its waiting via SQLOS scheduler methods rather than directly involving the Windows kernel. The desire to keep things simple and – as far as possible – keep control flow out of kernel mode is a very common goal for threading libraries and frameworks. .Net is a good frame of reference here, because it is well documented, but the pattern exists within OS APIs too, where the power and generality of kernel-mode code has to be weighed off against the cost of getting there.
Continue reading “Unsung SQLOS: the EventInternal”