Unsung SQLOS: the classic SOS_RWLock

Moving along with our bestiary of synchronisation classes, the SOS_RWLock, a reader-writer lock, feels like a logical next stop. It has been in the news recently, it has fairly simple semantics, and it is built upon primitives that we have already explored, namely spinlocks, linked lists and the EventInternal class. Its implementation is quite a leap from the simple SOS_Mutex and there is more scope for alternative implementations providing the same functionality. And, would you believe it, as called out by Bob Dorr, the 2012/2014 implementation has now been found wanting and got rewritten for 2016. Today we’re looking at the “classic” version though, because we then get the chance to understand the 2016 rewrite in terms of concrete design decisions. (Update: I examine the 2016 update here).

SOS_RWLock semantics and behaviour

This lock class can best be appreciated by comparing it to a mutex. Like the mutex, a reader-writer lock can only be acquired in exclusive mode by one requestor at a time, but instead of only exposing this exclusive-acquire (Writer) option, it alternatively allows acquisition in shared (Reader) mode. This stuff is completely natural to us database folks of course, because the semantics is a subset of the behaviours we get from familiar database locks.

Basic rules of the road:

Any number of simultaneous “clients” can share ownership of the lock in Read mode.
Readers block writers.
Writers block readers and other writers.
Blocking means that the requesting worker gets suspended (scheduled off the processor) and accrues a wait of a type specified in the lock acquisition request.

There is one subtlety around the granting of a Read mode request when the lock is currently held in Read mode. If there are no writers waiting on the lock (blocked by the fact that the lock is currently held in Read mode) the request is granted. However, the presence of any waiting Write mode request(s) will mean that the incoming reader gets suspended. This would seem to be a guard against writers being starved of access due to a continual barrage of readers.

The above point fits in with this being a very fair lock implementation, with requests always granted in FIFO order. When the lock is released, the oldest waiter is unpended. If this oldest waiter is a Read mode request, all subsequent consecutive Read mode waiters are also unpended, up to the end of the waiter list or the first waiting Write mode request. In other words, read requests who came after that write request still have to wait for their turn after the writer.

Recursive acquisition is disallowed; if a given worker is the current exclusive owner and tries to acquire the lock in either Read or Write mode again, a nasty assertion is raised. However, because read grants aren’t tracked against specific grantees (they are only counted) there is no equivalent check to prevent an existing Read mode grantee from getting granted a second concurrent read grant.

Interface

This is very similar to the SOS_Mutex, although the inherited method names of the EventInternal are now replaced by GetLock() and ReleaseLock(). While GetLock() acts very similarly to the mutex’s Wait() I like the fact that it is a bit softer on the ear: in the uncontended case there will be an immediate successful acquisition and no waiting, so it is fitting that the word “wait” doesn’t feature.

GetLock() adds the necessary parameter for request mode, viz Read or Write, but otherwise matches our old friend Wait() in taking a timeout and a pointer to an SOS_WaitInfo structure, which serves as the timesheet upon which any waiting is logged.

ReleaseLock() takes no parameters, and works on the premise that whoever called it is a grantee and has the right to request a release. If the lock was being held in Write mode at the time, it becomes unlocked, and waiters are unpended according to the above rules. If the lock was being held in Read mode, the count of readers is decremented, and if it has reached zero, the lock is also released.

Class member layout

As usual, I am not familiar with the actual member variable names, so these are just my own names, but they are good enough for out purposes. Note that both spinlock and heldMode are 32-bit members, and in both cases four bytes go wasted after them. This would appear to be padding to allow eight-byte alignment for the pointers flink, blink and exclusiveOwner; if compactness was a goal, it would have been possible to shave off eight bytes by rearranging member order:

0x00: spinlock
0x08: flink
0x10: blink
0x18: readerCount
0x1c: waitingWriterCount
0x20: heldMode
0x28: exclusiveOwner

Taking them one at a time:

The spinlock is of type SOS_RW, and protects the whole structure (including the waiter list) against concurrent updates and dirty reads.
flink and blink should by now be bread and butter to us. These form the list head for the list of waiters.
readerCount keeps track of how many readers currently own the lock if we’re in Read mode. This is a very important part of the lock’s state – although we don’t need to track who these readers are, by counting them we can know when the last reader has released it.
waitingWriterCount is a shortcut that allows a Read mode request for a lock already held in Read mode to take into account whether or not there are waiting writers. This isn’t strictly speaking necessary, since the information can be gleaned from the waiter list itself, but clearly at some point in the design it was deemed a worthwhile optimisation. Spoiler Alert: this doesn’t exist in the 2016 version anymore, so clearly that thinking was revisited.
heldMode is an enumeration indicating what mode the lock is currently held in. 0 = none (i.e. lock isn’t held at all), 1 = Read, 2 = Write
When held in Write mode, exclusiveOwner is the address of the Worker who was granted exclusive access. As usual, this is harvested through the ambient SystemThread.

A special structure is used for the list entries of the waiter list, which is composed of a few attributes plus an embedded EventInternal per waiter:

0x00: flink
0x08: blink
0x10: requestMode
0x14: requestGranted
0x18: eventInternal (fully expanded below)
  0x18: signalled
  0x1c: signalMode (init as auto-reset)
  0x20: flink
  0x28: blink
  0x30: count
  0x38: spinlock

These don’t need much explanation beyond a few notes. The wait list entries are linked to the SOS_RWLock itself through their own ListEntry (flink and blink at 0x00), and this could form a queue of arbitrary length. In contrast, the list within the embedded EventInternal (where flink and blink constitute a ListHead and not a ListEntry) is a degenerate case of linked list here, which will always contain exactly one waiter, up to the point where the event is signalled and the list becomes empty just before the whole wait list entry gets destroyed due to it having become irrelevant. So while it could be confusing that each wait list entry contains an EventInternal which itself contains a list of waiters, the actual implementation is much simpler.

Acquisition implementation

The GetLock() method is a wrapper for the more meaty GetLongWait(), but there are two cases where the lock can be granted immediately without having to enter the deeper function. These cases are:

The requested mode is Read, there are no waiting writers, and the lock isn’t currently held in Write mode (i.e. it is either not held at all or held in Read mode)
The requested mode is Write and the lock isn’t currently held in either mode

One of the first things GetLock() does is to acquire the member spinlock; this will be released within this proc for the two cases above, but otherwise GetLongWait() is called while the spinlock is still held, and it releases the spinlock at the appropriate time. This is necessary because the code paths in the child method could involve a thread suspension, and we absolutely should not suspend while holding a spinlock – clearly some developer discipline is required in these cases!

GetLongWait() contains the abovementioned wait list entry as a local (stack) variable. Before the still-held spinlock is released, the method does the following:

If the requested mode is Write, waitingWriterCount is incremented.
The local wait list entry is inserted at the head of the wait list.

Ignoring edge case handling for unsuccessful waiting, all that remains is to call Wait() on that initialised event. Once the wait succeeds, the method can be exited, and if the request mode was Write, the caller (GetLock()) can tag the current worker as exclusiveOwner. And that’s it from the perspective of the lucky guy who now has the lock of his dreams.

Except that he has to release it again.

Release implementation

Try and keep in mind that my description of a lock release from the angle of one thread is simultaneously the story of an EventInternal<SuspendQSlock>::Wait() call ending from the angle of one or more other threads trapped in their own GetLongWait() call, assuming that there are current waiters. Best to be sober for this bit in fact.

ReleaseLock() starts like this:

Acquire the spinlock
If the current heldMode is Write, i.e. it was just released by the exclusive owner, clear the exclusiveOwner member.
Otherwise, if the current heldMode is Read, decrement readerCount.

If there are no current waiters, there is a very simple code path: release the spinlock and return. Just in the case where there are now no readers holding the lock, heldMode must be set to None before releasing the spinlock.

In the case where there are waiters, the behaviour described above applies. The first waiter will always be unsuspended and heldMode set to the mode of that waiter’s request. If that mode is Read, all consecutive waiters up to the end of the list or the first Write waiter (if any) are unsuspended. If the mode is Write, it will only be that one waiter that is unsuspended, and waitingWriterCount will be decremented.

The mechanism of unpending bears a closer look – this can apply to one or more waiters, hence the involvement of another linked list:

Since we hold the spinlock, we can do any required linked list manipulation on our waiter list. This doesn’t extend to the actual contents of the EventInternal within a wait list entry, because that is protected by its own spinlock, but we aren’t touching that anyway.
As a local variable within the ReleaseLock() call, there is a list head to which will be added all wait list entries to be unsuspended. This means unhooking the wait list entry from the wait list and adding it to this unsuspend list.
Once the unsuspend list is populated, we are free to release the spinlock.
Now we iterate over that local unsuspend list, calling Signal() on each wait list entry’s EventInternal. All the interesting unsuspend gubbins sit in EventInternal as previously described, and which I will cover in much greater depth in a later post.

What fascinates me no end is to think of the allocation of these synchronisation objects. Let’s walk through it together.

The SOS_RWLock itself will typically be an object with global scope, since multiple threads knew where to look for it and have an interest in the thing it protects. At the other extreme, the wait list entry for a waiter, including the embedded EventInternal, is a local variable living on the stack of the running method GetLongWait(). When that method exits, those local objects go out of lexical scope, which is a compiler concern. However, from the runtime angle, the memory that held them now become unallocated, because it sits below the bottom of the stack and will get overwritten by future method calls – certainly anybody who still has a reference to that memory location is holding a time bomb. Fortunately it all makes sense intuitively: a thread is only waiting while it is within the GetLongWait() method call, and from the viewpoint of that thread, its erstwhile wait list entry is no more than an empty box after the EventInternal<SuspendQSlock>::Wait() call has returned. Since no more calls or references will be made to that EventInternal object, we don’t need to worry about it.

However, how about the same issue from the angle of others who may want to traverse that wait list, namely other calls to GetLongWait() and the inevitable concurrent call to ReleaseLock()?

Firstly, the ListEntry in question was linked into the wait list during the preparation for the wait. However, the reason for the completion of the wait was the signalling of the EventInternal by ReleaseLock(), and by the time it comes to signalling, the wait list entry has long been unhitched from the SOS_RWLock’s wait list and into the unsuspend list local to ReleaseLock() where no other thread can get at it. Hence we do not need to worry about other concurrent GetLongWait() calls, because the wait list they’ll be dealing with will not contain any ListEntries that are about to go out of scope. The only special protection needed here is the standard linked list integrity measure of the SOS_RWLock’s internal spinlock.

Within ReleaseLock() itself, the signalling of the waiter’s EventInternal is the action which will resume that other thread, and once it resumes, one of the first things it will do is to exit the GetLongWait() call where it got suspended partway through. Exiting this method will deallocate the wait list entry, meaning that we should not make any reference to it after signalling its EventInternal. For this reason, it becomes important to remove the wait list entry even from the local unpend list within ReleaseLock() just before the act of signalling. It would seem seductively simple to walk the list, signalling each entry as you go along, but it turns out to be important to play by the book, properly dequeueing east ListEntry before using it.

Conclusion

As I’m working through these synchronisation classes, I keep having to remind myself that some of these are virtually unknown in the SQL Server world. Which is odd, considering that spinlocks, latches and SQL Server locks are fairly well understood, and the other internal classes used within SQLOS and SQL Server are just members of the same family that don’t happen to be exposed through DMVs and wait types. Given how much in-depth knowledge there is about other locking constructs, it just feels fair to me to give these little guys their time in the sun as well.

3 thoughts on “Unsung SQLOS: the classic SOS_RWLock”

Pingback: SOS_RWLock – Curated SQL
Pingback: Unsung SQLOS: the 2016 SOS_RWLock | SQL on ice
Pingback: Unsung SQLOS: The SOS_UnfairMutexPair behind CMEMTHREAD waits – SQL on ice