Anatomy and psychology of a SQLOS spinlock

SQL Server spinlocks are famously elusive little beasties, tending to stay in the shadows except when they come out to bother you in swarms. I’m not going to add to the documentation of where specific spinlock types show up, or how to respond to contention on different types; the interested reader likely already knows where to look. Hint: Chris Adkin is quite the spinlock exterminator of the day.

In preparation for Bob Ward’s PASS Summit session, I figured it would make sense to familiarise myself a bit more with spinlock internals, since I have in the past found it frustrating to try and get a grip on it. Fact is, these are actually simple structures that are easy to understand, and as usual, a few public symbols go a long way. Being undocumented stuff, the usual caveats apply that one should not get too attached to implementation details.

Spinlock structure

It doesn’t get any simpler. A SQLOS spinlock is just a four-byte integer, embedded as a member variable in various classes, with two states:

  • Not acquired – the value is zero
  • Acquired – the value is the Windows thread ID of the owner

If it isn’t currently owned, anybody can claim ownership by overwriting that zero with their thread ID. Once that critical bit of code is done with it, ownership is relinquished by setting it back to zero.

I hope I get it

What sets a spinlock apart from other SQL Server synchronisation mechanisms is its lightweight nature (in terms of data and code footprint) and innate optimism. In the non-contended case, acquisition is a simple and efficient code path that requires no branching. Here is what an example looks like in assembler, presupposing that edx contains the thread ID, rcx contains the address of the spinlock, and eax is set to zero:

lock cmpxchg dword ptr [rcx], edx
jnz lblSpinlockNotAcquired

In plain English, this atomically overwrites the spinlock with our thread ID, but only if the value is zero at that exact moment. If we couldn’t take ownership, a conditional branch is taken to lblSpinlockNotAcquired, which will result in a call to SpinToAcquireWithExponentialBackoff. Performance impact of the non-contended case:

  • The cost of one memory access, which may have involved reading one cache line
  • Invalidating that cache line for other processors, which means RAM access the next time anyone else needs the spinlock or other class members sharing its cache line (this suggests that the designers of these classes will likely consider memory layout to minimise false cache line sharing)

Better luck next time

Okay, so let’s say someone else currently owns the spinlock, which is the red rag to the proverbial bull: I want it, and by Jove I shall have it! The initial acquisition attempt consisted of that single compare-exchange embedded in the method that wants to enter the spinlock-protected critical code section, and this is completely independent of the spinlock class. However, in order to progress the acquisition attempt, it now becomes necessary to call out to the template instance method SpinToAcquireWithExponentialBackoff, which takes the thread ID as a parameter.

Upon return from SpinToAcquireWithExponentialBackoff, we WILL own the spinlock. The salient question is just how long it might take, and how much potentially useful work could have been performed on the CPU during that time.

Inside SpinToAcquireWithExponentialBackoff

First things first: being a template method, this exists in a LOT of small variations, although the core functionality remains constant, and the tune goes something like this.

Each acquisition of a contended spinlock – i.e. each invocation of this method due to a collision – gets a SpinInfo structure, which is a local variable living on the thread stack and thus doesn’t require interaction with a memory allocator. This structure contains a few members giving it identity (ID of spinlock class, current worker, node…) as well as metrics like the time when the acquisition attempt started. In case backoff is needed, the SpinInfo is passed down to the (non-template) instance method Backoff.

Regarding the sheer physicality of spinning on the lock, I have at times seen illustrations of code that aggressively repeat a compare-exchange attempt in a tight loop. This would work, but is fortunately not how our spinlocks operate, since that would mess with the memory bus and the cache coherency signals passing between CPUs. Instead, we have an alternation of compare-exchange acquisition attempts and short bursts of “CPU burning” that don’t involve memory – nothing more than N iterations of a loop over the Pause instruction. The primary goal of this loop is to waste a short chunk of time without yielding the CPU, but the use of Pause does mean that a sibling hyperthread running on the same physical core gets a good chunk of the core.

The exponential backoff implementation plays out as follows in pseudocode – obviously those numbers would reflect the outcome of a fair bit of pragmatic tweaking by the SQLOS developers:

SpinsThisIteration = min(max(250, avg_spin_count_on_this_node / 2), 2500)
while spinlock not acquired {
    burn CPU for SpinsThisIteration times
    update aggregate spin count on this node
    try to acquire spinlock
    if not acquired {
        increment backoff count on this SpinInfo
        increment backoff count on the class-specific spinlock metrics for this node
        call SpinlockBase::Backoff
        if this SpinInfo's backoff count % 32 == 0 {
            reset SpinsThisIteration to 250
        } else {
            SpinsThisIteration = max(SpinsThisIteration * 2, 10 000 000)
        }
    }

We’re still being upbeat here. Even though some CPU cycles have now been burned, we kind of hope that it doesn’t get as far as a backoff, i.e. ideally the lock acquisition succeeded after the initial brief wait.

Time to back off

Having reached this point, it’s time to cut our losses and go to sleep without burning CPU. Pseudocode for the Backoff instance method:

    if this is our first backoff, save the start time (fine-grained QPC, not system time) in the SpinInfo
    if it's been >20s since the first backoff {
        output a backoff warning
        mark the current worker as Sick
        This.Sleep(5000)
        clear the current worker's Sick flag
        reset the backoff start time to now so the 20s starts counting again
    } else{
        if backoffs_in_this_acquisition_attempt % 8 == 0 {
            This.Sleep(1)
        } else {
            This.Sleep(0)
        }
    }

This is a fairly simple wrapper for SpinlockBase::Sleep, which takes a millisecond sleep time and in turn wraps Windows API calls doing the actual sleeping. SpinlockBase::Sleep contains handling for TF1531 which seems to supplement the backoff XEvent (guessing it writes to a ring buffer, but I haven’t checked properly); with that ignored, here’s what remains in pseudocode:

    publish the spinlock_backoff XEvent if necessary
    if requested sleep time == 0 {
        invoke Windows API call SwitchToThread()
    } else {
        note current timestamp as QPC
        invoke Windows API call Sleep() with requested sleep time
        // upon waking up
        log the time spent sleeping against the scheduler's stats for the spinlock class

The nuances behind Sleep(0) vs Sleep(1) vs SwitchToThread() are explained here by Joe Duffy. The basic premise is that we want to give the Windows scheduler a chance to schedule another thread, and we need to avoid starving lower priority threads.

Conclusions

It is no coincidence that higher-level SQLOS constructs didn’t feature much here, and that yielding doesn’t involve SQLOS-aware scheduler code. While SQL Server code is waiting on a spinlock, this SQLOS scheduler is out of action, and it won’t schedule another SQLOS thread, hence the CPU is temporarily a write-off for doing anything involving cooperative scheduling. What it can do is to yield the CPU to Windows; it may be that there is some useful non-SQL work that could be running on the CPU.

So why lock up the SQLOS scheduler at all instead of letting it reschedule another SQLOS thread? This looks to be tightly bound up with the intended lightweight nature of the spinlock: since nothing else can run on our scheduler, there is no need to save and subsequently restore state, which would be an expensive operation. This has a very interesting side-effect in terms of the SQLOS object hierarchy: for the whole duration of the acquisition attempt (including backoffs) there is a locked 1:1:1 relationship between SOS scheduler, task, worker and Windows thread, so it’s a moot point which one would be considered the one doing the spinning.

You may also have noticed the inherently partitioned nature of the spinlock statistics themselves, which is another performance optimisation. If a single global set of statistics were kept (the ones you’d find in sys.dm_os_spinlock_stats) there would be foreign memory access and a greater degree of cache line contention involved in its maintenance. As it stands, it is actually subject to lost updates since access isn’t serialised across the schedulers in a node (which would have required a nested spinlock!). However, given the statistical nature of these metrics, this is clearly a sane compromise. On the flip side, the need to aggregate across nodes to construct sys.dm_os_spinlock_stats adds a sliver of extra cost to querying that DMV, but face it, the querying of metrics is hardly the core use case that needs to be optimised.

Finally, a summary of state and footprint:

  • A non-acquired spinlock has no state other than the fact of not being acquired, and no history.
  • The acquisition of a non-contended spinlock is a simple state change and generates no history.
  • The act of acquiring a contended spinlock has a lifecycle and embodies a state machine. The initial state is influenced by prior history of spinning on that scheduler, and the spins, backoffs and sleep time are tracked and added to the scheduler- and spinlock class-specific metrics.

7 thoughts on “Anatomy and psychology of a SQLOS spinlock”

Leave a Reply

Your email address will not be published. Required fields are marked *