Unsung SQLOS: the EventInternal

Today we’re taking a step towards scheduler territory by examining the EventInternal class, the granddaddy of SQLOS synchronisation objects. At the outset, let’s get one formality out of the way: although it is a template class taking a spinlock type as template parameter, we only see it instantiated as EventInternal<SuspendQSLock> as of SQL Server 2014. What this means is that spins on its internal spinlock is always going to be showing up as SOS_SUSPEND_QUEUE.

It’s a very simple class (deceptively so even) which can implement a few different event flavours, doing its waiting via SQLOS scheduler methods rather than directly involving the Windows kernel. The desire to keep things simple and – as far as possible – keep control flow out of kernel mode is a very common goal for threading libraries and frameworks. .Net is a good frame of reference here, because it is well documented, but the pattern exists within OS APIs too, where the power and generality of kernel-mode code has to be weighed off against the cost of getting there.

EventInternal class layout

The class is nice and compact, so I’ll present its members in full. Not having access to private symbols, I’m making up the member names, but this isn’t controversial stuff.

+0x00: signalled
+0x04: signalMode
+0x08: flink
+0x10: blink
+0x18: count
+0x20: spinlock

Let’s walk through the members.

Signalled and SignalMode

signalled is a Boolean which carries the publically visible state of the event. In case you’re not familiar with them, an event is an object which can exist in one of two states: signalled and unsignalled. Superficially this aspect of its surface area is nothing more than a Boolean variable. A simple pedestrian crossing light is a reasonable model here, with signalled meaning “cross” and unsignalled meaning “don’t cross”. As a consumer of the protected resource (the pedestrian crossing) you use the traffic light in one of two ways, based on its current state:

  • Signalled Cross the road
  • Nonsignalled Wait until it is signalled, then cross the road

The end result is the same in both cases, in that you end up where you want to be on the other side of the road.

signalMode adds a twist. The behaviour described for the traffic light corresponds to a signal mode of 0, also known as a manual reset event. Here the event stays signalled irrespective of how many consumers pass through it (=successfully wait on it).

A signal mode of 1, however, turns it into an auto-reset event, where the act of successfully waiting on the event resets it to unsignalled. This is now more akin to a turnstile that only lets one person through after being signalled, e.g. by a scan of a valid transport pass or a button press by a security guard.

Interestingly, a event object is also sometimes known as a latch – that’s something to chew on for SQL Server folks. Don’t get hung up about who or what signals it; that is a separate issue altogether. Just keep in mind that the signal mode is a permanent attribute of the event – you construct it as manual reset or auto-reset. Full disclosure: there seems to be at least one more SignalMode (2, used by the related SOS_WaitableAddress), but let’s ignore it today.

Other members

flink and blink show that we are yet again in linked list territory. In this case, the event object contains the list head for a list of waiters, also known as its suspend queue. If nothing is waiting on the event, the queue would be empty, i.e. flink==blink. Otherwise it will point to a ring of ListEntries, each being the one right at the start of a Worker; this is the simple case where the address of the ListEntry is also the address of the containing record.

The act of attempting to wait on an event and then being suspended due to the event not being signalled is what puts a worker on the suspend queue.

count is an odd one. I give it that name because it partly demonstrates an ability to tally up a “credit” of being signalled in auto-reset mode, where a count of e.g. 2 means that two waits will be successful before it becomes unsignaled again. This would be semaphore semantics, but it isn’t completely implemented by EventInternal itself, so I would expect such behaviour to lie in derived classes.

spinlock is straightforward protection for the whole structure, protecting it from inconsistency due to concurrent modifications, and disallowing dirty reads of its members by requiring both readers and writers to acquire the spinlock before engaging with the members.

Programming interface

The main interaction with an EventInternal is through two methods, Signal() and Wait(). There are two additional variations, SignalAll and WaitAllowPrematureWakeup, although I won’t be going into those.

From the viewpoint of the typical consumer, Wait() is the big one. There are four possible outcomes when calling it:

  • No waiting is required because the event is signalled, and the call returns immediately. This is a very happy ending from the viewpoint of the worker.
  • A wait is required because the event isn’t in the signalled state. From the viewpoint of the caller, all that this means is that the method call doesn’t return immediately.
  • The worker is detected as having reached the end of its quantum, and is thus suspended, irrespective of whether the event is signalled or not. This “suspension” may be as simple as being immediately rescheduled after some scheduler housekeeping is done, but the important point is that the control flow enters scheduler code, so what happens next is up to the scheduler. This single call to the Wait method can thus yield (no pun intended) two separate pieces of waiting: first a wait to become runnable again, ascribed to SOS_SCHEDULER_YIELD, and then after being scheduled again, the chance to get in line on the event’s suspend queue, with that wait ascribed to the specified wait type. Once again, this isn’t perceived by the worker, except insofar as the Wait method call may not return immediately.
  • No waiting is done, but the method call returns with one of two errors – this will be explored below.

Signal() is called to change the event state, which may unpend waiting workers. There is slightly different behaviour for manual reset and auto-reset modes:

  • For a manual reset event, signalling will always leave it in the signalled state. If there are waiters at the time, they are all unpended, i.e. the workers become runnable. This of course doesn’t mean they run immediately, just that they are eligible to be scheduled again.
  • Signalling an auto-reset event will unpend the worker at the tail of the suspend queue if there are any waiters. If nobody is waiting, the event is left in the signalled state. Additionally, if nobody is waiting and the event is already signalled, the count member is set to 1. (This is where the semantics are unclear to me; incrementing it by one instead would be recognisable as semaphore behaviour.)

When calling Wait(), the caller actually gets a very sweet deal: one of the method parameters specifies whether the insert to the suspend queue, if any, should be at the head or the tail. Since unsuspending is always done from the tail, this means that consistently inserting at the tail gives us stack behaviour (LIFO) whereas inserting at the head is the more fair-sounding queue (FIFO) behaviour. While the stack model sounds like it could lead to thread starvation and unfairness, it all depends on what the event is used for; one good use case for stack behaviour is when we want to unpend the most recently frozen worker whose stack is most likely to be in memory and especially in L2 cache.

Another parameter of the Wait() method is a wait timeout, which can be infinite or zero. When anything other than infinite (magic value -1 as per common Windows practice) is specified, the caller must be willing to deal with a timeout error return code. In fact, especially when the timeout is set to zero, the Wait() method call becomes more akin to TryAcquire(), although keep in mind that events don’t actually have owners. Finally, the Wait() call also takes a pointer to an SOS_WaitInfo instance, which is the generic “timesheet” on which any waiting is logged by suspension support methods, and this includes a specified wait type. In other words, the event is the mechanism which implements the act of waiting, but the cost of waiting is ascribed not to the event but to the thing which causes us to need the event in the first place.

It is possible for a worker to be in an “ineligible for suspension” state. In this case, it is still allowed to call Wait(), but only with a timeout value of zero – anything else will return an error code.

Interaction with ambient state

This is where details raised in prior posts should start to coalesce. It makes sense to think of the EventInternal instance as the locus of “consciousness”: it has state, and as with any C++ instance method, every method call comes loaded with a hidden this parameter. However, we naturally think of the thread/worker as the stream of consciousness of the code, so there are clearly two entities involved. Outside of OS code, we don’t normally have to take the extra step of systematically treating the thread itself as an object interacting with another object. But hey, this is OS code!

When it comes to the nitty-gritty of putting the worker on the suspend queue or checking its eligibility for suspension, the Wait() method needs to access the Worker object. One can imagine a system where the worker address is explicitly passed around every single method call through all levels of client code, but this would be an error-prone hassle. One of the services fulfilled by SQLOS as an OS layer is making these concerns invisible to the user of the API, and as discussed in my SystemThread post, thread-local storage comes to the rescue here.

Within the Wait() method, the ambient SystemThread is retrieved from TLS, and this in turn carries references to its currently bound Worker, SOS_Task, and SOS_Scheduler. These references can then be used to retrieve Worker properties and call the Worker, SOS_Task and SOS_Scheduler instance methods associated with the act of suspension.

Remember as well that the “client” code running here (SQL Server running on top of SQLOS) is unlikely to interact with a bare EventInternal, and is more apt to deal with another class that in turn contains one. Hence these concerns get progressively more isolated from the developer who need to write the meat of SQL Server code.

Final thoughts

Understanding the EventInternal class is a gateway to various SQLOS synchronisation topics, and it builds on nothing more complicated than linked lists and the idea of thread state carried in the ambient SystemThread. I have now started scratching the surface of scheduling and thread suspension, which are big topics in their own right, and I plan to follow through with the mechanics of suspension in a separate post.

Further reading

I have been looking for an excuse to reference Joe Duffy’s excellent Concurrent Programming on Windows. I’m not going to pretend having worked through it completely yet, but it’s an great practical reference describing threading and synchronisation objects, historic context, and implementation in Windows. Duffy covers both native Windows and .Net threading, and there are quite a few nuggets of SQL Server-related history lying about in there. If you find this stuff remotely interesting, it’s quite accessible, complementing reference works like Windows Internals.

4 thoughts on “Unsung SQLOS: the EventInternal”

Leave a Reply

Your email address will not be published. Required fields are marked *