How SQL Server counts page refs without counting

While researching my previous post on latch promotion, I came across an odd piece of magic that made me do a double take. And sleep on it. And tear out my hair. I took it on faith that this is more likely a clever algorithm than a brain fart, but I could not stop asking myself repeatedly…

What were they smoking?

The piece of code in question is really simple, involving the maintenance of the BUF structure’s breferences member, which notionally is a count of how many times the page has been touched since the page’s counters were last reset.

Some background:

  • The counter reset happens about every four seconds and sets breferences to zero.
  • rdtsc is our fastest possible time stamp counter, counting CPU clock cycles. At 2Ghz, the bottom 32 bits roll over roughly every two seconds.
  • The ror operator (rotate right) is a form of bitwise right shift, but the bits shifted out get shifted in at the opposite end. So ror 3 takes the least significant three bits and make them the most significant three.

The fragment in question is invoked every time a page is touched, and translates to something like this:

int32 tsc = lower 32 bits of rdtsc;
tsc = tsc ror 3;
if (tsc & breferences == 0)
  breferences = breferences * 2 + 1;

Oh, the unspeakable cleverness

I’ll spare you my false starts, but I think I finally have it. The first observation is that, on the occasions breferences increments, it does not increment linearly, but instead has an exponential growth pattern. These increments take it through the sequence 0, 1, 3, 7, 15, 31, 127, 255 etc. Or in binary: 0, 1, 11, 111, 1111, 11111, 111111, 1111111, 11111111…

Those numbers can be seen as off-by-one variations of powers of two. Forget the offset, and think of the number as simply doubling on each increment if it keeps your head clearer – instead of accuracy, we have a order-of-magnitude reference count.

But how on God’s green earth can this be a useful number, and why is there a timestamp in the picture?

The way I reasoned it out for myself in the end is that this partial rotated timestamp is essentially a cheap semi-random number; while clearly not random, it is good enough for our purposes. And our only interest lies in the bit pattern for the least significant N bits, with N growing as breferences grows.

We are in the realm of bargain-basement statistics here. Observe: for a random number, there is a 1 in 2 chance that it will be odd, a 1 in 4 chance that it will be divisible by 4, a 1 in 8 chance that it will be divisible by 8, etc. Turning it sideways, we’re likely to need two tries to get an odd number, four tries to get a number divisible by four, etc. And the bit pattern (1, 11, 111…) we are anding against the modified TSC gives us the check for divisible by 2, 4, 8 etc.

Now translate this into something useful:

  • If breferences is 0, we’ll always update it to 1. This is deterministic, and it accurately reflects that the page was touched once.
  • If breferences is 1, let’s assume that it will take two attempts before the increment branch is taken; this is the 50% chance of the random number being odd. The successful attempt will update it to 3, i.e. two more references will probably have incremented breferences by two.
  • If breferences is 3, it will probably take four tries before it is incremented to its next value, which is 7.
  • Similarly it will likely take eight attempts to increment it to the next value, 15.
  • Et cetera, et cetera.

So even though breferences only has 33 values, the values are expressed as an actual best-guess reference count, rounded to the nearest power of two. And given what we use it for, this coarseness and slight nondeterminism is perfectly fine. As the expression goes, it’s good enough for jazz.

Yes it’s clever, but what’s the point?

Let’s assume that an older buffer pool implementation literally incremented breferences on every page access. This is all very well and accurate, but it comes at a cost, because each update involves a cache line invalidation, and this is something we prefer to avoid in a hot data structure: the 64-byte cache line containing breferences will now need to be re-read by the next CPU that accesses the BUF. And while this behaviour is inevitable for most uses of the latch, the latch lives in a different cache line within the BUF struct. So even if the issue remains for the latch, we can reduce it massively for the metadata members like breferences, and this is still a win.

This touches on another area of highly optimised data structures. Looking at the memory layout of the BUF, we find that the members which rarely change (page pointer, database ID, file ID, page ID etc) sit snugly in one cache line, and this is one which will rarely be invalidated. Then we have another cache line containing breferences, bcputicks etc. This is updated far more frequently, but it still isn’t very hot, comparatively speaking. The decision not to measure acquisition time on every acquisition (the 10% sampling discussed here) fits in this narrative as well. If we always logged acquisition time, we’d have more accurate statistics, but it would come at the cost of more cache line invalidation, and hence the act of measurement would bog us down unnecessarily.

One thing I’m learning here is that we can safely assume that data structures that are known to be highly concurrent will be strewn with Easter eggs of algorithmic cleverness. And Easter egg hunting is a great way to learn from clever people’s experience.

Bonus link: While it isn’t pertinent to our “random number” use of rdtsc, Bruce Dawson made some great observations about just how accurate a modern invariant TSC is.

One thought on “How SQL Server counts page refs without counting”

Leave a Reply

Your email address will not be published. Required fields are marked *