#TSQL2SDAY: Sing a song of unsigned ints

TSQL2SDAY logo

Occasionally we do as we are told. And when Brent Ozar tells us that we should celebrate the first T-SQL Tuesday of 2017 by writing about SQL Server bugs and enhancement requests, the only appropriate response is “What color, sir?”

The king was in his counting-house

Following Brent’s advice of Googling for “won’t fix” Connect items, an interesting enhancement request soon caught my eye. There are quite a few folks out there bemoaning the lack of unsigned int and bigint data types, for instance here, here, and here. Of course, the subject has also had some good airplay elsewhere, for instance in this DBA StackExchange post.

Let’s take a look at what is being asked here. Using the 32-bit integer as an example, we currently have a data type that can accept a range between negative two billion and two billion. But if negative numbers aren’t required, we can use those same 32 bits to store numbers between zero and four billion. Why, goes the question, throw away that perfectly useful upper range by always reserving space for a negative range we may not need?

There are four strands to the argument, all explored quite passionately in the posts mentioned above:

  • Many programming languages, and some competing RDBMSs, offer such data types in both signed and unsigned variations. Why can’t SQL Server?
  • In mitigation, it’s hard to come up with non-artificial scenarios where you need numbers higher than two billion, but you can guarantee you’ll never exceed four billion.
  • However, sometimes your data type is defined externally, either by a non-negotiable application “requirement” or by the need to migrate data between SQL Server and another RDBMS.
  • Given the relative complexity of implementing the change, Microsoft isn’t inclined to fix it.

A data type for ants? It needs to be at least TWO times bigger!

In most cases where we’re concerned about running out of numbers, a mere doubling of the available range is unlikely to solve much. If you’re currently living close to the edge, you really want to move the edge much further away that that.

We may of course we may find the odd domain where doubling the range of an integer type would make a useful difference. As a contrived example, let’s say we need to measure time in millisecond resolution up to a maximum value of one minute. A 16-bit signed integer would only take us to 32.767 seconds, but for the same storage space we would slightly exceed the 60 second mark, had we had the unsigned option.

Is this in itself a compelling argument? I guess it depends on how many real-world examples we could find where this would really have made a difference.

I’d be slightly more sympathetic to the idea of “Every bit is sacred, every bit is great”. But if it is thoughts of bit packing that keep you awake at night, perhaps you should spend fewer hours on the end-user side of relational database design.

So far, then, the strongest case for wanting new unsigned types seems to be integration with external systems that somehow require it.

Two’s complement, three’s a crowd

As a refresher, the implementation issue comes down to a convention in binary arithmetic, where negative numbers are represented in two’s complement form. When treated as a signed integer, the most significant bit of the binary word (of whatever size) becomes the sign bit; basic arithmetic operations remain as for the unsigned case, but semantics around magnitude comparison and overflow detection changes.

Let’s look at some toy examples in an eight bit integer world, i.e. the parallel universe where tinyint is signed.

decimal 0   =  0 0 0 0 0 0 0 0
decimal 1   =  0 0 0 0 0 0 0 1
decimal -1  =  1 1 1 1 1 1 1 1
decimal 255 = *can not be represented in eight bits*

In the real world where tinyint is in fact unsigned, things look slightly different:

decimal 0   =  0 0 0 0 0 0 0 0
decimal 1   =  0 0 0 0 0 0 0 1
decimal 255 =  1 1 1 1 1 1 1 1
decimal -1  = *can not be represented in any unsigned int*

The magic lies in the fact that the CPU doesn’t even care about the distinction between signed and unsigned. It applies the same simple logic to the same bit patterns, and it is up to user code to enforce semantics. So for the sum of binary 1 + 11111111, where the latter could be either decimal 255 or decimal -1, the addition would play out as follows, spilling into the carry bit:

     0 0 0 0 0 0 0 1
+    1 1 1 1 1 1 1 1
     ===============
   1 0 0 0 0 0 0 0 0

If treated as signed, the carry bit can be ignored, and we see that the sum of 1 and -1 is indeed 0. If unsigned, we have a carry, which we may need to treat as an overflow. (I’m glossing over the distinction between Carry and Overflow – this is described very nicely here.)

Another big gotcha is magnitude comparisons. When unsigned, 11111111 > 0, but when signed, 11111111 < 0. This just gets more interesting when you need to allow implicit casts to do comparisons between signed and unsigned types. I find it reminiscent of the head-scratching around string collation. Sometime "A" = "a", sometimes it isn't, and sometimes dissimilar things have different relative ordering in different contexts. What this comes down to is that different code paths are required for the program-level handling of signed vs unsigned types. It isn't fundamentally that complicated, but code paths need to earn their right to be created and maintained. This is how I interpret the below Connect comment by Microsoft's Jim Hogg:

On the cons side … mixing signed/unsigned types in C or C++ seems like it should be simple enough. It’s not. It opens a small tarpit of hard-to-find mistakes – most due to the complex rules for implicit promotions/widenings. SQL, alas, already has an even more complex set of implicits casting rules. Adding unsigned ints, I fear, would confuse us all even more.

He doesn’t touch on the further need for updates to the TDS protocol etc. to add a bunch of new types, but one can imagine that this would be a significant cost too.

Let me just explain widening, since this is about to become salient in the next section. When casting an integer to a wider integer, e.g. 32 bits to 64 bits, one must commit to filling the unused high-order bits in one of two ways:

  • For unsigned semantics, you zero-extend it, i.e. fill the upper 32 bits with zeros.
  • For signed semantics, you sign-extend it, i.e. fill the upper 32 bits with the most significant bit of the lower 32.

Simple enough, yes? Just keep a clear head, and I’m sure it will turn out alright.

If a bug falls in an unused code path, does it make a sound?

So on the subject of that “tarpit of hard-to-find mistakes” let’s progress from the Enhancement Requests section of the blog spec to the Bug Parade. With the proviso that I’m describing something that isn’t an acknowledged bug, likely due to the code path not (currently) coming into play, here is a description of a tiny little bug within SQL Server that fits Jim’s description of a tarpit.

Windows has a constant called INFINITE with the value 0xFFFFFFFF, i.e. all 32 bits set. This is used as a magic value of the dwMilliseconds parameter used in wait functions like WaitForSingleObject(). You can think of its value as either max int or -1, and I guess many people mentally interchange the two options. SQL Server uses this constant when calling such APIs, but it also inherits the “all bits set” pattern for tick counts that are intended to represent infinity, e.g. the notional quantum end time of a preemptive thread.

From within a particular SQL Server method (which for the moment will not be named), here is the condensed fragment demonstrating the bug; this has remained unchanged from 2014 to 2016. Upon entering the fragment, the 32-bit r8d register contains a millisecond timeout value.

mov  eax, r8d
or   rbx,0FFFFFFFFFFFFFFFFh
cmp  rax, rbx
je   ...

Only three relevant assembly instructions leading into a conditional branch, so I’ll explain them one at a time.

mov eax, r8d. Simple stuff: assign the value of r8d to eax. As a side effect, the higher order 32 bits of rax (where eax is the lower 32) are cleared.

or rbx, 0FFFFFFFFFFFFFFFFh. Roundabout way of assigning that constant to rbx. This phrasing is run-of-the-mill compiler optimisation, since the OR compiles down to a four-byte instruction, rather than the literal eight bytes plus another one or two for the opcode.

cmp rax,rbx into je (jump if equal). Is rax == rbx? If so we’ll take the conditional branch. Hmm, but this ain’t gonna happen. RBX has all 64 bits set, whereas RAX will always have its upper 32 bits clear.

I think the intention here was “Do {XYZ} if timeout == INFINITE”. However, irrespective of how this looks in the source code, somebody phrased the comparison against the 64-bit variation of INFINITE, and you’ll never match that against anything expressed in 32 bits.

Unless…

Let’s say you were thinking about INFINITE as equal to -1 and you declared your types as signed. Now before comparing the incoming 32-bit value against the 64-bit constant, the compiler would helpfully sign-extend the 32-bit value. Semantically we would then be successfully matching the 32-bit -1 with the 64-bit -1.

Conclusion

This little bug can be seen as a classic mixed-domain problem. That timeout parameter can be a positive integer or a magic number, and this magic number has different values in 32-bit in 64-bit incarnations. It so happens that treating the parameter as a signed integer yields the expected widening behaviour, but it is a moot point whether or not we think about it as a signed number, because either way we have to expend the mental effort.

Don’t get me wrong: at systems programming level, code is chock full of cleverness, and this is a normal and valid state of affairs. But such cleverness makes type safety hard. Ultimately it comes down to type design, doesn’t it? Sure, even in the database world we are used to thinking in terms of what can fit into e.g. a 32-bit integer. But getting too obsessive about it is a sign of wearing the System Programmer Hat when we are actually playing at database design.

How often in the real world does a user interface remind you to “Please enter a number between 0 and 4294967295”? Or for that matter “Please enter a number between -2147483648 and 2147483647”? These are all just big numbers that act as edge-of-the-universe bookends to ranges which we would likely want to constrain using business rules anyway.

Final verdict: I’m sure I could find the occasional use for unsigned ints and bigints. But the thought of keeping developers, testers and product managers away from implementing more exciting things tarnishes the appeal a bit.

Thank you for reading, and thanks again to Brent for an inspiring subject.

3 thoughts on “#TSQL2SDAY: Sing a song of unsigned ints”

  1. And for perfect symmetry, let’s also have signed tinyints.

    Yes, I’ve seen a use for that interpretation of a lovely byte.

    1. Hi Paul,

      Ah yes, the signy-int – just the ticket for storing 8-bit audio in a database!

      Seriously though, I have come across that one too, but only as something specified in an outside data source that needed to be ingested. In that case I just reinterpreted the bits as unsigned on the way in, i.e. negative numbers appear now appear to fall in the 128-255 range. This was possible because there was no need to do magnitude comparisons between these values, and it was okay to treat them as opaque bit patterns. In fact, if I remember correctly, the negative range wasn’t even used in practice – it’s quite possible that someone just came up with the use of that type in an outside system because it was there as a default choice, and not because negative numbers were needed.

      Incidentally, that reinterpretation hack just is the flip side of reseeding identities to the lowest negative number when the positive range runs out. In that case, if you reinterpret the bit pattern as unsigned, you’re really now using the high positive range that the data type doesn’t explicitly provide. Which works fine for opaque unique numbers, and only falls apart if you need to do magnitude comparisons.

      Care to share your signed 8-bit experience? I’m honestly curious to identify a case where that is truly the “right” data type for the job at hand. Maybe something like temperature measurement where fractional precision isn’t needed and the likely domain is constrained enough?

      Regards
      Ewald

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.