Month: April 2017

0:022> k # Child-SP RetAddr Call Site 00 0000008c`e460c238 00007ff7`51595990 SQLSVC!QScheduleCalcNextOccurrence 01 0000008c`e460c240 00007ff7`51595459 SQLAGENT!SetScheduleNextRunDate+0x1f0 02 0000008c`e460c310 00007ff7`5158f992 SQLAGENT!SetJobNextRunDate+0x1a9 03 0000008c`e460c7f0 00007fff`18514f87 SQLAGENT!JobManager+0xfe2 04 0000008c`e460fc70 00007fff`1851512e MSVCR120!_callthreadstartex+0x17 [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 376] 05 0000008c`e460fca0 00007fff`261913d2 MSVCR120!_threadstartex+0x102 [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 354] 06 0000008c`e460fcd0 00007fff`266754e4 KERNEL32!BaseThreadInitThunk+0x22 07 0000008c`e460fd00 00000000`00000000 ntdll!RtlUserThreadStart+0x34 0:022> r rax=000000000002ed9c rbx=0000000000000000 rcx=0000008ce460c288 rdx=000000000133c6b1 rsi=0000008cc8b23380 rdi=0000008ce428c278 rip=00007fff0c0d3e20 rsp=0000008ce460c238 rbp=0000008ce460c2a9 r8=000000000002eda1 r9=0000000000000000 r10=00000000000118a0 r11=0000000000000011 r12=0000000000000001 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei ng nz na pe cy cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000283 SQLSVC!QScheduleCalcNextOccurrence: 00007fff`0c0d3e20 48894c2408 mov qword ptr [rsp+8],rcx ss:0000008c`e460c240=0000008ce460c274

0:022> dd 8ce460c288 l0x20 0000008c`e460c288 00000004 00000001 00000004 00000005 0000008c`e460c298 00000000 00000000 0133c6b1 05f5bebf 0000008c`e460c2a8 00000190 000399b7 0133c6b1 0002ed9c 0000008c`e460c2b8 0133c6b1 0002ed9c 56b04598 00004634 0000008c`e460c2c8 c8b23390 0000008c 00000000 00000000 0000008c`e460c2d8 00000000 00000000 e4280400 0000008c 0000008c`e460c2e8 00000000 00000000 00000000 00000000 0000008c`e460c2f8 e428c278 0000008c e460c410 0000008c

0:022> ? 133c6b1 Evaluate expression: 20170417 = 00000000`0133c6b1 0:022> ? 0x2ed9c Evaluate expression: 191900 = 00000000`0002ed9c 0:022> ? 05f5bebf Evaluate expression: 99991231 = 00000000`05f5bebf

It’s SQLbits this week, and I’m as excited as a query going through the compile semaphore for the first time, so I’m taking a break from the heavy stuff and gossiping about bugs instead.

The thing about linked lists

I’ve previously done a blog post on doubly linked lists, but here is all you really need to know:

A list entry consists of two pointers, flink (forward link) and blink (backward link).
This structure is embedded at a known offset within the object that wants to be within the list. Because the offset is known, no extra pointer is required to point to the containing object.
The list head is also a flink/blink pair, living within the object that owns the list. An empty list is denoted by both flink and blink pointing to the list head iself.
Because list manipulation needs multiple updates, it can’t be done atomically, and inevitably the list is protected by a spinlock. Even reading it safely requires acquiring the spinlock first, and this is one of the reasons why singly linked lists feature in Hekaton.
Traversing a list involves getting the address of the next entry – flink or blink, depending on direction- and then doing something with the containing record. This may be as simple as checking an attribute and returning that object if it is the desired item, or continuing the traversal if it is not.

On the implementation detail side, you’ll normally find that a function to retrieve a list entry will return a null pointer, i.e. the value zero, when it finds an empty list, because it can’t validly return the address of the list head itself. Consuming code will then treat the null pointer as a signal value that nothing was found, and it will most certainly not try and dereference that value by treating it as an address.

Examples from the wild

These beasties are sometimes easy to recognise from a data dump alone. Here is what the list head for an empty list looks like, in this case from the very CBatch instance I was looking at while researching last week’s post:

00000000`38773408  00000000`38773408
00000000`38773410  00000000`38773408

Two pointer-sized members, the first one pointing to itself, and the second also pointing to the first, or strictly speaking to the pair starting with the first.

Then we have a list with a single entry, which is also easy to recognise by sight: both flink and blink of the list head point to the same list entry, and both flink and blink of that list entry point back to the list head. Interestingly, from those two bits of information alone, it’s impossible to tell which is the list head and which is the list item.

Here is the listhead within a CSession instance pointing at the list item for the single CBatch within that session:

00000000`38772f40  00000000`38773378
00000000`38772f48  00000000`38773378

And on the list entry side of the equation, here is the very start of that CBatch instance, with the list entry starting at the second member:

00000000`38773370  00007ffd`61d17b80 sqllang!CBatch::`vftable
00000000`38773378  00000000`38772f40
00000000`38773380  00000000`38772f40

This is a lovely textbook example. Firstly the vftable symbol confirms that we’re dealing with a CBatch, but the list entry starts at offset 8 and not 0. What this means is that one gets at the CBatch instance linked into this particular list by deducting 8 from the list entry’s address.

One way to cause a stack dump

In my previous post, I referenced one of Kendra Little’s Dear SQL DBA episodes, where she demonstrates a stack dump. The exception that led to the stack dump in question was a particular variation on a null reference:

Access violation occurred writing address 0000000000000008

Now I must admit that I didn’t get around to running Kendra’s repro and checking for myself, but this looks like a slam dunk for a dereference problem on a linked list while simultaneously demonstrating a clever safety convention.

Let’s start with the safety convention. The “null” of a null pointer isn’t a magic value, but in real-life implementation is simply zero, which is a perfectly valid virtual address. However, on the premise that trying to access address zero or addresses near it probably indicates a program error, the OS will map that page in such a way that trying to access it causes an access violation. This is not a bug or an accident, but a damn clever feature! Robert Love explains it very nicely over here for Linux, and it applies equally to Windows.

Now recall the convention that trying to retrieve the head or tail of an empty list will – by convention – bring you back a null pointer. When iterating, a related convention may also return a zero when you’ve gone all the way around and come back to the list head. Clearly the onus is on the developer to recognise that null pointer and not dereference it, but attempting to do so sets in motion the safety feature of an access violation, which can then be neatly caught through standard exception handling, for instance yielding a diagnostic stack dump.

Finding a dereference of address zero doesn’t suggest much in itself – zeroes can come from lots of sources. But the address 8 is far more evocative. If you are doing linked list traversal, the normal direction is through blink, which is accessed by adding eight to the address of the list item. And let’s say that you got the value zero as the address of your list item, but didn’t check the value before doing that dereference, you’d be dereferencing address 8. Hey, there can be other ways to get to address 8, but this is a strong candidate.

I should of course emphasise that diagnostic code such as DBCC PAGE jumps through hoops to avoid standard safety features like taking out locks while remaining “safe enough”, so finding a null pointer bug in such an area is just a reinforcement of the idea that this is unsupported, undocumented, and doesn’t necessarily live by the platinum standards expected in the rest of the product.

Lurking bugs that don’t bite

Ah, the thrill of the chase. There is a perverse delight in reading through a bit of disassembly and finding that logical hole where a safety check was only half-done. Here is an example from the beginning of a function (which will remain unnamed) to the point where we have a null-reference bug – I added simple labels and ASCII-art arrows for control flow.

         mov  rax,qword ptr [rcx+58h]
         add  rcx,50h
         cmp  rax,rcx
 +------ je   *1*
 |
 |       text rax,rax ; is rax == 0?
 +------ je   *1*
 |
 |       add  rax, 0FFFFFFFFFFFFFFF8h ; i.e. subtract 8
 |    +- jmp  *2*
 v    |
*1*   |  xor  eax,eax ; i.e. mov rax,0
      v
     *2* cmp  dword ptr [rax+1ACh], r8d ; <-- may dereference address 1AC

It's all in the last two instructions. A path exists where rax may contain the value 0, and while you'd normally expect the zero-check before the dereference, that zero-check is missing.

Now unlike the DBCC PAGE example, which has been known to go wrong, this is in a "production" function which doesn't appear to blow up all the time, i.e. the preconditions for the bug coming to light simply aren't there. In practical terms, at the point this function is called, clearly the list in question is never empty. Does that kind of code pattern even count as a bug? I guess the only thing that makes it stand out is the acknowledgement that we may find rax to be zero, or even force it to zero, right before the dereference. But this is probably inherited from an inlined function, and it is only in the disassembly that we see the juxtaposition so clearly.

For my part, I shall continue to delight in finding oddities like this. But at the same time it reminds me that I have a lot to be humble about: if I can spot these little things in the code produced by seriously good programmers, just think of the horrors they would be able to find in my code.

Month: April 2017

SQL Agent and the hairiest Dateadd in town

TL;DR

Spelunking time

Where next?

SQL Server doubly linked lists revisited

The thing about linked lists

Examples from the wild

One way to cause a stack dump

Lurking bugs that don’t bite