davidgow.net: The Stack Clash Vulnerability and the Ancient Pharaohs

So, this is the tale of a largely self-inflicted woe, which results from two somewhat disparate things: the Stack Clash vulnerability, and the 1999 computer game Pharaoh.

Screenshot of Pharaoh runnning under Linux (ingame)

First, some backstory. I've been quietly implementing some of the Win32 API on top of Linux (and, for much of the WM and graphics code, libSDL 2.0), in order to get some old Windows games running. Wine actually runs these games pretty close to perfectly, but as many of you may know, it implements the Windows APIs a little too accurately — it, for example, actually changes the screen resolution rather than emulating it with an offscreen buffer.

Ultimately, the way that slx_win32 — being the name of this monstrosity — and indeed wine work are the same: they load the .exe file into memory, resolve any imports (functions loaded from other .dlls), and then jump to the entry point. The Windows program is now running exactly like any other Linux process, save for a few special changes to the environment (such as the allocation of a Thread Information Block (TIB) and the associated setting of an LDT entry and segment register): all of the translation is just done by the Windows API functions we implement.

Except that's not entirely true: x86 Linux sometimes behaves slightly differently to Windows, and sometimes programs will rely on this behaviour. To make things just a little bit more exciting, while Pharaoh was written to run on Windows 98 (which hasn't been receiving updates since 2006), Linux is still being updated today. And that ultimately led to some confusion when Pharaoh, which previously had got to a misrendered[1] title screen, suddenly started segfaulting on startup.

A bit of debugging quickly showed that the faulting function was something called __alloca_probe, a call to which, it turns out, is inserted automatically by Microsoft® Visual C++® whenever there is a function which allocates more than a page of stack space. Note that the (limited) documentation out there for _alloca_probe is pretty misleading — it mostly describes some stack alignment code, and tells you to call _chkstk when your function actually has more than a page of stack space. This seems to be misleading, as the __alloca_probe function is definitely doing what most of the documentation out there is saying ought to be done by _chkstk. Maybe this is a case of IDA Pro misnaming the function, but it's just as likely to be a case of stale documentation.

In any case, the purpose of this __alloca_probe function is to ensure that the stack is large enough for the function to execute, and the way it does this is quite clever. While there is a maximum stack size (either determined through something in the Windows PE header, or via a limit set by uname), it would be wasteful for the operating system to allocate the maximum amount memory (and address space) for the stack every time a thread starts. Both Windows and Linux are lazy in this regard: they allocate the bare minimum amount of stack space, and leave some number of pages after the stack as guard pages[2]. These pages are configured such that the usermode process doesn't have any access to them: any attempt to read or write to them (or execute code from them) will trigger an exception. The OS, however, knowing that this is a guard page, uses this as a hint to grow the stack: it allocates new pages to cover where the access was, and moves the guard pages in case the stack grows futher. This allows the stack to grow dynamically.

While this is clever, it has one oversight: if a function wishes to allocate a very large object on the stack — one larger than the guard pages — and then accesses the lowest part of it, the access can skip over all of the guard pages, accessing unmapped memory and causing a segfault. Ouch! This is where __alloca_probe comes in: it grows the stack deliberately, by looping over every page in the allocation and trying to read it, thus always hitting the guard page.

Screenshot of Pharaoh running under Linux (opening cinematic)

For some reason, though, this wasn't working: clearly the stack wasn't growing when the guard pages were being touched. About halfway through __alloca_probe, the guard pages ceased to function. This was definitely a stack problem, as the game would work if I forced the stack to grow from within the Linux side of my code, either by calling alloca() in a function and returning from it, or by malloc()ing a new stack, and using some inline assembly to set the stack pointer to this new stack.

But, remove those hacks, and the game still crashed in __alloca_probe. Clearly, something in __alloca_probe works on Windows (and possibly used to work on Linux), but was now failing to grow the stack. What's more, it was working under wine under the same version of Linux. To understand why, we need to look at the implementation of __alloca_probe in a bit more detail.

The MSVC implementation of __alloca_probe accepts one input, the number of bytes of stack space to allocate, in the eax register. It preserves the value of all other registers (save the stack pointer, esp, which points to the top of the newly grown stack), and returns to the previous location. Unfortunately, this is trickier than it seems: in order to be able to return successfully (using the ret instruction), the return address has to be copied from the old top of the stack to the new top. Doing this, however, requires either the return address, or the old top of the stack to be remembered, and this requires another register, for a total of three:

one to store the remaining space to allocate,
one to store the “new” stack pointer — the address to be touched per-page, and
one to store the old stack pointer.

Microsoft does something interesting in their implementation: the remaining space to allocate remains in eax, the “new” stack pointer is stored in ecx, and the old stack pointer remains in esp until it is updated to its final value at the end of the function. To free up ecx for this use, it's saved on the (old) stack, and restored once all pages have been touched, and eax is therefore free to use as a temporary.

This all seems eminently sensible, so why doesn't it work? Could it be that Windows (particularly NT-based versions) always allocates address space on 64K boundaries? That detail had caused problems before[3] — could it be that Pharaoh was assuming this was the case, and only touching every 64K to match this size, rather than 4K to match the Linux page size? Alas, that wasn't it — __alloca_probe is indeed using a 4K page size.

The problem, it turns out, is that the stack pointer — esp — is only being updated at the end. This means that when the guard pages are being touched, they're technically outside the “valid” range of the stack. Not only does this make tools like valgrind complain, it also (it turns out) interferes with stack growth on modern Linux distros. Does Linux, for some reason, only allow guard page accesses to grow the stack if esp is also updated accordingly? Quick searches of the internet were not finding any documentation as to such, but all of my searches for linux guard page gave some detail-sparse “yes, Linux has guard pages” pages, and lots of results for a vulnerability named stack clash which — at first glance — did not seem to be related.

The stack clash vulnerability hypothesises an attack where the guard pages are jumped over — by use of a buffer overrun or similar attack on a stack-allocated buffer — in the hope that there is another piece of mapped memory below the stack, which would then be overwritten. Growing the stack into another allocation would normally cause a segfault: when a guard page access triggers the stack growth, the process will be killed if there is no room for the stack to grow due to another allocation. By skipping over the guard page, though, the kernel has no way of knowing that this access should be a part of the stack, rather than a legitimate access to the page.

Now, there's no kernel-level fix for stack clash: indeed, it's a fundamental design problem with the whole idea of automatically growing stacks. In fact, the best mitigation is actually to compile userspace code with the -fstack-check option, which basically just adds an equivalent to __alloca_probe to functions which use greater than a page of memory, just like MSVC has been doing for decades. There is, however, a kernel patch which attempts to mitigate it slightly by (basically) increasing the number of guard pages. (If you're curious, LWN.net has a good article on the subject.)

So how does increasing the size of the guard region cause this to fail? If anything, it seems like the sort of thing which would help, rather than hurt. The answer is, frighteningly enough, that this should have never worked. It turns out that Linux has never supported growing the stack (more than ~64K) below the stack pointer, esp.

So: why did it work before (and, while we're at it, why does it work in wine)? The answer actually was worked out by the Microsoft .NET CoreCLR team. When the stack clash fix increased the size of the guard pages, it caused programs to appear to use more stack space, as the guard pages were being counted as a part of the allocation. The Linux code checks to see if the page being accessed was part of an allocation before checking if it's too far below esp, so accesses within the guard page were considered fine, and the stack would grow without esp ever being checked. Because, with the stack clash patch, it was changed so that the guard pages were not part of the allocation (and so wouldn't count against stack or memory limits), the esp check would now occur even for accesses to guard pages. And now it's all broken

wine, on the other hand, bypasses all of this by just implementing its own guard pages and stack growth. It uses signal handlers to intercept the segfault which occurs when writing below its stack, and manually grows the stack with another call (via some internal wine allocation functions) to mmap(). This is clearly the right fix for this sort of issue, but it's also a lot of work to implement. So how can I hack around it without doing that much work?

In fact, I have three ways of hacking around this:

Just allocate a new block of memory that's big enough, and move esp there to make it the stack.
Force the stack to grow by using a call to alloca() — which works on Linux — and then returning, freeing the memory but leaving the stack at its greater size.
And re-implement __alloca_probe to move esp slowly when probing pages, so that it's never more than 64K behind the pages being touched.

Ultimately, at the moment I'm using the 2^nd option to pre-grow stacks to the SizeOfStackReserve field in the PE Header, which proves to be enough. Pharaoh works fine with a thread stack of 1MB.

But, just for fun, it'd be nice to try the 3^rd option. So, I got out my assembler, and wrote a replacement function. Apart from a few false starts, where I worked out why it was such a pain to have very few working registers and the stack moving out for underneath, I got a version which worked. Initially, I had the problem that my reimplementation was slightly larger than the original (which is 46 bytes long), so wouldn't fit in the executable, which I “fixed” by allocating a new block of executable memory at runtime, and patching the original with a JMP to the new implementation after loading the executable, but before actually starting it. (Actually, hooking functions has proved a really useful tool to have at my disposal when debugging this project, so I have a nice helper function which writes these JMP instructions for me.)

In the end, though, I was able to get the implementation down to 42 bytes, where I could just replace the original. If you're curious, here it is:

%define PAGE_SIZE 0x1000

__alloca_probe_new:

; ecx will store the original esp-4
; [ecx+0] will store the original ecx,
; [ecx+4] will store the return address
push ecx
mov ecx, esp
; Memory to allocate in eax
cmp eax, PAGE_SIZE
jb lastpage
loadpage:
sub esp, PAGE_SIZE
test [esp], eax
sub eax, PAGE_SIZE
cmp eax, PAGE_SIZE
ja loadpage
lastpage:
sub esp, eax
pop eax ; We need to decrease the stack size by eight, and probe it.
pop eax ; ...and again so we're decreasing it by eight
; We need to restore ecx and the return address
mov eax, [ecx+4]
push eax ; return addr
mov ecx, [ecx+0] ; restore original ecx
ret

It's easy to assemble this and use it to replace the existing __alloca_probe function, either by patching the original executable on disk, or at runtime by just memcpy()ing it over the original. The function is self-contained and relocatable, so should be pretty easy to use if you're curious. In my version of Pharaoh (v1.0), it's found at 0x00529C50 in memory (or 0x00129C50 on-disk in the .exe file). In the GOG.com version of Pharaoh Gold, it's found at 0x00561F30 (or 0x00161F30 on disk). And in the Pharaoh Demo, it's found at 0x0051B390 (or 0x0011B390 on disk).

Of course, this is really more of an intellectual exercise than anything, as just increasing the stack size on startup works as well in practice, and the “purest” way of fixing it would be to implement our own stack growth and guard pages à la wine. But, it's fun. And who knows, maybe — just maybe — someone will find it useful.

— David

[1]: The misrendering of the title screen is another fun bug: all of the buttons were not displaying, though the entire game would work when run under Valgrind. Valgrind — after some bugfixing — wasn't showing any memory errors, though. It ultimately turned out that the HeapAlloc() function, which I'd implemented as a simple wrapper around malloc(), was returning pointers > 2 GiB. Since 32-bit Windows traditionally used the top 2 GiB for the kernel, usermode pointers were always allocated below 2 GiB. This was “fixed” by using glibc's mallopt() function to disable the use of mmap() for allocations. In the future, I could fix this properly by finishing a separate heap implementation on top of VirtualAlloc(), my implementation of which uses Linux's MAP_32BIT option to only allocate from the bottom 2 GiB of address space.

[2]: I'm actually being a bit loose with the definition of guard page in this article: particularly when we get to how Linux handles them later. It might be worth looking them up (along with MAP_GROWSDOWN) if you're interested: at the very least, there are some fun facts out there. Did you know that Linux has only had guard pages for the stack since 2010?

[3]:Age of Empires has an interesting malloc() implementation, which calls the VirtualAlloc() function with an address which is not 64K aligned, and then proceeds to rely on the OS having rounded the address down, accessing memory which is actually slightly outside the bounds of the allocation. This meant that, before my VirtualAlloc() function could call mmap() to perform the actual allocation, it needed to not only round the address and size to a 64K boundary, it also needed to keep track of separate user-visible allocation addresses versus the (only 4K-aligned) underlying allocations provided by Linux.

The Stack Clash vulnerability and the Ancient Pharaohs