First, some backstory. I've been quietly implementing some of the Win32 API on top of Linux (and, for much of the WM and graphics code, libSDL 2.0), in order to get some old Windows games running. Wine actually runs these games pretty close to perfectly, but as many of you may know, it implements the Windows APIs a little too accurately — it, for example, actually changes the screen resolution rather than emulating it with an offscreen buffer.
Ultimately, the way that
slx_win32 — being the name of this monstrosity — and indeed
wine work are the same: they load the
.exe file into memory, resolve any imports (functions
loaded from other
.dlls), and then jump to the entry point. The Windows program is now running
exactly like any other Linux process, save for a few special changes to the environment (such as the
allocation of a Thread Information
Block (TIB) and the associated setting of an LDT entry and
segment register): all of the translation is just done by the Windows API functions we implement.
Except that's not entirely true: x86 Linux sometimes behaves slightly differently to Windows, and sometimes programs will rely on this behaviour. To make things just a little bit more exciting, while Pharaoh was written to run on Windows 98 (which hasn't been receiving updates since 2006), Linux is still being updated today. And that ultimately led to some confusion when Pharaoh, which previously had got to a misrendered title screen, suddenly started segfaulting on startup.
A bit of debugging quickly showed that the faulting function was something called
a call to which, it turns out, is inserted automatically by Microsoft® Visual C++® whenever there
is a function which allocates more than a page of stack space. Note that the (limited) documentation out
_alloca_probe is pretty misleading — it mostly describes some stack alignment code, and
tells you to call
when your function actually has more than a page of stack space. This seems
to be misleading, as the
__alloca_probe function is definitely doing what most of the
documentation out there is saying ought to be done by
_chkstk. Maybe this is a case of IDA Pro
misnaming the function, but it's just as likely to be a case of stale documentation.
In any case, the purpose of this
__alloca_probe function is to ensure that the stack is large enough
for the function to execute, and the way it does this is quite clever. While there is a maximum stack size
(either determined through something in the Windows PE header, or via a limit set by
uname), it would
be wasteful for the operating system to allocate the maximum amount memory (and address space) for the stack
every time a thread starts. Both Windows and Linux are lazy in this regard: they allocate the bare
minimum amount of stack space, and leave some number of pages after the stack as guard pages. These
pages are configured such that the usermode process doesn't have any access to them: any attempt to read or
write to them (or execute code from them) will trigger an exception. The OS, however, knowing that this is a
guard page, uses this as a hint to grow the stack: it allocates new pages to cover where the access was, and
moves the guard pages in case the stack grows futher. This allows the stack to grow dynamically.
While this is clever, it has one oversight: if a function wishes to allocate a very large object on the stack
— one larger than the guard pages — and then accesses the lowest part of it, the access can skip
over all of the guard pages, accessing unmapped memory and causing a segfault. Ouch! This is where
__alloca_probe comes in: it grows the stack deliberately, by looping over every page in the allocation
and trying to read it, thus always hitting the guard page.
For some reason, though, this wasn't working: clearly the stack wasn't growing when the guard pages were
being touched. About halfway through
__alloca_probe, the guard pages ceased to function. This was
definitely a stack problem, as the game would work if I forced the stack to grow from within the Linux side of
my code, either by calling
alloca() in a function and
returning from it, or by
malloc()ing a new stack, and using some inline assembly to set the stack
pointer to this new stack.
But, remove those hacks, and the game still crashed in
__alloca_probe. Clearly, something in
__alloca_probe works on Windows (and possibly used to work on Linux), but was now failing to grow the
stack. What's more, it was working under
wine under the same version of Linux.
To understand why, we need to look at the implementation of
__alloca_probe in a bit more
The MSVC implementation of
__alloca_probe accepts one
input, the number of bytes of stack space to allocate, in the eax register. It preserves the value of
all other registers (save the stack pointer, esp, which points to the top of the newly grown stack),
and returns to the previous location. Unfortunately, this is trickier than it seems: in order to be able to
return successfully (using the
ret instruction), the return address has to be copied from the old top
of the stack to the new top.
Doing this, however, requires either the return address, or the old top of the stack to be remembered, and this
requires another register, for a total of three:
- one to store the remaining space to allocate,
- one to store the “new” stack pointer — the address to be touched per-page, and
- one to store the old stack pointer.
Microsoft does something interesting in their implementation: the remaining space to allocate remains in eax, the “new” stack pointer is stored in ecx, and the old stack pointer remains in esp until it is updated to its final value at the end of the function. To free up ecx for this use, it's saved on the (old) stack, and restored once all pages have been touched, and eax is therefore free to use as a temporary.
This all seems eminently sensible, so why doesn't it work? Could it be that Windows (particularly NT-based
versions) always allocates address space on 64K boundaries? That detail
had caused problems before — could it be that Pharaoh was assuming this
was the case, and only touching every 64K to match this size, rather than 4K to match the Linux page size? Alas,
that wasn't it —
__alloca_probe is indeed using a 4K page size.
The problem, it turns out, is that the stack pointer — esp — is only being updated at the end. This means that when the guard pages are being touched, they're technically outside the “valid” range of the stack. Not only does this make tools like valgrind complain, it also (it turns out) interferes with stack growth on modern Linux distros. Does Linux, for some reason, only allow guard page accesses to grow the stack if esp is also updated accordingly? Quick searches of the internet were not finding any documentation as to such, but all of my searches for linux guard page gave some detail-sparse “yes, Linux has guard pages” pages, and lots of results for a vulnerability named stack clash which — at first glance — did not seem to be related.
The stack clash vulnerability hypothesises an attack where the guard pages are jumped over — by use of a buffer overrun or similar attack on a stack-allocated buffer — in the hope that there is another piece of mapped memory below the stack, which would then be overwritten. Growing the stack into another allocation would normally cause a segfault: when a guard page access triggers the stack growth, the process will be killed if there is no room for the stack to grow due to another allocation. By skipping over the guard page, though, the kernel has no way of knowing that this access should be a part of the stack, rather than a legitimate access to the page.
Now, there's no kernel-level fix for stack clash: indeed, it's a fundamental design problem with the
whole idea of automatically growing stacks. In fact, the best mitigation is actually to compile userspace code
-fstack-check option, which basically just adds an equivalent to
__alloca_probe to functions which use greater than a page of memory, just like MSVC has been doing for decades. There is, however, a kernel
patch which attempts to mitigate it slightly by (basically) increasing the number of guard pages.
(If you're curious, LWN.net has a good article on the subject.)
So how does increasing the size of the guard region cause this to fail? If anything, it seems like the sort of thing which would help, rather than hurt. The answer is, frighteningly enough, that this should have never worked. It turns out that Linux has never supported growing the stack (more than ~64K) below the stack pointer, esp.
So: why did it work before (and, while we're at it, why does it work in
The answer actually was worked out by the Microsoft
.NET CoreCLR team. When the stack clash fix increased the size of the guard pages, it caused programs
to appear to use more stack space, as the guard pages were being counted as a part of the allocation. The Linux
code checks to see if the page being accessed was part of an allocation before checking if it's too far below
esp, so accesses within the guard page were considered fine, and the stack would grow without
esp ever being checked. Because, with the stack clash patch, it was changed so that the
guard pages were not part of the allocation (and so wouldn't count against stack or memory limits), the
esp check would now occur even for accesses to guard pages. And now it's all broken
wine, on the other hand, bypasses all of this by just implementing its own guard pages and stack
growth. It uses signal handlers to intercept the segfault which occurs when writing below its stack, and
manually grows the stack with another call (via some internal wine allocation functions) to
This is clearly the right fix for this sort of issue, but it's also a lot of work to implement. So how
can I hack around it without doing that much work?
In fact, I have three ways of hacking around this:
- Just allocate a new block of memory that's big enough, and move esp there to make it the stack.
- Force the stack to grow by using a call to
alloca()— which works on Linux — and then returning, freeing the memory but leaving the stack at its greater size.
- And re-implement
__alloca_probeto move esp slowly when probing pages, so that it's never more than 64K behind the pages being touched.
Ultimately, at the moment I'm using the 2nd option to pre-grow stacks to the SizeOfStackReserve field in the PE Header, which proves to be enough. Pharaoh works fine with a thread stack of 1MB.
But, just for fun, it'd be nice to try the 3rd option. So, I got out my assembler, and wrote a
replacement function. Apart from a few false starts, where I worked out why it was such a pain to have very few
working registers and the stack moving out for underneath, I got a version which worked. Initially, I had the
problem that my reimplementation was slightly larger than the original (which is 46 bytes long), so wouldn't fit in the executable, which
I “fixed” by allocating a new block of executable memory at runtime, and patching the original with
JMP to the new implementation after loading the executable, but before actually starting it.
(Actually, hooking functions has proved a really useful tool to have at my disposal when debugging this
project, so I have a nice helper function which writes these
JMP instructions for me.)
In the end, though, I was able to get the implementation down to 42 bytes, where I could just replace the original. If you're curious, here it is:
%define PAGE_SIZE 0x1000 __alloca_probe_new: ; ecx will store the original esp-4 ; [ecx+0] will store the original ecx, ; [ecx+4] will store the return address push ecx mov ecx, esp ; Memory to allocate in eax cmp eax, PAGE_SIZE jb lastpage loadpage: sub esp, PAGE_SIZE test [esp], eax sub eax, PAGE_SIZE cmp eax, PAGE_SIZE ja loadpage lastpage: sub esp, eax pop eax ; We need to decrease the stack size by eight, and probe it. pop eax ; ...and again so we're decreasing it by eight ; We need to restore ecx and the return address mov eax, [ecx+4] push eax ; return addr mov ecx, [ecx+0] ; restore original ecx ret
It's easy to assemble this and use it to replace the existing
__alloca_probe function, either by
patching the original executable on disk, or at runtime by just
memcpy()ing it over the original.
The function is self-contained and relocatable, so should be pretty easy to use if you're curious. In my version of
Pharaoh (v1.0), it's found at
0x00529C50 in memory (or
0x00129C50 on-disk in the
.exe file). In the GOG.com version of
Pharaoh Gold, it's found at
0x00161F30 on disk). And in the Pharaoh Demo, it's found at
0x0011B390 on disk).
Of course, this is really more of an intellectual exercise than anything, as just increasing the stack size
on startup works as well in practice, and the “purest” way of fixing it would be to implement our
own stack growth and guard pages à la
wine. But, it's fun. And who knows, maybe — just maybe
— someone will find it useful.
: The misrendering of the title screen is another fun bug: all of the buttons were
not displaying, though the entire game would work when run under Valgrind. Valgrind — after some bugfixing —
wasn't showing any memory errors, though. It ultimately turned out that the HeapAlloc() function,
which I'd implemented as a simple wrapper around malloc(), was
returning pointers > 2 GiB. Since 32-bit Windows traditionally used the top 2 GiB for the kernel, usermode pointers
were always allocated below 2 GiB. This was “fixed” by using glibc's mallopt()
function to disable the use of
mmap() for allocations. In the future, I could fix this properly by finishing a
separate heap implementation on top of VirtualAlloc(), my
implementation of which uses Linux's
MAP_32BIT option to only allocate from the bottom 2 GiB of address
: I'm actually being a bit loose with the definition of guard page in this
article: particularly when we get to how Linux handles them later. It might be worth looking them up (along with
MAP_GROWSDOWN) if you're interested: at the very least, there are some fun facts out there. Did you
know that Linux has only had guard pages for the stack since
Empires has an interesting
malloc() implementation, which calls the
VirtualAlloc() function with an address which is not 64K aligned, and then proceeds to rely on the
OS having rounded the address down, accessing memory which is actually slightly outside the bounds of the allocation.
This meant that, before my
VirtualAlloc() function could call
mmap() to perform the actual
allocation, it needed to not only round the address and size to a 64K boundary, it also needed to keep track of separate
user-visible allocation addresses versus the (only 4K-aligned) underlying allocations provided by Linux.