FEX 2306 Tagged!
Another interesting month of changes for FEX-Emu! While this release is shorter than last, this also only has a month of work rather than two. We had some great work done this month, including a bunch of plumbing that most people won’t notice. Let’s see what changed!
Adds support for hardware TSO memory emulation prctl
Emulating the x86 memory model is the number one thing that slows down FEX emulation today. Apple Silicon supports this memory model in hardware which is why Rosetta on MacOS can get amazing performance. With some recent changes from the Asahi developers, FEX can ask the hardware to enable the TSO emulation bit. If the kernel reports back that hardware TSO memory is enabled, FEX can take a more lax approach to its memory emulation, getting an automatic speedup for Asahi systems.
Additionally not only is this a speed-up, it’s required for correct emulation. When this feature isn’t supported by the hardware, FEX needs to emulate the memory model using atomics and LRCPC instructions. This absolutely demolishes performance so it is usually recommended to disable the emulation to get “free” performance. This issue with this is that it can crash instability in the most awkward and peculiar of ways. We even found out in this last month that Unity games with their complex buffer management are highly likely to crash due to old cachelines of data hanging around. The only fix is to use emulate the memory model using hardware TSO flags or our atomic/LRCPC path. Sadly’s ARM Cortex’s hardware LRCPC implementation is barely any faster than atomics.
Support ArchLinux rootfs
FEX has now added an ArchLinux rootfs to its repository of images. This is the first non-Ubuntu rootfs image that we are providing! Please let us know if you have any issues with it! Next up is likely a Fedora image for all the Fedora ARM users out there!
Steamwebhelper crash fix
New beta versions of Steam has started relying on AT_EXECFN existing. FEX didn’t previously emulate this auxv value which was causing it to crash. With this fixed, steamwebhelper is now working again.
More AVX work!
This has been a long time coming and we’re almost there finally. After these changes FEX only needs to fix a few implementation bugs in the string operations, and implement the XSAVE instructions before allowing AVX emulation. In addition to that, FEX is also almost able to supprot AVX2 with the only instructions that need to be implemented is the gather load instructions!
Implements support for XGETBV
This is a fairly simple instruction as it lets the application query which CPU features are enabled. Necessary for an application to check before enabling any AVX usage.
Handle PCMPESTRM, PCMPISTRM and AVX variants
These are the remaining string instructions that FEX has implemented. While mostly implemented there are still a couple of edge case behaviours that aren’t quite correct and just need to be fixed.
Implement support for deferring asynchronous signals
This change has been a long time coming to make FEX’s JIT faster in the face of handling asynchronous signals. This issue is that FEX needs to enter code regions that are effectively “uninterruptible” until it is complete. This is basically a reentrancy problem where a piece of code executing could lock a mutex, or non-atomically updating a container’s data, then when a signal occurs it will jump out of the code and potentially come back to this corrupted state.
As an initial workaround to this problem, FEX would just disable all signals in each of these “signal-deferring regions.” This had the overhead that every region would have a system call going in to it and then another one coming out of it. If how frequently these regions happened was little then it would be a non-issue; but as is commonly the case, FEX’s JIT needs to be wrapped in this signal blocks. If a game is executing a bunch of code, this means we can be doing thousands of additional system calls per second which adds up as direct overhead.
With this new change, FEX marks that it is in a signal-deferring region with some very cheap memory accesses and if a signal doesn’t occur the overhead is negligible. In the case that a signal does occur, it will get stored to a queue, FEX will finish its signal-deferring region, and then come back to handle the signal.
This mostly works because asynchronous signals don’t have guarantees about the timeliness of the signal being delivered. Sadly this can result in signal queue depths being subtly incorrect but we are monitoring the situation to know if any game is affected. All in all this finally makes it so FEX can be straced without being overwhelmed and improves stutter problems!
Grand Theft Auto 5 AVX fix
FEX was accidentally reporting support for BMI1 and BMI2 CPU instructions. These extensions have a requirement that AVX must be implemented for these. This was causing Grand Theft Auto 5 to crash early trying to use AVX. We will now only report these extensions if emulated AVX is supported, which fixes this game.
Make vfork wait for the process to exit
FEX’s previous implementation of vfork actually behaved like fork. The difference between these two syscalls is fairly subtle. In the case of vfork, the parent process will end up sleeping until the child either exits or executes execve. We were instead treating this like a fork, where the parent continues immediately without waiting. While no known issues were encountered, it is good to ensure this behaviour is correct for future work.
Getdents optimization
This classic syscall is used for querying directory contents, FEX needs to emulate this syscall since 64-bit applications now use a new syscall called getdents64. FEX’s original implementation was fairly slow due to a misunderstanding as to how this syscall worked. It would create a temporary working buffer and copy data around a couple of times. With the new implementation it is able to use the buffer provided by the application and doing some minor fixups to make the overhead fairly light now. This improves performance when an application is doing heavy folder scanning, which mostly means it improves Proton startup time.
Minor optimizations
There were a handful of minor optimizations that improve performance so minorly that it falls within noise, but is nice to have.
Optimize ARM64 thunk trampolines
This is a very small optimization that changes an indirect load in to a PC relative load, removing a single data dependency.
Minor x87 FCMOV optimization
FEX was duplicating a mask from a GPR in to a vector register using two instruction and now it only uses one instruction.
Optimize ADC/ADD OF flag calculation
This was a small mistake where a bitwise negate was using two instructions instead of one.
Optimize EFLAG unpacking
Each time FEX was unpacking the EFLAG register it was using four instructions per bit of the flag. This has now been improved to only two. Cutting flag unpacking to 82% of its original size in some edge cases.
Supported emulated Linux kernel version up to 6.2
FEX used to max out the reported kernel version up to 5.18. Now we can report up to 6.2 with this change. 6.3 is going to be harder since it introduces a new prctl that FEX needs to work around.
Video game showcase
As said previously about Unity engine games having issues without TSO emulation. Here is a clip of Hollow Knight running under FEX full speed on a Lenovo X13s. Even with the overhead of emulating the x86-TSO memory model, this game runs remarkably well. With x86-TSO emulation disabled this game would have crashed a few seconds in.
See the 2306 Release Notes or the detailed change log in Github.