FEX 2505 Tagged
Well look who it is, another month and we have another release. This time around we have a smattering of fixes all over the place, including boring x87 quirks! Let’s get in to it!
Fixes squashfs/erofs mounting with newer libfuse
Over the past couple of months we have gotten reports about erofs and squashfs rootfs mounting being broken. This was something that was seemingly setup specific because there was trouble reproducing it on our end. This month we were finally able to reproduce the issue and track down what was going on. It looks like in newer release of the FUSE library, they have changed how they wait on child processes to exit. Due to a weird interaction with FEX launching and starting erofsfuse/squashfsfuse, it was breaking this behaviour. It was a trivial fix that just required someone to be able to reproduce it!
Memory leak fixes around thread creation and teardown
This month we encountered a game that had a very peculiar behaviour. It is creating a VLC video decoder object every tick in Unreal Engine. This VLC in turn creates six threads, and then the game tears it down. This usually isn’t too bad but when the game is doing it roughly 720 times a second then it spirals pretty quickly. This was apparently unnoticed by the original game developers because thread creation is /relatively/ light on a real x86 system, but under FEX thread creation gets sticky due to opaque threading APIs and other overheads.
We initially noticed this as a massive memory leak; Turns out this behaviour is really good at finding memory leaks in FEX’s thread creation code. After a couple days of bug squashing, we are no longer leaking memory on thread creation and teardown! This improve’s FEX’s memory usage in Steam and other applications that spin up worker threads and shut them down commonly.
Sadly the Linux port of RUINER’s in-game Bink videos are broken even on x86 devices, so it is recommended to use the Proton version because of that which never had this problem anyway.
Fix a couple of syscalls
Recently there has been a new x86/x86-64 emulator for RISC-V devices that has been showing good progress. We have been comparing notes with their developer about certain implementations and they found some bugs in our 32-bit syscall emulation! It’s unknown if these actually fix any games, but a bug is a bug and it’s good to squash them even if we aren’t directly affected today. Thanks to the developer for finding these and fixing them!
Rewrite x87 transcendentals to use Cephes softfloat library directly
This one is fairly boring because it doesn’t affect most people, but it has some interesting implications. Before this change FEX would use the compiler provided long double type to emulate x87’s transcendental operations. This includes atan2l, cosl, exp2l, log2l, sinl, and tanl, a very small number of implementations. We originally implemented these transcendental operations in this way because we knew that glibc implements these using 128-bit precision to match the precision that long double provides on AArch64 Linux. This is an okay assumption to make initially until we start running in different environments. Turns out if you’re running on a Musl system, then these transcendentals may only be implemented using 64-bit precision, resulting in less precision than x87 needs. Or if you’re running on a WIN32/Wine implementation, then long double is only defined to be 64-bit instead of 128-bit! There’s a couple of other situations in which we can’t control what the long double implementation will do.
In order to ensure consistency in FEX’s emulation, we have now dropped usage of long double in the x87 transcendental operations and use the Cephes softfloat library directly. This ensures that we are always operating at the full 128-bit precision that we were previously expecting and unifies the results that we get. In the future we may even go down to the cephes 80-bit softloat library, but we first need to ensure that it actually provides enough precision on their implementation. Which we might even get a small speed boost from!
Move Interpreter fallback ABI dispatchers to unified location
Speaking of x87, when we hit an instruction that we can’t emulate directly in the JIT then we do “interpreter” fallbacks out to C code. This isn’t actually an interpreter but it is a legacy name for when we had an IR interpreter. Anyway, we were emitting a LOT of redundant ABI handling code whenever we needed to leave the JIT for these handful of x87 operations. This code is doing roughly the same thing for every thread and every operation, resulting in a ton of wasted memory and icache getting hammered. We are now instead emitting the ABI handlers once, and they get shared between all the threads, this actually improves performance even though there is an additional jump to get out of the JIT. So hopefully some x87 games get a minor performance improvement!
More multiblock improvements
“Multiblock” is a JIT technique that makes us be able to emit larger functions of code at any given moment. One of the problems that this can cause is we do more analysis and code discovery which can hit invalid code. A bunch of little fixes have landed to make FEX more tolerant about hitting invalid code and continuing to JIT things. This improvemes the robustness of this feature so it is less likely to break a game!
Misc cleanups and fixes for WIN32/WINE
Not much to say here other than we are still hammering on our WINE arm64ec support. It’s relatively stable at this point and these just keep improving what’s already working pretty well!
See the 2505 Release Notes or the detailed change log in Github.