FEX 2211 Tagged!
A lot of good changes this month for our users. Both performance and compatibility improvements to be had!
Segment register index optimization
This optimization has been a long time coming. Sitting in pull-request limbo since back in April. This is an optimization to cache segment register addresses so the JIT can more optimally generate memory accesses. While segment registers are mostly gone with x86-64, 32-bit segment registers are used fairly commonly with some instructions completely implicitly. This just adds overhead to fetch the LDT and GDT entries for something that typically doesn’t change very quickly.
With this optimization in place, we get an average of 4.3% uplift in 32-bit Bytemark. This performance improvement will be directly felt when running 32-bit applications.
48-bit Proton Experimental fixes
For a while now FEX has worked with Proton 7.0 and older, but we have had issues running Proton Experimental in some cases. This was a tricky problem to nail down but we had some good leads. If your ARM device was running its kernel with 48-bit Virtual address space (VA) enabled then Proton Experimental wouldn’t work. On the other-hand if your kernel is compiled using a 36-bit VA then it would run fine. After a few days of debugging, it turns out that Proton/Wine allocates the lowest 32MB of its stack space, and the kernel by default allocates a 128MB space for the application.
When an application is ran natively the stack is allocated at the fixed location in memory. FEX was failing to allocate the stack at the correct location. When Wine’s preloader eventually ran; FEX will have allocated JIT code at that fixed location, which Wine would then map over, zeroing the memory and breaking the FEX JIT. The preloader has done this for a long time and it was by pure chance that we weren’t breaking older versions of Wine and Proton.
With this problem fixed in FEX, we are now able to run triple-A games on AArch64. Just like the following images of God of War running on Snapdragon 888.
Even more IR changes preparing for AVX emulation
Once again this month we have a absolute ton of commits from Lioncash working on making our JIT be ready for AVX emulation. Around 25 commits working towards this, with only about four more IR vector operations to support AVX with.
Once the JITs support 256-bit operations, we can start working towards emulating the instructions themselves.
Fix thunk crashing due to insufficient stack space
When FEX starts we potentially need to allocate all memory inside of the 48-bit VA space to match how x86-64 only has 47-bits. This intersects with our stack space allocation which is supposed to autogrow, but we allocated it instead. Now we give the full 128MB stack space to FEX so it won’t crash anymore.
Implements support for remaining BCD instructions
Thanks to @wannacu for implementing the remaining handful of 32-bit BCD instructions. DAA, DAS, AAA, AAS, AAM, AAD were all missing in FEX’s implementation. While BCD is fairly uncommonly used these days, they still managed to find an application that uses these instructions. With these implemented, FEX should have all of the BCD instructions finally implemented.
Implement gpuvis timeline profiler support
While not majorly important for users, this is a very good interface for developers wanting to watch why a game has stuttered and for how long code took to compile. This lets us take advantage of the same interface that GPU profiling events are using to see why a game missed a vsync.
This isn’t enabled by default out of concern for taking too much CPU time, so it needs to be enabled with the ENABLE_FEXCORE_PROFILER cmake option.
Fix ROR OF flag calculation
This is a fairly minor bug since not many things rely on the OF flag specifically. But in our testing of new Proton games, we found out that Denuvo Anti-Tamper is relying on this edge case behaviour and we messed it up. While this gets Denuvo running slightly farther, it still doesn’t quite work under FEX.
Fixes FPREM1 C2 flag calculation
FPREM1 will return a flag if the number was too large to calculate in one step. Which is usually not the case. Since we are calculating the full remainder we will never set say we return a partial remainder. This solves an infinite loop in Mono applications that are using SIN/COS math operations.
Claim X87 transcenental ops are in range
X87 will set a flag if a program tries to operate on a value that is out of range for trancendental SIN/COS/TAN operations. FEX-Emu doesn’t actually detect these for performance reasons, so instead claim these are always in range. While not always true, if they are out of range then we weren’t detecting them anyway. Fixes an issue where glibc would do some fixups to try and bring the value in range, resulting in invalid results.
Add missing thunk library versions
This fixes an issue where FEX thunks would try to dlopen development libraries, which are missing on most user’s devices.
Fixes indirect thunks with 8+ arguments
This fixes a quite bad crash with OpenGL and Vulkan thunking where every function with 8 or more arguments would be likely to break. Fixes thunks for a bunch of games.
Add support for disabling thunks in application configurations
This is useful for narrowing down thunk compatibility issues in certain applications. While it is still not recommended to enable thunks globally, this allows more flexibility with tinkering with it
Implements four more auxv values
FEX implements most of these values for applications to pull but in some cases we didn’t have these setup. Specifically AT_PLATFORM is required so ldconfig can work correctly. AT_HWCAP/AT_HWCAP2 is used for an application to check for CPU features, and AT_RANDOM is a 128-bit random number that the kernel provides.
Quite a few more things that were changed this month, but this report has been going on long enough.
See the 2211 Release Notes or the detailed change log in Github.