FEX 2305 Tagged!
Welcome back to another release of FEX-Emu! We had cancelled last month’s release due to a large amount of code churn happening. In order to ensure the highest quality of stability we were forced to do so. Now we’re back with an even lengthier release this month, so buckle up because there were a large number of changes that happened.
More AVX Work!
These last two months have been a while ride towards implementing AVX. @Lioncache has been burning down a ton of instructions to get everything in place for AVX emulation.
New instructions implemented
That’s a whole bunch of instructions implemented! We have now nearly implemented all the instructions required for AVX. The two major instructions before AVX can be exposed is the SSE4.2 instructions VPCMPISTRI and VPCMPESTRM. This is because these two instructions also have AVX versions so it is a required feature in order to support AVX.
We are getting really close and once this feature is done, we can quickly move on to finishing support for AVX2, F16C, and the fused multiply-accumulate extensions. At that point our CPU emulation will be effectively “feature-complete” for everything that games will care about in the short-term. Exciting times!
llvm-mingw and WINE support
This is a very big change that has been coming down the pipe for a while now. We have been mostly working behind the scenes to get FEX-Emu wired up so that it can be compiled as a Windows shared library. This last month is where this work has finally come to a head and most of the work is in place for this.
How this works is that FEX-Emu has a shared-library and static-library that gets compiled called
FEXCore. This is where all the CPU emulation
happens and tries to be mostly OS agnostic, while everything that is Linux specific lives in the frontend called
FEXInterpreter. Is is FEXCore now
that can be compiled as a Windows AArch64 PE library. While this isn’t currently useful to end users today. This means that WINE can link to this
library for emulating x86/x86-64 on AArch64 platforms. It should be noted that there are still some Linux assumptions strewn about the code, so this
isn’t a generic solution for emulation on a true Windows platform. We’re writing this support specifically for WINE today.
Converting away from C++ containers that allocate memory
This is the significant change that caused us to cancel last month’s release. While @Neobrain was writing code to support 32-bit library thunking, they had discovered a very big problem. FEX-Emu has long overridden the glibc memory allocation routines in order for us to ensure that FEX can allocate memory when emulating 32-bit applications. We discovered that this overriding also extends to system libraries that we load in after the fact. This meant that any time libGL would allocate memory, it would end up being a 64-bit pointer and there was nothing we could do about it.
The workaround for this problem is to stop overriding the system allocators, which will allow shared libraries to allocate memory that can safely be used by the 32-bit guest. But this also has the problem that FEX would then run out of memory when executing 32-bit applications. This is due to a quirk that FEX-Emu needs to allocate all the memory on the system before executing 32-bit applications.
The new workaround is to replace usage of every C++ container that allocates memory with FEX’s own container that will use its own allocator. This was an exceedingly invasive change that touches almost everything in our codebase. With the pain done, FEX now can use its own internal allocators while system libraries will use the regular glibc allocator as expected. See more about the limitations of this with our documentation.
Re-enable glibc allocator hooking again
Okay, the previous paragraph was a ruse; FEX-Emu needed to actually override the glibc allocator again. In this case FEX-Emu will actually have three allocators active at any given moment.
- FEX-Emu uses jemalloc for its internal allocator.
- The system allocator is overridden with another jemalloc allocator.
- The guest application’s glibc allocator is untouched.
The problems start occuring when a pointer is shared between thunks and the guest application. If one allocator tries to free a pointer from a different allocator then fireworks occur. The way around this is to use a jemalloc function to determine if it owns the pointer and choose which allocator to end up freeing the pointer from. This is particularly painful with X11 thunking because pointers are passed between client and server in a very laissez faire fashion. This may not stay around in the future but it is a necessary evil for now.
JIT Optimizations and improvements
Reclaim static assigned registers on 32-bit
This allows us to use 8 more general purpose registers and 8 more floating point registers with 32-bit applications. Depending on the game this can improve performance by a decent margin. We have seen upwards of 20% performance uplift in various games due to it.
Fix Visual C++ redistributable crashing
This was a really annoying bug, where every.single.time. that Proton would run, it would try to install the C++ runtime at least four times. The user would be required to kill the processes after they were installed. This was fairly egregious because we had thought it was fixed months ago and didn’t realize that it wasn’t actually fixed. Depending on the version of the Visual C++ redistributable and Proton it would still occur.
Root causing this issue turns out that the redistributable uses Windows’ structured exception handling to catch the case when it passes a null pointer
strlen which results in a SIGSEGV on the Linux side. FEX was incorrectly saving and restoring state when this occured, which caused it to
infinitely loop and crash. Now that this is fixed, these install correctly and Proton doesn’t try doing it on every single run.
Implement REP MOVS as a memcpy
This instruction behaves like a fairly fast memory copy on the CPU. We now convert this over to an internal memory copy operation. Similar to last month where we converted an instruction to a memset, this instruction being implemented as an IR operation has many times over performance improvements. In real games this usually translates to a few percentage FPS improvement which is a nice uplift.
Fix restoring of AVX state
While not actually being utilized today (Except due to a bug), @AndreRH found out that we were accidentally failing to restore AVX register state when a signal handler returned. It’s surprising that this wasn’t noticed earlier but it could have resulted in some really bad floating point state.
Remove double syscall overhead on filesystem accesses
When FEX was checking to see if a file exists in the overlayfs style rootfs image we provide, we need to check if the file exists there first. If the file exists we will redirect the file to be opened from the rootfs instead of the host filesystem. We had an issue that if the file didn’t exist, we would then check for it again on accident before accessing the host file. This would mean that one syscall turned in to three. With this fix in place we are now only converting it in to two.
If you’re running a rootfs image off of a particularly slow drive (or a network share) then this can shave a decent amount of time off of load times. This was particularly noticeable when running a Proton game under Steam because they will access a ton of files before starting up.
Adds default DRM ioctl interface
This is a fairly basic change. Instead of breaking when hitting an unknown ioctl, pass it to the kernel and hope for the best. This is mostly so Asahi and other drivers can test things under FEX without pushing patches to us for downstream support.
Add support for thunking Wayland
This doesn’t affect most users today but adding support for thunking wayland means in the future applications that use this can sanely use this thunk. SDL applications today might be able to take advantage of it but it is fairly fresh. We’re looking forward to the inevitable Wayland and WINE utilization to let things move away from X11.
Fixed 32-bit clock_nanosleep
There was a fairly nasty implementation detail where a 32-bit application trying to sleep with this syscall would actually consume a CPU core to 100%. While fairly uncommon, this allows the game Alwa’s Awakening to not burn a CPU core while running.
Add a bunch of functions to FEX’s ARMEmitter
Not really a user facing feature but our code emitter has gained a bunch of new instruction support. This will be used in the future for our AVX2 implementation and various things. So it’s good to have.
Video game showcase
As a reward for getting through this extra long reprt, here is this month’s game showcase. This month is the surrealistic rhythm game Thumper. The game runs exceptionally well and is quite fun to play. I would definitely recommend checking it out, it even supports VR for an “immersive” experience.