FEX 2403 Tagged!

Welcome back to another new tagged version of FEX-Emu! This month we have quite a few important bug fixes and optimizations, so let’s get right in to it!

Steam fix

As of Steam’s February 27th update there was a fairly major change to how Steam starts its embedded Chromium instance. With this most recent change it now runs inside of the Steam Linux Runtime environment. In turn Steam has disabled the sandbox feature of the Chromium instance because it is incompatible. FEX was already disabling this sandbox and forcibly passing in the argument to disable it.

Chromium really didn’t like the argument being passed in twice and it was causing it to crash early. We have now removed our application profile and let Steam configure the arguments as required.

As a side effect of Steam updating their version of Chromium, some users have noted that they are experiencing problems with GPU acceleration on Raspberry Pi systems. This is seemingly a video driver problem and unrelated to this crash that was fixed. It is currently unknown if we can fix this problem, as it is working fine on Tegra and Snapdragon systems.

Rootfs images updated

FEX’s rootfs images have been updated to include the latest versions of Mesa, gfxreconstruct, and Renderdoc. The major change here is having Mesa updated to 24.0 as the other two packages are mostly for developers.

Fix a potential hang on forking with memory allocations

We have fixed a known hang that occurs when a process is forking while another thread is allocating memory. This tended to occur as a hang when running Proton applications. While this fixes one hang, we still have another one that sporatically happens that we haven’t tracked down. While the occurence is relatively rare, it’s good to watch the process trees if the program is stuck waiting on a futex.

A bunch of CPU optimizations

As per usual, this last month has added a bunch of CPU optimizations! We have noted up to a 14% performance improvement in one benchmark and an average of around 4% in Geekbench. We need to commend our developers for hammering out these optimizations, even a small optimization can have big impacts on games that abuse a particular feature.

  • Use FlagM SETF8/16 for INC/DEC
  • Optimize LOCK DEC
  • Optimize ADC/SBC
  • Fuse add+cmn in to adds
  • Misc other optimizations
  • Optimize less than 32-bit add and sub

Small timestamp counter scaling

Recently we have found out that some games rely on an x86 CPU’s RDTSC instruction operating at Ghz frequencies. While this is not a good idea to make assumptions, it is relatively common that x86 CPUs have a really high frequency cycle counter frequency. Most laptop CPUs operating in the 1-2Ghz range while desktop CPUs can go up to 3Ghz in our testing.

Unreal Engine 5 has a new work graph system that spin-loops CPUs for a fixed number of cycles, expecting to not spin for very long. While this is relatively okay at 1Ghz, since it is only a few nanoseconds, When an ARM CPU’s cycle counter gets added to the mix it starts encountering problems.

The primary problem here is that all 64-bit Snapdragon processors ship with a fixed rate 19.2Mhz cycle counter. This continues all the way to their latest flagship the Snapdragon 8 Gen 3. Additionally other ARM devices we have tested like the Nvidia Jetson ARM boards and Apple M1 also ship a similarly low cycle counter. So while Unreal engine will only spin-loop for a measily 1597 cycles, on an ARM board this takes ~51,000 nanoseconds but on an x86 PC it only takes 591 nanoseconds! This was causing games to burn CPU time unnecessarily and run slower than they should!

To compensate for these slow cycle counters on most ARM devices, we are now scaling the value we return to the applications by multiplying the value by 128 times! This makes snapdragon cycle counters behave more similarly to a 2457 Mhz cycle counter, but with a 128 cycle granularity. This improves the FPS in Tekken 8 and will also improve performance in all other Unreal Engine 5 games. There may be other games affected as well!

As a side-note, a 1Ghz cycle counter is now mandated by ARMv8.6 and ARMv9.1 spec! So this problem will soon go away as new SoCs get on to the market.

Introduce ARM64EC static register allocation mapping

As part of the ongoing effort to support WINE’s Arm64EC code, we have changed the order in which our registers get allocated to more closely match what the Arm64EC ABI wants for register layout. Matching what Arm64EC wants for the regster layout means that when the JIT jumps out to some code, we shuffle less data around which gives a performance improvement. The Linux side of code doesn’t need this, so this only happens when building as a WINE module.

32-bit thunking improvements

This last month has had an exciting milestone for 32-bit thunking! We have landed support for thunking Wayland on 32-bit. Which means with our previously implemented Vulkan thunking, we can now run some games using Wayland plus Vulkan and Zink thrown in to the mix! In particular we have been able to test that Super Meat Boy works in this configuration! We still have more work to do before X11 and GLX works with 32-bit thunking so stay tuned to the future!

Memory leak fix

FEX had an issue with long running processes leaking memory. This showed up in applications that would start hundreds of threads and tear them down over and over. Steam is one of these long running processes that would starve the system of memory if left open over night. This is because the program spins up helper threads fairly aggressively and then shuts them down.

We have fixed one major memory leak but we still have a few more to go before its nailed down!

Syscall passthrough optimization

One important thing to be wary of when running games is syscall overhead. Every time an ioctl or other syscall is made, FEX can incur significant overhead compared to running the application natively. Additionally if we are passing syscalls through to glibc helpers then this can add more overhead and sometimes introduce bad behaviour.

This month we spent some time looking at how syscalls are handled when we know that we can pass the data directly to the kernel. This allows us to more quickly add new system calls when the kernel adds them, and ensures they are as fast as possible. With this optimization in-place FEX now directly emits small syscall handlers per syscall and jumps directly to the kernel if possible. This lowers CPU overhead for the most common syscalls, thus removing emulation overhead. While FEX’s syscalls were already fairly low overhead, this just improves the situation further!

Video game showcase

See the 2403 Release Notes or the detailed change log in Github.

Written on March 4, 2024