FEX 2406 Tagged!
A little late this month but we have a new FEX release has finally landed. This month we have some good optimization and fixes so let’s get right in to it.
A bunch of JIT optimizations
This last month is finally the culmination of preparation work over the past few months of cleanups in the FEX JIT. The new register allocator has landed in FEX which is significantly better than our previous RA. Our prior implementation was meant to be a temporary solution when FEX initially started as a project and as with most temporary code, it became permanent. It was excessively slow, best case it ran in quadratic time, worst case it could take INFINITE time which resulted in significant stutters or hangs. This new implementation by Alyssa now runs in two passes in linear time, significantly improving performance and also removing a ton of bad design decisions from the first implementation.
In addition to the new RA, we also have a bunch of little optimizations spread around that improves performance all over the place. One of the bigger performance improvements for people with new hardware is enabling the AFP extension and RPRES if supported. Apple supports these in their latest SoC and the newer Cortex also supports them. This improves scalar SSE performance by quite a bit. We won’t dive in to these too much but the various optimizations can improve performance from 2% to 12% in testing. We’re marching ever closer to running applications at near native speeds now.
Add support for 32-bit OpenGL thunking
This is a big feature! 32-bit thunking has been a long time coming and has crossed some significant hurdles towards actually working! One of the biggest CPU time sinks with games is the amount of time we need to spend in the video driver when running a game. “Thunking” allows us to remove that overhead and jump directly in to the AArch64 libGL directly and remove a bunch of emulation overhead. We have done a bunch of testing with this but we expect there will still be some bugs that need to be worked out. As for fun performance improvements, we have seen one game go from 150FPS up to 270FPS, so it’s worth trying in some cases.
As a note though, this is only 32-bit OpenGL thunking. 32-bit Vulkan drivers still need to go down the emulation path, so things like DXVK in older 32-bit games won’t get these performance improvements.
Default TSO emulation options changed
Over the course of the past couple months we have been testing the new TSO memory model emulation toggles and during this time we have determined the cost of accurately emulating Vector and Memory copy memory atomics to be too high for most hardware. The good news is that from all the testing we have done, this doesn’t actually cause any problems in any known games. So from this release onward we are by default disabling TSO emulation on these operations. We may come back and visit this once hardware ships that has FEAT_LRCPC3 which adds new instructions for Vector TSO loadstores.
Users with an older configuration can go in to FEXConfig to toggle these options off and enjoy the free speed benefits of not doing accurate emulation today! As a note, Apple Silicon’s TSO hardware emulation bit doesn’t suffer the same performance degradation so once Asahi Linux supports this for users then they get accurate emulation and speed!
Unaligned Half-barrier TSO Enabled is still recommended to keep enabled as that can cause significant bugs
Fix fstatat/statx with NOFOLLOW And JIT bugs
During a livestream one of our users encountered a bug in FEX-Emu that was breaking Darwinia. After diving in to the game to figure out what it was doing, it actually turned out to be three separate bugs that broke the game. The first bug fix with fstatat and statx syscalls were around edge case behaviour with the NOFOLLOW flag. The game was attempting to find the directory that the executable was living in and being smart in a way that broke FEX.
The other bugs were behaviour in our optimization passes where we broke x86 SIB addressing in a couple ways. We have since added unittests for these two bugs but if you would like to read more you can check out ConstProp fixes for Darwinia
With these bugs fixed the game now runs correctly under FEX-Emu without issue!
More ARM64EC improvements
This was some cleanup work for helping more easily integrating with what upstream WINE is doing for ARM64EC support. While still not entirely usable for end-users yet, it is steadily improving and can run real games if the environment is setup correctly. A lot of good work here and we’re hoping for more testing going forward.
NVIDIA Orin CPU errata!
Over the past month or two we had noticed that the NVIDIA Orin platform with its Cortex-A78AE CPU cores were running games markedly worse than our Snapdragon 8cx Gen 3 platform with Cortex-A78C cores. While these CPU cores are not identical between platforms, they are both based on the Cortex-A78 CPU core design so they should be relatively close. The NVIDIA Orin runs its cores at a 2.2Ghz clock frequency, while the Snapdragon runs its cores at 2.4Ghz. Nearly a 9% clock speed difference wasn’t accounting for the performance delta we were seeing!
The game we were testing was the PC port of Sonic Adventure 2: Battle; On Orin the board could only achieve 18FPS, while on Snapdragon we were easily hitting 60FPS with headroom to go higher if VSync was disabled. We were stunned by this absolute performance difference and couldn’t nail down the difference being due to different drivers.
Turns out we only needed to look at the Cortex-A78AE Software Developer Errata Notice to find out why.
1951502
Atomic instructions with acquire semantics might not be ordered with respect to older stores with release semanticsUnder certain conditions, atomic instructions with acquire semantics might not be ordered with respect to older instructions with release semantics. The older instruction could either be a store or store atomic.
This erratum can be avoided by inserting a DMB ST before acquire atomic instructions without release semantics. This can be implemented through execution of the following code at EL3 as soon as possible after boot:
This then goes on to talking about some code that programs the CPU so that it injects these DMB ST instructions before atomic acquires automatically! This is why this platform has been so weird for performance testing for years! This massive hardware errata basically deletes any advantage that the FEAT_LRCPC extension gives FEX and goes back to emulating atomics using half-barriers similar to how FEX already does it around unaligned atomics!
We are now looking to move off of this NVIDIA Orin platform as quickly as possible, it was already old and now that we have identified a significant problem around atomic performance it is higher priority. Luckily over the last few months we have great new hardware announcements. The Snapdragon X Elite devices are shipping soon, NVIDIA has announced a new Jetson AGX Thor platform, The NVIDIA Grace server platform is starting to become available, and Apple has some new M4 devices that will be interesting! Ideally we will get a new platform that we can plug a Radeon GPU in since it is a huge boon to our testing performance, but depending we may not have that luxury. We’ll see as we move on to new and better platforms!
Video game showcase
Instead of a video showcase from FEX this month, go checkout Asahi Lina’s Youtube Page. She recently did a couple of live streams fixing issues with the Asahi Linux MicroVM solution for running FEX-Emu on on Apple Silicon! She showcases a bunch of games while covering some of the more technical problems involved with getting FEX-Emu running on that platform.
Be warned, these are very long streams.
See the 2406 Release Notes or the detailed change log in Github.