FEX 2310 Tagged!
Welcome back to another monthly release for FEX-Emu. You might be thinking that after last month’s optimizations that we wouldn’t have much to show for this month. Well you would be wrong! We optimized even more! Let’s get in to it!
More instruction optimizations!
As stated last month, we introduced Instruction Count CI which has allowed us to do targeted optimizations of our code. One again we have optimized so many instructions that it would be impossible to go through each individual change. Check our detailed change log if you want to see all the instructions optimized. Let’s just look at the final benchmark numbers compared to last month.
Let’s talk about the Geekbench 5.4 results first since they don’t look very impressive at first glance. While we are only showing ~13% of a performance improvement, the problem with this result is that this number is an aggregate of multiple smaller benchmarks. Looking at the breakdown of all the subtests there are some that have improved by up to 66%! This is of course because some benchmarks take advantage of some instructions that we optimized more heavily than others. Luckily this improvement also scales to other video games as well.
The Bytemark improvements are a bit hard to make out, some numbers are hardly changed at all while a couple stand out as huge improvements. This mostly comes down to some very specific instruction optimizations that significantly improved performance in a couple of tests and the rest don’t show up as much.
With this months optimizations and last months combined these optimizations end up being significantly more interesting. Some Geekbench results are showing an average of 50% to 65% higher performance sometimes even higher. Some benchmark results showing nearly 2x the performance compared to before! These numbers translate very well to gaming performance where some games have more than doubled their FPS over the past couple months.
We’re not slowing down either, we still have a ton of optimizations to go on our march to get our emulation close to native performance.
Support preserve_all for interpreter fallbacks
We’re calling out this particular optimization for three reasons.
- It improves performance of x87 heavy code
- It only works with the super recently released Clang 17
- wine packages in FEX’s rootfs use x87 heavily in some instances.
Let’s talk about what this optimization is and how it improves performance. In Clang 17 they added support for a new function calling ABI called preserve_all. x86 has supported this ABI for a very long time but it is a new addition for Arm64. This ABI breaks convention from the regular AAPCS64 ABI in that if a small function needs to more registers then they need to first save pretty much any of them. Unlike AAPCS64 where it has a bunch of registers free for using. This is beneficial for FEX’s JIT since we can save signicant time by not saving any state when we need to jump out of the JIT and execute x87 softfloat code.
In particular this manifests to upwards of a 200% performance improvement in some microbenchmarks around x87 code! While this advantage is quite significant, the only way to take advantage of it is to compile FEX with Clang 17. Since this compiler release came out only last month, pretty much no distros have adopted it so it is unlikely to be used soon. In a few months time, or years depending on distro, they should naturally upgrade their compiler stack and free performance improvements will happen.
As a fairly major side note to this excursion, FEX has found that the 32-bit wine packages that is compiled with Canonical’s repository uses x87 heavily in some instances. This causes some really bad performance issues with some 32-bit games and installers. It is recommended to use Proton where you can here since it compiles its 32-bit libraries with SSE optimizations instead which work significantly better.
FEX-Emu may look to provide its own wine packages in the future with this same optimization in place to help alleviate some of this burden. Until then it is recommended to use FEX’s x87 reduced precision mode to try and alleviate some of the overhead.
Fixes a bug when chrooting in to rootfs
For quite a few months now FEX-Emu has changed some behaviour around chrooting in to the FEX rootfs. While chrooting isn’t generally advised, if a user wants to modify the rootfs then it’s the only option. While we provide some scripts inside of our rootfs images to facilitate this, it has been broken for a few months.
We have now fixed this bug in both FEX-Emu and the scripts inside of our rootfs images. So if you want to modify packages inside of the image you will now be able to do so again. Make sure to update your image to get the new scripts!
Remove x86-64 JIT and Interpreter
This has been a long time coming in the FEX-Emu project. We have had support for an IR interpreter and x86-64 host JIT for compatibility testing since the project’s inception. It has always been the case that if these CPU backends get in the way of the ARM64 JIT that they would get removed.
That time has finally come. Due to some upcoming changes around how flags are getting represented in FEX’s JIT and the general burden of implemented FEX’s IR operations three times, often undoing an x86->Arm64 translation to go back to x86. It has been deemed too much of a burden and these have been removed. This is a necessary step for our ARM64 JIT to gain more performance as we continue working to make it better.
We are looking forward to future ARM platforms that can take Radeon GPUs through PCIe slots to regain a platform which can test RADV directly, but until that point we will have to make due with our current devices.
Instruction Count CI on x86-64 hosts
While we removed our x86-64 JIT, we do have a fun addition to our instruction count CI. Now developers that don’t have an Arm64 device handy can still run the Instruction Count CI and attempt to optimize implementations without even having an ARM64 device to run it on. This is as simple as building FEX on an x86-64 device with the Vixl disassembler and simulator enabled and you will be able to optimize to your hearts content!
We’ve got a need for JIT speed! Let’s go fast!
Implement first optimizations using 128-bit SVE
This is a fairly minor change but previously FEX was not using any 128-bit SVE instructions. This is primarily because there aren’t really any SVE supporting devices in the consumer market, even though Snapdragon hardware theoretically supports it. 128-bit SVE adds a couple of optimizations that we can use.
- Wide-element shifts
- Index instruction for generating simple index masks
While these are fairly simple initially, they change some from being translated to six instructions down to one or two depending. This is a fairly minor change, but it is good to note that FEX is now taking advantage of SVE if it is available!
Adds WOW64 frontend
This has been a long time coming, with us adding initial mingw support back in FEX-2305. FEXCore now supports being built with a brand new WOW64 WINE frontend. While currently not being utilized, this will allow WINE to integrate FEX directly in to its WOW64 layer for running both x86 and x86-64 applications on Arm64 host devices.
This is a very substantial change to how WINE integrates with FEX, since today FEX-Emu just runs the full x86-64 WINE process and eats the overhead of emulating everything WINE needs to do. With the WOW64 layer now implemented, a bunch of the WINE code can now be Arm64 native code and when it needs to execute application code it just jumps back to the emulator. This is similar to how Windows natively handles its emulation through its “XTA” layer. Sadly today this is only wired up to work through a 32-bit x86 part of the layer, we need to get setup to support Wine when it inevitably supports Wow64 for x86_64->Arm64.
Big shout out to ByLaws implementing support for this! We look forward to future Wine integration work landing!
Implement thunking support for wayland-client and zink
We have some improvements to thunking this month! As we are working towards supporting thunking more code, we implemented some features to get wayland-client thunking wired up. While this support is early, it is enough to get Super Meat Boy up and running using wayland and zink overrides within a Wayland environment. We look forward to additional thunking improvements going forward so that performance can be improved everywhere.