FEX 2301 Tagged!

Happy new year! A new month brings a new release of FEX-Emu, bringing in the new year.

A large amount of work in this last month, showing that FEX-Emu isn’t slowing down even through the holiday season.

AVX emulation work continues

An absolute ton of work landed this last month towards bringing up AVX emulation in this last month. In total there were around 185 new AVX instructions implemented in FEX-Emu’s backend this month. At this point it starts becoming easier to talk about the number of missing instructions rather than what is implemented.

According to FEX-Emu’s instruction decoder tables, we have around 60 more instructions to implement before we can start advertising the feature. Of course with anything programming related, the last 10% is going to take the longest to implement.

A huge shoutout to @lioncash for smashing out these implementations so quickly. The amount of work going in to this is extensive.

As a side-note for users looking forward to this feature. The implementation requires hardware that supports both SVE and SVE2 with a 256-bit register width now. Which means that Fujitsu A64FX, Neoverse-V1, and all current consumer class Cortex chips are incapable of taking advantage of AVX once complete. This is a future proofing implementation for when future hardware becomes available that supports what FEX-Emu needs.

Implement a new AArch64 code emitter

One thing that has been a stand out performance bottleneck has been how quickly FEX-Emu can emit AArch64 binary code to memory. The project that FEX-Emu used for this is ARM/Linaro’s project called vixl. This project is a suite of tools including assemblers, simulators, and disassemblers and many open source projects do use this. This is a very nice project that eases the developer’s burden when writing a JIT that targets ARM devices. Sadly when profiling our code, it turns out that FEX-Emu spends a decent amount of time inside of vixl code due to how obtusely large it is. Even with Link-Time-Optimization enabled in our code, we can’t reduce the overhead incurred from vixl sadly.

With this in mind, FEX-Emu decided to create its own AArch64 code emitter tailored to what the project needs, which is high performance and low overhead.

As seen in the chart above, the percentage of time between how long it takes to emit code between Vixl and our new emitter is significant. With the Cortex-X1 only taking 68.7% of the time, and a smaller Cortex-A55 only taking 60.2% of the time. The Cortex-A55 having more of a win is showcasing that due to how much code vixl takes to emit code, it is effectively saturating the icache and BTB of the poor little CPU core.

Only code emission performance isn’t the only story that matters here though. We need to showcase how much of an improvement this has including the rest of the translation from x86 code.

Although code emission is only a percentage of our total time spent when translating x86 code, this new emitter is having a fairly massive ~8% reduction in time spent JITing. This will manifest as reduced stutters when users are running games and generally faster application execution for short-lived applications.

We’re not stopping there of course, look forward to the coming months as we spend more time optimizing our JIT so it runs even faster!

Initial 32-bit thunk support

A tricky feature that FEX-Emu does with its emulation is that it is translating 32-bit x86 applications to run inside of a 64-bit process space. This is a hard problem to resolve which is why we don’t currently support thunking of libraries when running 32-bit applications. This is the initial work required to start supporting this use case.

While not wired up to any library currently, we are quickly working towards getting Vulkan and OpenGL wired up to this interface so we can accelerate older 32-bit games.

Various JIT optimizations

There have been various JIT optimizations this month which will improve performance a small amount. These aren’t benchmarked since the percentage improvements are so small that it is likely to fall in to single digit noise.

Optimize inline syscall spilling

When FEX handles a syscall inline with our JIT, we were spilling all of our registers to memory. Now with this optimization correctly working we only spill exactly what is required, making inline syscalls faster.

Optimize generic spilling and filling

When jumping out of the JIT to C code, we need to spill both general purpose registers and vector registers to the stack. With this optimization in place we now generate roughly half the instructions necessary when doing so.

Optimize SVE register spilling and filling

While currently not utilized today, this cuts the number of instructions required for spilling SVE registers to a quarter. Should be quite nice for future hardware.

Zip elements for PHSUB instructions

These horizontal vector instructions behave a little weirdly and our original JIT implementation wasn’t quite optimal. Previously we were doing explicit element inserts to combine the final result. Now we are using the AArch64 Zip instructions which are significantly more optimal.

Fix global application configurations

This was a bug where we accidentally broke applications configurations shipped with the fex-emu package. In particular this caused the steamwebhelper to break. With this resolved, steam will work correctly again.

Fix misspelled library names in Thunks Database

While a fairly minor fix, this can have a profound impact on users that are using our thunking infrastructure. Our XCB thunks were incorrectly named, which meant that if users were enabling XCB thunks independentally of Vulkan/GL, then they wouldn’t have actually been enabled. With this typo fixed then this won’t be a concern.

Note that if Vulkan or GL thunks were enabled, then this wouldn’t likely have been an issue since X11 would have loaded xcb independentally anyway.

Misc

There was a bunch more this month that was smaller and spread out. We don’t want to take up too much of your time so if you want to see more, make sure to check out the detailed change log!

See the 2301 Release Notes or the detailed change log in Github.

Written on January 6, 2023