FEX 2504 Tagged

Good weekend emulation and PC gaming enthusiasts! Another month has passed and we have implemented a decent number of features, optimizations, and bug fixes. A lot of good things we cooked up, so let’s jump right in to it!

Slay the Spire audio fix

Over the past few months of optimizing code we had noticed that Slay the Spire’s audio had broken at one point. While trying to bisect the issue we couldn’t find when it had broken, was it always the case? After a weekend of struggling to debug it we finally tracked down exactly what was going wrong. A couple months ago now we implemented an optimization to remove x87 stack management from JIT code if we know the state of the stack. This is a great optimization because stack management is very costly. Turns out we had accidentally implemented F{INC,DEC}STP instructions backwards in one code path, pushing and popping the stack in the opposite directions! This had gone unnoticed because this isn’t a very common operation, and it also only broke under certain analysis conditions.

With that fixed and new unittests to ensure it never breaks again, Slay the Sphire’s audio is now working wonderfully!

Windows PE volatile metadata support

Volatile Metadata is an interesting feature that Microsoft implemented in MSCV 2019 and default enabled. This causes their compiler to generate metadata that the compiler identifies as “volatile”. This data is then used with their ARM64 Prism emulator to avoid costly emulation of the x86 memory model. As FEX developers and users are well aware, emulating the x86 memory model is the number one most costly feature and is required for games to function in a lot of cases. ARM has implemented FEAT_LRCPC{1,2,3} as a way to improve the performance of emulating the memory model. It’s a nice gesture from ARM but the performance is basically a joke on all shipping hardware. This is why Apple has hardware x86-TSO support and their performance is the best in the industry, side-stepping the problem entirely.

Microsoft’s compiler approach lessens the burden on our ARM hardware by just avoiding the problem all together and just having their JIT skip the emulation if the compiler says it is safe to do so. Because this feature has been implemented and enabled by default since 2019, there is actually a large number of games that have this metadata! FEX when running under a WINE-arm64ec environment will now use this metadata as well in order to improve emulation performance!

Hopefully once package maintainers start shipping Arm64ec WINE and FEX packages then this will be easy for users to get effectively free performance!

Optimizations as always

This month we optimized the JIT as we always do. This time around we have optimized SHA256 instructions, they are now using ARM’s equivalent instructions which provide roughly the same functionality. While these don’t tend to get heavily used for gaming, it’s nice when some loading code does some trivial hash checking to ensure the data is valid. With this new optimization we have roughly doubled the performance of SHA256 during emulation.

In addition we have implemented support for ARM’s FEAT_FRINTTS extension. This provides a handful of instructions for more quickly rounding float values in to integer restricted ranges. This is a fairly common operation and can result in games like Factorio getting a small performance improvement because of it.

Theres more optimizations and bug fixes in the JIT this month but we would be here forever if we talked about all of them. So here’s a list of of the various bits.

  • AVX 128-bit operation zeroing now use XZR GPR stores
  • Fix scalar reciprocal on AFP hardware
  • Optimize SVE maskmov loadstores
  • Strict BT flag generation
  • Remove trivial moves in CMPXCHG
  • Remove mask in register CMPXCHG
  • VPALIGNR trivial move optimization
  • Clear DF/RF flag on signal
  • Minor optimization for moving GPRs in to x87 registers
  • Spill PF/AF in one instruction instead of two
  • Fix x18 GPR saving
  • Tie VBSL source
  • Remove masking in SHLD
  • Fix address calculation again for 32-bit applications

See the 2504 Release Notes or the detailed change log in Github.

Written on April 4, 2025