FEX 2605 Tagged

We’re not even going to do an intro this month, we got ourselves some new hardware this month to play with let’s go!

Various JIT fixes and performance improvements

We can’t keep getting away with it. An emulator optimizing and fixing bugs in its JIT? Unheard of. This month we have various improvements littered around. We optimized some more x87 instructions in our reduced precision path again; This time hitting ATAN, FYL2X, FSCALE, and F2XM1, the typical 2x-4x improvement on these instructions depending. We’re starting to run out of x87 instructions to optimize at this point and 32-bit games can only run so fast!

Thanks to a new contributor bringing up this issue, we learned that cmpxchg8b/16b was setting some CPU flags incorrectly! This is a pretty big mistake and likely caused some games to spin on memory agressively. Luckily these two instructions are relatively rare, so it wouldn’t be a common site.

Another bug fixed was with SSE MAXPS and MAXPD instructions not breaking ties correctly on Inf/NaN inputs, which could have caused some issues. Some x87 operations also had some trivial denormal result checks in the same vein that were resolved. An additional bug fix was pushing and popping of 16-bit segment registers wasn’t actually working correctly. No idea who uses that feature on 64-bit CPUs today, but we at least know it’s fixed!

We also optimized this month the MMX PSHUFW instruction, finding more cases in which we can reduce a heavy table lookup in to either one or two instructions. We also modified the x87 class of FIST instructions to only access memory without TSO emulation to improve performance.

Fixed a crash with ARM64ec and controllers

This month we found a nasty bug that cropped up in an edge case. We only really noticed this because SDL behaves different when a DualSense controller was plugged in for whatever reason.

The core of the issue comes down to a feature in ARM64ec called a suspend doorbell. When WINE wants to pause an emulated thread so it is in a safe location, it will set a doorbell which FEX watches and will collaboratively suspend if it is set. Turns out we had a bug in the code that would always crash in this particular edge case, but because it is so uncommon we never really noticed it.

When a controller was connected, WINE was asking for our emulated threads to suspend so commonly, that it basically became a unittest by itself and immediately caught the issue. Easy fix, but it took a couple days of trying to figure out with this wacky race condition between multiple threads and signals getting passed around!

Query DCZID_EL0 on ARM64ec so CLZERO works

This is a fairly simple fix, we were forgetting to test if ARM’s DC ZVA instruction matches CLZERO behaviour, causing a crash on applications that unconditionally use the instruction. This just got lost with the initial implementation and we didn’t really notice it because it’s fairly uncommon in today’s games. Luckily we have some microbenchmarks that abuse it and found out the issue!

Snapdragon X2 Elite fixes

This month we finally got our hands on some new Snapdragon X2 Elite platforms for testing on! While they don’t yet support Linux out of the box, Qualcomm is pushing commits each kernel release to move the needle. With 7.1 or 7.2 we might even have a working GPU!

Since this was the first time we actually got our hands on the platform, we had to fix some new bugs and features we noticed in the hardware.

  • RNDRRS is still broken just like X1E, so disable the RNG feature
  • Still ships a 19.2Mhz cycle counter, so it’s ARMv9.0-a compliant hardware
    • ARMv9.1-a mandates a 1Ghz cycle counter
  • It supports SVE2 (with 128-bit registers) and SME, and KVM virtualization!
  • The hardware supports some new atomic behaviour!

Let’s dive in to that last bullet point, because that’s highly interesting for FEX’s interests. As usual for emulating x86, the hardest problem is emulating the memory model. Regular loads and stores turn in to acquire load and release store instructions, for a programmer we tend to call these “atomic” operations. Even the C++ standard wraps these in std::atomic<>. Then of course any x86 instruction that has a LOCK prefix on it, is an atomic memory operation that does a RMW of the location in memory. x86 has this extremely cool feature where the alignment of your accesses in memory don’t really matter, as long as they don’t cross a cacheline which is always 64-byte in size. If you cross a cacheline, then you invoke your CPU’s wrath and engage what is known as a split-lock, which is very slow and also slows down other processes on your system. Chips and Cheese actually had a really nice article recently on the performance of these split-locks. tldr the latency of an atomic goes from something like 1-2 nanosecond inside of a cacheline, up to around 700 nanosecond when a split-lock is engaged (Or worse depending on hardware).

FEX needs to emulate these split-locks in a way that works on ARM, but turns out that the latest ARM CPUs are even worse off. When your atomic operation crosses a 16-byte granularity, the operation will raise a SIGBUS. FEX then needs to catch this SIGBUS and emulate the split-lock functionality. This happens dramatically more frequently because of the 16-byte granularity instead of 64-byte granularity. An additional problem that FEX has, is that emulating the split-lock isn’t actually safe on ARM because we will always have the potential to tear.

Luckily! Qualcomm has partially solved this problem on X2E! With the latest iteration of their hardware, RMW atomic operations will now only SIGBUS when crossing a 64-byte cacheline granularity instead of every 16 bytes! This is a boon to correctness for the common case, which is to be commended. Somewhat interestingly, although they support RMW atomic operations without SIGBUS within a cacheline now, the acquire load and release store instructions still SIGBUS when crossing a 16-byte granularity. This is sadly the more common experience and what tends to cause performance issues with hand-coded memcpy/memset functions inside of games.

We also still don’t have a way to correctly emulate split-locks crossing a 64-byte granularity without tearing, but maybe the ARM-lords will grant us CASP cacheline straddling split-lock support some day.

Post by @[email protected]
View on Mastodon

See the 2605 Release Notes or the detailed change log in Github.

Written on May 8, 2026