FEX 2604 Tagged

We were a little bit late this month for this release. Turns out getting distracted trying to hunt bugs for a week does that. Let’s jump in to what has changed!

More memory savings

This month we have had some memory saving changes land, which is vitally important for 8GB and 16GB systems. Primarily we have now enabled our Dynamic L1 lookup cache and disabled our L2 lookup caches by default. We talked about this more in the FEX-2511 release post, but this can save hundreds of megabytes by changing these default options.

Additionally we have fixed a pseudo-leak in one of our thread-pool allocators. It wasn’t quite a real leak because each thread only ever held a single allocation, but it is supposed to share allocations between threads which means this ballooned pretty heavily for games that create a lot of threads. For our test game, ENDER LILIES: Quietus of the Knights, this meant going from consuming 409MB of memory down to 6MB for this pool.

Another change that occured this month is being more aware of Transparent Huge Pages potentially causing us to consume more memory than expected. When the operating mode is set to always instead of madvise then we were consuming significantly more RAM than expected. ArchLinux currently defaults to always which caught us by surprise in our testing. FEX will now actively ask for THP or non-THP buffers depending on their use-case which can dramatically reduce memory usage for our sparse buffers on these systems that default to always. As a side-effect, our JIT code buffer now always asks for a THP buffer, which cuts iTLB misses in half in our testing which dramatically reduces pressure on CPU’s L2 TLB lookups.

A smattering of bug fixes and performance improvements

As usual we have a large number of bug fixes and performance improvements. Each one being small enough that it would be hard to list them all, but we do have some highlights.

Inline SIN/COS/TAN for x87 reduced precision

One of the most costly things that our JIT can do is x87 emulation and jumping out of the JIT for a helper. Unfortunately they tend to come hand-in-hand. This month we have optimized these three transcendental operations to no longer jump out of the JIT which has sped up the operations by an average of 3.7x! This makes games that hit these x87 transcendentals go quite a bit faster, like Bayonetta and Fallout: New Vegas. Improving their playability on a larger set of systems.

Additional changes as follow:

Performance

Replace a code invalidation mutex with our hand-rolled implementation that is dramatically faster
Wire up FEAT_MOPS support. The Samsung Exynos 2600 is one of the first SoCs with support
Rearrange some Arm64EC dispatcher code for performance
Optimize a vector broadcast a game was hitting
Skip ELF parsing when code caching is disabled

Bug fixes

Fix prefetch encoded nop instructions
Ensure MXCSR is saved and restored correctly on signal
Reset relocation data on JIT restart

Workaround a Docker seccomp filter bug

A user has been tinkering with FEX inside of a Docker environment and they uncovered an issue where FEX was crashing for really bizarre issues. We eventually tracked down some syscalls that were ending up returning broken results due to a bug in Docker’s seccomp filter rules. It turns out that their filter doesn’t follow the AAPCS64 nor the SystemV x86-64 ABI rules around zero extending arguments that are smaller than the register size. This causes problems because it is on the callee to do the zero extension of the argument, and if you have garbage in the upper bits of the source register, it should get ignored.

Because Docker’s seccomp filter only ever compares the values passed to system calls to the full size of the register, 8/16/32-bit arguments can have garbage in the upper bits and incorrectly return -EPERM for perfectly valid data. We manually worked around the one instance we saw this causing problems locally, but Docker needs to audit their seccomp filters and correctly handle this for a real fix!

Add option to FEXGetConfig to show fault granularity

One of the major struggles with emulating x86’s TSO is that because our memory accesses fault when unaligned, they have dramatic overhead compared to x86 basically never faulting due to alignment problems. This is slightly improved on newer ARM CPUs where the FEAT_LSE2 extension removes a percentage of faults by allowing unaligned accesses inside of a 16-byte granule. With this new run-time test, we can visualize when instructions are going to fault to showcase how bad it is.

First a system that doesn’t support FEAT_LSE2

Then a system that supports FEAT_LSE2

The green pips show which byte-aligned memory accesses don’t fault and cause problems, while the red pips show where a fault will occur and we end up needing to backpatch the code with a memory fence, or simulate the operation in the signal handler. As seen, there is still quite a bit of red on the graphs even with best hardware for this. Meanwhile if we had a similar test for x86, all the pips would be green except the 128-bit result, which matches behaviour between the two architectures. (Except vector accesses which this doesn’t test).

We added this test capability so that if any hardware in the future does decide to fix this performance and correctness problem, then we get a very quick test we can run to detect it.

See the 2604 Release Notes or the detailed change log in Github.

Written on April 9, 2026