FEX 2607 Tagged

Time certainly goes by quicker than you’d expect. We even managed to skip last months release because we were busy doing other things. Let’s take that as an example and fly through the changes that we did over these past two months!

Optimizations and fixes for 256-bit SVE2 hardware

While this hardware doesn’t exist yet, we know it is an inevitability that it will at some point. While we switched gears a couple years ago to implement AVX using 128-bit operations, we never removed this code. In-fact we’ve been spending a bunch of effort on it fixing bugs and optimizing it so that once hardware ships it won’t be broken. We have now validated extensively that all AVX instructions now zero extend their results as expected and optimized a bunch of the instructions so they generate faster code. We think it’s now at a point that in the common case the code will be faster than our 128-bit emulation, but there is definitely still some more work todo. SVE2.1 provides a significant improvement to how shuffles operate and we haven’t gotten those implemented yet. Because there aren’t any 256-bit SVE2 hardware on the market, we might even require SVE2.1 or SVE2.2 for this class of hardware. We’ll of course still maintain our 128-bit path for lower-end hardware of course.

Various JIT fixes/changes

Once again, too many to go through individually, let’s throw it in a list. Most of these are just bug fixes but there are a handful of optimizations as well.

  • Fixed corruption with back to back PMOVMSKB instructions and full SMC detection
    • Fixes Vivado in this situation
  • Handle incorrect LOCK prefixed instructions correctly
  • Allow larger CPU context state
    • Fixes compilation error with musl
  • Fixes vsyscall page tracking
    • Was accidentally a NOEXEC page
  • Only use DC ZVA for Ampere when clearing AVX state
    • It’s the same or slower on other hardware
  • Fix zero extension in VCVTPS2PH
  • Fixes CRC32 with high 8-bit registers
  • Fixes 64-bit LODS instruction with address size override
  • Fix Mafia 3 in Arm64ec Proton
  • Fix Bioshock and other 32-bit games that disable DEP on Arm64ec Proton
    • Requires bleeding-edge Proton or WINE
  • Optimize x87 FYL2X/FPREM/FPREM1
  • Fix incorrect RSP update on 16-bit LEAVE instruction
  • Handle some new CPL0 only instructions correctly
  • Fix JIT allocations ending up in 32-bit VA space
    • Fixes some 32-bit games under Proton

Implement support for a unixlib under Proton/WINE

For a long time FEX’s WINE support works by building all of FEX as a DLL file which WINE loads at runtime. This has worked for us pretty well but constrained some of our design decisions at times, causing us to take less optimal or hacky paths. We also have some upcoming changes that would be dramatically harder to implement without a unixlib so we decided to start using one. Now FEX for WINE ships two DLL files, and two unixlib SO files.

Right now we are duplicating some code between the unixlib and the FEX WINE DLL file to make sure everything works correctly but within a few months we are likely to remove the duplicated code and rely entirely on the unixlib. Currently all the functionality provided is optional, but we would recommend using it because we may eventually not have it be optional.

Partial support for CUDA thunking

Some people were asking for CUDA thunking on the DGX Spark and we decided to implement partial support for it. It’s not 100% coverage but depending on the workload it can potentially work. We know today that if you try to execute applications with the static CUDA runtime linked, it won’t work well. Other than that give it a try and you might be surprised at what works.


See the 2607 Release Notes or the detailed change log in Github.

Written on July 2, 2026