FEX 2303 Tagged!

Oh jeez, another month already? I guess it’s time for another FEX-Emu release. Let’s pick a commit, spin the roulette wheel, and hope for the best! Surely that’s how releases work?

Rootfs images are now on a new CDN!

While this is something that doesn’t directly impact FEX when running applications, it’s a problem that most of our users need to deal with when installing FEX. Our previous CDN which was hosting our x86 images had a fair number of problems that couldn’t be solved. The main issue that affected users was that it was slow to download the images and depending where you were in the world, it could have an unstable connection. This resulted in gigabyte sized files taking forever to download or never at all!

This month we have switched our CDN to a service that has worldwide data replication across multiple dataservers. This improves the speed in which users can download our prebuilt images. Going from an average of 20MB/s to over 300MB/s is a significant boost. In addition to that, the connection is significantly more stable to the far corners of the world. Also something that doesn’t affect users at all is that this new CDN is actually significantly lower cost than what we are currently using. This was unexpected but it’s a nice bonus that this CDN is an improvement is every regard, including cost.

This month’s code changes

With that out of the way, onward to this month’s changes.

Optimize REP STOS instruction in to inline memset

This is an instruction that x86 offers that behaves similarly to a memory set operation. It behaves slightly differently since this allows you to set the memory by element size, and also you can choose to direction in which the memory is set. In particular this instruction tends to get used for zeroing out memory. Latest x86 CPUs have even optimized this instruction in order to be fast as possible. Previously FEX had decomposed this instruction in to a complex series of code blocks that was inefficient for our JIT and everything surrounding it. Now we instead convert this to a single IR operation called MemSet which exposes the semantics of how the instruction works. Allowing our IR to be cleaner and the backend to decompose it in a more optimal fashion. Currently we emit a a fairly trivial loop that handles this memory set operation. ARM has recently announced that future CPUs are going to support a memory set instruction that is very similar to the 8-bit REP STOS which will make this implementation even faster!

As seen by this graph, FEX is no where near a native implementation. It’s important to note that even without writing “optimal” codegen, this change has still given FEX up to an 11% performance improvement on its implementation. This was primarily focused around improving the IR, we can now optimize the code that the JIT emits significantly more easily! Getting closer to native is likely something to come in the future.

Add config option hide hypervisor CPUID bit

We encountered the first game that has anti-virtual machine code and refuses to run if it thinks it is running in a VM. While FEX isn’t a virtual machine, we expose this CPUID bit so software that cares can use it as hint to query FEX specific CPUID information. Now that this game has stumbled upon this issue, we added a configuration profile to disable this CPUID bit for the game. If any other games also pick up on this issue then we will need more profiles.

Proton and pressure-vessel startup optimizations

One of this months efforts have been about improving the time it takes for Proton to startup. pressure-vessel is the project that is used to setup the Proton execution environment which takes a while overall. One of the hardest things about Proton is that it executes thousands of programs and does an absolute ton of filesystem accesses. ARM devices typically don’t have the highest performance filesystems, which makes one part of this hard, but also FEX’s filesystem overlay adds overhead to this. Additionally one of FEX’s shortcomings currently is that every application execution must JIT fresh code every time it restarts. Since pressure-vessel starts so many programs, a lot of the time is just spent emitting code to memory. There were a few optimizations that went towards making this faster this month.

With the couple of optimizations in place we managed to shave a second off of the start-up time. Cutting the execution from 9.7 seconds down to 8.7 seconds. Or in the case of running on an Apple M1, execution is now down to 7 seconds. Almost all of this time improvement comes from faster syscall wrapping and the remaining CPU time is code JIT and execution. It’ll only get faster in the future!

Fix a race condition with syscall emulation

While this is a fairly minor change, we fixed a race condition around system calls which would consistently cause crashes when Steam was starting up. Every piece of work that improves stability just makes the whole emulation experience so much better and needs to be celebrated!

Signal frame improvements!

A significant problem with using FEX is the debugging experience when something breaks. We spent a good amount of time this month improving how FEX sets up its signal frames when the guest application hits a fault. Since we weren’t following traditional signal frame generation, tooling around backtracing was broken in most cases. We have now reworked this so that libSegFault will now work to give FEX a backtrace of the application’s state when it crashes.

We will be shipping a new rootfs which includes x86 and x86-64 libraries for libSegFault so that if users want to debug a crashing application, they can try and get a backtrace.

AVX work continues

Another month, another bunch of AVX work that has been implemented.

Instructions implemented

  • VPHSUBSW
  • VHSUBPD/VHSUBPS
  • VPERMILPD/VPERMILPS
  • VPERMD/VPERMPS
  • VPHADDSW
  • VPTEST
  • VPMOVSD/VPMOVSS
  • VSHUFPD/VSHUFPS
  • VPSHUFD/VPSHUFHW/VPSHUFLW
  • VPSHUFB
  • VPALIGNR
  • VEXTRACTF128/VEXTRACTI128
  • VPBLENDVB/VBLENDVPD/VBLENDVPS
  • VBLENDPD/VPBLENDW

As you can see a lot of new instructions are now implemented. This now leaves us with about thirty more instructions that need to be implemented before we can start avertising the features on SVE2-256bit supporting hardware. This is significant as we keep finding more and more games that are requiring AVX to run

ARM emitter cleanups

Another change that isn’t user facing but is always nice to point out some janitorial tasks that have been done. When we switched over to using our own code emitter there were some design choices and implementations that weren’t quite optimal. This usually culminates as developer pain when using the emitter but was a necessary evil since we wanted to get rid of VIXL’s assembler as fast as possible. @Lioncache spent some time this month cleaning up a lot of the dirty code in the emitter, in some cases making it slightly faster as well. This is always greatly appreciated as it reduces maintenance burden when working in the JIT.

They also implemented an absolute ton of new instruction emitter functions which previously didn’t exist. While we don’t use these yet, we will likely use them at some point which will make our lives easier in the future.

New development machines for our developers

Just recently a new Snapdragon laptop has gotten working OpenGL and Vulkan drivers up and running! We are gifting each of our developers one of these great machines in order to ensure we have testing platforms for all the OpenGL 4, DXVK, and VKD3D applications we want to be running! Kudos to all the developers that worked on bringing this hardware up so quickly!

Video game showcase

Something new we are trying out is starting to show video game showcases to show how well FEX-Emu runs. The first video is one of our developers poking around in the starting area of Crysis Remastered. This game runs quite well on the new Lenovo X13s with a few hiccups here and there.

Will it run Crysis? Yes it will!

See the 2303 Release Notes or the detailed change log in Github.

Written on March 6, 2023