FEX 2209 Tagged!

A lot of miscellaneous work this month that isn’t directly user facing. We do still have some interesting topics this month that some people will be interested in.

Simplify StealMemory functions

A fairly significant change this month is reducing the time it takes FEX to set up its memory upon load. FEX needs to do an initial setup of the memory when an application loads because between x86-64, x86, and AArch64 the memory layouts are significantly different.

Depending on the architecture of the application, FEX needs to allocate a large amount of memory to emulate the x86/x86-64 memory behaviour.

On 32-bit x86

  • We need to allocate all memory above 32-bit memory space
    • This is because we emulate 32-bit applications as a 64-bit AArch64 application

On 64-bit x86-64

  • We need to allocate all memory in the 48-bit virtual address space
    • This is because AArch64 supports the full 48-bit space for the user
    • x86-64 userspace only receives 47-bit
    • Application’s rely on not receiving 48-bit pointers!

From this graph showing the amount of CPU time spent in each routine, we can see a significant reduction in time to execute. For 32-bit and 64-bit specific operations this results in a ~70x and ~181x reduction in in execution time!

How well does this improve execution time in practice though?

This graph is showing the total time it takes to run applications fully through. The smallest test applications have shaved off around 75% - 85% their execution time. The biggest improvement comes from Proton setting up its execution environment. Proton’s underlying execution environment is called pressure-vessel which executes hundreds of background applications while setting up. This is one of the worst cases for FEX since each independent application execution needs to JIT new code and handle all of its state setup. This case reduces the execution time from around 21 seconds down to around 17 seconds! This can really be felt when execution back to back Proton instances when testing games!

While this is a significant step in the right direction, FEX still has a ways to go to hit the native execution time of pressure-vessel which can take as little as one second.

More AVX work

A bunch more work has gone in to supporting AVX emulation. This is still preliminary backend work for now.

  • HostRunner
    • Handle upper YMM lanes in sigsegv handler
  • InterpreterOps
    • Extend SSAData size to accomodate 256-bit operations
  • VectorOps
    • Extend VAnd/VBic/VOr/VXor
    • Extend VMov
    • Extend VectorImm
    • Extend VectorZero

Thunks

X11

Some fairly minor changes here that improve usability of thunks with Proton. We added more Xlibint functions to the thunks which fixes X11 thunking with DXVK. X11 is required for both Vulkan and OpenGL thunking so having this working is necessary when running those games.

Another necessary change for supporting thunks with Wine/Proton is more aggressively supporting X11 functions which require variadic arguments. There are quite a few of these functions sprinkled around that require this. While we supported these functions with open-coded support up to 7 arguments, we need to support at least up to 14 arbitrary arguments in some instances. We now have some assembly code in place which can support an arbitrary number of arguments by packing these in memory the expected way. While this only works for 64-bit integers, it’s all that we need for X11.

With both of these features implemented both OpenGL and Vulkan thunking works with Proton.

VDSO

While this is implemented as a thunk on the FEX side, it behaves slightly differently that normal thunks. This will always be enabled as long as FEX can load the VDSO-host.so library installed on the system. Due to the nature of VDSO, all applications always have a VDSO region provided by the kernel at all times. FEX wants to provide fast emulation of this “library” since applications abuse it heavily for performance. This was noticed when running Proton games, they abuse the clock_gettime very heavily which was causing significant CPU overhead. Applications were calling this VDSO syscall hundreds or thousands of times a second. This now significantly lowers the amount of time spent in the kernel for timing functions.

getdents syscall emulation

AArch64 doesn’t support this syscall but in most cases applications don’t use it. This is because there is a much more modern syscall called getdents64 that everything uses now. When running older compiled applications they are likely to use the classic syscall. Since AArch64 doesn’t have the classic version, we now emulate it entirely using getdents64, which fixes running applications from centos 7.

Misc

  • Fix compiling without jemalloc
    • Thunks are unsupported without jemalloc but we need to keep it compiling
  • Consolidate generated files to one file per platform
    • Nice code cleanup for developers
  • Minor cleanups for signature-based function pointer thunking
  • Support direct thunk config in configuration files
    • This improves the user experience with enabling thunks for application configurations
    • No need for two files to describe one thing now

See the 2209 Release Notes or the detailed change log in Github.

Written on September 5, 2022