FEX 2308 Tagged!
Whoa jeez, another month already? We’ve had our heads down working hard this last month, trying to make FEX-Emu the greatest x86/x86-64 emulator on Linux. A huge focus this month is optimizations because of course what we want is to go fast. We’re all cats and we’ve got the zoomies.
Every day we’re optimizing
As said before, this month has been an absolute mess of optimizations as we’ve been optimizing the project as thoroughly as possible. We could spend another month talking about the optimizations that we did this last month, so let’s blast through what we did. First let’s show a graph for how much FEX has improved over this last month.
Look at those numbers! Some benchmarks from bytemark have cracked the 200% mark! While a couple benchmarks do have regressions, we’re pretty sure that we know what they are and they will be rectified soon. These are the sorts of optimizations that can be felt in real games though.
So lets quickly run through some of the optimizations we ran in to this last month.
Switch to using half-barriers for memory accesses
When ARM hits an unaligned atomic memory access, we previous wrapped that load or store in two slow barrier instructions. We can now safely only use one barrier on one half of the instruction! This makes unaligned accesses quite a bit quicker.
Optimize x87 memory accesses
This removes a couple instructions when we access 80-bit floats.
Only clear icache for code
Some large code blocks can generate a decent amount of metadata that don’t need an icache clear. Can remove a bit of stutter.
Const prop BFI operation
Sometimes when a BFI instruction has constants in it, we can remove the BFI instruction
Optimize vector TSO loadstores
vector operations typically need an additional add on its address if it can’t fit in the instruction encoding for the immediate offset. We missed the optimization in which the immediate offset CAN actually fit. Removes an instruction per vector loadstore commonly
Use TST instead of CMN
Sometimes these instructions hit a slow path on Cortex-A57 so a minor win there.
Optimize xor reg, reg
x86’s universally agreed upon instruction for generating zero in a register is xor. This instruction isn’t actually optimal in ARM hardware. We now emit a move of constant zero which gets optimized to register rename on most ARM hardware.
More instructions optimized
These mostly just make the implementations use less instructions which makes them faster. There will be way more of this in the coming month
- rotate flag calculations
- 8-bit, 16-bit rcr
- PF flag calculation optimization
- Optimizing packing RFLAGS
- Optimize ADD/ADC OF flag packing
Fixes bug in SSE4.2 pcmpestri
This was causing Java applications to crash. Now that we fixed a different bug last month, we now have Java working to an extent. It still crashes on shutdown which is interesting and not all games are expected to work. But good luck testing random Java games!
Pack NZCV flags
This is the first step towards FEX generating x86 flags in a more optimal way. These flags match the ARM flags fairly closely and can be emulated in a more optimal way if we pack them together. This is likely what causes the regression in bytemark, but since this is an intermediate step it is expected to go away with the next optimization after this. Look forward to future optimizations that make this faster!
Remove weak symbol declarations in thunks
A bug that cropped up in thunks has been a crash that occurs when trying to use thunks from Ubuntu’s PPA system. This has been a major thorn in FEX’s side for months because once you rebuild the project locally, it would never reproduce. The problem stems from the fact that clang would decide that it can inline a “weak” symbol if its implementation is visible. This would only occur on Canonical’s ARM builders, potentially due to whatever device they use to compile the code on. This would cause our thunks to crash almost immediately if a user tried them from the PPA system. We have now worked around this clang quirk and this will now fix thunks when enabled from the PPA system.
Mingw build work
As part of FEX’s effort towards supporting running as a WINE dll, we have been slowly adding support for compiling FEXCore as a Windows DLL. This month we have removed a bunch of Linux assumptions and API usages from FEXCore and moved it to the frontend FEXInterpreter application. In doing so, FEXCore can now be compiled using llvm-mingw as a WINE specific DLL. This is completely unusable for users today but sets the groundwork towards what will eventually become a WoW64 integration in the future. We have also added mingw building of FEXCore to our CI so we ensure it doesn’t get broken. To be clear, even though this work allows us to compile as a Windows DLL, this doesn’t allow us to run under Windows. FEX still does a bunch of things that are Linux specific inside of the code.
Another improvement that doesn’t affect our users but good to shoutout the improvement for our developers. @Lioncache has spent a good amount of time this last month adding missing instructions and aliases to our AArch64 code emitter. While our code emitter covers a decent amount of the AArch64 instruction space, it takes time to ensure full coverage. Whenever we’re writing code for our JIT and an instruction is missing, it slows down whatever we are working on. So kudos for improving our coverage because it makes everyone’s lives easier.
Implement missing accept4, recvmmsg, sendmmsg for 32-bit socketcall
In a recent Steam client update, it started using accept4 for some background thing. This would cause it to spam a bunch of logs when failing to accept some connection. A simple fix just for a few missing system calls, Steam now no longer is complaining loudly.
Fix variadic packing in X11 thunking
WINE had broken X11 thunking for all of FEX’s history without any indication as to why. We never had time to look in to this but this last month we finally hit a game that crashed which made this easier to debug. This bug occured because WINE is one of the few applications that pass more than seven arguments through a few variadic API calls. This triggered a bug in FEX’s variadic repacking code once we starting packing the arguments on to the stack. With this fixed, WINE X11 thunking now works in significantly more games. This means that both OpenGL and Vulkan applications can be thunked under WINE.
Fixes dead context store elimination pass
This optimization pass removes redundant stores to FEX’s CPU context state. While this usually doesn’t save much, it can improve performance for some edge cases in FEX’s JIT. While this is a performance optimization, it likely won’t affect many things.
Fix 16-bit POPA instruction
This instruction was accidentally zero extending the 16-bit value in to the 32-bit register. We now insert the 16-bits as expected. This fixes an issue with OpenAL in some cases.