FEX 2503 Tagged
Here we are again, another month and some more cool changes with FEX. Let’s dive in and see what has changed!
Fix 3DNow! reciprocal precision on FEAT_RPRES supporting hardware
This change is kind of fun due to the nature of how reciprocals work on modern x86 hardware. With SSE and AVX, reciprocal and reciprocal square root are two instructions that are designed by specification to only provide 12-bits worth of precision for the initial estimate. If you want more precision then that, you use a Newton-Raphson refinement instruction, or eat the cost of some floating point division instructions.
While this is nice for x86; In ARM64 land our reciprocal instructions are limited to 8-bit precision. In order to emulate x86 precision we must ALWAYS do a refinement step, or eat the cost of the float division. For our initial implementation we had just ate the cost of the float division and called it a day. As we were crying that we needed more precision, ARM decided to bless us with the FEAT_RPRES extension. This extension is perfect for our needs, it increases the precision of the reciprocal instructions to 12-bits just like x86! So we wired this up in the end of 2023 with the hope that some hardware would support it. Today we finally have the Qualcomm Snapdragon Elite SoC and Apple’s M4 that supports these extension! This means those platforms get faster reciprocals for free.
In our quickness to implementing this instruction, we had forgotten one key player in this story. 3DNow! turns out is a special little extension that decided its reciprocals need to have 14-bit or 15-bit precision depending on instruction. This doesn’t work with RPRES but due to the implementations being shared in our JIT, we had inadvertently reduced precision on those platforms! Thankfully Paulo found this problem and hunted it down. Now for 3DNow! we are using an estimate plus Newton-Raphson refinement step in order to get the precision we need!
Hopefully all those POD Gold players out there appreciate the precision improvements!
Fix FEXServer background startup
This has been a fairly long running issue that was hard to reproduce. When new users were attempting to use FEX, they would try running their initial application and get a cryptic Failure to setup client message. Turns out how we were detecting that the FEXServer was ready would break in certain conditions when squashfs or erofs images were in use! We fixed this bug and now this problem is finally resolved, hopefully no more confused new users.
Enable multiblock option by default
This feature has been around for quite a while, but due to some bugs in our JIT it wasn’t ever quite safe to enable. Plus we had some major JIT performance concerns before optimizations in our JIT landed late last year. With a whole bunch of changes in place, we have now decided to turn this option on by default!
This option basically makes our JIT compile more code at once, allowing the JIT to stretch its legs a bit and gain some free performance. There may still be some small bugs in it, but without actually dogfooding it we’ll never find them. Let us know how it goes and may your games run faster than ever before.
Optimize SHA instructions by using ARM SHA instructions
This month there were a few optimizations around the x86 SHA extension. Emulating this extension is a bit peculiar because the instructions between the two architectures don’t quite match up. This requires a bit of noodling to figure out exactly how to get the ARM versions of the instructions to behave like the x86 instructions. With these optimizations in place, we now optimize SHA1RNDS4, SHA1MSG2, and SHA256MSG2 using roughly equivalent ARM instructions!
The only SHA instruction remaining that isn’t optimized is SHA256RNDS2, which is quite a bit different than the two ARM instructions that match the functionality. If anyone wants a brain teaser to implement this optimization, this can be a fun target to try and optimize!
Add Mangohud FEX profile sampling stats
When testing games running under FEX sometimes it is difficult to understand if a game’s performance is due to FEX getting in the way or something unrelated. This month we implemented a sampling based stats mechanism that external applications can read and get some insight in to what is happening. Primarily we have implemented this as a Mangohud config option that exposes details about FEX’s JIT. This allows us to directly correlate FEX’s JIT with a game’s FPS which is quite handy when running things. In particular we expose a few data points:
- SIGBUS events - Is the game hammering unaligned memory accesses?
- SMC events - Is the game constantly invalidating code?
- Softfloat events - Is the game hammering x87 or other transcendentals?
- JIT time - Total time spent having FEX’s JIT actually emitting code rather than executing
These stats can be seen in the following image, where we can see the game is executing roughly 33 million softfloat events per second, maximizing a CPU core’s usage and not able to hit 60FPS. This allows to to see that maybe we should try enabling the x87 reduced precision option to get some performance back.
This does require some work to setup today. It requires building Mangohud with the new FEX option enabled, it requires building FEX with the new sampling option enabled, and it also requires toggling a config option to turn it on as well. But we do invite enthusiasts that know what they’re doing to enable these options and see if there’s interesting stats in their games
Add a Tracy backend
This option is completely developer focused as Tracy stats requires a developer focusing for some optimizations inside of FEX’s code. This is a new timeline profiler backend for developers to see where code is spending time. We had already had GPUVis support wired up in FEX for quite a while, but Tracy is quite nice because the user interface is easier to handle. Additionally Tracy’s per-event overhead is usually lower that GPUVis since it uses a ringbuffer in userspace while GPUVis relies entirely on ftrace.
See the 2503 Release Notes or the detailed change log in Github.