FEX 2311 Tagged!
Another month gone by and another FEX release out the door! This last month was a bit of a less busy month as most of our team spent a week in Spain to take part in XDC 2023! We did still have the rest of the month to do some work although, so let’s get to the changes!
Small bug fixes
This month we fixed a couple of bugs with could have caused spurious crashes! In fact while testing some upcoming performance optimizations, we fixed a few unrelated bugs that was crashing Steam periodically! Always nice to see a bunch of little work that just improves the software, even if they aren’t a single big fix.
- Fix register corruption when jumping out of JIT
- Fixes double munmap which would cause spurious pointer unmaps
- Fixes crashes when a program would shut down a thread
- Implements RPRES support and Fix implementation issue with ARM’s new RPRES feature
- RPRES gives us the ability to do reciprocals in one instruction instead of using ARM’s divide instruction.
- The bug would have caused invalid data to be returned
- No CPU supports this yet luckily
- Fixed issue with *at syscalls not working with absolute paths
- Broke Proton’s pressure-vessel in weird and unique ways
- Fixes bug with named enum argument parser
- This is used to override CPU features with the FEX_HOSTFEATURES option so typically not hit
32-bit thunking infrastructure
While 32-bit thunking is not yet in place, and this month it still isn’t fully integrated, some of the code has been landing to work towards this goal. In order to do 32-bit thunking the right way we are spending bunch of time ensuring that we have a proper daya layout analysis system in place that is based on clang to do a couple of things. This analysis will let us to automatic translations of data structure from 32-bit in to 64-bit and also alert us if something needs to be manually translated. This needs to be in place because otherwise we can end up in a situation where we unknowingly corrupt data and it would be a nightmare to find. So this month we now have the ability to annotate our thunk definitions and start having clang work for us. While not complete, some of the work has shown to have thunking working for 32-bit Super Meat Boy to work! It’s getting there!
NZCV usage preparation
A big performance improvement that FEX is working on is to use the CPU’s flags to directly emulate the x86 flags when possible. This is a long and arduous task but the performance improvements will be huge once the code lands! A bunch of prep work this month has landed to start down this path but we’re going to need to let this sit in the oven for a bit longer. Check back next month to see if we get there!
Minor optimizations
With XDC being in the middle of the month, it caused most of the bigger work to be delayed so we have a bunch of smaller things this month!
- Minor optimization to bfi/bfxil
- Removes one or two instructions for some instruction translations
- Optimize atomic fetch operations in to atomic if the result isn’t used
- Removes a couple of instructions if the resulting fetch data isn’t used.
- Implements support for ARM’s new AFP extension
- Currently disabled until we can audit the codebase to ensure we aren’t corrupting anything
- Lets us remove an insert after every scalar operation to match SSE behaviour
- Optimize palignr that behaves like a move
- Compilers shouldn’t use this, but now we optimize it to a move
- Optimize pblendw
- A fairly uncommon instruction but now its implementation is basically as fast as it can be
- Optimize blendps
- We had already optimized blendpd last month, so this time was to optimize the 32-bit version
- Fairly commonly used so should improve perf in some games
- Optimize dpps and dppd
- These instructions do a dot product and a broadcast of their result but we couldn’t find a game using it heavily
- So while this is now optimal, this is unlikely to affect any real game
- Optimize some 3DNow! instructions
- 3DNow! is a really old instruction extension that is basically only used in some really old games
- All of these instruction implementations are basically as fast as we can make them now, which is good!
- Optimize direction flag pointer offset calculation
- This converts a three instruction calculation down to one and stops using a ternary selection
- This happens with x86’s repeat instructions, which typically happens for memcpy and memset
- Used a lot but is a minimal improvement.
- A few other random bits and bobs!
AVX optimizations!
While nothing supports our AVX implementation today, we have optimized a handful of implementations once hardware supports what we need. We have optimized a smattering of instruction translations.
- 256-bit VExtr, VFCADD, VURAvg, VFDiv, VSMax, VSMin, VUMax, VUMin
- Removes a bunch of truncating moves
- If we know an AVX instruction is operating at 128-bit width, we can remove a redundant move which speeds things up!
Video game showcase
See the 2311 Release Notes or the detailed change log in Github.