FEX 2408 Tagged

In the beginning, there were integers.

And robots wanted precise math, and so the x87 floating point unit was created.

And robots wanted faster math, and so SSE was created to replace x87, and it was good.

Speeding up x87

Although x87 is slow and deprecated, it hasn’t disappeared. 64-bit games will use SSE or even AVX for floating point math, but older 32-bit binaries – compiled decades ago – are filled with x87.

FEX aims to support your entire game catalogue. Last release, we added AVX to support the newest games. This release, we’ve circled back to the oldest. Old games ought to run well on new hardware, but if they use x87, performance can nosedive. Why? Two x87 quirks: 80-bit precision and the stack.

Floating point numbers are typically 32-bits or 64-bits. 32-bit is faster, while 64-bit enhances precision for numerical computing. Our target Arm hardware supports both 32-bit and 64-bit, but x87 adds an unfortunate third mode: 80-bit. Ostensibly, the extra bits of precision in intermediate calculations minimizes the accumulated error of the final result.

Is that necessary? Careful code can mitigate rounding error without the massive 80-bit hammer, thanks to techniques like the Kahan summation algorithm. New code doesn’t miss the 80-bit hardware.

Sadly, the FEX team can’t afford a time machine. They’re not cheap anymore. We’ll see what happens with Moore’s Law eighteen months ago. So we can’t teach game developers in 2005 how to make do with 64-bit floats. All we can do is slowly emulate 80-bit floats in software, or substitute 64-bit and hope the game won’t notice.

The second problem with x87 is more obscure. In a typical instruction set, each instruction specifies which registers it accesses. By contrast, x87 arranges registers in a stack. Instead of a destination register, x87 instructions push to the stack. Instead of source registers, sources are indexed relative to the top of the stack. Unlike 80-bit floats, stack machines are alive and well for virtual machines. Like 80-bit floats, they complicate emulation.

Arm instructions specify their registers directly, but we don’t know which registers an x87 instruction will use without knowing the stack top. Previously, FEX worked around this mismatch by keeping the emulated x87 stack in memory instead of registers. Arm can indirectly index memory, so this works. However, it’s slow. Because Arm is a RISC architecture, this approach requires multiple load/store instructions for every x87 arithmetic operation.

We can do better.

Instead of single instructions, we can translate entire x87 code blocks. That gives us the full context of each instruction. In “good” conditions, that lets us statically determine the stack layout, so we can translate stack access to real Arm floating point registers. The stack loads and stores disappear the our generated code.

There’s another trick we can play. Sometimes games will copy 80-bit floats without performing any calculations. Translating these copies naïvely is slow due to 80-bit emulation overhead. However, we can analyze multiple instructions together to detect 80-bit copies and translate to efficient Arm code.

These optimizations combine to a surprisingly large speed-up. To illustrate: a hot block in Psychonauts swizzles a 4x4 matrix. That’s light on arithmetic but heavy on x87 overhead. These optimization reduce the translated code from 2340 to 165 instructions. That’s a 93% improvement!

Big thanks to Paulo for taming x87, available in this month’s FEX release.

[redacted]

“I’m working on a branch that’s 10% faster than upstream. Should I mention that in the blog?”

“No, we don’t want to ruin the surprise for next month’s release.”

Written on August 12, 2024