FEX 2409 Tagged

FEX-2409 is now released… with a big performance boost.

I’m tired, carry me.

Little differences between x86 and Arm can cause big performance penalties for emulators. None more so than flags.

What are those?

Flags are bits representing the processor state. For example, if an operation results in zero, the “zero flag” is set. Both x86 and Arm have flags, so for emulating x86 on Arm, we map x86 flags to Arm flags to reduce emulation overhead. That is possible because x86 and Arm have similar flags. By contrast, architectures like RISC-V lack flags, slowing down x86-on-RISC-V emulators.

Many arithmetic operations set flags. Programs can then conditionally jump (“branch”) according to the flags. On x86, the flags are thus the building blocks of if statements and loops. To check if two variables are equal, x86 code subtracts them and checks the zero flag. To check if one variable is less than another, x86 code subtracts and checks the negative flag. This pattern – subtracting, setting flags, and discarding the actual result – is so common that it has a special instruction: CMP (“CoMPare”).

If the story ended here, emulation would be easy. Unfortunately, we need to talk about the carry flag.

After an addition, the carry flag indicates if the result overflowed. Programs can then check the carry flag to detect overflows. The flag can also be input to another addition to implement 128-bit additions.

Subtractions are similar. In hardware, subtractions are additions with an operand negated. Because they are additions in hardware, subtractions set the carry flag. Precisely how is the carry flag defined for subtraction? There are two competing conventions.

The first sets the flag when there is a borrow, by mathematical analogy with addition. x86 uses this “borrow flag” convention, as it seems more natural.

The second option sets the flag when there is not a borrow. Isn’t that backwards? It turns out that adding a (two’s complement) negated operand overflows exactly when the subtraction does not borrow. This “true carry” convention matches actual hardware behaviour, while the “borrow” x86 convention requires extra gates to invert carry. Arm uses the “true carry” convention to save a few gates.

Which convention should FEX use?

We could store the x86 carry flag in the Arm carry flag. Unfortunately, that requires an extra instruction after each subtraction to invert carry to get the borrow flag.

The counter-intuitive alternative is storing the opposite of the x86 flag. That requires an extra invert after every addition, but it eliminates the invert after subtraction.

Either we pay after additions or after subtractions. Which should we pick?

While addition is common, using the flags from an addition is not. Flags are typically used with comparisons, which are subtractions. Therefore, the inverted convention usually wins. This month, Alyssa adjusted FEX to invert carries, speeding up typical workloads by a few percent.

After tackling the carry flag, Alyssa optimized FEX’s translations of address modes, push/pop, AVX load/stores, and more. Overall, benchmarks are upwards of 10% faster since the last release.

A Qt change

What about more user-visible changes? If you use the FEXConfig tool to configure the emulator, you’re in for a treat. While it works, this ImGui-based tool isn’t exactly known for its convenience. In between his work optimizing the [redacted] out of FEX’s [redacted], Tony rewrote FEXConfig as a simple Qt application, improving aesthetics, usability, and accessibility all in one go. Here’s a preview:

Besides look and feel, we’ve polished first-time setup for logging, library forwarding, and RootFS images. We’ve also made tweaking various emulation settings a bit nicer. Users of our Ubuntu PPA can simply update to unlock these improvements without any further action.

But with so much optimization, who needs speedhacks anymore?

Written on September 5, 2024