Performance comparison of FPU with software emulation

A general answer will obviously very vague, because performance depends on so many factors.

However, based on my understanding, in processors that do not implement floating point (FP) operations in hardware, a software implementation will typically be 10 to 100 times slower (or even worse, if the implementation is bad) than integer operations, which are always implemented in hardware on CPUs.

The exact performance will depend on a number of factors, such as the features of the integer hardware - some CPUs lack a FPU, but have features in their integer arithmetic that help implement a fast software emulation of FP calculations.

The paper mentioned by njuffa, Cristina Iordache and Ping Tak Peter Tang, An Overview of Floating-Point Support and Math Library on the Intel XScale Architecture supports this. For the Intel XScale processor the list as latencies (excerpt):

integer addition or subtraction:  1 cycle
integer multiplication:           2-6 cycles
fp addition (emulated):           34 cycles
fp multiplication (emulated):     35 cycles

So this would result in a factor of about 10-30 between integer and FP arithmetic. The paper also mentions that the GNU implementation (the one the GNU compiler uses by default) is about 10 times slower, which is a total factor of 100-300.

Finally, note that the above is for the case where the FP emulation is compiled into the program by the compiler. Some operating systems (e.g. Linux and WindowsCE) also have an FP emulation in the OS kernel. The advantage is that even code compiled without FP emulation (i.e. using FPU instructions) can run on a process without an FPU - the kernel will transparently emulate unsupported FPU instructions in software. However, this emulation is even slower (about another factor 10) than a software emulation compiled into the program, because of additional overhead. Obviously, this case is only relevant on processor architectures where some processors haven an FPU, and some do not (such as x86 and ARM).

Note: This answer compares the performance of (emulated) FP operations with integer operations on the same processor. Your question might also be read to be about the performance of (emulated) FP operations compared to hardware FP operations (not sure what you meant). However, the result would be about the same, because if FP is implemented in hardware, it is typically (almost) as fast as integer operations.

Recommended topics

Hot tags