Performance comparison of FPU with software emulation
Asked Answered
S

1

9

While I know (so I have been told) that Floating-point coprocessors work faster than any software implementation of floating-point arithmetic, I totally lack the gut feeling how large this difference is, in order of magnitudes.

The answer probably depends on the application and where you work, between microprocessors and supercomputers. I am particularly interested in computer simulations.

Can you point out articles or papers for this question?

Shofar answered 2/3, 2013 at 11:50 Comment(3)
Typically between 1 and 3 orders of magnitude, depending on the operation.Commemorate
Performance of floating-point emulation will vary widely, based on the integer capabilities and performance of the target processor. A fast integer multiply is crucial for good performance of division, square root, etc. A recent example is the FLIP library, flip.gforge.inria.fr which is targeted at a VLIW CPU. Performance data and links to relevant papers are linked from the above URL.Mildred
A slightly older paper that would be of interest: Cristina Iordache and Ping Tak Peter Tang, An Overview of Floating-Point Support and Math Library on the Intel XScale Architecture, In Proceedings IEEE Symposium on Computer Arithmetic, pages 122-128, 2003. For a sample emulation code that you could time yourself, check out the single-precision reciprocal code I posted in reply to this question: #9011661Mildred
T
8

A general answer will obviously very vague, because performance depends on so many factors.

However, based on my understanding, in processors that do not implement floating point (FP) operations in hardware, a software implementation will typically be 10 to 100 times slower (or even worse, if the implementation is bad) than integer operations, which are always implemented in hardware on CPUs.

The exact performance will depend on a number of factors, such as the features of the integer hardware - some CPUs lack a FPU, but have features in their integer arithmetic that help implement a fast software emulation of FP calculations.

The paper mentioned by njuffa, Cristina Iordache and Ping Tak Peter Tang, An Overview of Floating-Point Support and Math Library on the Intel XScale Architecture supports this. For the Intel XScale processor the list as latencies (excerpt):

integer addition or subtraction:  1 cycle
integer multiplication:           2-6 cycles
fp addition (emulated):           34 cycles
fp multiplication (emulated):     35 cycles

So this would result in a factor of about 10-30 between integer and FP arithmetic. The paper also mentions that the GNU implementation (the one the GNU compiler uses by default) is about 10 times slower, which is a total factor of 100-300.

Finally, note that the above is for the case where the FP emulation is compiled into the program by the compiler. Some operating systems (e.g. Linux and WindowsCE) also have an FP emulation in the OS kernel. The advantage is that even code compiled without FP emulation (i.e. using FPU instructions) can run on a process without an FPU - the kernel will transparently emulate unsupported FPU instructions in software. However, this emulation is even slower (about another factor 10) than a software emulation compiled into the program, because of additional overhead. Obviously, this case is only relevant on processor architectures where some processors haven an FPU, and some do not (such as x86 and ARM).

Note: This answer compares the performance of (emulated) FP operations with integer operations on the same processor. Your question might also be read to be about the performance of (emulated) FP operations compared to hardware FP operations (not sure what you meant). However, the result would be about the same, because if FP is implemented in hardware, it is typically (almost) as fast as integer operations.

Thromboplastin answered 23/3, 2013 at 10:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.