c++ practical computational complexity of <cmath> SQRT()
Asked Answered
S

3

13

What is the difference in CPU cycles (or, in essence, in 'speed') between

 x /= y;

and

 #include <cmath>
 x = sqrt(y);

EDIT: I know the operations aren't equivalent, I'm just arbitrarily proposing x /= y as a benchmark for x = sqrt(y)

Shaver answered 30/7, 2011 at 16:9 Comment(14)
What....? How are these related?Neglect
It highly depends on compiler, configuration and target CPU.Hesta
Huh? Those are entirely different operations. What's the difference between printf and new? In any case, just compare the assembly!Baro
I imagine the performance of sqrt is dependent at least partly on the value of xUrial
Even with y, these snippets are still completely different.Urial
While comparing two different operations may sound strange, it is definitely possible (even if platform depeding and quitee difficult to do it right). Knowing approximate relative speed of basic floating point operations is important when doing low-level optimizations. Sometimes you can solve the same problem e.g (artificial example) either by multiplying 4 times and dividing 3 times, or multiplying 2 times and performing square root 2 times.Rosariarosario
@Mat a point of reference. If I have experience with the speed of one thing (because it is ubiquitous), then I can assess the speed of another thing when I know its speed relative to the first.Shaver
@Suma: You cannot compare the two unless you're given a concrete implementation of the division and the sqrt. Even if both are translated to x84 machine code, their performance may be different with different processors (e.g. old vs new).Hesta
Guys, while not completetly clear, I believe this to be a real question. @Matt: on less powerful systems that don't have dedicated hardware, sqrt is generally x10 slower than div. On any hardware from this decade, they are very close, and quite often get pipelined together into similiar floating point performance. You can search for CPU timings on your particular processor to get a better feel.Sample
Here friweb.hu/instlatx64 you can find measured timings of all x86 instructions (ns and ticks). E.g. for Core 2 Duo E6700 latency (L) of x87 sqrt operation is 29 ticks for 32-bit float; 58 ticks for 64-bit double and 69 ticks for 80-bit long double; SSE/SSE2 time for 32/64 bit packed floating point are the same (29 and 58 ticks). For F.P. Divide: 32bit=18clock; 64bit=32clock; 80bit=38 ticks; 32/64bit the same for x87 and SSE/SSE2. In your operation there is loading and storing a value, which must be counted additionally. This should be The answer, but some closed this good Q.Clinkerbuilt
@ybun I'm sorry if the question is ambiguous, but I'm just trying to get a general sense of how much sqrt() may slow down my code.Shaver
That last comment doesn't really make sens. If you need a square root, you'll have to pay the cost of calculating it. sqrt doesn't "slow down your code". Whether sqrt will be "fast" or not depends on your compiler & your platform.Shaitan
@Shaitan But in some situations calculating a square root can be avoided.Shaver
This is a ridiculous question - there's no data or basis for it whatsoever. It's far more likely that the rest of Matt's code will be sinking him, not square root. I'd vote to close.Neglect
C
13

The answer to your question depends on your target platform. Assuming you are using most common x86 cpus, I can give you this link http://instlatx64.atw.hu/ This is a collection of measured instruction latency (How long will it take to CPU to get result after it has argument) and how they are pipelined for many x86 and x86_64 processors. If your target is not x86, you can try to measure cost yourself or consult with your CPU documentation.

Firstly you should get a disassembler of your operations (from compiler e.g. gcc: gcc file.c -O3 -S -o file.asm or via dissasembly of compiled binary, e.g. with help of debugger). Remember, that In your operation there is loading and storing a value, which must be counted additionally.

Here are two examples from friweb.hu:

For Core 2 Duo E6700 latency (L) of SQRT (both x87, SSE and SSE2 versions)

  • 29 ticks for 32-bit float; 58 ticks for 64-bit double; 69 ticks for 80-bit long double;

of DIVIDE (of floating point numbers):

  • 18 ticks for 32-bit; 32 ticks for 64-bit; 38 ticks for 80-bit

For newer processors, the cost is less and is almost the same for DIV and for SQRT, e.g. for Sandy Bridge Intel CPU:

Floating-point SQRT is

  • 14 ticks for 32 bit; 21 ticks for 64 bit; 24 ticks for 80 bit

Floating-point DIVIDE is

  • 14 ticks for 32 bit; 22 ticks for 64 bit; 24 ticks for 80 bit

SQRT even a tick faster for 32bit.

So: For older CPUs, sqrt is itself 30-50 % slower than fdiv; For newer CPU the cost is the same. For newer CPU, cost of both operations become lower that it was for older CPUs; For longer floating format you needs more time; e.g. for 64-bit you need 2x time than for 32bit; but 80-bit is cheapy compared with 64-bit.

Also, newer CPUs have vector operations (SSE, SSE2, AVX) of the same speed as scalar (x87). Vectors are of 2-4 same-typed data. If you can align your loop to work on several FP values with same operation, you will get more performance from CPU.

Clinkerbuilt answered 30/7, 2011 at 16:46 Comment(2)
I'm sure its implied, but I'm assuming that <cmath> sqrt takes advantage of these CPU optimizations?Shaver
C++ cmath uses same sqrt() as C version of math.h. But internally sqrt() may have a bit more then just FSQRT asm code, e.g. error handling. Also, sometimes gcc will not inline call to sqrt() in place of call, so overhead of function call will be here. You need to check disassembler of YOUR function and grep it for machine codes with "sqrt" in their names. Also try option -ffast-math.Clinkerbuilt
N
6

If the square root function isn't implemented in special hardware or software, most library functions would calculate it using Newton's method, which converges quadratically.

Newton's method is an iterative method: you make an initial guess, calculate a trial result, and use that for the next guess. You repeat until you think you have a result that's "close enough." It so happens that you can prove how many iterations you need with square root. Every time through the cycle you get another two digits of accuracy, so most implementations will converge to the precision limit of doubles in 8-9 cycles.

If you read this carefully, you'll see that the iterative Newton's method is doing two subtractions, one multiplication, and one division per iteration.

Neglect answered 30/7, 2011 at 16:12 Comment(9)
Could you explain "converges quadratically"?Baro
@Neglect So does <cmath> implement SQRT using Newtons method, or does it take advantage of the CPU optimizations that others have mentioned?Shaver
This question is numerical methods. It belongs here. @Matt, I don't know about your particular implementation. Your C++ compiler might insert instructions for a machine optimized version.Neglect
@duffy oh wow, so thats ~8 * (2 subs., 1 mult., and one div.). Unless the CPU is optimizing considerably beyond that, that's more computation than I was hoping. Thanks.Shaver
You're being foolish. This is a micro-optimization. Do you have any data to suggest that the majority of your application's time is being spent calculating square roots? If not, forget about it until you do. I doubt that you have any data whatsoever to motivate this question.Neglect
@duffy Well, the square roots are being calculated recursively. The application is in neural networks, so the program just runs the same loop repeatedly, feeding new input to the network at each iteration. Sure this is micro-optimization, but if I micro-optimize every aspect of the algorithm, which is practical given the relative shortness of the code, I think it could lead to worthwhile gains.Shaver
The key there is "I think". Measure it - profile your code and be sure. You might be surprised by the outcome.Neglect
@duffy What's the best way to do that? If I run the program for long enough, can I just measure how many iterations it completes in x amount of time? Or will I need some particular kind of software to actually measure the compute time?Shaver
@KerrekSB Quadratic convergence means, roughly, that each iteration the number of digits of accuracy doubles. Eg., iteration 1 error has 0.1, iteration 2 has error 0.01, iteration 3 has error 0.001, iteration 4 has error 0.00001, iteration 5 has error 0.000000001.Glaudia
R
2

As a general rule of thumb: Both floating point division and square root are considered as slow operation (compared to fast ones like addition or multiplication). Square root can be expect to be approximately the same speed or somewhat slower (i.e. approx. 1x - 2x lower performance) compared to a division. E.g. on Pentium Pro

Division and square root have a latency of 18 to 36 and 29 to 69 cycles, respectively

To get more accurate answer, you need to dig into architecture manual for your platform or perform a benchmark.

Note: many modern platforms also offer inverse square root, which has the speed approximately the same as sqrt, but is often more useful (e.g. by having invsqrt you can compute both sqrt and div with one multiplication for each).

Rosariarosario answered 30/7, 2011 at 16:35 Comment(3)
For sandy bridge from intel both operations take exactly the same time . So, now, sqrt is not 2x slower than divClinkerbuilt
OK. Adjusted. It would be possible to include exact timings for many platforms, but I think the question wants to have just a "gut feeling", in the rare situations you really need exact data it is more important to know where or how you can get them.Rosariarosario
Two exact examples give some feeling to me.Clinkerbuilt

© 2022 - 2024 — McMap. All rights reserved.