c++ practical computational complexity of <cmath> SQRT()

Asked 30/7, 2011 at 16:9 Answered 30/7, 2011 at 16:46

Solved c++complexity-theory sqrt cpu-cycles cpu-time

What is the difference in CPU cycles (or, in essence, in 'speed') between

 x /= y;

and

 #include <cmath>
 x = sqrt(y);

EDIT: I know the operations aren't equivalent, I'm just arbitrarily proposing x /= y as a benchmark for x = sqrt(y)

Shaver answered 30/7, 2011 at 16:9 Comment(14)

What....? How are these related? – Neglect 30/7, 2011 at 16:11

It highly depends on compiler, configuration and target CPU. – Hesta 30/7, 2011 at 16:12

Huh? Those are entirely different operations. What's the difference between printf and new? In any case, just compare the assembly! – Baro 30/7, 2011 at 16:12

I imagine the performance of sqrt is dependent at least partly on the value of x – Urial 30/7, 2011 at 16:13

Even with y, these snippets are still completely different. – Urial 30/7, 2011 at 16:14

While comparing two different operations may sound strange, it is definitely possible (even if platform depeding and quitee difficult to do it right). Knowing approximate relative speed of basic floating point operations is important when doing low-level optimizations. Sometimes you can solve the same problem e.g (artificial example) either by multiplying 4 times and dividing 3 times, or multiplying 2 times and performing square root 2 times. – Rosariarosario 30/7, 2011 at 16:20

@Mat a point of reference. If I have experience with the speed of one thing (because it is ubiquitous), then I can assess the speed of another thing when I know its speed relative to the first. – Shaver 30/7, 2011 at 16:21

@Suma: You cannot compare the two unless you're given a concrete implementation of the division and the sqrt. Even if both are translated to x84 machine code, their performance may be different with different processors (e.g. old vs new). – Hesta 30/7, 2011 at 16:22

Guys, while not completetly clear, I believe this to be a real question. @Matt: on less powerful systems that don't have dedicated hardware, sqrt is generally x10 slower than div. On any hardware from this decade, they are very close, and quite often get pipelined together into similiar floating point performance. You can search for CPU timings on your particular processor to get a better feel. – Sample 30/7, 2011 at 16:25

Here friweb.hu/instlatx64 you can find measured timings of all x86 instructions (ns and ticks). E.g. for Core 2 Duo E6700 latency (L) of x87 sqrt operation is 29 ticks for 32-bit float; 58 ticks for 64-bit double and 69 ticks for 80-bit long double; SSE/SSE2 time for 32/64 bit packed floating point are the same (29 and 58 ticks). For F.P. Divide: 32bit=18clock; 64bit=32clock; 80bit=38 ticks; 32/64bit the same for x87 and SSE/SSE2. In your operation there is loading and storing a value, which must be counted additionally. This should be The answer, but some closed this good Q. – Clinkerbuilt 30/7, 2011 at 16:27

@ybun I'm sorry if the question is ambiguous, but I'm just trying to get a general sense of how much sqrt() may slow down my code. – Shaver 30/7, 2011 at 16:28

That last comment doesn't really make sens. If you need a square root, you'll have to pay the cost of calculating it. sqrt doesn't "slow down your code". Whether sqrt will be "fast" or not depends on your compiler & your platform. – Shaitan 30/7, 2011 at 16:31

@Shaitan But in some situations calculating a square root can be avoided. – Shaver 30/7, 2011 at 16:34

This is a ridiculous question - there's no data or basis for it whatsoever. It's far more likely that the rest of Matt's code will be sinking him, not square root. I'd vote to close. – Neglect 30/7, 2011 at 20:36

The answer to your question depends on your target platform. Assuming you are using most common x86 cpus, I can give you this link http://instlatx64.atw.hu/ This is a collection of measured instruction latency (How long will it take to CPU to get result after it has argument) and how they are pipelined for many x86 and x86_64 processors. If your target is not x86, you can try to measure cost yourself or consult with your CPU documentation.

Firstly you should get a disassembler of your operations (from compiler e.g. gcc: gcc file.c -O3 -S -o file.asm or via dissasembly of compiled binary, e.g. with help of debugger). Remember, that In your operation there is loading and storing a value, which must be counted additionally.

Here are two examples from friweb.hu:

For Core 2 Duo E6700 latency (L) of SQRT (both x87, SSE and SSE2 versions)

29 ticks for 32-bit float; 58 ticks for 64-bit double; 69 ticks for 80-bit long double;

of DIVIDE (of floating point numbers):

18 ticks for 32-bit; 32 ticks for 64-bit; 38 ticks for 80-bit

For newer processors, the cost is less and is almost the same for DIV and for SQRT, e.g. for Sandy Bridge Intel CPU:

Floating-point SQRT is

14 ticks for 32 bit; 21 ticks for 64 bit; 24 ticks for 80 bit

Floating-point DIVIDE is

14 ticks for 32 bit; 22 ticks for 64 bit; 24 ticks for 80 bit

SQRT even a tick faster for 32bit.

So: For older CPUs, sqrt is itself 30-50 % slower than fdiv; For newer CPU the cost is the same. For newer CPU, cost of both operations become lower that it was for older CPUs; For longer floating format you needs more time; e.g. for 64-bit you need 2x time than for 32bit; but 80-bit is cheapy compared with 64-bit.

Also, newer CPUs have vector operations (SSE, SSE2, AVX) of the same speed as scalar (x87). Vectors are of 2-4 same-typed data. If you can align your loop to work on several FP values with same operation, you will get more performance from CPU.

Clinkerbuilt answered 30/7, 2011 at 16:46 Comment(2)

I'm sure its implied, but I'm assuming that <cmath> sqrt takes advantage of these CPU optimizations? – Shaver 30/7, 2011 at 17:0

C++ cmath uses same sqrt() as C version of math.h. But internally sqrt() may have a bit more then just FSQRT asm code, e.g. error handling. Also, sometimes gcc will not inline call to sqrt() in place of call, so overhead of function call will be here. You need to check disassembler of YOUR function and grep it for machine codes with "sqrt" in their names. Also try option -ffast-math. – Clinkerbuilt 30/7, 2011 at 21:5

If the square root function isn't implemented in special hardware or software, most library functions would calculate it using Newton's method, which converges quadratically.

Newton's method is an iterative method: you make an initial guess, calculate a trial result, and use that for the next guess. You repeat until you think you have a result that's "close enough." It so happens that you can prove how many iterations you need with square root. Every time through the cycle you get another two digits of accuracy, so most implementations will converge to the precision limit of doubles in 8-9 cycles.

If you read this carefully, you'll see that the iterative Newton's method is doing two subtractions, one multiplication, and one division per iteration.

Neglect answered 30/7, 2011 at 16:12 Comment(9)

Could you explain "converges quadratically"? – Baro 30/7, 2011 at 16:13

@Neglect So does <cmath> implement SQRT using Newtons method, or does it take advantage of the CPU optimizations that others have mentioned? – Shaver 30/7, 2011 at 16:58

This question is numerical methods. It belongs here. @Matt, I don't know about your particular implementation. Your C++ compiler might insert instructions for a machine optimized version. – Neglect 30/7, 2011 at 17:30

@duffy oh wow, so thats ~8 * (2 subs., 1 mult., and one div.). Unless the CPU is optimizing considerably beyond that, that's more computation than I was hoping. Thanks. – Shaver 30/7, 2011 at 20:30

You're being foolish. This is a micro-optimization. Do you have any data to suggest that the majority of your application's time is being spent calculating square roots? If not, forget about it until you do. I doubt that you have any data whatsoever to motivate this question. – Neglect 30/7, 2011 at 20:35

@duffy Well, the square roots are being calculated recursively. The application is in neural networks, so the program just runs the same loop repeatedly, feeding new input to the network at each iteration. Sure this is micro-optimization, but if I micro-optimize every aspect of the algorithm, which is practical given the relative shortness of the code, I think it could lead to worthwhile gains. – Shaver 31/7, 2011 at 4:48

The key there is "I think". Measure it - profile your code and be sure. You might be surprised by the outcome. – Neglect 31/7, 2011 at 14:51

@duffy What's the best way to do that? If I run the program for long enough, can I just measure how many iterations it completes in x amount of time? Or will I need some particular kind of software to actually measure the compute time? – Shaver 31/7, 2011 at 16:16

@KerrekSB Quadratic convergence means, roughly, that each iteration the number of digits of accuracy doubles. Eg., iteration 1 error has 0.1, iteration 2 has error 0.01, iteration 3 has error 0.001, iteration 4 has error 0.00001, iteration 5 has error 0.000000001. – Glaudia 27/6, 2016 at 9:18

As a general rule of thumb: Both floating point division and square root are considered as slow operation (compared to fast ones like addition or multiplication). Square root can be expect to be approximately the same speed or somewhat slower (i.e. approx. 1x - 2x lower performance) compared to a division. E.g. on Pentium Pro

Division and square root have a latency of 18 to 36 and 29 to 69 cycles, respectively

To get more accurate answer, you need to dig into architecture manual for your platform or perform a benchmark.

Note: many modern platforms also offer inverse square root, which has the speed approximately the same as sqrt, but is often more useful (e.g. by having invsqrt you can compute both sqrt and div with one multiplication for each).

Rosariarosario answered 30/7, 2011 at 16:35 Comment(3)

For sandy bridge from intel both operations take exactly the same time . So, now, sqrt is not 2x slower than div – Clinkerbuilt 30/7, 2011 at 16:36

OK. Adjusted. It would be possible to include exact timings for many platforms, but I think the question wants to have just a "gut feeling", in the rare situations you really need exact data it is more important to know where or how you can get them. – Rosariarosario 30/7, 2011 at 16:47

Two exact examples give some feeling to me. – Clinkerbuilt 30/7, 2011 at 16:50

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags