How slow (how many cycles) is calculating a square root?

Asked 11/10, 2011 at 9:39 Answered 19/3, 2014 at 0:51

How slow (how many cycles) is calculating a square root? This came up in a molecular dynamics course where efficiency is important and taking unnecessary square roots had a noticeable impact on the running time of the algorithms.

Cracked answered 11/10, 2011 at 9:39 Comment(6)

As compared to what other operations in the same program? – Diathesis 11/10, 2011 at 9:41

@Piskvor, that's exactly the kind of answer I had in mind, see below. – Cracked 11/10, 2011 at 9:42

@Tomalak, can you cite a reference? – Cracked 11/10, 2011 at 22:37

@Doug: I can cite the Holy Bible of Hilarious Irony; does that count? – Icicle 11/10, 2011 at 22:58

It depends on how smart you are. If you do it right it's roughly equivalent to doing a divide of the same precision. – Ethical 19/3, 2014 at 0:53

If you just need to know whether the point (x1, y1) is inside the circle (xc,yc) with radius r, do not do this: if r < ( (xc-x1)^2 + (yc-y1)^2 )^ 0.5 Instead just say: if r^2 < (xc-x1)^2 + (yc-y1)^2. This is 2800 faster!!! – Colonel 24/7, 2015 at 1:14

Based on Agner Fog's instruction tables for Core 2 65nm when comparing the SSE performance with FSQRT, FDIV, FMUL and FADD, it is about equal, but looks faster because it can't do 80bit math. SSE has a super fast approximate reciprocal and approximate reciprocal sqrt though.

On Core2 45nm, FSQRT and FDIV root got faster, while FADD and FMUL haven't changed. Once again SSE performance is about the same.

Intel Core 2 (Merom, 65nm)

Instruction Operands Latency Reciprocal
throughput

FSQRT 6 - 69

FADD(P) r 3 1

FMUL(P) r 5 2

FDIV(R)(P) r 6 - 38 ^d 5 - 37 ^d

ADDSS/D xmm, xmm 3 1

ADDPS/D xmm, xmm 3 1

MULSS xmm, xmm 4 1

MULSD xmm, xmm 5 1

MULPS xmm, xmm 4 1

MULPD xmm, xmm 5 1

DIVSS xmm, xmm 6 - 18 ^d 5 - 17 ^d

DIVSD xmm, xmm 6 - 32 ^d 5 - 31 ^d

DIVPS xmm, xmm 6 - 18 ^d 5 - 17 ^d

DIVPD xmm, xmm 6 - 32 ^d 5 - 31 ^d

SQRTSS/PS xmm, xmm 6 - 29 6 - 29

SQRTSD/PD xmm, xmm 6 - 58 6 - 58

RSQRTSS/PS xmm, xmm 3 2

Intel Core 2 (Wolfdale, 45nm)

Instruction Operands Latency Reciprocal
throughput

FSQRT 6 - 20

FADD(P) r 3 1

FMUL(P) r 5 2

FDIV(R)(P) r 6 - 21 ^d 5 - 20 ^d

ADDSS/D xmm, xmm 3 1

ADDPS/D xmm, xmm 3 1

MULSS xmm, xmm 4 1

MULSD xmm, xmm 5 1

MULPS xmm, xmm 4 1

MULPD xmm, xmm 5 1

DIVSS xmm, xmm 6 - 13 ^d 5 - 12 ^d

DIVSD xmm, xmm 6 - 21 ^d 5 - 20 ^d

DIVPS xmm, xmm 6 - 13 ^d 5 - 12 ^d

DIVPD xmm, xmm 6 - 21 ^d 5 - 20 ^d

SQRTSS/PS xmm, xmm 6 - 13 5 - 12

SQRTSD/PD xmm, xmm 6 - 20 5 - 19

RSQRTSS/PS xmm, xmm 3 2

The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower than the values published elsewhere.

Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

^d Round divisors or low precision give low values.

Instruction	Operands	Latency	Reciprocal throughput
FSQRT		6 - 69
FADD(P)	r	3	1
FMUL(P)	r	5	2
FDIV(R)(P)	r	6 - 38 ^d	5 - 37 ^d
ADDSS/D	xmm, xmm	3	1
ADDPS/D	xmm, xmm	3	1
MULSS	xmm, xmm	4	1
MULSD	xmm, xmm	5	1
MULPS	xmm, xmm	4	1
MULPD	xmm, xmm	5	1
DIVSS	xmm, xmm	6 - 18 ^d	5 - 17 ^d
DIVSD	xmm, xmm	6 - 32 ^d	5 - 31 ^d
DIVPS	xmm, xmm	6 - 18 ^d	5 - 17 ^d
DIVPD	xmm, xmm	6 - 32 ^d	5 - 31 ^d
SQRTSS/PS	xmm, xmm	6 - 29	6 - 29
SQRTSD/PD	xmm, xmm	6 - 58	6 - 58
RSQRTSS/PS	xmm, xmm	3	2

Instruction	Operands	Latency	Reciprocal throughput
FSQRT		6 - 20
FADD(P)	r	3	1
FMUL(P)	r	5	2
FDIV(R)(P)	r	6 - 21 ^d	5 - 20 ^d
ADDSS/D	xmm, xmm	3	1
ADDPS/D	xmm, xmm	3	1
MULSS	xmm, xmm	4	1
MULSD	xmm, xmm	5	1
MULPS	xmm, xmm	4	1
MULPD	xmm, xmm	5	1
DIVSS	xmm, xmm	6 - 13 ^d	5 - 12 ^d
DIVSD	xmm, xmm	6 - 21 ^d	5 - 20 ^d
DIVPS	xmm, xmm	6 - 13 ^d	5 - 12 ^d
DIVPD	xmm, xmm	6 - 21 ^d	5 - 20 ^d
SQRTSS/PS	xmm, xmm	6 - 13	5 - 12
SQRTSD/PD	xmm, xmm	6 - 20	5 - 19
RSQRTSS/PS	xmm, xmm	3	2

Harmattan answered 11/10, 2011 at 12:48 Comment(2)

Right but the whole point was to compare not just raw clock cycles but to see how it actually affects performance. However, this is a good answer. – Cracked 13/10, 2011 at 2:20

@DougTreadwell well it's pretty bad, especially because of the ultra low throughput, it can completely kill the performance of a loop – Harmattan 13/10, 2011 at 15:18

Square root is about 4 times slower than addition using -O2, or about 13 times slower without using -O2. Elsewhere on the net I found estimates of 50-100 cycles which may be true, but it's not a relative measure of cost that is very useful, so I threw together the code below to make a relative measurement. Let me know if you see any problems with the test code.

The code below was run on an Intel Core i3 under Windows 7 operating system and was compiled in DevC++ (which uses GCC). Your mileage may vary.

#include <cstdlib>
#include <iostream>
#include <cmath>

/*
Output using -O2:

1 billion square roots running time: 14738ms

1 billion additions running time   : 3719ms

Press any key to continue . . .

Output without -O2:

10 million square roots running time: 870ms

10 million additions running time   : 66ms

Press any key to continue . . .

Results:

Square root is about 4 times slower than addition using -O2,
            or about 13 times slower without using -O2
*/

int main(int argc, char *argv[]) {

    const int cycles = 100000;
    const int subcycles = 10000;

    double squares[cycles];

    for ( int i = 0; i < cycles; ++i ) {
        squares[i] = rand();
    }

    std::clock_t start = std::clock();

    for ( int i = 0; i < cycles; ++i ) {
        for ( int j = 0; j < subcycles; ++j ) {
            squares[i] = sqrt(squares[i]);
        }
    }

    double time_ms = ( ( std::clock() - start ) / (double) CLOCKS_PER_SEC ) * 1000;

    std::cout << "1 billion square roots running time: " << time_ms << "ms" << std::endl;

    start = std::clock();

    for ( int i = 0; i < cycles; ++i ) {
        for ( int j = 0; j < subcycles; ++j ) {
            squares[i] = squares[i] + squares[i];
        }
    }

    time_ms = ( ( std::clock() - start ) / (double) CLOCKS_PER_SEC ) * 1000;

    std::cout << "1 billion additions running time   : " << time_ms << "ms" << std::endl;

    system("PAUSE");
    return EXIT_SUCCESS;
}

Cracked answered 11/10, 2011 at 9:42 Comment(4)

@ZacharyKraus: if you look at the edit history you'll see that my comment applied to the original version of the question, which lacked important detail as to CPU, compiler, platform, etc. Douglas was kind enough to subsequently update the answer so that it now includes all the relevant details, which makes the answer a lot more useful. I'll happily delete my comment as it is no longer relevant to the current version of the answer. – Emporium 17/3, 2014 at 23:57

Sorry about that I didn't understand what the comment was for I will delete mine as well since I am not sure it adds anything to the post either. – Shinar 19/3, 2014 at 0:48

You're not measuring speed of sqrt(), but speed of sqrt() + loop management + memory access. Loop management alone is an addition, a compare and a jump. So, statement that "sqrt is 4 times slower than addition" is a wrong conclusion. – Norge 1/2, 2017 at 15:17

Your time includes 1 billion int additions and 1 billion stack memory accesses. – Foresight 31/7, 2023 at 16:9

Square root takes several cycles, but it takes orders of magnitude more to access memory if it is not in cache. Therefore, trying to avoid computations by fetching pre-computed results from memory may actually be detrimental to performance.

It's difficult to say in the abstract whether you might gain or not, so if you want to know for sure, try and benchmark both approaches.

Here's a great talk on the matter by Eric Brummer, Compiler Developer on MSVC: http://channel9.msdn.com/Events/Build/2013/4-329

Jillianjillie answered 19/3, 2014 at 0:51 Comment(5)

Ah, but fetching pre-computed results from memory has saved a hardware demo in at least one case. – Ethical 19/3, 2014 at 0:54

Can you give an example of fetching a pre-computed result. I don't understand what you mean. But i will definitely check that slide show out later. It looks exciting. – Shinar 19/3, 2014 at 3:0

I simply mean avoiding having to compute something (like a square root) on the fly by computing it beforehand, putting the result somewhere in memory (in an array, hashtable, whatever), and accessing that result when you need it in your actual computation. The access might actually be much slower than the actual square root. – Jillianjillie 19/3, 2014 at 14:5

@Jillianjillie It really depends on the scenario. First of all, even if you don't store it pre-computed, you need to get the original value from somewhere. There's a memory access involved there, but you could store the sqrt-ed value alongside (or instead of) the original. If memory size is an issue, you can also replace the original value with the sqrt-ed value, since calculating the square is much cheaper than the square root. It all just depends on the scenario. – Ingenuity 15/2, 2015 at 13:57

@ZacharyKraus remember that reading something that is not in cache, and thus must be fetched from memory, can be up to 50 or 100 times slower. A cache read can take for example 2 o 3 cycles, which is basically free, but from memory can be up to 100 or 200 cycles. – Prink 1/4, 2020 at 15:6

Intel Core 2 (Merom, 65nm)

Intel Core 2 (Wolfdale, 45nm)

Recommended topics

Hot tags