Why does compiler generate additional sqrts in the compiled assembly code
Asked Answered
L

1

16

I'm trying to profile the time it takes to compute a sqrt using the following simple C code, where readTSC() is a function to read the CPU's cycle counter.

double sum = 0.0;
int i;
tm = readTSC();
for ( i = 0; i < n; i++ )
   sum += sqrt((double) i);
tm = readTSC() - tm;
printf("%lld clocks in total\n",tm);
printf("%15.6e\n",sum);

However, as I printed out the assembly code using

gcc -S timing.c -o timing.s

on an Intel machine, the result (shown below) was surprising?

Why there are two sqrts in the assembly code with one using the sqrtsd instruction and the other using a function call? Is it related to loop unrolling and trying to execute two sqrts in one iteration?

And how to understand the line

ucomisd %xmm0, %xmm0

Why does it compare %xmm0 to itself?

//----------------start of for loop----------------
call    readTSC
movq    %rax, -32(%rbp)
movl    $0, -4(%rbp)
jmp .L4
.L6:
cvtsi2sd    -4(%rbp), %xmm1
// 1. use sqrtsd instruction
sqrtsd  %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp  .L8
je  .L5
.L8:
movapd  %xmm1, %xmm0
// 2. use C funciton call
call    sqrt
.L5:
movsd   -16(%rbp), %xmm1
addsd   %xmm1, %xmm0
movsd   %xmm0, -16(%rbp)
addl    $1, -4(%rbp)
.L4:
movl    -4(%rbp), %eax
cmpl    -36(%rbp), %eax
jl  .L6
//----------------end of for loop----------------
call    readTSC
Landaulet answered 24/4, 2015 at 17:51 Comment(1)
That's got to be un-optimized code. Real code lays out the branches properly (with no taken branches on the non-NaN case), and leaves out the je since it will always be true after a ucomisd.Ania
K
25

It's using the library sqrt function for error handling. See glibc's documentation: 20.5.4 Error Reporting by Mathematical Functions: math functions set errno for compatibility with systems that don't have IEEE754 exception flags. Related: glibc's math_error(7) man page.

As an optimization, it first tries to perform the square root by the inlined sqrtsd instruction, then checks the result against itself using the ucomisd instruction which sets the flags as follows:

CASE (RESULT) OF
   UNORDERED:    ZF,PF,CF  111;
   GREATER_THAN: ZF,PF,CF  000;
   LESS_THAN:    ZF,PF,CF  001;
   EQUAL:        ZF,PF,CF  100;
ESAC;

In particular, comparing a QNaN to itself will return UNORDERED, which is what you will get if you try to take the square root of a negative number. This is covered by the jp branch. The je check is just paranoia, checking for exact equality.


Also note that gcc has a -fno-math-errno option which will sacrifice this error handling for speed. This option is part of -ffast-math, but can be used on its own without enabling any result-changing optimizations.

sqrtsd on its own correctly produces NaN for negative and NaN inputs, and sets the IEEE754 Invalid flag. The check and branch is only to preserve the errno-setting semantics which most code doesn't rely on.

-fno-math-errno is the default on Darwin (OS X), where the math library never sets errno, so functions can be inlined without this check.

Kiersten answered 24/4, 2015 at 17:58 Comment(5)
Note that `-ffast-math' does more than just sacrifice error handling for speed. In particular, it also breaks IEEE 754 compliance, i.e., use with care and only if you know what you're doing. See also #7421165Lunnete
@Lunnete yes, in general. However in this case, that's all it does.Kiersten
yes, that is right, I just feel like every mention of the fast-math flag should carry a warning label, that's why I added that comment.Lunnete
Is there a non-dangerous (no -ffast-math) way to get just a sqrtsd without error handling nonsense (so you just get NaN)?Puddling
@harold: -fno-math-errno eliminates the test and is safer.G

© 2022 - 2024 — McMap. All rights reserved.