libsvm compiled with AVX vs no AVX
Asked Answered
D

1

1

I compiled a libsvm benchmarking app which does svm_predict() 100 times on the same image using the same model. The libsvm is compiled statically (MSVC 2017) by directly including svm.cpp and svm.h in my project.

EDIT: adding benchmark details

for (int i = 0; i < counter; i++)
    {
        std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
        double label = svm_predict(model, input);
        std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();

        total_time += duration;

        std::cout << "\n\n\n" << sum << " label:" << label << " duration:" << duration << "\n\n\n";
    }

This is the loop that I benchmark without any major modifications to the libsvm code.

After 100 runs the average of one run is 4.7 ms with no difference if I use or not AVX instructions. To make sure the compiler generates the correct instructions I used Intel Software Development Emulator to check the instructions mix

with AVX:
*isa-ext-AVX                                                    36578280
*isa-ext-SSE                                                           4
*isa-ext-SSE2                                                          4
*isa-set-SSE                                                           4
*isa-set-SSE2                                                          4
*scalar-simd                                                    36568174
*sse-scalar                                                            4
*sse-packed                                                            4
*avx-scalar                                                     36568170
*avx128                                                             8363
*avx256                                                             1765

The other part

without AVX:
*isa-ext-SSE                                                       11781
*isa-ext-SSE2                                                   36574119
*isa-set-SSE                                                       11781
*isa-set-SSE2                                                   36574119
*scalar-simd                                                    36564559
*sse-scalar                                                     36564559
*sse-packed                                                        21341

I would expect to get some performance improvment I know that avx128/256/512 are not used that much but still. I have a i7-8550U CPU, do you think that if run the same test on a skylake i9 X series I would see a bigger difference ?

EDIT I added the instruction mix for each binary

With AVX:

ADD                                                             16868725
AND                                                                   49
BT                                                                     6
CALL_NEAR                                                       14032515
CDQ                                                                    4
CDQE                                                                3601
CMOVLE                                                                 6
CMOVNZ                                                                 2
CMOVO                                                                 12
CMOVZ                                                                  6
CMP                                                             25417120
CMPXCHG_LOCK                                                           1
CPUID                                                                  3
CQO                                                                   12
DEC                                                                   68
DIV                                                                    1
IDIV                                                                  12
IMUL                                                                3621
INC                                                              8496372
JB                                                                   325
JBE                                                                    5
JL                                                                  7101
JLE                                                                38338
JMP                                                              8416984
JNB                                                                    6
JNBE                                                                   3
JNL                                                                  806
JNLE                                                                  61
JNS                                                                    1
JNZ                                                             22568320
JS                                                                     2
JZ                                                               8465164
LEA                                                             16829868
MOV                                                             42209230
MOVSD_XMM                                                              4
MOVSXD                                                              1141
MOVUPS                                                                 4
MOVZX                                                               3684
MUL                                                                   12
NEG                                                                   72
NOP                                                                 4219
NOT                                                                    1
OR                                                                    14
POP                                                                 1869
PUSH                                                                1870
REP_STOSD                                                              6
RET_NEAR                                                            1758
ROL                                                                    5
ROR                                                                   10
SAR                                                                    8
SBB                                                                    5
SETNZ                                                                  4
SETZ                                                                  26
SHL                                                                 1626
SHR                                                                  519
SUB                                                                 6530
TEST                                                             5616533
VADDPD                                                               594
VADDSD                                                           8445597
VCOMISD                                                                3
VCVTSI2SD                                                           3603
VEXTRACTF128                                                           6
VFMADD132SD                                                           12
VFMADD231SD                                                            6
VHADDPD                                                                6
VMOVAPD                                                               12
VMOVAPS                                                             2375
VMOVDQU                                                                1
VMOVSD                                                          11256384
VMOVUPD                                                              582
VMULPD                                                               582
VMULSD                                                           8451540
VPXOR                                                                  1
VSUBSD                                                           8407425
VUCOMISD                                                            3600
VXORPD                                                              2362
VXORPS                                                              3603
VZEROUPPER                                                             4
XCHG                                                                   8
XGETBV                                                                 1
XOR                                                              8414763
*total                                                         213991340

Part2

No AVX:
ADD                                                             16869910
ADDPD                                                               1176
ADDSD                                                            8445609
AND                                                                   49
BT                                                                     6
CALL_NEAR                                                       14032515
CDQ                                                                    4
CDQE                                                                3601
CMOVLE                                                                 6
CMOVNZ                                                                 2
CMOVO                                                                 12
CMOVZ                                                                  6
CMP                                                             25417408
CMPXCHG_LOCK                                                           1
COMISD                                                                 3
CPUID                                                                  3
CQO                                                                   12
CVTDQ2PD                                                            3603
DEC                                                                   68
DIV                                                                    1
IDIV                                                                  12
IMUL                                                                3621
INC                                                              8496369
JB                                                                   325
JBE                                                                    5
JL                                                                  7392
JLE                                                                38338
JMP                                                              8416984
JNB                                                                    6
JNBE                                                                   3
JNL                                                                  803
JNLE                                                                  61
JNS                                                                    1
JNZ                                                             22568317
JS                                                                     2
JZ                                                               8465164
LEA                                                             16829548
MOV                                                             42209235
MOVAPS                                                              7073
MOVD                                                                3603
MOVDQU                                                                 2
MOVSD_XMM                                                       11256376
MOVSXD                                                              1141
MOVUPS                                                              2344
MOVZX                                                               3684
MUL                                                                   12
MULPD                                                               1170
MULSD                                                            8451546
NEG                                                                   72
NOP                                                                 4159
NOT                                                                    1
OR                                                                    14
POP                                                                 1865
PUSH                                                                1866
REP_STOSD                                                              6
RET_NEAR                                                            1758
ROL                                                                    5
ROR                                                                   10
SAR                                                                    8
SBB                                                                    5
SETNZ                                                                  4
SETZ                                                                  26
SHL                                                                 1626
SHR                                                                  516
SUB                                                                 6515
SUBSD                                                            8407425
TEST                                                             5616533
UCOMISD                                                             3600
UNPCKHPD                                                               6
XCHG                                                                   8
XGETBV                                                                 1
XOR                                                              8414745
XORPS                                                               2364
*total                                                         214000270
Disjoined answered 14/2, 2019 at 11:59 Comment(9)
By MVS you mean MSVC? Note that the number of SSE2 instruction in the no-AVX version is the same as the number of AVX instructions in the AVX version. Also scalar-simd is about the same in both cases. I'm not sure whether these classes of instructions are inclusive of exclusive of each other, but I think they can be inclusive of each other. This would mean that most of these SIMD instructions are actually used in scalar mode whether you enable AVX or not.Haland
@HadiBrais what do you mean by scalar mode ? Can you please elaborate ? Yes I mean MSVCDisjoined
Did you have a look at the disassembly of the library and saw a significant amount of vector instructions like (v)addps, (v)mulps, etc? If instead you mostly have addss and mulss instructions, then that code is not vectorized.Colonialism
@Colonialism I edited the question and added the instruction mixDisjoined
What benchmark did you compile? Obviously auto-vectorization is going to depend on what exact loops you're using.Profiteer
@PeterCordes I added the code for the loop I'm benchmarking, it will not tell oyu much because svm_predict() is actually implemented by libsvm. But if you look at the instruction mix there are some SSE/AVX instructions int the AVX case so I guess that auto-vectorization is working ?Disjoined
But you said you're including svm.cpp in your project, so that's being compiled, too. And yes, auto-vectorization of something is clearly working, but maybe just of an initialization loop. It's only a couple thousand instructions vs. 36M scalar instructions either way. So obviously it depends on exactly which SVM library function you use. It appears that this one doesn't make meaningful use of SIMD. (Oh, I was confusing SVM with SVML (Intel's Short Vector Math Library, stuff like _mm_sin_ps(), which isn't open source.)Profiteer
@PeterCordes can you please explain abit the difference between vectorized and scalar instructions? can SSE/AVX instructions be also scalar instructions? I'm a bit confused as you can seeDisjoined
See chtz's answer. x86 uses SSE1/2 or AVX for scalar FP math, using just the low element of XMM vector registers. It's somewhat better than x87 (more registers, and flat register set), but it's still only one result per instruction.Profiteer
C
3

Almost all arithmetic instructions you are listing work on scalars e.g., (V)SUBSD means SUBstract Scalar Double. The V in front essentially just means that AVX encoding is used (this also clears the upper half of the register, which the SSE instructions don't do). But given the instructions you listed, there should be barely any runtime difference.

Modern x86 uses SSE1/2 or AVX for scalar FP math, using just the low element of XMM vector registers. It's somewhat better than x87 (more registers, and flat register set), but it's still only one result per instruction.

There are a few thousand packed SIMD instructions, vs. ~36 million scalar instructions, so only a relatively unimportant part of the code got auto-vectorized and could benefit from 256-bit vectors.

Colonialism answered 14/2, 2019 at 14:18 Comment(10)
Yes, MSVC knows to use VZEROUPPER to avoid penalties like Why is this SSE code 6 times slower without VZEROUPPER on Skylake? from false dependencies. Possibly if you used an AVX intrinsic without compiling with -arch:AVX you could trick it into not doing that, at least with older versions of the compiler I think I remember getting it to emit mixed SSE/AVX code without -arch:AVX. But hopefully it stops you from shooting yourself in the foot. (Unless maybe you tell it to tune for KNL Xeon Phi, where vzeroupper is slow and not helpful.)Profiteer
I'm not entirely confident that this answer is correct. If you look at the number of multiply instructions, with AVX, there are about 8M VMULSD. But without AVX, there are about 42M MULSD. How can they have the same performance? The same applies to SUB, but this is a much cheaper instruction.Haland
@HadiBrais I agree, that is confusing. Maybe the SSE version got more loops unrolled. Or the AVX version does not include some functions which are present in the SSE version. The numbers significant for the run-time are the numbers of executed instructions (the first two lists of the OP).Colonialism
@HadiBrais: I don't trust the OP's per-instruction numbers. One of them has vastly more total instructions than the other, so it was probably for more iterations. e.g. 126M cmp vs. 25M. So probably 5x as many loop iterations, matching the 5x difference in scalar instructions. There are some differences, like the ratio of JZ to JNZ, though. Ratio of 2.6 with AVX vs. 2.1 with only SSE.Profiteer
@Colonialism My understanding is that all of the lists show the number of dynamic instructions. It's just that the first two lists categorize all the instructions into potentially overlapping classes.Haland
@HadiBrais Also, mulss actually has better throughput than subss/addss on many recent architectures. This is more a question of how much multiplication/addition circuits are on the CPU, not how "complicated" the instruction is. (Not that this matters for this question).Colonialism
@Colonialism It depends on the processor. Anyway, it's not clear at all why would they have the same performance. I'm suspecting that they have different bottlenecks. It's difficult to judge without thoroughly examining the hot loops in the code.Haland
@HadiBrais I'd highly suggest using /Qvec-report to see what is and isn't getting vectorized and why.Unknow
@HadiBrais it seems you were right about the total instruction count I ran it for different number of loops.I updated the instruction counters and now they look better. Also included the total number of instructions. ThanksDisjoined
@PeterCordes Brilliant hunch dude, you are right. Now I'm confident that the answer is correct.Haland

© 2022 - 2024 — McMap. All rights reserved.