SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?
Asked Answered
L

1

7

I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings).

  • On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms - 97%.
  • On i5 3470 difference is not so huge, but is still significant = 3206.2 ms vs. 2623.2 ms - 22%.

Both are "Ivy Bridge" - it is strange that they have so different "difference" (at least i can't see any technical differences in their specs - http://www.cpu-world.com/Compare_CPUs/Intel_AW8063801013511,Intel_CM8063701093302/).

Intel 64 and IA-32 Architectures Optimization Reference Manual mentions same throughput = 11 and latency = 3 for both PcmpEstrM and PcmpIstrM. Therefore i expect similar performance for both.

Q: Is the difference i've got practically designed/expected or i'm using these instruction in a wrong way?

Below is my dummy test scenario (VS 2012). The logic is pretty simple - scan 16MB оf text to find matching character. Since none of haystack and needle string contain zero terminators - i expect both E and I to have similar performance.

PS: I tried posting this question at intel's dev forum, but they identify it as spam :(

#include "stdafx.h"
#include <windows.h>
#define BEGIN_TIMER(NAME)                       \
    {                                           \
        LARGE_INTEGER   __freq;                 \
        LARGE_INTEGER   __t0;                   \
        LARGE_INTEGER   __t1;                   \
        double          __tms;                  \
        const char*     __tname = NAME;         \
        char            __tbuf[0xff];           \
        \
        QueryPerformanceFrequency(&__freq);     \
        QueryPerformanceCounter(&__t0);         
#define END_TIMER()                             \
        QueryPerformanceCounter(&__t1);         \
        __tms = (__t1.QuadPart - __t0.QuadPart) * 1000.0 / __freq.QuadPart; \
        sprintf_s(__tbuf, sizeof(__tbuf), "%-32s = %6.1f ms\n", __tname, __tms ); \
        OutputDebugStringA(__tbuf);             \
        printf(__tbuf);                         \
    }
// 4.1.3 Aggregation Operation
#define SSE42_AGGOP_BITBASE         2
#define SSE42_AGGOP_EQUAL_ANY       (00b << SSE42_AGGOP_BITBASE)
#define SSE42_AGGOP_RANGES          (01b << SSE42_AGGOP_BITBASE)
#define SSE42_AGGOP_EQUAL_EACH      (10b << SSE42_AGGOP_BITBASE)
#define SSE42_AGGOP_EQUAL_ORDERED   (11b << SSE42_AGGOP_BITBASE)
int _tmain(int argc, _TCHAR* argv[])
{
    int cIterations = 1000000;
    int cCycles = 1000;
    int cchData = 16 * cIterations;
    char* testdata = new char[cchData + 16];

    memset(testdata, '*', cchData);
    testdata[cchData - 1] = '+';
    testdata[cchData] = '\0';
    BEGIN_TIMER("PcmpIstrI") {
        for( int i = 0; i < cCycles; i++ ) {
            __asm {
                    push        ecx
                    push        edx
                    push        ebx
                    mov         edi, testdata
                    mov         ebx, cIterations
                    mov         al, '+'
                    mov         ah, al
                    movd        xmm1, eax               // fill low word with pattern
                    pshuflw     xmm1, xmm1, 0           // fill low dqword with pattern
                    movlhps     xmm1, xmm1              // ... and copy it hi dqword
                loop_pcmpistri:
                    PcmpIstrM   xmm1, [edi], SSE42_AGGOP_EQUAL_EACH
                    add         edi, 16
                    sub         ebx, 1
                    jnz         loop_pcmpistri
                    pop         ebx
                    pop         edx
                    pop         ecx
            }
        }
    } END_TIMER();
    BEGIN_TIMER("PcmpEstrI") {
        for( int i = 0; i < cCycles; i++ ) {
            __asm {
                    push        ecx
                    push        edx
                    push        ebx
                    mov         edi, testdata
                    mov         ebx, cIterations
                    mov         al, '+'
                    mov         ah, al
                    movd        xmm1, eax               // fill low word with pattern
                    pshuflw     xmm1, xmm1, 0           // fill low dqword with pattern
                    movlhps     xmm1, xmm1              // ... and copy it hi dqword
                    mov         eax, 15
                    mov         edx, 15
                loop_pcmpestri:
                    PcmpEstrM   xmm1, [edi], SSE42_AGGOP_EQUAL_EACH
                    add         edi, 16
                    sub         ebx, 1
                    jnz         loop_pcmpestri
                    pop         ebx
                    pop         edx
                    pop         ecx
            }
        }
    } END_TIMER();
    return 0;
}
Lyndel answered 5/1, 2014 at 16:7 Comment(7)
Have you tried implementing this with rep strings? On IvyBridge they should have been improvedGenus
Can't tell right now, but former experience (and I believe this was already tested on IvyBridge) shows that performance of REP is close to 'for' loop and in case of string comparison it is algorithmically worse because to call REP you need to specify count, what requires calculation of string length, what takes time and actually not needed in most cases ...Lyndel
Agner's tables state that pcmpestrm is 8 µops whereas pcmpistrm is just 3 µops with a register operand. This could explain the difference.Boles
Thanks for reference to Agner's tables (I didn't known about them). Information from there seems to be the key. Can you post an answer so that i can accept it?Lyndel
mov eax, '++++' would be a lot faster. Or imul eax, eax, 0x01010101 to broadcast a byte before movd. Then use pshufd to broadcast to a vector. Also, you don't need to push/pop registers inside inline asm. MSVC does that for you.Hirsh
@Leeor: Do you have any evidence that repe/ne scasb or cmpsb have optimized (vectorized) microcode in any CPU? ERMSB only applies to rep stosb and rep movsb, not the comparison stuff, and Agner Fog's tables lists rep scas as taking ~2n cycles on IvB (vs. 1/16B for rep stos best case).Hirsh
@Xtra: See also other links in the x86 tag wiki.Hirsh
B
2

According to the instruction tables of Agner fog, pcmpestrm takes 8 µops, whereas pcmpistrm takes 3 µops on most architectures. This should explain the performance difference you observe. Consider rewriting your code so you can use pcmpistrm instead of pcmpestrm if possible.

Boles answered 28/8, 2017 at 20:3 Comment(2)
Anything more than 4 uops is microcoded, so that's a qualitative front-end difference. stackoverflow.com/questions/26907523/…. Also, Agner does list the throughputs as 4 vs. 3 (close to Intel's numbers), but the impact on front-end throughput of surrounding code is something Intel's latency / throughput numbers don't help you figure out.Hirsh
See the "performance analysis" section of this answer for a summary of how latency / front-end / execution-ports are the three main "dimensions" for characterizing a sequence of instructions (with no branches or cache misses).Hirsh

© 2022 - 2025 — McMap. All rights reserved.