Why is memcmp so much faster than a for loop check?

E

3

43

Why is memcmp(a, b, size) so much faster than:

for(i = 0; i < nelements; i++) {
    if a[i] != b[i] return 0;
}
return 1;

Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.

Emulous answered 14/1, 2014 at 5:42 Comment(2)

Compile with -S to see the assembly-language output and find out. On x86, as others have mentioned, there are instructions for this, but often these can be vectorized. – Uglify 21/4, 2018 at 3:31

But what optimization level are you using? Many compilers can unroll that loop. – Uglify 21/4, 2018 at 3:39

M

59

memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.

As a "builtin"

GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.

On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).

Given the following code:

int test(const void* s1, const void* s2, int count)
{
    return memcmp(s1, s2, count) == 0;
}

gcc version 3.4.4 on Cygwin generates the following assembly:

; (prologue)
mov     esi, [ebp+arg_0]    ; Move first pointer to esi
mov     edi, [ebp+arg_4]    ; Move second pointer to edi
mov     ecx, [ebp+arg_8]    ; Move length to ecx

cld                         ; Clear DF, the direction flag, so comparisons happen
                            ; at increasing addresses
cmp     ecx, ecx            ; Special case: If length parameter to memcmp is
                            ; zero, don't compare any bytes.
repe cmpsb                  ; Compare bytes at DS:ESI and ES:EDI, setting flags
                            ; Repeat this while equal ZF is set
setz    al                  ; Set al (return value) to 1 if ZF is still set
                            ; (all bytes were equal).
; (epilogue)

Reference:

cmpsb instruction

As a library function

Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.

In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:

The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:

ENTRY(memcmp)
    .type   memcmp, @gnu_indirect_function
    LOAD_RTLD_GLOBAL_RO_RDX
    HAS_CPU_FEATURE (SSSE3)
    jnz 2f
    leaq    __memcmp_sse2(%rip), %rax
    ret 

2:  HAS_CPU_FEATURE (SSE4_1)
    jz  3f  
    leaq    __memcmp_sse4_1(%rip), %rax
    ret 

3:  leaq    __memcmp_ssse3(%rip), %rax
    ret 

END(memcmp)

In the Linux kernel

Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.

Megaera answered 14/1, 2014 at 5:43 Comment(11)

rep cmpsb, that is. – Sargent 14/1, 2014 at 5:45

@Sargent Wow, thanks I almost continued writing my whole answer as if this were memcpy. – Megaera 14/1, 2014 at 5:46

It would be interesting to profile the builtin version with non builtin version (-fno-builtin). At some point the builtin version was much slower. I don't know if it has improved. – Rawinsonde 14/1, 2014 at 13:12

I was reading the bug report, and it appears that it's quite complicated and there are lots of corner cases. (including constant lengths, etc.) – Megaera 14/1, 2014 at 13:52

IIRC instructions like rep cmpsb are actually quite slow. gcc now generates a call to the libc version of memcmp, which (in glibc) has an optimized asm implementation (using SIMD, not rep cmpsb). – Malefactor 12/11, 2014 at 15:9

That is not universally true. Modern CPUs have a "fast string operations" feature that puts the rep * versions back at the top. The Linux kernel detects your whether your CPU supports this feature and live patches the appropriate code in place. (Although that may be only for movsb and friends) – Megaera 12/11, 2014 at 16:23

I'll try and dig up some documentation, but Googling for that quoted string above should turn up something good. – Megaera 12/11, 2014 at 16:24

As mentioned by @MarcGlisse , 4.8 with -O3 makes a jmp memcmp. This is also mentioned at: https://mcmap.net/q/226424/-intrinsic-memcmp, which links to: gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 – Enchantment 24/4, 2015 at 20:38

All: I have thoroughly updated my answer with the current state of affairs. Feedback is welcome. – Megaera 28/1, 2016 at 12:31

Marc Glisse is correct; but "fast strings" only applies to rep, not repz/repnz. rep movsb / rep stosb are fast (especially with ERMSB on Ivybridge+), but repz cmpsb isn't. See agner.org/optimize for instruction tables: On Skylake, repz cmps has a runtime of >=2n cycles, taking >= 8n uops. (Where n is the element count, rcx if it goes to the end, i.e. 1 byte per 2 cycles for cmpsb.) But rep movs has a best-case of 1/32B (copy 32 bytes per cycle). – Pendragon 27/11, 2017 at 10:8

Inlining rep cmpsb is a poor choice except for very short blocks to save the function-call overhead. It's not fast. It can also avoid a branch mispredict, though, again a possible advantage for short blocks. – Pendragon 27/11, 2017 at 10:9

C

1

Is memcmp a CPU instruction or something?

It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.

Coupe answered 14/1, 2014 at 5:44 Comment(0)

I

1

It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.

intrinsic memcmp

Isaisaac answered 14/1, 2014 at 5:45 Comment(2)

memcmp is a GCC builtin, not intrinsic. intrinsic typically refers to C-level access to specific CPU instructions. – Megaera 14/1, 2014 at 6:7

and intrinsic is what they're called in Visual C++ – Isaisaac 14/1, 2014 at 15:42

As a "builtin"

As a library function

In the Linux kernel

Recommended topics

Hot tags