Why is memcmp so much faster than a for loop check?
Asked Answered
E

3

43

Why is memcmp(a, b, size) so much faster than:

for(i = 0; i < nelements; i++) {
    if a[i] != b[i] return 0;
}
return 1;

Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.

Emulous answered 14/1, 2014 at 5:42 Comment(2)
Compile with -S to see the assembly-language output and find out. On x86, as others have mentioned, there are instructions for this, but often these can be vectorized.Uglify
But what optimization level are you using? Many compilers can unroll that loop.Uglify
M
59

memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.

As a "builtin"

GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.

On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).

Given the following code:

int test(const void* s1, const void* s2, int count)
{
    return memcmp(s1, s2, count) == 0;
}

gcc version 3.4.4 on Cygwin generates the following assembly:

; (prologue)
mov     esi, [ebp+arg_0]    ; Move first pointer to esi
mov     edi, [ebp+arg_4]    ; Move second pointer to edi
mov     ecx, [ebp+arg_8]    ; Move length to ecx

cld                         ; Clear DF, the direction flag, so comparisons happen
                            ; at increasing addresses
cmp     ecx, ecx            ; Special case: If length parameter to memcmp is
                            ; zero, don't compare any bytes.
repe cmpsb                  ; Compare bytes at DS:ESI and ES:EDI, setting flags
                            ; Repeat this while equal ZF is set
setz    al                  ; Set al (return value) to 1 if ZF is still set
                            ; (all bytes were equal).
; (epilogue) 

Reference:

As a library function

Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.

In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:

The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:

ENTRY(memcmp)
    .type   memcmp, @gnu_indirect_function
    LOAD_RTLD_GLOBAL_RO_RDX
    HAS_CPU_FEATURE (SSSE3)
    jnz 2f
    leaq    __memcmp_sse2(%rip), %rax
    ret 

2:  HAS_CPU_FEATURE (SSE4_1)
    jz  3f  
    leaq    __memcmp_sse4_1(%rip), %rax
    ret 

3:  leaq    __memcmp_ssse3(%rip), %rax
    ret 

END(memcmp)

In the Linux kernel

Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.

Megaera answered 14/1, 2014 at 5:43 Comment(11)
rep cmpsb, that is.Sargent
@Sargent Wow, thanks I almost continued writing my whole answer as if this were memcpy.Megaera
It would be interesting to profile the builtin version with non builtin version (-fno-builtin). At some point the builtin version was much slower. I don't know if it has improved.Rawinsonde
I was reading the bug report, and it appears that it's quite complicated and there are lots of corner cases. (including constant lengths, etc.)Megaera
IIRC instructions like rep cmpsb are actually quite slow. gcc now generates a call to the libc version of memcmp, which (in glibc) has an optimized asm implementation (using SIMD, not rep cmpsb).Malefactor
That is not universally true. Modern CPUs have a "fast string operations" feature that puts the rep * versions back at the top. The Linux kernel detects your whether your CPU supports this feature and live patches the appropriate code in place. (Although that may be only for movsb and friends)Megaera
I'll try and dig up some documentation, but Googling for that quoted string above should turn up something good.Megaera
As mentioned by @MarcGlisse , 4.8 with -O3 makes a jmp memcmp. This is also mentioned at: https://mcmap.net/q/226424/-intrinsic-memcmp, which links to: gcc.gnu.org/bugzilla/show_bug.cgi?id=43052Enchantment
All: I have thoroughly updated my answer with the current state of affairs. Feedback is welcome.Megaera
Marc Glisse is correct; but "fast strings" only applies to rep, not repz/repnz. rep movsb / rep stosb are fast (especially with ERMSB on Ivybridge+), but repz cmpsb isn't. See agner.org/optimize for instruction tables: On Skylake, repz cmps has a runtime of >=2n cycles, taking >= 8n uops. (Where n is the element count, rcx if it goes to the end, i.e. 1 byte per 2 cycles for cmpsb.) But rep movs has a best-case of 1/32B (copy 32 bytes per cycle).Pendragon
Inlining rep cmpsb is a poor choice except for very short blocks to save the function-call overhead. It's not fast. It can also avoid a branch mispredict, though, again a possible advantage for short blocks.Pendragon
C
1

Is memcmp a CPU instruction or something?

It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.

Coupe answered 14/1, 2014 at 5:44 Comment(0)
I
1

It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.

intrinsic memcmp

Isaisaac answered 14/1, 2014 at 5:45 Comment(2)
memcmp is a GCC builtin, not intrinsic. intrinsic typically refers to C-level access to specific CPU instructions.Megaera
and intrinsic is what they're called in Visual C++Isaisaac

© 2022 - 2024 — McMap. All rights reserved.