Implementing memcmp

Asked 16/2, 2011 at 14:31 Answered 25/11, 2012 at 14:17

The following is the Microsoft CRT implementation of memcmp:

int memcmp(const void* buf1,
           const void* buf2,
           size_t count)
{
    if(!count)
        return(0);

    while(--count && *(char*)buf1 == *(char*)buf2 ) {
        buf1 = (char*)buf1 + 1;
        buf2 = (char*)buf2 + 1;
    }

    return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}

It basically performs a byte by byte comparision.

My question is in two parts:

Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?

Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.

Cuneo answered 16/2, 2011 at 14:31 Comment(4)

This isn't quite true for the most part. Assuming you compile with optimizations on, it's going to turn into a compiler intrinsic, rather than call the CRT's implementation. – Weber 16/2, 2011 at 15:19

Added in the C tag as this is really a C question – Kierkegaardian 16/2, 2011 at 15:46

When optimizing, one question to think about is how large of a data size is needed before any significant improvement is seen? Sometimes the overhead of executing the function takes up more time than the actual comparison. – Polypody 16/2, 2011 at 17:41

Is this really the implementation? I would have thought they're using SIMD instructions (SSE). – Kovacs 25/11, 2012 at 14:33

You could do it as an int-by-int comparison or an even wider data type if you wish.

The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.

Some processors run slower if you access values without following their alignment rules (some even crash if you try it).

So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.

Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.

Target answered 16/2, 2011 at 14:34 Comment(2)

Wouldn't that optimization also require that both pointers overhang at the start by the same amount. – Thither 17/2, 2011 at 11:50

That optimisation is possible but I suspect that with all the casting that would be involved (to uintptr_t from void for alignment check, to int* for comparisons and char*...) it would be quicker to just drop down to assembly and hack it together... – Cultural 13/3, 2014 at 21:4

The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.

It's also more complicated, and thus more likely to contain a bug.

Hedley answered 16/2, 2011 at 14:36 Comment(5)

Aligned access of both pointers is required for performance (especially when using 128-bit sse). See Importance of alignment even on x86 machines (admittedly from 2004, but should give you some idea of the luriking terrors). – Yashmak 16/2, 2011 at 14:57

@bobbogo: misaligned SSE access is much faster on modern Intel processors than it was in 2004; aligned access should still be preferred when possible, but I have seen many situations where misaligned access was faster than aligned access + software fixup. – Publishing 16/2, 2011 at 15:55

@Stephen Canon: Yeah, YMMV as usual. Presumably for small areas it doesn't really matter how inefficient you are. Is there a limit on the memory block size where aligned access + software fixup bcomes faster than without the fixup on the newest CPUs? – Yashmak 16/2, 2011 at 16:26

@bobbogo: Not really, no. There are certain highly specific scenarios where it's faster, but in general, unaligned SSE access (even when it crosses a cacheline or page boundary) is quite fast on i5/i7 cores. (Note that aligned access is still faster, so you should always make sure that at least one of your accesses is aligned; it's just that when you're reading from multiple arrays, you should generally align those you can and use misaligned access on the others now instead of trying to do something more clever). – Publishing 16/2, 2011 at 16:32

Just to offer some concrete numbers, from the Intel optimization manual, on i7 for data in L1 cache: misaligned loads that do not cross a cacheline boundary are exactly as fast as aligned loads (single-cycle throughput), and misaligned loads that do cross a cacheline boundary take about 4 cycles (compare to ~20 cycles on earlier microarchitectures). – Publishing 16/2, 2011 at 16:35

Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).

Sago answered 17/2, 2011 at 5:16 Comment(0)

If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).

Flan answered 16/2, 2011 at 14:41 Comment(0)

Is that really their implementation? I have other issues besides not doing it int-wise:

castng away constness.
does that return statement work? unsigned char - unsigned char = signed int?

int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.

memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.

I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.

Kierkegaardian answered 16/2, 2011 at 14:49 Comment(4)

The "casting away constness" is only to allow the incrementing of the pointer by char sized chunks. That is also what's going on in the return statement -- the char versions aren't being dereferenced and compared, they are simply undergoing pointer subtraction. Also keep in mind w.r.t. the const ness that this memcmp was probably written before const was added to the language, and someone just tacked on const to the method signature to update it for C++. EDIT: Also, you really didn't answer the OP's question. – Weber 16/2, 2011 at 15:21

@Billy there is no need to cast away constness you just create const char* pointers locally then use them throughout, and can increment them too. And yes, I did answer the OP's question, stating about the possible issues with alignment if you compared word-at-a-time which is what he proposed. – Kierkegaardian 16/2, 2011 at 15:35

There certainly is reason to cast away const ness if the code is written before const exists. I see that you sort of do answer the question now, but it's buried inside your bashing of one particular implementation to the point that I didn't see it. – Weber 16/2, 2011 at 15:50

unsigned char - unsigned char = int is exactly what's required, and yes, it does work. There's an implicit promotion of arithmetic arguments narrower than int to int. – Publishing 16/2, 2011 at 15:53

Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.

Psuedo code:

while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
  fetch source bytes
  fetch destination bytes
  perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.

For more information, search the web for "Data Driven Design" and "data oriented programming".

Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.

See also loop unrolling.
See also assembly language.

You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.

Polypody answered 16/2, 2011 at 17:31 Comment(0)

The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.

The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible. Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.

Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:

What is the minumum guaranteed buffer alignment?
Can you read any padding bytes past the end of a buffer without triggering an access violation?
May the buffer parameters be identical?
May the buffer size be 0?
Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
...

If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.

Caprifoliaceous answered 25/11, 2012 at 14:17 Comment(0)

-1

Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

Umpteen answered 16/2, 2011 at 16:33 Comment(1)

Just because it's a single assembly op doesn't mean it's faster! x86 string operations are slow as they expand to a large number of micro-ops out of a slow instruction decode ROM – Positive 17/2, 2011 at 5:25

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags