memcmp but need to compare block with fixed value
Asked Answered
A

6

9

I need to compare a block of memory to a fixed value in C. Can I do this with memcmp? Something like:

memcmp (starting_address , fixed_value , num_byte)

I need fixed_value to be a fixed value not the starting address of a block.

  1. Writing the fixed value to an entire temporary memory block is not an option because I have limited space.
  2. Using a loop to write and check memory one by one is not an option because it's very slow.

If it's not possible can anyone tell me a solution that is as fast (or faster) than memcmp?

Thanks,

EDIT: Let's say I have 5GB of memory that holds 0's. And I'm trying to make sure they're all 0's. Is it safe to check the first byte of the block then do this:

memcmp (starting_address , starting_address + ONE_BYTE , FIVE_GB); ?

EDIT: This is why I need to use memcmp and not a user defined loop:

This code took 546 clock ticks to run:

memset(0x80000000 , 0x1 , 0x10000000);
memset(0x90000000 , 0x1 , 0x10000000);
memcmp(0x80000000 , 0x90000000 , 0x10000000);

vs this one that took 7669 clock ticks:

unsigned int i;
int flag = 0;
int *p = 0x80000000;
int *q = 0x90000000;
while(p < 0x90000000)
{
    if(*p++ != *q++)
    {
        flag = 1;
    }
}
Archenemy answered 10/6, 2013 at 22:37 Comment(19)
"Using a loop to write and check memory one by one is not an option because it's very slow." What do you think memcmp is going to do?Yarndyed
Have you tried timing to see how long memcmp takes in comparison to a for loop you've written yourself before you came to the conclusion that memcmp is faster? Have you tried reading and comparing blocks of 32 or 64 bits at a time in a for loop?Factitious
@CarlNorum: For loops aren't even close to memcmp/memcpy performance in my experience. Modern processors have efficient instructions for data handling in memory (REP MOVSB comes to mind) and there's extra loop overhead. There are faster ways still in asm, since memcmp/memcpy is designed to handle generic cases, like when memory involved is not DWORD-aligned.Surf
no starting_address - ONE_BYTE will point below the start of starting_address at the first access.Sweet
It should be memcmp(start, start+1, size-1) but the interesting question is whether memcmp has well-defined behavior when the inputs overlap. I was going to instinctively say no, but when I tried to find an authoritative source stating that it's undefined, I couldn't come up with one.Unmixed
Thanks. Modified the code.Archenemy
@CarlNorum: memcmp is more efficient. See the second edit.Archenemy
But those aren't the same! Your second example isn't even valid code.Yarndyed
You forgot to use break. Searching large chunks of memory is memory bus-bound, not execution-bound. Very hard to write an inefficient version, but forgetting break is a good way.Sousaphone
Why care about performance if it is not even correct ?Sweet
Two famous quotes: "you can't get any less oprimised than 'wrong'" and "I can make it run arbitrarily fast if it doesn't have to produce the right result" :-)Grandsire
@CharlesBurns: I just wrote a quick test, and I can beat memcpy easily with a 64-bit compare in a tight loop. (0.036462 s vs 0.024012 s) for OP's buffer size of 0x10000000.Yarndyed
Just editted my code. Still memcmp is much faster.Archenemy
@CarlNorum: how did you implement the loop?Archenemy
It's going to look bad here in a comment, but here goes: uint64_t *p = (uint64_t *)buffer; uint64_t compare; memset(&compare, 1, sizeof compare); for (i = 0; i < SIZE/sizeof compare; i++) { if (p[i] != compare) break; }Yarndyed
I had #define SIZE 0x10000000 elsewhere.Yarndyed
@CarlNorum: Thanks! I tried this code: unsigned int i; int *p = 0x80000000; while(p < 0x90000000) { if(*p++ != 0x01010101) {break;}} which is essentially your code but with a while loop and it does it in 546 ticks, exactly as in memcmp. Can you please write the code in a post so I can accept it as a solution? Thanks.Archenemy
Uhm... 564 ticks? To iterate up to 256MB of memory? Unless you hit a mismatch very early only, then something tells me that your compiler is saying "Hey, look at this! All these writes are not necessary. Let's not do them and say we did. Then we can spend 564 ticks chillin'"Anticosti
@NikBougalis: The code is on a softprocessor on an FPGA so it gets converted to logic. I wouldn't be surprised if through some gate level optimization the time was significatly reduced. The 546 ticks was for the worst case scenario of no mismatches. I did try to insert mismatches at several places to make sure the loop catches them (to test the compiler) and it did catch them.Archenemy
Y
3

I just tested this loop on my Mac, and it beats memcmp:

uint64_t *p = (uint64_t *)buffer1;
uint64_t compare;
memset(&compare, 1, sizeof compare);
for (i = 0; i < length/sizeof compare; i++)
{
    if (p[i] != compare)
        break;
}

Complete example code:

#include <stdio.h>
#include <string.h>
#include <sys/resource.h>
#include <time.h>
#include <stdlib.h>
#include <stdint.h>

// from: http://www.gnu.org/software/libc/manual/html_node/Elapsed-Time.html
void timeval_subtract(struct timeval *result, struct timeval *x, struct timeval *y)
{
    /* Perform the carry for the later subtraction by updating y. */
    if (x->tv_usec < y->tv_usec)
    {
        int nsec = (y->tv_usec - x->tv_usec) / 1000000 + 1;
        y->tv_usec -= 1000000 * nsec;
        y->tv_sec += nsec;
    }

    if (x->tv_usec - y->tv_usec > 1000000)
    {
        int nsec = (x->tv_usec - y->tv_usec) / 1000000;
        y->tv_usec += 1000000 * nsec;
        y->tv_sec -= nsec;
    }

    /* Compute the time remaining to wait. tv_usec is certainly positive. */
    result->tv_sec = x->tv_sec - y->tv_sec;
    result->tv_usec = x->tv_usec - y->tv_usec;
}

int main(int argc, char **argv)
{
    struct rusage before;
    struct rusage after;
    struct timeval diff;
    size_t i;

    size_t length = strtoull(argv[1], NULL, 0);

    char *buffer1 = malloc(length);
    char *buffer2 = malloc(length);

    printf("filling...");
    fflush(stdout);
    memset(buffer1, 1, length);
    memset(buffer2, 1, length);
    printf(" done\n");

    getrusage(RUSAGE_SELF, &before);
    uint64_t *p = (uint64_t *)buffer1;
    uint64_t compare;
    memset(&compare, 1, sizeof compare);
    for (i = 0; i < length/sizeof compare; i++)
    {
        if (p[i] != compare)
            break;
    }
    if (i == length/sizeof compare)
        i = 0;
    getrusage(RUSAGE_SELF, &after);

    printf("\nloop (returned %zu):\n", i);
    timeval_subtract(&diff, &after.ru_utime, &before.ru_utime);
    printf("User:   %ld.%06d s\n", diff.tv_sec, diff.tv_usec);

    timeval_subtract(&diff, &after.ru_stime, &before.ru_stime);
    printf("System: %ld.%06d s\n", diff.tv_sec, diff.tv_usec);

    getrusage(RUSAGE_SELF, &before);
    i = memcmp(buffer1, buffer2, length);
    getrusage(RUSAGE_SELF, &after);

    printf("\nmemcmp (returned %zu):\n", i);
    timeval_subtract(&diff, &after.ru_utime, &before.ru_utime);
    printf("User:   %ld.%06d s\n", diff.tv_sec, diff.tv_usec);

    timeval_subtract(&diff, &after.ru_stime, &before.ru_stime);
    printf("System: %ld.%06d s\n", diff.tv_sec, diff.tv_usec);

    return 0;
}

And run results:

$ make
clang -Wall -Wextra -Werror -O3 -g -o example example.c
./example 0x10000000
filling... done

loop (returned 0):
User:   0.024078 s
System: 0.000011 s

memcmp (returned 0):
User:   0.036752 s
System: 0.000017 s

Maybe you can do something similar?

Note: For those concerned about cache warming, I also tried with the memcmp before the loop and got the same results.

Yarndyed answered 11/6, 2013 at 0:8 Comment(2)
Thanks! I was wrong to think that memcmp outperfoms a user-defined loop, at least for what I was trying to accomplish.Archenemy
@Arash No, you were not wrong. This is one simple minded example. But for buffers in general, with different alignments, sizes etc., memcmp will outperform a trivial implementation. The memcmp implementation accounts for alignments, cache line size, and other optimizations. I would suggest that you give this another thought.Malcommalcontent
H
3

One solution:

Create a buffer containing all the same value and compare against it iteratively. That way you get the advantage of an efficient memcmp implementation without having to write too much code:

static char val[4096]; // tune the size of the buffer if desired
/* at initialization: memset(val, 0x01, sizeof(val)) */

char *start, *ptr, *end;
// initialize start and end
for(ptr = start; ptr < end-sizeof(val); ptr += sizeof(val)) {
    if(memcmp(ptr, val, sizeof(val))
        goto not_all_val;
}
if(memcmp(ptr, val, end - ptr))
    goto not_all_val;

/* all val */
puts("all val");
return;

not_all_val:
puts("not all val");
return;

Performance comparison:

10000 iterations of memcmp(buf, buf2, 4000000) (two buffers of the same length, same as the first method in the question):

real    0m7.480s
user    0m7.375s
sys 0m0.103s

10000 iterations of character-by-character comparison over 4000000 bytes (same as the second method):

real    0m27.004s
user    0m26.908s
sys 0m0.094s

10000 iterations of the proposed method above over 4000000 bytes:

real    0m3.194s
user    0m3.151s
sys 0m0.042s

YMMV (I'm on a three-year-old Macbook Pro), but this method is twice as fast as comparing a complete buffer (likely due to superior cache performance), and nearly ten times as fast as character-by-character comparison.

Hettiehetty answered 10/6, 2013 at 23:24 Comment(0)
Y
3

I just tested this loop on my Mac, and it beats memcmp:

uint64_t *p = (uint64_t *)buffer1;
uint64_t compare;
memset(&compare, 1, sizeof compare);
for (i = 0; i < length/sizeof compare; i++)
{
    if (p[i] != compare)
        break;
}

Complete example code:

#include <stdio.h>
#include <string.h>
#include <sys/resource.h>
#include <time.h>
#include <stdlib.h>
#include <stdint.h>

// from: http://www.gnu.org/software/libc/manual/html_node/Elapsed-Time.html
void timeval_subtract(struct timeval *result, struct timeval *x, struct timeval *y)
{
    /* Perform the carry for the later subtraction by updating y. */
    if (x->tv_usec < y->tv_usec)
    {
        int nsec = (y->tv_usec - x->tv_usec) / 1000000 + 1;
        y->tv_usec -= 1000000 * nsec;
        y->tv_sec += nsec;
    }

    if (x->tv_usec - y->tv_usec > 1000000)
    {
        int nsec = (x->tv_usec - y->tv_usec) / 1000000;
        y->tv_usec += 1000000 * nsec;
        y->tv_sec -= nsec;
    }

    /* Compute the time remaining to wait. tv_usec is certainly positive. */
    result->tv_sec = x->tv_sec - y->tv_sec;
    result->tv_usec = x->tv_usec - y->tv_usec;
}

int main(int argc, char **argv)
{
    struct rusage before;
    struct rusage after;
    struct timeval diff;
    size_t i;

    size_t length = strtoull(argv[1], NULL, 0);

    char *buffer1 = malloc(length);
    char *buffer2 = malloc(length);

    printf("filling...");
    fflush(stdout);
    memset(buffer1, 1, length);
    memset(buffer2, 1, length);
    printf(" done\n");

    getrusage(RUSAGE_SELF, &before);
    uint64_t *p = (uint64_t *)buffer1;
    uint64_t compare;
    memset(&compare, 1, sizeof compare);
    for (i = 0; i < length/sizeof compare; i++)
    {
        if (p[i] != compare)
            break;
    }
    if (i == length/sizeof compare)
        i = 0;
    getrusage(RUSAGE_SELF, &after);

    printf("\nloop (returned %zu):\n", i);
    timeval_subtract(&diff, &after.ru_utime, &before.ru_utime);
    printf("User:   %ld.%06d s\n", diff.tv_sec, diff.tv_usec);

    timeval_subtract(&diff, &after.ru_stime, &before.ru_stime);
    printf("System: %ld.%06d s\n", diff.tv_sec, diff.tv_usec);

    getrusage(RUSAGE_SELF, &before);
    i = memcmp(buffer1, buffer2, length);
    getrusage(RUSAGE_SELF, &after);

    printf("\nmemcmp (returned %zu):\n", i);
    timeval_subtract(&diff, &after.ru_utime, &before.ru_utime);
    printf("User:   %ld.%06d s\n", diff.tv_sec, diff.tv_usec);

    timeval_subtract(&diff, &after.ru_stime, &before.ru_stime);
    printf("System: %ld.%06d s\n", diff.tv_sec, diff.tv_usec);

    return 0;
}

And run results:

$ make
clang -Wall -Wextra -Werror -O3 -g -o example example.c
./example 0x10000000
filling... done

loop (returned 0):
User:   0.024078 s
System: 0.000011 s

memcmp (returned 0):
User:   0.036752 s
System: 0.000017 s

Maybe you can do something similar?

Note: For those concerned about cache warming, I also tried with the memcmp before the loop and got the same results.

Yarndyed answered 11/6, 2013 at 0:8 Comment(2)
Thanks! I was wrong to think that memcmp outperfoms a user-defined loop, at least for what I was trying to accomplish.Archenemy
@Arash No, you were not wrong. This is one simple minded example. But for buffers in general, with different alignments, sizes etc., memcmp will outperform a trivial implementation. The memcmp implementation accounts for alignments, cache line size, and other optimizations. I would suggest that you give this another thought.Malcommalcontent
G
2

memcmp with an address is the best option for comparing blocks of memory. Whether you used a block inline or used the memory address of a block, you'd still have to store the block somewhere.

You can create such a block at compile time with something like:

int block[] = {3, 1, 4, 1, 5, 9};

and then just use block in your memcmp.

If you're just wanting to ensure a block of memory is set to specific values, use the for loop solution. Any other solution you come up with is going to have to do the same thing, check the entire block.

An alternative, if it's a really huge block of memory and it's taking too long, is to remove the requirement altogether. By that, I mean re-engineer your algorithms so that it becomes unnecessary. Let's say you have a 1G block.

An example: don't set them all to zeroes. Instead only set the bit at the front you're actively using and maintain a separate length variable to indicate how much you're using.

A heavily optimised memcmp may outperform a user loop but you may also find that it doesn't, simply because it has to cater to the general case - your specific case of checking against zero may allow a compiler to introduce optimisations that defeat memcmp.

As with all optimisations, measure, don't guess!

Grandsire answered 10/6, 2013 at 22:40 Comment(1)
I think OP wants to check if a given buffer is set all to the same single byte value.Yarndyed
M
1

One option would be to start with the source code for memcmp, and modify it to compare against a fixed buffer, iteratively. That way you will retain the optimizations built into memcmp, avoid the overhead of an external loop, and still achieve your goal. You could have a function like the following:

int memcmp2(const void *s1, size_t n1, const void *s2, size_t n2);

Where n1 is the size of buffer s1, and n2 that of s2.

Malcommalcontent answered 10/6, 2013 at 23:4 Comment(0)
F
1

If you have no control over who writes to that memory block, then there can't possibly exist a smart algorithm that would allow efficient comparison against a single value. You will need to iterate over the whole block and you can't skip even a single word. The only thing you can do is compare more data at once, possibly utilizing machine instructions that can process multiple words at once.

If you do have control over that memory and only you can write to it, then you can be smarter about determining what's in there. For example by "wasting" some space to hold a bit-pattern that determines which words are zeroes. For example, if your words are 32-bit, then you will have a separate memory block where you keep as many words that sum up to the same amount of bits as there are words in your actual memory block. In this case, this is going to cost you 1 byte per 32 bytes of usable memory, which isn't terrible. If you really do need byte-granularity, then the cost is much higher: 1 per 8. But you usually don't need that; you can narrow down the search once you've found a word that isn't zeroed, and search only in that one for the first non-zero byte.

Funch answered 10/6, 2013 at 23:18 Comment(0)
S
1

If, after running memcmp(), why would you expect memory changes? If the memory belongs only to your process nothing will modify it. If this is shared memory the problem becomes very different.

As an alternate suggestion I was thinking of using memset() to put all of the memory a known value- which you've already done in less than 546 ticks.

The reason is: memset() will set memory to a known value in one pass - making a second pass thru the same memory to validate takes circa twice as long.

Selden answered 10/6, 2013 at 23:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.