How can I search for substring in a buffer that contains null?
Asked Answered
C

2

7

Using C, I need to find a substring inside a buffer that may contain nulls.

haystack = "Some text\0\0\0\0 that has embedded nulls".
needle   = "has embedded"r 

I need to return the start of the substring, or null, similat to strstr():

request_segment_end = mystrstr(request_segment_start, boundary);

Are there any existing implementations that you know of?

Update

I found implementations of memove on google's codesearch, which I've copied here verbatim, untested,

 /*
 * memmem.c
 *
 * Find a byte string inside a longer byte string
 *
 * This uses the "Not So Naive" algorithm, a very simple but
 * usually effective algorithm, see:
 *
 * http://www-igm.univ-mlv.fr/~lecroq/string/
 */

#include <string.h>

void *memmem(const void *haystack, size_t n, const void *needle, size_t m)
{
        const unsigned char *y = (const unsigned char *)haystack;
        const unsigned char *x = (const unsigned char *)needle;

        size_t j, k, l;

        if (m > n || !m || !n)
                return NULL;

        if (1 != m) {
                if (x[0] == x[1]) {
                        k = 2;
                        l = 1;
                } else {
                        k = 1;
                        l = 2;
                }

                j = 0;
                while (j <= n - m) {
                        if (x[1] != y[j + 1]) {
                                j += k;
                        } else {
                                if (!memcmp(x + 2, y + j + 2, m - 2)
                                    && x[0] == y[j])
                                        return (void *)&y[j];
                                j += l;
                        }
                }
        } else
                do {
                        if (*y == *x)
                                return (void *)y;
                        y++;
                } while (--n);

        return NULL;
}
Cray answered 15/3, 2011 at 7:35 Comment(2)
What is "boundary" in this case?Ninurta
You could implement one of the methods described in ESMAJ.Poohpooh
G
6

It doesn't make sense to me for a "string" to contain null characters. Strings are null-terminated so the first occurrence marks the end of the string. Besides, what's to say that the null-terminator after the word "nulls" doesn't have any more characters after it.

If you mean to search in a buffer, then that would make more sense to me. You'd just have to search the buffer ignoring null characters and just relying on the lengths. I don't know of any existing implementations but it should be easy to whip up a simple naive implementation. Of course use a better search algorithm here as needed.

char *search_buffer(char *haystack, size_t haystacklen, char *needle, size_t needlelen)
{   /* warning: O(n^2) */
    int searchlen = haystacklen - needlelen + 1;
    for ( ; searchlen-- > 0; haystack++)
        if (!memcmp(haystack, needle, needlelen))
            return haystack;
    return NULL;
}

char haystack[] = "Some text\0\0\0\0 that has embedded nulls";
size_t haylen = sizeof(haystack)-1; /* exclude null terminator from length */
char needle[] = "has embedded";
size_t needlen = sizeof(needle)-1; /* exclude null terminator from length */
char *res = search_buffer(haystack, haylen, needle, needlen);
Garin answered 15/3, 2011 at 8:19 Comment(3)
You are right, my terminology was incorrect. I meant to search a buffer for a given string. Thank you for you code contribution. Busy testing it now.Cray
just a note to clean up some details on an otherwise good answer, the algorithm is a linear search and it's actually O(n), not O(n^2), I don't think you can do any better unless the data your searching is ordered somehow. Also I think there is an off by one error, the first comparison is done after the haystack pointer has been incremented so the first position is not checked.Salerno
It's actually worst-case O(m * n), where m is the length of the "needle" and n is the length of the "haystack" -- it's a linear search of n steps, but each step calls memcmp(), which itself contains a linear search of m steps. There are other algorithms that run in worst-case O(m + n). Also, I'm not seeing a problem with the haystack pointer -- AFAICT the code correctly finds a needle at the beginning of the haystack.Tref
L
8

You can use memmem if you are on a system that has it, like linux (it is a GNU extension). Just like strstr, but works on bytes and requires lengths of both "strings" since it doesn't check for null terminated strings.

#include <string.h>

void *memmem(const void *haystack, size_t haystacklen, const void *needle, size_t needlelen);
Lindahl answered 15/3, 2011 at 8:12 Comment(0)
G
6

It doesn't make sense to me for a "string" to contain null characters. Strings are null-terminated so the first occurrence marks the end of the string. Besides, what's to say that the null-terminator after the word "nulls" doesn't have any more characters after it.

If you mean to search in a buffer, then that would make more sense to me. You'd just have to search the buffer ignoring null characters and just relying on the lengths. I don't know of any existing implementations but it should be easy to whip up a simple naive implementation. Of course use a better search algorithm here as needed.

char *search_buffer(char *haystack, size_t haystacklen, char *needle, size_t needlelen)
{   /* warning: O(n^2) */
    int searchlen = haystacklen - needlelen + 1;
    for ( ; searchlen-- > 0; haystack++)
        if (!memcmp(haystack, needle, needlelen))
            return haystack;
    return NULL;
}

char haystack[] = "Some text\0\0\0\0 that has embedded nulls";
size_t haylen = sizeof(haystack)-1; /* exclude null terminator from length */
char needle[] = "has embedded";
size_t needlen = sizeof(needle)-1; /* exclude null terminator from length */
char *res = search_buffer(haystack, haylen, needle, needlen);
Garin answered 15/3, 2011 at 8:19 Comment(3)
You are right, my terminology was incorrect. I meant to search a buffer for a given string. Thank you for you code contribution. Busy testing it now.Cray
just a note to clean up some details on an otherwise good answer, the algorithm is a linear search and it's actually O(n), not O(n^2), I don't think you can do any better unless the data your searching is ordered somehow. Also I think there is an off by one error, the first comparison is done after the haystack pointer has been incremented so the first position is not checked.Salerno
It's actually worst-case O(m * n), where m is the length of the "needle" and n is the length of the "haystack" -- it's a linear search of n steps, but each step calls memcmp(), which itself contains a linear search of m steps. There are other algorithms that run in worst-case O(m + n). Also, I'm not seeing a problem with the haystack pointer -- AFAICT the code correctly finds a needle at the beginning of the haystack.Tref

© 2022 - 2024 — McMap. All rights reserved.