strstr() for a string that is NOT null-terminated
Asked Answered
M

4

35

How do I do the in-place equivalent of strstr() for a counted string (i.e. not null-terminated) in C?

Midyear answered 21/12, 2011 at 3:11 Comment(10)
You'll have to write your own version.Bearberry
Which string isn't null-terminated? The string being searched, or the sub-string?Tova
@TimCooper: The one being searched (haystack).Midyear
@SethCarnegie: It's not exactly trivial... I could try KMP or something if I really need to, but I'd rather avoid it if I can.Midyear
You can steal the implementation of strnstr() from BSD. But look out for this bug: mikeash.com/pyblog/dont-use-strnstr.htmlSollie
@mkb: ... Thanks for the link, I'll take a look but that link makes me a bit nervous haha.Midyear
Have you looked at Boyer-moore string searching? It works quite fine without terminating \0s and is O(4*n).Necropolis
@moshbear: I had heard about it once but I know little about it and I'd totally forgotten it -- I'll look at that, thanks!Midyear
Yes, Boyer-Moore is very good. It's a bit more complicated than KMP, so if you don't find a ready-made implementation, it may not be worthwhile to cook one up yourself. If performance is really important, I would however recommend BM, especially if needles are long. (Best case, BM looks only at every mth character.)Fridafriday
glibc has memmem (needle and haystack both counted), I'm sure there will be a public domain implementation out there too.Deyo
F
8

If you're afraid of O(m*n) behaviour - basically, you needn't, such cases don't occur naturally - here's a KMP implementation I had lying around which I've modified to take the length of the haystack. Also a wrapper. If you want to do repeated searches, write your own and reuse the borders array.

No guarantees for bug-freeness, but it seems to still work.

int *kmp_borders(char *needle, size_t nlen){
    if (!needle) return NULL;
    int i, j, *borders = malloc((nlen+1)*sizeof(*borders));
    if (!borders) return NULL;
    i = 0;
    j = -1;
    borders[i] = j;
    while((size_t)i < nlen){
        while(j >= 0 && needle[i] != needle[j]){
            j = borders[j];
        }
        ++i;
        ++j;
        borders[i] = j;
    }
    return borders;
}

char *kmp_search(char *haystack, size_t haylen, char *needle, size_t nlen, int *borders){
    size_t max_index = haylen-nlen, i = 0, j = 0;
    while(i <= max_index){
        while(j < nlen && *haystack && needle[j] == *haystack){
            ++j;
            ++haystack;
        }
        if (j == nlen){
            return haystack-nlen;
        }
        if (!(*haystack)){
            return NULL;
        }
        if (j == 0){
            ++haystack;
            ++i;
        } else {
            do{
                i += j - (size_t)borders[j];
                j = borders[j];
            }while(j > 0 && needle[j] != *haystack);
        }
    }
    return NULL;
}

char *sstrnstr(char *haystack, char *needle, size_t haylen){
    if (!haystack || !needle){
        return NULL;
    }
    size_t nlen = strlen(needle);
    if (haylen < nlen){
        return NULL;
    }
    int *borders = kmp_borders(needle, nlen);
    if (!borders){
        return NULL;
    }
    char *match = kmp_search(haystack, haylen, needle, nlen, borders);
    free(borders);
    return match;
}
Fridafriday answered 21/12, 2011 at 5:23 Comment(0)
T
8

See if the function below works for you. I haven't tested it thoroughly, so I would suggest you do so.

char *sstrstr(char *haystack, char *needle, size_t length)
{
    size_t needle_length = strlen(needle);
    size_t i;
    for (i = 0; i < length; i++) {
        if (i + needle_length > length) {
            return NULL;
        }
        if (strncmp(&haystack[i], needle, needle_length) == 0) {
            return &haystack[i];
        }
    }
    return NULL;
}
Tova answered 21/12, 2011 at 3:22 Comment(7)
That's actually similar to what I'm currently using, but it's O(mn), whereas (I'm assuming) strstr is O(m + n). So I'm looking for something that's not ridiculously slow like my version. :-) But +1 anyway, since the idea works.Midyear
@Mehrdad: Might also be worth it to take a peek at this implementation: src.gnu-darwin.org/src/lib/libc/string/strnstr.c.htmlTova
Wow, I guess I was wrong then... so strstr is typically defined to be an O(mn) operation?? Thanks for pointing that out... then I'll probably accept this in a bit, since it's the exact substitute for the question.Midyear
@Mehrdad: I cleaned up my solution a bit, if you would like to take a look at it again.Tova
@Mehrdad C does not specify/define the O() of strstr().Weatherley
You could avoid the per-loop check for if (i + needle_length > length) by just reducing length appropriately so the for loop condition is correct. if (needle_length > length) return NULL;, then length -= needle_length;, and the for loop becomes for (i = 0; i <= length; i++) (changing < to <= to ensure you check the last possible position). The first if check is only necessary because you use size_t; using ssize_t would remove the need for that check. Good compiler might be smart enough to eliminate if for you, but why write complex code and hope if you don't need to.Imbecilic
It would also be possible to optimize an implementation like this by: 1) Using memcmp to avoid NUL checking of strncmp. 2) Using memchr to find the first character of the needle in the haystack, so you're not making a strncmp/memcmp function call for every character (likely reducing the number of calls by a factor of 50x-100x). Or skip memchr and just test the first character manually before calling strncmp/memcmp.Imbecilic
F
8

If you're afraid of O(m*n) behaviour - basically, you needn't, such cases don't occur naturally - here's a KMP implementation I had lying around which I've modified to take the length of the haystack. Also a wrapper. If you want to do repeated searches, write your own and reuse the borders array.

No guarantees for bug-freeness, but it seems to still work.

int *kmp_borders(char *needle, size_t nlen){
    if (!needle) return NULL;
    int i, j, *borders = malloc((nlen+1)*sizeof(*borders));
    if (!borders) return NULL;
    i = 0;
    j = -1;
    borders[i] = j;
    while((size_t)i < nlen){
        while(j >= 0 && needle[i] != needle[j]){
            j = borders[j];
        }
        ++i;
        ++j;
        borders[i] = j;
    }
    return borders;
}

char *kmp_search(char *haystack, size_t haylen, char *needle, size_t nlen, int *borders){
    size_t max_index = haylen-nlen, i = 0, j = 0;
    while(i <= max_index){
        while(j < nlen && *haystack && needle[j] == *haystack){
            ++j;
            ++haystack;
        }
        if (j == nlen){
            return haystack-nlen;
        }
        if (!(*haystack)){
            return NULL;
        }
        if (j == 0){
            ++haystack;
            ++i;
        } else {
            do{
                i += j - (size_t)borders[j];
                j = borders[j];
            }while(j > 0 && needle[j] != *haystack);
        }
    }
    return NULL;
}

char *sstrnstr(char *haystack, char *needle, size_t haylen){
    if (!haystack || !needle){
        return NULL;
    }
    size_t nlen = strlen(needle);
    if (haylen < nlen){
        return NULL;
    }
    int *borders = kmp_borders(needle, nlen);
    if (!borders){
        return NULL;
    }
    char *match = kmp_search(haystack, haylen, needle, nlen, borders);
    free(borders);
    return match;
}
Fridafriday answered 21/12, 2011 at 5:23 Comment(0)
S
5

I just came across this and I'd like to share my implementation. It think it quite fast a I don't have any subcalls.

It returns the index in the haystack where the needle is found or -1 if it was not found.

/* binary search in memory */
int memsearch(const char *hay, int haysize, const char *needle, int needlesize) {
    int haypos, needlepos;
    haysize -= needlesize;
    for (haypos = 0; haypos <= haysize; haypos++) {
        for (needlepos = 0; needlepos < needlesize; needlepos++) {
            if (hay[haypos + needlepos] != needle[needlepos]) {
                // Next character in haystack.
                break;
            }
        }
        if (needlepos == needlesize) {
            return haypos;
        }
    }
    return -1;
}
Subtile answered 5/5, 2015 at 12:50 Comment(1)
Should of went ahead and made it Boyer-Moore while you were at it ;)Topminnow
I
0

I used this method

int memsearch(char* dataset, int datasetLength, char* target, int targetLen){
    for(int i = 0; i < datasetLength; i++){
            if(dataset[i] == target[0]){
                    int found = 1;
                    for(int j = 0; j < targetLen; j++){
                            int k = i + j;
                            if(k >= datasetLength || target[j] != dataset[k]){
                                    found = 0;
                                    break;
                            }
                    }
                    if(found) return i;
            }
    }
    return -1;
}
Intersidereal answered 2/10, 2018 at 1:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.