How to check if a string starts with another string in C?
Asked Answered
R

10

115

Is there something like startsWith(str_a, str_b) in the standard C library?

It should take pointers to two strings that end with nullbytes, and tell me whether the first one also appears completely at the beginning of the second one.

Examples:

"abc", "abcdef" -> true
"abcdef", "abc" -> false
"abd", "abdcef" -> true
"abc", "abc"    -> true
Roundhouse answered 22/1, 2011 at 22:15 Comment(3)
I think your 3rd example should have a true result.Byyourleave
possible duplicate of #15515588Ulda
Does this answer your question? How to check if string starts with certain string in C?Average
S
93

Apparently there's no standard C function for this. So:

bool startsWith(const char *pre, const char *str)
{
    size_t lenpre = strlen(pre),
           lenstr = strlen(str);
    return lenstr < lenpre ? false : memcmp(pre, str, lenpre) == 0;
}

Note that the above is nice and clear, but if you're doing it in a tight loop or working with very large strings, it does not offer the best performance, as it scans the full length of both strings up front (strlen). Solutions like wj32's or Christoph's may offer better performance (although this comment about vectorization is beyond my ken of C). Also note Fred Foo's solution which avoids strlen on str (he's right, it's unnecessary if you use strncmp instead of memcmp). Only matters for (very) large strings or repeated use in tight loops, but when it matters, it matters.

Shaggy answered 22/1, 2011 at 22:15 Comment(7)
I should mention that the usual thing would be for the string to be the first parameter, and the prefix to the be second. But I kept them as above because the seemed to be how your question was framed... The order is entirely up to you, but I really should have done it the other way 'round -- most string functions take the full string as the first argument, the substring as the second.Shaggy
This is an elegant solution, but it does have some performance issues. An optimized implementation would never look at more than min(strlen(pre), strlen(str)) characters from each string, nor would it ever look beyond the first mismatch. If the strings were long, but early mismatches were common, it would be very lightweight. But since this implementation takes the full length of both strings right up front, it forces worst-case performance, even if the strings differ in the very first character. Whether this matters really depends on the circumstances, but it's a potential problem.Beseem
@TomKarzes: Absolutely, I've gotten spoiled by languages/environments where string length is a known value rather than one we have to go figure out. :-) wj32's solution offers much better performance. Only matters for (very) large strings or tight loops, but when it matters, it matters.Shaggy
@TomKarzes You can substitute memcmp for strncmp here and it's faster. There's no UB because both strings are known to have at least lenpre bytes. strncmp checks each byte of both strings for NUL, but the strlen calls already guaranteed that there aren't any. (But it still has the performance hit you mentioned, when pre or str are longer than the actual common initial sequence.)Illiberal
@JimBalter - Very good point! Since using memcmp above wouldn't be appropriating from another answer here, I went ahead and changed it in the answer.Shaggy
P.S. This (now) may be the fastest answer on some machines with some strings, because strlen and memcmp can be implemented with very fast hardware instructions, and the strlens may put the strings into the cache, avoiding a double memory hit. On such machines, strncmp could be implemented as two strlens and a memcmp just like this, but it would be risky for a library writer to do so, as that could take much longer on long strings with short common prefixes. Here that hit is explicit, and the strlens are only done once each (Fred Foo's strlen + strncmp would do 3).Illiberal
P.P.S. This is even more effective if the function is inlined and the length of one or more argument is already known -- e.g., a constant. Consider checking several different prefixes against the same string -- one strlen for the target string, plus a strlen (unless constant) and memcmp for each prefix (and not even that if the prefix is longer than the target).Illiberal
F
227

There's no standard function for this, but you can define

bool prefix(const char *pre, const char *str)
{
    return strncmp(pre, str, strlen(pre)) == 0;
}

We don't have to worry about str being shorter than pre because according to the C standard (7.21.4.4/2):

The strncmp function compares not more than n characters (characters that follow a null character are not compared) from the array pointed to by s1 to the array pointed to by s2."

Flung answered 22/1, 2011 at 22:17 Comment(3)
Why is the answer no? Clearly, the answer is yes, it's called strncmp.Honoria
^ It should be obvious why the answer is no. An algorithm that employs strncmp and strlen is not "called strncmp".Illiberal
It's not a direct startsWith() function that returns a boolean.Magnetize
S
93

Apparently there's no standard C function for this. So:

bool startsWith(const char *pre, const char *str)
{
    size_t lenpre = strlen(pre),
           lenstr = strlen(str);
    return lenstr < lenpre ? false : memcmp(pre, str, lenpre) == 0;
}

Note that the above is nice and clear, but if you're doing it in a tight loop or working with very large strings, it does not offer the best performance, as it scans the full length of both strings up front (strlen). Solutions like wj32's or Christoph's may offer better performance (although this comment about vectorization is beyond my ken of C). Also note Fred Foo's solution which avoids strlen on str (he's right, it's unnecessary if you use strncmp instead of memcmp). Only matters for (very) large strings or repeated use in tight loops, but when it matters, it matters.

Shaggy answered 22/1, 2011 at 22:15 Comment(7)
I should mention that the usual thing would be for the string to be the first parameter, and the prefix to the be second. But I kept them as above because the seemed to be how your question was framed... The order is entirely up to you, but I really should have done it the other way 'round -- most string functions take the full string as the first argument, the substring as the second.Shaggy
This is an elegant solution, but it does have some performance issues. An optimized implementation would never look at more than min(strlen(pre), strlen(str)) characters from each string, nor would it ever look beyond the first mismatch. If the strings were long, but early mismatches were common, it would be very lightweight. But since this implementation takes the full length of both strings right up front, it forces worst-case performance, even if the strings differ in the very first character. Whether this matters really depends on the circumstances, but it's a potential problem.Beseem
@TomKarzes: Absolutely, I've gotten spoiled by languages/environments where string length is a known value rather than one we have to go figure out. :-) wj32's solution offers much better performance. Only matters for (very) large strings or tight loops, but when it matters, it matters.Shaggy
@TomKarzes You can substitute memcmp for strncmp here and it's faster. There's no UB because both strings are known to have at least lenpre bytes. strncmp checks each byte of both strings for NUL, but the strlen calls already guaranteed that there aren't any. (But it still has the performance hit you mentioned, when pre or str are longer than the actual common initial sequence.)Illiberal
@JimBalter - Very good point! Since using memcmp above wouldn't be appropriating from another answer here, I went ahead and changed it in the answer.Shaggy
P.S. This (now) may be the fastest answer on some machines with some strings, because strlen and memcmp can be implemented with very fast hardware instructions, and the strlens may put the strings into the cache, avoiding a double memory hit. On such machines, strncmp could be implemented as two strlens and a memcmp just like this, but it would be risky for a library writer to do so, as that could take much longer on long strings with short common prefixes. Here that hit is explicit, and the strlens are only done once each (Fred Foo's strlen + strncmp would do 3).Illiberal
P.P.S. This is even more effective if the function is inlined and the length of one or more argument is already known -- e.g., a constant. Consider checking several different prefixes against the same string -- one strlen for the target string, plus a strlen (unless constant) and memcmp for each prefix (and not even that if the prefix is longer than the target).Illiberal
C
40

I'd probably go with strncmp(), but just for fun a raw implementation:

int starts_with(const char *restrict string, const char *restrict prefix)
{
    while(*prefix)
    {
        if(*prefix++ != *string++)
            return 0;
    }

    return 1;
}

Gives:

Which is expected since there is control flow in loop:

$ gcc -fopt-info-missed -std=c11  -O3 -c generic.c
generic.c:3:11: missed: couldn't vectorize loop
generic.c:3:11: missed: not vectorized: control flow in loop.
Coffee answered 22/1, 2011 at 23:45 Comment(6)
I like this best - there's no reason to scan either of the strings for a length.Byyourleave
I would probably go with strlen+strncmp too, but although it does in fact work, all the controversy over it's vague definition is putting me off. So I'll use this, thanks.Poly
This is likely to be slower than strncmp, unless your compiler is really good at vectorization, because glibc writers sure are :-)Bekha
This version should be faster than the strlen+strncmp version if the prefix doesn't match, especially if there are already differences in the first few characters.Sheela
If the string is constant, a good compiler knows its length already so this could again be slower...Macmillan
^That optimization would only apply if the function is inlined.Illiberal
O
5

Use strstr() function. Stra == strstr(stra, strb)

Reference

The strstr() function finds the first occurrence of string2 in string1. The function ignores the null character (\0) that ends string2 in the matching process.

https://www.ibm.com/docs/en/i/7.4?topic=functions-strstr-locate-substring

Oglesby answered 22/1, 2011 at 22:30 Comment(4)
that seems to be somewhat backwards way of doing it - you'll go through whole stra even though it should be clear from very short initial segment if strb is a prefix or not.Heptagonal
Premature optimization is the root of all evil. I think this is the best solution, if it is not time critical code or long strings.Valois
@ilw It's a famous saying by famous computer scientists -- google it. It's often misapplied (as it is here) ... see joshbarczak.com/blog/?p=580Illiberal
I'm with Frank personally. Unix Philosophy: clarity is better than cleverness.Magnetize
C
4

I'm no expert at writing elegant code, but...

int prefix(const char *pre, const char *str)
{
    char cp;
    char cs;

    if (!*pre)
        return 1;

    while ((cp = *pre++) && (cs = *str++))
    {
        if (cp != cs)
            return 0;
    }

    if (!cs)
        return 0;

    return 1;
}
Chihli answered 22/1, 2011 at 22:30 Comment(0)
N
3

I noticed the following function definition in the Linux Kernel. It returns true if str starts with prefix, otherwise it returns false.

/**
* strstarts - does @str start with @prefix?
* @str: string to examine
* @prefix: prefix to look for.
*/
bool strstarts(const char *str, const char *prefix)
{
     return strncmp(str, prefix, strlen(prefix)) == 0;
}
Niven answered 15/7, 2022 at 22:4 Comment(3)
How is this different from Fred Foo's answer, apart from the order of arguments?Propaganda
The obvious difference is that I provided a reference to the code that I did not write. The code was added to the Linux Kernel in 2009 [1], 2 years before Fred Foo's answer was posted. So you should question Fred Foo's answer, not mine. [1]: github.com/torvalds/linux/commit/…Niven
this solution is rather obvious, Linus was not the first to write it either. Note that Christoph's solution is simpler and probably more efficient, and the accepted solution is clumsy.Propaganda
A
1

Optimized (v.2. - corrected):

uint32 startsWith( const void* prefix_, const void* str_ ) {
    uint8 _cp, _cs;
    const uint8* _pr = (uint8*) prefix_;
    const uint8* _str = (uint8*) str_;
    while ( ( _cs = *_str++ ) & ( _cp = *_pr++ ) ) {
        if ( _cp != _cs ) return 0;
    }
    return !_cp;
}
Afoul answered 5/11, 2014 at 0:58 Comment(3)
voting negative: startsWith("\2", "\1") returns 1, startsWith("\1", "\1") also returns 1Roundhouse
This decision will not use optimisations in clang, since not use instrisincs.Braden
^ intrinsics don't help here, especially if the target string is much longer than the prefix.Illiberal
S
0

Because I ran the accepted version and had a problem with a very long str, I had to add in the following logic:

bool longEnough(const char *str, int min_length) {
    int length = 0;
    while (str[length] && length < min_length)
        length++;
    if (length == min_length)
        return true;
    return false;
}

bool startsWith(const char *pre, const char *str) {
    size_t lenpre = strlen(pre);
    return longEnough(str, lenpre) ? strncmp(str, pre, lenpre) == 0 : false;
}
Soluble answered 26/10, 2014 at 1:13 Comment(0)
B
0

Or a combination of the two approaches:

_Bool starts_with(const char *restrict string, const char *restrict prefix)
{
    char * const restrict prefix_end = prefix + 13;
    while (1)
    {
        if ( 0 == *prefix  )
            return 1;   
        if ( *prefix++ != *string++)
            return 0;
        if ( prefix_end <= prefix  )
            return 0 == strncmp(prefix, string, strlen(prefix));
    }  
}

EDIT: The code below does NOT work because if strncmp returns 0 it is not known if a terminating 0 or the length (block_size) was reached.

An additional idea is to compare block-wise. If the block is not equal compare that block with the original function:

_Bool starts_with_big(const char *restrict string, const char *restrict prefix)
{
    size_t block_size = 64;
    while (1)
    {
        if ( 0 != strncmp( string, prefix, block_size ) )
          return starts_with( string, prefix);
        string += block_size;
        prefix += block_size;
        if ( block_size < 4096 )
          block_size *= 2;
    }
}

The constants 13, 64, 4096, as well as the exponentiation of the block_size are just guesses. It would have to be selected for the used input data and hardware.

Barracuda answered 19/9, 2018 at 14:15 Comment(3)
These are good ideas. Note though that the first one is technically undefined behavior if the prefix is shorter than 12 bytes (13 including NUL) because the language standard does not define the result of calculating an address outside the string other than the immediately following byte.Illiberal
@JimBalter: Could you add a reference? If the pointer is dereferenced and is after the terminating 0 then the deferenced pointer value is undefined. But why should the address itself be undefined? It is just a calculation.Barracuda
There was a general bug however: The block_size incrementation must be after the pointer incrementation. Now fixed.Barracuda
P
0

I use this macro:

#define STARTS_WITH(string_to_check, prefix) (strncmp(string_to_check, prefix, ((sizeof(prefix) / sizeof(prefix[0])) - 1)) ? 0:((sizeof(prefix) / sizeof(prefix[0])) - 1))

It returns the prexif length if the string starts with the prefix. This length is evaluated compile time (sizeof) so there is no runtime overhead.

Procopius answered 1/9, 2021 at 6:20 Comment(1)
(I am quite certain that) optimizers would evaluate strlen for literals anyway so there is very little use for the sizeof. Additionally, hiding a sizeof (that only works properly if the respective object is in scope) within a macro is a very easy way to introduce horrendous bugs.Kipton

© 2022 - 2024 — McMap. All rights reserved.