split char string with multi-character delimiter in C
Asked Answered
S

5

8

I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter.

I want to split char *string based on a substring such as "abc" or any other sub-string. How that can be achieved?

Savell answered 22/4, 2015 at 6:2 Comment(3)
possible duplicate of How to extract the string if we have have more than one delimiters?Epistrophe
I have one more query, how can I compare this str value in an if statement? for example if I have char *str = "abc" and I got a substring value from a long string and want to compare this substring value with *str: if(str == substr)Savell
got it, strcmp is used for this purpose! Thanks again everyone!Savell
N
9

Finding the point at which the desired sequence occurs is pretty easy: strstr supports that:

char str[] = "this is abc a big abc input string abc to split up";
char *pos = strstr(str, "abc");

So, at that point, pos points to the first location of abc in the larger string. Here's where things get a little ugly. strtok has a nasty design where it 1) modifies the original string, and 2) stores a pointer to the "current" location in the string internally.

If we didn't mind doing roughly the same, we could do something like this:

char *multi_tok(char *input, char *delimiter) {
    static char *string;
    if (input != NULL)
        string = input;

    if (string == NULL)
        return string;

    char *end = strstr(string, delimiter);
    if (end == NULL) {
        char *temp = string;
        string = NULL;
        return temp;
    }

    char *temp = string;

    *end = '\0';
    string = end + strlen(delimiter);
    return temp;
}

This does work. For example:

int main() {
    char input [] = "this is abc a big abc input string abc to split up";

    char *token = multi_tok(input, "abc");

    while (token != NULL) {
        printf("%s\n", token);
        token = multi_tok(NULL, "abc");
    }
}

produces roughly the expected output:

this is
 a big
 input string
 to split up

Nonetheless, it's clumsy, difficult to make thread-safe (you have to make its internal string variable thread-local) and generally just a crappy design. Using (for one example) an interface something like strtok_r, we can fix at least the thread-safety issue:

typedef char *multi_tok_t;

char *multi_tok(char *input, multi_tok_t *string, char *delimiter) {
    if (input != NULL)
        *string = input;

    if (*string == NULL)
        return *string;

    char *end = strstr(*string, delimiter);
    if (end == NULL) {
        char *temp = *string;
        *string = NULL;
        return temp;
    }

    char *temp = *string;

    *end = '\0';
    *string = end + strlen(delimiter);
    return temp;
}

multi_tok_t init() { return NULL; }

int main() {
    multi_tok_t s=init();

    char input [] = "this is abc a big abc input string abc to split up";

    char *token = multi_tok(input, &s, "abc");

    while (token != NULL) {
        printf("%s\n", token);
        token = multi_tok(NULL, &s, "abc");
    }
}

I guess I'll leave it at that for now though--to get a really clean interface, we really want to reinvent something like coroutines, and that's probably a bit much to post here.

Nupercaine answered 22/4, 2015 at 6:39 Comment(4)
How would one adopt this to do the same with an LPSTR string pointer? I know I can replace all the native string functions with their far pointer equivalents (_fstrlen etc.), but the input and output need to be LPSTR strings.Pinchcock
@TobiasTimpe: LPSTR is just an alias for char *.Nupercaine
Yes but I'm working on Win16 with LPSTR actually being a far pointer and I can't really convert the input to a normal char array because that would be too large.Pinchcock
@TobiasTimpe: The equivalence mostly works in both directions, so if you change all the instances of char * in this code to LPSTR, you're probably pretty close to having it work (but I haven't had a Win16 SDK installed for years, so I can't test that).Nupercaine
N
2

EDIT : Considered suggestions from Alan and Sourav and written a basic code for the same .

#include <stdio.h>

#include <string.h>

int main (void)
{
  char str[] = "This is abc test abc string";

  char* in = str;
  char *delim = "abc";
  char *token;

  do {

    token = strstr(in,delim);

    if (token) 
      *token = '\0';

    printf("%s\n",in);

    in = token+strlen(delim);

  }while(token!=NULL);


  return 0;
}
Normally answered 22/4, 2015 at 6:6 Comment(5)
You're very right, but i think that's not OP wants. he wants to cosider "abc" as a single delimiter.:-)Retinite
That's fine, but the first part of your answer may be misleading. Please consider removing that. :-)Retinite
Don't think the second solution will work either. strsep doesn't look just for "abc". It looks for any permutation of the characters in "abc". Try this string in your program as an example: "This is bac test ac string". Probably need to use strstr instead.Acyclic
@AlanAu [Just to add my two cents] ...and strsep() is not standard C, IMHO.Retinite
@AlanAu : Have implemented the logic using strstr, thanks for the inputs.Normally
R
1

You can easlity write your own parser using strstr() to achieve the same. The basic algorithm may look like this

  • use strstr() to find the first occurrence of the whole delimiter string
  • mark the index
  • copy from starting till the marked index, that will be your expected token.
  • to parse the input for subsequent entries, adjust the strating of the initial string to advance by token length + length of the delimiter string.
Retinite answered 22/4, 2015 at 6:3 Comment(0)
I
1

I wrote an simple implementation that is thread-safe:

struct split_string {
    int len;
    char** str;
};
typedef struct split_string splitstr;
splitstr* split(char* string, char* delimiter) {
    int targetsize = 0;
    splitstr* ret = malloc(sizeof(splitstr));
    if (ret == NULL)
        return NULL;
    ret->str = NULL;
    ret->len = 0;
    char* pos;
    char* oldpos = string;
    int newsize;
    int dlen = strlen(delimiter);
    do {
        pos = strstr(oldpos, delimiter);
        if (pos) {
            newsize = pos - oldpos;
        } else {
            newsize = strlen(oldpos);
        }
        char* newstr = malloc(sizeof(char) * (newsize + 1));
        strncpy(newstr, oldpos, newsize);
        newstr[newsize] = '\0';
        oldpos = pos + dlen;
        ret->str = realloc(ret->str, (targetsize+1) * sizeof(char*));
        ret->str[targetsize++] = newstr;
        ret->len++;
    } while (pos != NULL);
    return ret;
}

To use:

splitstr* ret = split(contents, "\n");
for (int i = 0; i < ret->len; i++) {
    printf("Element %d: %s\n", i, ret->str[i]);
}
Insanity answered 16/6, 2021 at 17:49 Comment(0)
H
0

A modified strsep implementation that supports multi-bytes delimiter

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * Split a string into tokens
 * 
 * @in: The string to be searched
 * @delim: The string to search for as a delimiter
 */
char *strsep_m(char **in, const char *delim) {
  char *token = *in;

  if (token == NULL)
    return NULL;
    
  char *end = strstr(token, delim);
  
  if (end) {
    *end = '\0';
    end += strlen(delim);
  }
  
  *in = end;
  return token;
}

int main() {
  char input[] = "a##b##c";
  char delim[] = "##";
  char *token = NULL;
  char *cin = (char*)input;
  while ((token = strsep_m(&cin, delim)) != NULL) {
    printf("%s\n", token);
  }
}
Herder answered 6/3, 2022 at 11:59 Comment(1)
Unlike strtok(), the code will produce 3 tokens for "##foo##", which may or may not be expected.Hogg

© 2022 - 2024 — McMap. All rights reserved.