Parse string into array based on spaces or "double quotes strings"

Asked 11/3, 2012 at 22:50 Answered 11/11, 2017 at 9:19

Im trying to take a user input string and parse is into an array called char *entire_line[100]; where each word is put at a different index of the array but if a part of the string is encapsulated by a quote, that should be put in a single index. So if I have

char buffer[1024]={0,};
fgets(buffer, 1024, stdin);

example input: "word filename.txt "this is a string that shoudl take up one index in an output array";

tokenizer=strtok(buffer," ");//break up by spaces
        do{
            if(strchr(tokenizer,'"')){//check is a word starts with a "
            is_string=YES;
            entire_line[i]=tokenizer;// if so, put that word into current index
            tokenizer=strtok(NULL,"\""); //should get rest of string until end "
            strcat(entire_line[i],tokenizer); //append the two together, ill take care of the missing space once i figure out this issue

              }  
        entire_line[i]=tokenizer;
        i++;
        }while((tokenizer=strtok(NULL," \n"))!=NULL);

This clearly isn't working and only gets close if the double quote encapsulated string is at the end of the input string but i could have input: word "this is text that will be user entered" filename.txt Been trying to figure this out for a while, always get stuck somewhere. thanks

Spawn answered 11/3, 2012 at 22:50 Comment(1)

Another thing you need to consider is the possibility that the data may need to include a double-quote character, so you might have to incorporate some sort of escaping of that character. Of course, it might be OK to put that support off until you have the basic quoted string stuff working, but it's something to keep in mind... – Curium 11/3, 2012 at 23:39

The strtok function is a terrible way to tokenize in C, except for one (admittedly common) case: simple whitespace-separated words. (Even then it's still not great due to lack of re-entrance and recursion ability, which is why we invented strsep for BSD way back when.)

Your best bet in this case is to build your own simple state-machine:

char *p;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;

for (p = buffer; *p != '\0'; p++) {
    c = (unsigned char) *p; /* convert to unsigned char for is* functions */
    switch (state) {
    case DULL: /* not in a word, not in a double quoted string */
        if (isspace(c)) {
            /* still not in a word, so ignore this char */
            continue;
        }
        /* not a space -- if it's a double quote we go to IN_STRING, else to IN_WORD */
        if (c == '"') {
            state = IN_STRING;
            start_of_word = p + 1; /* word starts at *next* char, not this one */
            continue;
        }
        state = IN_WORD;
        start_of_word = p; /* word starts here */
        continue;

    case IN_STRING:
        /* we're in a double quoted string, so keep going until we hit a close " */
        if (c == '"') {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_STRING or we handled the end above */

    case IN_WORD:
        /* we're in a word, so keep going until we get to a space */
        if (isspace(c)) {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_WORD or we handled the end above */
    }
}

Note that this does not account for the possibility of a double quote inside a word, e.g.:

"some text in quotes" plus four simple words p"lus something strange"

Work through the state machine above and you will see that "some text in quotes" turns into a single token (that ignores the double quotes), but p"lus is also a single token (that includes the quote), something is a single token, and strange" is a token. Whether you want this, or how you want to handle it, is up to you. For more complex but thorough lexical tokenization, you may want to use a code-building tool like flex.

Also, when the for loop exits, if state is not DULL, you need to handle the final word (I left this out of the code above) and decide what to do if state is IN_STRING (meaning there was no close-double-quote).

Vaccinate answered 11/3, 2012 at 23:40 Comment(2)

Out of curiosity, why call continue instead of break? If a developer were to add code after the switch they might be confused by the results because it's being short circuited by the continue calls. – Milano 17/12, 2016 at 23:40

@ktbiz: just a style preference. In real code I'd probably break when there's more work to do regardless of the state, and continue when it's time to move to the next input character, although the details always wind up varying. – Vaccinate 18/12, 2016 at 0:50

Torek's parts of parsing code are excellent but require little more work to use.

For my own purpose, I finished c function.
Here I share my work that is based on Torek's code.

#include <stdio.h>
#include <string.h>
#include <ctype.h>
size_t split(char *buffer, char *argv[], size_t argv_size)
{
    char *p, *start_of_word;
    int c;
    enum states { DULL, IN_WORD, IN_STRING } state = DULL;
    size_t argc = 0;

    for (p = buffer; argc < argv_size && *p != '\0'; p++) {
        c = (unsigned char) *p;
        switch (state) {
        case DULL:
            if (isspace(c)) {
                continue;
            }

            if (c == '"') {
                state = IN_STRING;
                start_of_word = p + 1; 
                continue;
            }
            state = IN_WORD;
            start_of_word = p;
            continue;

        case IN_STRING:
            if (c == '"') {
                *p = 0;
                argv[argc++] = start_of_word;
                state = DULL;
            }
            continue;

        case IN_WORD:
            if (isspace(c)) {
                *p = 0;
                argv[argc++] = start_of_word;
                state = DULL;
            }
            continue;
        }
    }

    if (state != DULL && argc < argv_size)
        argv[argc++] = start_of_word;

    return argc;
}
void test_split(const char *s)
{
    char buf[1024];
    size_t i, argc;
    char *argv[20];

    strcpy(buf, s);
    argc = split(buf, argv, 20);
    printf("input: '%s'\n", s);
    for (i = 0; i < argc; i++)
        printf("[%u] '%s'\n", i, argv[i]);
}
int main(int ac, char *av[])
{
    test_split("\"some text in quotes\" plus four simple words p\"lus something strange\"");
    return 0;
}

See program output:

input: '"some text in quotes" plus four simple words p"lus something strange"'
[0] 'some text in quotes'
[1] 'plus'
[2] 'four'
[3] 'simple'
[4] 'words'
[5] 'p"lus'
[6] 'something'
[7] 'strange"'

Scalawag answered 13/11, 2014 at 16:29 Comment(2)

May I ask, why you mark *p = 0 ? it alters the original buffer – Queenqueena 7/4, 2020 at 3:5

To make to a null-terminated string from parsed token – Scalawag 7/4, 2020 at 13:34

I wrote a qtok function some time ago that reads quoted words from a string. It's not a state machine and it doesn't make you an array but it's trivial to put the resulting tokens into one. It also handles escaped quotes and trailing and leading spaces:

#include <stdio.h>
#include <ctype.h>
#include <assert.h>

// Strips backslashes from quotes
char *unescapeToken(char *token)
{
    char *in = token;
    char *out = token;

    while (*in)
    {
        assert(in >= out);

        if ((in[0] == '\\') && (in[1] == '"'))
        {
            *out = in[1];
            out++;
            in += 2;
        }
        else
        {
            *out = *in;
            out++;
            in++; 
        }
    }
    *out = 0;
    return token;
}

// Returns the end of the token, without chaning it.
char *qtok(char *str, char **next)
{
    char *current = str;
    char *start = str;
    int isQuoted = 0;

    // Eat beginning whitespace.
    while (*current && isspace(*current)) current++;
    start = current;

    if (*current == '"')
    {
        isQuoted = 1;
        // Quoted token
        current++; // Skip the beginning quote.
        start = current;
        for (;;)
        {
            // Go till we find a quote or the end of string.
            while (*current && (*current != '"')) current++;
            if (!*current) 
            {
                // Reached the end of the string.
                goto finalize;
            }
            if (*(current - 1) == '\\')
            {
                // Escaped quote keep going.
                current++;
                continue;
            }
            // Reached the ending quote.
            goto finalize; 
        }
    }
    // Not quoted so run till we see a space.
    while (*current && !isspace(*current)) current++;
finalize:
    if (*current)
    {
        // Close token if not closed already.
        *current = 0;
        current++;
        // Eat trailing whitespace.
        while (*current && isspace(*current)) current++;
    }
    *next = current;

    return isQuoted ? unescapeToken(start) : start;
}

int main()
{
    char text[] = "   \"some text in quotes\"    plus   four simple words p\"lus something strange\" \"Then some quoted \\\"words\\\", and backslashes: \\ \\ \"  Escapes only work insi\\\"de q\\\"uoted strings\\\"   ";

    char *pText = text;

    printf("Original: '%s'\n", text);
    while (*pText)
    {
        printf("'%s'\n", qtok(pText, &pText));
    }

}

Outputs:

Original: '   "some text in quotes"    plus   four simple words p"lus something strange" "Then some quoted \"words\", and backslashes: \ \ "  Escapes only work insi\"de q\"uoted strings\"   '
'some text in quotes'
'plus'
'four'
'simple'
'words'
'p"lus'
'something'
'strange"'
'Then some quoted "words", and backslashes: \ \ '
'Escapes'
'only'
'work'
'insi\"de'
'q\"uoted'
'strings\"'

Rechaba answered 18/11, 2015 at 12:37 Comment(0)

I think the answer to your question is actually fairly simple, but I'm taking on an assumption where it seems the other responses have taken a different one. I'm assuming that you want any quoted block of text to be separated out on its own regardless of spacing with the rest of the text being separated by spaces.

So given the example:

"some text in quotes" plus four simple words p"lus something strange"

The output would be:

[0] some text in quotes

[1] plus

[2] four

[3] simple

[4] words

[5] p

[6] lus something strange

Given that this is the case, only a simple bit of code is required, and no complex machines. You would first check if there is a leading quote for the first character and if so tick a flag and remove the character. As well as removing any quotes at the end of the string. Then tokenize the string based on quotation marks. Then tokenize every other of the strings obtained previously by spaces. Tokenize starting with the first string obtained if there was no leading quote, or the second string obtained if there was a leading quote. Then each of the remaining strings from the first part will be added to an array of strings interspersed with the strings from the second part added in place of the strings they were tokenized from. In this way you can get the result listed above. In code this would look like:

#include<string.h>
#include<stdlib.h>

char ** parser(char * input, char delim, char delim2){
    char ** output;
    char ** quotes;
    char * line = input;
    int flag = 0;
    if(strlen(input) > 0 && input[0] == delim){
        flag = 1;
        line = input + 1;
    }
    int i = 0;
    char * pch = strchr(line, delim);
    while(pch != NULL){
        i++;
        pch = strchr(pch+1, delim);
    }
    quotes = (char **) malloc(sizeof(char *)*i+1);
    char * token = strtok(input, delim);
    int n = 0;
    while(token != NULL){
        quotes[n] = strdup(token);
        token = strtok(NULL, delim);
        n++;
    }
    if(delim2 != NULL){
        int j = 0, k = 0, l = 0;
        for(n = 0; n < i+1; n++){
            if(flag & n % 2 == 1 || !flag & n % 2 == 0){
                char ** new = parser(delim2, NULL);
                l = sizeof(new)/sizeof(char *);
                for(k = 0; k < l; k++){
                    output[j] = new[k];
                    j++;
                }
                for(k = l; k > -1; k--){
                    free(new[n]);
                }
                free(new);
            } else {
                output[j] = quotes[n];
                j++;
            }
        }
        for(n = i; n > -1; n--){
            free(quotes[n]);
        }
        free(quotes);
    } else {
        return quotes;
    }
    return output;
}

int main(){
    char * input;
    char ** result = parser(input, '\"', ' ');

    return 0;
}

(May not be perfect, I haven't tested it)

Anthropology answered 11/11, 2017 at 9:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags