Obtaining zero-length string from strtok()
Asked Answered
P

5

8

I have a CSV file containing data such as

value;name;test;etc

which I'm trying to split by using strtok(string, ";"). However, this file can contain zero-length data, like this:

value;;test;etc

which strtok() skips. Is there a way I can avoid strtok from skipping zero-length data like this?

Periwig answered 16/9, 2013 at 12:2 Comment(5)
Is strsep() available on your platform? The usage is very similar to strtok(), but it returns empty fields correctly.Chitin
@MartinR probably. I'm using Fedora w/ Linux 3.10.10.Periwig
So that could be an alternative. But even that would not handle delimiters inside quoted text like aaa;bbb;"ddd;eee";fff correctly.Chitin
@MartinR fortunately I don't need this functionality right now. I'm gonna try using strsep().Periwig
Which programming language are you using? You could include that in your title.Viscometer
C
9

A possible alternative is to use the BSD function strsep() instead of strtok(), if available. From the man page:

The strsep() function is intended as a replacement for the strtok() function. While the strtok() function should be preferred for portability reasons (it conforms to ISO/IEC 9899:1990 ("ISO C90")) it is unable to handle empty fields, i.e., detect fields delimited by two adjacent delimiter characters, or to be used for more than a single string at a time. The strsep() function first appeared in 4.4BSD.

A simple example (also copied from that man page):

char *token, *string, *tofree;

tofree = string = strdup("value;;test;etc");
while ((token = strsep(&string, ";")) != NULL)
    printf("token=%s\n", token);

free(tofree);

Output:

token=value
token=
token=test
token=etc

so empty fields are handled correctly.

Of course, as others already said, none of these simple tokenizer functions handles delimiter inside quotation marks correctly, so if that is an issue, you should use a proper CSV parsing library.

Chitin answered 16/9, 2013 at 12:50 Comment(0)
M
4

There is no way to make strtok() not behave this way. From man page:

A sequence of two or more contiguous delimiter bytes in the parsed string is considered to be a single delimiter. Delimiter bytes at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always nonempty strings.

But what you can do is check the amount of '\0' characters before the token, since strtok() replaces all encountered tokens with '\0'. That way you'll know how many tokens were skipped. Source info:

This end of the token is automatically replaced by a null-character, and the beginning of the token is returned by the function.

And a code sample to show what I mean.

char* aStr = ...;
char* ptr = NULL;

ptr = strtok (...);

char* back = ptr;
int count = -1;
do {
  back--;
  if (back <= aStr) break; // to protect against reads before aStr
  count++;
} while (*back = '\0');

(written without ide or testing, may be an invalid implementation, but the idea stands).

Mullens answered 16/9, 2013 at 12:12 Comment(3)
Sounds fair. I'm gonna try this approach.Periwig
I will be grateful for comments about downvotes, I'd like to correct this implementation if there's something wrong with it.Mullens
If strsep() isn't available, I like this approach but the solution has a few problems. First, as @CoolNamesAllTaken mentions, strtok() does not necessarily replace all delimiters with the end of string characters. So your while loop needs to check for the delimiter and null. Also your check for null is using assignment.Roping
C
2

No you can't. From "man strtok":

A sequence of two or more contiguous delimiter characters in the parsed string is considered to be a single delimiter. Delimiter characters at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always nonempty strings.

You could also run into problems if your data contains the delimiter inside quotes or any other "escape".

I think the best solution is to get a CSV parsing library or write your own parsing function.

Centigrade answered 16/9, 2013 at 12:14 Comment(2)
Writing my own parsing function was something I was trying to avoid so far.Periwig
Well, avoiding that is actually a good idea. There is this library, which is recommended in another StackOverflow thread: sourceforge.net/projects/libcsvCentigrade
C
0

From recent experience, it looks like strtok() does not necessarily replace all delimiters with the end of string characters, but rather replaces the first delimiter it finds with an end of string character and skips the following delimiters but leaves them in place.

This means that in the nominal case (no zero-length strings before delimiters), every call to strtok() after the first call to strtok() will return a pointer to a string that begins after a \0 character.

In the case where strtok() reads zero-length strings between delimiters, strtok() will return a pointer to a string that begins after a delimiter character that has not been replaced with \0.

Here is my solution for finding out whether strtok() has skipped a zero-length string between delimiters.

// Previous code is needed to point strtok to a string and start ingesting from it.
char * field_string = strtok(NULL, ',');
// Note that this can't be done after the first call to strtok for a given buffer, since the previous character would be outside of the string's memory space.
if (*(field_string-1) == '\0') {
    // no delimiters were skipped
} else {
    // one or more delimiters were skipped
}
Czarist answered 26/8, 2022 at 9:16 Comment(1)
Answers must answer the question; you shouldn't post a comment to a different answer as an answer on it's own.Examine
R
0

If strsep() isn't available this is my solution which builds on @Dariusz's approach, . In my case I wanted to create an array of pointers to tokens with indexes that matched the field positions in the original string.

Also, I was using commas instead of semicolons as the delimiter. This is what my code looks like:

char *ptr = strtok(line, ",");
char *arr[20]; // Your max may vary.  Checking for overflows is an exercise for the reader
int i = 0;

while (ptr != NULL)
{
    char *back = ptr-2; // go back 2 to check for extra tokens

    // Not all deliminators are converted to nulls, so check for commas too
    while ((*back == '\0' || *back == ',') && (back > line))
    {
      arr[i++] = back; // Some of these may be pointers to comma strings
      back--;
    }
    arr[i++] = ptr;
    
    ptr = strtok(NULL, ",");
}
Roping answered 3/3, 2023 at 17:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.