Escape character in C#'s Split()
Asked Answered
S

4

9

I am parsing some delimiter separated values, where ? is specified as the escape character in case the delimiter appears as part of one of the values.

For instance: if : is the delimiter, and a certain field the value 19:30, this needs to be written as 19?:30.

Currently, I use string[] values = input.Split(':'); in order to get an array of all values, but after learning about this escape character, this won't work anymore.

Is there a way to make Split take escape characters into account? I have checked the overload methods, and there does not seem to be such an option directly.

Sheepshearing answered 2/10, 2015 at 11:43 Comment(0)
E
17
string[] substrings = Regex.Split("aa:bb:00?:99:zz", @"(?<!\?):");

for

aa
bb
00?:99
zz

Or as you probably want to unescape ?: at some point, replace the sequence in the input with another token, split and replace back.

(This requires the System.Text.RegularExpressions namespace to be used.)

Evanescent answered 2/10, 2015 at 11:48 Comment(12)
What about aa:bb:00??:99:zz?Tulle
You would get "00??:9" which seems ok as there is no splitalbe tokenEvanescent
...unless you treat ?? as "escape of escape".Tulle
it depends, if we assume ? is acting like \ in C then the result should be 00?? and 9.Transmutation
Serg makes a great point, but as far as I can tell from my specifications, escape characters never get escaped. It seems like they deliberately picked ? because they know that it will never appear as part of a value.Sheepshearing
@LeeWhite If ? can never appear as part of a value, it would be far better to use it as the field delimiter, because you wouldn't need to worry about escaping -- or is it too late to do this?Macegan
@Macegan Fair point, but even if this could be changed, I would not be in a position to have influence over this.Sheepshearing
In that case, assuming no :: appear you could use the vanilla split; foo.Replace(':', '?').Replace("??", ":").Split(new[] { '?' });Evanescent
@AlexK.: I just found out that there is a problem with the solution you gave. When the escape character does not occur, and @"[^?]:" is matched, the character prior to the delimiter is removed from the value. I'm guessing this is because the character matches [^?], hence is part of the split characters sequence. Do you happen to know a nice way to overcome this problem?Sheepshearing
Can you give an example of this text?Evanescent
Your example, for instance. It returns {"a", "b", "00?:9", "zz"}.Sheepshearing
Woops missed that, updated with a negative look-behind insteadEvanescent
G
1

This kind of stuff is always fun to code without using Regex.

The following does the trick with one single caveat: the escape character will always escape, it has no logic to check for only valid ones: ?;. So the string one?two;three??;four?;five will be split into onewo, three?, fourfive.

 public static IEnumerable<string> Split(this string text, char separator, char escapeCharacter, bool removeEmptyEntries)
 {
     string buffer = string.Empty;
     bool escape = false;

     foreach (var c in text)
     {
         if (!escape && c == separator)
         {
             if (!removeEmptyEntries || buffer.Length > 0)
             {
                 yield return buffer;
             }

             buffer = string.Empty;
         }
         else
         {
             if (c == escapeCharacter)
             {
                 escape = !escape;

                 if (!escape)
                 {
                     buffer = string.Concat(buffer, c);
                 }
             }
             else
             {
                 if (!escape)
                 {
                     buffer = string.Concat(buffer, c);
                 }

                 escape = false;
             }
         }
     }

        if (buffer.Length != 0)
        {
            yield return buffer;
        }
    }
Groscr answered 2/10, 2015 at 12:55 Comment(1)
And also usually incorrect. buffer should be StringBuilder in order not to create loads of temporary string instances, and ?: should append : instead of ignoring it, and empty substrings are treated incorrectly.Aneurysm
R
1

Based on InBetween's answer. Using StringBuilder. And ensuring that it includes whatever character comes directly after the escape character, instead of removing it.

public static IEnumerable<string> Split(this string text, char separator, char escapeCharacter, bool removeEmptyEntries)
{
    var buffer = new StringBuilder();
    bool escape = false;

    foreach (var c in text)
    {
        if (!escape && c == separator)
        {
            if (!removeEmptyEntries || buffer.Length > 0)
            {
                yield return buffer.ToString();
            }

            buffer.Clear();
        }
        else
        {
            if (c == escapeCharacter && !escape)
            {
                escape = true;
            }
            else
            {
                buffer.Append(c);
                escape = false;
            }
        }
    }

    if (buffer.Length != 0)
    {
        yield return buffer.ToString();
    }
}

Which means that Split("aa:bb:00?:99:zz", ':', '?', false); gives you:

aa
bb
00:99
zz

and Split("Comma here:\\,,no comma:,slash:\\\\,slash and comma:\\\\\\,", ',', '\\', false); gives you:

Comma here:,
no comma:
slash:\
slash and comma:\,
Revisal answered 19/4, 2024 at 15:7 Comment(0)
T
-2

No, there's no way to do that. You will need to use regex (which depends on how exactly do you want your "escape character" to behave). In worst case I suppose you'll have to do the parsing manually.

Tulle answered 2/10, 2015 at 11:49 Comment(1)
you can comment with 4k rep. or give an answer.Singe

© 2022 - 2025 — McMap. All rights reserved.