Is there a lazy `String.Split` in C#
Asked Answered
R

7

17

All string.Split methods seems to return an array of strings (string[]).

I'm wondering if there is a lazy variant that returns an IEnumerable<string> such that one for large strings (or an infinite length IEnumerable<char>), when one is only interested in a first subsequences, one saves computational effort as well as memory. It could also be useful if the string is constructed by a device/program (network, terminal, pipes) and the entire strings is thus not necessary immediately fully available. Such that one can already process the first occurences.

Is there such method in the .NET framework?

Rigorous answered 27/1, 2015 at 19:33 Comment(2)
C# does not have a standard library. You seem to be referring to the .NET Framework, which is not specific to C#, VB.NET or any other particular language.Brebner
@JohnSaunders: modified...Rigorous
B
6

You could easily write one:

public static class StringExtensions
{
    public static IEnumerable<string> Split(this string toSplit, params char[] splits)
    {
        if (string.IsNullOrEmpty(toSplit))
            yield break;

        StringBuilder sb = new StringBuilder();

        foreach (var c in toSplit)
        {
            if (splits.Contains(c))
            {
                yield return sb.ToString();
                sb.Clear();
            }
            else
            {
                sb.Append(c);
            }
        }

        if (sb.Length > 0)
            yield return sb.ToString();
    }
}

Clearly, I haven't tested it for parity with string.split, but I believe it should work just about the same.

As Servy notes, this doesn't split on strings. That's not as simple, and not as efficient, but it's basically the same pattern.

public static IEnumerable<string> Split(this string toSplit, string[] separators)
{
    if (string.IsNullOrEmpty(toSplit))
        yield break;

    StringBuilder sb = new StringBuilder();
    foreach (var c in toSplit)
    {
        var s = sb.ToString();
        var sep = separators.FirstOrDefault(i => s.Contains(i));
        if (sep != null)
        {
            yield return s.Replace(sep, string.Empty);
            sb.Clear();
        }
        else
        {
            sb.Append(c);
        }
    }

    if (sb.Length > 0)
        yield return sb.ToString();
}
Bolide answered 27/1, 2015 at 19:51 Comment(2)
This only splits on characters, not strings.Simons
Note that when given an empty string, this will produce an empty enumerable, whereas string.Split returns a string[1] { "" } (an array with an empty string).Rafflesia
C
4

There is no such thing built-in. Regex.Matches is lazy if I interpret the decompiled code correctly. Maybe you can make use of that.

Or, you simply write your own split function.

Actually, you could image most string functions generalized to arbitrary sequences. Often, even sequences of T, not just char. The BCL does not emphasize that at generalization all. There is no Enumerable.Subsequence for example.

Conscionable answered 27/1, 2015 at 19:36 Comment(9)
I wish .NET had included an "immutable array of T" type; String could then simply be shorthand for "immutable array of char". I know there are many times I would have used "immutable array of Byte" or "immutable array of Int32" if they existed, and would expect generalization would be useful in many other cases as well.Headwards
@supercat: true, that's how Haskell handles strings. It enables generalizing a lot of string methods to lists...Rigorous
@Headwards There's IReadOnlyList<T>. A string could be an IReadOnlyList<char>, it's just that it wasn't around in .NET 1.0.Simons
@Servy: but if I recall correctly, in .NET 1.0 a string wasn't an IEnumerable<char> either. So one can slightly modify the design I guess?Rigorous
@Simons a string can't have virtual methods. That would allow for arbitrary change of semantics. That's a very brittle model for such a fundamental type. Also, under the old CAS security model that would open up all kinds of holes.Conscionable
@Conscionable string is sealed. Even if it had virtual methods, you couldn't inherit from them. You could also avoid having virtual methods by explicitly implementing the interface. You could also do other things like create an implicit conversion to that type, in which you returned an internal wrapper around the char[] that did implement that interface, without having string implement the interface.Simons
@Simons I was talking about potentially using IReadOnlyList<char> instead of string. A hypothetical scenario. I though we were talking about that. BCL code could never accept a IReadOnlyList<char> instead of a sealed string.Conscionable
@Conscionable I thought you just meant having string implement IReadOnlyList<char> so that you could treat it as a list when you wanted to.Simons
@Servy: The IReadOnlyList<T> interface is rather anemic, and provides neither a promise of immutability nor an efficient means of exporting a range of items to an array. Code which receives a String can safely assume its contents won't change, but there's no nice equivalent for a sequence of Byte or a sequence of Int32.Headwards
C
4

Nothing built-in, but feel free to rip my Tokenize method:

 /// <summary>
/// Splits a string into tokens.
/// </summary>
/// <param name="s">The string to split.</param>
/// <param name="isSeparator">
/// A function testing if a code point at a position
/// in the input string is a separator.
/// </param>
/// <returns>A sequence of tokens.</returns>
IEnumerable<string> Tokenize(string s, Func<string, int, bool> isSeparator = null)
{
    if (isSeparator == null) isSeparator = (str, i) => !char.IsLetterOrDigit(str, i);

    int startPos = -1;

    for (int i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
    {
        if (!isSeparator(s, i))
        {
            if (startPos == -1) startPos = i;
        }
        else if (startPos != -1)
        {
            yield return s.Substring(startPos, i - startPos);
            startPos = -1;
        }
    }

    if (startPos != -1)
    {
        yield return s.Substring(startPos);
    }
}
Complemental answered 27/1, 2015 at 19:45 Comment(0)
P
1

There is no built-in method to do this as far as I'm know. But it doesn't mean you can't write one. Here is a sample to give you an idea:

public static IEnumerable<string> SplitLazy(this string str, params char[] separators)
{
    List<char> temp = new List<char>();
    foreach (var c in str)
    {
        if (separators.Contains(c) && temp.Any())
        {
             yield return new string(temp.ToArray());
             temp.Clear();
        }
        else
        {
            temp.Add(c);
        }
    }
    if(temp.Any()) { yield return new string(temp.ToArray()); }
}

Ofcourse this doesn't handle all cases and can be improved.

Passade answered 27/1, 2015 at 19:45 Comment(1)
This only splits on characters, not strings.Simons
L
1

I wrote this variant which supports also SplitOptions and count. It behaves same like string.Split in all test cases I tried. The nameof operator is C# 6 sepcific and can be replaced by "count".

public static class StringExtensions
{
    /// <summary>
    /// Splits a string into substrings that are based on the characters in an array. 
    /// </summary>
    /// <param name="value">The string to split.</param>
    /// <param name="options"><see cref="StringSplitOptions.RemoveEmptyEntries"/> to omit empty array elements from the array returned; or <see cref="StringSplitOptions.None"/> to include empty array elements in the array returned.</param>
    /// <param name="count">The maximum number of substrings to return.</param>
    /// <param name="separator">A character array that delimits the substrings in this string, an empty array that contains no delimiters, or null. </param>
    /// <returns></returns>
    /// <remarks>
    /// Delimiter characters are not included in the elements of the returned array. 
    /// If this instance does not contain any of the characters in separator the returned sequence consists of a single element that contains this instance.
    /// If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the <see cref="Char.IsWhiteSpace"/> method.
    /// </remarks>
    public static IEnumerable<string> SplitLazy(this string value, int count = int.MaxValue, StringSplitOptions options = StringSplitOptions.None, params char[] separator)
    {
        if (count <= 0)
        {
            if (count < 0) throw new ArgumentOutOfRangeException(nameof(count), "Count cannot be less than zero.");
            yield break;
        }

        Func<char, bool> predicate = char.IsWhiteSpace;
        if (separator != null && separator.Length != 0)
            predicate = (c) => separator.Contains(c);

        if (string.IsNullOrEmpty(value) || count == 1 || !value.Any(predicate))
        {
            yield return value;
            yield break;
        }

        bool removeEmptyEntries = (options & StringSplitOptions.RemoveEmptyEntries) != 0;
        int ct = 0;
        var sb = new StringBuilder();
        for (int i = 0; i < value.Length; ++i)
        {
            char c = value[i];
            if (!predicate(c))
            {
                sb.Append(c);
            }
            else
            {
                if (sb.Length != 0)
                {
                    yield return sb.ToString();
                    sb.Clear();
                }
                else
                {
                    if (removeEmptyEntries)
                        continue;
                    yield return string.Empty;
                }

                if (++ct >= count - 1)
                {
                    if (removeEmptyEntries)
                        while (++i < value.Length && predicate(value[i]));
                    else
                        ++i;
                    if (i < value.Length - 1)
                    {
                        sb.Append(value, i, value.Length - i);
                        yield return sb.ToString();
                    }
                    yield break;
                }
            }
        }

        if (sb.Length > 0)
            yield return sb.ToString();
        else if (!removeEmptyEntries && predicate(value[value.Length - 1]))
            yield return string.Empty;
    }

    public static IEnumerable<string> SplitLazy(this string value, params char[] separator)
    {
        return value.SplitLazy(int.MaxValue, StringSplitOptions.None, separator);
    }

    public static IEnumerable<string> SplitLazy(this string value, StringSplitOptions options, params char[] separator)
    {
        return value.SplitLazy(int.MaxValue, options, separator);
    }

    public static IEnumerable<string> SplitLazy(this string value, int count, params char[] separator)
    {
        return value.SplitLazy(count, StringSplitOptions.None, separator);
    }
}
Leslee answered 23/11, 2015 at 10:24 Comment(0)
F
0

I wanted the functionality of Regex.Split, but in a lazily evaluated form. The code below just runs through all Matches in the input string, and produces the same results as Regex.Split:

public static IEnumerable<string> Split(string input, string pattern, RegexOptions options = RegexOptions.None)
{
    // Always compile - we expect many executions
    var regex = new Regex(pattern, options | RegexOptions.Compiled);

    int currentSplitStart = 0;
    var match = regex.Match(input);

    while (match.Success)
    {
        yield return input.Substring(currentSplitStart, match.Index - currentSplitStart);

        currentSplitStart = match.Index + match.Length;
        match = match.NextMatch();
    }

    yield return input.Substring(currentSplitStart);
}

Note that using this with the pattern parameter @"\s" will give you the same results as string.Split().

Frons answered 17/10, 2017 at 1:1 Comment(1)
Just a note for a readers. When using this code in production, move Regex definition out of method scope. Otherwise regex compilation will occure on every Split executionGalba
F
0

Lazy split without create tempory string.

Chunk of string copied using system coll mscorlib String.SubString.

public static IEnumerable<string> LazySplit(this string source, StringSplitOptions stringSplitOptions, params string[] separators)
{
    var sourceLen = source.Length;

    bool IsSeparator(int index, string separator)
    {
        var separatorLen = separator.Length;

        if (sourceLen < index + separatorLen)
        {
            return false;
        }

        for (var i = 0; i < separatorLen; i++)
        {
            if (source[index + i] != separator[i])
            {
                return false;
            }
        }

        return true;
    }

    var indexOfStartChunk = 0;

    for (var i = 0; i < source.Length; i++)
    {
        foreach (var separator in separators)
        {
            if (IsSeparator(i, separator))
            {
                if (indexOfStartChunk == i && stringSplitOptions != StringSplitOptions.RemoveEmptyEntries)
                {
                    yield return string.Empty;
                }
                else
                {
                    yield return source.Substring(indexOfStartChunk, i - indexOfStartChunk);
                }

                i += separator.Length;
                indexOfStartChunk = i--;
                break;
            }
        }
    }

    if (indexOfStartChunk != 0)
    {
        yield return source.Substring(indexOfStartChunk, sourceLen - indexOfStartChunk);
    }
}
Follmer answered 29/5, 2018 at 17:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.