Best way to split string into lines
Asked Answered
D

13

190

How do you split multi-line string into lines?

I know this way

var result = input.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

looks a bit ugly and loses empty lines. Is there a better solution?

Diocletian answered 2/10, 2009 at 7:49 Comment(3)
Possible duplicate of Easiest way to split a string on newlines in .NET?Zygapophysis
Yes, you use the exact line delimiter present in the file, e.g. just "\r\n" or just "\n" rather than using either \r or \n and ending up with a load of blank lines on windows-created files. What system uses LFCR line endings, btw?Acroterion
@CaiusJard LFCR is used in RISC OS... It was used in some early microcomputers of the late 70s and early 80s, but it does not seems relevant anymore.Rangy
P
226
  • If it looks ugly, just remove the unnecessary ToCharArray call.

  • If you want to split by either \n or \r, you've got two options:

    • Use an array literal – but this will give you empty lines for Windows-style line endings \r\n:

      var result = text.Split(new [] { '\r', '\n' });
      
    • Use a regular expression, as indicated by Bart:

      var result = Regex.Split(text, "\r\n|\r|\n");
      
  • If you want to preserve empty lines, why do you explicitly tell C# to throw them away? (StringSplitOptions parameter) – use StringSplitOptions.None instead.

Prolocutor answered 2/10, 2009 at 7:53 Comment(26)
Removing ToCharArray will make code platform-specific (NewLine can be '\n')Diocletian
@Kon you should use Environment.NewLine if that is your concern. Or do you mean the origin of the text, rather than the location of execution?Incendiarism
@Will: on the off chance that you were referring to me instead of Konstantin: I believe (strongly) that parsing code should strive to work on all platforms (i.e. it should also read text files that were encoded on different platforms than the executing platform). So for parsing, Environment.NewLine is a no-go as far as I’m concerned. In fact, of all the possible solutions I prefer the one using regular expressions since only that handles all source platforms correctly.Prolocutor
lol didn't notice the name similarity. I agree completely in this case.Incendiarism
@Hamish Well just look at the documentation of the enum, or look in the original question! It’s StringSplitOptions.RemoveEmptyEntries.Prolocutor
Ah I see, my bad, I was looking within RegexOptions; have not had my coffee yet.Kironde
How about the text that contains '\r\n\r\n'. string.Split will return 4 empty lines, however with '\r\n' it should give 2. It gets worse if '\r\n' and '\r' are mixed in one file.Sanctuary
@SurikovPavel Use the regular expression. That is definitely the preferred variant, as it works correctly with any combination of line endings.Prolocutor
A minor point - I usually go with the verbatim string literal in the second argument to Regex.Split, i.e. - var result = Regex.Split(text, @"\r\n|\r|\n"); In this case it works either way because the C# compiler interprets \n and \r in the same way that the regular expression parser does. In the general case though it might cause problems.Brackish
Just adding my 2c worth. Since the OP wants to keep blank lines, you can't write a parser that works for any type of environment and/or handles mixed cases (i.e. the RegEx), because if you have '\n\r' how do you know it's one 'break' instead of two that are just encoded wrong? If it's the latter, it would be two blank lines, but if it's the former, he would only be one. You have to ask what is the source of the encodings. If the source is on the same platform as the parser (regardless of what platform it is) then you can use Environment.NewLine as the source is known.Glaydsglaze
@MarqueIV There are different possible answers to this, all valid. One is to expect and require consistent text files. Another one is to not accept "\r" on its own as a line separator (because, let’s face it, no system has used this convention in well over a decade): the only actually used conventions are "\r\n" and "\n". In fact, your example ("\n\r") has never been a valid line break anywhere. Either read it as two line breaks or throw an error, but certainly don’t treat it as a single line break.Prolocutor
First things first, my text was a typo. Use '\r\n' and my point is still the same: you can't write a universal parser on a system if you're required to keep blank lines. Note that by adding the restriction that you're not to accepting '\r' by itself, and you only want to use '\n' to detect new lines, with that change, you no longer have a universal parser essentially proving my point that without such limitations, it can't (easily*) be done, and chances are doesn't need to be in the first place. (*It can playing with RegEx ordering and such, but that just makes it much slower.)Glaydsglaze
@MarqueIV I think you misread my comment: since "\r" is never used as a delimiter, so you can easily write a universal parser that accepts all actually used delimiters; It’s done by simply splitting on "\r\n|\n". There’s no need for anything more fancy than that. But, honestly, in practice there’s nothing wrong with the regex code shown in my answer, and it will work just fine with a file that mixes different styles of line breaks, including the obsolete "\r".Prolocutor
If you have input that has mixed styles like you said, there's no way to differentiate between '\n\r' and '\n' and '\r' without making the assumption that there will never be an '\r', and when you make that assumption, then you've removed the condition that I just mentioned that causes the ambiguity. Plus, you can't make that assumption anyway as there are plenty of embedded hardware systems that use '\r'. That's why terminals give you three choices for line breaks. You need to know you're input up front. I guess we'll just have to disagree and each use what works for us.Glaydsglaze
@MarqueIV That’s why my previous comment says “in practice” it works. You’re arguing from a pretty unlikely case. Yes, obviously such cases are ambiguous but I contend that they are not relevant enough to care, and these ambiguities are fundamentally unresolvable, anyway: no parsing strategy will work since the ambiguity is then in the data itself, not in the parsing process.Prolocutor
But I believe you just made my point for me. That's exactly why I just use Environment.NewLine by default, and only use something like the RegEx solution if you venture outside the realm of the more-likely scenarios. It happens, but as they say, a giant time-killer is implementing solutions for things that might happen, rather than things that do. Sure, plan for the future of course (i.e. don't design yourself into a corner where you can't make the change later), but don't actually implement a future until you actually need to. In other words, I don't think our points are that far off.Glaydsglaze
@MarqueIV “That's exactly why I just use Environment.NewLine” — but that’s the worst thing you can do because now you start breaking lots of actual files, whereas my solution breaks approximately zero actually existing files. Check out how many modern text editors use only the system’s newline for line breaks (hint: none do).Prolocutor
Nothing is broken if you're never planning on getting anything that doesn't match your platform's encoding. If you know that (just like you know there may never be a '\r') then you're optimizing your results, not wasting time running things through a RegEx engine that don't need to be, which can kill a time-critical application. If you will have multiple encodings, then use the RegEx. You just can't do universal. Again, I don't think we're arguing the same point. You've made yours and I've made a different one. Tangential, but not in contradiction.Glaydsglaze
@MarqueIV I honestly have trouble understanding your use-case: You don’t need to go beyond your current platform to encounter text files that use different line ending conventions. I know for a fact that my current system contains files with different conventions (I edited one just yesterday, and I only know about the diverging line endings because diff flagged them). This isn’t “planning for the future”, it’s making code robust for the here and now.Prolocutor
Plus, taking a step back, one could argue that if you do need blank lines but don't enforce a standard for line encodings, then you're just asking for trouble anyway. After all, if you skip blank lines, you can write a universal parser, rendering this entire convo thread obsolete! :)Glaydsglaze
And in your case, I'd argue the 'platform' is you using editing tools that may have differing line endings, hence you getting your diff. But if you're using a known format for instance, from another system, and not something manually edited, then there's no need to plan for that case and you can increase throughput of processing by not. Again, we're not arguing the same point!. Time and place. If you're taking in user-editable files, then I 100% agree with you. But if you're taking in system-generated files from a known system on the same platform, then I stand by my original statement. :)Glaydsglaze
@MarqueIV No, nothing was mangled. The files have different (but internally consistent) line endings because they were created by different people, on different platforms. Yet they end up on my machine. — And I want to emphasise that we are very much arguing the same point, because I’m fundamentally not understanding where your potential use-case exists. I simply don’t see when it would be more useful, and produce less problems, to split on a platform hard-coded newline rather than using my heuristic, which I (and clearly many others) have found to work in 100% of real files.Prolocutor
"Created by different people, on different platforms". That is a different use-case than something say from a web service where the line endings are predictable and consistent. And if that system is on the same platform, then you can use Environment.NewLine and crush the performance of RegEx. Again, time and place. I plan for, but don't implement solutions for things until they happen. Just like the code, developer productivity is also increased.Glaydsglaze
To hopefully appease you, if you're saying you need a system that has to detect blank lines, and you are taking files created on platforms with differing line endings, and you're guaranteeing you will never get '\r' by itself and/or your line endings will be consistent in the same file (which you can't if it's edited on machines with two different line endings and all line endings aren't updated), then I agree... the regex works. But I'm saying if you can't make those guarantees, it won't because you then won't be able to differentiate between '\n\r' and '\n' and '\r'. Make sense?Glaydsglaze
In fairness, nothing will work in that case, not just RegEx because there is no standard for the line endings on the parser, which brings me back to one of my earlier points, if you are saying blank lines are important to you, then you must define what represents a blank line or you can't answer the above question (without those other guarantees that is.)Glaydsglaze
More precision might help: it is not possible to write a parser to handle a combination of all cases, the RE here will handle combinations of any two cases in one file.Gluten
M
164
using (StringReader sr = new StringReader(text)) {
    string line;
    while ((line = sr.ReadLine()) != null) {
        // do something
    }
}
Mawkin answered 29/7, 2011 at 13:17 Comment(3)
This is the cleanest approach, in my subjective opinion.Sidereal
Any idea in terms of performance (compared to string.Split or Regex.Split)?Sarraceniaceous
I like this solution a lot, but I found a minor problem: when the last line is empty, it's ignored (only the last one). So, "example" and "example\r\n" will both produce only one line while "example\r\n\r\n" will produce two lines. This behavior is discussed here: github.com/dotnet/runtime/issues/27715Roughhew
N
83

Update: See here for an alternative/async solution.


This works great and is faster than Regex:

input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)

It is important to have "\r\n" first in the array so that it's taken as one line break. The above gives the same results as either of these Regex solutions:

Regex.Split(input, "\r\n|\r|\n")

Regex.Split(input, "\r?\n|\r")

Except that Regex turns out to be about 10 times slower. Here's my test:

Action<Action> measure = (Action func) => {
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++) {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
);

measure(() =>
    Regex.Split(input, "\r\n|\r|\n")
);

measure(() =>
    Regex.Split(input, "\r?\n|\r")
);

Output:

00:00:03.8527616

00:00:31.8017726

00:00:32.5557128

and here's the Extension Method:

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        return str.Split(new[] { "\r\n", "\r", "\n" },
            removeEmptyLines ? StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None);
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines
Nonconductor answered 8/8, 2014 at 4:21 Comment(6)
Please add some more details to make your answer more useful for readers.Revest
Done. Also added a test to compare its performance with Regex solution.Nonconductor
Somewhat faster pattern due to less backtracking with the same functionality if one uses [\r\n]{1,2}England
@OmegaMan That has some different behavior. It will match \n\r or \n\n as single line-break which is not correct.Nonconductor
@Nonconductor I won't argue with you, but if the data has line feeds in multiple numbers...there most likely is something wrong with the data; let us call it an edge case.England
@OmegaMan How is Hello\n\nworld\n\n an edge case? It is clearly one line with text, followed by an empty line, followed by another line with text, followed by an empty line.Tamathatamaulipas
N
37

You could use Regex.Split:

string[] tokens = Regex.Split(input, @"\r?\n|\r");

Edit: added |\r to account for (older) Mac line terminators.

Newmark answered 2/10, 2009 at 7:53 Comment(6)
This won’t work on OS X style text files though, since these use only \r as line ending.Prolocutor
@Konrad Rudolph: AFAIK, '\r' was used on very old MacOS systems and is almost never encountered anymore. But if the OP needs to account for it (or if I'm mistaken), then the regex can easily be extended to account for it of course: \r?\n|\rNewmark
@Bart: I don’t think you’re mistaken but I have repeatedly encountered all possible line endings in my career as a programmer.Prolocutor
@Konrad, you're probably right. Better safe than sorry, I guess.Newmark
Less backtracking and same functionality with [\r\n]{1,2}England
@ΩmegaMan: That will lose empty lines, e.g. \n\n.Sisterhood
I
11

If you want to keep empty lines just remove the StringSplitOptions.

var result = input.Split(System.Environment.NewLine.ToCharArray());
Imbibition answered 2/10, 2009 at 7:57 Comment(1)
NewLine can be '\n' and input text can contain "\n\r".Diocletian
L
7
string[] lines = input.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
Landbert answered 11/8, 2013 at 5:58 Comment(0)
N
5

I had this other answer but this one, based on Jack's answer, is significantly faster might be preferred since it works asynchronously, although slightly slower.

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        using (var sr = new StringReader(str))
        {
            string line;
            while ((line = sr.ReadLine()) != null)
            {
                if (removeEmptyLines && String.IsNullOrWhiteSpace(line))
                {
                    continue;
                }
                yield return line;
            }
        }
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines

Test:

Action<Action> measure = (Action func) =>
{
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++)
    {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None)
);

measure(() =>
    input.GetLines()
);

measure(() =>
    input.GetLines().ToList()
);

Output:

00:00:03.9603894

00:00:00.0029996

00:00:04.8221971

Nonconductor answered 16/12, 2016 at 3:31 Comment(4)
I do wonder if this is because you aren't actually inspecting the results of the enumerator, and therefore it isn't getting executed. Unfortunately, I'm too lazy to check.Quintilla
Yes, it actually is!! When you add .ToList() to both the calls, the StringReader solution is actually slower! On my machine it is 6.74s vs. 5.10sVowelize
That makes sense. I still prefer this method because it lets me to get lines asynchronously.Nonconductor
Maybe you should remove the "better solution" header on your other answer and edit this one...Vowelize
H
3

Split a string into lines without any allocation.

public static LineEnumerator GetLines(this string text) {
    return new LineEnumerator( text.AsSpan() );
}

internal ref struct LineEnumerator {

    private ReadOnlySpan<char> Text { get; set; }
    public ReadOnlySpan<char> Current { get; private set; }

    public LineEnumerator(ReadOnlySpan<char> text) {
        Text = text;
        Current = default;
    }

    public LineEnumerator GetEnumerator() {
        return this;
    }

    public bool MoveNext() {
        if (Text.IsEmpty) return false;

        var index = Text.IndexOf( '\n' ); // \r\n or \n
        if (index != -1) {
            Current = Text.Slice( 0, index + 1 );
            Text = Text.Slice( index + 1 );
            return true;
        } else {
            Current = Text;
            Text = ReadOnlySpan<char>.Empty;
            return true;
        }
    }


}
Hanahanae answered 30/1, 2021 at 14:40 Comment(2)
Interesting! Should it implement IEnumerable<>?Diocletian
@KonstantinSpirin it can not implement IEnumerable<> because 1) ref structs cannot implement interfaces and 2) 'ReadOnlySpan<char>' can not be used as a type argument in IEnumerable<> because it is a ref structHalbeib
D
2

Slightly twisted, but an iterator block to do it:

public static IEnumerable<string> Lines(this string Text)
{
    int cIndex = 0;
    int nIndex;
    while ((nIndex = Text.IndexOf(Environment.NewLine, cIndex + 1)) != -1)
    {
        int sIndex = (cIndex == 0 ? 0 : cIndex + 1);
        yield return Text.Substring(sIndex, nIndex - sIndex);
        cIndex = nIndex;
    }
    yield return Text.Substring(cIndex + 1);
}

You can then call:

var result = input.Lines().ToArray();
Diviner answered 2/10, 2009 at 8:8 Comment(0)
G
2
    private string[] GetLines(string text)
    {

        List<string> lines = new List<string>();
        using (MemoryStream ms = new MemoryStream())
        {
            StreamWriter sw = new StreamWriter(ms);
            sw.Write(text);
            sw.Flush();

            ms.Position = 0;

            string line;

            using (StreamReader sr = new StreamReader(ms))
            {
                while ((line = sr.ReadLine()) != null)
                {
                    lines.Add(line);
                }
            }
            sw.Close();
        }



        return lines.ToArray();
    }
Gazelle answered 6/8, 2015 at 0:55 Comment(2)
This worked really well for parsing a custom file format I wrote. Your code is much faster reading 500+ lines compared to string.Split - big difference! Thanks!Cryptogam
you can remove MemoryStream / StreamWriter / StreamReader and just use StringReader! It have a ReadLine method too.Halbeib
C
2

It's tricky to handle mixed line endings properly. As we know, the line termination characters can be "Line Feed" (ASCII 10, \n, \x0A, \u000A), "Carriage Return" (ASCII 13, \r, \x0D, \u000D), or some combination of them. Going back to DOS, Windows uses the two-character sequence CR-LF \u000D\u000A, so this combination should only emit a single line. Unix uses a single \u000A, and very old Macs used a single \u000D character. The standard way to treat arbitrary mixtures of these characters within a single text file is as follows:

  • each and every CR or LF character should skip to the next line EXCEPT...
  • ...if a CR is immediately followed by LF (\u000D\u000A) then these two together skip just one line.
  • String.Empty is the only input that returns no lines (any character entails at least one line)
  • The last line must be returned even if it has neither CR nor LF.

The preceding rule describes the behavior of StringReader.ReadLine and related functions, and the function shown below produces identical results. It is an efficient C# line breaking function that dutifully implements these guidelines to correctly handle any arbitrary sequence or combination of CR/LF. The enumerated lines do not contain any CR/LF characters. Empty lines are preserved and returned as String.Empty.

/// <summary>
/// Enumerates the text lines from the string.
///   ⁃ Mixed CR-LF scenarios are handled correctly
///   ⁃ String.Empty is returned for each empty line
///   ⁃ No returned string ever contains CR or LF
/// </summary>
public static IEnumerable<String> Lines(this String s)
{
    int j = 0, c, i;
    char ch;
    if ((c = s.Length) > 0)
        do
        {
            for (i = j; (ch = s[j]) != '\r' && ch != '\n' && ++j < c;)
                ;

            yield return s.Substring(i, j - i);
        }
        while (++j < c && (ch != '\r' || s[j] != '\n' || ++j < c));
}

Note: If you don't mind the overhead of creating a StringReader instance on each call, you can use the following C# 7 code instead. As noted, while the example above may be slightly more efficient, both of these functions produce the exact same results.

public static IEnumerable<String> Lines(this String s)
{
    using (var tr = new StringReader(s))
        while (tr.ReadLine() is String L)
            yield return L;
}
Campinas answered 6/2, 2019 at 18:22 Comment(0)
R
2

late to the party, but I've been using a simple collection of extension methods for just that, which leverages TextReader.ReadLine():

public static class StringReadLinesExtension
{
    public static IEnumerable<string> GetLines(this string text) => GetLines(new StringReader(text));
    public static IEnumerable<string> GetLines(this Stream stm) => GetLines(new StreamReader(stm));
    public static IEnumerable<string> GetLines(this TextReader reader) {
        string line;
        while ((line = reader.ReadLine()) != null)
            yield return line;
        reader.Dispose();
        yield break;
    }
}

Using the code is really trivial:

// If you have the text as a string...
var text = "Line 1\r\nLine 2\r\nLine 3";
foreach (var line in text.GetLines())
    Console.WriteLine(line);
// You can also use streams like
var fileStm = File.OpenRead("c:\tests\file.txt");
foreach(var line in fileStm.GetLines())
    Console.WriteLine(line);

Hope this helps someone out there.

Rangy answered 30/5, 2022 at 19:32 Comment(0)
H
1

Split a string into lines without any allocation.

static IEnumerable<ReadOnlyMemory<char>> GetLines(this string text, string newLine) 
{
    if (text.Length == 0)
        yield break;

    var memory = text.AsMemory();
    int index;

    while ((index = memory.Span.IndexOf(newLine)) != -1) 
    {
        yield return memory.Slice(0, index);
        memory = memory.Slice(index + newLine.Length);
    }

    yield return memory;
}

Example of use

foreach (ReadOnlyMemory<char>> line in GetLines(text, "\r\n"))
{
   // use the line variable or if needed...
   // alternative use

   ReadOnlySpan<char> span = line.Span;
   string str = line.Span.ToString();
}
Halbeib answered 10/12, 2023 at 2:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.