Best way to split string into lines

Asked 2/10, 2009 at 7:49 Answered 10/12, 2023 at 2:35

190

How do you split multi-line string into lines?

I know this way

var result = input.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

looks a bit ugly and loses empty lines. Is there a better solution?

Diocletian answered 2/10, 2009 at 7:49 Comment(3)

Possible duplicate of Easiest way to split a string on newlines in .NET? – Zygapophysis 13/5, 2019 at 8:40

Yes, you use the exact line delimiter present in the file, e.g. just "\r\n" or just "\n" rather than using either \r or \n and ending up with a load of blank lines on windows-created files. What system uses LFCR line endings, btw? – Acroterion 2/2, 2022 at 6:45

@CaiusJard LFCR is used in RISC OS... It was used in some early microcomputers of the late 70s and early 80s, but it does not seems relevant anymore. – Rangy 30/5, 2022 at 21:30

226

If it looks ugly, just remove the unnecessary ToCharArray call.
If you want to split by either \n or \r, you've got two options:
- Use an array literal – but this will give you empty lines for Windows-style line endings \r\n:
```
var result = text.Split(new [] { '\r', '\n' });
```
- Use a regular expression, as indicated by Bart:
```
var result = Regex.Split(text, "\r\n|\r|\n");
```
If you want to preserve empty lines, why do you explicitly tell C# to throw them away? (StringSplitOptions parameter) – use StringSplitOptions.None instead.

Prolocutor answered 2/10, 2009 at 7:53 Comment(26)

Removing ToCharArray will make code platform-specific (NewLine can be '\n') – Diocletian 2/10, 2009 at 9:11

@Kon you should use Environment.NewLine if that is your concern. Or do you mean the origin of the text, rather than the location of execution? – Incendiarism 20/1, 2011 at 17:3

@Will: on the off chance that you were referring to me instead of Konstantin: I believe (strongly) that parsing code should strive to work on all platforms (i.e. it should also read text files that were encoded on different platforms than the executing platform). So for parsing, Environment.NewLine is a no-go as far as I’m concerned. In fact, of all the possible solutions I prefer the one using regular expressions since only that handles all source platforms correctly. – Prolocutor 20/1, 2011 at 17:14

lol didn't notice the name similarity. I agree completely in this case. – Incendiarism 20/1, 2011 at 18:32

@Hamish Well just look at the documentation of the enum, or look in the original question! It’s StringSplitOptions.RemoveEmptyEntries. – Prolocutor 19/10, 2011 at 16:41

Ah I see, my bad, I was looking within RegexOptions; have not had my coffee yet. – Kironde 19/10, 2011 at 17:37

How about the text that contains '\r\n\r\n'. string.Split will return 4 empty lines, however with '\r\n' it should give 2. It gets worse if '\r\n' and '\r' are mixed in one file. – Sanctuary 27/4, 2012 at 18:52

@SurikovPavel Use the regular expression. That is definitely the preferred variant, as it works correctly with any combination of line endings. – Prolocutor 27/4, 2012 at 23:28

A minor point - I usually go with the verbatim string literal in the second argument to Regex.Split, i.e. - var result = Regex.Split(text, @"\r\n|\r|\n"); In this case it works either way because the C# compiler interprets \n and \r in the same way that the regular expression parser does. In the general case though it might cause problems. – Brackish 15/11, 2017 at 22:18

Just adding my 2c worth. Since the OP wants to keep blank lines, you can't write a parser that works for any type of environment and/or handles mixed cases (i.e. the RegEx), because if you have '\n\r' how do you know it's one 'break' instead of two that are just encoded wrong? If it's the latter, it would be two blank lines, but if it's the former, he would only be one. You have to ask what is the source of the encodings. If the source is on the same platform as the parser (regardless of what platform it is) then you can use Environment.NewLine as the source is known. – Glaydsglaze 20/8, 2018 at 20:27

@MarqueIV There are different possible answers to this, all valid. One is to expect and require consistent text files. Another one is to not accept "\r" on its own as a line separator (because, let’s face it, no system has used this convention in well over a decade): the only actually used conventions are "\r\n" and "\n". In fact, your example ("\n\r") has never been a valid line break anywhere. Either read it as two line breaks or throw an error, but certainly don’t treat it as a single line break. – Prolocutor 21/8, 2018 at 7:56

First things first, my text was a typo. Use '\r\n' and my point is still the same: you can't write a universal parser on a system if you're required to keep blank lines. Note that by adding the restriction that you're not to accepting '\r' by itself, and you only want to use '\n' to detect new lines, with that change, you no longer have a universal parser essentially proving my point that without such limitations, it can't (easily*) be done, and chances are doesn't need to be in the first place. (*It can playing with RegEx ordering and such, but that just makes it much slower.) – Glaydsglaze 21/8, 2018 at 8:42

@MarqueIV I think you misread my comment: since "\r" is never used as a delimiter, so you can easily write a universal parser that accepts all actually used delimiters; It’s done by simply splitting on "\r\n|\n". There’s no need for anything more fancy than that. But, honestly, in practice there’s nothing wrong with the regex code shown in my answer, and it will work just fine with a file that mixes different styles of line breaks, including the obsolete "\r". – Prolocutor 21/8, 2018 at 9:23

If you have input that has mixed styles like you said, there's no way to differentiate between '\n\r' and '\n' and '\r' without making the assumption that there will never be an '\r', and when you make that assumption, then you've removed the condition that I just mentioned that causes the ambiguity. Plus, you can't make that assumption anyway as there are plenty of embedded hardware systems that use '\r'. That's why terminals give you three choices for line breaks. You need to know you're input up front. I guess we'll just have to disagree and each use what works for us. – Glaydsglaze 21/8, 2018 at 9:39

@MarqueIV That’s why my previous comment says “in practice” it works. You’re arguing from a pretty unlikely case. Yes, obviously such cases are ambiguous but I contend that they are not relevant enough to care, and these ambiguities are fundamentally unresolvable, anyway: no parsing strategy will work since the ambiguity is then in the data itself, not in the parsing process. – Prolocutor 21/8, 2018 at 9:45

But I believe you just made my point for me. That's exactly why I just use Environment.NewLine by default, and only use something like the RegEx solution if you venture outside the realm of the more-likely scenarios. It happens, but as they say, a giant time-killer is implementing solutions for things that might happen, rather than things that do. Sure, plan for the future of course (i.e. don't design yourself into a corner where you can't make the change later), but don't actually implement a future until you actually need to. In other words, I don't think our points are that far off. – Glaydsglaze 21/8, 2018 at 14:23

@MarqueIV “That's exactly why I just use Environment.NewLine” — but that’s the worst thing you can do because now you start breaking lots of actual files, whereas my solution breaks approximately zero actually existing files. Check out how many modern text editors use only the system’s newline for line breaks (hint: none do). – Prolocutor 21/8, 2018 at 14:23

Nothing is broken if you're never planning on getting anything that doesn't match your platform's encoding. If you know that (just like you know there may never be a '\r') then you're optimizing your results, not wasting time running things through a RegEx engine that don't need to be, which can kill a time-critical application. If you will have multiple encodings, then use the RegEx. You just can't do universal. Again, I don't think we're arguing the same point. You've made yours and I've made a different one. Tangential, but not in contradiction. – Glaydsglaze 21/8, 2018 at 14:25

@MarqueIV I honestly have trouble understanding your use-case: You don’t need to go beyond your current platform to encounter text files that use different line ending conventions. I know for a fact that my current system contains files with different conventions (I edited one just yesterday, and I only know about the diverging line endings because diff flagged them). This isn’t “planning for the future”, it’s making code robust for the here and now. – Prolocutor 21/8, 2018 at 14:27

Plus, taking a step back, one could argue that if you do need blank lines but don't enforce a standard for line encodings, then you're just asking for trouble anyway. After all, if you skip blank lines, you can write a universal parser, rendering this entire convo thread obsolete! :) – Glaydsglaze 21/8, 2018 at 14:27

And in your case, I'd argue the 'platform' is you using editing tools that may have differing line endings, hence you getting your diff. But if you're using a known format for instance, from another system, and not something manually edited, then there's no need to plan for that case and you can increase throughput of processing by not. Again, we're not arguing the same point!. Time and place. If you're taking in user-editable files, then I 100% agree with you. But if you're taking in system-generated files from a known system on the same platform, then I stand by my original statement. :) – Glaydsglaze 21/8, 2018 at 14:33

@MarqueIV No, nothing was mangled. The files have different (but internally consistent) line endings because they were created by different people, on different platforms. Yet they end up on my machine. — And I want to emphasise that we are very much arguing the same point, because I’m fundamentally not understanding where your potential use-case exists. I simply don’t see when it would be more useful, and produce less problems, to split on a platform hard-coded newline rather than using my heuristic, which I (and clearly many others) have found to work in 100% of real files. – Prolocutor 21/8, 2018 at 14:34

"Created by different people, on different platforms". That is a different use-case than something say from a web service where the line endings are predictable and consistent. And if that system is on the same platform, then you can use Environment.NewLine and crush the performance of RegEx. Again, time and place. I plan for, but don't implement solutions for things until they happen. Just like the code, developer productivity is also increased. – Glaydsglaze 21/8, 2018 at 14:36

To hopefully appease you, if you're saying you need a system that has to detect blank lines, and you are taking files created on platforms with differing line endings, and you're guaranteeing you will never get '\r' by itself and/or your line endings will be consistent in the same file (which you can't if it's edited on machines with two different line endings and all line endings aren't updated), then I agree... the regex works. But I'm saying if you can't make those guarantees, it won't because you then won't be able to differentiate between '\n\r' and '\n' and '\r'. Make sense? – Glaydsglaze 21/8, 2018 at 14:46

In fairness, nothing will work in that case, not just RegEx because there is no standard for the line endings on the parser, which brings me back to one of my earlier points, if you are saying blank lines are important to you, then you must define what represents a blank line or you can't answer the above question (without those other guarantees that is.) – Glaydsglaze 21/8, 2018 at 14:50

More precision might help: it is not possible to write a parser to handle a combination of all cases, the RE here will handle combinations of any two cases in one file. – Gluten 23/9, 2018 at 10:46

164

using (StringReader sr = new StringReader(text)) {
    string line;
    while ((line = sr.ReadLine()) != null) {
        // do something
    }
}

Mawkin answered 29/7, 2011 at 13:17 Comment(3)

This is the cleanest approach, in my subjective opinion. – Sidereal 21/10, 2013 at 9:41

Any idea in terms of performance (compared to string.Split or Regex.Split)? – Sarraceniaceous 25/1, 2019 at 7:49

I like this solution a lot, but I found a minor problem: when the last line is empty, it's ignored (only the last one). So, "example" and "example\r\n" will both produce only one line while "example\r\n\r\n" will produce two lines. This behavior is discussed here: github.com/dotnet/runtime/issues/27715 – Roughhew 28/1, 2022 at 21:4

Update: See here for an alternative/async solution.

This works great and is faster than Regex:

input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)

It is important to have "\r\n" first in the array so that it's taken as one line break. The above gives the same results as either of these Regex solutions:

Regex.Split(input, "\r\n|\r|\n")

Regex.Split(input, "\r?\n|\r")

Except that Regex turns out to be about 10 times slower. Here's my test:

Action<Action> measure = (Action func) => {
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++) {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
);

measure(() =>
    Regex.Split(input, "\r\n|\r|\n")
);

measure(() =>
    Regex.Split(input, "\r?\n|\r")
);

Output:

00:00:03.8527616

00:00:31.8017726

00:00:32.5557128

and here's the Extension Method:

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        return str.Split(new[] { "\r\n", "\r", "\n" },
            removeEmptyLines ? StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None);
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines

Nonconductor answered 8/8, 2014 at 4:21 Comment(6)

Please add some more details to make your answer more useful for readers. – Revest 8/8, 2014 at 4:47

Done. Also added a test to compare its performance with Regex solution. – Nonconductor 8/8, 2014 at 18:50

Somewhat faster pattern due to less backtracking with the same functionality if one uses [\r\n]{1,2} – England 27/2, 2015 at 17:23

@OmegaMan That has some different behavior. It will match \n\r or \n\n as single line-break which is not correct. – Nonconductor 27/2, 2015 at 22:13

@Nonconductor I won't argue with you, but if the data has line feeds in multiple numbers...there most likely is something wrong with the data; let us call it an edge case. – England 28/2, 2015 at 1:2

@OmegaMan How is Hello\n\nworld\n\n an edge case? It is clearly one line with text, followed by an empty line, followed by another line with text, followed by an empty line. – Tamathatamaulipas 9/8, 2015 at 10:59

You could use Regex.Split:

string[] tokens = Regex.Split(input, @"\r?\n|\r");

Edit: added |\r to account for (older) Mac line terminators.

Newmark answered 2/10, 2009 at 7:53 Comment(6)

This won’t work on OS X style text files though, since these use only \r as line ending. – Prolocutor 2/10, 2009 at 8:1

@Konrad Rudolph: AFAIK, '\r' was used on very old MacOS systems and is almost never encountered anymore. But if the OP needs to account for it (or if I'm mistaken), then the regex can easily be extended to account for it of course: \r?\n|\r – Newmark 2/10, 2009 at 8:37

@Bart: I don’t think you’re mistaken but I have repeatedly encountered all possible line endings in my career as a programmer. – Prolocutor 2/10, 2009 at 13:24

@Konrad, you're probably right. Better safe than sorry, I guess. – Newmark 2/10, 2009 at 13:28

Less backtracking and same functionality with [\r\n]{1,2} – England 27/2, 2015 at 17:22

@ΩmegaMan: That will lose empty lines, e.g. \n\n. – Sisterhood 21/3, 2019 at 8:21

If you want to keep empty lines just remove the StringSplitOptions.

var result = input.Split(System.Environment.NewLine.ToCharArray());

Imbibition answered 2/10, 2009 at 7:57 Comment(1)

NewLine can be '\n' and input text can contain "\n\r". – Diocletian 2/10, 2009 at 9:26

string[] lines = input.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);

Landbert answered 11/8, 2013 at 5:58 Comment(0)

I had this other answer but this one, based on Jack's answer, ~~is significantly faster~~ might be preferred since it works asynchronously, although slightly slower.

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        using (var sr = new StringReader(str))
        {
            string line;
            while ((line = sr.ReadLine()) != null)
            {
                if (removeEmptyLines && String.IsNullOrWhiteSpace(line))
                {
                    continue;
                }
                yield return line;
            }
        }
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines

Test:

Action<Action> measure = (Action func) =>
{
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++)
    {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None)
);

measure(() =>
    input.GetLines()
);

measure(() =>
    input.GetLines().ToList()
);

Output:

00:00:03.9603894

00:00:00.0029996

00:00:04.8221971

Nonconductor answered 16/12, 2016 at 3:31 Comment(4)

I do wonder if this is because you aren't actually inspecting the results of the enumerator, and therefore it isn't getting executed. Unfortunately, I'm too lazy to check. – Quintilla 19/10, 2017 at 16:54

Yes, it actually is!! When you add .ToList() to both the calls, the StringReader solution is actually slower! On my machine it is 6.74s vs. 5.10s – Vowelize 2/11, 2017 at 12:20

That makes sense. I still prefer this method because it lets me to get lines asynchronously. – Nonconductor 6/11, 2017 at 4:41

Maybe you should remove the "better solution" header on your other answer and edit this one... – Vowelize 6/11, 2017 at 9:22

Split a string into lines without any allocation.

public static LineEnumerator GetLines(this string text) {
    return new LineEnumerator( text.AsSpan() );
}

internal ref struct LineEnumerator {

    private ReadOnlySpan<char> Text { get; set; }
    public ReadOnlySpan<char> Current { get; private set; }

    public LineEnumerator(ReadOnlySpan<char> text) {
        Text = text;
        Current = default;
    }

    public LineEnumerator GetEnumerator() {
        return this;
    }

    public bool MoveNext() {
        if (Text.IsEmpty) return false;

        var index = Text.IndexOf( '\n' ); // \r\n or \n
        if (index != -1) {
            Current = Text.Slice( 0, index + 1 );
            Text = Text.Slice( index + 1 );
            return true;
        } else {
            Current = Text;
            Text = ReadOnlySpan<char>.Empty;
            return true;
        }
    }


}

Hanahanae answered 30/1, 2021 at 14:40 Comment(2)

Interesting! Should it implement IEnumerable<>? – Diocletian 1/2, 2021 at 2:12

@KonstantinSpirin it can not implement IEnumerable<> because 1) ref structs cannot implement interfaces and 2) 'ReadOnlySpan<char>' can not be used as a type argument in IEnumerable<> because it is a ref struct – Halbeib 9/12, 2023 at 23:48

Slightly twisted, but an iterator block to do it:

public static IEnumerable<string> Lines(this string Text)
{
    int cIndex = 0;
    int nIndex;
    while ((nIndex = Text.IndexOf(Environment.NewLine, cIndex + 1)) != -1)
    {
        int sIndex = (cIndex == 0 ? 0 : cIndex + 1);
        yield return Text.Substring(sIndex, nIndex - sIndex);
        cIndex = nIndex;
    }
    yield return Text.Substring(cIndex + 1);
}

You can then call:

var result = input.Lines().ToArray();

Diviner answered 2/10, 2009 at 8:8 Comment(0)

    private string[] GetLines(string text)
    {

        List<string> lines = new List<string>();
        using (MemoryStream ms = new MemoryStream())
        {
            StreamWriter sw = new StreamWriter(ms);
            sw.Write(text);
            sw.Flush();

            ms.Position = 0;

            string line;

            using (StreamReader sr = new StreamReader(ms))
            {
                while ((line = sr.ReadLine()) != null)
                {
                    lines.Add(line);
                }
            }
            sw.Close();
        }



        return lines.ToArray();
    }

Gazelle answered 6/8, 2015 at 0:55 Comment(2)

This worked really well for parsing a custom file format I wrote. Your code is much faster reading 500+ lines compared to string.Split - big difference! Thanks! – Cryptogam 7/10, 2022 at 20:25

you can remove MemoryStream / StreamWriter / StreamReader and just use StringReader! It have a ReadLine method too. – Halbeib 10/12, 2023 at 2:9

It's tricky to handle mixed line endings properly. As we know, the line termination characters can be "Line Feed" (ASCII 10, \n, \x0A, \u000A), "Carriage Return" (ASCII 13, \r, \x0D, \u000D), or some combination of them. Going back to DOS, Windows uses the two-character sequence CR-LF \u000D\u000A, so this combination should only emit a single line. Unix uses a single \u000A, and very old Macs used a single \u000D character. The standard way to treat arbitrary mixtures of these characters within a single text file is as follows:

each and every CR or LF character should skip to the next line EXCEPT...
...if a CR is immediately followed by LF (\u000D\u000A) then these two together skip just one line.
String.Empty is the only input that returns no lines (any character entails at least one line)
The last line must be returned even if it has neither CR nor LF.

The preceding rule describes the behavior of StringReader.ReadLine and related functions, and the function shown below produces identical results. It is an efficient C# line breaking function that dutifully implements these guidelines to correctly handle any arbitrary sequence or combination of CR/LF. The enumerated lines do not contain any CR/LF characters. Empty lines are preserved and returned as String.Empty.

/// <summary>
/// Enumerates the text lines from the string.
///   ⁃ Mixed CR-LF scenarios are handled correctly
///   ⁃ String.Empty is returned for each empty line
///   ⁃ No returned string ever contains CR or LF
/// </summary>
public static IEnumerable<String> Lines(this String s)
{
    int j = 0, c, i;
    char ch;
    if ((c = s.Length) > 0)
        do
        {
            for (i = j; (ch = s[j]) != '\r' && ch != '\n' && ++j < c;)
                ;

            yield return s.Substring(i, j - i);
        }
        while (++j < c && (ch != '\r' || s[j] != '\n' || ++j < c));
}

Note: If you don't mind the overhead of creating a StringReader instance on each call, you can use the following C# 7 code instead. As noted, while the example above may be slightly more efficient, both of these functions produce the exact same results.

public static IEnumerable<String> Lines(this String s)
{
    using (var tr = new StringReader(s))
        while (tr.ReadLine() is String L)
            yield return L;
}

Campinas answered 6/2, 2019 at 18:22 Comment(0)

late to the party, but I've been using a simple collection of extension methods for just that, which leverages TextReader.ReadLine():

public static class StringReadLinesExtension
{
    public static IEnumerable<string> GetLines(this string text) => GetLines(new StringReader(text));
    public static IEnumerable<string> GetLines(this Stream stm) => GetLines(new StreamReader(stm));
    public static IEnumerable<string> GetLines(this TextReader reader) {
        string line;
        while ((line = reader.ReadLine()) != null)
            yield return line;
        reader.Dispose();
        yield break;
    }
}

Using the code is really trivial:

// If you have the text as a string...
var text = "Line 1\r\nLine 2\r\nLine 3";
foreach (var line in text.GetLines())
    Console.WriteLine(line);
// You can also use streams like
var fileStm = File.OpenRead("c:\tests\file.txt");
foreach(var line in fileStm.GetLines())
    Console.WriteLine(line);

Hope this helps someone out there.

Rangy answered 30/5, 2022 at 19:32 Comment(0)

Split a string into lines without any allocation.

static IEnumerable<ReadOnlyMemory<char>> GetLines(this string text, string newLine) 
{
    if (text.Length == 0)
        yield break;

    var memory = text.AsMemory();
    int index;

    while ((index = memory.Span.IndexOf(newLine)) != -1) 
    {
        yield return memory.Slice(0, index);
        memory = memory.Slice(index + newLine.Length);
    }

    yield return memory;
}

Example of use

foreach (ReadOnlyMemory<char>> line in GetLines(text, "\r\n"))
{
   // use the line variable or if needed...
   // alternative use

   ReadOnlySpan<char> span = line.Span;
   string str = line.Span.ToString();
}

Halbeib answered 10/12, 2023 at 2:35 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update: See here for an alternative/async solution.

Recommended topics

Hot tags