How do you split multi-line string into lines?
I know this way
var result = input.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
looks a bit ugly and loses empty lines. Is there a better solution?
How do you split multi-line string into lines?
I know this way
var result = input.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
looks a bit ugly and loses empty lines. Is there a better solution?
If it looks ugly, just remove the unnecessary ToCharArray
call.
If you want to split by either \n
or \r
, you've got two options:
Use an array literal – but this will give you empty lines for Windows-style line endings \r\n
:
var result = text.Split(new [] { '\r', '\n' });
Use a regular expression, as indicated by Bart:
var result = Regex.Split(text, "\r\n|\r|\n");
If you want to preserve empty lines, why do you explicitly tell C# to throw them away? (StringSplitOptions
parameter) – use StringSplitOptions.None
instead.
Environment.NewLine
is a no-go as far as I’m concerned. In fact, of all the possible solutions I prefer the one using regular expressions since only that handles all source platforms correctly. –
Prolocutor StringSplitOptions.RemoveEmptyEntries
. –
Prolocutor Regex.Split
, i.e. - var result = Regex.Split(text, @"\r\n|\r|\n");
In this case it works either way because the C# compiler interprets \n and \r in the same way that the regular expression parser does. In the general case though it might cause problems. –
Brackish "\r"
on its own as a line separator (because, let’s face it, no system has used this convention in well over a decade): the only actually used conventions are "\r\n"
and "\n"
. In fact, your example ("\n\r"
) has never been a valid line break anywhere. Either read it as two line breaks or throw an error, but certainly don’t treat it as a single line break. –
Prolocutor "\r"
is never used as a delimiter, so you can easily write a universal parser that accepts all actually used delimiters; It’s done by simply splitting on "\r\n|\n"
. There’s no need for anything more fancy than that. But, honestly, in practice there’s nothing wrong with the regex code shown in my answer, and it will work just fine with a file that mixes different styles of line breaks, including the obsolete "\r"
. –
Prolocutor diff
flagged them). This isn’t “planning for the future”, it’s making code robust for the here and now. –
Prolocutor using (StringReader sr = new StringReader(text)) {
string line;
while ((line = sr.ReadLine()) != null) {
// do something
}
}
string.Split
or Regex.Split
)? –
Sarraceniaceous "example"
and "example\r\n"
will both produce only one line while "example\r\n\r\n"
will produce two lines. This behavior is discussed here: github.com/dotnet/runtime/issues/27715 –
Roughhew This works great and is faster than Regex:
input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
It is important to have "\r\n"
first in the array so that it's taken as one line break. The above gives the same results as either of these Regex solutions:
Regex.Split(input, "\r\n|\r|\n")
Regex.Split(input, "\r?\n|\r")
Except that Regex turns out to be about 10 times slower. Here's my test:
Action<Action> measure = (Action func) => {
var start = DateTime.Now;
for (int i = 0; i < 100000; i++) {
func();
}
var duration = DateTime.Now - start;
Console.WriteLine(duration);
};
var input = "";
for (int i = 0; i < 100; i++)
{
input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}
measure(() =>
input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
);
measure(() =>
Regex.Split(input, "\r\n|\r|\n")
);
measure(() =>
Regex.Split(input, "\r?\n|\r")
);
Output:
00:00:03.8527616
00:00:31.8017726
00:00:32.5557128
and here's the Extension Method:
public static class StringExtensionMethods
{
public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
{
return str.Split(new[] { "\r\n", "\r", "\n" },
removeEmptyLines ? StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None);
}
}
Usage:
input.GetLines() // keeps empty lines
input.GetLines(true) // removes empty lines
[\r\n]{1,2}
–
England \n\r
or \n\n
as single line-break which is not correct. –
Nonconductor Hello\n\nworld\n\n
an edge case? It is clearly one line with text, followed by an empty line, followed by another line with text, followed by an empty line. –
Tamathatamaulipas You could use Regex.Split:
string[] tokens = Regex.Split(input, @"\r?\n|\r");
Edit: added |\r
to account for (older) Mac line terminators.
\r
as line ending. –
Prolocutor [\r\n]{1,2}
–
England If you want to keep empty lines just remove the StringSplitOptions.
var result = input.Split(System.Environment.NewLine.ToCharArray());
string[] lines = input.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
I had this other answer but this one, based on Jack's answer, is significantly faster might be preferred since it works asynchronously, although slightly slower.
public static class StringExtensionMethods
{
public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
{
using (var sr = new StringReader(str))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (removeEmptyLines && String.IsNullOrWhiteSpace(line))
{
continue;
}
yield return line;
}
}
}
}
Usage:
input.GetLines() // keeps empty lines
input.GetLines(true) // removes empty lines
Test:
Action<Action> measure = (Action func) =>
{
var start = DateTime.Now;
for (int i = 0; i < 100000; i++)
{
func();
}
var duration = DateTime.Now - start;
Console.WriteLine(duration);
};
var input = "";
for (int i = 0; i < 100; i++)
{
input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}
measure(() =>
input.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None)
);
measure(() =>
input.GetLines()
);
measure(() =>
input.GetLines().ToList()
);
Output:
00:00:03.9603894
00:00:00.0029996
00:00:04.8221971
Split a string into lines without any allocation.
public static LineEnumerator GetLines(this string text) {
return new LineEnumerator( text.AsSpan() );
}
internal ref struct LineEnumerator {
private ReadOnlySpan<char> Text { get; set; }
public ReadOnlySpan<char> Current { get; private set; }
public LineEnumerator(ReadOnlySpan<char> text) {
Text = text;
Current = default;
}
public LineEnumerator GetEnumerator() {
return this;
}
public bool MoveNext() {
if (Text.IsEmpty) return false;
var index = Text.IndexOf( '\n' ); // \r\n or \n
if (index != -1) {
Current = Text.Slice( 0, index + 1 );
Text = Text.Slice( index + 1 );
return true;
} else {
Current = Text;
Text = ReadOnlySpan<char>.Empty;
return true;
}
}
}
IEnumerable<>
? –
Diocletian IEnumerable<>
because 1) ref structs cannot implement interfaces and 2) 'ReadOnlySpan<char>' can not be used as a type argument in IEnumerable<>
because it is a ref struct –
Halbeib Slightly twisted, but an iterator block to do it:
public static IEnumerable<string> Lines(this string Text)
{
int cIndex = 0;
int nIndex;
while ((nIndex = Text.IndexOf(Environment.NewLine, cIndex + 1)) != -1)
{
int sIndex = (cIndex == 0 ? 0 : cIndex + 1);
yield return Text.Substring(sIndex, nIndex - sIndex);
cIndex = nIndex;
}
yield return Text.Substring(cIndex + 1);
}
You can then call:
var result = input.Lines().ToArray();
private string[] GetLines(string text)
{
List<string> lines = new List<string>();
using (MemoryStream ms = new MemoryStream())
{
StreamWriter sw = new StreamWriter(ms);
sw.Write(text);
sw.Flush();
ms.Position = 0;
string line;
using (StreamReader sr = new StreamReader(ms))
{
while ((line = sr.ReadLine()) != null)
{
lines.Add(line);
}
}
sw.Close();
}
return lines.ToArray();
}
MemoryStream
/ StreamWriter
/ StreamReader
and just use StringReader
! It have a ReadLine
method too. –
Halbeib It's tricky to handle mixed line endings properly. As we know, the line termination characters can be "Line Feed" (ASCII 10, \n
, \x0A
, \u000A
), "Carriage Return" (ASCII 13, \r
, \x0D
, \u000D
), or some combination of them. Going back to DOS, Windows uses the two-character sequence CR-LF \u000D\u000A
, so this combination should only emit a single line. Unix uses a single \u000A
, and very old Macs used a single \u000D
character. The standard way to treat arbitrary mixtures of these characters within a single text file is as follows:
\u000D\u000A
) then these two together skip just one line.String.Empty
is the only input that returns no lines (any character entails at least one line)The preceding rule describes the behavior of StringReader.ReadLine and related functions, and the function shown below produces identical results. It is an efficient C# line breaking function that dutifully implements these guidelines to correctly handle any arbitrary sequence or combination of CR/LF. The enumerated lines do not contain any CR/LF characters. Empty lines are preserved and returned as String.Empty
.
/// <summary>
/// Enumerates the text lines from the string.
/// ⁃ Mixed CR-LF scenarios are handled correctly
/// ⁃ String.Empty is returned for each empty line
/// ⁃ No returned string ever contains CR or LF
/// </summary>
public static IEnumerable<String> Lines(this String s)
{
int j = 0, c, i;
char ch;
if ((c = s.Length) > 0)
do
{
for (i = j; (ch = s[j]) != '\r' && ch != '\n' && ++j < c;)
;
yield return s.Substring(i, j - i);
}
while (++j < c && (ch != '\r' || s[j] != '\n' || ++j < c));
}
Note: If you don't mind the overhead of creating a StringReader
instance on each call, you can use the following C# 7 code instead. As noted, while the example above may be slightly more efficient, both of these functions produce the exact same results.
public static IEnumerable<String> Lines(this String s)
{
using (var tr = new StringReader(s))
while (tr.ReadLine() is String L)
yield return L;
}
late to the party, but I've been using a simple collection of extension methods for just that, which leverages TextReader.ReadLine()
:
public static class StringReadLinesExtension
{
public static IEnumerable<string> GetLines(this string text) => GetLines(new StringReader(text));
public static IEnumerable<string> GetLines(this Stream stm) => GetLines(new StreamReader(stm));
public static IEnumerable<string> GetLines(this TextReader reader) {
string line;
while ((line = reader.ReadLine()) != null)
yield return line;
reader.Dispose();
yield break;
}
}
Using the code is really trivial:
// If you have the text as a string...
var text = "Line 1\r\nLine 2\r\nLine 3";
foreach (var line in text.GetLines())
Console.WriteLine(line);
// You can also use streams like
var fileStm = File.OpenRead("c:\tests\file.txt");
foreach(var line in fileStm.GetLines())
Console.WriteLine(line);
Hope this helps someone out there.
Split a string into lines without any allocation.
static IEnumerable<ReadOnlyMemory<char>> GetLines(this string text, string newLine)
{
if (text.Length == 0)
yield break;
var memory = text.AsMemory();
int index;
while ((index = memory.Span.IndexOf(newLine)) != -1)
{
yield return memory.Slice(0, index);
memory = memory.Slice(index + newLine.Length);
}
yield return memory;
}
Example of use
foreach (ReadOnlyMemory<char>> line in GetLines(text, "\r\n"))
{
// use the line variable or if needed...
// alternative use
ReadOnlySpan<char> span = line.Span;
string str = line.Span.ToString();
}
© 2022 - 2024 — McMap. All rights reserved.
\r
or\n
and ending up with a load of blank lines on windows-created files. What system uses LFCR line endings, btw? – Acroterion