Best way to split string into lines with maximum length, without breaking words
Asked Answered
S

7

23

I want to break a string up into lines of a specified maximum length, without splitting any words, if possible (if there is a word that exceeds the maximum line length, then it will have to be split).

As always, I am acutely aware that strings are immutable and that one should preferably use the StringBuilder class. I have seen examples where the string is split into words and the lines are then built up using the StringBuilder class, but the code below seems "neater" to me.

I mentioned "best" in the description and not "most efficient" as I am also interested in the "eloquence" of the code. The strings will never be huge, generally splitting into 2 or three lines, and it won't be happening for thousands of lines.

Is the following code really bad?

private static IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
    stringToSplit = stringToSplit.Trim();
    var lines = new List<string>();

    while (stringToSplit.Length > 0)
    {
        if (stringToSplit.Length <= maximumLineLength)
        {
            lines.Add(stringToSplit);
            break;
        }

        var indexOfLastSpaceInLine = stringToSplit.Substring(0, maximumLineLength).LastIndexOf(' ');
        lines.Add(stringToSplit.Substring(0, indexOfLastSpaceInLine >= 0 ? indexOfLastSpaceInLine : maximumLineLength).Trim());
        stringToSplit = stringToSplit.Substring(indexOfLastSpaceInLine >= 0 ? indexOfLastSpaceInLine + 1 : maximumLineLength);
    }

    return lines.ToArray();
}
Slimsy answered 13/3, 2014 at 3:23 Comment(0)
S
14

How about this as a solution:

IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
    var words = stringToSplit.Split(' ').Concat(new [] { "" });
    return
        words
            .Skip(1)
            .Aggregate(
                words.Take(1).ToList(),
                (a, w) =>
                {
                    var last = a.Last();
                    while (last.Length > maximumLineLength)
                    {
                        a[a.Count() - 1] = last.Substring(0, maximumLineLength);
                        last = last.Substring(maximumLineLength);
                        a.Add(last);
                    }
                    var test = last + " " + w;
                    if (test.Length > maximumLineLength)
                    {
                        a.Add(w);
                    }
                    else
                    {
                        a[a.Count() - 1] = test;
                    }
                    return a;
                });
}

I reworked this as prefer this:

IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
    var words = stringToSplit.Split(' ');
    var line = words.First();
    foreach (var word in words.Skip(1))
    {
        var test = $"{line} {word}";
        if (test.Length > maximumLineLength)
        {
            yield return line;
            line = word;
        }
        else
        {
            line = test;
        }
    }
    yield return line;
}
Soembawa answered 13/3, 2014 at 4:1 Comment(8)
This is certainly clever, but it does not split words that exceed the maximum line length and therefore does not satisfy the criteria. I also suspect it's harder for most people to figure out what it's doing at a glance - it was for me.Slimsy
@ToboldHornblower - It does indeed split words that exceed the maximum length. The while loop in the middle does that. Also, I'm considering writing an extension method to clean up this logic to make it simpler as I find myself writing this kind of function quite a lot.Soembawa
I ran it on some test data and it did not split the long words. My apologies if I missed something, I will try it again.Slimsy
Just tried it with one word (ThisIsAReallLongWordThatShouldEasilyExceedTheMaximumPermissibleLineLength) and it does not split it, it prints it out. Same happens when it is part of a sentence.Slimsy
Ah, I see the issue - you split the word if it is not the first or the last word in the sentence, otherwise, you leave it as is.Slimsy
Wow, good pick up. I did my testing with large words in the middle. I didn't even think to test the case that the large word was at the beginning or the end.Soembawa
It turns out it was failing on the last word being the long word. I've fixed it now. Should be 100%.Soembawa
Nice work! It's certainly made me more aware of Aggregate(), but it is around 4 to 5 times slower than the code I posted when I compare them, and I still think it's not immediately obvious what it's doing. I'm going to mark it as the answer anyway, because it does work and it made me think.Slimsy
C
25

Even when this post is 3 years old I wanted to give a better solution using Regex to accomplish the same:

If you want the string to be splitted and then use the text to be displayed you can use this:

public string SplitToLines(string stringToSplit, int maximumLineLength)
{
    return Regex.Replace(stringToSplit, @"(.{1," + maximumLineLength +@"})(?:\s|$)", "$1\n");
}

If on the other hand you need a collection you can use this:

public MatchCollection SplitToLines(string stringToSplit, int maximumLineLength)
{
    return Regex.Matches(stringToSplit, @"(.{1," + maximumLineLength +@"})(?:\s|$)");
}

NOTES

Remember to import regex (using System.Text.RegularExpressions;)

You can use string interpolation on the match:
$@"(.{{1,{maximumLineLength}}})(?:\s|$)"

The MatchCollection works almost like an Array

Matching example with explanation here

Cosme answered 12/4, 2017 at 4:52 Comment(1)
To convert MatchCollection to array: var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b").Cast<Match>().Select(m => m.Value).ToArray(); Found at: https://mcmap.net/q/111935/-converting-a-matchcollection-to-string-arrayUnpile
S
14

How about this as a solution:

IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
    var words = stringToSplit.Split(' ').Concat(new [] { "" });
    return
        words
            .Skip(1)
            .Aggregate(
                words.Take(1).ToList(),
                (a, w) =>
                {
                    var last = a.Last();
                    while (last.Length > maximumLineLength)
                    {
                        a[a.Count() - 1] = last.Substring(0, maximumLineLength);
                        last = last.Substring(maximumLineLength);
                        a.Add(last);
                    }
                    var test = last + " " + w;
                    if (test.Length > maximumLineLength)
                    {
                        a.Add(w);
                    }
                    else
                    {
                        a[a.Count() - 1] = test;
                    }
                    return a;
                });
}

I reworked this as prefer this:

IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
    var words = stringToSplit.Split(' ');
    var line = words.First();
    foreach (var word in words.Skip(1))
    {
        var test = $"{line} {word}";
        if (test.Length > maximumLineLength)
        {
            yield return line;
            line = word;
        }
        else
        {
            line = test;
        }
    }
    yield return line;
}
Soembawa answered 13/3, 2014 at 4:1 Comment(8)
This is certainly clever, but it does not split words that exceed the maximum line length and therefore does not satisfy the criteria. I also suspect it's harder for most people to figure out what it's doing at a glance - it was for me.Slimsy
@ToboldHornblower - It does indeed split words that exceed the maximum length. The while loop in the middle does that. Also, I'm considering writing an extension method to clean up this logic to make it simpler as I find myself writing this kind of function quite a lot.Soembawa
I ran it on some test data and it did not split the long words. My apologies if I missed something, I will try it again.Slimsy
Just tried it with one word (ThisIsAReallLongWordThatShouldEasilyExceedTheMaximumPermissibleLineLength) and it does not split it, it prints it out. Same happens when it is part of a sentence.Slimsy
Ah, I see the issue - you split the word if it is not the first or the last word in the sentence, otherwise, you leave it as is.Slimsy
Wow, good pick up. I did my testing with large words in the middle. I didn't even think to test the case that the large word was at the beginning or the end.Soembawa
It turns out it was failing on the last word being the long word. I've fixed it now. Should be 100%.Soembawa
Nice work! It's certainly made me more aware of Aggregate(), but it is around 4 to 5 times slower than the code I posted when I compare them, and I still think it's not immediately obvious what it's doing. I'm going to mark it as the answer anyway, because it does work and it made me think.Slimsy
A
9

I don't think your solution is too bad. I do, however, think you should break up your ternary into an if else because you are testing the same condition twice. Your code might also have a bug. Based on your description, it seems you want lines <= maxLineLength, but your code counts the space after the last word and uses it in the <= comparison resulting in effectively < behavior for the trimmed string.

Here is my solution.

private static IEnumerable<string> SplitToLines(string stringToSplit, int maxLineLength)
    {
        string[] words = stringToSplit.Split(' ');
        StringBuilder line = new StringBuilder();
        foreach (string word in words)
        {
            if (word.Length + line.Length <= maxLineLength)
            {
                line.Append(word + " ");
            }
            else
            {
                if (line.Length > 0)
                {
                    yield return line.ToString().Trim();
                    line.Clear();
                }
                string overflow = word;
                while (overflow.Length > maxLineLength)
                {
                    yield return overflow.Substring(0, maxLineLength);
                    overflow = overflow.Substring(maxLineLength);
                }
                line.Append(overflow + " ");
            }
        }
        yield return line.ToString().Trim();
    }

It is a bit longer than your solution, but it should be more straightforward. It also uses a StringBuilder so it is much faster for large strings. I performed a benchmarking test for 20,000 words ranging from 1 to 11 characters each split into lines of 10 character width. My method completed in 14ms compared to 1373ms for your method.

Assimilate answered 13/3, 2014 at 5:48 Comment(1)
I appreciate your observation of my testing the same condition twice, that is something I will have to address.Slimsy
C
2

Try this (untested)

    private static IEnumerable<string> SplitToLines(string value, int maximumLineLength)
    {
        var words = value.Split(' ');
        var line = new StringBuilder();

        foreach (var word in words)
        {
            if ((line.Length + word.Length) >= maximumLineLength)
            {
                yield return line.ToString();
                line = new StringBuilder();
            }

            line.AppendFormat("{0}{1}", (line.Length>0) ? " " : "", word);
        }

        yield return line.ToString();
    }
Cosette answered 13/3, 2014 at 3:54 Comment(4)
That's in line with the examples I have seen of taking the splitting into words approach. The one problem here is that any word that is longer than the maximum line length (as unlikely as that may be) will not be split (which will violate the constraint that no line exceed the maximum).Slimsy
You could just add that as a special case/condition and deal with it how you like. Throw exception? (If it's not allowed?)Cosette
You have two issues with this code (1) it is putting a space at the beginning of each line and (2) it always misses the last line.Soembawa
Much better, but still doesn't split single words that are longer than the maximum line length.Soembawa
A
2
  • ~6x faster than the accepted answer
  • More than 1.5x faster than the Regex version in Release Mode (dependent on line length)
  • Optionally keep the space at the end of the line or not (the regex version always keeps it)
    static IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength, bool removeSpace = true)
        {
            int start = 0;
            int end = 0;
            for (int i = 0; i < stringToSplit.Length; i++)
            {
                char c = stringToSplit[i];
                if (c == ' ' || c == '\n')
                {
                    if (i - start > maximumLineLength)
                    {
                        string substring = stringToSplit.Substring(start, end - start); ;
                        start = removeSpace ? end + 1 : end; // + 1 to remove the space on the next line
                        yield return substring;
                    }
                    else
                        end = i;
                }
            }
            yield return stringToSplit.Substring(start); // remember last line
        }

Here is the example code used to test speeds (again, run on your own machine and test in Release mode to get accurate timings) https://dotnetfiddle.net/h5I1GC
Timings on my machine in release mode .Net 4.8

Accepted Answer: 667ms
Regex: 368ms
My Version: 117ms
Armchair answered 5/4, 2022 at 12:40 Comment(2)
In VS debug mode, dotnetfiddle, RoslynPad, your Solution is slower than the regex version, almost twice as slow. (your benchmark code is a good example on howto not use Regex. There are RegexOptions.Compiled and one can cache the Regex, when heavy usage is expected, those two almost double the speed), Also Regex has a slightly different result. But still your code is faster in Release mode, maybe due to no Debug Symbols or maybe heavy optimisation of the "compiler". InterestingBobby
@Bobby I took the approach of reading the string at the lowest level as a char array. At it's rawest, there is a single iteration over the string and subsets of an array. I couldn't understand how it could be slower than any other option, which is why I tested in release mode.Armchair
C
0

My requirement was to have a line break at the last space before the 30 char limit. So here is how i did it. Hope this helps anyone looking.

 private string LineBreakLongString(string input)
        {
            var outputString = string.Empty;
            var found = false;
            int pos = 0;
            int prev = 0;
            while (!found)
                {
                    var p = input.IndexOf(' ', pos);
                    {
                        if (pos <= 30)
                        {
                            pos++;
                            if (p < 30) { prev = p; }
                        }
                        else
                        {
                            found = true;
                        }
                    }
                    outputString = input.Substring(0, prev) + System.Environment.NewLine + input.Substring(prev, input.Length - prev).Trim();
                }

            return outputString;
        }
Clouet answered 9/5, 2019 at 17:10 Comment(0)
O
-1

An approach using recursive method and ReadOnlySpan (Tested)

public static void SplitToLines(ReadOnlySpan<char> stringToSplit, int index, ref List<string> values)
{
   if (stringToSplit.IsEmpty || index < 1) return;
   var nextIndex = stringToSplit.IndexOf(' ');
   var slice = stringToSplit.Slice(0, nextIndex < 0 ? stringToSplit.Length : nextIndex);

   if (slice.Length <= index)
   {
      values.Add(slice.ToString());
      nextIndex++;
   }
   else
   {
      values.Add(slice.Slice(0, index).ToString());
      nextIndex = index;
   }

   if (stringToSplit.Length <= index) return;
   SplitToLines(stringToSplit.Slice(nextIndex), index, ref values);
}
Onyx answered 30/9, 2020 at 13:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.