How to split string preserving whole words?
Asked Answered
E

10

17

I need to split long sentence into parts preserving whole words. Each part should have given maximum number of characters (including space, dots etc.). For example:

int partLenght = 35;
string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon."

Output:

1 part: "Silver badges are awarded for"
2 part: "longer term goals. Silver badges are"
3 part: "uncommon."
Elaterid answered 9/12, 2010 at 12:37 Comment(3)
Are you trying to implement a word-wrap algorithm ?Frayda
Your example was wrong by the way :).... Part 2 shouldn't contain "are" as my solution shows.Taka
step 1 split using the given length and step 2 used condition and check word.Averyaveryl
T
24

Try this:

    static void Main(string[] args)
    {
        int partLength = 35;
        string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";
        string[] words = sentence.Split(' ');
        var parts = new Dictionary<int, string>();
        string part = string.Empty;
        int partCounter = 0;
        foreach (var word in words)
        {
            if (part.Length + word.Length < partLength)
            {
                part += string.IsNullOrEmpty(part) ? word : " " + word;
            }
            else
            {
                parts.Add(partCounter, part);
                part = word;
                partCounter++;
            }
        }
        parts.Add(partCounter, part);
        foreach (var item in parts)
        {
            Console.WriteLine("Part {0} (length = {2}): {1}", item.Key, item.Value, item.Value.Length);
        }
        Console.ReadLine();
    }
Taka answered 9/12, 2010 at 12:55 Comment(1)
small change if first word is longer than the partLength: (!string.IsNullOrEmpty(part)) parts.Add(partCounter, part);Jerkwater
K
17

I knew there had to be a nice LINQ-y way of doing this, so here it is for the fun of it:

var input = "The quick brown fox jumps over the lazy dog.";
var charCount = 0;
var maxLineLength = 11;

var lines = input.Split(' ', StringSplitOptions.RemoveEmptyEntries)
    .GroupBy(w => (charCount += w.Length + 1) / maxLineLength)
    .Select(g => string.Join(" ", g));

// That's all :)

foreach (var line in lines) {
    Console.WriteLine(line);
}

Obviously this code works only as long as the query is not parallel, since it depends on charCount to be incremented "in word order".

Keeling answered 9/12, 2010 at 13:0 Comment(3)
looks like you need to change g to g.toArray() in the string.Join callSamarskite
There's a bug in this, see @JonLord's answer below: https://mcmap.net/q/685894/-how-to-split-string-preserving-whole-wordsCattery
@Keeling may be you need to change the split method for .Net Framework v4.5 form input.Split(' ', StringSplitOptions.RemoveEmptyEntries) to input.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries)Grindery
A
13

I've been testing Jon's and Lessan's answers, but they don't work properly if your max length needs to be absolute, rather than approximate. As their counter increments, it doesn't count the empty space left at the end of a line.

Running their code against the OP's example, you get:

1 part: "Silver badges are awarded for " - 29 Characters
2 part: "longer term goals. Silver badges are" - 36 Characters
3 part: "uncommon. " - 13 Characters

The "are" on line two, should be on line three. This happens because the counter does not include the 6 characters from the end of line one.

I came up with the following modification of Lessan's answer to account for this:

public static class ExtensionMethods
{
    public static string[] Wrap(this string text, int max)
    {
        var charCount = 0;
        var lines = text.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
        return lines.GroupBy(w => (charCount += (((charCount % max) + w.Length + 1 >= max) 
                        ? max - (charCount % max) : 0) + w.Length + 1) / max)
                    .Select(g => string.Join(" ", g.ToArray()))
                    .ToArray();
    }
}
Ancestry answered 10/7, 2013 at 12:53 Comment(2)
string[] texts = text.Wrap (50); , it's perfect thanksDormie
Still has a bug. Pass it the string "The quick brown fox jumps over the lazy" and a max of 20. It should return 2 lines of 19 length, but it returns 3 lines. There is room for 'fox' on the first line, making room for the rest of the string on the second line. Perhaps a simpler to understand non-linq version would be less cool but actually produce working code? Three people in this question alone have tried and failed ;)Mete
R
8

It seems like everyone is using some form of "Split then rebuild the sentence"...

I thought I would take a stab at this the way my brain would logically think about doing this manually, which is:

  • Split on length
  • Go backwards to the nearest space and use that chunk
  • Remove the used chunk and start over

The code ended up being a little more complex than I was hoping for, however I believe it handles most (all?) edge cases - including words that are longer than maxLength, when the words end exactly on the maxLength, etc.

Here's my function:

private static List<string> SplitWordsByLength(string str, int maxLength)
{
    List<string> chunks = new List<string>();
    while (str.Length > 0)
    {
        if (str.Length <= maxLength)                    //if remaining string is less than length, add to list and break out of loop
        {
            chunks.Add(str);
            break;
        }

        string chunk = str.Substring(0, maxLength);     //Get maxLength chunk from string.

        if (char.IsWhiteSpace(str[maxLength]))          //if next char is a space, we can use the whole chunk and remove the space for the next line
        {
            chunks.Add(chunk);
            str = str.Substring(chunk.Length + 1);      //Remove chunk plus space from original string
        }
        else
        {
            int splitIndex = chunk.LastIndexOf(' ');    //Find last space in chunk.
            if (splitIndex != -1)                       //If space exists in string,
                chunk = chunk.Substring(0, splitIndex); //  remove chars after space.
            str = str.Substring(chunk.Length + (splitIndex == -1 ? 0 : 1));      //Remove chunk plus space (if found) from original string
            chunks.Add(chunk);                          //Add to list
        }
    }
    return chunks;
}

Test usage:

string testString = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";
int length = 35;

List<string> test = SplitWordsByLength(testString, length);

foreach (string chunk in test)
{
    Console.WriteLine(chunk);  
}

Console.ReadLine();
Recognizee answered 22/8, 2019 at 14:26 Comment(0)
M
7

Split the string with a (space), that build up new strings from the resulting array, stopping before your limit for each new segment.

Untested pseudo-code:

string[] words = sentence.Split(new char[] {' '});
IList<string> sentenceParts = new List<string>();
sentenceParts.Add(string.Empty);

int partCounter = 0;    

foreach (var word in words)
{
  if(sentenceParts[partCounter].Length + word.Length > myLimit)
  {
     partCounter++;
     sentenceParts.Add(string.Empty);
  }

  sentenceParts[partCounter] += word + " ";
}
Montgomery answered 9/12, 2010 at 12:41 Comment(0)
S
3

Expanding on jon's answer above; I needed to switch g with g.toArray(), and also change max to (max + 2) to get an exact wrapping on the max'th character.

public static class ExtensionMethods
{
    public static string[] Wrap(this string text, int max)
    {
        var charCount = 0;
        var lines = text.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
        return lines.GroupBy(w => (charCount += w.Length + 1) / (max + 2))
                    .Select(g => string.Join(" ", g.ToArray()))
                    .ToArray();
    }
}

And here is sample usage as NUnit tests:

[Test]
public void TestWrap()
{
    Assert.AreEqual(2, "A B C".Wrap(4).Length);
    Assert.AreEqual(1, "A B C".Wrap(5).Length);

    Assert.AreEqual(2, "AA BB CC".Wrap(7).Length);
    Assert.AreEqual(1, "AA BB CC".Wrap(8).Length);

    Assert.AreEqual(2, "TEST TEST TEST TEST".Wrap(10).Length);
    Assert.AreEqual(2, "  TEST TEST TEST TEST  ".Wrap(10).Length);
    Assert.AreEqual("TEST TEST", "  TEST TEST TEST TEST  ".Wrap(10)[0]);
}
Samarskite answered 16/2, 2011 at 16:38 Comment(0)
M
2

At first I was thinking this might be a Regex kind of thing but here's my shot at it:

List<string> parts = new List<string>();
int partLength = 35;
string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";

string[] pieces = sentence.Split(' ');
StringBuilder tempString = new StringBuilder("");

foreach(var piece in pieces)
{
    if(piece.Length + tempString.Length + 1 > partLength) 
    {
        parts.Add(tempString.ToString());
        tempString.Clear();        
    }
    tempString.Append(" " + piece); 
}
Mook answered 9/12, 2010 at 12:44 Comment(0)
E
1

Joel there is a little bug in your code that I've corrected here:

public static string[] StringSplitWrap(string sentence, int MaxLength)
{
        List<string> parts = new List<string>();
        string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";

        string[] pieces = sentence.Split(' ');
        StringBuilder tempString = new StringBuilder("");

        foreach (var piece in pieces)
        {
            if (piece.Length + tempString.Length + 1 > MaxLength)
            {
                parts.Add(tempString.ToString());
                tempString.Clear();
            }
            tempString.Append((tempString.Length == 0 ? "" : " ") + piece);
        }

        if (tempString.Length>0)
            parts.Add(tempString.ToString());

        return parts.ToArray();
}
Exacting answered 3/1, 2012 at 9:12 Comment(0)
T
1

This works:

int partLength = 35;
string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";
List<string> lines =
    sentence
        .Split(' ')
        .Aggregate(new [] { "" }.ToList(), (a, x) =>
        {
            var last = a[a.Count - 1];
            if ((last + " " + x).Length > partLength)
            {
                a.Add(x);
            }
            else
            {
                a[a.Count - 1] = (last + " " + x).Trim();
            }
            return a;
        });

It gives me:

Silver badges are awarded for 
longer term goals. Silver badges 
are uncommon. 
Traynor answered 23/2, 2017 at 0:45 Comment(0)
L
0

While CsConsoleFormat† was primarily designed to format text for console, it supports generating plain text as well.

var doc = new Document().AddChildren(
  new Div("Silver badges are awarded for longer term goals. Silver badges are uncommon.") {
    TextWrap = TextWrapping.WordWrap
  }
);
var bounds = new Rect(0, 0, 35, Size.Infinity);
string text = ConsoleRenderer.RenderDocumentToText(doc, new TextRenderTarget(), bounds);

And, if you actually need trimmed strings like in your question:

List<string> lines = text.Trim()
  .Split(new[] { Environment.NewLine }, StringSplitOptions.None)
  .Select(s => s.Trim())
  .ToList();

In addition to word wrap on spaces, you get proper handling of hyphens, zero-width spaces, no-break spaces etc.

† CsConsoleFormat was developed by me.

Loaf answered 1/3, 2018 at 23:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.