c# Fastest way to remove extra white spaces
Asked Answered
H

28

59

What is the fastest way to replace extra white spaces to one white space?
e.g.

from

foo      bar 

to

foo bar
Haemolysis answered 22/6, 2011 at 15:27 Comment(3)
Fastest to write, smallest line of code, understandable/maintainable LOC, CPU time, other?Galvanotropism
possible duplicate of How to replace multiple white spaces with one white spaceBatton
Regex by SLaks=1407ms, StringBuilder by Blindy=154ms, Array=130ms, NoIf=91ms. Source code and test results below in my answer.Rifkin
R
56

The fastest way? Iterate over the string and build a second copy in a StringBuilder character by character, only copying one space for each group of spaces.

The easier to type Replace variants will create a bucket load of extra strings (or waste time building the regex DFA).

Edit with comparison results:

Using http://ideone.com/NV6EzU, with n=50 (had to reduce it on ideone because it took so long they had to kill my process), I get:

Regex: 7771ms.

Stringbuilder: 894ms.

Which is indeed as expected, Regex is horribly inefficient for something this simple.

Rhombohedral answered 22/6, 2011 at 15:30 Comment(14)
A Compiled regex will execute as fast as anything you can write yourselfFusillade
@SLaks, another sweeping generalization disproved!Rhombohedral
Compiling Regex only makes sense if you're processing lots of strings in a batch.Unicellular
@L_7337, That's what happens to 3 year (almost 4 now) old posts, links sometimes stop working. You'll just have to take my word for it.Rhombohedral
@Blindy: Could you then maybe add the code you used directly to your answer? Thank you in advance!Geotaxis
Blindy, I'll upvote if you can add the source code from your link that is now broken. Links are bad in StackOverflow. It's always best to post source code.Concinnate
You don't need the source, I described the algorithm in enough detail. The linked program is just for the speed comparison.Rhombohedral
@Rhombohedral What about string resultString = string.Join(" ", sourceString.Split(' ').Where(s => s != ""));?Waltner
Split is also ridiculously slow, you're putting pressure on the heap for no reason by allocating that array.Rhombohedral
I've reinstated the link from web.archive.org ;)Speedwell
It's going to take years to get into top 10 answers here, but I have a version with no ifs which seems to be measurably faster than the sb version.Rifkin
I can't see how you implemented my algorithm in your code, but if you have any new StringBuilder() in it, try moving it outside so it's only created once, and use Clear() between runs. I have a feeling most of that difference is the repeated creation of the object.Rhombohedral
My results (code below) are Regex by SLaks 1407ms, StringBuilder by Blindy 154ms, Array 130ms, NoIf 91ms.Rifkin
You should try the regex version in .net 5, they did some crazy improvements on that front.Rhombohedral
F
54

You can use a regex:

static readonly Regex trimmer = new Regex(@"\s\s+");

s = trimmer.Replace(s, " ");

For added performance, pass RegexOptions.Compiled.

Fusillade answered 22/6, 2011 at 15:29 Comment(3)
@Navid: Replace \s with a space: new Regex(@" +") (two space characters)Fusillade
It's kind of funny that this is the accepted answer given his initial claim about performance.Rhombohedral
My test results (code below) are Regex by SLaks=1407ms, StringBuilder by Blindy=154ms, Array=130ms, NoIf=91ms.Rifkin
E
37

A bit late, but I have done some benchmarking to get the fastest way to remove extra whitespaces. If there are any faster answers, I would love to add them.

Results:

  1. NormalizeWhiteSpaceForLoop: 156 ms (by Me - From my answer on removing all whitespace)
  2. NormalizeWhiteSpace: 267 ms (by Alex K.)
  3. RegexCompiled: 1950 ms (by SLaks)
  4. Regex: 2261 ms (by SLaks)

Code:

public class RemoveExtraWhitespaces
{
    public static string WithRegex(string text)
    {
        return Regex.Replace(text, @"\s+", " ");
    }

    public static string WithRegexCompiled(Regex compiledRegex, string text)
    {
        return compiledRegex.Replace(text, " ");
    }

    public static string NormalizeWhiteSpace(string input)
    {
        if (string.IsNullOrEmpty(input))
            return string.Empty;

        int current = 0;
        char[] output = new char[input.Length];
        bool skipped = false;

        foreach (char c in input.ToCharArray())
        {
            if (char.IsWhiteSpace(c))
            {
                if (!skipped)
                {
                    if (current > 0)
                        output[current++] = ' ';

                    skipped = true;
                }
            }
            else
            {
                skipped = false;
                output[current++] = c;
            }
        }

        return new string(output, 0, current);
    }

    public static string NormalizeWhiteSpaceForLoop(string input)
    {
        int len = input.Length,
            index = 0,
            i = 0;
        var src = input.ToCharArray();
        bool skip = false;
        char ch;
        for (; i < len; i++)
        {
            ch = src[i];
            switch (ch)
            {
                case '\u0020':
                case '\u00A0':
                case '\u1680':
                case '\u2000':
                case '\u2001':
                case '\u2002':
                case '\u2003':
                case '\u2004':
                case '\u2005':
                case '\u2006':
                case '\u2007':
                case '\u2008':
                case '\u2009':
                case '\u200A':
                case '\u202F':
                case '\u205F':
                case '\u3000':
                case '\u2028':
                case '\u2029':
                case '\u0009':
                case '\u000A':
                case '\u000B':
                case '\u000C':
                case '\u000D':
                case '\u0085':
                    if (skip) continue;
                    src[index++] = ch;
                    skip = true;
                    continue;
                default:
                    skip = false;
                    src[index++] = ch;
                continue;
            }
        }

        return new string(src, 0, index);
    }
}

Tests:

[TestFixture]
public class RemoveExtraWhitespacesTest
{
    private const string _text = "foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo foo                  bar                  foobar                     moo ";
    private const string _expected = "foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo foo bar foobar moo ";

    private const int _iterations = 10000;

    [Test]
    public void Regex()
    {
        var result = TimeAction("Regex", () => RemoveExtraWhitespaces.WithRegex(_text));
        Assert.AreEqual(_expected, result);
    }

    [Test]
    public void RegexCompiled()
    {
        var compiledRegex = new Regex(@"\s+", RegexOptions.Compiled);
        var result = TimeAction("RegexCompiled", () => RemoveExtraWhitespaces.WithRegexCompiled(compiledRegex, _text));
        Assert.AreEqual(_expected, result);
    }

    [Test]
    public void NormalizeWhiteSpace()
    {
        var result = TimeAction("NormalizeWhiteSpace", () => RemoveExtraWhitespaces.NormalizeWhiteSpace(_text));
        Assert.AreEqual(_expected, result);
    }

    [Test]
    public void NormalizeWhiteSpaceForLoop()
    {
        var result = TimeAction("NormalizeWhiteSpaceForLoop", () => RemoveExtraWhitespaces.NormalizeWhiteSpaceForLoop(_text));
        Assert.AreEqual(_expected, result);
    }

    public string TimeAction(string name, Func<string> func)
    {
        var timer = Stopwatch.StartNew();
        string result = string.Empty; ;
        for (int i = 0; i < _iterations; i++)
        {
            result = func();
        }

        timer.Stop();
        Console.WriteLine(string.Format("{0}: {1} ms", name, timer.ElapsedMilliseconds));
        return result;
    }
}
Embosser answered 2/6, 2016 at 12:34 Comment(4)
I've just added a version with no branching (no switch, no if) that does only space, and seems to be faster than NormalizeWhiteSpaceForLoop.Rifkin
cool. Add your code as an answer and i'll run the benchmarking.Embosser
Whats the point of these methods if they dont remove the last space in the string?Raccoon
These methods will replace duplicate whitespaces (more than one whitespace after each other with one whitespace). So if there is one whitespace at the end of the string, this will not be removed.Embosser
E
14
string q = " Hello     how are   you           doing?";
string a = String.Join(" ", q.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries));
Edmea answered 16/10, 2014 at 21:41 Comment(1)
short solution with no regex and works as a charm - cheers!Tomato
F
12

I use below methods - they handle all whitespace chars not only spaces, trim both leading and trailing whitespaces, remove extra whitespaces, and all whitespaces are replaced to space char (so we have uniform space separator). And these methods are fast.

public static String CompactWhitespaces( String s )
{
    StringBuilder sb = new StringBuilder( s );

    CompactWhitespaces( sb );

    return sb.ToString();
}

public static void CompactWhitespaces( StringBuilder sb )
{
    if( sb.Length == 0 )
        return;

    // set [start] to first not-whitespace char or to sb.Length

    int start = 0;

    while( start < sb.Length )
    {
        if( Char.IsWhiteSpace( sb[ start ] ) )
            start++;
        else 
            break;
    }

    // if [sb] has only whitespaces, then return empty string

    if( start == sb.Length )
    {
        sb.Length = 0;
        return;
    }

    // set [end] to last not-whitespace char

    int end = sb.Length - 1;

    while( end >= 0 )
    {
        if( Char.IsWhiteSpace( sb[ end ] ) )
            end--;
        else 
            break;
    }

    // compact string

    int dest = 0;
    bool previousIsWhitespace = false;

    for( int i = start; i <= end; i++ )
    {
        if( Char.IsWhiteSpace( sb[ i ] ) )
        {
            if( !previousIsWhitespace )
            {
                previousIsWhitespace = true;
                sb[ dest ] = ' ';
                dest++;
            }
        }
        else
        {
            previousIsWhitespace = false;
            sb[ dest ] = sb[ i ];
            dest++;
        }
    }

    sb.Length = dest;
}
Fennie answered 20/5, 2013 at 14:54 Comment(1)
Worth noting that a single newline, '\n', will be replaced with a space, ' '.Koehn
C
9
string text = "foo       bar";
text = Regex.Replace(text, @"\s+", " ");
// text = "foo bar"

This solution works with spaces, tabs, and newline. If you want just spaces, replace '\s' with ' '.

Cristoforo answered 22/6, 2011 at 15:34 Comment(1)
Don't re-parse the regex every time.Fusillade
F
8

I needed one of these for larger strings and came up with the routine below.

Any consecutive white-space (including tabs, newlines) is replaced with whatever is in normalizeTo. Leading/trailing white-space is removed.

It's around 8 times faster than a RegEx with my 5k->5mil char strings.

internal static string NormalizeWhiteSpace(string input, char normalizeTo = ' ')
{
    if (string.IsNullOrEmpty(input))
        return string.Empty;

    int current = 0;
    char[] output = new char[input.Length];
    bool skipped = false;

    foreach (char c in input.ToCharArray())
    {
        if (char.IsWhiteSpace(c))
        {
            if (!skipped)
            {
                if (current > 0)
                    output[current++] = normalizeTo;

                skipped = true;
            }
        }
        else
        {
            skipped = false;
            output[current++] = c;
        }
    }

    return new string(output, 0, skipped ? current - 1 : current);
}
Firth answered 29/7, 2014 at 19:46 Comment(2)
This solution throws an exception if input = " " (single space), on return line. I changed to something like this, now it's working fine.Batton
Works great. Fast, no extra allocations. On the start, IsNullOrWhiteSpace check is needed, due to " " on input error.Lw
B
5
string yourWord = "beep boop    baap beep   boop    baap             beep";

yourWord = yourWord .Replace("  ", " |").Replace("| ", "").Replace("|", "");
Breakfront answered 27/2, 2013 at 0:3 Comment(1)
Provided you don't have a "|" in yourWord.Parallax
J
5

I've tried using StringBuilder to:

  1. remove extra whitespace substrings
  2. accept characters from looping over the original string, as Blindy suggests

Here's the best balance of performance & readability I've found (using 100,000 iteration timing runs). Sometimes this tests faster than a less-legible version, at most 5% slower. On my small test string, regex takes 4.24x as much time.

public static string RemoveExtraWhitespace(string str)
    {
        var sb = new StringBuilder();
        var prevIsWhitespace = false;
        foreach (var ch in str)
        {
            var isWhitespace = char.IsWhiteSpace(ch);
            if (prevIsWhitespace && isWhitespace)
            {
                continue;
            }
            sb.Append(ch);
            prevIsWhitespace = isWhitespace;
        }
        return sb.ToString();
    }
Jaredjarek answered 16/6, 2015 at 1:17 Comment(1)
For better memory usage, you can give StringBuilder initial capacity.Mccann
D
4

It's not fast, but if simplicity helps, this works:

while (text.Contains("  ")) text=text.Replace("  ", " ");
Derosier answered 11/4, 2016 at 21:47 Comment(0)
L
4

This piece of code works good. I have not measure the performance.

string text = "   hello    -  world,  here   we go  !!!    a  bc    ";
string.Join(" ", text.Split().Where(x => x != ""));
// Output
// "hello - world, here we go !!! a bc"
Luisluisa answered 28/12, 2017 at 22:44 Comment(0)
R
4

I've tried with an array and with no if.

Results

PS C:\dev\Spaces> dotnet run -c release
// .NETCoreApp,Version=v3.0
Seed=7, n=20, s.Length=2828670
Regex by SLaks            1407ms, len=996757
StringBuilder by Blindy    154ms, len=996757
Array                      130ms, len=996757
NoIf                        91ms, len=996757
All match!

Methods

private static string WithNoIf(string s)
{
    var dst = new char[s.Length];
    uint end = 0;
    char prev = char.MinValue;
    for (int k = 0; k < s.Length; ++k)
    {
        var c = s[k];
        dst[end] = c;

        // We'll move forward if the current character is not ' ' or if prev char is not ' '
        // To avoid 'if' let's get diffs for c and prev and then use bitwise operatios to get 
        // 0 if n is 0 or 1 if n is non-zero
        uint x = (uint)(' ' - c) + (uint)(' ' - prev); // non zero if any non-zero

        end += ((x | (~x + 1)) >> 31) & 1; // https://mcmap.net/q/75504/-check-if-a-number-is-non-zero-using-bitwise-operators-in-c by ruslik
        prev = c;
    }
    return new string(dst, 0, (int)end);
}
private static string WithArray(string s)
{
    var dst = new char[s.Length];
    int end = 0;
    char prev = char.MinValue;
    for (int k = 0; k < s.Length; ++k)
    {
        char c = s[k];
        if (c != ' ' || prev != ' ') dst[end++] = c;
        prev = c;
    }
    return new string(dst, 0, end);
}

Test code

public static void Main()
{
    const int n = 20;
    const int seed = 7;
    string s = GetTestString(seed);

    var fs = new (string Name, Func<string, string> Func)[]{
        ("Regex by SLaks", WithRegex),
        ("StringBuilder by Blindy", WithSb),
        ("Array", WithArray),
        ("NoIf", WithNoIf),
    };

    Console.WriteLine($"Seed={seed}, n={n}, s.Length={s.Length}");
    var d = new Dictionary<string, string>(); // method, result
    var sw = new Stopwatch();
    foreach (var f in fs)
    {
        sw.Restart();
        var r = "";
        for( int i = 0; i < n; i++) r = f.Func(s);
        sw.Stop();
        d[f.Name] = r;
        Console.WriteLine($"{f.Name,-25} {sw.ElapsedMilliseconds,4}ms, len={r.Length}");
    }
    Console.WriteLine(d.Values.All( v => v == d.Values.First()) ? "All match!" : "Not all match! BAD");
}

private static string GetTestString(int seed)
{
    // by blindy from https://mcmap.net/q/74235/-c-fastest-way-to-remove-extra-white-spaces
    var rng = new Random(seed);
    // random 1mb+ string (it's slow enough...)
    StringBuilder ssb = new StringBuilder(1 * 1024 * 1024);
    for (int i = 0; i < 1 * 1024 * 1024; ++i)
        if (rng.Next(5) == 0)
            ssb.Append(new string(' ', rng.Next(20)));
        else
            ssb.Append((char)(rng.Next(128 - 32) + 32));
    string s = ssb.ToString();
    return s;
}
Rifkin answered 18/11, 2019 at 12:59 Comment(0)
G
2

try this:

System.Text.RegularExpressions.Regex.Replace(input, @"\s+", " ");
Glycoside answered 22/6, 2011 at 15:30 Comment(1)
Don't re-parse the regex every time.Fusillade
W
2

A few requirements are not clear in this question which deserve some thought.

  1. Do you want a single leading or trailing white space character?
  2. When you replace all white space with a single character, do you want that character to be consistent? (i.e. many of these solutions would replace \t\t with \t and ' ' with ' '.

This is a very efficient version which replaces all white space with a single space and removes any leading and trailing white space prior to the for loop.

  public static string WhiteSpaceToSingleSpaces(string input)
  {
    if (input.Length < 2) 
        return input;

    StringBuilder sb = new StringBuilder();

    input = input.Trim();
    char lastChar = input[0];
    bool lastCharWhiteSpace = false;

    for (int i = 1; i < input.Length; i++)
    {
        bool whiteSpace = char.IsWhiteSpace(input[i]);

        //Skip duplicate whitespace characters
        if (whiteSpace && lastCharWhiteSpace)
            continue;

        //Replace all whitespace with a single space.
        if (whiteSpace)
            sb.Append(' ');
        else
            sb.Append(input[i]);

        //Keep track of the last character's whitespace status
        lastCharWhiteSpace = whiteSpace;
    }

    return sb.ToString();
  }
Williamwilliams answered 1/7, 2016 at 19:40 Comment(0)
S
2

I don't know if it's the fastest way but i use this and this is worked for me:

    /// <summary>
    /// Remove all extra spaces and tabs between words in the specified string!
    /// </summary>
    /// <param name="str">The specified string.</param>
    public static string RemoveExtraSpaces(string str)
    {
        str = str.Trim();
        StringBuilder sb = new StringBuilder();
        bool space = false;
        foreach (char c in str)
        {
            if (char.IsWhiteSpace(c) || c == (char)9) { space = true; }
            else { if (space) { sb.Append(' '); }; sb.Append(c); space = false; };
        }
        return sb.ToString();
    }
Scandian answered 29/3, 2019 at 2:19 Comment(0)
U
1

This is funny, but on my PC the below method is just as fast as Sergey Povalyaev's StringBulder approach - (~282ms for 1000 reps, 10k src strings). Not sure about memory usage though.

string RemoveExtraWhiteSpace(string src, char[] wsChars){
   return string.Join(" ",src.Split(wsChars, StringSplitOptions.RemoveEmptyEntries));
}

Obviously it works okay with any chars - not just spaces.

Though this is not what the OP asked for - but if what you really need is to replace specific consecutive characters in a string with only one instance you can use this relatively efficient method:

    string RemoveDuplicateChars(string src, char[] dupes){  
        var sd = (char[])dupes.Clone();  
        Array.Sort(sd);

        var res = new StringBuilder(src.Length);

        for(int i = 0; i<src.Length; i++){
            if( i==0 || src[i]!=src[i-1] || Array.BinarySearch(sd,src[i])<0){
                res.Append(src[i]); 
            }
        }
        return res.ToString();
    }
Unicellular answered 6/6, 2014 at 10:26 Comment(2)
This won't work correctly, for example: RemoveDuplicateChars("aa--sdf a", new char[] { 'a' }) will return: "--sdf " while it should return "a--sdf a".Burrstone
Okay, the RemoveDuplicateChars was not the best name for this method, but if you look at the OP's question you can see that the goal was to replace in the source string any number of what can be considered a whitespace character with a single space. so my remark re "any chars" means that. I'll edit my answer to make it more obvious.Unicellular
S
1
public string GetCorrectString(string IncorrectString)
    {
        string[] strarray = IncorrectString.Split(' ');
        var sb = new StringBuilder();
        foreach (var str in strarray)
        {
            if (str != string.Empty)
            {
                sb.Append(str).Append(' ');
            }
        }
        return sb.ToString().Trim();
    }
Stabler answered 24/3, 2015 at 6:27 Comment(0)
Z
1

I just whipped this up, haven't tested it yet though. But I felt this was elegant, and avoids regex:

    /// <summary>
    /// Removes extra white space.
    /// </summary>
    /// <param name="s">
    /// The string
    /// </param>
    /// <returns>
    /// The string, with only single white-space groupings. 
    /// </returns>
    public static string RemoveExtraWhiteSpace(this string s)
    {
        if (s.Length == 0)
        {
            return string.Empty;
        }

        var stringBuilder = new StringBuilder();
        var whiteSpaceCount = 0;
        foreach (var character in s)
        {
            if (char.IsWhiteSpace(character))
            {
                whiteSpaceCount++;
            }
            else
            {
                whiteSpaceCount = 0;
            }

            if (whiteSpaceCount > 1)
            {
                continue;
            }

            stringBuilder.Append(character);
        }

        return stringBuilder.ToString();
    }
Zinc answered 20/10, 2016 at 20:54 Comment(0)
A
1

Am I missing something here? I came up with this:

// Input: "HELLO     BEAUTIFUL       WORLD!"
private string NormalizeWhitespace(string inputStr)
{
    // First split the string on the spaces but exclude the spaces themselves
    // Using the input string the length of the array will be 3. If the spaces
    // were not filtered out they would be included in the array
    var splitParts = inputStr.Split(' ').Where(x => x != "").ToArray();

   // Now iterate over the parts in the array and add them to the return
   // string. If the current part is not the last part, add a space after.
   for (int i = 0; i < splitParts.Count(); i++)
   {
        retVal += splitParts[i];
        if (i != splitParts.Count() - 1)
        {
            retVal += " ";
        }
   }
    return retVal;
}
// Would return "HELLO BEAUTIFUL WORLD!"

I know I am creating a second string here to return it as well as creating the splitParts array. Just figured this is pretty straight forward. Maybe I am not taking into account some of the potential scenarios.

Activator answered 25/5, 2017 at 15:13 Comment(1)
"I know I am creating a second string here" -- you're actually creating a string for each word, and an array hold them all together. That's a lot of memory allocation the garbage collection will have to freeze your threads to clean up.Rhombohedral
H
1

I know this is really old, but the easiest way to compact whitespace (replace any recurring whitespace character with a single "space" character) is as follows:

    public static string CompactWhitespace(string astring)
    {
        if (!string.IsNullOrEmpty(astring))
        {
            bool found = false;
            StringBuilder buff = new StringBuilder();

            foreach (char chr in astring.Trim())
            {
                if (char.IsWhiteSpace(chr))
                {
                    if (found)
                    {
                        continue;
                    }

                    found = true;
                    buff.Append(' ');
                }
                else
                {
                    if (found)
                    {
                        found = false;
                    }

                    buff.Append(chr);
                }
            }

            return buff.ToString();
        }

        return string.Empty;
    }
Hansom answered 21/9, 2017 at 19:40 Comment(1)
Granted, you can use the same logic within a "for" loop instead, for a slight performance increase (because C# won't need to instantiate an enumerator which the "foreach" loop requires)Hansom
R
1

I'm not very familiar with C#, hence my code is not an elegant/most efficient one. I came here to find an answer that fits my use case, but I couldn't find one (or I couldn't figure out one).

For my use case, I needed to normalize all the White Spaces (WS:{space, tab, cr lf}) with the following conditions:

  • WS can come in any combination
  • Replace a sequence of WS with the most significant WS
  • tab need to be retained in some cases (a tab separated file, for eg. and in that case repeated tabs also need to be preserved). But in most cases they have to be converted into spaces.

So here's a sample input and an expected output (Disclaimer: my code is test only for this example)



        Every night    in my            dreams  I see you, I feel you
    That's how    I know you go on

Far across the  distance and            places between us   



You            have                 come                    to show you go on


to be converted into

Every night in my dreams I see you, I feel you
That's how I know you go on
Far across the distance and places between us
You have come to show you go on

Here's my code

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main(string text)
    {
        bool preserveTabs = false;

        //[Step 1]: Clean up white spaces around the text
        text = text.Trim();
        //Console.Write("\nTrim\n======\n" + text);

        //[Step 2]: Reduce repeated spaces to single space. 
        text = Regex.Replace(text, @" +", " ");
        // Console.Write("\nNo repeated spaces\n======\n" + text);

        //[Step 3]: Hande Tab spaces. Tabs needs to treated with care because 
        //in some files tabs have special meaning (for eg Tab seperated files)
        if(preserveTabs)
        {
            text = Regex.Replace(text, @" *\t *", "\t");
        }
        else
        {
            text = Regex.Replace(text, @"[ \t]+", " ");
        }
        //Console.Write("\nTabs preserved\n======\n" + text);

        //[Step 4]: Reduce repeated new lines (and other white spaces around them)
                  //into a single new line.
        text = Regex.Replace(text, @"([\t ]*(\n)+[\t ]*)+", "\n");
        Console.Write("\nClean New Lines\n======\n" + text);    
    }
}

See this code in action here: https://dotnetfiddle.net/eupjIU

Rambouillet answered 16/2, 2019 at 1:48 Comment(0)
P
1

What if you adjust famos algo - in this case to compare "similar" strings - case in-sensitive & do not care about multi spaces and can stand NULLs too. Do not trust benchmarks - this one was put into a data compare intensive task, aprox. 1/4GB data and speed-up is arround 100% (commented part vs this algo 5/10min) on whole action. Some of these here had less arround 30% difference. Would tell building best algo will need go to disassembly and check what will compiler do with in both release or debug build. Here also half simpler a fulltrim as answer to similar (C question), case sensitive yet.

public static bool Differs(string srcA, string srcB)
{
    //return string.Join(" ", (a?.ToString()??String.Empty).ToUpperInvariant().Split(new char[0], StringSplitOptions.RemoveEmptyEntries).ToList().Select(x => x.Trim()))
    //    != string.Join(" ", (b?.ToString()??String.Empty).ToUpperInvariant().Split(new char[0], StringSplitOptions.RemoveEmptyEntries).ToList().Select(x => x.Trim()));

    if (srcA == null) { if (srcB == null) return false; else srcA = String.Empty; } // A == null + B == null same or change A to empty string
    if (srcB == null) { if (srcA == null) return false; else srcB = String.Empty; }
    int dstIdxA = srcA.Length, dstIdxB = srcB.Length; // are there any remaining (front) chars in a string ?
    int planSpaceA = 0, planSpaceB = 0; // state automaton 1 after non-WS, 2 after WS
    bool validA, validB; // are there any remaining (front) chars in a array ?
    char chA = '\0', chB = '\0';

spaceLoopA:
        if (validA = (dstIdxA > 0)) {
            chA = srcA[--dstIdxA];
            switch (chA) {
                case '!': case '"': case '#': case '$': case '%': case '&': case '\'': case '(': case ')': case '*': case '+': case ',': case '-':
                case '.': case '/': case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': case ':':
                case ';': case '<': case '=': case '>': case '?': case '@': case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G':
                case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T':
                case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z': case '[': case '\\': case ']': case '^': case '_': case '`': // a-z will be | 32 to Upper
                case '{': case '|': case '}': case '~':
                    break; // ASCII except lowercase
                case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i':
                case 'j': case 'k': case 'l': case 'm': case 'n': case 'o': case 'p': case 'q': case 'r':
                case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z':
                    chA = (Char)(chA & ~0x20);
                    break;
                case '\u0020': case '\u00A0': case '\u1680': case '\u2000': case '\u2001':
                case '\u2002': case '\u2003': case '\u2004': case '\u2005': case '\u2006':
                case '\u2007': case '\u2008': case '\u2009': case '\u200A': case '\u202F':
                case '\u205F': case '\u3000': case '\u2028': case '\u2029': case '\u0009':
                case '\u000A': case '\u000B': case '\u000C': case '\u000D': case '\u0085':
                    if (planSpaceA == 1) planSpaceA = 2; // cycle here to address multiple WS before non-WS part
                    goto spaceLoopA;
                default:
                    chA = Char.ToUpper(chA);
                    break;
        }}
spaceLoopB:
        if (validB = (dstIdxB > 0)) { // 2nd string / same logic
            chB = srcB[--dstIdxB];
            switch (chB) {
                case '!': case '"': case '#': case '$': case '%': case '&': case '\'': case '(': case ')': case '*': case '+': case ',': case '-':
                case '.': case '/': case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': case ':':
                case ';': case '<': case '=': case '>': case '?': case '@': case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G':
                case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T':
                case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z': case '[': case '\\': case ']': case '^': case '_': case '`': // a-z will be | 32 to Upper
                    break;
                case '{': case '|': case '}': case '~':
                    break; // ASCII except lowercase
                case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i':
                case 'j': case 'k': case 'l': case 'm': case 'n': case 'o': case 'p': case 'q': case 'r':
                case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z':
                    chB = (Char)(chB & ~0x20);
                    break;
                case '\u0020': case '\u00A0': case '\u1680': case '\u2000': case '\u2001':
                case '\u2002': case '\u2003': case '\u2004': case '\u2005': case '\u2006':
                case '\u2007': case '\u2008': case '\u2009': case '\u200A': case '\u202F':
                case '\u205F': case '\u3000': case '\u2028': case '\u2029': case '\u0009':
                case '\u000A': case '\u000B': case '\u000C': case '\u000D': case '\u0085':
                    if (planSpaceB == 1) planSpaceB = 2;
                goto spaceLoopB;
                default:
                    chB = Char.ToUpper(chB);
                    break;
        }}
        if (planSpaceA != planSpaceB) return true; // both should/not have space now (0 init / 1 last non-WS / 2 last was WS)
        if (validA) { // some (non-WS) in A still
            if (validB) {
            if (chA != chB) return true; // both have another char to compare, are they different ?
            } else return true; // not in B not - they are different
        } else { // A done, current last pair equal => continue 2 never ending loop till B end (by WS only to be same)
            if (!validB) return false; // done and end-up here without leaving by difference => both are same except some WSs arround
            else return true; // A done, but non-WS remains in B - different
        }  // A done, B had no non-WS or non + WS last follow - never ending loop continue
        planSpaceA = 1; planSpaceB = 1;
        goto spaceLoopA; // performs better
    }
}
Painkiller answered 4/5, 2020 at 12:5 Comment(0)
S
0

you could use indexOf to first grab where the whitespace sequences start, then use replace method to change the white space to "". From there, you can use the index you grabbed and place one whitespace character in that spot.

Surfbird answered 22/6, 2011 at 15:30 Comment(3)
This will involve lots of wasted String instances.Fusillade
True, I'm not familiar with any quick ways unfortunately.Surfbird
Looking into those now, this should be useful for the program I'm writing currently.Surfbird
V
0

For those who just want to copy-pase and go on:

    private string RemoveExcessiveWhitespace(string value)
    {
        if (value == null) { return null; }

        var builder = new StringBuilder();
        var ignoreWhitespace = false;
        foreach (var c in value)
        {
            if (!ignoreWhitespace || c != ' ')
            {
                builder.Append(c);
            }
            ignoreWhitespace = c == ' ';
        }
        return builder.ToString();
    }
Vertebral answered 18/3, 2018 at 2:48 Comment(0)
R
0

My version (improved from Stian's answer). Should be very fast.

public static string TrimAllExtraWhiteSpaces(this string input)
{
    if (string.IsNullOrEmpty(input))
    {
        return input;
    }

    var current = 0;
    char[] output = new char[input.Length];
    var charArray = input.ToCharArray();

    for (var i = 0; i < charArray.Length; i++)
    {
        if (!char.IsWhiteSpace(charArray[i]))
        {
            if (current > 0 && i > 0 && char.IsWhiteSpace(charArray[i - 1]))
            {
                output[current++] = ' ';
            }
            output[current++] = charArray[i];
        }
    }

    return new string(output, 0, current);
}
Raccoon answered 31/8, 2021 at 22:41 Comment(0)
B
-1

There is no need for complex code! Here is a simple code that will remove any duplicates:

public static String RemoveCharOccurence(String s, char[] remove)
{
    String s1 = s;
    foreach(char c in remove)
    {
        s1 = RemoveCharOccurence(s1, c);
    }

    return s1;
}

public static String RemoveCharOccurence(String s, char remove)
{
    StringBuilder sb = new StringBuilder(s.Length);

    Boolean removeNextIfMatch = false;
    foreach(char c in s)
    {
        if(c == remove)
        {
            if(removeNextIfMatch)
                continue;
            else
                removeNextIfMatch = true;
        }
        else
            removeNextIfMatch = false;

        sb.Append(c);
    }

    return sb.ToString();
}
Burrstone answered 21/11, 2014 at 22:3 Comment(2)
This does not answer OP's question. He wants to replace multiple white space characters (spaces, tabs, new lines) with a single "white space" - probably just space. Your code will not handle a space followed by a tab (and another space and/or newline) as expected - it will not replace that sequence with a single space.Unicellular
Besides, this is not efficient if you need to remove multiple chars - it will require N passes. So this is definitely not the fastest way. If you only need a single char de-dupe then your 2nd method is okayUnicellular
R
-1

It's very simple, just use the .Replace() method:

string words = "Hello     world!";
words = words.Replace("\\s+", " ");

Output >>> "Hello world!"

Reeher answered 11/7, 2016 at 12:40 Comment(0)
H
-2

Simplest way I can think of:

Text = Text.Replace("\<Space>\<Space>", "\<Space>").Replace("\<Space>\<Space>", "\<Space>");
// Replace 2 \<Space>s with 1 space, twice
Hartmann answered 13/10, 2021 at 6:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.