Quickly remove unnecessary whitespace from a (very large) string
Asked Answered
C

3

6

I'm working with very large (45,000,000+ character) strings in VBA, and I need to remove superfluous whitespace.

One space (aka, ASCII Code 32) is okay but any sections with two or more consecutive spaces should be reduced to only one.

I found a similar question here, although that OP's definition of a "very long string" was only 39,000 characters. The accepted answer was a loop using Replace:

Function MyTrim(s As String) As String
    Do While InStr(s, "  ") > 0
        s = Replace$(s, "  ", " ")
    Loop
    MyTrim = Trim$(s)
End Function

I tried this method and it was "worked", but was painfully slow:

Len In:  44930886 
Len Out: 35322469
Runtime: 247.6 seconds

Is there a faster way to remove whitespace from a "very large" string?

Colet answered 10/2, 2018 at 13:36 Comment(0)
H
6

I suspect the performance problem is due to creating a very large number of large intermediate strings. So, any method that does things without creating intermediate strings or with much fewer would perform better.

A Regex replace has a good chance of that.

Option Explicit

Sub Test(ByVal text As String)

  Static Regex As Object
  If Regex Is Nothing Then
    Set Regex = CreateObject("VBScript.RegExp")
    Regex.Global = True
    Regex.MultiLine = True
  End If

  Regex.Pattern = " +" ' space, one or more times

  Dim result As String: result = Regex.Replace(text, " ")
  Debug.Print Len(result), Left(result, 20)
End Sub

With an input string of 45 million characters takes about a second.

Runner:

Sub Main()

  Const ForReading As Integer = 1
  Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
  Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
  Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
  Dim text As String: text = file.ReadAll()
  Set file = Nothing
  Set fso = Nothing
  Debug.Print Len(text), Left(text, 20)

  Test (text)

End Sub

Test data creator (C#):

var substring = "××\n× ××   ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(@"C:\ProgramData\test.txt", text, encoding);

BTW—Since VBA (VB4, Java, JavaScript, C#, VB, …) uses UTF-16, the space character is the one UTF-16 code unit ChrW(32). (Any similarity to or comparison with ASCII, is unnecessary mental gymnastics, and if put into code as ANSI [Chr(32)], unnecessary conversion behind the scenes, with different behavior for different machines, users and times.)

Hoard answered 10/2, 2018 at 18:13 Comment(2)
Excellent, your Test procedure works amazingly: 44930886 bytes in 1.6 seconds. I found that using your method of opening the file with FileSystemObject [after I changed FormatUTF16 to 1; the only option that worked with my file] takes about twice as long as good ol' Open fName For Input As #1: myText= Input(LOF(1), #1): Close #1, but that's couple seconds is irrelevant either way since my question was talking about 4-hour runtimes! I appreciate it; this opens up a new door with my overall project... Thanks! (also to @florent-b!)Colet
(This was the first question I answered myself, but it turn out better than I expected... should I "flag" my own answer as "lame"? lol)Colet
C
2

This question is a lot more interesting than the answer makes it seem because there should be nothing wrong with the solution proposed by OP, as the algorithm is plenty efficient in theory.

It turns out, that the problem here is the poor implementation of VBA's inbuilt Replace function under the hood, which causes it to completely choke on large strings with many replacements.

It is easily possible to manually implement a Replace function with linear runtime that far outperforms the inbuilt function for large strings. An example of such an implementation is presented here:

'Works like the inbuilt 'Replace', but is much faster on large strings
'with many replacements
'This function is the renamed function `ReplaceFast` from here:
'https://github.com/guwidoe/VBA-StringTools
Public Function Replace(ByRef str As String, _
                        ByRef sFind As String, _
                        ByRef sReplace As String, _
               Optional ByVal lStart As Long = 1, _
               Optional ByVal lCount As Long = -1, _
               Optional ByVal lCompare As VbCompareMethod _
                                       = vbBinaryCompare) As String
    Const methodName As String = "Replace"
    If lStart < 1 Then Err.Raise 5, methodName, _
        "Argument 'lStart' = " & lStart & " < 1, invalid"
    If lCount < -1 Then Err.Raise 5, methodName, _
        "Argument 'lCount' = " & lCount & " < -1, invalid"
    lCount = lCount And &H7FFFFFFF
    
    If lCompare <> vbBinaryCompare Or Len(str) < 10000 Or lCount < 10000 Then
        Replace = VBA.Replace(str, sFind, sReplace, lStart, lCount, lCompare)
        Exit Function
    End If

    If Len(str) = 0 Or Len(sFind) = 0 Then
        Replace = Mid$(str, lStart)
        Exit Function
    End If

    Dim lenFind As Long:         lenFind = Len(sFind)
    Dim lenReplace As Long:      lenReplace = Len(sReplace)
    If lenFind = 0 Then Exit Function
    
    Static lFindPositions() As Long
    If (Not Not lFindPositions) = 0 Then ReDim lFindPositions(0 To 32767)
    Dim numFinds As Long
    Dim k As Long:         k = InStr(lStart, str, sFind, lCompare)
    
    On Error GoTo catch
    Do Until k = 0 Or lCount = numFinds
        lFindPositions(numFinds) = k
        numFinds = numFinds + 1
        k = InStr(k + lenFind, str, sFind, lCompare)
    Loop
    On Error GoTo 0
    GoTo continue
catch:
    ReDim Preserve lFindPositions(LBound(lFindPositions) To _
                                      UBound(lFindPositions) * 4)
    Resume
continue:
    Dim bufferSizeChange As Long
    bufferSizeChange = numFinds * (lenReplace - lenFind) - lStart + 1

    If Len(str) + bufferSizeChange < 0 Then Exit Function

    Replace = Space$(Len(str) + bufferSizeChange)

    Dim i As Long
    Dim j As Long:              j = 1
    Dim lastOccurrence As Long: lastOccurrence = lStart
    Dim count As Long:          count = 1

    For k = 0 To numFinds - 1
        If count > lCount Then Exit For
        i = lFindPositions(k)
        Dim diff As Long: diff = i - lastOccurrence
        If diff > 0 Then _
            Mid$(Replace, j, diff) = Mid$(str, lastOccurrence, diff)
        j = j + diff
        If lenReplace <> 0 Then
            Mid$(Replace, j, lenReplace) = sReplace
            j = j + lenReplace
        End If
        count = count + 1
        lastOccurrence = i + lenFind
    Next k
    If j <= Len(Replace) Then Mid$(Replace, j) = Mid$(str, lastOccurrence)
End Function

Just pasting this code into the project should fix the performance issues of the original code without changing anything about it at all, just by overriding the inbuilt Replace function.

In my testing, the original code should only take about 3 seconds to process a string similar to OPs example when the improved Replace function is present, about a 100 times improvement!:

Sub DemoMyTrim()
    Const LEN_INPUT_STR = 45000000
    
    Dim inputStr As String
    inputStr = String(LEN_INPUT_STR * 2 / 3, "a") & Space(LEN_INPUT_STR / 3)
    
    Dim t As Single: t = Timer()
    
    Dim outStr As String: outStr = MyTrim(inputStr)

    Debug.Print "Trimming took " & Timer() - t & " seconds."
    Debug.Print "Len Out: " & Len(outStr)
End Sub

Public Function MyTrim(ByRef s As String) As String
    MyTrim = s
    Do While InStr(MyTrim, "  ") > 0
        MyTrim = Replace(MyTrim, "  ", " ")
    Loop
End Function

While this is very interesting, it is still slower than the RegEx solution proposed by the accepted answer.

Since the accepted answer uses Regex which is not available on Mac, I want to present another alternative that is even faster than the original algorithm with the improved Replace function and still makes do with VBA inbuilt functions that are available on any platform.

This is possible with another function from the LibStringTools library:

'Replaces consecutive occurrences of 'substring' that repeat more than 'limit'
'times with exactly 'limit' consecutive occurrences
'Source:
'https://github.com/guwidoe/VBA-StringTools
Public Function LimitConsecutiveSubstringRepetition( _
                                           ByRef str As String, _
                                  Optional ByRef subStr As String = vbNewLine, _
                                  Optional ByVal limit As Long = 1, _
                                  Optional ByVal Compare As VbCompareMethod _
                                                          = vbBinaryCompare) _
                                           As String
    Const methodName As String = "LimitConsecutiveSubstringRepetition"
    Static recursionDepth As Long
    
    If limit < 0 Then Err.Raise 5, methodName, _
        "Argument 'limit' = " & limit & " < 0, invalid"

    Dim lenSubStr As Long:      lenSubStr = Len(subStr)
    Dim lenStr As Long:         lenStr = Len(str)
    
    LimitConsecutiveSubstringRepetition = str
    If lenStr = 0 Or lenSubStr = 0 Or lenStr < lenSubStr * (limit + 1) Then
        Exit Function
    End If
    
    If lenSubStr = 1 Then
        Dim alSubStr As String: alSubStr = String$(limit + 1, subStr)
    Else
        alSubStr = Space$(lenSubStr * (limit + 1))
        Mid$(alSubStr, 1) = subStr
        If limit + 1 > 1 Then Mid$(alSubStr, lenSubStr + 1) = alSubStr
    End If
    Dim lenAlSubStr As Long:    lenAlSubStr = Len(alSubStr)
    
    Dim i As Long:              i = InStr(1, str, alSubStr, Compare)
    Dim j As Long:              j = 1
    Dim lastOccurrence As Long: lastOccurrence = 1 - lenSubStr
    Dim copyChunkSize As Long
    
    If i = 0 Then Exit Function

    Do Until i = 0
        i = i + lenSubStr * limit
        lastOccurrence = lastOccurrence + lenSubStr
        copyChunkSize = i - lastOccurrence
        Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
            Mid$(str, lastOccurrence, copyChunkSize)
        j = j + copyChunkSize
        Do
            lastOccurrence = i
            i = InStr(lastOccurrence + lenSubStr, str, subStr, Compare)
        Loop Until i - lastOccurrence <> lenSubStr
        If i = 0 Then Exit Do
        If limit > 0 Then i = InStr(i, str, alSubStr, Compare)
    Loop
    copyChunkSize = lenStr - lastOccurrence - lenSubStr + 1
    Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
        Mid$(str, lastOccurrence + lenSubStr)
    If j + copyChunkSize - 1 < Len(LimitConsecutiveSubstringRepetition) Then _
        LimitConsecutiveSubstringRepetition = _
            Left$(LimitConsecutiveSubstringRepetition, j + copyChunkSize - 1)

    Do Until InStr(1, LimitConsecutiveSubstringRepetition, alSubStr, Compare) = 0
        Dim s As String: s = LimitConsecutiveSubstringRepetition
        lenStr = Len(s)
        If lenSubStr = 2 And limit = 0 _
        And StrComp(Left$(subStr, 1), Right$(subStr, 1), Compare) <> 0 Then
            i = InStr(1, s, alSubStr, Compare)
            j = 1
            lastOccurrence = 1
            Dim leftChar As String:  leftChar = Left$(subStr, 1)
            Dim rightChar As String: rightChar = Right$(subStr, 1)
            Do Until i = 0
                Dim l As Long: l = i
                Dim r As Long: r = i + 1
                Do
                    l = l - 1
                    r = r + 1
                    If l < 1 Then Exit Do
                Loop Until StrComp(Mid$(s, l, 1), leftChar, Compare) <> 0 _
                        Or StrComp(Mid$(s, r, 1), rightChar, Compare) <> 0
                copyChunkSize = l + 1 - lastOccurrence
                If copyChunkSize > 0 Then _
                    Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
                        Mid$(s, lastOccurrence, copyChunkSize)
                j = j + copyChunkSize
                lastOccurrence = r
                i = InStr(r, s, alSubStr, Compare)
            Loop
            copyChunkSize = lenStr - r + 1
            Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
                        Mid$(s, lastOccurrence)
        Else
            Dim lSubStr As String:  lSubStr = Left$(alSubStr, lenSubStr * limit)
            Dim lenlSubStr As Long: lenlSubStr = Len(lSubStr)
            Dim minL As Long:       minL = 1
            Dim maxR As Long
            i = InStr(1, s, alSubStr, Compare)
            j = 1
            lastOccurrence = 1
            Do Until i = 0
                Dim susChunk As String
                susChunk = Space$(lenAlSubStr * 2 - 2 + lenlSubStr)
                l = i
                r = i + lenAlSubStr
                maxR = InStr(r, s, alSubStr, Compare) - 1
                If maxR = -1 Then maxR = lenStr
                Dim lenLeft As Long, lenRight As Long
                Do
                    If l - lenSubStr + 1 < minL Then
                        lenLeft = l - minL
                    Else
                        lenLeft = lenSubStr - 1
                    End If
                    If r + lenSubStr - 2 > maxR Then
                        lenRight = maxR - r + 1
                    Else
                        lenRight = lenSubStr - 1
                    End If
                    If lenLeft + lenRight < lenSubStr Then Exit Do
                    Mid$(susChunk, 1, lenLeft) = Mid$(s, l - lenLeft, lenLeft)
                    If lenlSubStr > 0 Then _
                        Mid$(susChunk, lenLeft + 1, lenlSubStr) = lSubStr
                    Mid$(susChunk, lenLeft + lenlSubStr + 1, lenRight) = _
                        Mid$(s, r, lenRight)
                    susChunk = Left$(susChunk, lenLeft + lenRight + lenlSubStr)
                    Dim n As Long: n = InStr(1, susChunk, alSubStr, Compare)
                    If n = 0 Then Exit Do
                    l = l + n - lenLeft - 1
                    r = r + n + lenSubStr - 1 - lenLeft
                Loop
                copyChunkSize = l - lastOccurrence
                If copyChunkSize > 0 Then _
                    Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
                        Mid$(s, lastOccurrence, copyChunkSize)
                j = j + copyChunkSize
                If limit > 0 Then
                    Mid$(LimitConsecutiveSubstringRepetition, j, lenlSubStr) = _
                        Left(alSubStr, lenlSubStr)
                    j = j + lenlSubStr
                    Mid$(s, r - lenlSubStr, lenlSubStr) = lSubStr
                End If
                minL = maxR + 1
                lastOccurrence = r
                i = InStr(r - lenlSubStr, s, alSubStr, Compare)
            Loop
            copyChunkSize = lenStr - r + 1
            Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
                        Mid$(s, lastOccurrence)
        End If
        
        If j + copyChunkSize - 1 < Len(LimitConsecutiveSubstringRepetition) Then _
            LimitConsecutiveSubstringRepetition = _
                Left$(LimitConsecutiveSubstringRepetition, j + copyChunkSize - 1)
    Loop
End Function

Using this function, the desired result can be achieved as follows:

Dim inputStr as String
'... somehow populate input string

dim outStr as String
outStr = LimitConsecutiveSubstringRepetition(inputStr, " ", 1)

Depending on the structure of the input string, this can be even faster than the RegEx method in some cases.

Cp answered 11/3 at 19:17 Comment(3)
Very instructive aspects +:) ... For multiple spaces (only blanks), it may be interesting as a side note that Application.WorksheetFunction.Trim(), which only goes up to 2 ^15 characters, is also extremely fast up to this limit (would require an additional action, as it also trims the left and right string ends).Compete
Thanks for the comment @T.M., definitely an interesting point but I feel like this was already covered in @ashleedawg's answer, hence why I didn't cover it.Cp
Of course tl;dr; greetings to ViennaCompete
C
1

In VBA, the size of a String is limited to approximately 2 Billion Characters. The "Replace-Loop" method above took 247 seconds for a 45 Million character string, which is over 4 minutes.

Theoretically, that means a 2 Billion character string would take at least 3 hours — if it even finished without crashing — so it's not exactly practical.

Excel has a built-in worksheet function Trim which is not the same as VBA's Trim function.

Worksheet function Trim removes all spaces from text except for single spaces between words.

The problem is that Trim, like all functions called with Application.WorksheetFunction, has a size limit of 32,767 characters, and this [unfortunately] applies even when calling the function from VBA with a string that's not even in a cell.

However, we can still use the function if we use it to loop through our "gigantic string" in sections, like this:

EDIT: Don't even bother with this crap (my function, below)! See the RegEx answer above.

Function bigTrim(strIn As String) As String

    Const maxLen = 32766
    Dim loops As Long, x As Long
    loops = Int(Len(strIn) / maxLen)
    If (Len(strIn) / maxLen) <> loops Then loops = loops + 1

    For x = 1 To loops
        bigTrim = bigTrim & _
            Application.WorksheetFunction.Trim(Mid(strIn, _
            ((x - 1) * maxLen) + 1, maxLen))
    Next x

End Function

Running this function on the same string that was used with the "Replace-Loop" method yielded much better results:

Len In:  44930886 
Len Out: 35321845
Runtime: 33.6 seconds

That's more than 7x faster than the "Replace-Loop" method, and managed to remove 624 spaces that were somehow missed by the other method.

(I though about looking into why the first method missed characters, but since I know my string isn't missing anything, and the point of this exercise was to save time, that would be silly!)

Colet answered 10/2, 2018 at 13:36 Comment(4)
There's 2 issues with your solution. It will not work if a serie of white-space is separated by the Mid and string concatenation is an expressive operation which should no be used within a loop. You could easily trim a string of 40000000 characters with a regular expression in about a second: re.Pattern = "\s+", Trim(re.Replace(strIn, " ")).Roughneck
@FlorentB. That sounds like it has the makings of an answer.Submergible
OMG *I didn't even think of RegEx (probably partly since I haven't had much luck figuring it out in the past!) I will try that now and report back...!Colet
well this has gone on my favs list.Submergible

© 2022 - 2024 — McMap. All rights reserved.