This question is a lot more interesting than the answer makes it seem because there should be nothing wrong with the solution proposed by OP, as the algorithm is plenty efficient in theory.
It turns out, that the problem here is the poor implementation of VBA's inbuilt Replace
function under the hood, which causes it to completely choke on large strings with many replacements.
It is easily possible to manually implement a Replace
function with linear runtime that far outperforms the inbuilt function for large strings. An example of such an implementation is presented here:
'Works like the inbuilt 'Replace', but is much faster on large strings
'with many replacements
'This function is the renamed function `ReplaceFast` from here:
'https://github.com/guwidoe/VBA-StringTools
Public Function Replace(ByRef str As String, _
ByRef sFind As String, _
ByRef sReplace As String, _
Optional ByVal lStart As Long = 1, _
Optional ByVal lCount As Long = -1, _
Optional ByVal lCompare As VbCompareMethod _
= vbBinaryCompare) As String
Const methodName As String = "Replace"
If lStart < 1 Then Err.Raise 5, methodName, _
"Argument 'lStart' = " & lStart & " < 1, invalid"
If lCount < -1 Then Err.Raise 5, methodName, _
"Argument 'lCount' = " & lCount & " < -1, invalid"
lCount = lCount And &H7FFFFFFF
If lCompare <> vbBinaryCompare Or Len(str) < 10000 Or lCount < 10000 Then
Replace = VBA.Replace(str, sFind, sReplace, lStart, lCount, lCompare)
Exit Function
End If
If Len(str) = 0 Or Len(sFind) = 0 Then
Replace = Mid$(str, lStart)
Exit Function
End If
Dim lenFind As Long: lenFind = Len(sFind)
Dim lenReplace As Long: lenReplace = Len(sReplace)
If lenFind = 0 Then Exit Function
Static lFindPositions() As Long
If (Not Not lFindPositions) = 0 Then ReDim lFindPositions(0 To 32767)
Dim numFinds As Long
Dim k As Long: k = InStr(lStart, str, sFind, lCompare)
On Error GoTo catch
Do Until k = 0 Or lCount = numFinds
lFindPositions(numFinds) = k
numFinds = numFinds + 1
k = InStr(k + lenFind, str, sFind, lCompare)
Loop
On Error GoTo 0
GoTo continue
catch:
ReDim Preserve lFindPositions(LBound(lFindPositions) To _
UBound(lFindPositions) * 4)
Resume
continue:
Dim bufferSizeChange As Long
bufferSizeChange = numFinds * (lenReplace - lenFind) - lStart + 1
If Len(str) + bufferSizeChange < 0 Then Exit Function
Replace = Space$(Len(str) + bufferSizeChange)
Dim i As Long
Dim j As Long: j = 1
Dim lastOccurrence As Long: lastOccurrence = lStart
Dim count As Long: count = 1
For k = 0 To numFinds - 1
If count > lCount Then Exit For
i = lFindPositions(k)
Dim diff As Long: diff = i - lastOccurrence
If diff > 0 Then _
Mid$(Replace, j, diff) = Mid$(str, lastOccurrence, diff)
j = j + diff
If lenReplace <> 0 Then
Mid$(Replace, j, lenReplace) = sReplace
j = j + lenReplace
End If
count = count + 1
lastOccurrence = i + lenFind
Next k
If j <= Len(Replace) Then Mid$(Replace, j) = Mid$(str, lastOccurrence)
End Function
Just pasting this code into the project should fix the performance issues of the original code without changing anything about it at all, just by overriding the inbuilt Replace
function.
In my testing, the original code should only take about 3 seconds to process a string similar to OPs example when the improved Replace
function is present, about a 100 times improvement!:
Sub DemoMyTrim()
Const LEN_INPUT_STR = 45000000
Dim inputStr As String
inputStr = String(LEN_INPUT_STR * 2 / 3, "a") & Space(LEN_INPUT_STR / 3)
Dim t As Single: t = Timer()
Dim outStr As String: outStr = MyTrim(inputStr)
Debug.Print "Trimming took " & Timer() - t & " seconds."
Debug.Print "Len Out: " & Len(outStr)
End Sub
Public Function MyTrim(ByRef s As String) As String
MyTrim = s
Do While InStr(MyTrim, " ") > 0
MyTrim = Replace(MyTrim, " ", " ")
Loop
End Function
While this is very interesting, it is still slower than the RegEx solution proposed by the accepted answer.
Since the accepted answer uses Regex which is not available on Mac, I want to present another alternative that is even faster than the original algorithm with the improved Replace
function and still makes do with VBA inbuilt functions that are available on any platform.
This is possible with another function from the LibStringTools library:
'Replaces consecutive occurrences of 'substring' that repeat more than 'limit'
'times with exactly 'limit' consecutive occurrences
'Source:
'https://github.com/guwidoe/VBA-StringTools
Public Function LimitConsecutiveSubstringRepetition( _
ByRef str As String, _
Optional ByRef subStr As String = vbNewLine, _
Optional ByVal limit As Long = 1, _
Optional ByVal Compare As VbCompareMethod _
= vbBinaryCompare) _
As String
Const methodName As String = "LimitConsecutiveSubstringRepetition"
Static recursionDepth As Long
If limit < 0 Then Err.Raise 5, methodName, _
"Argument 'limit' = " & limit & " < 0, invalid"
Dim lenSubStr As Long: lenSubStr = Len(subStr)
Dim lenStr As Long: lenStr = Len(str)
LimitConsecutiveSubstringRepetition = str
If lenStr = 0 Or lenSubStr = 0 Or lenStr < lenSubStr * (limit + 1) Then
Exit Function
End If
If lenSubStr = 1 Then
Dim alSubStr As String: alSubStr = String$(limit + 1, subStr)
Else
alSubStr = Space$(lenSubStr * (limit + 1))
Mid$(alSubStr, 1) = subStr
If limit + 1 > 1 Then Mid$(alSubStr, lenSubStr + 1) = alSubStr
End If
Dim lenAlSubStr As Long: lenAlSubStr = Len(alSubStr)
Dim i As Long: i = InStr(1, str, alSubStr, Compare)
Dim j As Long: j = 1
Dim lastOccurrence As Long: lastOccurrence = 1 - lenSubStr
Dim copyChunkSize As Long
If i = 0 Then Exit Function
Do Until i = 0
i = i + lenSubStr * limit
lastOccurrence = lastOccurrence + lenSubStr
copyChunkSize = i - lastOccurrence
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(str, lastOccurrence, copyChunkSize)
j = j + copyChunkSize
Do
lastOccurrence = i
i = InStr(lastOccurrence + lenSubStr, str, subStr, Compare)
Loop Until i - lastOccurrence <> lenSubStr
If i = 0 Then Exit Do
If limit > 0 Then i = InStr(i, str, alSubStr, Compare)
Loop
copyChunkSize = lenStr - lastOccurrence - lenSubStr + 1
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(str, lastOccurrence + lenSubStr)
If j + copyChunkSize - 1 < Len(LimitConsecutiveSubstringRepetition) Then _
LimitConsecutiveSubstringRepetition = _
Left$(LimitConsecutiveSubstringRepetition, j + copyChunkSize - 1)
Do Until InStr(1, LimitConsecutiveSubstringRepetition, alSubStr, Compare) = 0
Dim s As String: s = LimitConsecutiveSubstringRepetition
lenStr = Len(s)
If lenSubStr = 2 And limit = 0 _
And StrComp(Left$(subStr, 1), Right$(subStr, 1), Compare) <> 0 Then
i = InStr(1, s, alSubStr, Compare)
j = 1
lastOccurrence = 1
Dim leftChar As String: leftChar = Left$(subStr, 1)
Dim rightChar As String: rightChar = Right$(subStr, 1)
Do Until i = 0
Dim l As Long: l = i
Dim r As Long: r = i + 1
Do
l = l - 1
r = r + 1
If l < 1 Then Exit Do
Loop Until StrComp(Mid$(s, l, 1), leftChar, Compare) <> 0 _
Or StrComp(Mid$(s, r, 1), rightChar, Compare) <> 0
copyChunkSize = l + 1 - lastOccurrence
If copyChunkSize > 0 Then _
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(s, lastOccurrence, copyChunkSize)
j = j + copyChunkSize
lastOccurrence = r
i = InStr(r, s, alSubStr, Compare)
Loop
copyChunkSize = lenStr - r + 1
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(s, lastOccurrence)
Else
Dim lSubStr As String: lSubStr = Left$(alSubStr, lenSubStr * limit)
Dim lenlSubStr As Long: lenlSubStr = Len(lSubStr)
Dim minL As Long: minL = 1
Dim maxR As Long
i = InStr(1, s, alSubStr, Compare)
j = 1
lastOccurrence = 1
Do Until i = 0
Dim susChunk As String
susChunk = Space$(lenAlSubStr * 2 - 2 + lenlSubStr)
l = i
r = i + lenAlSubStr
maxR = InStr(r, s, alSubStr, Compare) - 1
If maxR = -1 Then maxR = lenStr
Dim lenLeft As Long, lenRight As Long
Do
If l - lenSubStr + 1 < minL Then
lenLeft = l - minL
Else
lenLeft = lenSubStr - 1
End If
If r + lenSubStr - 2 > maxR Then
lenRight = maxR - r + 1
Else
lenRight = lenSubStr - 1
End If
If lenLeft + lenRight < lenSubStr Then Exit Do
Mid$(susChunk, 1, lenLeft) = Mid$(s, l - lenLeft, lenLeft)
If lenlSubStr > 0 Then _
Mid$(susChunk, lenLeft + 1, lenlSubStr) = lSubStr
Mid$(susChunk, lenLeft + lenlSubStr + 1, lenRight) = _
Mid$(s, r, lenRight)
susChunk = Left$(susChunk, lenLeft + lenRight + lenlSubStr)
Dim n As Long: n = InStr(1, susChunk, alSubStr, Compare)
If n = 0 Then Exit Do
l = l + n - lenLeft - 1
r = r + n + lenSubStr - 1 - lenLeft
Loop
copyChunkSize = l - lastOccurrence
If copyChunkSize > 0 Then _
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(s, lastOccurrence, copyChunkSize)
j = j + copyChunkSize
If limit > 0 Then
Mid$(LimitConsecutiveSubstringRepetition, j, lenlSubStr) = _
Left(alSubStr, lenlSubStr)
j = j + lenlSubStr
Mid$(s, r - lenlSubStr, lenlSubStr) = lSubStr
End If
minL = maxR + 1
lastOccurrence = r
i = InStr(r - lenlSubStr, s, alSubStr, Compare)
Loop
copyChunkSize = lenStr - r + 1
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(s, lastOccurrence)
End If
If j + copyChunkSize - 1 < Len(LimitConsecutiveSubstringRepetition) Then _
LimitConsecutiveSubstringRepetition = _
Left$(LimitConsecutiveSubstringRepetition, j + copyChunkSize - 1)
Loop
End Function
Using this function, the desired result can be achieved as follows:
Dim inputStr as String
'... somehow populate input string
dim outStr as String
outStr = LimitConsecutiveSubstringRepetition(inputStr, " ", 1)
Depending on the structure of the input string, this can be even faster than the RegEx method in some cases.
Test
procedure works amazingly:44930886 bytes in 1.6 seconds
. I found that using your method of opening the file with FileSystemObject [after I changedFormatUTF16
to1
; the only option that worked with my file] takes about twice as long as good ol'Open fName For Input As #1: myText= Input(LOF(1), #1): Close #1
, but that's couple seconds is irrelevant either way since my question was talking about 4-hour runtimes! I appreciate it; this opens up a new door with my overall project... Thanks! (also to @florent-b!) – Colet