Is there any way I can speed up this VBA algorithm?
Asked Answered
N

3

11

I am looking to implement a VBA trie-building algorithm that is able to process a substantial English lexicon (~50,000 words) in a relatively short amount of time (less than 15-20 seconds). Since I am a C++ programmer by practice (and this is my first time doing any substantial VBA work), I built a quick proof-of-concept program that was able to complete the task on my computer in about half a second. When it came time to test the VBA port however, it took almost two minutes to do the same -- an unacceptably long amount of time for my purposes. The VBA code is below:

Node Class Module:

Public letter As String
Public next_nodes As New Collection
Public is_word As Boolean

Main Module:

Dim tree As Node

Sub build_trie()
    Set tree = New Node
    Dim file, a, b, c As Integer
    Dim current As Node
    Dim wordlist As Collection
    Set wordlist = New Collection
    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file
    Do While Not EOF(file)
        Dim line As String
        Line Input #file, line
        wordlist.add line
    Loop
    For a = 1 To wordlist.Count
        Set current = tree
        For b = 1 To Len(wordlist.Item(a))
            Dim match As Boolean
            match = False
            Dim char As String
            char = Mid(wordlist.Item(a), b, 1)
            For c = 1 To current.next_nodes.Count
                If char = current.next_nodes.Item(c).letter Then
                    Set current = current.next_nodes.Item(c)
                    match = True
                    Exit For
                End If
            Next c
            If Not match Then
                Dim new_node As Node
                Set new_node = New Node
                new_node.letter = char
                current.next_nodes.add new_node
                Set current = new_node
            End If
        Next b
        current.is_word = True
    Next a
End Sub

My question then is simply, can this algorithm be sped up? I saw from some sources that VBA Collections are not as efficient as Dictionarys and so I attempted a Dictionary-based implementation instead but it took an equal amount of time to complete with even worse memory usage (500+ MB of RAM used by Excel on my computer). As I say I am extremely new to VBA so my knowledge of both its syntax as well as its overall features/limitations is very limited -- which is why I don't believe that this algorithm is as efficient as it could possibly be; any tips/suggestions would be greatly appreciated.

Thanks in advance

NB: The lexicon file referred to by the code, "corncob_caps.txt", is available here (download the "all CAPS" file)

Nyeman answered 7/10, 2011 at 21:9 Comment(3)
Project specifications require VBA. Trust me, if I could use my C++ implementation instead, I would.Nyeman
If you can tell me a little more about what you're actually DOING with the code I bet I can help. The node info, you're building an XML file from a txt file? Or what? I'm just kind of confused as to what the code is doing so I can't really break it down in my head to make it more efficient.Alleviative
@chris and @Boann both mentioned what should be the most important thing, which is to iterate collections with For...Each and not by index. Collections have the performance characteristics of lists: #4828463. I'm surprised that moving Dims around had any impact.Live
H
18

There are a number of small issues and a few larger opportunities here. You did say this is your first vba work, so forgive me if I'm telling you things you already know

Small things first:
Dim file, a, b, c As Integer declares file, a and b as variants. Integer is 16 bit sign, so there may be risk of overflows, use Long instead.

DIM'ing inside loops is counter-productive: unlike C++ they are not loop scoped.

The real opportunity is:

Use For Each where you can to iterate collections: its faster than indexing.

On my hardware your original code ran in about 160s. This code in about 2.5s (both plus time to load word file into the collection, about 4s)

Sub build_trie()
    Dim t1 As Long
    Dim wd As Variant
    Dim nd As Node

    Set tree = New Node
    ' Dim file, a, b, c As Integer  : declares file, a, b as variant
    Dim file As Integer, a As Long, b As Long, c As Long     ' Integer is 16 bit signed

    Dim current As Node
    Dim wordlist As Collection
    Set wordlist = New Collection
    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file

    ' no point in doing inside loop, they are not scoped to the loop
    Dim line As String
    Dim match As Boolean
    Dim char As String
    Dim new_node As Node

    Do While Not EOF(file)
        'Dim line As String
        Line Input #file, line
        wordlist.Add line
    Loop


    t1 = GetTickCount
    For Each wd In wordlist ' for each is faster
    'For a = 1 To wordlist.Count
        Set current = tree
        For b = 1 To Len(wd)
            'Dim match As Boolean
            match = False
            'Dim char As String
            char = Mid$(wd, b, 1)
            For Each nd In current.next_nodes
            'For c = 1 To current.next_nodes.Count
                If char = nd.letter Then
                'If char = current.next_nodes.Item(c).letter Then
                    Set current = nd
                    'Set current = current.next_nodes.Item(c)
                    match = True
                    Exit For
                End If
            Next nd
            If Not match Then
                'Dim new_node As Node
                Set new_node = New Node
                new_node.letter = char
                current.next_nodes.Add new_node
                Set current = new_node
            End If
        Next b
        current.is_word = True
    Next wd

    Debug.Print "Time = " & GetTickCount - t1 & " ms"
End Sub

EDIT:

loading the word list into a dynamic array will reduce load time to sub second. Be aware that Redim Preserve is expensive, so do it in chunks

    Dim i As Long, sz As Long
    sz = 10000
    Dim wordlist() As String
    ReDim wordlist(0 To sz)

    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file

    i = 0
    Do While Not EOF(file)
        'Dim line As String
        Line Input #file, line
        wordlist(i) = line
        i = i + 1
        If i > sz Then
            sz = sz + 10000
            ReDim Preserve wordlist(0 To sz)
        End If
        'wordlist.Add line
    Loop
    ReDim Preserve wordlist(0 To i - 1)

then loop through it like

    For i = 0 To UBound(wordlist)
        wd = wordlist(i)
Hoar answered 7/10, 2011 at 22:54 Comment(2)
Amazing, this works perfectly! The funny thing is though, while the For Eachs do help, eliminating the re-Dims is actually having a much bigger impact because the algorithm was still slow after putting in just the For Eachs (which I had already done since Boann suggested that earlier) -- which, as you pointed out, I'm doing because I'm used to C++'s scoping rules. Also thanks for pointing out the other things (like my comma'd Dim declaration and Long vs Integer); again, that was the C++ side of me taking over again. ;)Nyeman
@Nyeman glad to help. I'd be interested to know what performance you ultimately getHoar
L
3

I'm out of practice with VBA, but IIRC, iterating the Collection using For Each should be a bit faster than going numerically:

Dim i As Variant
For Each i In current.next_nodes
    If i.letter = char Then
        Set current = i
        match = True
        Exit For
    End If
Next node

You're also not using the full capabilities of Collection. It's a Key-Value map, not just a resizeable array. You might get better performance if you use the letter as a key, though looking up a key that isn't present throws an error, so you have to use an ugly error workaround to check for each node. The inside of the b loop would look like:

Dim char As String
char = Mid(wordlist.Item(a), b, 1)
Dim node As Node
On Error Resume Next
Set node = Nothing
Set node = current.next_nodes.Item(char)
On Error Goto 0
If node Is Nothing Then
    Set node = New Node
    current.next_nodes.add node, char
Endif
Set current = node

You won't need the letter variable on class Node that way.

I didn't test this. I hope it's all right...

Edit: Fixed the For Each loop.


Another thing you can do which will possibly be slower but will use less memory is use an array instead of a collection, and resize with each added element. Arrays can't be public on classes, so you have to add methods to the class to deal with it:

Public letter As String
Private next_nodes() As Node
Public is_word As Boolean

Public Sub addNode(new_node As Node)
    Dim current_size As Integer
    On Error Resume Next
    current_size = UBound(next_nodes) 'ubound throws an error if the array is not yet allocated
    On Error GoTo 0
    ReDim next_nodes(0 To current_size) As Node
    Set next_nodes(current_size) = new_node
End Sub

Public Function getNode(letter As String) As Node
    Dim n As Variant
    On Error Resume Next
    For Each n In next_nodes
        If n.letter = letter Then
            Set getNode = n
            Exit Function
        End If
    Next
End Function

Edit: And a final optimization strategy, get the Integer char value with the Asc function and store that instead of a String.

Labarum answered 7/10, 2011 at 21:36 Comment(2)
Will give it a shot. As I mentioned I already tried a Dictionary which also used key/value lookup but it provided no performance benefit, so we'll see if a key/value Collection is different. ThanksNyeman
But ReDim next_nodes(0 To current_size) As Node will just clear everything in next_nodes every time you call addNode!Intercede
T
-1

You really need to profile it, but if you think Collections are slow maybe you can try using dynamic arrays?

Tintype answered 7/10, 2011 at 21:37 Comment(1)
I heard about those online but it seems that VBA does not allow arrays to be Public members of a class module. The apparent workaround, to use Variants and Property functions, doesn't seem to work with dynamic arrays because you can't ReDim a Variant. But then again, I really don't know what I'm talking about -- if I'm wrong please correct me/maybe show some sample Node code.Nyeman

© 2022 - 2024 — McMap. All rights reserved.