Erlang: What is most-wrong with this trie implementation?
Asked Answered
M

6

6

Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.

At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.

This is the trie:add implementation:

add([], Trie) ->
    dict:store(stop, true, Trie);

add([Ch|Rest], Trie) ->
    % setdefault(Key, Default, Dict) ->
    %     case dict:find(Key, Dict) of
    %         { ok, Val } -> { Dict, Val }
    %         error -> { dict:new(), Default }
    %     end.
    { NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
    NewSubTrie = add(Rest, SubTrie),
    dict:store(Ch, NewSubTrie, NewTrie).

And you can see the rest, along with an example of how it's used (at the bottom), here:

http://gist.github.com/263513

Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.

So, what am I doing most-wrong? And how might I make it a bit less-wrong?

Mullet answered 26/12, 2009 at 18:43 Comment(3)
Ha. I did that in PHP a few years ago.Prosopopoeia
How big is your input word list?Antecedent
My word list is 200,000 words (or 2.5 megs).Mullet
A
4

You could implement this functionality by simply storing the words in an ets table:

% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).  
> ets:insert(words, [{"zebra"}]).    

% check if word exists
> ets:lookup(words, "zed").          
[{"zed"}]

% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]

If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.

If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).

Antecedent answered 26/12, 2009 at 20:23 Comment(2)
Alright, so I've been tinkering with ets, but I've been getting issues with "bad argument". Maybe you know a simple solution? Question is here: #1965490Mullet
Alright, I've tinkered with the proplist implementation too... And it has hit an issue which causes the shell to hang. I've asked that here: #1982757 (ps: thanks for all your help - it's much appreciated).Mullet
S
4

I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.

Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.

Saied answered 26/12, 2009 at 21:22 Comment(2)
To check how much memory a dict() construct takes, call: erts_debug:flat_size(dict:new()). It returns 46 words, which is 184 bytes on a 32-bit system, or 368 bytes on a 64-bit one.Antecedent
Thanks for the suggestion... Although I've run into a strange problem where the shell hangs when I create my proplist-based-trie, which I've asked here: #1982757 - could you, by any chance, offer any insight there?Mullet
J
2

An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)

Joiejoin answered 29/12, 2009 at 2:34 Comment(0)
P
2

Look into DAWGs. They're much more compact than tries.

Planking answered 4/1, 2011 at 1:15 Comment(0)
C
1

I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts. http://www.erlang.org/doc/man/digraph.html

Clementina answered 26/12, 2009 at 18:43 Comment(0)
U
1

If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.

Unsatisfactory answered 26/12, 2009 at 23:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.