Word-separating algorithm

Asked 4/8, 2009 at 23:10 Answered 5/8, 2009 at 5:13

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?

Darin answered 4/8, 2009 at 23:10 Comment(2)

Any particuloar programming language? – Haematogenesis 4/8, 2009 at 23:12

#195510 – Lombardi 24/10, 2010 at 0:49

Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.

Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):

List<String> breakUp(String str, Trie root) {
    Set<(List<String>, Trie)> set = {([], root)};
    for (char c : str) {
        Set<(List<String>, Trie)> newSet = {};
        for (List<String> ls, Trie t : set) {
            Trie tNext = t.follow(c);
            if (tNext != null) {
                newSet.add((ls, tNext));
                if (tNext.isWord()) {
                    newSet.add((t.follow(c).getWord() :: ls, root));
                }
            }
        }
        set = newSet;
     }
     for (List<String> ls, Trie t : set) {
        if (t == root) return ls;
     }
     return null;
 }

Let me know if I need to clarify or I missed something...

Oaxaca answered 5/8, 2009 at 5:13 Comment(2)

Good algorithm - and more or less what I was thinking about. Did you give it a try? – Comforter 5/8, 2009 at 7:32

you could probably do better with some dynamic programming – Clipfed 5/8, 2009 at 14:17

I would imagine they take a dictionary word list like /usr/share/dict/words on your common or garden variety Unix system and try to find sets of word matches (starting from the left?) that result in the largest amount of original text being covered by a match. A simple breadth-first-search implementation would probably work fine, since it obviously doesn't have to run fast.

Jeaninejeanlouis answered 4/8, 2009 at 23:16 Comment(0)

I'd imaging these sites do it similar to this:

Get a list of word for your target language
Remove "useless" words like "a", "the", ...
Run through the list and check which of the words are substrings of the domain name
Take the most common words of the remaining list (Or the ones with the highest adsense rating,...)

Of course that leads to nonsense for expertsexchange, but what else would you expect there...

Backler answered 4/8, 2009 at 23:37 Comment(0)

(disclaimer: I did not try it myself, so take it merely as a food for experimentation. 4-grams are taken mostly out of the blue sky, just from my experience that 3-grams won't work all too well; 5-grams and more might work better, even though you will have to deal with a pretty large table). It's also simplistic in a sense that it does not take into the account the ending of the string - if it works for you otherwise, you'd probably need to think about fixing the endings.

This algorithm would run in a predictable time proportional to the length of the string that you are trying to split.

So, first: Take a lot of human-readable texts. for each of the text, supposing it is in a single string str, run the following algorithm (pseudocode-ish notation, assumes the [] is a hashtable-like indexing, and that nonexistent indexes return '0'):

for(i=0;i<length(s)-5;i++) {
  // take 4-character substring starting at position i
  subs2 = substring(str, i, 4); 
  if(has_space(subs2)) {
    subs = substring(str, i, 5);
    delete_space(subs);
    yes_space[subs][position(space, subs2)]++;
  } else {
    subs = subs2;
    no_space[subs]++;
  }
}

This will build you the tables which will help to decide whether a given 4-gram would need to have a space in it inserted or not.

Then, take your string to split, I denote it as xstr, and do:

for(i=0;i<length(xstr)-5;i++) {
  subs = substring(xstr, i, 4);
  for(j=0;j<4;j++) {
    do_insert_space_here[i+j] -= no_space[subs];
  }
  for(j=0;j<4;j++) {
    do_insert_space_here[i+j] += yes_space[subs][j];
  }
}

Then you can walk the "do_insert_space_here[]" array - if an element at a given position is bigger than 0, then you should insert a space in that position in the original string. If it's less than zero, then you shouldn't.

Please drop a note here if you try it (or something of this sort) and it works (or does not work) for you :-)

Luthanen answered 5/8, 2009 at 0:23 Comment(0)

Recommended topics

Hot tags