Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

Asked 3/10, 2010 at 21:43 Answered 4/10, 2010 at 0:50

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?

The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'

In the novel, the Sheep Man is translated as saying things like:

"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."

So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.

What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?

Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.

I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers. Thanks!

Templia answered 3/10, 2010 at 21:43 Comment(0)

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.

Cassation answered 3/10, 2010 at 22:12 Comment(3)

You might try a radix tree for letter by letter lookups. It would also allow you to easily tell if there were longer potential matches. – Cardenas 3/10, 2010 at 22:39

Oh, wow. I had thought of something like that (a tree with 26 children on each node, was my thought), but my boss said it was a ludicrous idea. I gotta stop listening to him. >:( – Cassation 3/10, 2010 at 23:0

+1 I had actually thought about something like this a while ago. I even ran into the same problems: "hi there" == "hithere" == "hit here". "to get her" == "together" == "together". – Print 4/10, 2010 at 0:40

First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.

Once you have the words, you have two possible approaches:

You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.

On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.

I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.

Unsay answered 3/10, 2010 at 22:16 Comment(0)

I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.

Print answered 4/10, 2010 at 0:41 Comment(0)

This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution

 textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."

indiv_characters = list(textstring) #splits string into individual characters

teststring = ''
sequential_indiv_word_list = []

for cur_char in indiv_characters:
    teststring = teststring + cur_char
    # do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
    if in_english_dict == True:
        sequential_indiv_word_list.append(teststring)
        teststring = ''

#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word

There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.

Also you need to account for punctuation, write conditionals like

if cur_char == ',' or cur_char =='.':
   #do action to start new "word" automatically

Icecold answered 4/10, 2010 at 0:50 Comment(0)

Recommended topics

Hot tags