Detect most likely words from text without spaces / combined words
Asked Answered
A

5

13

How could I detect and split words from a combined string?

Example:

"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]
Alena answered 1/2, 2010 at 0:55 Comment(5)
I don't have any experience in this area, but you might want to start with a look at the Natural Language Toolkit. nltk.orgTailback
And filesaveas could be split Files ave as NOT just File Save As It would be tough just to split on word possibilities unless you have a specialized vocabulary...Sallyanne
And ["c","dim","age"], ["cd", "i", "mage"], etc... it's not easy to get this right without knowing the context. It gets even harder to pick the right option when there are lots of domain specific terms and abbreviations which are rare in normal text but common in the typical expected inputs. You'd probably need to train the algorithm.Joelie
#8870761Refinery
Does this answer your question? Need help understanding this Python Viterbi algorithmRiedel
D
12

Here's a dynamic programming solution (implemented as a memoized function). Given a dictionary of words with their frequencies, it splits the input text at the positions that give the overall most likely phrase. You'll have to find a real wordlist, but I included some made-up frequencies for a simple test.

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])
Dominique answered 1/2, 2010 at 10:45 Comment(0)
I
8

I don't know of any library for it, but it shouldn't be hard to implement basic functionality.

  1. Get a words list, like UNIX's words.
  2. Stuff the contents of your word list into a trie.
  3. Take the string you want to split and follow its path in the trie. Each time you reach a valid word, create a new branch that searches for a word starting from the point of the string that you got to. Once you finish your current branch, backtrack to the one you created, like in a depth first search.
  4. Disambiguate the resulting lists manually, using heuristics or through a natural language parser.

Example:

  1. Word: "filesaveasstring"
  2. First valid word is "file". Try matching "saveas". First valid word is "save". Try matching "asstring". First valid word is "as". Try matching "string". First valid word is "string". Matched until end; put the [file save as string] into your results list.
  3. Backtrack to matching "string" - no other possibilities. Backtrack to "asstring". First unvisited valid word is "ass". Try matching "tring". No possible matches. Backtrack to "asstring". No possible matches. Backtrack to "filesaveasstring".
  4. First unvisited match is "files". Try to match "aveasstring". First match is "ave". Try matching "asstring" (same results as steps 2/3), adding [files ave as string] to your results list and backtrack to the start.
  5. Try matching "filesaveasstring". No unvisited matches. Done.
  6. Select the most likely from [[file save as string] [files ave as string]] using a heuristic or a natural language parser.
Inbreeding answered 1/2, 2010 at 1:17 Comment(1)
What's the runtime of this? I somehow am not able to generalize this problem enough to find its accurate Big-O complexity.Voluntarism
I
2

I don't know a library that does this, but it's not too hard to write if you have a list of words:

wordList = file('words.txt','r').read().split()
words = set( s.lower() for s in wordList )

def splitString(s):
    found = []

    def rec(stringLeft, wordsSoFar):
        if not stringLeft:
            found.append(wordsSoFar)
        for pos in xrange(1, len(stringLeft)+1):
            if stringLeft[:pos] in words:
                rec(stringLeft[pos:], wordsSoFar + [stringLeft[:pos]])

    rec(s.lower(), [])
    return found

This will return all possible ways to split the string into the given words.

Example:

>>> splitString('filesaveas')
[['file', 'save', 'as'], ['files', 'ave', 'as']]
Integrator answered 1/2, 2010 at 1:17 Comment(0)
P
0

Can see this example : But its written in scala. This can split anything you want when the sentence contains no space in between.

Nonspaced-Sentence-Tokenizer

Primaveria answered 16/8, 2020 at 11:5 Comment(0)
S
-2

I know this question is marked for Python but I needed a JavaScript implementation. Going off of the previous answers I figured I'd share my code. Seems to work decently.

function findWords(input){
    input = input.toLowerCase().replace(/\s/g, ""); //Strip whitespace

    var index = 0;
    var validWords = [];
    for (var len = input.length; len > 0; len--){ //Go backwards as to favor longer words
        var testWord = input.substr(index, len);
        var dictIndex = _dictionary.indexOf(testWord.replace(/[^a-z\']/g, "")); //Remove non-letters
        if (dictIndex != -1){
            validWords.push(testWord);
            if (len == input.length){
                break; //We are complete
            }
            var nextWords = findWords(input.substr(len, input.length - len)); //Recurse
            if (!nextWords.words.length){ //No further valid words
                validWords.pop();
            }
            validWords = validWords.concat(nextWords.words);
            if (nextWords.complete === true){
                break; //Cascade complete
            }
        }
    }
    return {
        complete:len > 0, //We broke which indicates completion
        words:validWords
    };
}

Note: "_dictionary" is expected to be an array of words sorted by frequency. I am using a wordlist from Project Gutenberg.

Sorrows answered 28/8, 2017 at 0:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.