How to split text without spaces into list of words

Asked 15/1, 2012 at 14:10 Answered 5/7, 2023 at 6:14

154

Input: "tableapplechairtablecupboard..." many words

What would be an efficient algorithm to split such text to the list of words and get:

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)

P.S.
We have a list of all possible words.
Word "cupboard" can be "cup" and "board", select longest.
Language: python, but main thing is the algorithm itself.

Breathed answered 15/1, 2012 at 14:10 Comment(7)

@Breathed - Your "longest possible" criterion implied that it was for compound words. And in that case, what would happen if the string were "carpetrel". Would it be "carpet", or "petrel"? – Lindahl 15/1, 2012 at 14:16

Just put all your words in frozenset for fastest retrieval. words = frozenset(list_of_words) – Thirsty 15/1, 2012 at 14:17

There is many dictitonary words in your string:

['able', 'air', 'apple', 'boa', 'boar', 'board', 'chair', 'cup', 'cupboard', 'ha', 'hair', 'lea', 'leap', 'oar', 'tab', 'table', 'up']

– Mervinmerwin 15/1, 2012 at 14:24

I know there are a lot of words, but at some point you will not be able to get the next word, if you chose the wrong one before – Breathed 15/1, 2012 at 14:29

Using a parsing error to take decisions can be dangerous. What if there is really an error in your input string? Or there's a word that you don't have? – Nickles 15/1, 2012 at 15:16

Exact duplicate of Detect most likely words from text without spaces / combined words, please close. – Ferrite 17/4, 2014 at 8:31

Does this answer your question? Need help understanding this Python Viterbi algorithm – Bondsman 27/12, 2019 at 19:46

276

A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.

Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

Polygnotus answered 25/7, 2012 at 4:23 Comment(12)

Nice function! Using github.com/dwyl/english-words/blob/master/words.txt on "namethecompanywherebonniewasemployedwhenwestarteddating" results in "name the comp anywhere bonnie was employed when we started dating". Do you know why? It is because of the dictionary? – Danitadaniyal 13/8, 2016 at 7:39

It's an interesting approach, but this algorithm is approximate, not too simple, and its performance depends on rather strong assumptions (knowledge of frequencies, independence, etc.) Under what conditions would one prefer this algorithm to more traditional approaches, like for example the obvious DP implementation - precise and relatively straightforward, with O(|max length of word| * |text length|) runtime? – Stamford 17/10, 2016 at 6:6

This is excellent. I've turned it into a pip package: pypi.python.org/pypi/wordninja pip install wordninja – Hustler 20/4, 2017 at 23:0

@Danitadaniyal your words.txt contains "comp": ``` $ grep "^comp$" words.txt comp ``` and it's sorted alphabetically. this code assumes it's sorted in decreasing frequency of appearance (which is common for n-gram lists like this). if you use a properly sorted list, your string comes out fine: ``` >>> wordninja.split('namethecompanywherebonniewasemployedwhenwestarteddating') ['name', 'the', 'company', 'where', 'bonnie', 'was', 'employed', 'when', 'we', 'started', 'dating'] ``` – Hustler 29/6, 2017 at 16:18

It will mainly depend on the training corpus, a lot of ambiguity can be generated when removing spaces. hishowme can be divided into hi show me and his how me (actual output from wordninja) I would careful adapt the corpus to your needs and be careful when using for decreasing noise in a free text, since it can end up creating even more – Botany 12/11, 2019 at 9:8

As an improvement, you can also consider the probability of some words appearing next to others, namely with n-grams – Botany 12/11, 2019 at 9:9

@Botany Is this what called as Machine Learning algorithm? – Brill 23/2, 2021 at 13:3

@Generic Human this is very interesting! by the way how did you create the dictionary? are there any tools that can be used for this? thanks – Aciculate 26/2, 2021 at 22:40

Very useful indeed. Just need an extra bit of code to deal with numbers within strings, e.g. 'imd2011london50kmbufferdeprivationindexv4' to get 'imd 2011 London 50 km buffer deprivation index v 4'. I have actually written a workaround with a second custom list of words but it's so amateurish I would want to spoil the solution by posting it! – Concepcion 4/3, 2021 at 20:26

does this algorithm prefers shorter words above long words regardless that the longer words have better rank? – Immortality 16/5, 2021 at 10:11

@Generic Human Thanks a lot for the solution. I have a situation where the input string have some random letters inserted between the legit words. For example in "btmicrosoffilesecurebthomelogindropboxupdatetofficedropbox" the "t" is missing from the end of "microsof" while the "t" after update is an addition. Can you think of a way to skip random words and get the most likely to be words out of the string. Please help. – Similar 15/6, 2022 at 4:44

Hi. Your quick-and-dirty list of words is now a broken link! – Malamud 12/4 at 15:12

Based on the excellent work in the top answer, I've created a pip package for easy use.

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

To install, run pip install wordninja.

The only differences are minor. This returns a list rather than a str, it works in python3, it includes the word list and properly splits even if there are non-alpha chars (like underscores, dashes, etc).

Thanks again to Generic Human!

https://github.com/keredson/wordninja

Hustler answered 20/4, 2017 at 23:6 Comment(6)

@Hustler - First of all, thanks for the solution. It does behave well. However, it strips the special characters like "-" etc. It sometimes does not give the proper split like take a long string say - "WeatheringPropertiesbyMaterial Trade Name Graph 2-1. Color Change, E, after Arizona, Florida, Cycolac®/Geloy® Resin Systems Compared to PVC.[15] 25 20 15 ∆E 10 5 0 PVC, White PVC, Brown C/G, BrownC/G. Capstock is the material used as the surface layer applied to the exterior surface of a profile extrusion. Geloy® resin capstock over a Cycolac® substrate provides outstanding weatherability.[25]" – Champ 11/2, 2019 at 14:54

can you open an issue in GH? – Hustler 23/2, 2019 at 19:23

can i ignore local names .. is there any way to provide a list of words to ignore? – Qualm 26/2, 2021 at 11:12

good code, but it seems to ignore accent characters lm.split('knihyokoních') = ['knihy', 'ok', 'on', 'ch'] í is lost. another thing is that if i put some my words on top of list they seems to be ignored while splitting this phrase. strange – Immortality 16/5, 2021 at 10:6

what is the purpose of the regex splitter? i suggest to change it to this _SPLIT_RE = re.compile("[^'\w'']+") to support accents eg 'knihyokoních' or to allow it also as some custom parameter. because the current version splits also by accent chars, and lost them. – Immortality 16/5, 2021 at 19:50

@Hustler Thanks a lot for the solution. I have a situation where the input string have some random letters inserted between the legit words. For example in "btmicrosoffilesecurebthomelogindropboxupdatetofficedropbox" the "t" is missing from the end of "microsof" while the "t" after update is an addition. Can you think of a way to skip random words and get the most likely to be words out of the string. Please help. – Similar 15/6, 2022 at 4:40

Here is solution using recursive search:

def find_words(instring, prefix = '', words = None):
    if not instring:
        return []
    if words is None:
        words = set()
        with open('/usr/share/dict/words') as f:
            for line in f:
                words.add(line.strip())
    if (not prefix) and (instring in words):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if prefix in words:
        try:
            solutions.append([prefix] + find_words(suffix, '', words))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix, words))
    except ValueError:
        pass
    if solutions:
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]
    else:
        raise ValueError('no solution')

print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))

yields

['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

Suter answered 15/1, 2012 at 14:49 Comment(1)

works "out of the box", thank you! I think also to use trie structure as miku said, not just set of all words. Thanks anyway! – Breathed 15/1, 2012 at 16:19

Using a trie data structure, which holds the list of possible words, it would not be too complicated to do the following:

Advance pointer (in the concatenated string)
Lookup and store the corresponding node in the trie
If the trie node has children (e.g. there are longer words), go to 1.
If the node reached has no children, a longest word match happened; add the word (stored in the node or just concatenated during trie traversal) to the result list, reset the pointer in the trie (or reset the reference), and start over

Mistassini answered 15/1, 2012 at 14:25 Comment(5)

If the target is to consume the entire string, you would need to backtrack, "tableprechaun" then has to be split after "tab". – Heliochrome 15/1, 2012 at 14:47

Plus for mentioning trie, but I also agree with Daniel, that backtracking needs to be done. – Breathed 15/1, 2012 at 16:24

@Daniel , longest-match search doesn't need backtracking, no. What makes you think that? And what's wrong with the algorithm above? – Leek 15/1, 2012 at 23:36

@Devin The fact that for "tableprechaun" the longest match from the start is "table", leaving "prechaun", which can't be split into dictionary words. So you have to take the shorter match "tab" leaving you with a "leprechaun". – Heliochrome 15/1, 2012 at 23:40

@Daniel , Sorry, yes. I misunderstood the problem. The corrected algorithm should keep track of all possible tree positions at once -- AKA linear-time NFA search. Or else backtrack, sure, but that's worst-case exponential time. – Leek 16/1, 2012 at 10:0

The answer by Generic Human is great. But the best implementation of this I've ever seen was written Peter Norvig himself in his book 'Beautiful Data'.

Before I paste his code, let me expand on why Norvig's method is more accurate (although a little slower and longer in terms of code).

The data is a bit better - both in terms of size and in terms of precision (he uses a word count rather than a simple ranking)
More importantly, it's the logic behind n-grams that really makes the approach so accurate.

The example he provides in his book is the problem of splitting a string 'sitdown'. Now a non-bigram method of string split would consider p('sit') * p ('down'), and if this less than the p('sitdown') - which will be the case quite often - it will NOT split it, but we'd want it to (most of the time).

However when you have the bigram model you could value p('sit down') as a bigram vs p('sitdown') and the former wins. Basically, if you don't use bigrams, it treats the probability of the words you're splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.

Here's the link to the data (it's data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/

and here's the link to the code: http://norvig.com/ngrams/ngrams.py

These links have been up a while, but I'll copy paste the segmentation part of the code here anyway

import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

def test(verbose=None):
    """Run some tests, taken from the chapter.
    Since the hillclimbing algorithm is randomized, some tests may fail."""
    import doctest
    print 'Running tests...'
    doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):
    "Conditional probability of word, given previous word."
    try:
        return P2w[prev + ' ' + word]/float(Pw[prev])
    except KeyError:
        return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo 
def segment2(text, prev='<S>'): 
    "Return (log P(words), words), where words is the best segmentation." 
    if not text: return 0.0, [] 
    candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) 
                  for first,rem in splits(text)] 
    return max(candidates) 

def combine(Pfirst, first, (Prem, rem)): 
    "Combine first and rem results into one (probability, words) pair." 
    return Pfirst+Prem, [first]+rem

Serapis answered 27/5, 2016 at 15:16 Comment(2)

This works well, but when I try to apply this on my entire dataset, it keeps saying RuntimeError: maximum recursion depth exceeded in cmp – Autophyte 9/7, 2018 at 5:45

ngrams definitely will give you an accuracy boost w/ an exponentially larger frequency dict, memory and computation usage. btw the memo function is leaking memory like a sieve there. should clear it between calls. – Hustler 26/7, 2018 at 18:22

Unutbu's solution was quite close but I find the code difficult to read, and it didn't yield the expected result. Generic Human's solution has the drawback that it needs word frequencies. Not appropriate for all use case.

Here's a simple solution using a Divide and Conquer algorithm.

It tries to minimize the number of words E.g. find_words('cupboard') will return ['cupboard'] rather than ['cup', 'board'] (assuming that cupboard, cup and board are in the dictionnary)
The optimal solution is not unique, the implementation below returns a solution. find_words('charactersin') could return ['characters', 'in'] or maybe it will return ['character', 'sin'] (as seen below). You could quite easily modify the algorithm to return all optimal solutions.
In this implementation solutions are memoized so that it runs in a reasonable time.

The code:

words = set()
with open('/usr/share/dict/words') as f:
    for line in f:
        words.add(line.strip())

solutions = {}
def find_words(instring):
    # First check if instring is in the dictionnary
    if instring in words:
        return [instring]
    # No... But maybe it's a result we already computed
    if instring in solutions:
        return solutions[instring]
    # Nope. Try to split the string at all position to recursively search for results
    best_solution = None
    for i in range(1, len(instring) - 1):
        part1 = find_words(instring[:i])
        part2 = find_words(instring[i:])
        # Both parts MUST have a solution
        if part1 is None or part2 is None:
            continue
        solution = part1 + part2
        # Is the solution found "better" than the previous one?
        if best_solution is None or len(solution) < len(best_solution):
            best_solution = solution
    # Remember (memoize) this solution to avoid having to recompute it
    solutions[instring] = best_solution
    return best_solution

This will take about about 5sec on my 3GHz machine:

result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)

the reis masses of text information of peoples comments which is parsed from h t m l but there are no delimited character sin them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple e t c in the string i also have a large dictionary to query whether the word is reasonable so whats the fastest way of extraction t h x a lot

Cleaner answered 23/1, 2014 at 12:33 Comment(1)

There's no reason to believe a text cannot end in a single-letter word. You should consider one split more. – Peltate 25/2, 2016 at 20:8

Here is the accepted answer translated to JavaScript (requires node.js, and the file "wordninja_words.txt" from https://github.com/keredson/wordninja):

var fs = require("fs");

var splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
var maxWordLen = 0;
var wordCost = {};

fs.readFile("./wordninja_words.txt", 'utf8', function(err, data) {
    if (err) {
        throw err;
    }
    var words = data.split('\n');
    words.forEach(function(word, index) {
        wordCost[word] = Math.log((index + 1) * Math.log(words.length));
    })
    words.forEach(function(word) {
        if (word.length > maxWordLen)
            maxWordLen = word.length;
    });
    console.log(maxWordLen)
    splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
    console.log(split(process.argv[2]));
});


function split(s) {
    var list = [];
    s.split(splitRegex).forEach(function(sub) {
        _split(sub).forEach(function(word) {
            list.push(word);
        })
    })
    return list;
}
module.exports = split;


function _split(s) {
    var cost = [0];

    function best_match(i) {
        var candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
        var minPair = [Number.MAX_SAFE_INTEGER, 0];
        candidates.forEach(function(c, k) {
            if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
                var ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
            } else {
                var ccost = Number.MAX_SAFE_INTEGER;
            }
            if (ccost < minPair[0]) {
                minPair = [ccost, k + 1];
            }
        })
        return minPair;
    }

    for (var i = 1; i < s.length + 1; i++) {
        cost.push(best_match(i)[0]);
    }

    var out = [];
    i = s.length;
    while (i > 0) {
        var c = best_match(i)[0];
        var k = best_match(i)[1];
        if (c == cost[i])
            console.log("Alert: " + c);

        var newToken = true;
        if (s.slice(i - k, i) != "'") {
            if (out.length > 0) {
                if (out[-1] == "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
                    out[-1] = s.slice(i - k, i) + out[-1];
                    newToken = false;
                }
            }
        }

        if (newToken) {
            out.push(s.slice(i - k, i))
        }

        i -= k

    }
    return out.reverse();
}

Addle answered 7/11, 2018 at 4:50 Comment(0)

This will help

from wordsegment import load, segment
load()
segment('providesfortheresponsibilitiesofperson')

Heartbreaking answered 28/7, 2020 at 15:40 Comment(0)

If you precompile the wordlist into a DFA (which will be very slow), then the time it takes to match an input will be proportional to the length of the string (in fact, only a little slower than just iterating over the string).

This is effectively a more general version of the trie algorithm which was mentioned earlier. I only mention it for completeless -- as of yet, there's no DFA implementation you can just use. RE2 would work, but I don't know if the Python bindings let you tune how large you allow a DFA to be before it just throws away the compiled DFA data and does NFA search.

Leek answered 15/1, 2012 at 14:33 Comment(1)

especially plus for re2, didn't use it before – Breathed 15/1, 2012 at 16:18

Many Thanks for the help in https://github.com/keredson/wordninja/

A small contribution of the same in Java from my side.

The public method splitContiguousWords could be embedded with the other 2 methods in the class having ninja_words.txt in same directory(or modified as per the choice of coder). And the method splitContiguousWords could be used for the purpose.

public List<String> splitContiguousWords(String sentence) {

    String splitRegex = "[^a-zA-Z0-9']+";
    Map<String, Number> wordCost = new HashMap<>();
    List<String> dictionaryWords = IOUtils.linesFromFile("ninja_words.txt", StandardCharsets.UTF_8.name());
    double naturalLogDictionaryWordsCount = Math.log(dictionaryWords.size());
    long wordIdx = 0;
    for (String word : dictionaryWords) {
        wordCost.put(word, Math.log(++wordIdx * naturalLogDictionaryWordsCount));
    }
    int maxWordLength = Collections.max(dictionaryWords, Comparator.comparing(String::length)).length();
    List<String> splitWords = new ArrayList<>();
    for (String partSentence : sentence.split(splitRegex)) {
        splitWords.add(split(partSentence, wordCost, maxWordLength));
    }
    log.info("Split word for the sentence: {}", splitWords);
    return splitWords;
}

private String split(String partSentence, Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> cost = new ArrayList<>();
    cost.add(new Pair<>(Integer.valueOf(0), Integer.valueOf(0)));
    for (int index = 1; index < partSentence.length() + 1; index++) {
        cost.add(bestMatch(partSentence, cost, index, wordCost, maxWordLength));
    }
    int idx = partSentence.length();
    List<String> output = new ArrayList<>();
    while (idx > 0) {
        Pair<Number, Number> candidate = bestMatch(partSentence, cost, idx, wordCost, maxWordLength);
        Number candidateCost = candidate.getKey();
        Number candidateIndexValue = candidate.getValue();
        if (candidateCost.doubleValue() != cost.get(idx).getKey().doubleValue()) {
            throw new RuntimeException("Candidate cost unmatched; This should not be the case!");
        }
        boolean newToken = true;
        String token = partSentence.substring(idx - candidateIndexValue.intValue(), idx);
        if (token != "\'" && output.size() > 0) {
            String lastWord = output.get(output.size() - 1);
            if (lastWord.equalsIgnoreCase("\'s") ||
                    (Character.isDigit(partSentence.charAt(idx - 1)) && Character.isDigit(lastWord.charAt(0)))) {
                output.set(output.size() - 1, token + lastWord);
                newToken = false;
            }
        }
        if (newToken) {
            output.add(token);
        }
        idx -= candidateIndexValue.intValue();
    }
    return String.join(" ", Lists.reverse(output));
}


private Pair<Number, Number> bestMatch(String partSentence, List<Pair<Number, Number>> cost, int index,
                      Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> candidates = Lists.reverse(cost.subList(Math.max(0, index - maxWordLength), index));
    int enumerateIdx = 0;
    Pair<Number, Number> minPair = new Pair<>(Integer.MAX_VALUE, Integer.valueOf(enumerateIdx));
    for (Pair<Number, Number> pair : candidates) {
        ++enumerateIdx;
        String subsequence = partSentence.substring(index - enumerateIdx, index).toLowerCase();
        Number minCost = Integer.MAX_VALUE;
        if (wordCost.containsKey(subsequence)) {
            minCost = pair.getKey().doubleValue() + wordCost.get(subsequence).doubleValue();
        }
        if (minCost.doubleValue() < minPair.getKey().doubleValue()) {
            minPair = new Pair<>(minCost.doubleValue(), enumerateIdx);
        }
    }
    return minPair;
}

Verified answered 23/11, 2019 at 15:28 Comment(2)

what if we don't have list of words? – Soundproof 11/3, 2020 at 19:35

If I have understood correctly the query: Hence in the above approach, the public method accepts a sentence of type String which is split based on a first level with regex. And for the list of ninja_words it is available for download from the git repo. – Verified 9/4, 2020 at 4:42

It seems like fairly mundane backtracking will do. Start at the beggining of the string. Scan right until you have a word. Then, call the function on the rest of the string. Function returns "false" if it scans all the way to the right without recognizing a word. Otherwise, returns the word it found and the list of words returned by the recursive call.

Example: "tableapple". Finds "tab", then "leap", but no word in "ple". No other word in "leapple". Finds "table", then "app". "le" not a word, so tries apple, recognizes, returns.

To get longest possible, keep going, only emitting (rather than returning) correct solutions; then, choose the optimal one by any criterion you choose (maxmax, minmax, average, etc.)

Grunenwald answered 15/1, 2012 at 14:20 Comment(1)

@Breathed , backtracking search is an exponential-time algorithm. What is "good" about it? – Leek 15/1, 2012 at 23:38

If you have an exhaustive list of the words contained within the string:

word_list = ["table", "apple", "chair", "cupboard"]

Using list comprehension to iterate over the list to locate the word and how many times it appears.

string = "tableapplechairtablecupboard"

def split_string(string, word_list):

    return ("".join([(item + " ")*string.count(item.lower()) for item in word_list if item.lower() in string])).strip()

The function returns a string output of words in order of the list table table apple chair cupboard

Odontoid answered 6/8, 2019 at 19:38 Comment(0)

Here is a sample code (based on some examples) with vanilla (pure) JavaScript. Make sure to add words base (sample.txt) to use:

 async function getSampleText(data) {
        await fetch('sample.txt').then(response => response.text())
            .then(text => {
                const wordList = text;
                // Create a regular expression for splitting the input string.
                const splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");

                // Initialize the variables for storing the maximum word length and the word costs.
                let maxWordLen = 0;
                let wordCost = {};

                // Split the word list into an array of words.
                const words = wordList.split('\n');

                // Calculate the word costs based on the word list.
                words.forEach((word, index) => {
                    wordCost[word] = Math.log((index + 1) * Math.log(words.length));
                });

                // Find the maximum word length.
                words.forEach((word) => {
                    if (word.length > maxWordLen) {
                        maxWordLen = word.length;
                    }
                });

                console.log(maxWordLen);
                //console.log(split(process.argv[2]));

                /**
                 * Split the input string into an array of words.
                 * @param {string} s The input string.
                 * @return {Array} The array of words.
                 */
                function split(s) {
                    const list = [];
                    s.split(splitRegex).forEach((sub) => {
                        _split(sub).forEach((word) => {
                            list.push(word);
                        });
                    });
                    return list;
                }

                /**
                 * Split the input string into an array of words.
                 * @private
                 * @param {string} s The input string.
                 * @return {Array} The array of words.
                 */
                function _split(s) {
                    const cost = [0];

                    /**
                     * Find the best match for the i first characters, assuming cost has been built for the i-1 first characters.
                     * @param {number} i The index of the character to find the best match for.
                     * @return {Array} A pair containing the match cost and match length.
                     */
                    function best_match(i) {
                        const candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
                        let minPair = [Number.MAX_SAFE_INTEGER, 0];
                        candidates.forEach((c, k) => {
                            let ccost;
                            if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
                                ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
                            } else {
                                ccost = Number.MAX_SAFE_INTEGER;
                            }
                            if (ccost < minPair[0]) {
                                minPair = [ccost, k + 1];
                            }
                        });
                        return minPair;
                    }

                    // Build the cost array.
                    for (let i = 1; i < s.length + 1; i++) {
                        cost.push(best_match(i)[0]);
                    }

                    // Backtrack to recover the minimal-cost string.
                    const out = [];
                    let i = s.length;
                    while (i > 0) {
                        const c = best_match(i)[0];
                        const k = best_match(i)[1];
                        if (c === cost[i]) {
                            console.log("Done: " + c);
                        }

                        let newToken = true;
                        if (s.slice(i - k, i) !== "'") {
                            if (out.length > 0) {
                                if (out[-1] === "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
                                    out[-1] = s.slice(i - k, i) + out[-1];
                                    newToken = false;
                                }
                            }
                        }

                        if (newToken) {
                            out.push(s.slice(i - k, i));
                        }

                        i -= k;
                    }
                    return out.reverse();
                }

                console.log(split('Thiswasaveryniceday'));
            })

    }

    getSampleText();

Alpinist answered 7/1, 2023 at 23:57 Comment(0)

Use Enchant Library. The best option. Check out : https://www.youtube.com/watch?v=Q3UR-uBWGfY&t=206s

# Import the enchant library for spell-checking
import enchant
def split_merged_words(word_to_split):
    splitted_words = []

    # "en_US" is the language code for English
    dictionary = enchant.Dict("en_US")
 
    word = word_to_split
    length_of_word = len(word)
    i = 0
    while i < length_of_word:
        for j in range(length_of_word, i, -1):
            word_to_check = word[i:j]
            if dictionary.check(word_to_check):
                splitted_words.append(word_to_check)
                i = j
                break
    return splitted_words

merged_words = input("Enter the merged words: ")
words = split_merged_words(merged_words)
print("The splitted words:", words)

Enter the merged words: tableapplechairtablecupboard

The splitted words: ['table', 'apple', 'chair', 'table', 'cupboard']

Quarterback answered 5/7, 2023 at 6:14 Comment(0)

-1

You need to identify your vocabulary - perhaps any free word list will do.

Once done, use that vocabulary to build a suffix tree, and match your stream of input against that: http://en.wikipedia.org/wiki/Suffix_tree

Obau answered 15/1, 2012 at 14:25 Comment(14)

How would this work in practice? After building the suffix tree, how would you know what to match? – Oba 16/12, 2013 at 20:12

@JohnKurlak Like any other deterministic finite automaton - the end of a complete word is an accepting state. – Obau 16/12, 2013 at 20:18

Doesn't that approach require backtracking? You didn't mention backtracking in your answer... – Oba 16/12, 2013 at 20:51

Why not? What happens if you have "tableprechaun", as mentioned below? It will match the longest word that it can, "table", and then it won't find another word. It will have to backtrack back to "tab" and then match "leprechaun". – Oba 16/12, 2013 at 20:56

@JohnKurlak Multiple "branches" can be live at the same time. In effect, you push a token into the tree for every letter which is a possible word start, and the same letter may advance other live tokens. – Obau 16/12, 2013 at 20:59

That seems like a lot to keep track of in order to advance a token. It would require some thinking to not repeat the same work, so I think a dynamic programming approach would be a lot simpler/faster. – Oba 16/12, 2013 at 21:5

@JohnKurlak In what way is that a lot to keep track of? It's one token per potential word. There's no repetition of work as long as you can code a simple loop. As to a dynamic programming approach, I can't evaluate that in the abstract, as dynamic programming isn't a single algorithm, but I strongly doubt that would be simpler or more efficient, simply because this approach is not obviously suboptimal. – Obau 16/12, 2013 at 21:10

I guess I still don't understand your solution. If the words that can be made from "tableprechaun" are "tab", "able", "table", "pre", and "leprechaun", do you have a token at "t", "a", "p", and "l" in the tree? – Oba 16/12, 2013 at 21:40

@JohnKurlak You would insert a token for each instance of t, a, p, and l. The tokens move if the following letter is an available transition; if the next letter is not an available transition, the token is discarded. I would have thought this would be elementary to any computer scientist. If you're familiar with machine models, I suggest you start with reading about finite automata on wikipedia. – Obau 16/12, 2013 at 21:47

Each instance in the entire tree? When do the tokens move? – Oba 16/12, 2013 at 21:48

I'm having trouble understanding your solution because you haven't explained what the tokens are or what they represent. It would probably save us both time if you explicitly laid out your algorithm in your answer. – Oba 16/12, 2013 at 21:54

@JohnKurlak It would probably save us time if you read up about this algorithm yourself. I have already explained what the tokens represent, when they are inserted, and when they move or are eliminated. If you need me to coach you or tutor you, I'm willing to do so for a fee. – Obau 16/12, 2013 at 21:56

@JohnKurlak Also, I'm aware that you have decided to vent your frustration by downvoting this answer. – Obau 16/12, 2013 at 21:57

"if you read up about this algorithm yourself" ... I would love to read about the algorithm, but you didn't name one. – Oba 17/12, 2013 at 5:53

-1

Based on unutbu's solution I've implemented a Java version:

private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
    if(isAWord(instring)) {
        if(suffix.length() > 0) {
            List<String> rest = splitWordWithoutSpaces(suffix, "");
            if(rest.size() > 0) {
                List<String> solutions = new LinkedList<>();
                solutions.add(instring);
                solutions.addAll(rest);
                return solutions;
            }
        } else {
            List<String> solutions = new LinkedList<>();
            solutions.add(instring);
            return solutions;
        }

    }
    if(instring.length() > 1) {
        String newString = instring.substring(0, instring.length()-1);
        suffix = instring.charAt(instring.length()-1) + suffix;
        List<String> rest = splitWordWithoutSpaces(newString, suffix);
        return rest;
    }
    return Collections.EMPTY_LIST;
}

Input: "tableapplechairtablecupboard"

Output: [table, apple, chair, table, cupboard]

Input: "tableprechaun"

Output: [tab, leprechaun]

Foundry answered 13/3, 2014 at 5:38 Comment(0)

-1

For German language there is CharSplit which uses machine learning and works pretty good for strings of a few words.

https://github.com/dtuggener/CharSplit

Fondue answered 28/6, 2018 at 9:10 Comment(0)

-1

Expanding on @miku's suggestion to use a Trie, an append-only Trie is relatively straight-forward to implement in python:

class Node:
    def __init__(self, is_word=False):
        self.children = {}
        self.is_word = is_word

class TrieDictionary:
    def __init__(self, words=tuple()):
        self.root = Node()
        for word in words:
            self.add(word)

    def add(self, word):
        node = self.root
        for c in word:
            node = node.children.setdefault(c, Node())
        node.is_word = True

    def lookup(self, word, from_node=None):
        node = self.root if from_node is None else from_node
        for c in word:
            try:
                node = node.children[c]
            except KeyError:
                return None

        return node

We can then build a Trie-based dictionary from a set of words:

dictionary = {"a", "pea", "nut", "peanut", "but", "butt", "butte", "butter"}
trie_dictionary = TrieDictionary(words=dictionary)

Which will produce a tree that looks like this (* indicates beginning or end of a word):

* -> a*
 \\\ 
  \\\-> p -> e -> a*
   \\              \-> n -> u -> t*
    \\
     \\-> b -> u -> t*
      \\             \-> t*
       \\                 \-> e*
        \\                     \-> r*
         \
          \-> n -> u -> t*

We can incorporate this into a solution by combining it with a heuristic about how to choose words. For example we can prefer longer words over shorter words:

def using_trie_longest_word_heuristic(s):
    node = None
    possible_indexes = []

    # O(1) short-circuit if whole string is a word, doesn't go against longest-word wins
    if s in dictionary:
        return [ s ]

    for i in range(len(s)):
        # traverse the trie, char-wise to determine intermediate words
        node = trie_dictionary.lookup(s[i], from_node=node)

        # no more words start this way
        if node is None:
            # iterate words we have encountered from biggest to smallest
            for possible in possible_indexes[::-1]:
                # recurse to attempt to solve the remaining sub-string
                end_of_phrase = using_trie_longest_word_heuristic(s[possible+1:])

                # if we have a solution, return this word + our solution
                if end_of_phrase:
                    return [ s[:possible+1] ] + end_of_phrase

            # unsolvable
            break

        # if this is a leaf, append the index to the possible words list
        elif node.is_word:
            possible_indexes.append(i)

    # empty string OR unsolvable case 
    return []

We can use this function like this:

>>> using_trie_longest_word_heuristic("peanutbutter")
[ "peanut", "butter" ]

Because we maintain our position in the Trie as we search for longer and longer words, we traverse the trie at most once per possible solution (rather than 2 times for peanut: pea, peanut). The final short-circuit saves us from walking char-wise through the string in the worst-case.

The final result is only a handful of inspections:

'peanutbutter' - not a word, go charwise
'p' - in trie, use this node
'e' - in trie, use this node
'a' - in trie and edge, store potential word and use this node
'n' - in trie, use this node
'u' - in trie, use this node
't' - in trie and edge, store potential word and use this node
'b' - not in trie from `peanut` vector
'butter' - remainder of longest is a word

A benefit of this solution is in the fact that you know very quickly if longer words exist with a given prefix, which spares the need to exhaustively test sequence combinations against a dictionary. It also makes getting to an unsolvable answer comparatively cheap to other implementations.

The downsides of this solution are a large memory footprint for the trie and the cost of building the trie up-front.

Dissertate answered 23/8, 2018 at 18:57 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

The idea

The code

The results

Optimization

Recommended topics

Hot tags