split sentence without space in python (nltk?)

Asked 30/6, 2016 at 13:55 Answered 16/6, 2023 at 13:45

I have a set of concatenated word and i want to split them into arrays

For example :

split_word("acquirecustomerdata")
=> ['acquire', 'customer', 'data']

I found pyenchant, but it's not available for 64bit windows.

Then i tried to split each string into sub string and then compare them to wordnet to find a equivalent word. For example :

from nltk import wordnet as wn
def split_word(self, word):
    result = list()
    while(len(word) > 2):
        i = 1
        found = True
        while(found):
            i = i + 1
            synsets = wn.synsets(word[:i])
            for s in synsets:
                if edit_distance(s.name().split('.')[0], word[:i]) == 0:
                    found = False
                    break;
        result.append(word[:i])
        word = word[i:]
   print(result)

But this solution is not sure and is too long. So I'm looking for your help.

Thank you

Mussorgsky answered 30/6, 2016 at 13:55 Comment(4)

If your doing word detection, then tome might come out of that. I'd say fix the data source that gave you concatenated words – Reardon 30/6, 2016 at 14:0

As @cricket_007 suggested, word detection can be extremely difficult (often requiring machine learning, and AI algorithms) and introduce a whole wealth of natural language ambiguities, your data source should be fixed. – Uveitis 30/6, 2016 at 14:11

What they both said. Why don't you explain how you end up with the stuck-together words that you need to split. There's probably an easier way to get where you're going. – Leilanileininger 30/6, 2016 at 17:53

In fact, i just want to clean my data. But i will do it manually if it's not easily feasible. thanks all – Mussorgsky 2/7, 2016 at 14:8

Check - Word Segmentation Task from Norvig's work.

from __future__ import division
from collections import Counter
import re, nltk

WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)

def pdist(counter):
    "Make a probability distribution, given evidence from a Counter."
    N = sum(counter.values())
    return lambda x: counter[x]/N

P = pdist(COUNTS)

def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return product(P(w) for w in words)

def product(nums):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    result = 1
    for x in nums:
        result *= x
    return result

def splits(text, start=0, L=20):
    "Return a list of all (first, rest) pairs; start <= len(first) <= L."
    return [(text[:i], text[i:]) 
            for i in range(start, min(len(text), L)+1)]

def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text, 1))
        return max(candidates, key=Pwords)

print segment('acquirecustomerdata')
#['acquire', 'customer', 'data']

For better solution than this you can use bigram/trigram.

More examples at : Word Segmentation Task

Benito answered 10/7, 2016 at 12:7 Comment(0)

There is a library called "wordsegment" that you can use: http://www.grantjenks.com/docs/wordsegment/

pip install wordsegment
import wordsegment
from wordsegment import load, segment
load()
segment("acquirecustomerdata")

Output:

['acquire', 'customer', 'data']

Hideaway answered 9/11, 2021 at 18:49 Comment(0)

If you have a list of all possible words, you can use something like this:

import re

word_list = ["go", "walk", "run", "jump"]  # list of all possible words
pattern = re.compile("|".join("%s" % word for word in word_list))

s = "gowalkrunjump"
result = re.findall(pattern, s)

Vocalic answered 30/6, 2016 at 14:18 Comment(1)

"I have a set of concatenated word" – Amalgam 4/7, 2016 at 12:24

You can also use a library called wordninja

Just install it like:

pip install wordninja

and use it like this:

import wordninja as wn
print( wn.split("thisworkswithnospacesbetweensentences")

what will produce this as output:

['this', 'works', 'with', 'no', 'spaces', 'between', 'sentences']

Daliladalis answered 16/6, 2023 at 13:45 Comment(0)

Recommended topics

Hot tags