How can I split multiple joined words?

Asked 12/10, 2008 at 2:37 Answered 4/7, 2023 at 4:47

I have an array of 1000 or so entries, with examples below:

wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector

I would like to be able to split these into their respective words, as:

wicked weather
liquid weather
drive our trucks
go compact
slim projector

I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary?

I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?

Phenolphthalein answered 12/10, 2008 at 2:37 Comment(5)

Note that a naive implementation would return "wick ed weather" – Fluor 14/10, 2008 at 21:20

hey optimal solutions, i saw your response on an EMR question and was wondering if i could contact you with some questions regarding healthcare IT? – Unaneled 27/9, 2009 at 16:42

Also see Python and Ruby implementations at #11448359 – Niklaus 27/12, 2019 at 19:37

Does this answer your question? How to split text without spaces into list of words – Oversew 10/4, 2022 at 5:33

↑ Looks very good @Oversew ! – Phenolphthalein 16/4, 2022 at 19:6

The Viterbi algorithm is much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.)

import re
from collections import Counter

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

Testing it:

>>> viterbi_segment('wickedweather')
(['wicked', 'weather'], 5.1518198982768158e-10)
>>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0])
'its easy for me to split long run together blocks'

To be practical you'll likely want a couple refinements:

Add logs of probabilities, don't multiply probabilities. This avoids floating-point underflow.
Your inputs will in general use words not in your corpus. These substrings must be assigned a nonzero probability as words, or you end up with no solution or a bad solution. (That's just as true for the above exponential search algorithm.) This probability has to be siphoned off the corpus words' probabilities and distributed plausibly among all other word candidates: the general topic is known as smoothing in statistical language models. (You can get away with some pretty rough hacks, though.) This is where the O(n) Viterbi algorithm blows away the search algorithm, because considering non-corpus words blows up the branching factor.

Miscegenation answered 26/1, 2009 at 23:10 Comment(13)

Isn't that the algorithm used to sort out DNA sequences? – Victorie 5/1, 2010 at 5:31

I dunno, but the general idea of Viterbi (finding the most likely sequence of hidden states given a sequence of observations) -- that ought to have uses with DNA too. – Miscegenation 5/1, 2010 at 21:49

en.wikipedia.org/wiki/… says they sometimes use hidden Markov models for sequence alignment, and sequence alignment is the basic task in shotgun sequencing: en.wikipedia.org/wiki/Bioinformatics#Sequence_analysis -- so I guess you're right, at least sort of! – Miscegenation 5/1, 2010 at 21:53

I've got a question about this one. I like the solution, however as I've been trying to reproduce it in C#, I ran into an issue. Since the first for loop ranges from 1 to len+1 and the inner loop ranges from max(i-maxWordLen) to i, this means that the 1st word that it looks at is from 0 to 1 (length of 2). If a sentence like "Ithinkclearly" is put in there, the "I" will not be recognized as a distinct word but it will rather catch "It". Am I correct in my logic? Also, what language is this written in? – Candler 15/6, 2012 at 13:24

@mj_, it's in Python, and range(low, high) means [low, low+1, ..., high-1] -- that is, the high bound is not included. The loop for j in range(0, 1) looks at j=0 only. – Miscegenation 16/6, 2012 at 7:11

Darius, how come then the word "I" is not selected as opposed to "its". "I" has a greater frequency. This approach doesn't backtrack as far as I can tell in the event of it not being able to match subsequent letters. Am I correct? – Candler 17/6, 2012 at 2:31

You're right that it doesn't backtrack. It does produce the segmentation with the highest total score (product of single-word probabilities). It's not clear to me what concrete problem you're running into -- you have 'hink' in your dictionary and you're getting "it hink clearly"? The results with single-word frequencies can be improved, at the cost of significantly greater computation, by going to a second-order model where you're looking at conditional probabilites: P(word | preceding_word). See norvig.com/ngrams (which I should update the answer with a link to). – Miscegenation 17/6, 2012 at 21:1

For a perfect result, you will need a couple more refiments: 1) You need to return a very small probability for non-words instead of zero, otherwise if some part of the phrase never appears in dictionary it will ruin the results for the whole phrase. I used (log(1/total)-max_word_len-1)*(j-i) for the logarithm of non-words probability (this also penalizes longer non-words to prevent eating up valid words). 2) You need to keep tuples of (-non_words_len, -non_words_count, prob) in probs to minimize non-recognized sequences and merge adjacent non-word chunks. – Basicity 22/5, 2015 at 11:48

This answer is like a "best of" for StackOverflow. – Andizhan 12/9, 2016 at 0:49

What is big.txt? – Lavalava 5/9, 2017 at 16:28

@Lavalava just any big text file full of English words to build a dictionary from. You might use War and Peace from Project Gutenberg, say. – Miscegenation 12/9, 2017 at 19:8

Hi @IlyaSemenov, could you help me to understand smoothing formulae you used? Many thanks – Ludovika 4/7, 2018 at 13:57

English words list is here. raw.githubusercontent.com/dwyl/english-words/master/words.txt – Zipper 15/7, 2019 at 10:34

Can a human do it?

farsidebag
far sidebag
farside bag
far side bag

Not only do you have to use a dictionary, you might have to use a statistical approach to figure out what's most likely (or, god forbid, an actual HMM for your human language of choice...)

For how to do statistics that might be helpful, I turn you to Dr. Peter Norvig, who addresses a different, but related problem of spell-checking in 21 lines of code: http://norvig.com/spell-correct.html

(he does cheat a bit by folding every for loop into a single line.. but still).

Update This got stuck in my head, so I had to birth it today. This code does a similar split to the one described by Robert Gamble, but then it orders the results based on word frequency in the provided dictionary file (which is now expected to be some text representative of your domain or English in general. I used big.txt from Norvig, linked above, and catted a dictionary to it, to cover missing words).

A combination of two words will most of the time beat a combination of 3 words, unless the frequency difference is enormous.

I posted this code with some minor changes on my blog

http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/ and also wrote a little about the underflow bug in this code.. I was tempted to just quietly fix it, but figured this may help some folks who haven't seen the log trick before: http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/

Output on your words, plus a few of my own -- notice what happens with "orcore":

perl splitwords.pl big.txt words
answerveal: 2 possibilities
 -  answer veal
 -  answer ve al

wickedweather: 4 possibilities
 -  wicked weather
 -  wicked we at her
 -  wick ed weather
 -  wick ed we at her

liquidweather: 6 possibilities
 -  liquid weather
 -  liquid we at her
 -  li quid weather
 -  li quid we at her
 -  li qu id weather
 -  li qu id we at her

driveourtrucks: 1 possibilities
 -  drive our trucks

gocompact: 1 possibilities
 -  go compact

slimprojector: 2 possibilities
 -  slim projector
 -  slim project or

orcore: 3 possibilities
 -  or core
 -  or co re
 -  orc ore

Code:

#!/usr/bin/env perl

use strict;
use warnings;

sub find_matches($);
sub find_matches_rec($\@\@);
sub find_word_seq_score(@);
sub get_word_stats($);
sub print_results($@);
sub Usage();

our(%DICT,$TOTAL);
{
  my( $dict_file, $word_file ) = @ARGV;
  ($dict_file && $word_file) or die(Usage);

  {
    my $DICT;
    ($DICT, $TOTAL) = get_word_stats($dict_file);
    %DICT = %$DICT;
  }

  {
    open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";

    foreach my $word (<$WORDS>) {
      chomp $word;
      my $arr = find_matches($word);


      local $_;
      # Schwartzian Transform
      my @sorted_arr =
        map  { $_->[0] }
        sort { $b->[1] <=> $a->[1] }
        map  {
          [ $_, find_word_seq_score(@$_) ]
        }
        @$arr;


      print_results( $word, @sorted_arr );
    }

    close $WORDS;
  }
}


sub find_matches($){
    my( $string ) = @_;

    my @found_parses;
    my @words;
    find_matches_rec( $string, @words, @found_parses );

    return  @found_parses if wantarray;
    return \@found_parses;
}

sub find_matches_rec($\@\@){
    my( $string, $words_sofar, $found_parses ) = @_;
    my $length = length $string;

    unless( $length ){
      push @$found_parses, $words_sofar;

      return @$found_parses if wantarray;
      return  $found_parses;
    }

    foreach my $i ( 2..$length ){
      my $prefix = substr($string, 0, $i);
      my $suffix = substr($string, $i, $length-$i);

      if( exists $DICT{$prefix} ){
        my @words = ( @$words_sofar, $prefix );
        find_matches_rec( $suffix, @words, @$found_parses );
      }
    }

    return @$found_parses if wantarray;
    return  $found_parses;
}


## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(@){
    my( @words ) = @_;
    local $_;

    my $score = 1;
    foreach ( @words ){
        $score = $score * $DICT{$_} / $TOTAL;
    }

    return $score;
}

sub get_word_stats($){
    my ($filename) = @_;

    open(my $DICT, '<', $filename) or die "unable to open $filename\n";

    local $/= undef;
    local $_;
    my %dict;
    my $total = 0;

    while ( <$DICT> ){
      foreach ( split(/\b/, $_) ) {
        $dict{$_} += 1;
        $total++;
      }
    }

    close $DICT;

    return (\%dict, $total);
}

sub print_results($@){
    #( 'word', [qw'test one'], [qw'test two'], ... )
    my ($word,  @combos) = @_;
    local $_;
    my $possible = scalar @combos;

    print "$word: $possible possibilities\n";
    foreach (@combos) {
      print ' -  ', join(' ', @$_), "\n";
    }
    print "\n";
}

sub Usage(){
    return "$0 /path/to/dictionary /path/to/your_words";
}

Brownedoff answered 12/10, 2008 at 2:37 Comment(8)

Can this be run on Windows XP? How do I get Perl loaded. I obviously need to get out more (in terms of other languages)! :) – Phenolphthalein 14/10, 2008 at 15:0

Yeah, you are looking for something called ActivePerl , which is the windows distribution. I didn't use any modules, so you don't need to add anything to the standard build. Just find a good representative dictionary. – Brownedoff 14/10, 2008 at 17:15

+1 - I don't know Perl but I gave you +1 for going above and beyond the call of duty. Nice! – Ursuline 26/1, 2009 at 23:30

I modified the code to try and make it more maintainable. Although it was fairly decent to start with. – Fluor 18/3, 2009 at 21:34

I wouldn't have modified it, if it wasn't already a community wiki post. – Fluor 18/3, 2009 at 21:35

yeah I modified it too much myself (didn't know there's a wiki switchover). Thanks for the edits -- alas, better SE practice leads to worse readability. I like the earlier version better for instructional purposes, but folks can find it on the blog anyway, so keeping your edits for comparison. – Brownedoff 18/3, 2009 at 23:1

There is obviously a flaw in this code, it only comes up with 'expert sex change' as the second most likely result. – Evangelina 2/5, 2011 at 23:28

i am also same issue, but i want solution in python :( – Chenee 29/8, 2018 at 7:12

pip install wordninja

>>> import wordninja
>>> wordninja.split('bettergood')
['better', 'good']

Khotan answered 19/9, 2019 at 11:46 Comment(2)

Better to mention the answer that gave birth to the mentioned wordninja package: https://mcmap.net/q/156149/-how-to-split-text-without-spaces-into-list-of-words – Dido 15/9, 2020 at 13:43

You might mention that this is viterbi in a stable package form. – Amoretto 14/1, 2022 at 1:32

The best tool for the job here is recursion, not regular expressions. The basic idea is to start from the beginning of the string looking for a word, then take the remainder of the string and look for another word, and so on until the end of the string is reached. A recursive solution is natural since backtracking needs to happen when a given remainder of the string cannot be broken into a set of words. The solution below uses a dictionary to determine what is a word and prints out solutions as it finds them (some strings can be broken out into multiple possible sets of words, for example wickedweather could be parsed as "wicked we at her"). If you just want one set of words you will need to determine the rules for selecting the best set, perhaps by selecting the solution with fewest number of words or by setting a minimum word length.

#!/usr/bin/perl

use strict;

my $WORD_FILE = '/usr/share/dict/words'; #Change as needed
my %words; # Hash of words in dictionary

# Open dictionary, load words into hash
open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n";
while (<WORDS>) {
  chomp;
  $words{lc($_)} = 1;
}
close(WORDS);

# Read one line at a time from stdin, break into words
while (<>) {
  chomp;
  my @words;
  find_words(lc($_));
}

sub find_words {
  # Print every way $string can be parsed into whole words
  my $string = shift;
  my @words = @_;
  my $length = length $string;

  foreach my $i ( 1 .. $length ) {
    my $word = substr $string, 0, $i;
    my $remainder = substr $string, $i, $length - $i;
    # Some dictionaries contain each letter as a word
    next if ($i == 1 && ($word ne "a" && $word ne "i"));

    if (defined($words{$word})) {
      push @words, $word;
      if ($remainder eq "") {
        print join(' ', @words), "\n";
        return;
      } else {
        find_words($remainder, @words);
      }
      pop @words;
    }
  }

  return;
}

Paronym answered 12/10, 2008 at 4:12 Comment(2)

haven't run it, but it reads like a better solution than BKB's since it produces all possibilities. – Brownedoff 12/10, 2008 at 5:38

This works like magic. Exactly what I have been looking for, thank you so much. I am trying to translate into PHP. If there is a PHP version, please share it here. – Grindlay 23/2, 2016 at 9:27

I think you're right in thinking that it's not really a job for a regular expression. I would approach this using the dictionary idea - look for the longest prefix that is a word in the dictionary. When you find that, chop it off and do the same with the remainder of the string.

The above method is subject to ambiguity, for example "drivereallyfast" would first find "driver" and then have trouble with "eallyfast". So you would also have to do some backtracking if you ran into this situation. Or, since you don't have that many strings to split, just do by hand the ones that fail the automated split.

Hove answered 12/10, 2008 at 2:40 Comment(2)

Gotta locate a dictionary file to hit against. – Phenolphthalein 12/10, 2008 at 3:8

Thanks! I am going to get this and that Perl together, see what happens. – Phenolphthalein 12/10, 2008 at 3:27

This is related to a problem known as identifier splitting or identifier name tokenization. In the OP's case, the inputs seem to be concatenations of ordinary words; in identifier splitting, the inputs are class names, function names or other identifiers from source code, and the problem is harder. I realize this is an old question and the OP has either solved their problem or moved on, but in case someone else comes across this question while looking for identifier splitters (like I was, not long ago), I would like to offer Spiral ("SPlitters for IdentifieRs: A Library"). It is written in Python but comes with a command-line utility that can read a file of identifiers (one per line) and split each one.

Splitting identifiers is deceptively difficult. Programmers commonly use abbreviations, acronyms and word fragments when naming things, and they don't always use consistent conventions. Even in when identifiers do follow some convention such as camel case, ambiguities can arise.

Spiral implements numerous identifier splitting algorithms, including a novel algorithm called Ronin. It uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin can split identifiers that do not use camel case or other naming conventions, including cases such as splitting J2SEProjectTypeProfiler into [J2SE, Project, Type, Profiler], which requires the reader to recognize J2SE as a unit. Here are some more examples of what Ronin can split:

# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']

Using the examples from the OP's question:

# spiral wickedweather liquidweather  driveourtrucks gocompact slimprojector
wickedweather: ['wicked', 'weather']
liquidweather: ['liquid', 'weather']
driveourtrucks: ['driveourtrucks']
gocompact: ['go', 'compact']
slimprojector: ['slim', 'projector']

As you can see, it is not perfect. It's worth noting that Ronin has a number of parameters and adjusting them makes it possible to split driveourtrucks too, but at the cost of worsening performance on program identifiers.

More information can be found in the GitHub repo for Spiral.

Amaranthine answered 23/3, 2018 at 0:16 Comment(0)

So I spent like 2 days on this answer, since I need it for my own NLP work. My answer is derived from Darius Bacon's answer, which itself was derived from the Viterbi algorithm. I also abstracted it to take each word in a message, attempt to split it, and then reassemble the message. I expanded Darius's code to make it debuggable. I also swapped out the need for "big.txt", and use the wordfreq library instead. Some comments stress the need to use a non-zero word frequency for non-existent words. I found that using any frequency higher than zero would cause "itseasyformetosplitlongruntogetherblocks" to undersplit into "itseasyformetosplitlongruntogether blocks". The algorithm in general tends to either oversplit or undersplit various test messages depending on how you combine word frequencies and how you handle missing word frequencies. I played around with many tweaks until it behaved well. My solution uses a 0.0 frequency for missing words. It also adds a reward for word length (otherwise it tends to split words into characters). I tried many length rewards, and the one that seems to work best for my test cases is word_frequency * (e ** word_length). There were also comments warning against multiplying word frequencies together. I tried adding them, using the harmonic mean, and using 1-freq instead of the 0.00001 form. They all tended to oversplit the test cases. Simply multiplying word frequencies together worked best. I left my debugging print statements in there, to make it easier for others to continue tweaking. Finally, there's a special case where if your whole message is a word that doesn't exist, like "Slagle's", then the function splits the word into individual letters. In my case, I don't want that, so I have a special return statement at the end to return the original message in those cases.

import numpy as np
from wordfreq import get_frequency_dict

word_prob = get_frequency_dict(lang='en', wordlist='large')
max_word_len = max(map(len, word_prob))  # 34

def viterbi_segment(text, debug=False):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        new_probs = []
        for j in range(max(0, i - max_word_len), i):
            substring = text[j:i]
            length_reward = np.exp(len(substring))
            freq = word_prob.get(substring, 0) * length_reward
            compounded_prob = probs[j] * freq
            new_probs.append((compounded_prob, j))
            
            if debug:
                print(f'[{j}:{i}] = "{text[lasts[j]:j]} & {substring}" = ({probs[j]:.8f} & {freq:.8f}) = {compounded_prob:.8f}')

        prob_k, k = max(new_probs)  # max of a touple is the max across the first elements, which is the max of the compounded probabilities
        probs.append(prob_k)
        lasts.append(k)

        if debug:
            print(f'i = {i}, prob_k = {prob_k:.8f}, k = {k}, ({text[k:i]})\n')


    # when text is a word that doesn't exist, the algorithm breaks it into individual letters.
    # in that case, return the original word instead
    if len(set(lasts)) == len(text):
        return text

    words = []
    k = len(text)
    while 0 < k:
        word = text[lasts[k]:k]
        words.append(word)
        k = lasts[k]
    words.reverse()
    return ' '.join(words)

def split_message(message):
  new_message = ' '.join(viterbi_segment(wordmash, debug=False) for wordmash in message.split())
  return new_message

messages = [
    'tosplit',
    'split',
    'driveourtrucks',
    "Slagle's",
    "Slagle's wickedweather liquidweather driveourtrucks gocompact slimprojector",
    'itseasyformetosplitlongruntogetherblocks',
]

for message in messages:
    print(f'{message}')
    new_message = split_message(message)
    print(f'{new_message}\n')

tosplit
to split

split
split

driveourtrucks
drive our trucks

Slagle's
Slagle's

Slagle's wickedweather liquidweather driveourtrucks gocompact slimprojector
Slagle's wicked weather liquid weather drive our trucks go compact slim projector

itseasyformetosplitlongruntogetherblocks
its easy for me to split long run together blocks

Unequaled answered 29/11, 2021 at 23:9 Comment(0)

A simple solution with Python: install the wordsegment package: pip install wordsegment.

$ echo thisisatest | python -m wordsegment
this is a test

Quita answered 30/8, 2019 at 21:59 Comment(0)

output :-
['better', 'good'] ['coffee', 'shop']
['coffee', 'shop']

    pip install wordninja
import wordninja
n=wordninja.split('bettergood')
m=wordninja.split("coffeeshop")
print(n,m)

list=['hello','coffee','shop','better','good']
mat='coffeeshop'
expected=[]
for i in list:
    if i in mat:
        expected.append(i)
print(expected)

Hoard answered 12/10, 2008 at 2:37 Comment(0)

Well, the problem itself is not solvable with just a regular expression. A solution (probably not the best) would be to get a dictionary and do a regular expression match for each work in the dictionary to each word in the list, adding the space whenever successful. Certainly this would not be terribly quick, but it would be easy to program and faster than hand doing it.

Scott answered 12/10, 2008 at 2:41 Comment(0)

A dictionary based solution would be required. This might be simplified somewhat if you have a limited dictionary of words that can occur, otherwise words that form the prefix of other words are going to be a problem.

Adams answered 12/10, 2008 at 2:41 Comment(0)

There is python package released Santhosh thottingal called mlmorph which can be used for morphological analysis.

https://pypi.org/project/mlmorph/

Examples:

from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")

Gives

[('കേരളം<np><genitive>', 179)]

He also wrote a blog on the topic https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analyser/

Anglicize answered 16/1, 2019 at 5:34 Comment(0)

This will work if the are camelCase. JavaScript!!!

function spinalCase(str) {
  let lowercase = str.trim()
  let regEx = /\W+|(?=[A-Z])|_/g
  let result = lowercase.split(regEx).join("-").toLowerCase()

  return result;
}

spinalCase("AllThe-small Things");

Kierkegaardian answered 11/4, 2020 at 15:19 Comment(1)

Okay, First we create a funciton which accepts a value. let lowercase = str.trim() basically .trim() doesn't check for whitespace in or between a string it only checks and removes whitespace from either the start, end or both side if there is, let regEx = /\W+|(?=[A-Z])|_/g is a regular expression, where, \W+ doesn't match letters and numbers, therefore, leaving _ and Spaces to match, | means OR, (?=[A-Z]) means a positive lookahead where it searches for all Capletters , then result runs the expression and joins it with a hyphen and converts it all toLowerCase. – Kierkegaardian 12/4, 2020 at 12:55

One of the solutions could be with recurssion (the same can be converted into dynamic-programming):

static List<String> wordBreak(
    String input,
    Set<String> dictionary
) {

  List<List<String>> result = new ArrayList<>();
  List<String> r = new ArrayList<>();

  helper(input, dictionary, result, "", 0, new Stack<>());

  for (List<String> strings : result) {
    String s = String.join(" ", strings);
    r.add(s);
  }

  return r;
}

static void helper(
    final String input,
    final Set<String> dictionary,
    final List<List<String>> result,
    String state,
    int index,
    Stack<String> stack
) {

  if (index == input.length()) {

    // add the last word
    stack.push(state);

    for (String s : stack) {
      if (!dictionary.contains(s)) {
        return;
      }
    }

    result.add((List<String>) stack.clone());

    return;
  }

  if (dictionary.contains(state)) {
    // bifurcate
    stack.push(state);
    helper(input, dictionary, result, "" + input.charAt(index),
           index + 1, stack);

    String pop = stack.pop();
    String s = stack.pop();

    helper(input, dictionary, result, s + pop.charAt(0),
           index + 1, stack);

  }
  else {
    helper(input, dictionary, result, state + input.charAt(index),
           index + 1, stack);
  }

  return;
}

The other possible solution would be the use of Tries data structure.

Decastere answered 20/10, 2020 at 16:50 Comment(0)

use Enchant Library. The best option. Check out : https://www.youtube.com/watch?v=Q3UR-uBWGfY&t=206s

# Import the enchant library for spell-checking
import enchant
def split_merged_words(word_to_split):
    splitted_words = []
    dictionary = enchant.Dict("en_US")
    word = word_to_split
    length_of_word = len(word)
    i = 0
    while i < length_of_word:
        for j in range(length_of_word, i, -1):
            word_to_check = word[i:j]
            if dictionary.check(word_to_check):
                splitted_words.append(word_to_check)
                i = j
                break
    return splitted_words

merged_words = input("Enter the merged words: ")
words = split_merged_words(merged_words)
print("The splitted words:", words)

Personalism answered 4/7, 2023 at 4:47 Comment(0)

I may get downmodded for this, but have the secretary do it.

You'll spend more time on a dictionary solution than it would take to manually process. Further, you won't possibly have 100% confidence in the solution, so you'll still have to give it manual attention anyway.

Purehearted answered 12/10, 2008 at 2:49 Comment(1)

man.. now I really want to downvote you! :-) We tried a similar approach to filtering naughty search queries once.. spent more time building a nice interface a secretary (PR person, in my case) would use, than I would on a classifier. – Brownedoff 12/10, 2008 at 2:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags