How to split a string into words. Ex: "stringintowords" -> "String Into Words"?
Asked Answered
M

10

21

What is the right way to split a string into words ? (string doesn't contain any spaces or punctuation marks)

For example: "stringintowords" -> "String Into Words"

Could you please advise what algorithm should be used here ?

! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.

Myelencephalon answered 12/8, 2010 at 11:10 Comment(1)
The only way that you could split that string into words is to use a dictionary. Although this would probably be quite resource intensive.Cravens
W
15

As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:

(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).

(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).

Wallie answered 12/8, 2010 at 23:48 Comment(0)
P
23

Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.

Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate

S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).

This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.

Palaeozoic answered 12/8, 2010 at 15:16 Comment(3)
How can one split the words? Suppose, the dict contains "by, bygone, gone, days" and the word string is "bygonedays". I want the maximum number of splits - so the output must be "by gone days" and not "bygone days"Legatee
The original question did not ask to obtain the maximum number of words. If you want that, just track this number in each table entry.Prendergast
To actually find the splitting, we could not just store the splits in S that set to true. For example, for word "splitting", there can be "split" and "splitting", which makes a bool array: [f,f,f,f,true,f,f,f,true], so in the end, according to your alg, we may end up saying that: "split" and "ting" is the solution(though 'ting' is not a valid word). Maybe instead of storing bool value in the array, we can store a list, in which contains all the valid splits until now, at last, we can simply check the last slot of the array to get the solutions.Entertainer
W
15

As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:

(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).

(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).

Wallie answered 12/8, 2010 at 23:48 Comment(0)
C
5

There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.

In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.

Cassock answered 12/8, 2010 at 11:35 Comment(1)
I think it's too complicated to be an out-of-the-box solution which is obviously expected :)Monosepalous
M
4

If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.

For example: windowsteamblog (of http://windowsteamblog.com/ fame)

  • windows team blog
  • window steam blog
Morula answered 12/8, 2010 at 11:21 Comment(4)
Agreed that a dictionary is needed, but why do you think it'll be that inefficient? This is a typical application for Tries...Inofficious
@Jérémie, ok, maybe inefficient wasn't the right choice of words, perhaps "bloody slow" would be better =)Morula
window steam blog would never be a website! i was really rooting for it, too, but nope: msft. =(Zacatecas
This should be a comment, as it's not an answer.Reyna
P
3

Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.

You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).

Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).

If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:

Needle:       isashort    
Haystack:     This is a short phrase    
Preprocessed: thisisashortphrase   
NumSpaces   : 000011233333444444 

And your answer would come from:

location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)

Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.

Peel answered 12/8, 2010 at 15:44 Comment(0)
C
1

This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).

Cellist answered 16/5, 2011 at 14:44 Comment(0)
H
0

This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.

With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.

Hara answered 12/8, 2010 at 11:14 Comment(2)
Actually, it needn't be nearly as expensive as the knapsack problem. You can apply dynamic programming techniques to substantially reduce the search space.Cassock
Yes, agreeing with Nick Johnson: this is a standard, simple O(n^2) dynamic programming problem. Throwing in an NP-complete problem is like trying to slice bread with a jackhammer!!!Inofficious
W
0

Create a list of possible words, sort it from long words to short words.

Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.

Watford answered 12/8, 2010 at 11:15 Comment(0)
U
0

A simple Java solution which has O(n^2) running time.

public class Solution {
    // should contain the list of all words, or you can use any other data structure (e.g. a Trie)
    private HashSet<String> dictionary;

    public String parse(String s) {
        return parse(s, new HashMap<String, String>());
    }

    public String parse(String s, HashMap<String, String> map) {
        if (map.containsKey(s)) {
            return map.get(s);
        }
        if (dictionary.contains(s)) {
            return s;
        }
        for (int left = 1; left < s.length(); left++) {
            String leftSub = s.substring(0, left);
            if (!dictionary.contains(leftSub)) {
                continue;
            }
            String rightSub = s.substring(left);
            String rightParsed = parse(rightSub, map);
            if (rightParsed != null) {
                String parsed = leftSub + " " + rightParsed;
                map.put(s, parsed);
                return parsed;
            }
        }
        map.put(s, null);
        return null;
    }
}
Uncrowned answered 2/3, 2015 at 12:16 Comment(0)
G
-1

I was looking at the problem and thought maybe I could share how I did it. It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:

string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);

/** this way, one does not check the dictionary to check for word validity 
 *  on every substring; It would only be queried once and for all, 
 *  eliminating multiple travels to the data storage
 */
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();

validwords = validwords.sort(length, desc);

array segments = [];
while(mainword != ""){
    for(x = 0; x < validwords.length; x++){
        if(mainword.startswith(validwords[x])) {
            segments.push(validwords[x]);
            mainword = mainword.remove(v);
            x = 0;
        }
    }

    /**
     * remove the first character if any of valid words do not match, then start again
     * you may need to add the first character to the result if you want to
     */
    mainword = mainword.substring(1);
}

string result = segments.join(" ");
Geostatic answered 23/12, 2012 at 15:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.