The answer by Generic Human is great. But the best implementation of this I've ever seen was written Peter Norvig himself in his book 'Beautiful Data'.
Before I paste his code, let me expand on why Norvig's method is more accurate (although a little slower and longer in terms of code).
- The data is a bit better - both in terms of size and in terms of precision (he uses a word count rather than a simple ranking)
- More importantly, it's the logic behind n-grams that really makes the approach so accurate.
The example he provides in his book is the problem of splitting a string 'sitdown'. Now a non-bigram method of string split would consider p('sit') * p ('down'), and if this less than the p('sitdown') - which will be the case quite often - it will NOT split it, but we'd want it to (most of the time).
However when you have the bigram model you could value p('sit down') as a bigram vs p('sitdown') and the former wins. Basically, if you don't use bigrams, it treats the probability of the words you're splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.
Here's the link to the data (it's data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/
and here's the link to the code: http://norvig.com/ngrams/ngrams.py
These links have been up a while, but I'll copy paste the segmentation part of the code here anyway
import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10
def memo(f):
"Memoize function f."
table = {}
def fmemo(*args):
if args not in table:
table[args] = f(*args)
return table[args]
fmemo.memo = table
return fmemo
def test(verbose=None):
"""Run some tests, taken from the chapter.
Since the hillclimbing algorithm is randomized, some tests may fail."""
import doctest
print 'Running tests...'
doctest.testfile('ngrams-test.txt', verbose=verbose)
################ Word Segmentation (p. 223)
@memo
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
return [(text[:i+1], text[i+1:])
for i in range(min(len(text), L))]
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
return product(Pw(w) for w in words)
#### Support functions (p. 224)
def product(nums):
"Return the product of a sequence of numbers."
return reduce(operator.mul, nums, 1)
class Pdist(dict):
"A probability distribution estimated from counts in datafile."
def __init__(self, data=[], N=None, missingfn=None):
for key,count in data:
self[key] = self.get(key, 0) + int(count)
self.N = float(N or sum(self.itervalues()))
self.missingfn = missingfn or (lambda k, N: 1./N)
def __call__(self, key):
if key in self: return self[key]/self.N
else: return self.missingfn(key, self.N)
def datafile(name, sep='\t'):
"Read key,value pairs from file."
for line in file(name):
yield line.split(sep)
def avoid_long_words(key, N):
"Estimate the probability of an unknown word."
return 10./(N * 10**len(key))
N = 1024908267229 ## Number of tokens
Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)
#### segment2: second version, with bigram counts, (p. 226-227)
def cPw(word, prev):
"Conditional probability of word, given previous word."
try:
return P2w[prev + ' ' + word]/float(Pw[prev])
except KeyError:
return Pw(word)
P2w = Pdist(datafile('count_2w.txt'), N)
@memo
def segment2(text, prev='<S>'):
"Return (log P(words), words), where words is the best segmentation."
if not text: return 0.0, []
candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first))
for first,rem in splits(text)]
return max(candidates)
def combine(Pfirst, first, (Prem, rem)):
"Combine first and rem results into one (probability, words) pair."
return Pfirst+Prem, [first]+rem
['able', 'air', 'apple', 'boa', 'boar', 'board', 'chair', 'cup', 'cupboard', 'ha', 'hair', 'lea', 'leap', 'oar', 'tab', 'table', 'up']
– Mervinmerwinparsing error
to take decisions can be dangerous. What if there is really an error in your input string? Or there's a word that you don't have? – Nickles