Stemming algorithm that produces real words
Asked Answered
P

3

36

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities

I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):

http://tartarus.org/~martin/PorterStemmer/php.txt

This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".

I've tried "Snowball" (suggested within another Stack Overflow thread).

http://snowball.tartarus.org/demo.php

For my example (community / communities), Snowball stems to "communiti".

Question

Are there any other stemming algorithms that will do this? Has anyone else solved this problem?

My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.

Performative answered 10/10, 2008 at 10:43 Comment(2)
The easiest way I think would be to just store both values, the full word and the stem.Gaffney
Stem the word, normalize the word, get the metaphone of the word, and finally also keep a copy of the full word. Using all four of these, compare them to other words and you should find your match.Erasmo
C
16

The core issue here is that stemming algorithms operate on a phonetic basis purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

  1. Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)
  2. Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)

Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

Cachexia answered 10/10, 2008 at 11:22 Comment(4)
This seems like a good idea. I think having an automated system will be beneficial, so working on the "most common" word being the one to use seems a simple solution - and easy to implement. Many thanks.Performative
This approach is a good one, and I've used it in the past. One brief note, though: stemming algorithms don't (usually) operate on a phonetic basis, they're written based on the grammar of the language, not the sound of the words. For details, I recommend reading snowball.tartarus.org/texts/introduction.html , particularly section 2 - "Some ideas underlying stemming"Ebby
Ah, true. I was sloppy in my use of "phonetic" and have edited my answer to state that it's based on spelling rules.Cachexia
Great idea ! is there a database that contains the map between the stemmed version of the words with the original or the stable version ? A link to such a database would be of great help !Rhoden
I
50

If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.

There are many lemmatizers for English, I've only used morpha though. Morpha is just a big lex-file which you can compile into an executable. Usage example:

$ cat test.txt 
Community
Communities
$ cat test.txt | ./morpha -uc
Community
Community

You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html

Indigenous answered 5/3, 2009 at 15:26 Comment(7)
I am a beginner , so please dont mind if i am wrong . I just want to know if it works for cases like converting collection to collect (I hope you get what i am trying to convey - converting the words to a more general/shorter form)Rhoden
@Rhoden I'm sure Morpha does not convert 'collection' into 'collect'. Btw, how would you formally define 'more general/shorter form'?Indigenous
My need is to convert the words with similar meaning into a single word , so that the similarity between words can be accompolished . Example collection collect have to be converted to collect and using wordnet i find the hypernym but if i use stemmers they provide a word which has no meaning and hence i cant find a hypernym of that wordRhoden
UW has uploaded morpha stemmer to Maven central if you plan to use it from a Java application.Estellaestelle
Is there any morpha library for python?Lemus
NLTK for python :#772418Jeanejeanelle
This is the right answer. The accepted answer is not correct, and anyone coming to this page should follow this advice instead.Ytterbium
C
16

The core issue here is that stemming algorithms operate on a phonetic basis purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

  1. Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)
  2. Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)

Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

Cachexia answered 10/10, 2008 at 11:22 Comment(4)
This seems like a good idea. I think having an automated system will be beneficial, so working on the "most common" word being the one to use seems a simple solution - and easy to implement. Many thanks.Performative
This approach is a good one, and I've used it in the past. One brief note, though: stemming algorithms don't (usually) operate on a phonetic basis, they're written based on the grammar of the language, not the sound of the words. For details, I recommend reading snowball.tartarus.org/texts/introduction.html , particularly section 2 - "Some ideas underlying stemming"Ebby
Ah, true. I was sloppy in my use of "phonetic" and have edited my answer to state that it's based on spelling rules.Cachexia
Great idea ! is there a database that contains the map between the stemmed version of the words with the original or the stable version ? A link to such a database would be of great help !Rhoden
C
15

Hey I don't know if that's perhaps too late, but there is only one PHP stemming script that produces real words: http://phpmorphy.sourceforge.net/ – it took me ages to find it. All other stemmers have to be compiled and even after that they only work according to Porter algorithm, which produces stems, not lemmas (i.e. community = communiti). PhpMorphy one works perfectly well, it's easy to install and initialize, and has English, Russian, German, Ukrainian and Estonian dictionaries. It also comes with a script that you can use to compile other dictionaries. The documentation is in Russian, but put it through Google translate and it should be easy.

Clearcole answered 6/6, 2012 at 11:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.