Algorithm to compare similarity of English sentences

Asked 15/7, 2011 at 8:37 Answered 27/4, 2013 at 9:7

I have a collection of sentences, and I need to analyse them to see how similar they are.

Are there any established algorithms to do this?

I care about:

containing the same words (ignoring inflexions for now)
containing the same words in a similar order

I've used Levenshtein distance and n-grams for spelling before, although I'm not entirely confident if these translate to my purposes.

Naively, "I don't care about spelling differences, typos can be treated as different words" although perhaps it would be nice to account for this.

perhaps some hybrid of splitting the sentence at spaces and one of the above (or other) algorithms would be a starting point

What options are available? Any advice?

Thanks!

Relive answered 15/7, 2011 at 8:37 Comment(0)

This paper compares several sentence similarity measures. Perhaps you can use one of them as is, or modify it for your needs.

Otherwise sentence similarity measure is a good key term to google for.

Soucy answered 15/7, 2011 at 9:45 Comment(1)

@Andrew actually I just googled because the question raised my interest :) I'm not familiar with the topic ... I understand that your problem may be in the technical details, which are mostly ignored in that paper (make it spelling-mistake-resitant, inflexions, etc. good that English words are barely inflected) – Soucy 15/7, 2011 at 10:36

To ignore inflections you should look into stemming algorithms: http://en.wikipedia.org/wiki/Porter_stemmer

They reduce words to their root forms.

Hafer answered 27/4, 2013 at 9:7 Comment(0)

Recommended topics

Hot tags