I have a collection of sentences, and I need to analyse them to see how similar they are.
Are there any established algorithms to do this?
I care about:
- containing the same words (ignoring inflexions for now)
- containing the same words in a similar order
I've used Levenshtein distance and n-grams for spelling before, although I'm not entirely confident if these translate to my purposes.
Naively, "I don't care about spelling differences, typos can be treated as different words" although perhaps it would be nice to account for this.
perhaps some hybrid of splitting the sentence at spaces and one of the above (or other) algorithms would be a starting point
What options are available? Any advice?