Using WordNet to determine semantic similarity between two texts?
Asked Answered
E

1

5

How can you determine the semantic similarity between two texts in python using WordNet?

The obvious preproccessing would be removing stop words and stemming, but then what?

The only way I can think of would be to calculate the WordNet path distance between each word in the two texts. This is standard for unigrams. But these are large (400 word) texts, that are natural language documents, with words that are not in any particular order or structure (other than those imposed by English grammar). So, which words would you compare between texts? How would you do this in python?

Episcopalian answered 13/7, 2012 at 2:35 Comment(2)
I would iterate over all words and compare to the same index in the other text with a levenshtein distance and attempt to minimize itMongo
The two texts are not organised by a similar index. It would be a wikipedia page on dogs and another on cats, for instance.Episcopalian
F
11

One thing that you can do is:

  1. Kill the stop words
  2. Find as many words as possible that have maximal intersections of synonyms and antonyms with those of other words in the same doc. Let's call these "the important words"
  3. Check to see if the set of the important words of each document is the same. The closer they are together, the more semantically similar your documents.

There is another way. Compute sentence trees out of the sentences in each doc. Then compare the two forests. I did some similar work for a course a long time ago. Here's the code (keep in mind this was a long time ago and it was for class. So the code is extremely hacky, to say the least).

Hope this helps

Fecundate answered 13/7, 2012 at 3:26 Comment(3)
+1 Good ideas. Im looking at your code but I don't see how to compare sentence trees. Presumably it should only take around 15 lines of code with NLTK in python, no?Episcopalian
I never got to that point. But it should be a straight shot the output of my codeFecundate
Depends on how you want to compare sentence trees. But it shouldn't take too much code.Fecundate

© 2022 - 2024 — McMap. All rights reserved.