What NLP tools to use to match phrases having similar meaning or semantics
Asked Answered
P

3

18

I am working on a project which requires me to match a phrase or keyword with a set of similar keywords. I need to perform semantic analysis for the same.

an example:

Relevant QT
cheap health insurance
affordable health insurance
low cost medical insurance
health plan for less
inexpensive health coverage

Common Meaning

low cost health insurance

Here the the word under Common Meaning column should match the under Relevant QT column. I looked at a bunch of tools and techniques to do the same. S-Match seemed very promising, but I have to work in Python, not in Java. Also Latent Semantic Analysis looks good but I think its more for document classification based upon a Keyword rather than keyword matching. I am somewhat familiar with NLTK. Could someone provide some insight on what direction I should proceed and what tools I should use for the same?

Petrick answered 3/8, 2012 at 15:9 Comment(2)
What's the scope of your project? If you're dealing with a few core key-words or senses, it may be easy enough to specify word equivalence classes by hand (e.g. a word-list of phrases meaning "low cost health insurance").Pollak
I have to extract semantically similar words like low cost health insurance from a group of around 200000 words. I am thinking I have to apply clustering after running an initial algorithm on these words to generate sort of centers(words) which will match semantically similar words in its cluster. The whole procedure is unsupervised.Petrick
M
7

If you have a big corpus, where these words occur, available, you can train a model to represent each word as vector. For instance, you can use deep learning via word2vec’s "skip-gram and CBOW models", they are implemented in the gensim software package

In the word2vec model, each word is represented by a vector, you can then measure the semantic similarity between two words by measuring the cosine of the vectors representing th words. Semantic similar words should have a high cosine similarity, for instance:

model.similarity('cheap','inexpensive') = 0.8

(The value is made up, just for illustration.)

Also, from my experiments, summing a relatively small number of words (i.e., up to 3 or 4 words) preserves the semantics, for instance:

vector1 = model['cheap']+model['health']+model['insurance']
vector2 = model['low']+model['cost']+model['medical']+model['insurance']

similarity(vector1,vector2) = 0.7

(Again, just for illustration.)

You can use this semantic similarity measure between words as a measure to generate your clusters.

Mockery answered 24/12, 2014 at 21:54 Comment(0)
N
6

When Latent Semantic Analysis refers to a "document", it basically means any set of words that is longer than 1. You can use it to compute the similarity between a document and another document, between a word and another word, or between a word and a document. So you could certainly use it for your chosen application.

Other algorithms that may be useful include:

Northey answered 27/12, 2014 at 18:54 Comment(0)
S
2

I'd start by taking a look at Wordnet. It will give you real synonyms and other word relations for hundreds of thousands of terms. Since you tagged the nltk: It provides bindings for Wordnet, and you can use it as the basis for domain-specific solutions.

Still in the NLTK, check out the discussion of the method similar() in the introduction to the NLTK book, and the class nltk.text.ContextIndex that it's based on. (All pretty simple still, but it might be all you really need).

Spectrophotometer answered 6/1, 2013 at 14:1 Comment(2)
The NLTK links give a 404.Yearbook
Thanks for the heads-up, @rwst! Somewhere in the last ten years the NLTK website made the "www" part of the domain obligatory. Fixed now.Spectrophotometer

© 2022 - 2024 — McMap. All rights reserved.