How to detect that two sentences are similar?
Asked Answered
R

3

28

I want to compute how similar two arbitrary sentences are to each other. For example:

  1. A mathematician found a solution to the problem.
  2. The problem was solved by a young mathematician.

I can use a tagger, a stemmer, and a parser, but I don’t know how detect that these sentences are similar.

Rang answered 21/4, 2013 at 16:4 Comment(2)
Have you considered asking this sort of thing at Linguistics.SE? I find that NLP questions tend to get a better treatment there.Asmodeus
@Asmodeus but its a programming / algorithmic based question!Gowen
D
33

These two sentences are not just similar, they are almost paraphrases, i.e., two alternative ways of expressing the same meaning. It is also a very simple case of paraphrase, in which both utterances use the same words with the only exception of one being in active form while the other is passive. (The two sentences are not exactly paraphrases because in the second sentence the mathematician is "young". This additional information makes the semantic relation between the two sentences non symmetric. In these cases, you would say that the second utterance "entails" the first one, or in other words that the first can be inferred from the second).

From the example it is not possible to understand whether you are actually interested in paraphrase detection, textual entailment or in sentence similarity in general, which is an even broader and fuzzier problem. For example, is "people eat food" more similar to "people eat bread" or to "men eat food"?

Both paraphrase detection and text similarity are complex, open research problems in Natural Language Processing, with a large and active community of researchers working on them. It is not clear what is the extent of your interest in this topic, but consider that even though many brilliant researchers have spent and spend their whole careers trying to crack it, we are still very far from finding sound solutions that just work in general.

Unless you are interested in a very superficial solution that would only work in specific cases and that would not capture syntactic alternation (as in this case), I would suggest that you look into the problem of text similarity in more depth. A good starting point would be the book "Foundations of Statistical Natural Language Processing", which provides a very well organised presentation of most statistical natural language processing topics. Once you have clarified your requirements (e.g., under what conditions is your method supposed to work? what levels of precision/recall are you after? what kind of phenomena can you safely ignore, and which ones you need to account for?) you can start looking into specific approaches by diving into recent research work. Here, a good place to start would be the online archives of the Association for Computational Linguistics (ACL), which is the publisher of most research results in the field.

Just to give you something practical to work with, a very rough baseline for sentence similarity would be the cosine similarity between two binary vectors representing the sentences as bags of words. A bag of word is a very simplified representation of text, commonly used for information retrieval, in which you completely disregard syntax and only represent a sentence as a vector whose size is the size of the vocabulary (i.e., the number of words in the language) and whose component "i" is valued "1" if the word at position "i" in the vocabulary appears in the sentence, and "0" otherwise.

Dormitory answered 21/4, 2013 at 16:42 Comment(4)
but cosine similarilty will show these sentence same I drink milk but I dont drink alcohalic drinks and I dont drink milk but I drink alcohlic drinks !Gowen
@RavinderPayal, that is what to be solved under natural language understanding.Bergamo
@amit_kumar yeah, and this specific problem can be solved by mapping verbs with nouns, and tokenizationGowen
Do not agree. that is just NLP and POS tagging which deals only with syntax. NLU is about semantics and pragmatics. NLU engines are state of art work and still we do not have a solution which can generalize for all domains.Bergamo
V
7

A more modern approach (in 2021) is to use a Machine Learning NLP model. There are pre-trained models exactly for this task, many of them are derived from BERT, so you don't have to train your own model (you could if you wanted to). Here is a code example that uses the excellent Huggingface Transformers library with PyTorch. It's based on this example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "bert-base-cased-finetuned-mrpc"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

sequence_0 = "A mathematician found a solution to the problem."
sequence_1 = "The problem was solved by a young mathematician."

tokens = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")
classification_logits = model(**tokens)[0]
results = torch.softmax(classification_logits, dim=1).tolist()[0]

classes = ["not paraphrase", "is paraphrase"]
for i in range(len(classes)):
    print(f"{classes[i]}: {round(results[i] * 100)}%")
Vasodilator answered 18/5, 2021 at 11:34 Comment(2)
For two identical sentences, it still gives difference. For example, sequence_0 = "Take one tablet by mouth" sequence_1 = "Take one tablet by mouth" it gives 94% similar. Why?Matsumoto
Also this approach has the same shortcoming that Ravinder points out above-namely that it doesn't do a great job with structural intent, such that "I like pie more than cake" and "I like cake more than pie" are said to paraphrase each other in spite of having opposite meanings.Cris
G
1

In some cases, it is possible to automatically transform sentences into discourse representation structures that represent their meanings. If two sentences produce the same discourse representation structure, then it is likely that they have similar meanings.

Godfree answered 25/12, 2016 at 4:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.