How to compare two strings by meaning?
Asked Answered
F

3

6

I want the user of my node.js application to write down ideas, which then get stored in a database. So far so good, but I don't want redundant entrys in that table, so I decided to check for similarity, using this one: https://www.npmjs.com/package/string-similarity-js

Do you know a way, in which I can compare two strings by meaning? In like getting a high similarity score for "using public transport" vs "driving by train" which performs very poor in the above one.

Fides answered 19/12, 2019 at 16:54 Comment(3)
What you've just described is a PhD level problem involving AI.Schaffer
Turns out that natural language understanding is and has been one of the most difficult problems in computing.Broddie
This is a good question - welcome to the community. Sorry about all the lonely people during Christmas spending their time downvoting questions. This is not a PhD level issue anymore - this is a solved problem for all practical purposes. Use the answer by JxCode along with lighter model like Google Universal Sentence embeddings to group similar text using cosine similarityLuggage
M
9

To compare two strings by meaning, the strings would need to be convert first to a tensor and then evalutuate the distance or similarity between the tensors. Many algorithm can be used to convert strings to tensors - all related to the domain of interest. But the Universal Sentence Encoder is a wide broad sentence encoder that will project all words in one dimensional space. The cosine similarity can be used to see how closed some words are in meaning.

Example

Though king and kind are closed in hamming distance (difference of only one character), they are very different. Whereas queen and king though they seems not related (because all characters are different) are close in meaning. Therefore the distance (in meaning) between king and queen should be smaller than between king and kind as demonstrated in the following snippet.

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/universal-sentence-encoder"></script>

<script>

(async() => {

const model = await use.load();
const embeddings = (await model.embed(['queen', 'king', 'kind'])).unstack()
tf.losses.cosineDistance(embeddings[0], embeddings[1], 0).print() // 0.39812755584716797
tf.losses.cosineDistance(embeddings[1], embeddings[2], 0).print() // 0.5585797429084778

})()  
</script>
Marianmariana answered 22/1, 2020 at 15:56 Comment(0)
O
5

Comparing the meaning of two string is still an ongoing research. If you really want to solve the problem (or to get really good performance of your language modal) you should consider get a PhD.

For out of box solution at the time: I found this Github repo that implement google's BERT modal and use it to get the embedding of two sentences. In theory, the two sentence share the same meaning if there embedding is similar.

https://github.com/UKPLab/sentence-transformers

# the following is simplified from their README.md
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# Corpus with example sentences
S1 = ['A man is eating a food.']
S2 = ['A man is eating pasta.']

s1_embedding = embedder.encode(S1)
s2_embedding = embedder.encode(S2)

dist = scipy.spatial.distance.cdist([s1_embedding], [s2_embedding], "cosine")[0]
Example output (copied from their README.md)

Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating a piece of bread. (Score: 0.8518)
A man is eating a food. (Score: 0.8020)
A monkey is playing drums. (Score: 0.4167)
A man is riding a horse. (Score: 0.2621)
A man is riding a white horse on an enclosed ground. (Score: 0.2379)
Orelle answered 19/12, 2019 at 17:9 Comment(1)
Excellent advice "... you should consider get a PhD ..."Doings
V
0

As @edkeveked answered, but that used the VanillaJS.

For me, I had to do that in NodeJS since I was working on the backend.

I have had a good time doing that let me share that.

First, install the following packages. npm install @tensorflow/tfjs @tensorflow-models/universal-sentence-encoder

Import the packages

const tf = require("@tensorflow/tfjs")
const sentenceEncoder = require("@tensorflow-models/universal-sentence-encoder")

And load the model (In my case I saved a promise outside the function and awaited it inside the function call. In this way, we will load it when the script loads and when the function is called it will immediately resolve.

const modelPromise = sentenceEncoder.load()

Create the embeddings for both strings and then calculate cosineSimilarity

const calculateSemanticSimilarity = async ({ text_a, text_b }) => {
  const model = await modelPromise

  const embeddings = (await model.embed([text_a, text_b])).unstack()

  return tf.losses.cosineDistance(embeddings[0], embeddings[1], 0).dataSync()[0]
}

module.exports = { calculateSemanticSimilarity }
Venter answered 28/11, 2023 at 9:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.