BLEU score implementation for sentence similarity detection

Asked 22/3, 2011 at 11:22 Answered 20/9, 2021 at 13:5

Solved java algorithm nlp text-processing machine-translation

I need to calculate BLEU score for identifying whether two sentences are similar or not.I have read some articles which are mostly about BLEU score for Measuring Machine translation accuracy.But i'm in need of a BLEU score to find out similarity between sentences in a same language[English].(i.e)(Both the sentences are in English).Thanks in anticipation.

Heliotropin answered 22/3, 2011 at 11:22 Comment(0)

Well, if you just want to calculate the BLEU score, it's straightforward. Treat one sentence as the reference translation and the other as the candidate translation.

Sello answered 22/3, 2011 at 15:56 Comment(0)

For sentence level comparisons, use smoothed BLEU

The standard BLEU score used for machine translation evaluation (BLEU:4) is only really meaningful at the corpus level, since any sentence that does not have at least one 4-gram match will be given a score of 0.

This happens because, at its core, BLEU is really just the geometric mean of n-gram precisions that is scaled by a brevity penalty to prevent very short sentences with some matching material from being given inappropriately high scores. Since the geometric mean is calculated by multiplying together all the terms to be included in the mean, having a zero for any of the n-gram counts results in the entire score being zero.

If you want to apply BLEU to individual sentences, you're better off using smoothed BLEU (Lin and Och 2004 - see sec. 4), whereby you add 1 to each of the n-gram counts before you calculate the n-gram precisions. This will prevent any of the n-gram precisions from being zero, and thus will result in non-zero values even when there are not any 4-gram matches.

Java Implementation

You'll find a Java implementation of both BLEU and smooth BLEU in the Stanford machine translation package Phrasal.

Alternatives

As Andreas already mentioned, you might want to use an alternative scoring metric such as Levenstein's string edit distance. However, one problem with using the traditional Levenstein string edit distance to compare sentences is that it isn't explicitly aware of word boundaries.

Other alternatives include:

Word Error Rate - This is essentially the Levenstein distance applied to a sequence of words rather than a sequence of characters. It's widely used for scoring speech recognition systems.
Translation Edit Rate (TER) - This is similar to word error rate, but it allows for an additional swap edit operation for adjacent words and phrases. This metric has become popular within the machine translation community since it correlates better with human judgments than other sentence similarity measures such as BLEU. The most recent variant of this metric, known as Translation Edit Rate Plus (TERp), allows for matching of synonyms using WordNet as well as paraphrases of multiword sequences ("died" ~= "kicked the bucket").
METEOR - This metric first calculates an alignment that allows for arbitrary reordering of the words in the two sentences being compared. If there are multiple possible ways to align the sentences, METEOR selects the one that minimizes crisscrossing alignment edges. Like TERp, METEOR allows for matching of WordNet synonyms and paraphrases of multiword sequences. After alignment, the metric computes the similarity between the two sentences using the number of matching words to calculate a F-α score, a balanced measure of precision and recall, which is then scaled by a penalty for the amount of word order scrambling present in the alignment.

Rothberg answered 23/3, 2011 at 17:56 Comment(0)

Here you go: http://code.google.com/p/lingutil/

Blake answered 8/11, 2011 at 16:5 Comment(0)

Well, if you just want to calculate the BLEU score, it's straightforward. Treat one sentence as the reference translation and the other as the candidate translation.

Sello answered 22/3, 2011 at 15:56 Comment(0)

Maybe the (Levenstein) edit distance is also an option, or the Hamming distance. Either way, the BLEU score is also appropriate for the job; it measures the similarity of one sentence against a reference, so that only makes sense when they're in the same language like with your problem.

Eliathan answered 22/3, 2011 at 23:8 Comment(0)

You can use Moses multi-bleu script, where you can also use multiple references: https://github.com/moses-smt/mosesdecoder/blob/RELEASE-2.1.1/scripts/generic/multi-bleu.perl

Kosel answered 16/1, 2015 at 19:26 Comment(0)

You are not encouraged to implement the BLEU yourself, and the SACREBLEU is a standard implementation.

from datasets import load_metric
metric = load_metric("sacrebleu")

Tighe answered 20/9, 2021 at 13:5 Comment(0)

Recommended topics

Hot tags