TL;DR:
>>> import nltk
>>> hypothesis = ['This', 'is', 'cat']
>>> reference = ['This', 'is', 'a', 'cat']
>>> references = [reference] # list of references for 1 sentence.
>>> list_of_references = [references] # list of references for all sentences in corpus.
>>> list_of_hypotheses = [hypothesis] # list of hypotheses that corresponds to list of references.
>>> nltk.translate.bleu_score.corpus_bleu(list_of_references, list_of_hypotheses)
0.6025286104785453
>>> nltk.translate.bleu_score.sentence_bleu(references, hypothesis)
0.6025286104785453
(Note: You have to pull the latest version of NLTK on the develop
branch in order to get a stable version of the BLEU score implementation)
In Long:
Actually, if there's only one reference and one hypothesis in your whole corpus, both corpus_bleu()
and sentence_bleu()
should return the same value as shown in the example above.
In the code, we see that sentence_bleu
is actually a duck-type of corpus_bleu
:
def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
smoothing_function=None):
return corpus_bleu([references], [hypothesis], weights, smoothing_function)
And if we look at the parameters for sentence_bleu
:
def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
smoothing_function=None):
""""
:param references: reference sentences
:type references: list(list(str))
:param hypothesis: a hypothesis sentence
:type hypothesis: list(str)
:param weights: weights for unigrams, bigrams, trigrams and so on
:type weights: list(float)
:return: The sentence-level BLEU score.
:rtype: float
"""
The input for sentence_bleu
's references is a list(list(str))
.
So if you have a sentence string, e.g. "This is a cat"
, you have to tokenized it to get a list of strings, ["This", "is", "a", "cat"]
and since it allows for multiple references, it has to be a list of list of string, e.g. if you have a second reference, "This is a feline", your input to sentence_bleu()
would be:
references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ]
hypothesis = ["This", "is", "cat"]
sentence_bleu(references, hypothesis)
When it comes to corpus_bleu()
list_of_references parameter, it's basically a list of whatever the sentence_bleu()
takes as references:
def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25),
smoothing_function=None):
"""
:param references: a corpus of lists of reference sentences, w.r.t. hypotheses
:type references: list(list(list(str)))
:param hypotheses: a list of hypothesis sentences
:type hypotheses: list(list(str))
:param weights: weights for unigrams, bigrams, trigrams and so on
:type weights: list(float)
:return: The corpus-level BLEU score.
:rtype: float
"""
Other than look at the doctest within the nltk/translate/bleu_score.py
, you can also take a look at the unittest at nltk/test/unit/translate/test_bleu_score.py
to see how to use each of the component within the bleu_score.py
.
By the way, since the sentence_bleu
is imported as bleu
in the (nltk.translate.__init__.py
](https://github.com/nltk/nltk/blob/develop/nltk/translate/init.py#L21), using
from nltk.translate import bleu
would be the same as:
from nltk.translate.bleu_score import sentence_bleu
and in code:
>>> from nltk.translate import bleu
>>> from nltk.translate.bleu_score import sentence_bleu
>>> from nltk.translate.bleu_score import corpus_bleu
>>> bleu == sentence_bleu
True
>>> bleu == corpus_bleu
False
nltk
in order to have a stabilized version of BLEU. Actually, how you're using the function is not really correct, will explain in an answer =) – SimpkinsBLEU is designed to approximate human judgement at a corpus level, and performs badly if used to evaluate the quality of individual sentences.
Maybe the line of questioning is not relevant to the metric. – Adriatic