Main Content

bleuEvaluationScore

Evaluate translation or summarization with BLEU similarity score

Description

The BiLingual Evaluation Understudy (BLEU) scoring algorithm evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

example

score = bleuEvaluationScore(candidate,references) returns the BLEU similarity score between the specified candidate document and the reference documents. The function computes n-gram overlaps between candidate and references for n-gram lengths one through four, with equal weighting. For more information, see BLEU Score.

example

score = bleuEvaluationScore(candidate,references,'NgramWeights',ngramWeights) uses the specified n-gram weighting, where ngramWeights(i) corresponds to the weight for n-grams of length i. The length of the weight vector determines the range of n-gram lengths to use for the BLEU score evaluation.

Examples

collapse all

Create an array of tokenized documents and extract a summary using the extractSummary function.

str = [
    "The fox jumped over the dog."
    "The fast brown fox jumped over the lazy dog."
    "The lazy dog saw a fox jumping."
    "There seem to be animals jumping other animals."
    "There are quick animals and lazy animals"];
documents = tokenizedDocument(str);
summary = extractSummary(documents)
summary = 
  tokenizedDocument:

   10 tokens: The fast brown fox jumped over the lazy dog .

Specify the reference documents as a tokenizedDocument array.

str = [
    "The quick brown animal jumped over the lazy dog."
    "The quick brown fox jumped over the lazy dog."];
references = tokenizedDocument(str);

Calculate the BLEU score between the summary and the reference documents using the bleuEvaluationScore function.

score = bleuEvaluationScore(summary,references)
score = 0.7825

This score indicates a fairly good similarity. A BLEU score close to one indicates strong similarity.

Create an array of tokenized documents and extract a summary using the extractSummary function.

str = [
    "The fox jumped over the dog."
    "The fast brown fox jumped over the lazy dog."
    "The lazy dog saw a fox jumping."
    "There seem to be animals jumping other animals."
    "There are quick animals and lazy animals"];
documents = tokenizedDocument(str);
summary = extractSummary(documents)
summary = 
  tokenizedDocument:

   10 tokens: The fast brown fox jumped over the lazy dog .

Specify the reference documents as a tokenizedDocument array.

str = [
    "The quick brown animal jumped over the lazy dog."
    "The quick brown fox jumped over the lazy dog."];
references = tokenizedDocument(str);

Calculate the BLEU score between the candidate document and the reference documents using the default options. The bleuEvaluationScore function, by default, uses n-grams of length one through four with equal weights.

score = bleuEvaluationScore(summary,references)
score = 0.7825

Given that the summary document differs only by one word to one of the reference documents, this score might suggest a lower similarity than might be expected. This behavior is due to the function using n-grams which are too large for the short document length.

To address this, use shorter n-grams by setting the 'NgramWeights' option to a shorter vector. Calculate the BLEU score again using only unigrams and bigrams by setting the 'NgramWeights' option to a two-element vector. Treat unigrams and bigrams equally by specifying equal weights.

score = bleuEvaluationScore(summary,references,'NgramWeights',[0.5 0.5])
score = 0.8367

This score suggests a better similarity than before.

Input Arguments

collapse all

Candidate document, specified as a tokenizedDocument scalar, a string array, or a cell array of character vectors. If candidate is not a tokenizedDocument scalar, then it must be a row vector representing a single document, where each element is a word.

Reference documents, specified as a tokenizedDocument array, a string array, or a cell array of character vectors. If references is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To evaluate against multiple reference documents, use a tokenizedDocument array.

N-gram weights, specified as a row vector of finite nonnegative values, where ngramWeights(i) corresponds to the weight for n-grams of length i. The length of the weight vector determines the range of n-gram lengths to use for the BLEU score evaluation. The function normalizes the n-gram weights to sum to one.

Tip

If the number of words in candidate is smaller than the number of elements in ngramWeights, then the resulting BLEU score is zero. To ensure that bleuEvaluationScore returns nonzero scores for very short documents, set ngramWeights to a vector with fewer elements than the number of words in candidate.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Output Arguments

collapse all

BLEU score, returned as a scalar value in the range [0,1] or NaN.

A BLEU score close to zero indicates poor similarity between candidate and references. A BLEU score close to one indicates strong similarity. If candidate is identical to one of the reference documents, then score is 1. If candidate and references are both empty documents, then score is NaN. For more information, see BLEU Score.

Tip

If the number of words in candidate is smaller than the number of elements in ngramWeights, then the resulting BLEU score is zero. To ensure that bleuEvaluationScore returns nonzero scores for very short documents, set ngramWeights to a vector with fewer elements than the number of words in candidate.

Algorithms

collapse all

BLEU Score

The BiLingual Evaluation Understudy (BLEU) scoring algorithm [1] evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

To compute the BLEU score, the algorithm uses n-gram counts, clipped n-gram counts, modified n-gram precision scores, and a brevity penalty.

The clipped n-gram counts function Countclip, if necessary, truncates the n-gram count for each n-gram so that it does not exceed the largest count observed in any single reference for that n-gram. The clipped counts function is given by

Countclip(n-gram)=min(Count(n-gram),MaxRefCount(n-gram)),

where Count(n-gram) denotes the n-gram counts and MaxRefCount(n-gram) is the largest n-gram count observed in a single reference document for that n-gram.

The modified n-gram precision scores are given by

pn=C{Candidates}n-gramCCountclip(n-gram)C'{Candidates}n-gramCCount(n-gram),

where n corresponds to the n-gram length and {candidates} is the set of sentences in the candidate documents.

Given a vector of n-gram weights w, the BLEU score is given by

bleuScore=BP·exp(n=1Nwnlogp¯n),

where N is the largest n-gram length, the entries in p¯ correspond to the geometric averages of the modified n-gram precisions, and BP is the brevity penalty given by

BP={1if c>re1rcif cr

where c is the length of the candidate document and r is the length of the reference document with length closest to the candidate length.

References

[1] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: A Method for Automatic Evaluation of Machine Translation." In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. Association for Computational Linguistics, 2002.

Introduced in R2020a