bleuEvaluationScore

Evaluate translation or summarization with BLEU similarity score

Syntax

score = bleuEvaluationScore(candidate,references)

score = bleuEvaluationScore(candidate,references,Name=Value)

Description

The BiLingual Evaluation Understudy (BLEU) scoring algorithm evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

score = bleuEvaluationScore(candidate,references) returns the BLEU similarity score between the specified candidate document and the reference documents. The function computes n-gram overlaps between candidate and references for n-gram lengths one through four, with equal weighting. For more information, see BLEU Score.

example

score = bleuEvaluationScore(candidate,references,Name=Value) specifies additional options using one or more name-value arguments.

example

Examples

collapse all

Evaluate Summary

Open Live Script

Create an array of tokenized documents and extract a summary using the extractSummary function.

str = [
    "The fox jumped over the dog."
    "The fast brown fox jumped over the lazy dog."
    "The lazy dog saw a fox jumping."
    "There seem to be animals jumping other animals."
    "There are quick animals and lazy animals"];
documents = tokenizedDocument(str);
summary = extractSummary(documents)

summary = 
  tokenizedDocument:

   10 tokens: The fast brown fox jumped over the lazy dog .

Specify the reference documents as a tokenizedDocument array.

str = [
    "The quick brown animal jumped over the lazy dog."
    "The quick brown fox jumped over the lazy dog."];
references = tokenizedDocument(str);

Calculate the BLEU score between the summary and the reference documents using the bleuEvaluationScore function.

score = bleuEvaluationScore(summary,references)

score = 
0.7825

This score indicates a fairly good similarity. A BLEU score close to one indicates strong similarity.

Specify N-Gram Weights

Open Live Script

Create an array of tokenized documents and extract a summary using the extractSummary function.

str = [
    "The fox jumped over the dog."
    "The fast brown fox jumped over the lazy dog."
    "The lazy dog saw a fox jumping."
    "There seem to be animals jumping other animals."
    "There are quick animals and lazy animals"];
documents = tokenizedDocument(str);
summary = extractSummary(documents)

summary = 
  tokenizedDocument:

   10 tokens: The fast brown fox jumped over the lazy dog .

Specify the reference documents as a tokenizedDocument array.

str = [
    "The quick brown animal jumped over the lazy dog."
    "The quick brown fox jumped over the lazy dog."];
references = tokenizedDocument(str);

Calculate the BLEU score between the candidate document and the reference documents using the default options. The bleuEvaluationScore function, by default, uses n-grams of length one through four with equal weights.

score = bleuEvaluationScore(summary,references)

score = 
0.7825

Given that the summary document differs only by one word to one of the reference documents, this score might suggest a lower similarity than might be expected. This behavior is due to the function using n-grams which are too large for the short document length.

To address this, use shorter n-grams by setting the 'NgramWeights' option to a shorter vector. Calculate the BLEU score again using only unigrams and bigrams by setting the 'NgramWeights' option to a two-element vector. Treat unigrams and bigrams equally by specifying equal weights.

score = bleuEvaluationScore(summary,references,'NgramWeights',[0.5 0.5])

score = 
0.8367

This score suggests a better similarity than before.

Input Arguments

collapse all

`candidate` — Candidate document
`tokenizedDocument` scalar | string array | cell array of character vectors

Candidate document, specified as a tokenizedDocument scalar, a string array, or a cell array of character vectors. If candidate is not a tokenizedDocument scalar, then it must be a row vector representing a single document, where each element is a word.

`references` — Reference documents
`tokenizedDocument` array | string array | cell array of character vectors

Reference documents, specified as a tokenizedDocument array, a string array, or a cell array of character vectors. If references is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To evaluate against multiple reference documents, use a tokenizedDocument array.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: bleuEvaluationScore(candidate,references,IgnoreCase=true) evaluate the BLEU similarity score ignoring case

`NgramWeights` — N-gram weights
`[0.25 0.25 0.25 0.25]` (default) | row vector of finite nonnegative values

N-gram weights, specified as a row vector of finite nonnegative values, where NgramWeights(i) corresponds to the weight for n-grams of length i. The length of the weight vector determines the range of n-gram lengths to use for the BLEU score evaluation. The function normalizes the n-gram weights to sum to one.

Tip

If the number of words in candidate is smaller than the number of elements in ngramWeights, then the resulting BLEU score is zero. To ensure that bleuEvaluationScore returns nonzero scores for very short documents, set ngramWeights to a vector with fewer elements than the number of words in candidate.

`IgnoreCase` — Option to ignore case
`0` (`false`) (default) | `1` (`true`)

Option to ignore case, specified as one of these values:

0 (false) – use case-sensitive comparisons between candidates and references.
1 (true) – compare candidates and references ignoring case.

Output Arguments

collapse all

`score` — BLEU score
scalar

BLEU score, returned as a scalar value in the range [0,1] or NaN.

A BLEU score close to zero indicates poor similarity between candidate and references. A BLEU score close to one indicates strong similarity. If candidate is identical to one of the reference documents, then score is 1. If candidate and references are both empty documents, then score is NaN. For more information, see BLEU Score.

Tip

Algorithms

collapse all

BLEU Score

The BiLingual Evaluation Understudy (BLEU) scoring algorithm [1] evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

To compute the BLEU score, the algorithm uses n-gram counts, clipped n-gram counts, modified n-gram precision scores, and a brevity penalty.

The clipped n-gram counts function ${Count}_{clip}$ , if necessary, truncates the n-gram count for each n-gram so that it does not exceed the largest count observed in any single reference for that n-gram. The clipped counts function is given by

${Count}_{clip} (n-gram) = min (Count (n-gram), MaxRefCount (n-gram)),$

where $Count (n-gram)$ denotes the n-gram counts and $MaxRefCount (n-gram)$ is the largest n-gram count observed in a single reference document for that n-gram.

The modified n-gram precision scores are given by

$p_{n} = \frac{\sum_{C \in {Candidates}} \sum_{n-gram \in C} {Count}_{clip} (n-gram)}{\sum_{C' \in {Candidates}} \sum_{{n-gram}^{'} \in C^{'}} Count ({n-gram}^{'})},$

where n corresponds to the n-gram length and ${candidates}$ is the set of sentences in the candidate documents.

Given a vector of n-gram weights w, the BLEU score is given by

$bleuScore = BP \cdot \exp (\sum_{n = 1}^{N} w_{n} \log {\bar{p}}_{n}),$

where N is the largest n-gram length, the entries in $\bar{p}$ correspond to the geometric averages of the modified n-gram precisions, and $BP$ is the brevity penalty given by

$BP = {\begin{matrix} 1 & if c > r \\ e^{1 - \frac{r}{c}} & if c \leq r \end{matrix}$

where c is the length of the candidate document and r is the length of the reference document with length closest to the candidate length.

References

[1] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: A Method for Automatic Evaluation of Machine Translation." In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. Association for Computational Linguistics, 2002.

Version History

Introduced in R2020a

bleuEvaluationScore

Syntax

Description

Examples

Evaluate Summary

Specify N-Gram Weights

Input Arguments

`candidate` — Candidate document
`tokenizedDocument` scalar | string array | cell array of character vectors

`references` — Reference documents
`tokenizedDocument` array | string array | cell array of character vectors

Name-Value Arguments

`NgramWeights` — N-gram weights
`[0.25 0.25 0.25 0.25]` (default) | row vector of finite nonnegative values

`IgnoreCase` — Option to ignore case
`0` (`false`) (default) | `1` (`true`)

Output Arguments

`score` — BLEU score
scalar

Algorithms

BLEU Score

References

Version History

See Also

Topics

bleuEvaluationScore

Syntax

Description

Examples

Evaluate Summary

Specify N-Gram Weights

Input Arguments

candidate — Candidate document tokenizedDocument scalar | string array | cell array of character vectors

references — Reference documents tokenizedDocument array | string array | cell array of character vectors

Name-Value Arguments

NgramWeights — N-gram weights [0.25 0.25 0.25 0.25] (default) | row vector of finite nonnegative values

IgnoreCase — Option to ignore case 0 (false) (default) | 1 (true)

Output Arguments

score — BLEU score scalar

Algorithms

BLEU Score

References

Version History

See Also

Topics

`candidate` — Candidate document
`tokenizedDocument` scalar | string array | cell array of character vectors

`references` — Reference documents
`tokenizedDocument` array | string array | cell array of character vectors

`NgramWeights` — N-gram weights
`[0.25 0.25 0.25 0.25]` (default) | row vector of finite nonnegative values

`IgnoreCase` — Option to ignore case
`0` (`false`) (default) | `1` (`true`)

`score` — BLEU score
scalar