The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scoring algorithm [1] calculates the
similarity between a candidate document and a collection of reference documents. Use the
ROUGE score to evaluate the quality of document translation and summarization models.
N-gram Co-Occurrence Statistics (ROUGE-N)Given an n-gram length n, the ROUGE-N metric between a candidate
document and a single reference document is given by
where the elements ri are
sentences in the reference document, is the number of times the specified n-gram occurs in the candidate
document and numNgrams(ri) is the number of
n-grams in the specified reference sentence
ri.
For sets of multiple reference documents, the ROUGE-N metric is given by
To use the ROUGE-N metric, set the 'ROUGEMethod'
option to
'n-grams'
.
Longest Common Subsequence (ROUGE-L)Given a sentence and a sentence s, where the elements
si correspond to words, the subsequence is a common subsequence of d and
s if for and , where the elements of s are the words of the
sentence and k is the length of the subsequence. The subsequence is a longest common subsequence (LCS) if the subsequence length
k is maximal.
Given a candidate document and a single reference document the
union of the longest common subsequences is given by
where is the set of longest common subsequences in the candidate document and
the sentence ri from a reference
document.
The ROUGE-L metric is an F-score measure. To calculate it, first calculate the recall
and precision scores given by
Then, the ROUGE-L metric between a candidate document and a
single reference document is given by the F-score measure
where the parameter controls the relative importance of the precision and recall. Because
the ROUGE score favors recall, is typically set to a high value.
For sets of multiple reference documents, the ROUGE-L metric is given by
To use the ROUGE-L metric, set the 'ROUGEMethod'
option to
'longest-common-subsequences'
.
Weighted Longest Common Subsequence (ROUGE-W)Given a weighting function f such that f has the
property f(x+y)>f(x)+f(y) for any positive integers
x and y, define to be the length of the longest consecutive matches encountered in the
candidate document and a single reference document scored by the weighting function
f. For more information about calculating this value, see [1].
The ROUGE-W is metric given an F-score measure which requires the recall and precision
scores given by
The ROUGE-W metric between a candidate document and a single
reference document is given by the F-score measure
where the parameter controls the relative importance of the precision and recall. Because
the ROUGE score favors recall, is typically set to a high value.
For multiple reference documents, the ROUGE-W metric is given by
To use the ROUGE-W metric, set the 'ROUGEMethod'
option to
'weighted-longest-common-subsequences'
.
Skip-Bigram Co-Occurrence Statistics (ROUGE-S)A skip-bigram is an ordered pair of words in a sentence allowing
for arbitrary gaps between them. That is, given a sentence from a candidate document, where the elements
cij correspond to the words in the sentence,
the pair of words is a skip-bigram if.
The ROUGE-S metric is an F-score measure. To calculate it, first calculate the recall
and precision scores given by
where the elements ri and
ci are sentences in the reference document
and candidate document, respectively, is the number of times the specified skip-bigram occurs in the candidate
document, and numSkipBigrams(s) is the number of skip-bigrams in the
sentence s.
Then, the ROUGE-S metric between a candidate document and a
single reference document is given by the F-score measure
For sets of multiple reference documents, the ROUGE-S metric is given by
To use the ROUGE-S metric, set the 'ROUGEMethod'
option to
'skip-bigrams'
.
Skip-Bigram and Unigram Co-Occurrence Statistics (ROUGE-SU)To also include unigram co-occurrence statistics in the ROUGE-S metric, introduce
unigram counts into the recall and precision scores for ROUGE-S. This is equivalent to
including start tokens in the candidate and reference documents, since
where Count(unigram,candidate) is the number of
times the specified unigram appears in the candidate document, and and denote the reference sentence and the candidate document augmented with
start tokens, respectively.
For sets of multiple reference documents, the ROUGE-SU metric is given by
where is the reference document with sentences augmented with start
tokens.
To use the ROUGE-SU metric, set the 'ROUGEMethod'
option to
'skip-bigrams-and-unigrams'
.