Main Content

extractSummary

Extract summary from documents

Since R2020a

Description

example

summary = extractSummary(documents) chooses a subset of the input documents to serve as a summary, and returns them as a tokenizedDocument array.

example

[summary,scores] = extractSummary(documents) also returns the importance scores used for selecting the summary documents. In this case, scores(i) represents the score for summary(i).

example

[summary,scores] = extractSummary(documents,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Create an array of tokenized documents.

str = [
    "The quick brown fox jumped over the lazy dog."
    "The fox jumped over the dog."
    "The lazy dog saw a fox jumping."
    "There seem to be animals jumping other animals."
    "There are quick animals and lazy animals"];
documents = tokenizedDocument(str);

Extract a summary of the documents using the extractSummary function. The function, by default, chooses 1/10 of the input documents, rounding up.

summary = extractSummary(documents)
summary = 
  tokenizedDocument:

   10 tokens: The quick brown fox jumped over the lazy dog .

To specify a larger summary, use the 'SummarySize' option. Extract a three-document summary.

summary = extractSummary(documents,'SummarySize',3)
summary = 
  3x1 tokenizedDocument:

    10 tokens: The quick brown fox jumped over the lazy dog .
     7 tokens: The fox jumped over the dog .
     9 tokens: There seem to be animals jumping other animals .

Create an array of tokenized documents.

str = [
    "The quick brown fox jumped over the lazy dog."
    "The fox jumped over the dog."
    "The lazy dog saw a fox jumping."
    "There seem to be animals jumping over other animals."
    "There are quick animals and lazy animals"];
documents = tokenizedDocument(str);

Extract a three-document summary. The second output scores contains the summary document importance scores.

[summary,scores] = extractSummary(documents,'SummarySize',3)
summary = 
  3x1 tokenizedDocument:

    10 tokens: The quick brown fox jumped over the lazy dog .
    10 tokens: There seem to be animals jumping over other animals .
     7 tokens: The fox jumped over the dog .

scores = 3×1

    0.2426
    0.2174
    0.1911

Visualize the scores in a bar chart.

figure
bar(scores)
xlabel("Summary Document")
ylabel("Score")
title("Summary Document Importance")

To summarize a single document, split the document into an array of sentences, and use the extractSummary function.

Create a string scalar containing the document.

str = ...
    "There is a quick fox. The fox is brown. There is a dog which " + ...
    "is lazy. The dog is very lazy. The fox jumped over the dog. " + ...
    "The quick brown fox jumped over the lazy dog.";

Split the string into sentences using the splitSentences function.

str = splitSentences(str)
str = 6x1 string
    "There is a quick fox."
    "The fox is brown."
    "There is a dog which is lazy."
    "The dog is very lazy."
    "The fox jumped over the dog."
    "The quick brown fox jumped over the lazy dog."

Create a tokenized document array containing the sentences.

documents = tokenizedDocument(str)
documents = 
  6x1 tokenizedDocument:

     6 tokens: There is a quick fox .
     5 tokens: The fox is brown .
     8 tokens: There is a dog which is lazy .
     6 tokens: The dog is very lazy .
     7 tokens: The fox jumped over the dog .
    10 tokens: The quick brown fox jumped over the lazy dog .

Extract a summary from the sentences using the extractSummary function. To return a summary with three documents, set the 'SummarySize' option to 3.To ensure the summary documents appear in the same order as the input documents, set the 'OrderBy' option to 'position'.

summary = extractSummary(documents,'SummarySize',3,'OrderBy','position')
summary = 
  3x1 tokenizedDocument:

     6 tokens: There is a quick fox .
     7 tokens: The fox jumped over the dog .
    10 tokens: The quick brown fox jumped over the lazy dog .

To reconstruct the sentences into a single document, convert the documents to string using the joinWords function and join the sentences using the join function.

sentences = joinWords(summary);
summaryStr = join(sentences)
summaryStr = 
"There is a quick fox . The fox jumped over the dog . The quick brown fox jumped over the lazy dog ."

To remove the surrounding punctuation characters, use the replace function.

punctuationRight = ["." "," "’" ")" ":" "?" "!"];
summaryStr = replace(summaryStr," " + punctuationRight,punctuationRight);

punctuationLeft = ["(" "‘"];
summaryStr = replace(summaryStr,punctuationLeft + " ",punctuationLeft)
summaryStr = 
"There is a quick fox. The fox jumped over the dog. The quick brown fox jumped over the lazy dog."

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: extractSummary(documents,'ScoringMethod','lexrank') extracts a summary from documents and sets the scoring method option to 'lexrank'.

Scoring method used for extractive summarization, specified as the comma-separated pair consisting of 'ScoringMethod' and one of the following:

  • 'textrank' – Use the TextRank algorithm.

  • 'lexrank' – Use the LexRank algorithm.

  • 'mmr' – Use the MMR algorithm.

Query document for MMR scoring, specified as the comma-separated pair consisting of 'Query' and a tokenizedDocument scalar, a string array of words, or a cell array of character vectors. If 'Query' not a tokenizedDocument scalar, then it must be a row vector representing a single document, where each element is a word.

This option only has an effect when 'ScoringMethod' is 'mmr'.

Size of summary, specified as the comma-separated pair consisting of 'SummarySize' and one of the following:

  • Scalar in the range (0,1) – Extract the specified proportion of input documents, rounding up. In this case, the number of summary documents ceil(SummarySize*numDocuments), where numDocuments is the number of input documents.

  • Positive integer – Extract a summary with the specified number of documents. If SummarySize is greater than or equal to the number of input documents, then the function returns the input documents sorted according to the 'OrderBy' option.

    Inf – Return the input documents sorted according to the 'OrderBy' option.

Data Types: double

Order of documents in summary, specified as the comma-separated pair consisting of 'OrderBy' and one of the following:

  • 'score' – Order documents by their score according to the 'ScoringMethod' option.

  • 'position' – Maintain the document order from the input.

Output Arguments

collapse all

Extracted summary, returned as a tokenizedDocument array. The summary is a subset of documents, and is sorted according to the 'OrderBy' option.

Summary document scores, returned as a vector, where scores(i) is the score of the jth summary document according to the 'ScoringMethod' option. The scores are sorted according to the 'OrderBy' option.

Version History

Introduced in R2020a