bagOfWords
Bag-of-words model
Description
A bag-of-words model (also known as a term-frequency counter) records the number of times that words appear in each document of a collection.
bagOfWords
does not split text into words. To create an array of
tokenized documents, see tokenizedDocument
.
Creation
Description
creates an empty
bag-of-words model.bag
= bagOfWords
counts the words appearing in bag
= bagOfWords(documents
)documents
and returns a
bag-of-words model.
creates a bag-of-words model using the words in bag
= bagOfWords(uniqueWords
,counts
)uniqueWords
and the corresponding frequency counts in counts
.
Input Arguments
documents
— Input documents
tokenizedDocument
array | string array of words | cell array of character vectors
Input documents, specified as a tokenizedDocument
array, a string array of words, or a cell array of
character vectors. If documents
is not a
tokenizedDocument
array, then it must be a row vector representing
a single document, where each element is a word. To specify multiple documents, use a
tokenizedDocument
array.
uniqueWords
— Unique word list
string vector | cell array of character vectors
Unique word list, specified as a string vector or a cell array of
character vectors. If uniqueWords
contains
<missing>
, then the function ignores the
missing values. The size of uniqueWords
must be
1-by-V where V is the number of columns of
counts
.
Example: ["an" "example" "list"]
Data Types: string
| cell
counts
— Frequency counts of words
matrix of nonnegative integers
Frequency counts of words corresponding to
uniqueWords
, specified as a matrix of
nonnegative integers. The value counts(i,j)
corresponds to the number of times the word
uniqueWords(j)
appears in the
ith document.
counts
must have
numel(uniqueWords)
columns.
Properties
Counts
— Word counts per document
sparse matrix
Word counts per document, specified as a sparse matrix.
NumDocuments
— Number of documents seen
nonnegative integer
Number of documents seen, specified as a nonnegative integer.
NumWords
— Number of unique words in model
nonnegative integer
Number of unique words in the model, specified as a nonnegative integer.
Vocabulary
— Unique words in model
string vector
Unique words in the model, specified as a string vector.
Data Types: string
Object Functions
encode | Encode documents as matrix of word or n-gram counts |
tfidf | Term Frequency–Inverse Document Frequency (tf-idf) matrix |
topkwords | Most important words in bag-of-words model or LDA topic |
addDocument | Add documents to bag-of-words or bag-of-n-grams model |
removeDocument | Remove documents from bag-of-words or bag-of-n-grams model |
removeEmptyDocuments | Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model |
removeWords | Remove selected words from documents or bag-of-words model |
removeInfrequentWords | Remove words with low counts from bag-of-words model |
join | Combine multiple bag-of-words or bag-of-n-grams models |
wordcloud | Create word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model |
Examples
Create Bag-of-Words Model
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" ... ] (1x3092 string) NumWords: 3092 NumDocuments: 154
View the top 10 words and their total counts.
tbl = topkwords(bag,10)
tbl=10×2 table
Word Count
_______ _____
"thy" 281
"thou" 234
"love" 162
"thee" 161
"doth" 88
"mine" 63
"shall" 59
"eyes" 56
"sweet" 55
"time" 53
Create Bag-of-Words Model from Unique Words and Counts
Create a bag-of-words model using a string array of unique words and a matrix of word counts.
uniqueWords = ["a" "an" "another" "example" "final" "sentence" "third"]; counts = [ ... 1 2 0 1 0 1 0; 0 0 3 1 0 4 0; 1 0 0 5 0 3 1; 1 0 0 1 7 0 0]; bag = bagOfWords(uniqueWords,counts)
bag = bagOfWords with properties: Counts: [4x7 double] Vocabulary: ["a" "an" "another" "example" "final" "sentence" "third"] NumWords: 7 NumDocuments: 4
Import Text from Multiple Files Using a File Datastore
If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.
Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt
", where N
is the number of the sonnet. Specify the read function to be extractFileText
.
readFcn = @extractFileText; fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);
Create an empty bag-of-words model.
bag = bagOfWords
bag = bagOfWords with properties: Counts: [] Vocabulary: [1x0 string] NumWords: 0 NumDocuments: 0
Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag
.
while hasdata(fds) str = read(fds); document = tokenizedDocument(str); bag = addDocument(bag,document); end
View the updated bag-of-words model.
bag
bag = bagOfWords with properties: Counts: [4x276 double] Vocabulary: ["From" "fairest" "creatures" "we" "desire" "increase" "," "That" "thereby" "beauty's" "rose" "might" "never" "die" "But" "as" "the" "riper" "should" ... ] (1x276 string) NumWords: 276 NumDocuments: 4
Remove Stop Words from Bag-of-Words Model
Remove the stop words from a bag-of-words model by inputting a list of stop words to removeWords
. Stop words are words such as "a", "the", and "in" which are commonly removed from text before analysis.
documents = tokenizedDocument([ "an example of a short sentence" "a second short sentence"]); bag = bagOfWords(documents); newBag = removeWords(bag,stopWords)
newBag = bagOfWords with properties: Counts: [2x4 double] Vocabulary: ["example" "short" "sentence" "second"] NumWords: 4 NumDocuments: 2
Most Frequent Words of Bag-of-Words Model
Create a table of the most frequent words of a bag-of-words model.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" ... ] (1x3092 string) NumWords: 3092 NumDocuments: 154
Find the top five words.
T = topkwords(bag);
Find the top 20 words in the model.
k = 20; T = topkwords(bag,k)
T=20×2 table
Word Count
________ _____
"thy" 281
"thou" 234
"love" 162
"thee" 161
"doth" 88
"mine" 63
"shall" 59
"eyes" 56
"sweet" 55
"time" 53
"beauty" 52
"nor" 52
"art" 51
"yet" 51
"o" 50
"heart" 50
⋮
Create Tf-idf Matrix
Create a Term Frequency–Inverse Document Frequency (tf-idf) matrix from a bag-of-words model.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" ... ] (1x3092 string) NumWords: 3092 NumDocuments: 154
Create a tf-idf matrix. View the first 10 rows and columns.
M = tfidf(bag); full(M(1:10,1:10))
ans = 10×10
3.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 3.8918 2.4720 2.5520
0 0 0 0 0 4.5287 0 0 0 0
0 0 0 0 0 0 0 0 0 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 2.5520
0 0 2.7344 0 0 0 0 0 0 0
Create Word Cloud from Bag-of-Words Model
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154×3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" "contracted" … ] NumWords: 3092 NumDocuments: 154
Visualize the bag-of-words model using a word cloud.
figure wordcloud(bag);
Create Bag-of-Words Model in Parallel
If your text data is contained in multiple files in a folder, then you can import the text data and create a bag-of-words model in parallel using parfor
. If you have Parallel Computing Toolbox™ installed, then the parfor
loop runs in parallel, otherwise, it runs in serial. Use join
to combine an array of bag-of-words models into one model.
Create a list of filenames. The examples sonnets have file names "exampleSonnetN.txt
", where N
is the number of the sonnet.
filenames = [ "exampleSonnet1.txt" "exampleSonnet2.txt" "exampleSonnet3.txt" "exampleSonnet4.txt"];
Create a bag-of-words model from a collection of files. Initialize an empty bag-of-words model and then loop over the files and create a bag-of-words model for each file.
bag = bagOfWords; numFiles = numel(filenames); parfor i = 1:numFiles filename = filenames(i); textData = extractFileText(filename); document = tokenizedDocument(textData); bag(i) = bagOfWords(document); end
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to parallel pool with 4 workers.
Combine the bag-of-words models using join
.
bag = join(bag)
bag = bagOfWords with properties: Counts: [4x276 double] Vocabulary: ["From" "fairest" "creatures" "we" "desire" "increase" "," "That" "thereby" "beauty's" "rose" "might" "never" "die" "But" "as" "the" "riper" "should" ... ] (1x276 string) NumWords: 276 NumDocuments: 4
Tips
If you intend to use a held out test set for your work, then partition your text data before using
bagOfWords
. Otherwise, the bag-of-words model may bias your analysis.
Version History
Introduced in R2017b
See Also
bagOfNgrams
| addDocument
| removeDocument
| removeInfrequentWords
| removeWords
| removeEmptyDocuments
| topkwords
| encode
| tfidf
| tokenizedDocument
Ouvrir l'exemple
Vous possédez une version modifiée de cet exemple. Souhaitez-vous ouvrir cet exemple avec vos modifications ?
Commande MATLAB
Vous avez cliqué sur un lien qui correspond à cette commande MATLAB :
Pour exécuter la commande, saisissez-la dans la fenêtre de commande de MATLAB. Les navigateurs web ne supportent pas les commandes MATLAB.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)