Contenu principal

removeInfrequentWords

Remove words with low counts from bag-of-words model

Description

newBag = removeInfrequentWords(bag,count) removes the words that appear at most count times in total from the bag-of-words model bag. The function, by default, is case sensitive.

example

newBag = removeInfrequentWords(bag,count,'IgnoreCase',true) removes the words that appear at most count times in total ignoring case. If words differ only by case, then the corresponding counts are merged.

example

Examples

collapse all

Remove the words that appear two times or fewer from a bag-of-words model.

Create a bag-of-words model from an array of tokenized documents.

documents = tokenizedDocument([
    "an example of a short sentence"
    "a second short sentence"
    "another example"
    "a short example"]);
bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

        NumWords: 8
          Counts: [4×8 double]
      Vocabulary: ["an"    "example"    "of"    "a"    "short"    "sentence"    "second"    "another"]
    NumDocuments: 4

Remove the words that appear two times or fewer from the bag-of-words model.

count = 2;
newBag = removeInfrequentWords(bag,count)
newBag = 
  bagOfWords with properties:

        NumWords: 3
          Counts: [4×3 double]
      Vocabulary: ["example"    "a"    "short"]
    NumDocuments: 4

Input Arguments

collapse all

Input bag-of-words model, specified as a bagOfWords object.

Count threshold to remove words, specified as a positive integer. The function removes the words that appear count times in total or fewer.

Version History

Introduced in R2017b