Main Content

wordTokenize

Tokenize text into words using BERT tokenizer

Since R2023b

    Description

    example

    words = wordTokenize(tokenizer,str) tokenizes the text in str into words using the specified Bidirectional Encoder Representations from Transformers (BERT) tokenizer.

    Examples

    collapse all

    Load a pretrained BERT-Base neural network and corresponding tokenizer using the bert function.

    [net,tokenizer] = bert;

    View the tokenizer.

    tokenizer
    tokenizer = 
      bertTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: "[PAD]"
           PaddingCode: 1
            StartToken: "[CLS]"
             StartCode: 102
          UnknownToken: "[UNK]"
           UnknownCode: 101
        SeparatorToken: "[SEP]"
         SeparatorCode: 103
           ContextSize: 512
    
    

    Tokenize the text "Bidirectional Encoder Representations from Transformers" into words using the wordTokenize function.

    str = "Bidirectional Encoder Representations from Transformers";
    words = wordTokenize(tokenizer,str)
    words = 1x1 cell array
        {["Bidirectional"    "Encoder"    "Representations"    "from"    "Transformers"]}
    
    

    Input Arguments

    collapse all

    BERT tokenizer, specified as a bertTokenizer object.

    Input text, specified as a string array, character vector, or cell array of character vectors.

    Example: ["An example of a short sentence."; "A second short sentence."]

    Data Types: string | char | cell

    Output Arguments

    collapse all

    Tokenized words, returned as a cell array of string arrays.

    Data Types: cell

    Algorithms

    collapse all

    WordPiece Tokenization

    The WordPiece tokenization algorithm [2] splits words into subword units and maps common sequences of characters and subwords to a single integer. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows the model to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.

    These steps outline how to create a WordPiece tokenizer:

    1. Initialize vocabulary — Create an initial vocabulary of the unique characters in the data.

    2. Count token frequencies — Iterate through the training data and count the frequencies of each token in the vocabulary.

    3. Merge most frequent pairs — Identify the most frequent pair of tokens in the vocabulary and merge them into a single token. Update the vocabulary accordingly.

    4. Repeat counting and merging — Repeat the counting and merging steps until the vocabulary reaches a predefined size or until tokens can no longer merge.

    These steps outline how a WordPiece tokenizer tokenizes new text:

    1. Split text — Split text into individual words.

    2. Identify OOV words — Identify any OOV words that are not present in the pretrained vocabulary.

    3. Replace OOV words — Replace the OOV words with their subword counterparts from the vocabulary. For example, by iteratively checking that OOV tokens start with vocabulary tokens.

    References

    [1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

    [2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

    Version History

    Introduced in R2023b