Main Content

subwordTokenize

Tokenize text into subwords using BERT tokenizer

Since R2023b

    Description

    subwords = subwordTokenize(tokenizer,str) tokenizes the text in str into subwords using the specified Bidirectional Encoder Representations from Transformers (BERT) tokenizer. This syntax automatically adds special tokens to the input.

    example

    subwords = subwordTokenize(tokenizer,str1,str2) tokenizes the sentence pair str1,str2 into subwords. This syntax automatically adds special tokens to the input.

    subwords = subwordTokenize(___,AddSpecialTokens=tf) specifies whether to add special tokens to the input.

    Examples

    collapse all

    Load a pretrained BERT-Base neural network and corresponding tokenizer using the bert function.

    [net,tokenizer] = bert;

    View the tokenizer.

    tokenizer
    tokenizer = 
      bertTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: "[PAD]"
           PaddingCode: 1
            StartToken: "[CLS]"
             StartCode: 102
          UnknownToken: "[UNK]"
           UnknownCode: 101
        SeparatorToken: "[SEP]"
         SeparatorCode: 103
           ContextSize: 512
    
    

    Tokenize the text "Bidirectional Encoder Representations from Transformers" into subwords using the subwordTokenize function.

    str = "Bidirectional Encoder Representations from Transformers";
    subwords = subwordTokenize(tokenizer,str)
    subwords = 1×1 cell array
        {["[CLS]"    "bid"    "##ire"    "##ction"    "##al"    "en"    "##code"    "##r"    "representations"    "from"    "transformers"    "[SEP]"]}
    
    

    Input Arguments

    collapse all

    Tokenizer, specified as a bertTokenizer object.

    Input text, specified as a string array, character vector, or cell array of character vectors.

    Example: ["An example of a short sentence."; "A second short sentence."]

    Data Types: string | char | cell

    Input sentence pairs, specified as string arrays, character vectors, or cell arrays of character vectors of the same size.

    If you specify str1 and str2, then the function returns concatenated tokenized subwords.

    Data Types: char | string | cell

    Flag to add padding, start, unknown, and separator tokens to input, specified as 1 (true) or 0 (false).

    Output Arguments

    collapse all

    Tokenized subwords, returned as a string array.

    Data Types: string

    Algorithms

    collapse all

    References

    [1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

    [2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

    Version History

    Introduced in R2023b