Main Content

encodeTokens

Convert tokens to token codes

Since R2023b

    Description

    [tokenCodes,segments] = encodeTokens(tokenizer,tokens) encodes tokens using the specified tokenizer and returns the token codes and segments. This syntax automatically adds special tokens to the input.

    example

    [tokenCodes,segments] = encodeTokens(tokenizer,tokens1,tokens2) encodes the sentence pair tokens1,tokens2. This syntax automatically adds special tokens to the input.

    [tokenCodes,segments,idx] = encodeTokens(___) also returns the mapping between the input and the encoded output.

    ___ = encodeTokens(___,AddSpecialTokens=tf) specifies whether to add special tokens to the input.

    Examples

    collapse all

    Load a pretrained BERT-Base neural network and corresponding tokenizer using the bert function.

    [net,tokenizer] = bert;

    View the tokenizer.

    tokenizer
    tokenizer = 
      bertTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: "[PAD]"
           PaddingCode: 1
            StartToken: "[CLS]"
             StartCode: 102
          UnknownToken: "[UNK]"
           UnknownCode: 101
        SeparatorToken: "[SEP]"
         SeparatorCode: 103
           ContextSize: 512
    
    

    Encode the tokens "Bidirectional", "Encoder", "Representations", "from", and "Transformers" using the encodeTokens function.

    tokens = ["Bidirectional" "Encoder" "Representations" "from" "Transformers"];
    [tokenCodes,segments] = encodeTokens(tokenizer,tokens);

    View the token codes.

    tokenCodes
    tokenCodes = 1×1 cell array
        {[102 7227 7443 7543 2390 4373 16045 2100 15067 2014 19082 103]}
    
    

    View the segments.

    segments
    segments = 1×1 cell array
        {[1 1 1 1 1 1 1 1 1 1 1 1]}
    
    

    Input Arguments

    collapse all

    Tokenizer, specified as a bertTokenizer or bpeTokenizer object.

    Input tokens, specified as a tokenizedDocument array, string array, or cell array of character vectors.

    Input sentence pairs, specified as tokenizedDocument arrays, string arrays, or cell arrays of character vectors.

    Flag to add padding, start, unknown, and separator tokens to input, specified as 1 (true) or 0 (false).

    Output Arguments

    collapse all

    Token codes, returned as a cell array of vectors of positive integers. The token codes index into the tokenizer vocabulary.

    Data Types: cell

    Segment indices, returned as a cell array of vectors of positive integers.

    The segments indicate which input token corresponds to which input. For each element s of the cell array, the value s(i) indicates which input corresponds to tokenCodes(i). If you specify a single string as input, then each element of segments is an array of ones.

    Data Types: cell

    Mapping between the token codes and inputs, returned as a vector of positive integers.

    The value idx(i) indicates which input token corresponds to tokenCodes(i). If tf is 1 (true) and tokenCodes(i) is the padding, start, unknown, or separator code of the tokenizer, then idx(i) is NaN.

    Data Types: double

    Algorithms

    collapse all

    References

    [1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

    [2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

    Version History

    Introduced in R2023b