Main Content

onehotencode

Encode data labels into one-hot vectors

    Description

    example

    B = onehotencode(A,featureDim) encodes data labels in categorical array A into a one-hot encoded array B. The function replaces each element of A with a numeric vector of length equal to the number of unique classes in A along the dimension specified by featureDim. The vector contains a 1 in the position corresponding to the class of the label in A, and a 0 in every other position. Any <undefined> values are encoded to NaN values.

    example

    tblB = onehotencode(tblA) encodes categorical data labels in table tblA into a table of one-hot encoded numeric values. The function replaces the single variable of tblA with as many variables as the number of unique classes in tblA. Each row in tblB contains a 1 in the variable corresponding to the class of the label in tlbA, and a 0 in all other variables.

    example

    ___ = onehotencode(___,typename) encodes the labels into numeric values of data type typename. Use this syntax with any of the input and output arguments in previous syntaxes.

    example

    ___ = onehotencode(___,'ClassNames',classes) also specifies the names of the classes to use for encoding. Use this syntax when A or tblA does not contain categorical values, when you want to exclude any class labels from being encoded, or when you want to encode the vector elements in a specific order. Any label in A or tblA of a class that does not exist in classes is encoded to a vector of NaN values.

    Examples

    collapse all

    Encode a categorical vector of class labels into one-hot vectors representing the labels.

    Create a column vector of labels, where each row of the vector represents a single observation. Convert the labels to a categorical array.

    labels = ["red";"blue";"red";"green";"yellow";"blue"];
    labels = categorical(labels);

    View the order of the categories.

    categories(labels)
    ans = 4x1 cell
        {'blue'  }
        {'green' }
        {'red'   }
        {'yellow'}
    
    

    Encode the labels into one-hot vectors by using the onehotencode function. Expand the labels into vectors in the second dimension to encode the classes. Each column of onehotLabels corresponds to a unique label.

    onehotLabels = onehotencode(labels,2)
    onehotLabels = 6×4
    
         0     0     1     0
         1     0     0     0
         0     0     1     0
         0     1     0     0
         0     0     0     1
         1     0     0     0
    
    

    Each observation in labels is now a row vector with a 1 in the position corresponding to the category of the class label, and a 0 in all other positions. The function encodes the labels in the same order as the categories, so that a 1 in position 1 represents the first category in the list (in this case, blue). For example, because the second row in onehotLabels has a 1 in the first column, that observation is in the blue category.

    You can also use dummyvar to encode the labels. dummyvar creates dummy variables, which in this case are the same as the encoded labels onehotLabels. For a comparison between the functions onehotencode and dummyvar, see Alternative Functionality.

    Encode a categorical vector of area codes into one-hot vectors representing the codes.

    Create a numeric row vector of area codes, where each column of the vector represents a single observation. Convert the numeric vector to a categorical vector.

    codes = [802 802 603 802 603 802];
    categCodes = categorical(codes);

    View the order of the categories.

    categories(categCodes)
    ans = 2x1 cell
        {'603'}
        {'802'}
    
    

    Encode the area codes into one-hot vectors by using the onehotencode function. Expand the codes into vectors in the first dimension, so that each row corresponds to a unique label.

    labels = onehotencode(categCodes,1)
    labels = 2×6
    
         0     0     1     0     1     0
         1     1     0     1     0     1
    
    

    Each observation in labels is now a column vector with a 1 in the position corresponding to the category of the area code, and a 0 in all other positions. The function encodes the area codes in the same order as the categories, so that a 1 in position 1 (first row) represents the first category in the list.

    One-hot encode a table of categorical values.

    Create a table of categorical data labels. Each row in the table contains a single observation.

    color = ["blue";"red";"blue";"green";"yellow";"red"];
    color = categorical(color);
    color = table(color);

    One-hot encode the table of class labels by using the onehotencode function.

    color = onehotencode(color)
    color=6×4 table
        blue    green    red    yellow
        ____    _____    ___    ______
    
         1        0       0       0   
         0        0       1       0   
         1        0       0       0   
         0        1       0       0   
         0        0       0       1   
         0        0       1       0   
    
    

    Each column of the table represents a class. The function encodes the data labels with a 1 in the column of the corresponding class, and a 0 everywhere else.

    Encode data labels when not all classes in the data are relevant by using only a subset of the classes.

    Create a row vector of data labels, where each column of the vector represents a single observation

    pets = ["dog","fish","cat","dog","cat","bird"];

    Define the list of classes to encode. These classes are a subset of the classes in the observations.

    animalClasses = ["bird";"cat";"dog"];

    One-hot encode the observations into the first dimension, so that each row of encPets corresponds to a unique class. Specify the classes to encode.

    encPets = onehotencode(pets,1,"ClassNames",animalClasses)
    encPets = 3×6
    
         0   NaN     0     0     0     1
         0   NaN     1     0     1     0
         1   NaN     0     1     0     0
    
    

    Observations of a class not in the list of classes to encode are encoded to a vector of NaN values.

    Encode a table that contains several types of class variables by encoding each variable separately.

    Create a table containing observations of several types of categorical data.

    color = ["blue";"red";"blue";"green";"yellow";"red"];
    color = categorical(color);
    
    pets = ["dog";"fish";"cat";"dog";"cat";"bird"];
    pets = categorical(pets);
    
    location = ["USA";"CAN";"CAN";"USA";"AUS";"USA"];
    location = categorical(location);
    
    data = table(color,pets,location)
    data=6×3 table
        color     pets    location
        ______    ____    ________
    
        blue      dog       USA   
        red       fish      CAN   
        blue      cat       CAN   
        green     dog       USA   
        yellow    cat       AUS   
        red       bird      USA   
    
    

    Use a for-loop to one-hot encode each table variable and append it to a new table containing the encoded data.

    encData = table();
    
    for i=1:width(data)
     encData = [encData onehotencode(data(:,i))];
    end
    
    encData
    encData=6×11 table
        blue    green    red    yellow    bird    cat    dog    fish    AUS    CAN    USA
        ____    _____    ___    ______    ____    ___    ___    ____    ___    ___    ___
    
         1        0       0       0        0       0      1      0       0      0      1 
         0        0       1       0        0       0      0      1       0      1      0 
         1        0       0       0        0       1      0      0       0      1      0 
         0        1       0       0        0       0      1      0       0      0      1 
         0        0       0       1        0       1      0      0       1      0      0 
         0        0       1       0        1       0      0      0       0      0      1 
    
    

    Each row of encData encodes the three different categorical classes for each observation.

    Compare the encoded data created by using onehotencode to the dummy variables created by using dummyvar. The dummyvar function does not accept table inputs. Combine the class variables into the cell array group.

    group = {color,pets,location};
    dummyData = dummyvar(group)
    dummyData = 6×11
    
         1     0     0     0     0     0     1     0     0     0     1
         0     0     1     0     0     0     0     1     0     1     0
         1     0     0     0     0     1     0     0     0     1     0
         0     1     0     0     0     0     1     0     0     0     1
         0     0     0     1     0     1     0     0     1     0     0
         0     0     1     0     1     0     0     0     0     0     1
    
    

    The encoded data encData and dummy variables dummyData have the same encoding but different data type. For more information on the differences between the onehotencode and dummyvar functions, see Alternative Functionality.

    Input Arguments

    collapse all

    Array of data labels to encode, specified as a categorical array, a numeric array, or a string array.

    • If A is a categorical array, the elements of the one-hot encoded vectors match the same order in categories(A).

    • If A is not a categorical array, you must specify the classes to encode using the 'ClassNames' name-value argument. The function encodes the vectors in the order that the classes appear in classes.

    • If A contains undefined values or values not present in classes, the function encodes those values as a vector of NaN values. typename must be 'double' or 'single'.

    Data Types: categorical | numeric | string

    Dimension to expand to encode the labels, specified as a positive integer.

    featureDim must specify a singleton dimension of A, or be larger than n where n is the number of dimensions of A.

    Table of data labels to encode, specified as a table. The table must contain a single variable and one row for each observation. Each entry must contain a categorical scalar, a numeric scalar, or a string scalar.

    • If tblA contains categorical values, the elements of the one-hot encoded vectors match the order of the categories; for example, the same order as categories(tbl(1,n)).

    • If tblA does not contain categorical values, you must specify the classes to encode using the 'ClassNames' name-value argument. The function encodes the vectors in the order that the classes appear in classes.

    • If tblA contains undefined values or values not present in classes, the function encodes those values as NaN values. typename must be 'double' or 'single'.

    Data Types: table

    Data type of the encoded labels, specified as a character vector or a string scalar.

    • If the classification label input is a categorical array, a numeric array, or a string array, then the encoded labels are returned as an array of data type typename.

    • If the classification label input is a table, then the encoded labels are returned as a table where each entry has data type typename.

    Valid values of typename are floating point, signed and unsigned integer, and logical types.

    Example: 'int64'

    Data Types: char | string

    Classes to encode, specified as a cell array of character vectors, a string vector, a numeric vector, or a two-dimensional character array.

    • If the input A or tblA does not contain categorical values, then you must specify classes. You can also use the classes argument to exclude any class labels from being encoded, or to encode the vector elements in a specific order.

    • If A or tblA contains undefined values or values not present in classes, the function encodes those values to a vector of NaN values. typename must be 'double' or 'single'.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | string | cell

    Output Arguments

    collapse all

    Encoded labels, returned as a numeric array.

    Encoded labels, returned as a table.

    Each row of tblB contains the one-hot encoded label for a single observation, in the same order as in tblA. Each row contains a 1 in the variable corresponding to the class of the label in tlbA, and a 0 in all other variables.

    Alternative Functionality

    To encode data labels, you can also use dummyvar, which creates dummy variables from grouping variables. The following table compares the onehotencode and dummyvar functions for different use cases.

    Use CaseWhen to Use onehotencodeWhen to Use dummyvar
    Encoding multiple variablesUse onehotencode in a loop. For an example, see One-Hot Encode Table with Several Variables. Specify the input argument group as a cell array or positive integer matrix. For examples, see Create Dummy Variables from Multiple Grouping Variables and Create Dummy Variables from Numeric Grouping Variables.
    Encoding a variable in cell array formatConvert the cell array variable to a categorical array. Specify the input argument group as a cell array containing one or more grouping variables.
    Encoding noncategorical data labelsSpecify the data labels as a categorical array or specify the classes to encode using the ClassNames name-value argument. For an example, see One-Hot Encode Subset of Classes.You do not need to convert the data labels, because dummyvar accepts noncategorical grouping variables as input.
    Encoding an array of data labelsSpecify the dimension to expand (featureDim).The software automatically determines the dimension to expand. dummyvar returns dummy variables as a numeric array with columns created from the columns of the input grouping variables.

    In many cases, you do not need to use the onehotencode or dummyvar function for encoding. Most Statistics and Machine Learning Toolbox™ functions can operate directly on categorical response data. Most classification and regression functions also accept categorical predictors.

    Introduced in R2021b