Center and scale data in the Live Editor
The Normalize Data task lets you interactively normalize data by choosing centering and scaling methods, such as z-score. The task automatically generates MATLAB® code for your live script.
Using this task, you can:
Customize how to center and scale data in a workspace variable such as a table or timetable.
Automatically visualize the input data compared to the normalized data.
Output the centering and scaling values used to compute the normalization.
To add the Normalize Data task to a live script in the MATLAB Editor:
On the Live Editor tab, select Task > Normalize Data.
In a code block in the script, type a relevant keyword, such as
normalize. Select Normalize Data from
the suggested command completions.
Input data— Valid input data from workspace
This task operates on data of type
double. The data can be contained in a vector or table variables.
When providing a table or timetable for the input data, specify All
supported variables to normalize all variables with a supported type. To
choose specific supported variables to normalize, select Specified
variables and then select the variables individually.
Normalization method— Method and parameters for normalizing data
Center and scale| ...
Specify the method and related parameters for normalizing data using one of the following options.
Center and scale to have mean 0 and standard deviation 1.
Center and scale to have median 0 and median absolute deviation 1.
Positive numeric scalar (default is 2) or
|Scale data by p-norm.|
Left and right range limits (default is 0 for left limit and 1 for right limit)
|Rescale range of data to an interval of the form |
|Center and scale data to have median 0 and interquartile range 1.|
|Center to have mean 0.|
|Center to have median 0.|
|Shift center by specified numeric value.|
|Shift center using values in a numeric array or in a table whose variable names match the specified table variables from the input data.|
Scale data by standard deviation.
|Scale data by median absolute deviation.|
|Scale data by first element of data.|
|Scale data by interquartile range.|
Numeric scalar (default is 1)
|Scale data by dividing by a numeric value.|
|Scale data using values in a numeric array or in a table whose variable names match the specified table variables from the input data.|
|Both center and scale data using parameters from the
For a random variable X with mean μ and standard deviation σ, the z-score of a value x is For sample data with mean and standard deviation S, the z-score of a data point x is
z-scores measure the distance of a data point from the mean in terms of the standard deviation. The standardized data set has mean 0 and standard deviation 1, and it retains the shape properties of the original data set (same skewness and kurtosis).
The median absolute deviation (MAD) of a data set is the median value of the absolute deviations from the median of the data: . Therefore, the MAD describes the variability of the data in relation to the median.
The MAD is generally preferred over using the standard deviation of the data when the data contains outliers (very large or very small values) because the standard deviation squares deviations from the mean, giving outliers an unduly large impact. Conversely, the deviations of a small number of outliers do not affect the value of the MAD.
The general definition for the p-norm of a vector v that has N elements is
where p is any positive real value or
Inf. Some common values of p are:
If p is 1, then the resulting 1-norm is the sum of the absolute values of the vector elements.
If p is 2, then the resulting 2-norm gives the vector magnitude or Euclidean length of the vector.
If p is
Inf, then .
Rescaling changes the distance between the minimum and maximum values in a data set by stretching or squeezing the points along the number line. The z-scores of the data are preserved, so the shape of the distribution remains the same.
The equation for rescaling data
X to an arbitrary
[a b] is
The interquartile range (IQR) of a data set describes the range of the middle 50% of values when the values are sorted. If the median of the data is Q2, the median of the lower half of the data is Q1, and the median of the upper half of the data is Q3, then .
The IQR is generally preferred over looking at the full range of the data when the data contains outliers (very large or very small values) because the IQR excludes the largest 25% and smallest 25% of values in the data.