Pairwise distance between pairs of observations
returns the distance by using the method specified by D
= pdist(X
,Distance
,DistParameter
)Distance
and DistParameter
. You can specify
DistParameter
only when Distance
is
'seuclidean'
, 'minkowski'
, or
'mahalanobis'
.
Compute the Euclidean distance between pairs of observations, and convert the distance vector to a matrix using squareform
.
Create a matrix with three observations and two variables.
rng('default') % For reproducibility X = rand(3,2);
Compute the Euclidean distance.
D = pdist(X)
D = 1×3
0.2954 1.0670 0.9448
The pairwise distances are arranged in the order (2,1), (3,1), (3,2). You can easily locate the distance between observations i
and j
by using squareform
.
Z = squareform(D)
Z = 3×3
0 0.2954 1.0670
0.2954 0 0.9448
1.0670 0.9448 0
squareform
returns a symmetric matrix where Z(i,j)
corresponds to the pairwise distance between observations i
and j
. For example, you can find the distance between observations 2 and 3.
Z(2,3)
ans = 0.9448
Pass Z
to the squareform
function to reproduce the output of the pdist
function.
y = squareform(Z)
y = 1×3
0.2954 1.0670 0.9448
The outputs y
from squareform
and D
from pdist
are the same.
Create a matrix with three observations and two variables.
rng('default') % For reproducibility X = rand(3,2);
Compute the Minkowski distance with the default exponent 2.
D1 = pdist(X,'minkowski')
D1 = 1×3
0.2954 1.0670 0.9448
Compute the Minkowski distance with an exponent of 1, which is equal to the city block distance.
D2 = pdist(X,'minkowski',1)
D2 = 1×3
0.3721 1.5036 1.3136
D3 = pdist(X,'cityblock')
D3 = 1×3
0.3721 1.5036 1.3136
Define a custom distance function that ignores coordinates with NaN
values, and compute pairwise distance by using the custom distance function.
Create a matrix with three observations and two variables.
rng('default') % For reproducibility X = rand(3,2);
Assume that the first element of the first observation is missing.
X(1,1) = NaN;
Compute the Euclidean distance.
D1 = pdist(X)
D1 = 1×3
NaN NaN 0.9448
If observation i
or j
contains NaN
values, the function pdist
returns NaN
for the pairwise distance between i
and j
. Therefore, D1(1) and D1(2), the pairwise distances (2,1) and (3,1), are NaN
values.
Define a custom distance function naneucdist
that ignores coordinates with NaN
values and returns the Euclidean distance.
function D2 = naneucdist(XI,XJ) %NANEUCDIST Euclidean distance ignoring coordinates with NaNs n = size(XI,2); sqdx = (XI-XJ).^2; nstar = sum(~isnan(sqdx),2); % Number of pairs that do not contain NaNs nstar(nstar == 0) = NaN; % To return NaN if all pairs include NaNs D2squared = sum(sqdx,2,'omitnan').*n./nstar; % Correction for missing coordinates D2 = sqrt(D2squared);
Compute the distance with naneucdist
by passing the function handle as an input argument of pdist
.
D2 = pdist(X,@naneucdist)
D2 = 1×3
0.3974 1.1538 0.9448
X
— Input dataInput data, specified as a numeric matrix of size m-by-n. Rows correspond to individual observations, and columns correspond to individual variables.
Data Types: single
| double
Distance
— Distance metricDistance metric, specified as a character vector, string scalar, or function handle, as described in the following table.
Value | Description |
---|---|
'euclidean' | Euclidean distance (default). |
'squaredeuclidean' | Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.) |
'seuclidean' | Standardized Euclidean distance. Each coordinate difference between observations is
scaled by dividing by the corresponding element of the standard deviation,
|
'mahalanobis' | Mahalanobis distance using the sample covariance of
|
'cityblock' | City block distance. |
'minkowski' | Minkowski distance. The default exponent is 2. Use |
'chebychev' | Chebychev distance (maximum coordinate difference). |
'cosine' | One minus the cosine of the included angle between points (treated as vectors). |
'correlation' | One minus the sample correlation between points (treated as sequences of values). |
'hamming' | Hamming distance, which is the percentage of coordinates that differ. |
'jaccard' | One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ. |
'spearman' |
One minus the sample Spearman's rank correlation between observations (treated as sequences of values). |
@ |
Custom distance function handle. A distance function has the form function D2 = distfun(ZI,ZJ) % calculation of distance ...
If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle. |
For definitions, see Distance Metrics.
When you use 'seuclidean'
,
'minkowski'
, or 'mahalanobis'
, you
can specify an additional input argument DistParameter
to control these metrics. You can also use these metrics in the same way as
the other metrics with a default value of
DistParameter
.
Example:
'minkowski'
DistParameter
— Distance metric parameter valuesDistance metric parameter values, specified as a positive scalar, numeric vector, or
numeric matrix. This argument is valid only when you specify
Distance
as 'seuclidean'
,
'minkowski'
, or 'mahalanobis'
.
If Distance
is 'seuclidean'
,
DistParameter
is a vector of scaling factors for
each dimension, specified as a positive vector. The default value is
std(X,'omitnan')
.
If Distance
is 'minkowski'
,
DistParameter
is the exponent of Minkowski
distance, specified as a positive scalar. The default value is 2.
If Distance
is 'mahalanobis'
,
DistParameter
is a covariance matrix, specified as
a numeric matrix. The default value is cov(X,'omitrows')
.
DistParameter
must be symmetric and positive
definite.
Example:
'minkowski',3
Data Types: single
| double
D
— Pairwise distancesPairwise distances, returned as a numeric row vector of length
m(m–1)/2, corresponding to pairs
of observations, where m is the number of observations in
X
.
The distances are arranged in the order (2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1), i.e., the lower-left triangle of the m-by-m distance matrix in column order. The pairwise distance between observations i and j is in D((i-1)*(m-i/2)+j-i) for i≤j.
You can convert D
into a symmetric matrix by using
the squareform
function.
Z = squareform(D)
returns an
m-by-m matrix where
Z(i,j)
corresponds to the pairwise distance between
observations i and j.
If observation i or j contains
NaN
s, then the corresponding value in
D
is NaN
for the built-in
distance functions.
D
is commonly used as a dissimilarity matrix in
clustering or multidimensional scaling. For details, see Hierarchical Clustering and the function reference pages for
cmdscale
, cophenet
, linkage
, mdscale
, and optimalleaforder
. These
functions take D
as an input argument.
A distance metric is a function that defines a distance between
two observations. pdist
supports various distance
metrics: Euclidean distance, standardized Euclidean distance, Mahalanobis distance,
city block distance, Minkowski distance, Chebychev distance, cosine distance,
correlation distance, Hamming distance, Jaccard distance, and Spearman
distance.
Given an m-by-n data matrix
X
, which is treated as m
(1-by-n) row vectors
x1,
x2, ...,
xm, the various distances between
the vector xs and
xt are defined as follows:
Euclidean distance
The Euclidean distance is a special case of the Minkowski distance, where p = 2.
Standardized Euclidean distance
where V is the n-by-n diagonal matrix whose jth diagonal element is (S(j))2, where S is a vector of scaling factors for each dimension.
Mahalanobis distance
where C is the covariance matrix.
City block distance
The city block distance is a special case of the Minkowski distance, where p = 1.
Minkowski distance
For the special case of p = 1, the Minkowski distance gives the city block distance. For the special case of p = 2, the Minkowski distance gives the Euclidean distance. For the special case of p = ∞, the Minkowski distance gives the Chebychev distance.
Chebychev distance
The Chebychev distance is a special case of the Minkowski distance, where p = ∞.
Cosine distance
Correlation distance
where
and .
Hamming distance
Jaccard distance
Spearman distance
where
rsj is the rank
of xsj taken over
x1j,
x2j,
...xmj, as
computed by tiedrank
.
rs and rt are the coordinate-wise rank vectors of xs and xt, i.e., rs = (rs1, rs2, ... rsn).
.
.
Usage notes and limitations:
The distance input argument value (Distance
) must
be a compile-time constant. For example, to use the Minkowski distance,
include coder.Constant('Minkowski')
in the
-args
value of codegen
.
The distance input argument value (Distance
)
cannot be a custom distance function.
The generated code of
pdist
uses parfor
(MATLAB Coder) to create loops that run in
parallel on supported shared-memory multicore platforms in the generated code. If your compiler
does not support the Open Multiprocessing (OpenMP) application interface or you disable OpenMP
library, MATLAB®
Coder™ treats the parfor
-loops as for
-loops. To find supported compilers, see Supported Compilers.
To disable OpenMP library, set the EnableOpenMP
property of the
configuration object to false
. For
details, see coder.CodeConfig
(MATLAB Coder).
For more information on code generation, see Introduction to Code Generation and General Code Generation Workflow.
Usage notes and limitations:
The supported distance input argument values
(Distance
) for optimized CUDA code are
'euclidean'
,
'squaredeuclidean'
,
'seuclidean'
, 'cityblock'
,
'minkowski'
, 'chebychev'
,
'cosine'
, 'correlation'
,
'hamming'
, and
'jaccard'
.
Distance
cannot be a custom distance
function.
Distance
must be a compile-time constant.
Usage notes and limitations:
The Distance
argument must be specified as a character
vector.
For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
cluster
| clusterdata
| cmdscale
| cophenet
| dendrogram
| inconsistent
| linkage
| pdist2
| silhouette
| squareform
A modified version of this example exists on your system. Do you want to open this version instead?
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
Select web siteYou can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.