Calculate pairwise distance between sequences
D
= seqpdist(Seqs
)
D
= seqpdist(Seqs
,
...'PropertyName
', PropertyValue
,
...)
D
= seqpdist(Seqs
,
...'Method', MethodValue
, ...)
D
= seqpdist(Seqs
,
...'Indels', IndelsValue
, ...)
D
= seqpdist(Seqs
,
...'OptArgs', OptArgsValue
, ...)
D
= seqpdist(Seqs
,
...'PairwiseAlignment', PairwiseAlignmentValue
,
...)
D
= seqpdist(Seqs
,
...'UseParallel', UseParallelValue
, ...)
D
= seqpdist(Seqs
,
...'SquareForm', SquareFormValue
...)
D
= seqpdist(Seqs
,
...'Alphabet', AlphabetValue
, ...)
D
= seqpdist(Seqs
,
...'ScoringMatrix', ScoringMatrixValue
, ...)
D
= seqpdist(Seqs
,
...'Scale', ScaleValue
, ...)
D
= seqpdist(Seqs
,
...'GapOpen', GapOpenValue
, ...)
D
= seqpdist(Seqs
,
...'ExtendGap', ExtendGapValue
, ...)
Seqs  Any of the following:

MethodValue  Character vector or string that specifies the method to calculate pairwise distances. Default
is 'JukesCantor' . 
IndelsValue  Character vector or string that specifies how to treat sites with gaps. Default is
'score' . 
OptArgsValue  Character vector or cell array that specifies one or more input arguments required or
accepted by the distance method specified by the
Method property. 
PairwiseAlignmentValue  Controls the global pairwise alignment of input sequences (using
the nwalign function), while
ignoring the multiple alignment of the input sequences (if any). Choices
are true or false . Default is:
TipIf your input sequences are the same length,

UseParallelValue  Controls the calculation of the pairwise distances using parfor loops.
When true , and Parallel
Computing Toolbox™ is
installed and a parpool is open, computation occurs
in parallel. If there are no open parpool , but
automatic creation is enabled in the Parallel Preferences, the default
pool will be automatically open and computation occurs in parallel.
If Parallel
Computing Toolbox is installed, but there are no open parpool and
automatic creation is disabled, then computation uses parfor loops
in serial mode. If Parallel
Computing Toolbox is not installed,
then computation uses parfor loops in serial mode.
Default is false , which uses forloops in serial
mode. 
SquareFormValue  Controls the conversion of the output into a square matrix.
Choices are 
AlphabetValue  Character vector or string specifying the type of sequence (nucleotide or amino acid).
Choices are 'NT' or
'AA' (default). 
ScoringMatrixValue  Either of the following:
NoteIf you need to compile

ScaleValue  Positive value that specifies the scale factor used to return the score in arbitrary units. If the scoring matrix information also provides a scale factor, then both are used. 
GapOpenValue  Positive integer that specifies the penalty for opening a gap
in the alignment. Default is 8 . 
ExtendedGapValue  Positive integer that specifies the penalty for extending a
gap. Default is equal to GapOpenValue . 
D  Vector that contains biological distances between each pair
of sequences stored in the M elements of Seqs . 
returns D
= seqpdist(Seqs
)D
,
a vector containing biological distances between each pair of sequences
stored in the M
sequences of Seqs
,
a cell array of sequences, a vector of structures, or a matrix or
sequences.
is a D
1
by(M*(M1)/2)
row
vector corresponding to the M*(M1)/2
pairs of
sequences in Seqs
. The output
is
arranged in the order D
((2,1),(3,1),..., (M,1),(3,2),...(M,2),...(M,M1))
.
This is the lowerleft triangle of the full M
byM
distance
matrix. To get the distance between the I
th
and the J
th sequences for I >
J
, use the formula D((J1)*(MJ/2)+IJ)
.
calls D
= seqpdist(Seqs
,
...'PropertyName
', PropertyValue
,
...)seqpdist
with optional properties
that use property name/property value pairs. Specify one or more properties
in any order. Enclose each PropertyName
in
single quotation marks. Each PropertyName
is
case insensitive. These property name/property value pairs are as
follows:
specifies
a method to compute distances between each sequence pair. Choices
are shown in the following tables.D
= seqpdist(Seqs
,
...'Method', MethodValue
, ...)
Methods for Nucleotides and Amino Acids
Method  Description 

pdistance  Proportion of sites at which the two sequences are different. p is
close to 1 for poorly related sequences, and p is
close to 0 for similar sequences.d = p 
JukesCantor (default)  Maximum likelihood estimate of the number of substitutions
between two sequences. For nucleotides:
For amino acids:

alignmentscore  Distance (d ) between two sequences (1,
2 ) is computed from the pairwise alignment score between
the two sequences (score12 ), and the pairwise alignment
score between each sequence and itself (score11 , score22 )
as follows:d = (1score12/score11)* (1score12/score22) d = 0 
Methods with No Scoring of Gaps (Nucleotides Only)
Method  Description 

TajimaNei  Maximum likelihood estimate considering the background nucleotide
frequencies. It can be computed from the input sequences or given
by setting OptArgs to [gA gC gG gT] . gA , gC , gG , gT are
scalar values for the nucleotide frequencies. 
Kimura  Considers separately the transitional nucleotide substitution and the transversional nucleotide substitution. 
Tamura  Considers separately the transitional nucleotide substitution,
the transversional nucleotide substitution, and the GC content. GC
content can be computed from the input sequences or given by setting OptArgs to
the proportion of GC content (scalar value from 0 to 1 ). 
Hasegawa  Considers separately the transitional nucleotide substitution,
the transversional nucleotide substitution, and the background nucleotide
frequencies. Background frequencies can be computed from the input
sequences or given by setting the OptArgs property
to [gA gC gG gT] . 
NeiTamura  Considers separately the transitional nucleotide substitution
between purines, the transitional nucleotide substitution between
pyrimidines, the transversional nucleotide substitution, and the background
nucleotide frequencies. Background frequencies can be computed from
the input sequences or given by setting the OptArgs property
to [gA gC gG gT] . 
Methods with No Scoring of Gaps (Amino Acids Only)
Method  Description 

Poisson  Assumes that the number of amino acid substitutions at each site has a Poisson distribution. 
Gamma  Assumes that the number of amino acid substitutions at each
site has a Gamma distribution with parameter a .
Set a using the OptArgs property.
Default is 2 . 
You can also specify a userdefined distance function using @
,
for example, @distfun
. The distance function must
have the form:
function D = distfun(S1, S2, OptArgsValue)
The distfun
function takes the following
arguments:
S1
, S2
—
Two sequences of the same length (nucleotide or amino acid).
OptArgsValue
— Optional
problemdependent arguments.
The distfun
function returns a scalar that
represents the distance between S1
and S2
.
specifies
how to treat sites with gaps. Choices are:D
= seqpdist(Seqs
,
...'Indels', IndelsValue
, ...)
score
(default) — Scores
these sites either as a point mutation or with the alignment parameters,
depending on the method selected.
pairwisedel
— For every
pairwise comparison, it ignores the sites with gaps.
completedel
— Ignores all
the columns in the multiple alignment that contain a gap. This option
is available only if you provided a multiple alignment as the input Seqs
.
passes
one or more arguments required or accepted by the distance method
specified by the D
= seqpdist(Seqs
,
...'OptArgs', OptArgsValue
, ...)Method
property. Use a character
vector or cell array to pass one or more input arguments. For example,
provide the nucleotide frequencies for the TajimaNei
distance
method, instead of computing them from the input sequences.
controls the global pairwise alignment of input sequences
(using the D
= seqpdist(Seqs
,
...'PairwiseAlignment', PairwiseAlignmentValue
,
...)nwalign
function),
while ignoring the multiple alignment of the input sequences (if any).
Default is:
true
— When all input sequences
do not have the same length.
false
— When all input sequences
have the same length.
If your input sequences have the same length, seqpdist
assumes
they are aligned. If they are not aligned, do one of the following:
Align the sequences before passing them to seqpdist
,
for example, using the multialign
function.
Set PairwiseAlignment
to true
when
using seqpdist
.
specifies
whether to use D
= seqpdist(Seqs
,
...'UseParallel', UseParallelValue
, ...)parfor
loops when calculating the
pairwise distances. When true
, and Parallel
Computing Toolbox is
installed and a parpool
is open, computation occurs
in parallel. If there are no open parpool
, but
automatic creation is enabled in the Parallel Preferences, the default
pool will be automatically open and computation occurs in parallel.
If Parallel
Computing Toolbox is installed, but there are no open parpool
and
automatic creation is disabled, then computation uses parfor
loops
in serial mode. If Parallel
Computing Toolbox is not installed,
then computation uses parfor
loops in serial mode.
Default is false
, which uses forloops in serial
mode.
controls
the conversion of the output into a square matrix such that D
= seqpdist(Seqs
,
...'SquareForm', SquareFormValue
...)
denotes
the distance between the D
(I
,J
)I
th and J
th
sequences. The square matrix is symmetric and has a zero diagonal.
Choices are true
or false
(default).
Setting Squareform
to true
is
the same as using the squareform
function in Statistics and Machine
Learning Toolbox™ .
specifies
the type of sequence (nucleotide or amino acid). Choices are D
= seqpdist(Seqs
,
...'Alphabet', AlphabetValue
, ...)'NT'
or 'AA'
(default).
The remaining input properties are available when the Method
property
equals 'alignmentscore'
or the PairwiseAlignment
property
equals true
.
specifies the scoring matrix to use for
the global pairwise alignment. Default is:D
= seqpdist(Seqs
,
...'ScoringMatrix', ScoringMatrixValue
, ...)
'NUC44'
— When AlphabetValue
equals 'NT'
.
'BLOSUM50'
— When AlphabetValue
equals 'AA'
.
specifies
the scale factor used to return the score in arbitrary units. Choices
are any positive value. If the scoring matrix information also provides
a scale factor, then both are used.D
= seqpdist(Seqs
,
...'Scale', ScaleValue
, ...)
specifies
the penalty for opening a gap in the alignment. Choices are any positive
integer. Default is D
= seqpdist(Seqs
,
...'GapOpen', GapOpenValue
, ...)8
.
specifies
the penalty for extending a gap in the alignment. Choices are any
positive integer. Default is equal to D
= seqpdist(Seqs
,
...'ExtendGap', ExtendGapValue
, ...)GapOpenValue
.
Read amino acid alignment data into a MATLAB structure.
seqs = fastaread('pf00002.fa');
For every possible pair of sequences in the multiple
alignment, ignore sites with gaps and score with the scoring matrix PAM250
.
dist = seqpdist(seqs,'Method','alignmentscore',... 'Indels','pairwisedelete',... 'ScoringMatrix','pam250');
Force the realignment of each sequence pair ignoring the provided multiple alignment.
dist = seqpdist(seqs,'Method','alignmentscore',... 'Indels','pairwisedelete',... 'ScoringMatrix','pam250',... 'PairwiseAlignment',true);
Measure the JukesCantor pairwise distances after realigning each sequence pair, counting the gaps as point mutations.
dist = seqpdist(seqs,'Method','jukescantor',... 'Indels','score',... 'Scoringmatrix','pam250',... 'PairwiseAlignment',true);