Using HMMs for Profile Analysis of a Protein Family

Open Script

This example shows how HMM profiles are used to characterize protein families. Profile analysis is a key tool in bioinformatics. The common pairwise comparison methods are usually not sensitive and specific enough for analyzing distantly related sequences. In contrast, Hidden Markov Model (HMM) profiles provide a better alternative to relate a query sequence to a statistical description of a family of sequences. HMM profiles use a position-specific scoring system to capture information about the degree of conservation at various positions in the multiple alignment of these sequences. HMM profile analysis can be used for multiple sequence alignment, for database searching, to analyze sequence composition and pattern segmentation, and to predict protein structure and locate genes by predicting open reading frames.

Accessing PFAM Databases

Start this example with an already built HMM of a protein family. Retrieve the model for the well-known 7-fold transmembrane receptor from the Sanger Institute database. The PFAM key number is PF00002. Also retrieve the pre-aligned sequences used to train this model. More information about the PFAM database can be found at http://pfam.xfam.org/.

hmm_7tm = gethmmprof(2);
seed_seqs = gethmmalignment(2,'type','seed');

For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in public repositories is frequently curated and updated; therefore the results of this example might be slightly different when you use up-to-date datasets.

load('gpcrfam.mat','hmm_7tm','seed_seqs')

Models and alignments can also be stored and parsed in later directly from the files using the pfamhmmread, fastaread and multialignread functions.

Display the names and contents of the first three loaded sequences using the seqdisp command.

seqdisp(seed_seqs([1 2 3]),'row',70)

ans =

  23×81 char array

    '>VIPR2_HUMAN/123-371                                                             '
    '  1  YILVKAIYTL GYSVS.LMSL ATGSIILCLF .RKLHCTR.N YIHLNLFLSF ILRAISVLVK .DDVLYSSS.'
    ' 71  GTLHCPD... .......... .......... ....QPSSW. ..V.GCKLSL VFLQYCIMAN FFWLLVEGLY'
    '141  LHTLLVA... ...MLPP.RR CFLAYLLIGW GLPTVCIGAW TAAR...... .........L YLED......'
    '211  ......TGC. WDTN.DHSVP W....WVIRI PILISIIVNF VLFISIIRIL LQKLT..... .SPDVGGNDQ'
    '281  SQY....... .......... .......... ....KRLAKS TLLLIPLFGV HYMV..FAVF PISI...S.S'
    '351  KYQILFELCL GSF....QGL VV                                                    '
    '                                                                                 '
    '>VIPR_CARAU/100-348                                                              '
    '  1  FRSVKIGYTI GHSVS.LISL TTAIVILCMS .RKLHCTR.N YIHMHLFVSF ILKAIAVFVK .DAVLYDVIQ'
    ' 71  ESDNCS.... .......... .......... .....TASV. ....GCKAVI VFFQYCIMAS FFWLLVEGLY'
    '141  LHALLAVS.. ...FFSE.RK YFWWYILIGW GGPTIFIMAW SFAK...... .........A YFND......'
    '211  ......VGC. WDIIENSDLF W....WIIKT PILASILMNF ILFICIIRIL RQKIN..... .CPDIGRNES'
    '281  NQY....... .......... .......... ....SRLAKS TLLLIPLFGI NFII..FAFI PENI...K.T'
    '351  ELRLVFDLIL GSF....QGF VV                                                    '
    '                                                                                 '
    '>VIPR1_RAT/140-386                                                               '
    '  1  YNTVKTGYTI GYSLS.LASL LVAMAILSLF .RKLHCTR.N YIHMHLFMSF ILRATAVFIK .DMALFNSG.'
    ' 71  EIDHCS.... .......... .......... .....EASV. ....GCKAAV VFFQYCVMAN FFWLLVEGLY'
    '141  LYTLLAVS.. ...FFSE.RK YFWGYILIGW GVPSVFITIW TVVR...... .........I YFED......'
    '211  ......FGC. WDTI.INSSL W....WIIKA PILLSILVNF VLFICIIRIL VQKLR..... .PPDIGKNDS'
    '281  SPY....... .......... .......... ....SRLAKS TLLLIPLFGI HYVM..FAFF PDNF...K.A'
    '351  QVKMVFELVV GSF....QGF VV                                                    '

More information regarding how to store the profile HMM information in a MATLAB® structure is found in the help for hmmprofstruct.

Profile HMM Alignment

To test the profile HMM alignment tool you can re-align the sequences from the multiple alignment to the HMM model. First erase the periods in sequences used to format the downloaded aligned sequences. Doing this removes the alignment information from the sequences.

seqs = strrep({seed_seqs.Sequence},'.','');
names = {seed_seqs.Header};

Now align all the proteins to the HMM profile.

fprintf('Aligning sequences ')
scores = zeros(numel(seqs),1);
aligned_seqs = cell(numel(seqs),1);
for sn=1:numel(seqs)
    fprintf('.')
    [scores(sn),aligned_seqs{sn}]=hmmprofalign(hmm_7tm,seqs{sn});
end
fprintf('\n')

Aligning sequences ................................

Next, send the results to the system web browser to better explore the new multiple alignment. Columns marked with * at the bottom indicate when the model was in a "match" or "delete" state.

hmmprofmerge(aligned_seqs,names,scores)

You can also explore the alignment from the command window; the hmmprofmerge function with one output argument places the aligned sequences into a char array.

str = hmmprofmerge(aligned_seqs);
str(1:10,1:80)

ans =

  10×80 char array

    'YILVKAIYTLGYSVS.LMSLATGSIILCLF.RKLHCTR.NYIHLNLFLSFILRAISVLVK.DDVLYSSSG-TLH......'
    'FRSVKIGYTIGHSVS.LISLTTAIVILCMS.RKLHCTR.NYIHMHLFVSFILKAIAVFVK.DAVLYDVIQESDN......'
    'YNTVKTGYTIGYSLS.LASLLVAMAILSLF.RKLHCTR.NYIHMHLFMSFILRATAVFIK.DMALFNSG-EIDH......'
    'FGAIKTGYTIGHSLS.LISLTAAMIILCIF.RKLHCTR.NYIHMHLFMSFIMRAIAVFIK.DIVLFESG-ESDH......'
    'YLSVKALYTVGYSTS.LVTLTTAMVILCRF.RKLHCTR.NFIHMNLFVSFMLRAISVFIK.DWILYAEQD-SSH......'
    'FSTVKIIYTTGHSIS.IVALCVAIAILVAL.RRLHCPR.NYIHTQLFATFILKASAVFLK.DAAIFQGDS-TDH......'
    'LSTLKQLYTAGYATS.LISLITAVIIFTCF.RKFHCTR.NYIHINLFVSFILRATAVFIK.DAVLFSDET-QNH......'
    'FDRLGMIYTVGYSVS.LASLTVAVLILAYF.RRLHCTR.NYIHMHLFLSFMLRAVSIFVK.DAVLYSGATLDEA......'
    'FERLYVMYTVGYSIS.FGSLAVAILIIGYF.RRLHCTR.NYIHMHLFVSFMLRATSIFVK.DRVVHAHIGVKEL......'
    'ALNLFYLTIIGHGLS.IASLLISLGIFFYF.KSLSCQR.ITLHKNLFFSFVCNSVVTIIH.LTAVANNQALVAT......'

Looking for Similarity with Sequence Comparison

Having a profile HHM which describes this family has several advantages over plain sequence comparison. Suppose that you have a new oligonucleotide that you want to relate to the 7-transmembrane receptor family. For this example, get a protein sequence from NCBI and extract the aminoacid sequence.

mousegpcr = getgenpept('NP_783573');
Bai3 = mousegpcr.Sequence;

This sequence is also provided in the MAT-file gpcrfam.mat.

load('gpcrfam.mat','mousegpcr')
Bai3 = mousegpcr.Sequence;

seqdisp(Bai3,'row',70)

ans =

  22×82 char array

    '   1  MKAVRNLLIY IFSTYLLVMF GFNAAQDFWC STLVKGVIYG SYSVSEMFPK NFTNCTWTLE NPDPTKYSIY'
    '  71  LKFSKKDLSC SNFSLLAYQF DHFSHEKIKD LLRKNHSIMQ LCSSKNAFVF LQYDKNFIQI RRVFPTDFPG'
    ' 141  LQKKVEEDQK SFFEFLVLNK VSPSQFGCHV LCTWLESCLK SENGRTESCG IMYTKCTCPQ HLGEWGIDDQ'
    ' 211  SLVLLNNVVL PLNEQTEGCL TQELQTTQVC NLTREAKRPP KEEFGMMGDH TIKSQRPRSV HEKRVPQEQA'
    ' 281  DAAKFMAQTG ESGVEEWSQW SACSVTCGQG SQVRTRTCVS PYGTHCSGPL RESRVCNNTA LCPVHGVWEE'
    ' 351  WSPWSLCSFT CGRGQRTRTR SCTPPQYGGR PCEGPETHHK PCNIALCPVD GQWQEWSSWS HCSVTCSNGT'
    ' 421  QQRSRQCTAA AHGGSECRGP WAESRECYNP ECTANGQWNQ WGHWSGCSKS CDGGWERRMR TCQGAAVTGQ'
    ' 491  QCEGTGEEVR RCSEQRCPAP YEICPEDYLI SMVWKRTPAG DLAFNQCPLN ATGTTSRRCS LSLHGVASWE'
    ' 561  QPSFARCISN EYRHLQHSIK EHLAKGQRML AGDGMSQVTK TLLDLTQRKN FYAGDLLVSV EILRNVTDTF'
    ' 631  KRASYIPASD GVQNFFQIVS NLLDEENKEK WEDAQQIYPG SIELMQVIED FIHIVGMGMM DFQNSYLMTG'
    ' 701  NVVASIQKLP AASVLTDINF PMKGRKGMVD WARNSEDRVV IPKSIFTPVS SKELDESSVF VLGAVLYKNL'
    ' 771  DLILPTLRNY TVVNSKVIVV TIRPEPKTTD SFLEIELAHL ANGTLNPYCV LWDDSKSNES LGTWSTQGCK'
    ' 841  TVLTDASHTK CLCDRLSTFA ILAQQPREIV MESSGTPSVT LIVGSGLSCL ALITLAVVYA ALWRYIRSER'
    ' 911  SIILINFCLS IISSNILILV GQTQTHNKSI CTTTTAFLHF FFLASFCWVL TEAWQSYMAV TGKIRTRLIR'
    ' 981  KRFLCLGWGL PALVVATSVG FTRTKGYGTD HYCWLSLEGG LLYAFVGPAA AVVLVNMVIG ILVFNKLVSR'
    '1051  DGILDKKLKH RAGQMSEPHS GLTLKCAKCG VVSTTALSAT TASNAMASLW SSCVVLPLLA LTWMSAVLAM'
    '1121  TDKRSILFQI LFAVFDSLQG FVIVMVHCIL RREVQDAFRC RLRNCQDPIN ADSSSSFPNG HAQIMTDFEK'
    '1191  DVDIACRSVL HKDIGPCRAA TITGTLSRIS LNDDEEEKGT NPEGLSYSTL PGNVISKVII QQPTGLHMPM'
    '1261  SMNELSNPCL KKENTELRRT VYLCTDDNLR GADMDIVHPQ ERMMESDYIV MPRSSVSTQP SMKEESKMNI'
    '1331  GMETLPHERL LHYKVNPEFN MNPPVMDQFN MNLDQHLAPQ EHMQNLPFEP RTAVKNFMAS ELDDNVGLSR'
    '1401  SETGSTISMS SLERRKSRYS DLDFEKVMHT RKRHMELFQE LNQKFQTLDR FRDIPNTSSM ENPAPNKNPW'
    '1471  DTFKPPSEYQ HYTTINVLDT EAKDTLELRP AEWEKCLNLP LDVQEGDFQT EV                   '

First, using local alignment compare the new sequence to one of the sequences in the multiple alignment. For instance use the first sequence, in this case the human protein 'VIPR2'. The Smith-Waterman algorithm (swalign) can make use of scoring matrices. Scoring matrices can capture the probability of substitution of symbols. The sequences in this example are known to be only distantly related, so BLOSUM30 is a good choice for the scoring matrix.

VIPR2 = seqs{1};
[sc_aa_affine, alignment] = swalign(Bai3,VIPR2,'ScoringMatrix',...
                   'blosum30','gapopen',5,'extendgap',3,'showscore',true);

sc_aa_affine

sc_aa_affine =

   69.6000

By looking at the scoring space, apparently, both sequences are related. However, this relationship could not be inferred from a dot plot.

Bai3_aligned_region = strrep(alignment(1,:),'-','');
seqdotplot(VIPR2,Bai3_aligned_region,7,2)
ylabel('VIPR2'); xlabel('Bai3');

Is either of these two examples enough evidence to affirm that these sequences are related? One way to test this is to randomly create a fake sequence with the same distribution of amino acids and see how it aligns to the family. Notice that the score of the local alignment between the fake sequence and the VIPR2 protein is not significantly lower than the score of the alignment between the Bia3 and VIPR2 proteins. To ensure reproducibility of the results of this example, we reset the global random generator.

rng(0,'twister');
fakeSeq = randseq(1000,'FROMSTRUCTURE',aacount(VIPR2));
sc_fk_affine = swalign(fakeSeq,VIPR2,'ScoringMatrix','blosum30',...
                       'gapopen',5,'extendgap',3,'showscore',true)

sc_fk_affine =

   60.4000

In contrast, when you align both sequences to the family using the trained profile HMM, the score of aligning the target sequence to the family profile is significantly larger than the score of aligning the fake sequence.

sc_aa_hmm = hmmprofalign(hmm_7tm,Bai3)
sc_fk_hmm = hmmprofalign(hmm_7tm,fakeSeq)

sc_aa_hmm =

  214.5286


sc_fk_hmm =

  -49.1624

Exploring Profile HMM Alignment Options

Similarly to the swalign alignment function, when you use profile alignments you can visualize the scoring space using the showscore option to the hmmprofalign function.

Display Bai3 aligned to the 7tm_2 family.

hmmprofalign(hmm_7tm,Bai3,'showscore',true);
title('log-odds score for best path: Bai3');

Display the "fake" sequence aligned to the 7tm_2 family.

hmmprofalign(hmm_7tm,fakeSeq,'showscore',true);
title('log-odds score for best path: fake sequence');

Display Bai3 globally aligned to the 7tm_2 family.

[sc_aa_hmm,align,ptrs] = hmmprofalign(hmm_7tm,Bai3);
Bai3_hmmaligned_region = Bai3(min(ptrs):max(ptrs));
hmmprofalign(hmm_7tm,Bai3_hmmaligned_region,'showscore',true);
title('log-odds score for best path: Bai3 aligned globally');

Align tandemly repeated domains.

naa = numel(Bai3_hmmaligned_region);
repeats = randseq(1000,'FROMSTRUCTURE',aacount(Bai3)); %artificial example
repeats(200+(1:naa)) = Bai3_hmmaligned_region;
repeats(500+(1:naa)) = Bai3_hmmaligned_region;
repeats(700+(1:naa)) = Bai3_hmmaligned_region;
hmmprofalign(hmm_7tm,repeats,'showscore',true);
title('log-odds score for best path: Bai3 tandem repeats');

Searching for Fragment Domains

In MATLAB®, you can search for fragment domains by manually activating the B->M and M->E transition probabilities of the HMM model.

hmm_7tm_f = hmm_7tm;
hmm_7tm_f.BeginX(3:end)=.002;
hmm_7tm_f.MatchX(1:end-1,4)=.002;

Create a random sequence, or fragment model, with a small insertion of the Bai3 protein:

fragment = randseq(1000,'FROMSTRUCTURE',aacount(Bai3));
fragment(501:550) = Bai3_hmmaligned_region(101:150);

Try aligning the random sequence with the inserted peptide to both models, the global and fragment model:

hmmprofalign(hmm_7tm,fragment,'showscore',true);
title('log-odds score for best path: PF00002 global ');
hmmprofalign(hmm_7tm_f,fragment,'showscore',true);
title('log-odds score for best path: PF00002 fragment domains');

Exploring the Profile HMMs

The function showhmmprof is an interactive tool to explore the profile HMM. Try right and left mouse clicks over the model figures. There are three plots for each model: (1) the symbol emission probabilities in the Match states, (2) the symbol emission probabilities in the Insert states, and (3) the Transition probabilities.

showhmmprof(hmm_7tm,'scale','logodds')

An alternative method to explore a profile HMM is by creating a sequence logo from the multiple alignment. A sequence logo displays the frequency of bases found at each position within a given region, usually for a binding site. Using the hmm_7tm sequences, consider the portion of the Parathyroid hormone-related peptide receptor (precursor) found at the n-terminus of the PTRR_Human sequence. The seqlogo allows a quick visual comparison of how well this region is conserved across the 7tm family.

seqlogo(str,'startat',1,'endat',20,'alphabet','AA')

Profile Estimation

Profile HMMs can also be estimated from a multiple alignment. As new sequences related to the family are found, it is possible to re-estimate the model parameters.

hmm_7tm_new = hmmprofestimate(hmm_7tm,str)

hmm_7tm_new = 

  struct with fields:

                   Name: '7tm_2'
    PfamAccessionNumber: 'PF00002.19'
       ModelDescription: '7 transmembrane receptor (Secretin family)'
            ModelLength: 243
               Alphabet: 'AA'
          MatchEmission: [243×20 double]
         InsertEmission: [243×20 double]
           NullEmission: [0.0768 0.0418 0.0396 0.0305 0.0201 … ] (1×20 double)
                 BeginX: [244×1 double]
                 MatchX: [242×4 double]
                InsertX: [242×2 double]
                DeleteX: [242×2 double]
        FlankingInsertX: [2×2 double]
                  LoopX: [2×2 double]
                  NullX: [2×1 double]

In case your sequences are not pre-aligned, you can also utilize the multialign function before estimating a new HMM profile. It is possible to refine the HMM profile by re-aligning the sequences to the model and re-estimating the model iteratively until you converge to a locally optimal model.

aligned_seqs  = multialign(seqs);
hmm_7tm_ma = hmmprofestimate(hmmprofstruct(270),aligned_seqs)
showhmmprof(hmm_7tm_ma,'scale','logodds')
close; close; % close insertion emission prob. and transition prob.

hmm_7tm_ma = 

  struct with fields:

        ModelLength: 270
           Alphabet: 'AA'
      MatchEmission: [270×20 double]
     InsertEmission: [270×20 double]
       NullEmission: [0.0500 0.0500 0.0500 0.0500 0.0500 … ] (1×20 double)
             BeginX: [271×1 double]
             MatchX: [269×4 double]
            InsertX: [269×2 double]
            DeleteX: [269×2 double]
    FlankingInsertX: [2×2 double]
              LoopX: [2×2 double]
              NullX: [2×1 double]

Align all sequences to the new model.

fprintf('Aligning sequences ')
scores = zeros(numel(seqs),1);
aligned_seqs = cell(numel(seqs),1);
for sn=1:numel(seqs)
    fprintf('.')
    [scores(sn),aligned_seqs{sn}]=hmmprofalign(hmm_7tm_ma,seqs{sn});
end
fprintf('\n')

str = hmmprofmerge(aligned_seqs);
str(1:10,1:80)

Aligning sequences ................................

ans =

  10×80 char array

    'YILVKAIYTLGYSVSLMSLATGSIILCLF.RKLHCTRNYIHLNLFLSFILRAISVLVKDDVLYSS---SGTLHCP-....'
    'FRSVKIGYTIGHSVSLISLTTAIVILCMS.RKLHCTRNYIHMHLFVSFILKAIAVFVKDAVLYDVIQ--ESDNCS-....'
    'YNTVKTGYTIGYSLSLASLLVAMAILSLF.RKLHCTRNYIHMHLFMSFILRATAVFIKDMALFNS---GEIDHCS-....'
    'FGAIKTGYTIGHSLSLISLTAAMIILCIF.RKLHCTRNYIHMHLFMSFIMRAIAVFIKDIVLFES---GESDHCH-....'
    'YLSVKALYTVGYSTSLVTLTTAMVILCRF.RKLHCTRNFIHMNLFVSFMLRAISVFIKDWILYAE---QDSSHCF-....'
    'FSTVKIIYTTGHSISIVALCVAIAILVAL.RRLHCPRNYIHTQLFATFILKASAVFLKDAAIFQG---DSTDHCS-....'
    'LSTLKQLYTAGYATSLISLITAVIIFTCF.RKFHCTRNYIHINLFVSFILRATAVFIKDAVLFSD---ETQNHCL-....'
    'FDRLGMIYTVGYSVSLASLTVAVLILAYF.RRLHCTRNYIHMHLFLSFMLRAVSIFVKDAVLYSGATLDEAERLTE....'
    'FERLYVMYTVGYSISFGSLAVAILIIGYF.RRLHCTRNYIHMHLFVSFMLRATSIFVKDRVVHAHIGVKELESLIM....'
    'ALNLFYLTIIGHGLSIASLLISLGIFFYF.KSLSCQRITLHKNLFFSFVCNSVVTIIHLTAVANNQALVATNP---....'

Show the aligned sequences in the browser.

hmmprofmerge(aligned_seqs,names,scores)