Hidden Markov model

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


1. How Viterbi algorithm works?

In 1967 Andrew Viterbi formulated an algorithm to compute the optimal (most likely) state sequence in a hidden Markov model given a sequence of observed outputs.

The key idea is not finding all possible state sequences—just the most likely state sequence. At an arbitrary point before all of the observations have been seen we cannot know what is the most likely sequence. What is currently the most likely sequence might be eliminated completely when the next observations comes along if there is no transition that extends the previously most likely sequence with the observation. Consequently the currently most promising sequence might seem very unlikely if the latest transition has a very low probability or impossible if there is no such transition. In a situation where the only transition that could possibly account for the latest observation is from one state.

The necessity is to maintain a sequence that ends in that state so that is possible to extend it with the latest observation. Thus it is important to maintain for every state the most likely sequence that ends in that state. There may be many other ways of ending in that state but since we are looking for the most likely sequence there is nothing to be gained by maintaining anything but the one that has the highest probability. The Viterbi algorithm therefore works by maintaining for each state:

a. The most likely sequence of states that ends in that state, and
b. The probability of that sequence. Sometimes the probability will be zero.

2. What is forward-backward procedure?

The forward/backward probabilities are used in the HMM algorithm. The probabilities occur when evaluating by breaking the sequence probability into two pieces via use of a single hidden variable treated as a Bayesian parameter.

In forward probabilities we are not concerned with a sequence of states. Instead, we want to know the sum of the probabilities of all paths that account for the sequence of observations. In principle the final state can be any of the states. Let α1(t) be the probability P (w1, t-1,st=si). Clearly if we know αι (t) for all states i we can calculate the probabilities of αι (t+1) for all states i by simply extending computing and summing all extensions of the previous states. This is precisely what we do for the forward-probabilities algorithm. To get P (w1, t) therefore we simply sum the probabilities of all sequences ending in each of the states.

Backward probabilities work in exactly the same way as forward probabilities but we start from the final observation and work backwards which is useful for solving the training problem.

3. Give examples of usage of HMMs in bioinformatics.

1) Multiple Sequence Alignment

A frequent bioinformatics problem is to assess if a “new” sequence belongs to a family of homologous sequences, using a given multiple alignment of the sequences of the family. In this framework, a frequently used concept is the consensus sequence, i.e. the sequence having in each position the residue that, among those of the multiple alignment, occurs most frequently in that position. A related concept is that of a profile: instead of assigning to each position the most frequent residue, assigning a profile to a sequence amounts to assign to each position of the sequence a set of “scores”, each one to a residue that can occur in that position. More formally, the profile is a matrix, whose dimensions are the number of positions and the number of possible residues, and that for each position along the multiple alignment, assigns a score to each possible element in such position.

To solve the above mentioned problem, a first technique to judge the total score obtained by aligning (using a suitable choice of the score matrix and of the gap penalties) the new sequence to the consensus sequence obtained from the multiple alignment. A better technique is to judge the total score obtained by aligning (using the score matrix inside the profile and a suitable choice of the gap penalties) the new sequence to the profile obtained from the multiple alignment. An even better technique is to use a “profile HMM”, an implementation of the HMM which combines the idea of the profile (Gribskov 1987) with the idea of the HMM, and has been specially designed for dealing with multiple sequence alignment.

The major advantages of the profile HMM with respect to profile analysis are that in profile analysis the scores are given heuristically, while HMMs strive to use statistically consistent formulas, and that producing a good profile HMMs requires less skill and manual intervention than producing good standard profiles.

2) Genetic Mapping

One of the earliest applications of HMMs in bioinformatics has been the use of a nonstationary HMM for genetic mapping (Lander 1987), i.e. the estimation of some kind of distance between loci of known (or at least presumed) order along the chromosome.

Lander and Green initially obtained linkage maps providing experimental linkage data based on pedigrees; afterwards, in order to obtain radiation maps Slonim et al. (1997) used a nonstationary HMM starting from experimental radiation data based on gamma irradiation breaks.

3) Gene Finding

Many extensions to the original “pure” HMM have been developed for gene finding. Henderson et al. designed separate HMM modules, each one appropriate for a specific region of DNA. Kulp et al. and Burge et al. used a generalized HMM (GHMM or “hidden semiMarkov Model”) that allows more than one emission for each internal state. Durbin et al. introduced a model called “pair HMM”, which is like a standard HMM except that the emission consists in a pair of aligned sequences. Meyer and Durbin presented a new method that predicts the gene structure starting from two homologous DNA sequences, identifying the conserved subsequences. Pachter et al., following a similar idea, proposed a generalized pair HMM (GPHMM) that combines the GHMM and the pair HMM approaches, in order to improve the gene finding comparing orthologous sequences.

Pedersen and Hein introduced an evolutionary Hidden Markov Model (EHMM), based on a suitable coupling of an HMM and a set of evolutionary models based on a phylogenetic tree.

4) Secondary Structure Protein Prediction

HMMs are also employed to predict the secondary structure of a protein (i.e. the type of the local three-dimensional structure, usually alpha-helix, beta-sheet, or coil), an important step for predicting the global three-dimensional structure.

Asai et al. first used a simple HMM for the secondary structure prediction, while Goldman et al. in the HMM approach exploited some evolutionary information contained in protein sequence alignments.

5) Signal Peptide Prediction

Signal peptide prediction, i.e., the determination of the protein destination address contained in the peptide first tract is often of paramount importance both for diseases analysis and for drug design. Juncker et al. proposed a successful method, using a standard HMM, to predict lipoprotein signal peptides in Gram-negative eubacteria. The method was tested against a neural network model.

Schneider and Fechner provided a thorough review on the use of HMMs and of three other methods for the signal peptide prediction. Zhang and Wood created a profile HMM for signal peptide prediction, by means of a novel approach to the use of the HMMER package, together with a suitable tuning of some critical parameters.

6) Transmembrane Protein Prediction

A direct measurement of the complete 3D structure of a transmembrane protein is now feasible only in very few cases. On the other hand, for many practical purposes (such as drug design), it is already very useful to simply know at least the transmembrane protein topology and to this end a number of models are available to predict such topology. The secondary structure of the transmembrane tracts of most proteins is of alpha helix type; important exceptions are the so-called beta-barrels restricted to the outer membrane of gram-negative bacteria and of mitochondria.

Some researchers (Sonnhammer 1998, Tusnady 1998, Krogh 2001, Kahsay 2005) specialised the HMM architecture to predict the topology of helical transmembrane proteins. Kahsay et al. used unconventional pseudo counts that they obtained from a modified Dirichlet formula.

Other researchers (Martelli 2002, Liu 2003, Bagos 2004) specialised the HMM architecture to predict the topology of beta-barrel transmembrane proteins. Martelli et al. trained the model with the evolutionary information computed from multiple sequence alignment, while Bagos et al. adopted the conditional Maximum Likelihood proposed by Krogh.

7) Epitope Prediction

A preliminary step in inducing an immune response is the binding of a peptide to a Major Histocompatibility Complex (MHC) molecule, either of class I (as in viral infections or cancer) or of class II (as in bacterial infections). Since, however, most peptides cannot bind to an MHC molecule, it is important to predict which are the epitopes, i.e., the peptides that can bind to an MHC molecule. Mamitsuka advocated the use of supervised learning (for both class I and II) to improve the performance of HMMs. A different approach (Noguchi 2002, Kato 2003), to improve the performance of HMMs in predicting class I epitopes, combines HMM with a new algorithm, the “successive state splitting”(SSS) algorithm. Yu et al. provided a thorough comparative study of several methods, as binding motifs, binding matrices, hidden Markov models (HMM), or artificial neural networks (ANN).

Udaka et al. , in order to improve the prediction of the binding ability of a peptide to an MHC Class I molecule, used an iterative strategy for the “Query Learning Algorithm” (Abe 1998), which trains a set of HMMs by means of the socalled “Qbag” algorithm. More specifically the algorithm, within any iteration, indicates the peptides for which the prevision is more uncertain, so that their binding ability is measured, and then fed back, for learning, to the model.

8) Phylogenetic Analysis

Phylogenetic analysis aims to find probabilistic models of phylogeny and to obtain evolutionary trees of different organisms from a set of molecular sequences. Felsenstein and Churchill in order to account for the fact that evolution speed varies among positions along the sequences, allowed in their model for three possible speed values as hidden states of the HMM.The optimization is performed by minimizing a suitable objective function by means of Newton-Raphson method.

Thorne et al. proposed an evolutionary phylogeny model that uses an HMM to combine the primary structure with a known or estimated secondary structure. Siepel and Haussler provided a thorough review, and considered also HMMs of higher order. Husmeier used a generalisation of standard HMMs (the so-called factorial HMM), where emissions are due to the combined effect of two internal states belonging to two different hidden Markov chains, the first state representing the tree topology, and the second state the selective pressure.

Mitchinson treated simultaneously alignment and phylogeny by means of the so-called tree-HMM that combines a profile-HMM with a probabilistic model of phylogeny, enhancing it with a number of heuristic approximate algorithms. An iterative version with further enhancements, particularly successful in identifying distant homologs, is described by Qian and Goldstein


1) Felsenstein J, Churchill GA (1996). A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol; 13: 93-104.

2) Gribskov M, McLachlan AD, Eisenberg D (1987). Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA; 84: 4355-8.

3) Lander ES, Green P (1987). Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U S A; 84: 2363-7.

4) Slonim D, Kruglyak L, Stein L, Lander E (1997). Building human genome maps with radiation hybrids. J Comput Biol; 4: 487-504.

5) Meyer IM, Durbin R(2002). Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics; 18: 1309-18.

6) Krogh A, Mian IS, Haussler D (1994). A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res 22: 4768-78.

7) Henderson J, Salzberg S, Fasman KH(1997). Finding genes in DNA with a Hidden Markov Model. J Comput Biol; 4: 127-41.

8) Kulp D, Haussler D, Reese MG, Eeckman FH(1996). A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol; 4: 134-42.

9) Burge CB, Karlin S(1998). Finding the genes in genomic DNA. Curr Opin Struct Biol; 8: 346-54.

10) Pachter L, Alexandersson M, Cawley S(2001). Applications of generalized pair hidden Markov models to alignment and gene finding problems, In: Lengauer T, Sankoff D, Istrail S, Pevzner P, Waterman M Eds, Proceedings of the Fifth International Conference on Computational Biology (RECOMB). Vol. 1, ACM Press, New York,: 241-8.

11) Pedersen JS, Hein J(2003). Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics; 19: 219-27.

12) Schneider G, Fechner U. Advances in the prediction of protein targeting signals (2004). Proteomics; 4: 1571-80.

13) Asai K, Hayamizu S, Handa K (1993). Prediction of protein secondary structure by the hidden Markov model. Comput Appl Biosci; 9: 141-46.

14) Mitchison GJ 9(1999). A probabilistic treatment of phylogeny and sequence alignment. J Mol Evol; 49: 11-22.

15) Goldman N, Thorne JL, Jones DT (1996). Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J Mol Biol; 263: 196-208.

16) Juncker AS, Willenbrock H, Von Heijne G, Brunak S, Nielsen H, Krogh A (2003). Prediction of lipoprotein signal peptides in Gramnegative bacteria. Protein Sci; 12: 1652-62.

17) Zhang Z, Wood WI(2003). A profile hidden Markov model for signal peptides generated by HMMER. Bioinformatics; 19: 307-8.

18) Sonnhammer EL, von Heijne G, Krogh A(1998). A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol; 6: 175-82.

19) Durbin R, Eddy S, Krogh A, Mitchison G(1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK.

20) Majoros WH, Pertea M, Salzberg SL(2005). Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics;21: 1782-8.

21) Lukashin AV, Borodovsky M (1998). GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res; 26: 1107-15.

22) Tusnady GE, Simon I(1998). Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol; 283: 489-506.

23) Krogh A, Larsson B, von Heijne G, Sonnhammer EL(2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol; 305: 567-80.

24) Siepel A, Haussler D (2005). Phylogenetic hidden Markov models. In Nielsen R Ed, Statistical Methods in Molecular Evolution Springer, New York; 325-351.

25) Noguchi H, Kato R, Hanai T, Matsubara Y, Honda H, Brusic V, Kobayashi T(2002). Hidden Markov model-based prediction of antigenic peptides that interact with MHC class II molecules. J Biosci Bioeng; 94: 264-70.

26) Kato R, Noguchi H, Honda H, Kobayashi T (2003). Hidden Markov model-based approach as the first screening of binding peptides that interact with MHC class II molecules. Enzyme Microb Technol; 33: 472-81.

27) Husmeier D (2005). Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models. Bioinformatics; 21: II166-72.

28) Kahsay RY, Gao G, Liao L (2005). An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes. Bioinformatics; 21: 1853-8.

29) Yu K, Petrovsky N, Schonbach C, Koh JY, Brusic V (2002). Methods for prediction of peptide binding to MHC molecules: a comparative study. Mol Med; 8: 137-48.

30) Udaka K, Mamitsuka H, Nakaseko, Abe N (2002). Prediction of MHC Class I Binding Peptides by a Query Learning Algorithm Based on Hidden Markov Models. J Biol Phys; 28: 183-94.

31) Abe N, Mamitsuka H (1998). Query learning strategies using boosting and bagging. Proceedings of the Fifteenth International Conference on Machine Learning; 1-9.

32) Thorne JL, Goldman N, Jones DT (1996). Combining protein evolution and secondary structure. Mol Biol Evol; 13: 666-73.

33) Qian B, Goldstein RA (2004). Performance of an iterated T-HMM for homology detection. Bioinformatics; 20: 2175-80.

34) Bagos PG, Liakopoulos TD, Spyropoulos IC, Hamodrakas SJ (2004). A Hidden Markov Model method, capable of predicting and discriminating beta-barrel outer membrane proteins. BMC Bioinformatics; 5: 29.

35) Yoon B-J, Vaidyanathan PP (2004). HMM with auxiliary memory: a new tool for modeling RNA secondary structures. Proceedings of the Thirty-Eighth Asilomar Conference on Signals, Systems, and Computers, Monterey, CA,; 2: 1651-5.

36) Martelli PL, Fariselli P, Krogh A, Casadio R (2002). A sequence-profilebased HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics; 18: S46-53.

37) Liu Q, Zhu YS, Wang BH, Li YX (2003). A HMM-based method to predict the transmembrane regions of beta-barrel membrane proteins. Comput Biol Chem; 27: 69-76.

38) Krogh A (1997). Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol; 5: 179-86.