The sequences of identifying regions

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Sequence alignment is a way of arranging the sequences of identifying regions of similarity between DNA, RNA, and protein sequences that may be a consequence of functional, evolutionary or structural relationships between the sequences. Aligned sequences of nucleotide in case of DNA and amino acids in case of protein residues are typically represented as rows in matrix form. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Pairwise sequence alignment is the process of aligning two sequences and is the basis of database similarity searching and multiple sequence alignment is method of finding similarity between more than two sequences.

Short and similar sequences can be aligned by hand. However, most required is the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and adjusting the final results to reflect patterns that are difficult to represent algorithmically. Computational approaches to sequence alignment generally fall into two categories:

Ø Global alignments : Calculating a global alignment is a form of global method that "forces" the alignment to span the entire length of all query sequences. Global alignments, align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. A general global alignment technique is NEEDLEMAN-WUNCH method, which is based on dynamic programming.




Ø Local alignments. It identify regions of similarity within long sequences that are often widely divergent overall. It is often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. Local alignments are more useful in case of dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The technique is named as SMITH-WATERMAN method based on dynamic programming.





A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally optimizing methods like

  • Dynamic programming , and
  • Heuristic algorithms/probabilistic method


Two sequences if share a common ancestor. Then mismatch can be interpreted by point mutation and by introducing gaps in one or both the sequences. In sequence alignments of proteins, the degree of similarity between amino acids gives the rough measure of how conserved a sequence of motifs is in the given sequence. The absence of substitutions, or the presence of only very conservative substitutions in a particular region of the sequence, suggest that the region has structural or functional importance. Although DNA and RNA nucleotides are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.


It is the fundamental component of bioinformatics which is extremely successful in finding structural, functional, and evolutionary similarity between two given sequences. It provides interference for the relatedness of two sequences. Homologous sequence are those which are too much closely similar to each other. Sequences homology is drawn from sequence comparison, whereas sequence similarity is actual observation after sequence case of protein sequence, pairwise alignment are often used to infer homology, although this approach can be rather imprecise.

There are two sequence alignment strategies,

  • local alignment and
  • global alignment, and

There are three types of algorithm that perform both local and global alignments. They are

  • Dot matrix method: The dot matrix method is useful in visually identifying similar regions, but lacks the sophistication of other two methods,
  • Dynamic programming: Dynamic programming is an exhaustive and quantitative method to find optimal alignments. This method effectively works in three steps:
  • producing sequence versus sequence matrix,
  • to accumulate scores in the matrix, and
  • to track back through the matrix in reverse order to identify the highest scoring path. This scoring steps involve the use of scoring matrices and gap penalties., and
  • Word method: also known as k-tuple methods, are heuristic methods that are not to find an optimal alignment solution, but are more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is obvious that a large proportion of the candidate sequences will have essentially no significant match with the query sequence.
  • Scoring matrices describe the statistical probabilities of one residue being substituted by other. PAM and BLOSUM are the most commonly used matrices for aligning protein sequences.

  • The PAM matrices involve the use of evolutionary model and extrapolation of probability values from alignment of close homologs to more divergent ones. In contrast the BLOSUM matrices are derived from actual alignment.
  • The PAM and BLOSUM serials number also have opposite meanings. Matrices of high PAM numbers are used to align divergent sequences and low PAM numbers are aligning closely related sequences.

In practice, if one is certain about which matrix to use that gives the best alignment result. Statistical significance of pairwise sequence similarity can be tested using a randomization test where distribution follows an extreme value distribution.


Multiple sequence alignment is an essential technique n many bioinformatics applications. Many algorithms have been developed to achieve optimal alignments. Some programs are exhaustive in nature; some are heuristic. Because exhaustive programs are not feasible in most cases, heuristic programs are commonly used. These include

  • Progressive approaches: The progressive approach is a step wise assembly of multiple alignments according to pairwise alignment similarity. A prominent example is Clustal, which is characterized by adjustable scoring matrices and gap penalties as well as by the application of weighting schemes. The major shortcoming of the program is the "greediness", which relates to error fixation in the early steps of computation. To remedy the problem, T-coffee and DbClustal have been developed that combine both global and local alignment to generate more sensitive alignment. Another improvement on the traditional progressive approach is to use graphic profiles, as in Poa, which eliminates the problem of error fixation. Praline is profile based and has the capacity to restrict alignment based on protein structure information and is thus more accurate than Clustal.,
  • Iterative approaches: The iterative approach works by repetitive refinement of suboptimal alignments, and
  • Block-based approaches: The block based methods focuses on identifying regional similarities.

It is important to keep in mind that no alignment program is absolutely guaranteed to find correct alignment, especially when the number of sequences is large and the divergent level is high. The alignment resulting from automated alignment programs often contain errors. The best approach is to perform alignment using a combination of multiple alignment programs. The alignment result can be further refined manually. Protein- encoding DNA sequence should be preferably be aligned at the protein level first, after which the alignment can be back to alignment.


1.) Sequence alignments are useful in bioinformatics for identifying sequence

  • similarity,
  • producing phylogenetic trees, and
  • developing homology models of protein structures.

2.) Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.

3.) In database searches like BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or its regions arising by chance given the size and composition of the database being searched. These values can vary depending on the search space.

4.) Biological application:

  • Sequenced RNA, such as EST (expressed sequence tags) and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about RNA editing and splicing. Sequence alignment is also a part of genome assembly, where sequences are aligned to find overlap so that long stretches of sequence can be formed.
  • Another use is Single Nucleotide Polymorphism (SNP) analysis, where sequences from different individuals are aligned to find single base pairs that are often different in a population.

5.) Non-biological application:

  • The method of sequence alignment have also found applications in other fields, most notably in natural language processing and in social sciences. Techniques that is used to generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs.
  • The field of historical and comparative linguistics, has also used sequence alignment to partially automate the comparative method by which linguists reconstruct languages.
  • Business and marketing research has also taken multiple sequence alignment techniques as their tool in analyzing series of purchases over time.


Thus the sequence alignment is a scheme of writing one sequence on top of another where the residues in one position are deemed to have a common evolutionary origin. If the same letter occurs in both sequences then this position are accepted to be conserved in evolution. If the letters differ, it is assumed that the two derive from an ancestral letter, which could be one of the two or neither. It is most important phenomenon rather method of finding the history of sequences which comprise their structural, functional, and evolutionary history. They also tell us about consensus sequences and their hierarchy with the help of different algorithm and matrices used in this method.