Gene finding essentially means identification of stretches of sequences in genomic DNA that are biologically functional (1).Computational gene finding is a branch of it, that concerns with the development of various computational methods to locate protein coding regions and regulatory signals in unprocessed genomic DNA sequence data (2).Given an uncharacterized strand of DNA sequence, computational gene finding methods aim to address few major questions as follows(3)
So gone are those days when gene finding was based on labour intensive wet lab experimentations on living cells and organisms, followed by difficult statistical analysis in order to determine their homologous recombination to find their orders in certain chromosomes. And all this efforts and data generated put together used to predict only a rough GENETIC MAP, predicting relative location of known genes(4).
The advances of computational biology has made the whole work of gene finding a lot more easier task. With the help of more comprehensive genome sequence tools and the powerful computational resources, the GENE FINDING AND FUNCTION PREDICTION is lot more of a computational problem now(5). Though the function prediction still needs in vivo experiments for confirmatory purposes(6), the in silico techniques are fast taking over.
Get your grade
or your money back
using our Essay Writing Service!
In application though gene finding is not merely locating a gene in a strand of DNA more true in eukaryotes where coding regions, i.e Exons and regulatory regions like promoters are intermittently embedded in numbers of non-signal regions and non-coding regions, i.e Introns. So locating a gene becomes a much tougher job and needs to take different functional components of a gene in consideration as well .
The three major approaches used for gene finding are
a)Extrinsic approach or Homology based approach : This approach is based on finding sequence similarity between the target genome with those of already sequenced ones available as database .It uses local alignment tools(smith-waterman algo,BLAST,FASTA) to search mRNA or protein product, cDNA and ESTs databases.(3) When compared with ESTs from the same organism, regions corresponding to processed mRNAs can be identified.(1) The higher the similarity the more its probable to confirm the target sequence to be gene. But this method is expensive, needs huge data to be already available and cannot be used for predicting genes whose proteins are not in library moreover regions of similarity limits are ill defined as well.
b)Ab-initio approach: In simplest words Ab-initio methods of gene finding is based on statistical properties ascribed to a given genome.(7) It actually searches for signals for protein coding genes . Now these signs or signals can be real signals of specific sequences indicating a gene present downstream like that of a promoter sequence carrying transcriptional factor binding site or statistical properties ascribed to specific sequences. And for this very reason this approach is different for the two major classes of organism i.e. Prokaryotes and Eukaryotes. The Ab-initio approach actually predicts a gene possibility (4) and needs external evidence for establishing its functionality.
C) Comparative gemonics approach :With so many organismal genome being already sequenced, this new approach is fast being considered for gene finding .As the name suggests this approach predicts genes by comparing genomes of related species considering evolutionary pressure that leads to conservation of functional genes. It considers functional genes undergo lesser mutation in nature to conserve their functionality. The initial application of these approach was to study mouse and human genomes, using programs such as SLAM, SGP and Twinscan /N-SCAN(4)This approach is also used for projecting annotations amongst genomes .
The gene finding tools use statistical models such as HMM and combines content measurement and signals associated with probable genes to predict a gene possibility.Some successful gene finders are GLIMMER,GeneMark for prokaryotes and GENESCAN ,GENEid for eukaryotes.But success in eukaryotic gene finding and function prediction has been limited.The major reason behind this is natural complexity of the genetic materials of euckryotes, which calls for many associated factors as well to be taken in consideration during computational gene prediction.
Prokaryotic gene finding and function prediction:
To start with prokaryotes have relatively much smaller genomes around 0.5 to 10 million BP (A) and so genes are much densely packed.
Always on Time
Marked to Standard
Prokaryotic signals of a probable gene like, alternate start codons,SD sequence,stem loop structures,CpG islands, RNA polymerase bidning site in the prokaryotic promoter is well characterized with its sequences like Pribnow box and TF binding sites. Thus making gene and function prediction and easy systematically.
The protein coding sequences in prokaryotes appear in continuous ORFs and are as long as few hundred to thousand basepairs.
3(UAA,UGA,UAG) out of the 64 possible triplet codons being stop codons, the probability of a stop codon is 1 for every 20-25 codons .This probability is functionally very useful for prokaryotic gene prediction.
There is certain periodicity in appearance of certain conserved sequences in prokaryotic genome that helps in the gene prediction .
These features make the prokaryotic gene prediction relatively simpler. Well scored algorithms hence can much accurately predict gene and its function.
But prokaryotic gene prediction is not free from problems .The major problems are the overlapping nature of few genes(B) and difficulty in predicting translation start sites. But custom scoring matrices and advanced scoring matrix those considerers all these parameters have made prokaryotic gene prediction a lot more accurate job now. And post prediction comparing with BLAST search eliminates sequencing errors as well.
But predicting eukaryotic gene and gene function is rather confusing and complex owing to the complexities of eukaryotic genome.
Eukaryotic gene and gene function prediction:
Unlike prokaryotes, eukaryotic genome is huge,107 to 1010 bp and coding density is low.(c)
The regulatory signals, splice sites and promoter sequences are complex and less characterized. The most frequently used signals are CpG islands and poly A tails(D)
The eukaryotic genome consists of coding regions termed as exons embedded in stretches of non coding regions called introns;Considering fundamental structures they are similar but functionally introns donâ€™t code for any amino acid.
The haploid genome size is variable in eukaryotic genome,and though the non coding regions are one reason for this another paradox is relative similarity in proteome size in very dissimilar eularyotes (E/F)
Unlike prokaryotes the eukaryotic mRNA doesnâ€™t undergo simultaneous translation. The pre mRNA undergoes an editing process of splicing to remove intronic regions to create the compact functional protein coding regions.(G) which might had been divided in many smaller exon regions in the genome.
The nucleosome pattern of organization of the eukaryotic genome,sequesters some genetic information.The prediction of such sequestered genes also needs specific algorhythms for prediction and analysis.(H)
Higher eukaryotes have more complex processes of gene regulations like alternative splicing.All these processes create the complexity in gene products with limited number of genes and needs to be addressed as well during computational function prediction.
Some eukaryotic coding and non coding sequences are more highly conserved that others. These are signals for gene and fucntion prediction but are most often located distant from actual protein coding genes.
The Ab-initio methods depends on genomic sequence and associated informations for the prediction.This deals the gene finding and function prediction as a statistical probability issue based on scores ascribed on specific associated signals.
There are many pattern recognition methods that are used for detection of the signals. Few of them are
Though previously mostly zero order markov models and
The Hidden Markov models being the most frequently used methods
Ab-initio methods allow for the prediction of novel genes, genes that are unlike any that are known. However, ab initio techniques are generally not effective in detecting alternately spliced forms, interleaved or overlapping genes. They also have difficulty in accurate identification of exon/intron boundaries. Almost all ab-initio gene finders generate large numbers of false positive predictions arising from learnign overfitted models on small training sets. With these caveats in mind, we embark on the study of Hidden Markov models for finding genes in complex eukaryotic genes.