This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
This deterministic process, sometimes referred to as the classic version of the Central Dogma of Molecular Biology, dictates the course of gene expression and the path toward cell development. The regulation of this process is fundamental for organisms at all levels of complexity.
From adaptation to changing environments to the development of new functions, gene regulation provides several advantages to the cell. By choosing which genes to express at a particular time, an organism can increase versatility and adaptability to its environment. Gene regulation allows for more efficient use of resources. For example, if an organism is able to metabolize both lactose and glucose, but only glucose is available, the organism can shut down the machinery responsible for the uptake and metabolism of lactose, thereby conserving energy. Gene expression allows for adaptation to stimuli in the environment, such as exposure to extreme heat or cold, and gives the organism protection from these potential hazards. In higher level eukaryotes, gene regulation assists the differentiation process, which produces specific cell types such as liver, kidney, lung, etc. Control over gene expression can be exerted by the cell at the transcriptional, post-transcriptional, translational, and post-translational levels, as well as epigenetically through the addition of specific compounds to nucleic acids. This wide range of regulation types allows cells to construct very complicated and extensive regulation mechanisms.
NA-binding proteins are involved in gene regulation at many levels, and understanding how they perform their tasks is crucial to understanding the regulatory process. DNA-binding proteins are an integral part of the gene regulation process and are also responsible for DNA repair. A particular subclass of these proteins, transcription factors, help control both the initiation and the level of transcription, which is currently the most studied and best understood type of regulation. RNA-binding proteins are directly involved with activities such as protein synthesis, regulation of gene expression, and RNA splicing and editing, as well as other posttranscriptional activities. Both DNA- and RNA-binding proteins are essential to the replication of specific types of viruses. The mechanisms underlying the behavior of these proteins require an understanding of the interactions that occur between specific residues on proteins and the nucleic acids to which they bind. Prediction of NA-binding residues can provide practical assistance in the functional annotation of NA-binding proteins. Predictions can also be used to expedite mutagenesis experiments, guiding researchers to the correct binding residues in these proteins. Identifying these residues is a complex and difficult problem. The characteristic traits of a residue which enable binding are largely unknown. Whether or not certain characteristics of its neighbors affect a residue's binding capability is also poorly understood, further complicating the issue. Because of this, machine learning has often been employed in an attempt to discover precisely which residues confer binding functionality.
Current computer modeling techniques fall into two categories: sequence- and structure-based.
Sequence-based methods commonly use evolutionary information collected from a statistical analysis of the transcription factor binding sites. In general, these methods begin with the collection of DNA sequences known to bind specific transcription factors, followed by statistical analysis of these sequences. One common analysis protocol involves the creation of a multiple sequence alignment (MSA), which can then be used to build a position-specific scoring matrix (PSSM). Genomic sequence is subsequently scanned with the PSSM to identify additional binding sequences for this transcription factor. The main drawback of these methods is that they require many examples in order to gain useful information, and a sufficient number of examples are often unavailable. In contrast, structure-based methods use three-dimensional structures obtained by X-ray crystallography and NMR to describe protein-DNA interactions at the atomic level and can clearly explain binding recognition and binding affinity (80). An ever-increasing number of experimentally-solved three-dimensional structures provide new models for protein-DNA interaction.
These structures allow for a much clearer explanation of binding recognition and binding affinity. Much effort has been focused on the development of a statistical potential for protein structure and protein-protein interaction prediction (105; 186; 106; 104). In this work, we have applied the aforementioned technique to improve the recognition of protein-DNA interactions. Although there is not a one-to-one recognition code between amino acids and DNA bases, protein-DNA interaction preferences have been found and knowledge-based potentials imply that such interaction propensities can be used for protein-DNA binding prediction. A statistical potential, based on hydrogen bonding and hydrophobic contacts, has been used to describe interaction preferences between 20 amino acids and 4 bases. This potential was subsequently used to evaluate the binding affinity of a protein-DNA complex (110). Only short-distance contacts were included in that computation. A distance-dependent statistical potential, which takes into account both long-range interactions up to 15°A and multi-body interactions, has also been built (103). A grid-based potential, which has a different spatial partition than a distance-dependent potential, has been shown to perform quite well in evaluating binding affinity, predicting cooperative binding, and predicting binding specificity of protein-DNA complexes (132; 87). In these works, C_ atoms were used to represent the position of the amino acid and two-body interactions (those involving one amino acid and one DNA base) were considered.
CpG island (CpGI) methylation is a type of epigenetic modification of DNA. It occurs in eukaryotes and, in human DNA, is based on the addition of a methyl group to the nucleic acid cytosine. The methyl group is added to the number 5 carbon of the pyrimidine ring of a cytosine which is followed by a guanine (CpG). This action is performed by enzymes called
DNA methyltransferases, of which there are two types. DNA methyltransferase 1 (Dnmt1) performs maintenance methylation during which methyl groups are added to cytosines of the daughter strand in exactly the same pattern as the parent strand. DNA methyltransferases
3a and 3b (Dnmt3a, Dnmt3b) are involved in de novo methylation during which cytosines at new positions in the DNA strand are methylated, which changes the methylation pattern in a localized region of the DNA. This new methylation pattern is a reversible, heritable change that does not alter the DNA sequence itself or the genotype. This type of modification is described as epigenetic.
The generally accepted definition of CpGIs involves three criteria (65; 150). First, the GC content of a CpGI must be _ 60% or greater. Secondly, the observed vs. expected number of CpG dinucleotides should be above _ 0.65. Thirdly, the length of the island itself should fall within the range of 200 to 3000 nucleotides. This last characteristic is a matter of some debate and varies greatly within the literature. CpGIs account for _ 1% of the human genome and were initially thought to be located primarily in the 5' region of expressed genes in higher eukaryotes. While CpGIs overlap with the promoter region of 50-60% of human genes, including most housekeeping genes, recent high-throughput and genomewide studies indicate as many as
50% of CpGIs occur inter- or intragenically in human DNA (77). Coventional wisdom says that in the human genome CpGs within promoters regions are usually unmethylated, while most CpG dinucleotides in non-coding regions are methylated (28) (there are several exceptions, including X-chromosome inactivation (47)). It has been shown recently, however, that more than one third of CpGs within transcribed regions of Arabidopsis DNA may be methylated (183).
Methylation of CpGIs can affect gene expression in the following manner (33): Methylation of an island upstream (5') of a gene by DNA methyltransferase signals the methyl-CpG-binding protein (MeCP) components of a histone deacetylase complex (HDAC). The action of deacetylation by HDAC of the lysine residues of histone restores their positive charge, which increases the affinity of histone for the negatively charged backbone of DNA. HDACs are associated with the formation of heterochromatin through this process. The formation of heterochromatin generally down-regulates DNA transcription by blocking access for transcription factors to the promoter region of genes.
A number of previous attempts have been made to predict CpGI methylation. Feltus et al. (59) used hierarchical clustering and seven DNA sequence patterns to distinguish between methylation-prone and methylation-resistant CpGIs with an accuracy of 82%. They later used a linear discriminate method and achieved 87% accuracy (58). Fang et al. (56) used SVM with DNA sequence properties and transcription factor binding site (TFBS) information to reach a result of 85% accuracy, 77% sensitivity, and 86% specificity. Bhasin et al. (26) attempted to predict specific cytosines within CpGIs that were methylated. This group used an SVM classifier and sequence composition attributes and reported 75% accuracy, 72% sensitivity, and 77% specificity as their best results. The most comprehensive paper written on this subject is that of Bock et al. (29), in which they use 1184 DNA attributes to classify 132 CpGIs from human chromosome 21 as methylation-prone or methylation-resistant. In their study they find that specific sequence patterns, DNA repeats, and DNA structure have a high correlation to methylation. They report that _ 66% of the attributes that are differentially distributed between methylated and unmethylated CpGIs are 4-mer DNA sequence patterns, both strandand non-strandspecific. Additionally, other work has shown that certain DNA motifs may affect susceptibility to aberrant methylation in human fibroblast clones (59). Further evidence indicates that repetitive sequences can affect the methylation status of CpGIs (107; 178). These studies highlight the importance of sequence patterns and characteristics in CpGI methylation.
Each of the abovementioned areas of study focus on a particular component of the nucleic acid binding process. While important, none of these individual components gives us a broad understanding of how the system of transcriptional regulation operates as a whole. To achieve this, much focus has been placed in recent years on the study of gene regulation or transcriptional networks, with the goal of understanding how these networks evolve and function. These works include the creation of logical models, continuous models, and single-molecule methods (for a review see (83)). Statistical models of protein-protein interaction networks have also been created in order to study their evolution (22). This type of global approach is necessary to understand how the individual components function in concert to regulate gene expression.
Recently, the topics of network organization and robustness within biological networks have come into the spotlight. How do gene regulation and protein-protein interaction networks manage to stay robust to genetic changes, some of which are deleterious or mutational in nature? How does an organism maintain fitness? Parallels between gene regulation networks and communication networks have been drawn, focusing on failure and attack tolerance in biological networks (5). In 2005, Wagner discussed two main hypotheses on the mechanistic causes of robustness: redundancy and distributed robustness (162). He pointed out that while there is evidence that duplicate genes play an important role in an organism's tolerance to change, many systems, including metabolic and gene regulation networks, show no gene redundancy but are still able to tolerate removal of highly-connected nodes. Subsequently, the transcription factor network in yeast was shown to possess a distributed node degree distribution (18), which is thought to lend a level of robustness to the scale-free gene regulation network. However, the same co-regulation network architecture was not found in E. coli (17), which highlights the possibility that there are multiple pathways for achieving fitness. In 2007, Wagner and Wright observed that many regulator-target gene pairs in more than a dozen biological networks had intermediate regulators between them. These 'alternative routes' could be a possible cause of robustness (163).
Several evolutionary models have been created to study robustness. Ciliberti et al. showed that robustness is an evolvable trait (39). Crombach and Hogeweg showed that the evolution of gene regulation dynamics can lead to increased efficiency of creating well-adapted offspring, while maintaining a robustness to most mutations (46). Krishnan showed that robustness evolves along with networks as an emergent property even in the absence of selective pressure
In recent years, researchers have produced a body of work that has given us a clearer (albeit more complicated) picture of how diseases such as cancer come to be, how it develops, and how it can be treated. The roles of genetics in the form of single nucleotide polymorphisms or SNPs (52), epigenetics (137), miRNA (161), copy number variation (90), chromatin structure (32), and protein biomarkers (64) in cancer have been shown. While great scientific advances have been made in the understanding and treatment of this disease in the last 50 years, we still do not have clear knowledge of the 'how' and 'why'. Given a set of initial conditions in the body defined by genetics, lifestyle, environmental exposure, etc., cancer begins and proceeds to develop through an evolutionary process. This results in all cancers having unique characteristics (71). Clearly, cancer is a multi-dimensional problem for which we have an enormous amount of data. Gaining knowledge from the existing data, however, is a nontrivial task. In the last several years, bioinformatics and computational biology have made a variety of contributions to disease analysis using existing data in an attempt to increase our understanding. Popular topics include the discovery, prediction, and analysis of genes related to disease (166), statistical analysis of SNPs and disease (100), the prediction and discovery of new drug targets (149), the development of the disease ontology and its application to the human genome (122; 123), the analysis of protein-protein interaction networks as they relate to disease (76), and many others. Of particular interest is the development of 'disease networks' (69; 181), which are in most cases bipartite graphs describing disease-disease as well as disease gene relationships (see Figure 3). These edges may signify one or more shared genes, metabolic pathways, miRNAs, or a number of other data types.
These edges may signify one or more shared genes, metabolic pathways, miRNAs, or a number of other data types. The disease network reveals the interconnected nature of various diseases, which begs the question; can we gain new knowledge of a disease such as cancer by studying 'connected', non-cancer diseases? Many diseases including obesity (92; 151), various infections (9), diabetes (164), and possibly even psychological stress (66) have been reported to have some relationship to cancer. Often the relationship type is unknown or partially known, which indicates that a deeper understanding of these relationships is needed. Furthermore, these relationships have not yet been explored as a whole, but rather as individual links. Due to the complicated nature of many diseases, which may involve the failure of multiple levels of biological function including DNA repair, gene regulation, epigenetic and histone modifications, metabolic pathways, etc., elucidation of disease relationships requires a systematic and computational solution. Though there may be a plethora of data available to quantify this problem, the data itself does nothing for us unless we can turn that data into knowledge (a similar problem arose after the sequencing of the human genome). Merely combining sources of data is not sufficient. We must identify patterns within the data, which is manually infeasible when the number of data points and characteristics to be compared is large. Clearer understanding could be gained by finding, among all attributes of a relationship, those that characterize it most accurately. Several existing machine learning algorithms can help achieve this including multiple instance learning (48), positive/unlabeled (PU) learning (102), Bayesian inference (27), the alternating decision tree, or ADTree (60), and others. In the past we have used the ADTree algorithm to analyze methylation patterns on DNA (37) and to predict DNA-binding proteins (95). In both cases, this algorithm helped us to understand what characteristics have the most influence on determining the class to which the examples belonged. A similar method of 'rule discovery' is needed in the case of the disease network. Of course, the rules may be heavily dependent upon the types of disease in question (i.e., metabolic, infectious, autoimmune, and genetic). By analyzing a combination of available genetic, epigenetic, and proteomic data, one should be able to use these algorithms to enrich the edges between cancer and other diseases in the disease network, as well as to predict new edges within disease clusters.
There have been several previous attempts at predicting gene-disease association. ¨Ozg¨ur et al. used SVM, text mining, and several network metrics to rank potential disease genes (124). Gonzalez et al. predicted atherosclerosis-related genes by creating a PPI and adding weights to certain proteins based on text mining of PubMed abstracts (70). Xu et al. used a KNN classifier to predict hereditary disease genes from OMIM over the human PPI network with an overall accuracy of 76%. They found that these hereditary disease proteins tended to have a larger number of interactions and tended to have more shared neighbors than non-disease proteins (174). Wu et al. developed CIPHER, a software tool that prioritizes disease genes (171). Due to the complicated nature of many diseases, which may involve the failure of multiple levels of biological function including DNA repair, gene regulation, epigenetic and histone modifications, metabolic pathways, etc., elucidation of disease relationships requires a systematic and computational solution. Though there may be a plethora of data available to quantify this problem, the data itself does nothing for us unless we can turn that data into knowledge (a similar problem arose after the sequencing of the human genome). Merely combining sources of data is not sufficient. We must identify patterns within the data, which is manually infeasible when the number of data points and characteristics to be compared is large. Clearer understanding could be gained by finding, among all attributes of a relationship, those that characterize it most accurately. Several existing machine learning algorithms can help achieve this including multiple instance learning (48), positive/unlabeled (PU) learning (102), Bayesian inference (27), the alternating decision tree, or ADTree (60), and others. In the past we have used the ADTree algorithm to analyze methylation patterns on DNA (37) and to predict DNA-binding proteins (95). In both cases, this algorithm helped us to understand what characteristics have the most influence on determining the class to which the examples belonged. A similar method of rule discovery is needed in the case of the disease network.