Prediction Of Replication Origins In Viral Genome Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The herpesvirus family includes some of the well-known pathogenic viruses such as herpes simplex, varicella-zoster, Epstein-Barr, and cytomegalovirus. Some of these viruses are believed to pose major risks in immunosuppressive post-transplantation therapies, while others have been associated with life-threatening diseases such as AIDS and various cancers (Bennett et al., 2001; Biswas et al., 2001; Labrecque et al., 1995; Vital et al., 1995). Examples of the 80 or more herpesviruses that infect a variety of animal species are the herpes simplex viruses (HSV-1 and HSV-2), which cause cold sores and genital tract infections in humans; Epstein-Barr virus (EBV) associated with infectious mononucleosis and with two-human cancer, Burkitt's lymphoma and nasopharyngeal carcinoma; cytomegalovirus(CMV), causing animal and human diseases, particularly in immunodeficient individuals; varicella-zostervirus (VZV), producing chickenpox in children and shingles in adults; and Marek's herpesvirus, which causes malignant avian lymphoma (see p709 in Kornberg and Baker, 1992). A number of the animal herpesviruses are of agricultural concern. For example, the alcelaphine herpesvirus 1, indigenous to the wildebeest, is a causative agent of the fatal lymphoproliferative disease malignant catarrhal fever in cattle and deer (Bridgen, 1991).

As DNA replication is the central step in the reproduction of many viruses, understanding the molecular mechanisms involved in DNA replication is of great importance in developing strategies to control the growth and spread of viruses (Delecluse and Hammerschmidt, 2000). There are many differences between the replication processes of different viruses. They are imposed by the biology of the host cell and the nature of the virus genome. In general, virus replication involves three broad stages which are carried out by all types of virus: initiation of infection, replication and expression of the genome, and finally, the release of mature virions from the infected cell (Cann, 2001). The genomes of herpesviruses are linear double-stranded DNA molecules ranging in size from 120kb to more than 200 kb. Although different herpesviruses display a wide variety of tissue tropisms and vary enormously in the way they interact with their natural hosts, one common feature of the biology of all herpesviruses is the mechanism by which they replicate their genomes during the lytic phase of the replication cycle. Lytic DNA replication in herpesvirus occurs by a mechanism that generates long head-to-tail concatemers of viral genomes that are cleaved to unit-length genomes during the process of encapsidation. This is common mode of lytic DNA replication reflects a converved set of viral genes encoding the basic components of the replication machinery. Another common feature of herpesvirus biology is the capacity to remain latent in the infected host; the mechanism by which the viral genomes are maintained during latency apparently differs considerably among the herpesviruses. The cells that harbor latent genomes are different for the different viruses, and perhaps the more intimate relationship between viral and host chromosomal replication during latency accounts for the greater diversity of mechanism.

Replication origins are places on the DNA molecules where replication processes are initiated. For Epstein-Barr virus, one of these replication origins has been shown to associate with cellular proteins that regulate the initiation of DNA synthesis in human cells (Sugden, 2002). This suggests that these replication origins are also important locations for studying possible mechanisms of infecting human host cells. Knowledge of the locations of these replication origins will enhance the development of antiviral agents by blocking viral DNA replication or by interfering with the infection process. As replication origins in DNA are considered major sites for regulating genome replication in general, labor-intensive laboratory procedures have been used to search for replication origins in various organisms (e.g., see Hamzeh, 1990; Zhu, 1998; Newlon and Theis, 2002). With the increasing availability of genomic DNA sequence data, the value of using computational methods to predict likely locations of replication origins before the experimental search has already been recognized.

Herpesviruses utilize two different types of replication origins during lytic and latent infections. For each type of origins, the count and locations in the genome vary from one kind of herpesvirus to another. Most herpesviruses have one to two of latent and lytic origins. It has been documented in various studies (e.g. Masse et al. 1992; Hamzeh, 1990; Dykes, 1997) that the nucleotide sequences around the replication origins are specific to the individual viruses. Yet the presence of clusters of direct or inverted repetitive sequences, including palindromes, is quite common in both types of origins in many members of the herpesvirus family (see Chew et al. 2005 and references therein). For example, as early as 1977, Hirsch et al reported that the origin of replication of herpes simplex virus DNA might be in the region of the repeats. The main origin of replication of the DNA of pseudorabies (a kind of herpesvirus) is located in the region of the molecule bearing the inverted repeats (Tamar et al., 1980).

1.2 Special word patterns to predict replication origins

Close inverted repeats are segments of DNA double helix strands, which consist of two arms of similar DNA-with one inverted and complemented relative to the other-around a central, usually nonhomologous spacer (see Fig. 2.a). We call the left-arm of a single-stranded nucleic acid molecule of the close inverted repeats the left-stem and the right-arm the right-stem. The last base of left-stem and the first base of right-stem are called left center and right center respectively. The space between the two stems are called gap. The numbers of bases contained in left stem or right stem and in gap are called stem length and gap length respectively. L is the stem length (see Fig. 2.b). DNA palindrome is a special case of close inverted repeat: it is a close inverted repeat without gap. If there are few insertions, deletions or substitutions in the close inverted repeats, we call it close approximate inverted repeats.

5'……TCT TGT n…n ACAAGA......3'


Right stem

Figure 2.a Close inverted repeats

Left stem

5'..….TCT TGT AGGC ACAAGA......3'


Figure 2.b Nomenclature of closely inverted repeats

Close repeats are short repeats separated by a spacer of several nucleotides (Rocha and Blanchard, 2002) (see Fig. 3 for an illustration). The arrows under the sequence indicate the sequence that is repeated. Linguistically, an example of a direct repeat is "bye-bye". The close approximate repeats are repeats that contain errors.

5'……TCT TGT n…n TCTTGT......3'

Figure 3 Close repeats

1.3 Current computational methods to predict replication origins

Predicting replication origins in bacterial, archaeal, and eukaryotic genomes

A number of computational methods have been developed for predicting replication origins in bacterial, archaeal, and eukaryotic genomes. These algorithms exploit certain characteristic sequence features found around the replication origins. For example, Lobry (1996) employs the GC skew plot to predict replication origins and terminus in bacterial genomes. The skew (G-C)/(G+C), where G and C respectively stand for the percentages of guanine and cytosine bases in a sliding window, switches polarity in the vicinity of the replication origin and terminus, with the leading strand manifesting a positive skew. Salzberg et al. (1998) predict the replication origins for a number of bacterial and archaeal genomes by identifying some 7-mers and/or 8-mers whose orientation is preferentially skewed around the replication origins. Zhang and Zhang (2005) use the Z-curve method successfully to identify several replication origins in bacterial and archaeal genomes. The Z-curve of any given DNA sequence is a three-dimensional curve which uniquely represents the sequence so that unusual sequence compositional features, such as those around a replication origin, can sometimes be visually recognized. Mackiewicz et al. (2004) propose three methods, based on DNA asymmetry, the distribution of DnaA boxes and dnaA gene location, were applied to identify the putative replication origins in 112 bacterial chromosomes. They find that DNA asymmetry is the most universal method of putative oriC identification and better prediction can be achieved when the method is applied together with others. For eukaryotic DNA, Breier et al. (2004) develop the Oriscan algorithm to predict replication origins in the S. cerevisiae genome by searching for sequences similar to a training set of 26 known yeast origins pinpointed by site-directed mutagenesis. Oriscan uses both the origin recognition complex binding site and its flanking regions to identify candidates, and then ranks potential origins by their likelihood of activity. More recently, wavelet based multi-scale analysis of DNA strand asymmetries have also been developed (Brodie of Brodie et al., 2005; Touchon et al., 2005) for detecting mammalian DNA replication origins. It is important to note that a prediction method designed for one kind of genomes may not necessarily work well on others because the differences in DNA replication mechanisms in different organisms naturally lead to differences in sequence features around their replication origins.

Predicting replication origins in viruses

Early studies have suggested that replication origins in herpesvirus genomes often lie around regions of the DNA sequence with unusually high concentration of palindromes (Weller et al., 1985; Reisman et al., 1985; Masse et al. 1992). Based on these observations, Leung et al. (2005) suggest using scan statistics to locate statistically significant clusters of palindromes. Chew et al. (2005) have further developed palindrome-based scoring schemes in predicting known replication origins in complete herpesvirus genomes. They offer two more refined schemes of quantifying palindrome concentration to improve the sensitivity of the prediction. One of these schemes, namely the base weighted scheme (BWS1), which scores each palindrome according to how rare it is expected to occur in a nucleotide sequence generated randomly as a first order Markov chain, is found to be the most sensitive for the herpesviruses. Their approach is to slide a window of size about 0.5% of the genome length over the sequence. As the window moves along, a score that reflects the concentration of palindromes in the window is calculated. The top scoring windows are then selected as predicted likely locations of replication origins. Because of the lack of strong family-wide sequence similarities around the origins, the above prediction methods designed for relatively large and complex dsDNA viruses like the herpesviruses with over 100, 000 base pairs in the genomes are based on various sequence statistics rather than the actual nucleotide sequences around replication origins.

Lin et al. (2003) have observed that in some herpesvirus genomes, the nucleotide sequences around replication origins are richer in A and T bases. This is not surprising because DNA replication typically requires the binding of an assembly of enzymes (e.g., helicases) to locally unwind the DNA helical structure, and pull apart the two complementary strands (see Chapter 1 in Kornberg and Baker, 1992; Bramhill, and Kornberg, 1998). Higher AT content around the origins makes the two complementary DNA strands bond less strongly to each other. This facilitates the two strands to be pulled apart and initiate the replication process. Indeed, Segurado et al. (2003) have used a sliding window approach to find "islands" within the Schizosaccharomyces pombe genome that have high AT content. They measure base composition using sliding windows of different sizes and find that AT content of windows in regions containing replication origins are significantly higher than those that do not. Chew et al. (2005) have also reported using sliding windows of AT percentages on herpesviruses. Using windows with top AT percentages they are able to predict 65% of replication origins in their dataset. Moreover, this method has successfully identified four origins not predicted by BWS1, suggesting that the AT percentages may be a useful sequence feature to be incorporated into the set of replication origin prediction tools for dsDNA viruses.

Chew et al. (2007) find a means to better quantify the AT content variation in genome sequences. This score-based excursion approach is used to identify segments of a genome having high AT concentration, which are called high-scoring segments. These regions are predicted as potential replication origin sites in herpesviruses. The excursion approach has the advantages of not requiring a preset sliding window size and having rigorous criteria to evaluate statistical significance of high scoring segments. After checking the high-scoring segments against known replication origins in herpesviruses, they found that AT excursion method successfully identify several replication origins not previously predicted by the palindrome-based method. Therefore, AT excursions would be a valuable complement to the palindrome-based methods (Chew et al., 2005).

Cruz-Cano et al. (2007) introduced a replication origin prediction scheme based on Support Vector Machines (SVMs). SVMs can learn from characteristics of the known replication origins of those genomes in the training data set and then make predictions of where the replication origins of a new genome are likely to be. The potential of features not related to repetitive structures, to help with the classification process has been explored also. They deal with two ideas related with this subject. First, new features, some related to repetitive structures and some not, are presented. Second, Support Vector Machines (SVMs) are used to extract the most important of these features to predict the location of replication origins in herpesviruses and caudoviruses. The general construct of SVMs allows any number of selected sequence characteristics to be included as input variables. They train the SVMs with different number of input variables containing information about the known replication origin locations and the dinucleotide representation ratios. The SVMs provided sensitivity and positive predictive value superior or at least equal to those given by the previous sources, sometimes requiring very few simple variables as inputs.

2. Proposal

2.1 Sequence features fusion in predicting replication origins in viral genomes

Sequence features, such as palindromes (Chew et al., 2005), AT content (Chew et al., 2007), have been separately used to predict replication origins in herpesvirus. Empirical studies have suggested that close approximate repeats are also found near replication origins in viral genomes (Weller et al, 1985; Reisman et al., 1985; Masse et al. 1992; Hirsch, I., 1977; Lehman, I.R, 1999; Dutch, R.E, 1992). We propose to integrate these genomic sequence features to improve the prediction of likely locations of replication origins in herpesvirus. (See Fig. 4 for an illustration.)

Figure 4 Procedure of predicting.

2.1.1 Generalized linear model:

Our first attempt will be to use generalized linear model to combine the information of spatial abundance of close approximate inverted repeats, close approximate repeats and AT abundance to predict the likely locations of replication origins in herpesviruses. We outline four key steps in this approach below.

Step 1: Locating close approximate inverted repeats and close approximate repeats

For each genomic sequence we run the palindrome program, which is part of EMBOSS [European Molecular Biology Open Software Suite, (Rice et al., 2000)] to extract the close approximate inverted repeats positions and lengths. In running this program, judicious choice of the parameters is needed: (i) the minimum stem-length, (ii) the maximum gap length between the two stems, and (iii) the number of mismatches allowed.

We use "REPuter" (Kurtz et al. 1999; Kurtz et al. 2001) to locate close approximate repeats. In order to assess the significance of a repeat, we compute its E-value, i.e. the number of repeats of the same length or longer and with the same number of errors or fewer that one would expect to find in a random DNA of the same length. Maximum computed repeats (show the repeats with smallest E-value), minimal repeat size (specify that repeats must have the given length) and error distance (Search repeats up to the given hamming/edit distance) should be chosen before running the program. After the REPuter run has finished, a REPuter result page gives a nice overview of the number, the length and the location of repeats in the uploaded sequence.

Step 2: Choosing scoring scheme for close approximate inverted repeats, and close approximate repeats

Each of these close approximate inverted repeats and close approximate repeats will be assigned a score according to a suitable scoring scheme chosen in this step.

We intend to extend the palindrome BWS scoring scheme described by Chew et al. (2005) to close approximate inverted repeats. The basic idea behind palindrome BWSm is that a higher score should be given to a rarer palindrome (that is, a palindrome which has a lower probability to occur by chance). We assess the probability of occurrence of a particular palindrome based on Markov type sequence models [see chapter 3 in Durbin (1998)]. Here m denotes the order of the Markov chain in modeling the genomic sequence. We take the negative logarithm of the probability of a palindrome to give it a positive score, which is higher when its probability is lower. Furthermore we need to introduce novel scoring schemes for close approximate repeats to quantify the spatial abundance of closed approximate repeats in a genomic sequence.

Step 3: Computing window scores for close approximate inverted repeats and close approximate repeats and AT content in each model

The entire genomic sequence is partitioned into N, to be suitably chosen, non-overlapping windows of equal size. For each window, we total the scores for close approximate inverted repeats occurring inside this window. Likewise, we total the scores for close approximate repeats in this window. The window score for AT content is the percentage of nucleotide base A and T in this window. Close approximate inverted repeats, and close approximate repeats are considered in the window if their left-centers are.

Step 4: Building up a generalized linear model

Based on these window scores, a generalized linear model is built up to predict the locations of replication origins. Consider data obtained in N windows. Let Yi be the ith binary response variable defined by

Let. Associated with this response are the values,…, of 6 explanatory variables, where, , , denote the ith window scores of close approximate inverted repeats, close approximate repeats, and AT content respectively, and , , . A generalized linear model can be specified as follows:

where (=1,2,…6) are regression coefficients to be estimated. We consider two link functions, the logistic link and log-linear link. We will use a part of viral genomes with known replication origins to fit the models, and then use remaining viral genomes with known replication origins for validating the model built.

There are certain advantages of this approach over the existing ones. First, this approach makes it possible to consider several sequence features simultaneously rather than one feature at a time as in existing approaches (Chew et al., 2005, Chew et al., 2007). Furthermore, the values of regression coefficients (i=1, 2, 3) tell us the relative strengths of these explanatory variables. And (i=4, 5, 6) represent the interaction of the sequence features.

2.1.2 Further potential refinements

It is generally known that a genomic sequence is far from being homogenous. How does one take into account the heterogeneity of genomic sequence in these window scores? We propose to attempt different approaches, such as, HMM (Churchill, 1989; Churchill, 1992), change-point method (Braun and Muller 1998), or entropy method (Li, 2001), to segment the genomic sequence into homogenous segments. And then we will look into the issue of how to correct the windows' scores according to their background.

2.2 Exploration of motifs around replication origins

2.2.1 In search of over-(or under-)represented motifs around replication origins

Over-(or under-)represented motifs in the context of larger sequences have been variously implicated in biological functions and mechanisms. Over-represented motifs are good candidates for playing a functional role in the sequences, while under-representation hints that if the motif were present, it would have a harmful dysregulatory effect (Frith et al., 2004). Leung et al. (1996) found that clusters of some of the most over- and under-represented 4- and 5-words in the some herpesvirus genomes are identified with functional sites such as the origins of replication and regulatory signals of individual viruses. Thus we could identify over- or underrepresented motifs around known replication origins. Then search for the similar motifs in other herpesvirus genomes. It's reasonable to guess the over- or underrepresented motifs around the location of replication origins.

One method of measures of over- or under-representation is as followed. Let fx denote the frequency of the nucleotide X (A, C, G, or T) in the sequence at hand, fxy the frequency of dinucleotide XY, fxyz the frequency of trinucleotide XYZ, and so on. A standard assessment of dinucleotide bias is through an "odds" ratio calculation, namely. For sufficiently larger (smaller) than 1, the XY pair is considered over-(under-) represented compared with a random association of mononucleotides. There are classical statistical tests of the contingency table genre in terms of (Hollander, M. & Wolfe, D. A., 1973).

There are many approaches and tools to find over- or underrepresented motifs (Apostolico et al.,

2000; Schbath, 1997; Alberto et al., 2004). VERBUMCULUS is a suite of software tools for the efficient and fast detection of over- or underrepresented words in nucleotide sequences (Alberto et al., 2004). The inner core of VERBUMCULUS rests on subtly interwoven properties of statistics, pattern matching and combinatorics on words, that enable one to limit drastically and a priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This tool can find under- and over-represented words within both a single genetic sequence and a family of sequences. Thus we can use this tool to search for over- or underrepresented motifs to known replication origins in herpesvirus genome sequences.

2.2.2 Alignment approaches to find motifs around the replication origins

Some sequence motifs around the replication origins are essential for DNA replication (Dean et al., 1992). Hence identifying conserved motifs around known replication origins in viral genome may help to predict the location of replication origins. We can use alignment approaches to find motifs.

Different alignment tools

There are many alignment tools so far. For global alignment, whose goal is to align complete sequences, there are dynamic programming based and iterative methods. MSA (Multiple Sequence Alignment) and ClustalW (Thompson et al., 1997) are based on dynamic programming. Simulated annealing (MSASA, Kim et al., 1994) and Genetic Algorithm (SAGA, Notredame and Higgins, 1996 and RAGA) are iterative methods. While for local alignment, whose aim is to locate, there are Gibbs based (GIBBS, Lawrence, 1993), Hidden Markov Model (HMMER, Eddy, 1998) and EM based (MEME, Bailey and Elkan, 1995) tools.

Gibbs sampling

Lawrence et al. (1993) described a Gibbs Sampling strategy for multiple alignment. We will use this strategy to do multiple alignments for herpesvirus.

Parts of herpesvirus genome sequences are chosen to be aligned, whose location of replication origins are known. We pick up the sequence fragments of replication origins. The lengths of these fragments are 2 map units (a map unit, abbreviated mu, is 1% of the genome length), and the center of these fragments is just the center of known replication origins. Our problem is to locate and describe the motif shared by all or most sequences, while its starting position in each sequence is unknown. We assume each motif appears exactly once in one sequence fragment and the motif has fixed length.

The steps of Gibbs Sampling algorithm is as followed: Given N sequences with ith sequence length Li and desired motif width W:

Step 1) Choose a starting position in each sequence at random: a1 in seq 1, a2 in seq 2, …, aN in sequence N .

Step 2) Choose a sequence at random from the set (say, seq 1).

Step 3) Make a weight matrix model of width W from the sites in all sequences except the one chosen in step 2.

Step 4) Assign a probability to each position in seq 1 using the weight matrix model constructed in step 3: p = {p1, p2, p3, …, pLi-W+1 }

Step 5) Sample a starting position in seq 1 based on this probability distribution and set a1 to this new position.

Step 6) Choose a sequence at random from the set (say, seq 2).

Step 7) Make a weight matrix model of width W from the sites in all sequences except the one chosen in step 6.

Step 8) Assign a probability to each position in seq 2 using the weight matrix model constructed in step 7.

Step 9) Sample a starting position in seq 2 based on this dist.

Step 10) Repeat until convergence

Using this Gibbs sampling method, we can locate and describe the motif around replication origins. Then we can use PWM (Position Weight Matrix Model) to represent the same motif in other herpesvirus genome sequences and predict the location of replication origins around that motif. PWM is a universal way to represent DNA motifs. In a PWM, there is a matrix element for all possible bases at every position in the motif; the score for any particular sequence is the sum of matrix values for the sequence (Stormo, 2000). PWM composes 4 rows to represent 4 types of nucleotides acids of DNA sequences; the length of the PWM equals the length of the motif. The score of a given sequence matching the model can be calculated by: , Here, is the probability of base si in the ith position of the motif. The score also means the probability of the sequence generated by the PWM model representing the motif. A simplified way to calculate the score is to use the logarithm of the value, and the above formula can be written as. (Wu, X. et al, 2004).

Predict replication origins in other similar viral families with herpesvirus.

After building up models and assessing suitable approaches to predict replication origins in herpesvirus, we apply the best model to predict replication origins in other similar viral families, for instance, poxviruses, baculoviruses, iridoviruses, which are all double-stranded DNA enveloped viruses. Poxviruses are large double-stranded DNA viruses with genomes ranging from 130 to 380 kbp (Moss, 2001).The baculoviruses are a family of large rod-shaped viruses, which contain circular double-stranded genome ranging from 80-180 kbp. Iridoviridae is a family of viruses with dsDNA genomes ranging from 150 to 280 kbp. Since these viral families share similar physical characteristics, it is hopefully that the approaches of predicting replication origins in herpesvirus can be extended to these viruses.