Codon usage

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


The codon usage is an intangible ruler for directed the process of gene expression from genetic code to protein synthesis. Previous studies have revealed that the synonymous codon usage is not random (i.e., codons for a particular amino acid are not used equally) and has been driven by different factors in different organisms (Sharp and Li, 1986; Bulmer, 1991). Mutational bias and natural selection are hypothesized to be the two major factors that shape the codon usage within and between the whole genome of different organisms. Mutational bias is a global force acting on all sequences and drives the change of base composition and the evolution of whole genome. Substantial variation in the base composition was displayed at a variety of taxonomic levels, even among genotypes within species. For example, mutational bias is the dominant factor shaping codon usage in some prokaryotes with extremely GC-poor or GC-rich genome (Ikemura, 1981; Sharp et al., 1986; Duret and Mouchiroud, 1999) and in highly complex organism, such as human (Karlin and Mrazek, 1996). On the other hand, translation selection has been reported to be the main factor shaping codon usage in prokaryotes, Saccharomyces cerevisiae and Caenorhabditis elegans genomes (Sharp et al., 1993). In these species, the highly expressed genes display high codon usage bias than weakly expressed genes by selection for the use of optimal codons to maximize rates and efficiency of translation.

Dinoflagellates, a large and diverse group of eukaryotic flagellated microalgae, are important primary producers and play an important role in aquatic food chain in the marine and fresh water environments (Rizzo, 2003). However, only about half of the dinoflagellate species are photosynthetic. Several species of these eukaryotic algae, such as Alexandrium spp., can produce toxins and cause harmful algal blooms which can impact on marine ecosystem (Hegaret et al., 2007). Molecular systematic analysis has illustrated that dinoflagellates are belonging the alveolates which forming a monophyletic group with ciliates and apicomplexans (Lidie et al., 2005). Dinoflagellates have been discovered to have amazing diversity of lifestyles including free-living, heterotrophic, parasitic, endosymbiotic, and freshwater taxa (Uribe et al., 2008). In addition, these algae have several biological characteristics, such as liquid crystalline DNA in chromosome, extracellular spindle through the nuclear envelop, and permanently condensed chromosome while the progress of cell proliferation, that are not found in any other organisms (Hackett et al., 2005). Therefore, the study of the molecular mechanisms that regulate cell proliferation, toxicity, and photosynthesis of dinoflagellate is of critical importance for understanding their biological mechanisms. However, there are still very few studies regarding the molecular biological of dinoflagellates.

Interestingly, several species of photosynthetic dinoflagellates which contain peridinin in photosynthetic organelles (plastids) have different organelle genome structure comparing with other eukaryotic algae. Most of the chloroplast-specific genes of peridinium-containing dinoflagellate have been discovered to transfer to the cell nucleus, and only a small part of chloroplast-specific genes are encoded on minicircles, small plasmid-like molecules containing one or two polypeptide genes each (Laatsch et al., 2004). Here, the chloroplast-specific genes are defined as genes that homology to other photosynthetic eukaryotic algae and are exclusively encoded in the chloroplast genome. Analysis of chloroplast-specific genes in dinoflagellate is important in understanding the evolution of these unusual eukaryotic algae.

Dinoflagellates typically possess large genomes, ranging approximately from 3 pg (Amphidinium carterae) to more than 200 pg (Lingulodinium polyedra) per haploid cell. This characteristics makes dinoflagellate unlikely to be selected for whole genome sequencing, whatever their importance in terms of evolution and ecology. One possible solution to these limitations would be to exploit the publicly available expressed sequence tags (ESTs) (Rispe et al., 2007). Single-pass sequencing of random cDNA clones to generate the ESTs is a rapid, powerful, and cost-effective method in massive cloning of cDNA as well as in large scale characterization of cDNA sequences for deciphering genome sequence. Although the sequences generated by this method are incomplete and are prone to error, EST collections (about two-thirds of them of human origin) grow much faster than any other genomic sequence information and become a powerful means of gene discovery (Kuo et al., 2004). Recently, several dinoflagellate EST libraries were generated to study gene expression, plastid evolution, and circadian-controlled expressed genes (Hackett et al., 2005; Uribe et al., 2008; Tanikawa et al., 2004). These EST results become a useful resource for future investigations of the coding genes at the whole-genome level.

It is known that analysis of codon usage pattern has both practical and theoretical importance in understanding the basics of molecular biology. In this present study, we investigated synonymous codon usage for the coding sequences of dinoflagellate Alexandrium tamarense genome through EST data. Here, A. tamarense, which contains peridinin, was chosen as a model system, because it is one of the best-studied dinoflagellate and can form toxic blooms as well as cause paralytic shellfish poisoning through neurotoxins production (Hackett et al., 2005). It should be note that A. tamarense haploid cell contains approximately 143 chromosomes with genome size of 200 pg/cell. The codon usage bias of this alga was then investigated using methods of multivariate statistic analysis, variance analysis and corresponding analysis.

Materials and methods

ESTs data and clustering

A total of 10865 ESTs from toxic dinoflagellate A. tamarense CCMP 1598 were retrieved from the NCBI dbEST database. These ESTs were corresponding to two different libraries (start and normalized cDNA libraries) created by the same authors (Hackett et al, 2005) from the cultures of A. tamarense. Most EST data (10770 ESTs, 98.9%) we obtained were single sequence read from the 3' end including 3' untranslated region (UTR). The accession numbers of these EST sequences were CF751845-CF751962, CF774560-CF774855, CF947047-CF948546, CK431405- CK433904, CK782344-CK786698, CV553867-CV555405 and CX769195- CX769771. Sequence data were then clustering and assembly to obtain putatively unique transcripts by using Uicluster v.2-1.1 (Trivedi, 2001) with default parameters. A minimum match percentage of 95% for 40 overlapping bases was necessary to assembly two sequences as one cluster.

Prediction of coding sequences in ESTs

The coding sequences fragment in the Alexandrium tamarense transcripts were than predicted by using a recently developed FrameDP v.1.0.3 (Gouzy et al., 2009), a self-training integrative pipeline base on FrameD for predicting the position of the translated region in EST. However, unlike FrameD, FrameDP can use blastx results to generate training sequences and then to calculate training matrix base on training sequences without human curation. The training sequences were automatically extracted from A. tamarense unique transcripts using regions showing a significant identity over a given length with Uniprot database by using blastx filtered with E-value? < 10-4 and 40% identity over 100 amino acids. After running the detection procedure on the entire clustered ESTs (unigenes), a collection of resulting putative CDSs of Alexandrium tamarense were then generated. To minimize the sampling errors, only CDSs with length longer than or equal to 150 bp were used in this paper. Also, CDSs were excluded from the analysis if they come from the unigenes containing more than two CDSs.

Plastid associated and ribosomal genes identification

To identify the plastid associated and ribosomal genes, the gene sequences were compared with the NCBI non-redundant (nr) database by using the blastx algorithms (Altschul et al. 1997). Queries were performed with the stand-alone blast program. The nr database and blast program was downloaded in Oct, 2009 from NCBI. The sequences that showed significant similarity (E < 10-6) to ribosomal genes were then examined by hand to confirm. Plastid associated genes were use the results of Hackett et al. (2004).

Codon usage analysis

The frequency of 59 codons code for 18 amino acids (exclude Met (AUG), Trp (UGG), but include termination codons (UAA, UAG, TGA)) were determined for all the selected genes in each Nitrobacter genome. Moreover, four codon usage indices, relative synonymous codon usage (RSCU), G+C content at the third position of synonymous codons (GC3s), effective number of codons (Nc), and codon adaptation index (CAI), were used to help analyze the codon usage in this research (Russell and Sharp, 2001).

RSCU is defined as the ratio of observed frequency of a codon to the expected frequency if all the synonymous codons for that amino acid were used equally (Sharp et al., 1986). RSCU values are independent of amino acid usage and very useful in comparing synonymous codon usage variation among the genes. GC3s is defined as the frequency of G or C nucleotides present at the third position of synonymously variable sense codons (i.e. excluding Met, Trp, and termination codons) (Peden, 1999). Nc value was used to measure the magnitude of codon bias for an individual CDS (Wright, 1990). An Nc value of a CDS is range from 20 for a CDS with extreme bias using only one codon per amino acid to 61 for a CDS with no bias using synonymous codons equally. A CAI value was used to measure of synonymous codon bias for an individual CDS by comparing to a set of known highly expressed genes, such as ribosomal proteins and elongation factors (Sharp and Li, 1987). The CAI value of a CDS is between 0 and 1.0. Highly expressed genes are selected to have greater codon bias than lowly expressed genes and therefore having high CAI values. CAI values in this study were calculated by using the codon usage of ribosomal proteins as a reference.

Correspondence analysis on RSCU

The relationship between variable and sample can be analyzed by correspondence analysis (COA; Greenacre, 1984). COA is a multivariate method and its aim is to summarize data structures in high-dimension space by projection onto low-dimension subspaces, while loosing as little information as possible (Semon et al., 2006). As in this case, CA was carried out on RSCU values for all the genes in a multiple dimensions of 59 axes (exclude Met, Trp, and stop codon) according to their usage of the 59 sense codons and then it determines the most prominent axes contributing variation on RSCU among the genes. This method has been widely used by many researchers to study the variation of codon usage among the genes (Morton, 1999; Musto et al., 2001; Grocock and Sharp, 2002).

Analysis tools

GC3s, Nc, RSCU, and CAI were calculated using the program CodonW 1.4.2 (Peden, 1999). COA was also performed using CodonW in order to examine the major trend in codon usage variation among genes in each genome. Correlation analysis was performed using Spearman's rank correlation method in software SPSS version 12.0.

Results and Discussion

Analysis of Alexandrium tamarense ESTs

A total of 10865 A. tamarense ESTs that are the subject of this research were downloaded from the GenBank. The average sequence length of this EST library after trimming poly(A) sequence was 558 bp. Most of ESTs (88.9%, 9678 ESTs) were longer than 400 bp. The initial ESTs from A. tamarense EST libraries were grouped into 6538 unigenes, in which 2037 containing two or more than two ESTs per unigene and 4501 were singletons. The distributions of the cluster size and frequency of EST libraries are shown in Fig. 1. Among the 6538 A. tamarense unigenes, 1756 (26.9%) had a significant similarity to a protein in database using blastx against the nr databases with a cutoff of E-value < 1 - 10-5 and already discussed elsewhere (Hackett, 2005). The remaining 73.1% of unigenes had low similarity scores may correspond to novel proteins or to no-coding sequences.

Reconstruction of partial coding sequences in ESTs

The coding frame and putative coding sequence on the Alexandrium tamarense unigenes were then determined with FrameDP (Gouzy et al., 2009). FrameDP is base on FrameD (Schiex et al., 2003) which can identify open reading frame by using extended interpolated Markov models (IMMs) and has frameshift correction ability. Among the 6538 A. tamarense unigenes, FrameDP predicted 4704 unigenes (71.9%) having at least one CDS. Only 2115 (44.9%) FrameDP predicted unigenes have a hit with nr database using blastx with a cutoff of E-value < 1 - 10-5. This result suggested that FrameDP can efficiently extract CDSs from ESTs. To increase the CDSs quality, ESTs contained more than one CDSs and length of CDSs less than 150 bp were excluded from the analysis. Finally, a total of 1735 genes met these criteria and were selected for our following analyses.

Codon usage patterns

The overall codon usage of A. tamarense was displayed in Table 2. Since A. tamarense have a high G+C content (62.0%), it is expected that C or G ending codons will predominate in the coding region of these organisms. Of all the 18 degenerately encoded amino acids in Table 2, all preferentially used degenerate codons were found to be C or G ending codons and none of preferentially used degenerate codons was found to be A or U ending. Therefore, base compositional mutational bias is an important factor in shaping the codon usage of A. tamarense genomes. However, some heterogeneity of codon usage variation among the genes was hidden by overall RSCU values. For example, among Gly codons in Table 2 GGC is about 3.1 times as frequent as GGG.

To further analyze the degree of heterogeneity in codon usage in A. tamarense genomes, the GC3s and Nc values for all the genes were calculated to determine if codon heterogeneity exists among genes of various A. tamarense species (Fig. 1). This method (Nc plot, a plot of Nc versus GC3s) had been used effectively by many researchers to explore the codon usage variation among genes for many different organisms (Wright, 1990). If synonymous codon bias is subject to compositional constraints, it should fall on or just below the expected curve in the Nc plot. However, if synonymous codon bias is subject to translational selection, it should fall considerably below the expected curve.

Correspondence analysis

In order to understand the cause of this variation, we conducted a COA of RSCU values on all the genes of each genome. This statistical approach has been extensively used to characterize the major trends in codon usage among the genes in several species (Grocock and Sharp, 2002; Zavala et al., 2002; Singer and Hickey, 2003). The RSCU variation on the first two dimensions together explains 16.82%, 20.37%, and 17.20% of the total RSCU variation in WIN, HAM, and NSP, respectively. Therefore, a two dimensional solution on RSCU by COA appears satisfactory in the Nitrobacter genomes. In Fig. 2, a plot of CDS, predicted highly expressed genes, predicted lowly expressed genes, ribosomal protein genes, and nitrite-oxidizing genes on first and second major axes produced by COA on RSCU for each genome was displayed. Follow the analysis of Wu et al (2005), the predicted highly expressed genes and predicted lowly expressed genes are defined as genes with top 10% and bottom 10% of CAI values, respectively. Consequently, the high and low cutoff values of CAI for highly and lowly expressed genes were 0.709 and 0.496, 0.697 and 0.426, and 0.704 and 0.496 for WIN, HAM, and NSP, respectively. In the three Nitrobacter genomes, most genes near the origins of the axes clustered together to form an inverse trapezoid shaped cloud in a range of -0.5 to +1.0 for the first axis and -0.5 to 0.3 for the second axis. However, putatively highly expressed genes were clustered on left side of major axis and putatively lowly expressed genes were clustered on right side of major axis. These results suggested that gene expression levels in Nitrobacter genomes play an important role on synonymous codon usage. Most of the ribosomal protein genes of all Nitrobacter genomes clustered at the left side of the first axis and indicate a high expressed in the ribosomal protein genes which was the result of selection for translational efficiency. However, nitrite-oxidizing genes are clustered as two groups at the left and right sides of the first axis and indicate possible two different expression levels in nitrite-oxidizing genes. The first major axis are strongly negatively correlated with CAI (r = -0.948, -0.969, and -0.949 with all p < 0.001 for WIN, HAM, and NSP, respectively) and GC3s (r = -0.901, -0.938, and -0.911 with all p < 0.001 for the aforesaid order of strains). Therefore, it is obvious that the correlation between CAI and GC3s (r = 0.885, 0.827, and 0.829 with all p < 0.001 for WIN, HAM, and NSP, respectively) are positive and strongly significant for the aforesaid order of genomes. Consequently, translational selection is the major factor in shaping codon usage in the Nitrobacter genomes. However, mutational bias is also important in shaping codon usage in the Nitrobacter genomes. In addition, the first major axis was significantly correlated with hydrophobicity (r = -0.134, -0.142, and -0.069 with all p < 0.001 for WIN, HAM, and NSP, respectively) and gene length (r = -0.260, -0.240, and -0.241 with all p < 0.001 for the aforesaid order of strains). The correlation coefficient of hydrophobicity and gene length were far less than both translational selection and mutational bias, suggesting that hydrophobicity plays a minor role in shaping codon usage in this genome.


  • Altschul S.F., T.L. Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, M., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.
  • Bulmer, M., 1991. The selection-mutation-drift theory of synonymous codon usage. Genetics 129, 897-907.
  • Duret, L., Mouchiroud, D., 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. U.S.A. 96, 4482-4487.
  • Gouzy, J., Carrere, S., Schiex, T., 2009. FrameDP: sensitive peptide detection on noisy matured sequences. Bioinformatics 25, 670-671.
  • Greenacre, M.J., 1984. Theory and applications of correspondence analysis. Academic Press, Landon.
  • Grocock, R.J., Sharp, P.M., 2002. Synonymous codon usage in Pseudomonas aeruginosa PA01. Gene 289, 131-139.
  • Hackett, J.D., Scheetz, T.E., Yoon, H.S., Soares, M.B., Bonaldo, M.F., Casavant, T.L., Bhattachary, D., 2005. Insights into a dinoflagellate genome through expressed sequence tag analysis. BMC Genomics 6, 80-93.
  • Hegaret, H., Wikfors, G..H., Soudant, P., Lambert, C., Shumway, S.E., Berard, J.B., Lassus, P., 2007. Toxic dinoflagellates (Alexandrium fundyense and A. catenella) have minimal apparent effect on oyster hemocytes. Mar. Biol. 152, 441-447.
  • Ikemura, T., 1981. Correlation between the abundance of Escherichia coli transfer-RNAs and the occurrence of the respective codons in its protein genes- a proposal for a synonymous codon choice that is optimal for the Escherichia coli translational system. J. Mol. Biol. 151, 389-409.
  • Karlin, S., Mrazek, J., 1996. What drives codon choices in human genes? J. Mol. Biol. 262, 459-472.
  • Kuo J., Chen, M.C., Lin, C.H., Fang, L.S., 2004. Comparative gene expression in the symbiotic and aposymbiotic Aiptasia pulchella by expressed sequence tag analysis. Biochem. Biophys. Res. Commun. 318, 176-186.
  • Laatsch T., Zauner S., Stoebe-Maier B., Kowallik K.V., Maier, U.G., 2004. Plastid-derived single gene minicircles of the dinoflagellate Ceratium horridum are localized in the nucleus. Mol. Biol. Evol. 21, 1318-1322.
  • Lidie, K.B., Ryan, J. C., Barbier M., Van Dolah F.M., 2005. Gene expression in Florida red tide dinoflagellate Karenia brevis: analysis of an expressed sequence tag library and development of DNA microarray. Marine Biotech. 7, 481-493.
  • Morton, B.R., 1999. Strand asymmetry and codon usage bias in the chloroplast genome of Euglena gracilis. Proc. Natl. Acad. Sci. U.S.A. 96, 5123-5128.
  • Musto, H., Cruveiller, S., Onofrio, G. D., Romero, H., Bernardi, G., 2001. Translational selection on codon usage in Xenopus laevis. Mol. Biol. Evol. 18, 1703-1707.
  • Peden, J.F., 1999. Analysis of Codon Usage, Ph.D. Thesis, University of Nottingham.
  • Rispe, C., Legeai, F., Gauthier, J.P., Tagu, D., 2007. Strong heterogeneity in nucleotidic composition and codon bias in the pea aphid (Acyrthosiphon pisum) shown by EST-based coding genome reconstruction. J. Mol. Evol. 65,413-24.
  • Rizzo, P.J., 2003. Those amazing dinoflagellate chromosomes. Cell Res. 13, 215-217.
  • Russell J.G., Sharp, P.M., 2001. Synonymous codon usage in Cryptosporidium parvum: identification of two distinct trends among genes. Int. J. Parasitol. 31, 402-412.
  • Schiex, T., Gouzy, J., Moisan, A., de Oliveira, Y., 2003. FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res. 31, 3738-3741.
  • Sharp, P.M., Li, W.H., 1986. An evolutionary perspective on synonymous codon usage in unicellular organisms. J. Mol. Evol. 24, 28-38.
  • Sharp, P.M., Li, W. H., 1987. The codon adaptation index- a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281-1295.
  • Tanikawa, N., Akimoto, H., Ogoh, K., Wu, C., Ohmiya, Y., 2004 Expressed sequence tag analysis of the dinoflagellate Lingulodinium polyedrum during dark phase. Photochem. Photobiol. 80, 31-35.
  • Trivedi, N., Bischof, J., Davis, S., Pedretti, K., Scheetz, T.E., Braun, T.A., Roberts, C.A., Robinson, N.L., Sheffield, V.C., Soares, M.B., Casavant, T.L., 2002. Parallel creation of non-redundant gene indices from partial mRNA transcripts. Future Gener. Comp. Syst. 18, 863-870.
  • Uribe, P., Fuentes, D., Valdes, J., Shmaryahu, A., Zuniga, A., Holmes, D., Valenzuela, P.D.T., 2008. Preparation and analysis of an expressed sequence tag library from the toxic dinoflagellate Alexandrium catenella. Mar. Biotechnol. 10, 692-700.
  • Wright, F., 1990. The 'effective number of codons' used in a gene. Gene 87, 23-29.