This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
The entire human genome was sequenced by the Human genome Project, which was an international effort, begun in 1990 and completed in 2003. Although the project successfully sequenced the whole human genome and identified 20,000-25,000 genes, yet there is a big fraction on the human genome in which the function is not identified yet. These are referred to non-coding DNA sequences in the human genome. There have not been many studies conducted on non-coding DNA sequences of the human genome because formerly they were considered as "junk DNA". This is the motivation to carry out the research on non-coding DNA sequences of the human genome. Genomic comparisons have identified many shared non-coding sequences across species. It can thus be conjectured that there should be some reason behind non-coding regions to be maintained over long evolutionary and there are suggestions as the conserved non-coding sequences may be functional. This is the main motivation for selecting conserved non-coding sequences of the human genome to conduct this research.
A large portion of the eukaryotic genomes consists of non coding DNA sequences in contrast to prokaryotic genomes. Due to this reason of recent the non-coding DNA region has attracted a lot of attention. A significant factor is that non-coding sequences can be found only on eukaryotes with its prevalence increasing significantly in vertebrates. There is no general agreement about why most eukaryotic cells contain more DNA than required for their proteins. These non coding DNA sequences do not contain instructions to make proteins but seem to be involved in and their functionality which is yet not known.
Besides this, a very interesting finding about the human genome is that while the coding sequences contain only 2% while the remaining 98% account for non-coding DNA. Two different explanations for the variability in the amount of non coding DNA have been proposed.
One explanation is that non-coding DNA has no function; they exist due to mutation pressure.
A second explanation is that according to the functional theories non-coding DNA has a sequence independent function .
The percentage of non-coding DNA varies greatly in size on different genomes, even between closely related species. In prokaryotes the total length of non-coding DNA increases linearly with the total length of protein-coding DNA. Why does so much non-coding DNA exist and why its amount per genome does varies so much are central biological problems. This is the motivation why further studies on this area need to be carried out.
This dissertation is an attempt to look at the conserved non coding DNA sequences in the human genome which have not been analyzed much, so far. The objective of the research is to analyze the distribution of conserved non-coding DNA sequences in the human genome. There must be some reason behind the conserved non coding DNA sequences to be conserved for thousands of years time. DNA sequence comparison between different organisms provides a higher probability to identify common signatures that may have functional implications. Mammals and fish, being most evolutionary distant present vertebrates for which whole genome information is available, provide high valuable information in gathering conserved non-coding sequences. It is important to use orthologous sequences when performing cross-species DNA comparisons to recognize functional elements that are evolutionary conserved . Conserved non coding sequences which have been found through a whole genome comparison between human and the zebrafish (Danio rerio) are used in conducting the research. Zebrafish provides an important model organism for analysis of vertebrate development because its gene order and content is alike to the common vertebrate form of mitochondrial DNA . There is extensive similarity between the zebrafish and the human genomes so that many human development and disease genes have counterparts in the zebrafish. The finding of the research is expected to be significant since it will help us to understand more about the human genome. It will also be valuable for biological experiments and further researches, which are conducted on the human genome.
The study could contribute to medical treatment because the drug designers need to explore the human genome to know how our body reacts against diseases. It is the key to design a successful drug. When a disease attacks our body the coding sequences may not be the only portion which works against that. There may be some interaction between the cell immune system and non-coding sequences. Conserved blocks of DNA likely result from the sequence-specific constraints of DNA-protein interactions, although the relation between conserved sequences and functionality characterized binding sites and is not exact . It is remarkable since there are some proteins which are produced by our coding sequences which cause certain diseases such as cancer. We need to identify and understand our whole body means our whole genome to learn how the body system works. Although we know the functionality of the coding DNA sequences we do not know any functionality on non-coding DNA sequences. Since they are the majority of our genome the knowledge we have on the human genome is poor. That is why studying on non-coding DNA sequences in the human genome is so important as much as analyzing coding DNA sequences in the human genome.
Aims and Objectives
The main objectives of the research are to,
Analyze the distribution of CNSs over the human genome to identify the distribution patterns of the blocks if there exists any.
Identify whether there is any correlation between the closely existing gene and the conserved non-coding blocks.
An execution environment that is persistent is needed. So the algorithms can be tested without blocking out access.
The Structure of the Report
The dissertation has organized as follows. In chapter 2 of the dissertation the literature on the research area is discussed. This part discusses all the literature that needs to understand the related problem and details of the researches that had done so far on the area.
Human Genome, Chromosome and Genes
The qualities that differentiate human from other primates originated in specific DNA sequence changes in the human genome. Therefore the entire human genome has to be sequenced and analyzed to understand the characteristics of specific DNA sequences. The Human Genome Project, which for the first time sequenced the entire human genome completed on 2003 with a 13 years effort. According to the research findings more than 50% of the human genome shows sequence similarity to genes in other organisms. The human genome contains around 20,000-25,000 genes, which are located on DNA strands, are distributed among the 23 pairs of chromosomes in a cell.
The human belongs to eukaryotes; is an organism whose cells contain complex structures enclosed within membranes. Eukaryotes have more complexity in gene and genome structure than prokaryotes. The chromosomes of each eukaryotic cell are made up of
DNA (Deoxyribonucleic acid)
Small amount of RNA (ribonucleic acid) .
Chromosomes are made up with two DNA strands and the DNA sequences are made up with four nucleotides as Adenine (A), Cytosine(C), Guanine (G) and Thymine (T). DNA strands make hydrogen bonds between Adenine with Thymine and Guanine with Cytosine.
A gene can be defined as a part of DNA of which some is transcribed and includes a promoter, coding sequence and a signal for the RNA polymerase to stop. There need not to be a unique promoter for each gene because one promoter may act on several genes. Our body has the ability to produce different tissues because each cell only expresses a subset of its genes. These genes have different functionalities according to the functions of the organisms. So the same gene may appear on different chromosomes but the function they provide may vary. Most of the genes are capable of making more than one protein, so there can be number of proteins than the number of genes of a genome.
Figure 1: Organization of the Human Genome 
Coding DNA Sequences
There had been a lot studies conducted on the human genome. But the main focus on most of the genomic studies is protein coding genes or RNAs. The DNA sequences which codes for a protein or RNA is considered as coding DNA sequences. Even though the coding DNA sequences are considered as the main part of the genome they account only a small proportion of the human genome as 2%. ATG and AUG represent the sequences of DNA and RNA respectively that are the start codon or initiation codon. TAG or TAA denote stop codon of a coding DNA sequences and on RNA sequences it is UAG or UAA.
Gene expression means information of a gene is used in the synthesis of a functional gene product. These products can be proteins, but in non-protein coding genes such as rRNA genes or tRNA genes the product is a functional RNA. So in genetics, gene expression is the most fundamental level at which genotype gives rise to the phenotype.
There are two types of genes that code for polypeptides.
Structural genes - Code for functional proteins such as enzymes, hormones, antibodies, storage proteins and fibres.
Regulatory genes - Control the activities of other genes.
Non-coding DNA Sequences
Previously it was considered that the protein-coding sequences are the only important DNA sequence fraction of the genome. But as the studies on genomic sequences have been carried out, it has been discovered that coding sequences contains only a small fraction of the whole genome. A large portion of the genome contains non-coding sequences which was previously considered as "junk DNA". But since the non-coding sequences contain a large fraction over the genome, researchers have been interested on non-coding sequences as well as protein-coding sequences.
When consider about the human genome, coding sequences are spread only about 2% of the entire genome and the other 98% contains non-coding DNA sequences. As the figure 3 describes it is the largest fraction of non-coding DNA among eukaryotes.
Figure 3: Proportions of Non-protein coding DNA of different genomes
Non-coding DNA parts can be separated in to two parts as
Intronic regions (Introns)
In 1997 biologists discovered that the DNA of a eukaryotic gene is longer than its consequent mRNA. Because mRNA is a direct copy of the DNA sequence they thought it should be the same length. Then it was revealed that there are introns, which separate a gene area in to several pieces called exons. These exons are coding parts and as Figure 4 describes introns are appearing between exons. The size and arrangement of introns is uneven and can be considered as characteristics of genes. Same mRNA may have different introns removed in different cells. Therefore the gene has alternative introns and can code for different proteins. This increases the potential use.
Example: Calcitonin gene
Two different forms of mRNA are produced by this gene, depending on the introns which are removed at each occurrence. One produces protein calcitonin and other produces CGRP (Calcitonin gene-ralated peptides) which is similar to calcitonin.
Intergenic areas are the regions that locate between the genes. The length of intergenic regions cannot be guaranteed and varies on different chromosome locations.
There are two concepts that have been proposed about introns and intergenic regions .
The introns-early concept held that (nearly) all introns have been inherited to eukaryotic genes from ancestor of prokaryotes and eukaryotes. The difference in gene arrangement among homologous eukaryotes take place due to different intron losses as described by the Figure 5.
Figure 2: Introns-early concept: Introns have been inherited by ancestor - Ancestor of prokaryotes and eukaryotes has had a lot of introns and the eukaryotes today have different structure of introns due to the loss of those introns
As the Figure 6 describes the intron late concept argue against that introns were a eukaryotic novelty and new introns have been emerging continuously through eukaryotic evolution.
Figure 3: Introns were a eukaryotic novelty - Ancestor of prokaryotes and eukaryotes had no introns and through the eukaryotic evolution introns have been emerging continuously
Intron insertion might take major evolutionary conversions but intron loss takes over at short evolutionary distances . It has been proposed that the fraction of shared intron positions between species has been decreased while evolutionary distance has been increased. Accordingly intron conservation could be useful as phylogenetic marker. So intron positions have been successfully used as phylogenetic markers for shorter evolutionary distances .
Recent finding says that old introns are significantly over-represented in the 5'-portions of the genes but new introns are distributed much more uniformly. Moreover, in most of the genomes which are very rich in introns are over-represented in the 3'-portions of the genes.
Since introns are appearing between exons, subsequent analysis has shown that introns located between two codons were on average. Moreover that kind of introns are located in more highly conserved portions of genes than the introns located after the first position in a codon and after the second position in a codon.
Above findings are for eukaryotes and it has been found that not like eukaryotes, prokaryotes make up only a small fraction of non-coding DNA sequences.
Formerly people thought the genome complexity can be described considering the size of the genome. But there is no association between the genomic sizes and their complexity. As figure 7 describes the genome size of many organisms has turned out to be not much different than the genome size of the human .This information show that genome size does not correlated with the complexity of an organism.
Figure 4 : Genome size comparison
But when considered on eukaryotes it is significant that the amount of non-protein coding proportion increases with the genome complexity (Figure 8). So most functional theories emphasize that cell size is adaptively important and that the genome-size-cell-volume relationship is the key to explaining the continued presence of non-coding DNA .
Conserved DNA Sequences
Conserved DNA sequences indicates that the sequences are functionally constraint. Non-functional genome parts are drift apart as species are divergent away from each other, because these non-functional parts undergo due to mutations. But if the parts are functional they remain recognizably similar in species over long periods of time. These parts may be responsible for code for proteins, code for RNA, and act as regulatory regions for enhancers, promoters, repressors and so on. So the sections which are functional DNA sequences likely to remain the same or very similar over millions of years of mutation pressure. This is the idea behind comparative genomic analyses between species. So it describes that comparing DNA sequences from several species provides a way to identify common signatures that may have functional significance.
Conserved Non-coding DNA Sequences
The availability of full genome sequences has led to many comparative genomic studies. These studies have examined the non-coding part of the genomes by genomic comparisons. The studies have discovered that vertebrate genomes contain a large amount of non-coding DNA sequences with strong conservation across the phylogeny. Interpret the functions of the non-coding sequences is a challenging, but highly important problem which the genomic studies have faced today.
Analysis on orthologous genes from animals and plants has discovered many shared non-coding DNA elements. But conservation of many intron positions in distant eukaryotes in spite of intron densities differ widely, and the location of introns in orthologous genes are not always same in closely related specie . About 50% of the DNA sequences that are conserved between human and mice are outside the coding sequences. Further from genomic comparisons between human, mice and dog Frazer  has found one half of the human/mouse conserved sequence was also conserved in the dog. Average percent pair wise sequence identity between CNSs of human with comparisons of Chimp, Mouse, Rat, Chicken, Frog,Fugu, Zebrafish and Tetraodon genomes is illustrated by figure 09. It is very interesting because it proves the evolutionary constraints of non-coding DNA which are likely to be functional. Conserved non-coding regions are less likely to undergo mutations compared to random DNA sequences . It proposes that the Conserved non-coding regions are functional significance sequences. Conserved blocks of non-coding DNA sequences likely have an impact on the sequence specific constraints of DNA protein interactions even though the association between functionally characterized binding sites and conserved non-coding DNA sequences have not discovered yet. Researchers suggest that about 20% - 30% nucleotide sites of eukaryotic genomes are expected to be conserved in functionally constrained non-coding regions .
Figure 5 : Average percent pair-wise sequence identity of CNSs of human with comparisons of Chimp, Mouse, Rat, Chicken, Frog,Fugu, Zebrafish and Tetraodon genomes 
When considered on the findings of the distribution of CNSs it is significant that the length of the CNS is differently distributed from distribution of the length of exons. It is unevenly distributed throughout the genome and various chromosomes . That is a main characteristic of CNSs and observations say the distribution of the CNS is not uniform but highly clustered. The genomic regions which are very rich in genes may be poor in CNSs and on the other hand gene poor areas may be rich in CNSs. Chromosome 21 is a good example. Even though it is poor in genes the chromosome region is very rich in CNSs . A recent finding says that the amount of CNSs and coding sequences roughly equal in the chromosome 21 region . When considered with the chromosome length, the chromosome which has the highest density of CNSs is chromosome 10 and chromosome X has the lowest . Loots  has discovered that about one-half of the conserved elements were to be found in Intergenic regions more than 10kb distant from the closest known gene.
CNEs are unlikely to be translated because as aligned coding sequences on CNEs there is no periodicity on every three bases . So it is unlikely to consider these CNEs as new protein coding genes but there is a possibility that these sequences could be novel RNA genes that are transcribed and not translated .
Since recently the interest on the non-coding DNA sequences was raised, a lot of researches were carried out to find the functionality and characteristics of these regions. As a result it has discovered that conserved non-coding regions are strong associated with genes that have critical roles during development [9, 10, 11, 12, 16, 18]. So now it is generally well proven that non-coding conserved regions correspond to functional regulatory elements . The researchers believe CNSs are likely to function as negative or positive regulatory elements either spatially or temporally . Indeed, there is possibility for a CNS to act as either an enhancer or repressor  of transcription depending on the factors that bound to it . However, the tissue- or timing- specificity of CNE enhancers is not known a priori.
The researches have shown up that elements conserved in limited number of species are likely to be functional as the elements which are conserved in many species. There exists a limited class of CNSs on the human genome that can be found only in a subset of mammalians, which is shorter in length than CNSs that are common to other species . This limited class is a collection of rapidly evolving functional non-coding sequences. So it suggests these sequences may be accountable for gene expression variation between species .
By their research Prabhakar suggest that changes in non-coding sequences may have contributed to the modifications in brain development and functions that are responsible for unique human cognitive qualities. Not only syntenic association of the CNSs, but also the distance between these elements might be important to control certain gene expressions . By a comparison between human and fugu (Takifugu rubripes) approximately 1400 CNSs have been recognized, which appear to be development regulators in vertebrates .
Loots  says that large analysis on expression patterns of mammalian genes will classify sets of co-regulated genes that may share CNSs. Human genome may contain more than 1000 transcripts. So analyzing CNSs is very important in identifying the regulatory circuits involved in human development and differentiation. Although a small fraction of the CNSs can be associated with transcriptional regulation, there remain a large number of CNSs with unexplained function .
An important demonstration of the function of conserved non-coding regions is multispecies-species sequence comparison on Interiukin gene cluster . The selected regions were 1Mb and deletion of a conserved non-coding element of 401bp was shown to change Interiukin expression in T cells of transgenic mice. VIB (The Flanders Institute for Biotechnology) researches linked to K. U .Leuven and Harvard University say that non-coding DNA help tuning gene activity and enable organisms to quickly adapt to environmental changes . Dr. Craig Pikaard at Washington University and his research group have discovered that non-coding DNA acts as a part of the cellular immune system by enhancing the ability of the cell to fight viruses and transposons.
Minimum amount of non-coding DNA eukaryotes require increases quadratically with the amount of DNA located on exons . For the human genome the minimum non-coding amount is 5.4% of the whole genome length . Very short, specific CNS have been discovered which give rise to ncRNAs (non-coding RNAs) such as micro RNA and siRNA (small interfering RNA). These RNAs are involved in regulatory functions but also have been linked to genetic diseases in humans as has already been shown for the Sonic Hedgehog(SSH) gene [14, 11]. SSH gene controls a range of differentiation process during vertebrate development as well as patterning the limb .
In the human genome the highest number of duplicated CNSs is located on chromosomes 18/19, 8/10 and 5/16 . The frequencies of the appearing of these CNSs through the genome are not associated with the paralogons' genomic distribution. "Paralogs" are genes that are homologous to other genes in the same species. Therefore it is considerable to say that they are, likely to have originated from a common ancestral gene. Even though the highest numbers of human paralogs are on the chromosome pairs 1q/9q, 7q/17q, 2q/12q, 15q/18q, 1q/6q and 5q/15q, the CNS duplicates are normally clustered around 2p/14q, 8p/10q and 18p/19p/20p . McEwen  have reported 124 families of duplicated CNSs in the human genome and where about 98% of them were assigned to a single or multiple copies of the paralogous genes. Their main role is related to transcription or development.
Since most of the CNSs are located in and around genes that act as development regulators a number of studies has been carried out on the area. So it has been discovered that functional data is presented for highly CNSs associated with four unrelated development regulators which are SOX21, PAX6, HLXB9, and SSH . Trans-dev genes are mostly located in regions of low gene density because of the genome architecture which give an additional level of transcriptional control. Therefore it has suggested that CNSs might play a function in structuring the genomic architecture mostly around trans-dev genes . There are findings where a CNS cluster is to be found close to more than one trans-dev gene. Such occurrences illustrate the value of associating endogenous expression pattern of genes with CNE enhancer activity [11, 15, 16, 18].
Human and Zebrafish
Human and fish both belong to eukaryotes. According to the tree of life which describes evolutionary distance between these two species (Figure 10) the last common ancestor of humans and fish lived roughly 440 million years ago. So they are not very close species and, genomic sequences which are shared by these two species show evolutionary constrained functional regions.
Figure 6 : tree of life describes the evolutionary distance between the zebrafish and human
Zebrafish (Danio reiro) a fish type, is useful model organism for studies of vertebrate development and gene function. Pioneering work of George Streisinger at University of Oregon established the zebrafish as a model organism. Because the gene order and content of zebrafish is equal to the common vertebrate form of mitochondrial DNA [ZFIN - db about zebrafish], it is an ideal model vertebrate for development and disease search. When performing genome comparisons to find conserved regions over evolutionary it is important to choose genomes which are orthologous. The evolutionary tree describes the relation between the zebrafish and the human as Figure 10. As the figure describes after the separation of the two species, their evolutionary time from their last common ancestor is likely same. Human Zebrafish genome comparison has found 4799 CNEs which are identical more than 70% and the length is more than80bp . Another genomic comparison has identified 73187 strand specific CNEs each element is lengthy more than 50bp and with more than 50% identity between the zebrafish and human [cneViewer]. cneViewer is the database which has built on that data that can be accessed as a web tool.
Figure 7: Part of evolutionary tree. Positions of the Human and Zebrafish (Danio reiro) are highlighted on the figure.
Design of the System
This chapter illustrates the design approach on analyzing conserved non-coding DNA sequences. Figure 8 depicts the data flow and the linkage of the system.
Filter data by
sequence similarity > 60%
sequence length > 100bp
Set up a local database
Remove unnecessary duplicated data
Filtered CNE data
Organize Conserved Non-coding Element data
Perform multiple sequence alignment
Make a FASTA format data file of the sequences with the CNE ID as the identifier
Build a similarity tree using multiple sequence alignment output
Identify CNE families by the similarity tree
(Make families for 4 cases: Similarity > 60%, 70%, 80% and 90%)
Formulate CNE families
Find the genes nearby each CNE of the families
Cluster the genes according to the CNE families and build a kohonen map
Perform gene ontology search on the genes belong to same cluster to find any relationship between the genes
Search for relationship between genes and CNEs
Figure 8 : The data flow of the system
Organize Conserved Non-coding Element data
The study is about the conserved non-coding sequences in the human genome, so to carry out the study conserved non-coding sequences in the human genome needs to be recognized. "cneViewer" is a database and a web tool that systemize information on conserved non-coding sequences by a genomic comparison between the human and zebrafish Danio reiro. The database contains 73187 strand specific conserved non-coding elements and each data element is lengthier than 50 bp and the sequence identity is higher than 50%. Each data element has attributes as,
Conserved Non-coding Element ID (CNE ID)
Distance to the nearby gene
Conserved Sequence Identity
Sequence in the human genome
Sequence in the Danio reiro
Set up a local database
To use the data on the research the data needs to be downloaded because no API is provided with the tool. The data can be downloaded in to an html file and then a local MYSQL database is to be set up with the downloaded data.
There can be several entries for the same data element in the database because the database, "cneViewer" is organized considering the anatomy. If the same gene on the same chromosome position appears in more than one anatomy then the same data element can be inserted in to the database more than a time. Such unnecessary duplicates needs to be removed.
According to the previous studies the selection criteria of the conserved non-coding elements for the study is chosen as 60% sequence identity and 100 bp sequence length.
Formulate CNE families
According to the previous mentioned selection criteria filter the data for the study. Similar sequences with different CNE IDs needs to be identified before continue the process. For that a multiple sequence alignment needs to be performed using all the filtered sequences.
Figure 9: Sample dandrogram - Branches represent the sequence clusters
As figure 8 depicts dendrogram is a graphical representation of the outcome of hierarchical cluster analysis. According to the multiple sequences alignment results, a dendrogram is to be built to help to interpretation of the classification. The branches of the dendrogram represent clusters obtained at each step. So the sequences that are alike are clustered together. Likewise CNE clusters are to be formed according to the branches of the dendrogram. Families are to be identified on four cases where sequences are to be classified into clusters considering 60%, 70%, 80% and 90% as cut off similarity percentages.
Formulate gene clusters according to the corresponding CNE families
Genes which locate 500kb nearby to the CNEs that belong to same CNE cluster are belonging to the same gene cluster. In another words the genes which are located nearby the CNEs that belong to same cluster are to be clustered to the same gene cluster. A Self Organizing Map is to be constructed on the gene clusters considering nearby CNE as a feature.
Search for relationship between genes and CNEs
The genes which belong to the same cluster on the Self Organizing Map should be further analyzed. To find out the association between the genes that belong to same cluster, there should be a gene classification method to follow. Therefore genes are to be clustered by considering the gene ontology.
Cluster the genes based on the gene ontology
Gene ontology illustrates how gene products act in a cellular context . Gene ontology can be described by the three organizing principles which are,
A component of a cell which is part of some larger object, that may be an anatomical structure or a gene product group.
This means a series of actions completed by one or more ordered assemblies of molecular functions.
Molecular function expresses activities that occur at the molecular level such as catalytic or binding activities. These are generally corresponding to activities that can be performed by individual gene products, but some activities are performed by assembled gene complexes.
All the genes that were selected on the previous stage are to be clustered considering the gene ontology. Since the relationship between the non-coding sequences and the genes are not recognized the data needs to be analyzed on various ways. Hence three self organizing maps are to be constructed considering the three organizing principles of gene ontology individually and another map considering all three principles together. So the four feature maps will be constructing considering,
Above mentioned all three principles
The genes that belong to the same cluster, considering nearby CNE as a feature, need to be analyzed to find any relationship between the genes. Four self organizing maps which are constructing on gene ontology are considering for cluster comparison. On this phase the cluster map constructed considering CNEs as a feature is needed to be compared with all the other constructed maps by taking into account one at a time. Cluster sets which have similar genes are to be evaluated to measure how related they are to each other.
Set up the local database
"cneViewer" database allows to download the data into a html file without a proper format. Therefore the data was copied into a text file and used a php script to copy the data into the database.
Formulate CNE families
After the database had been setup data was filtered using 60% sequence identity. The CNE which were filtered using 60% were inserted into a FASTA format file with CNE ID as the identifier and the corresponding human CNE sequence as the nucleotide sequence.
The FASTA file was used for a multiple sequence alignment using the "CLUSTALW" tool. The result of the multiple sequence alignment was interpreted graphically using the tool, "TreeView".