This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
The present thesis project was carried out within the Research
Program for Molecular Neurology. The group holds
as its mission "to understand the molecular background of mitochondrial disorders,
and use that knowledge to develop diagnosis and therapy."
The project was motivated by the group having started to use exome sequencing,
performed in cooperation with FIMM (Institute for Molecular Medicine Finland),
for molecular diagnosis of patients, mostly children, with suspected mitochondrial
disease. The aim was then to develop an approach for the analysis of genetic variation
data resulting from exome sequencing, in order to identify the mutations and
genes linked to the patients’ disorders. In particular, the approach was to be applied
in studies comprising a single patient exome, without associated sequence data from
family members or from other patients affected by the same disorder.
An exome variant data analysis workflow was developed which is customised for
the characteristics of infantile mitochondrial disorders, combining computational
resources built in-house and external databases and tools. The development involved
users of the workflow — several members of the group — who took part
in iterative rounds of proposed improvements and practical application in patient
This thesis discusses all the elements and the setup of the workflow. Patient study
examples illustrate how the workflow elements are put together. The results obtained
in the analysis of exome variant data of a cohort of 49 paediatric patients are
also presented. The workflow was effective in identifying single nucleotide variants
(SNVs) in nuclear genes causing mitochondrial disease, as validated by functional
studies, for 10 of the patients.
The thesis is structured as follows. Chapter 2 proceeds with some background on
mitochondrial disorders and exome sequencing. The core of the thesis is Chapter
3 where the workflow itself is discussed, including application examples. Chapter 4
presents the outcome of applying the workflow to a cohort of infantile-onset patients.
In Chapter 5 we briefly address current and future work that stemmed from this
project. Lastly, concluding considerations are drawn in Chapter 6.
2 Mitochondrial disease and exome sequencing
Mitochondria are organelles present in almost all our cells, with mature red blood
cells as the only exception. Several cellular processes that are essential for life take
place in mitochondria, most prominently the production of energy which makes
the organelles known as the cell’s power plants. Energy is produced via cellular
respiration, whereby biochemical energy, in the form of oxygen and nutrients in food
molecules, is converted into ATP (adenosine triphosphate) molecules and oxygen is
reduced to water.
Mitochondria have their own genome: a small and circular DNA molecule containing
16 569 base pairs, believed to have been originally acquired by endosymbiosis
between our distant single-cell ancestor and a bacterium some few billion years
ago. Differently from nuclear DNA (nDNA), the inheritance of mitochondrial DNA
(mtDNA) is strictly maternal. The existence of mtDNA does not make the organelle
genetically self-sufficient, however. Most of the hundreds of proteins involved in the
energy production pathway are encoded in the nucleus, synthesised in the cytoplasm
and then imported into mitochondria. Notably, energy production in mitochondria
is the only process in the mammalian cell known to involve two genomes — mtDNA
and nDNA — operating in fine coordination.
Mutations in mtDNA or nDNA that affect proteins involved in energy metabolism
in mitochondria are the underlying cause of mitochondrial disease, although environmental
factors can also play a part (Ylikallio and Suomalainen, 2012). These are
usually severe and progressive disorders with an estimated minimum prevalence of
one in every 5 000 births, on the basis of combined data from studies undertaken
in Australia, for infantile-onset disorders, and England, for adult-onset disorders
(Thorburn, 2004). Treatment remains mostly palliative with no cure available at
this time (Koene and Smeitink, 2011).
Heterogeneity is the hallmark of mitochondrial disorders, from genetic to biochemical
to clinical features, making the disorders complex and difficult to diagnose:
"Oxidative phosphorylation, i.e., ATP synthesis by the oxygen-consuming
respiratory chain (RC) [in mitochondria], supplies most organs and tissues
with energy... Consequently, RC deficiency can theoretically give
rise to any symptom, in any organ or tissue, at any age, with any mode of
inheritance, due to the twofold genetic origin of RC components (nuclear
DNA and mitochondrial DNA)." (Munnich and Rustin, 2001)
To date, more than 100 causative mtDNA and nDNA genes have been linked to mitochondrial
disease (Tucker et al., 2010). Phenotypes usually manifest in multiple
organ systems, have a wide spectrum of time of onset, from perinatal to adulthood,
and vary in presentation and severity throughout an individual’s life span and between
individuals (Suomalainen, 2011). Children tend to be the most severely affected
with the poorest prognoses. Diagnosis in children is also the most challenging.
Clinical presentation is markedly variable in them and histological findings are often
less specific compared to adult patients (Thorburn and Smeitink, 2001; Wolf and
Smeitink, 2002). Amongst so much diversity, energy deficiency in the cells, and in
the patient by consequence, is the only unifying feature of mitochondrial disorders.
Within the currently known genetic underpinnings and genotype-phenotype correlations
of mitochondrial disorders, at best about half of patients studied by a particular
diagnostic centre have a causative mutation found in one of the known disease genes
(Kirby and Thorburn, 2008; Calvo et al., 2010). Mutant genes in mtDNA and their
associated disorders have been mapped out, while many nuclear genes remain to
be uncovered. It has been estimated that mutations in nuclear genes cause roughly
one-third of adult-onset and three-quarters of infantile-onset mitochondrial disease
(DiMauro and Schon, 2003). When current knowledge of mitochondrial disease
genes is exhausted to no avail, the more exploratory approach of exome sequencing
can provide some answers.
2.1 Exome sequencing in identifying disease-causing mutations
Proteins are encoded by genes in nuclear and mitochondrial DNA. Each gene has
coding sections, called exons, and non-coding sections, called introns. The exome
consists of all exons of all genes in a genome. The human exome is estimated to
correspond to only about 1% of the total genome, amounting to approximately
30Mb. This relatively small part of the genome, however, holds most of the mutations
currently known to be associated with human genetic diseases. What is
more, the exome is as yet better understood than non-coding and regulatory regions
of the genome. Whole-exome sequencing is seen as a middle-ground approach for
identifying Mendelian disease genes: it is more comprehensive and less biased than
sequencing a pre-determined gene panel while at the same time potentially more
cost-effective than sequencing and studying the entire genome.
Exome sequencing arose from the development of methods that couple together
targeted capture and massively parallel DNA sequencing (also referred to as ‘nextgeneration’
sequencing (NGS)). The technique of capture of targeted genomic loci,
today widely used for exome capture, was proposed in (Gnirke et al., 2009). In the
article, a fishing analogy illustrates the technique: exon baits are thrown in excess
in a pond of total human DNA fragments for a catch of enriched segments of exonic
DNA. The baits are long single-stranded oligonucleotide probes, each consisting of a
target exome segment, long enough to hybridise in solution with protein-coding exons
(they are in average 169 bp long) and flanked on both sides by primer sequences
for amplification. Since many target exons are shorter than the designed probes,
the captured sequence as a whole extends beyond the 30Mb long exome. Magnetic
beads are used to amass the catch of exome segments, which are amplified and can
then be sequenced in the chosen sequencing platform. Next, the obtained exome
sequence reads are aligned to the human reference genome. Identified differences
are genotyped to make up the set of genetic variants contained in the exome of
the sequenced individual. Today, the most widely used software for the alignment
step is the Burrows-Wheeler Alignment tool (BWA) (Li and Durbin, 2009), and
for variant calling, the Genome Analysis Toolkit (GATK) (DePristo et al., 2011)
developed at the Broad Institute, and SAMtools (Li et al., 2009) developed at the
The types of genetic variation that can be ascertained by current exome variant calling
methods and accompanying tools are SNVs (point mutations, single nucleotide
substitutions), indels (small insertions and deletions) and CNVs (copy number variants).
SNVs are the simplest and most common type of variation, as well as the
most prevalent in association to disease. They make up approximately 55% of the
pathogenic mutations in the Human Gene Mutation Database (HGMD) (Stenson
et al., 2009). So far, the methods for identifying variant in NGS data are more
accurate in calling SNVs, as compared to other types of variation (McKenna et al.,
2010; DePristo et al., 2011). Also, for the subsequent stage of data analysis, SNVs
are the best served with a variety of computational resources, such as large-scale
population databases and tools for prediction of their functional effect on proteins.
In the studies of molecularly undiagnosed patients using exome sequencing data, the
Wartiovaara group concentrated first on SNVs. For this reason, SNVs are the type
of genetic variation this thesis focuses on. Recessively inherited, loss-of-function
mutations leading to altered, reduced or absent gene products required for normal
mitochondrial function are usually implicated in infantile-onset disorders. Of
interest, therefore, are the non-synonymous SNVs (nsSNVs), which change the corresponding
codon so that they either code for a different amino acid (a missense
mutation) or become a stop codon (a nonsense mutation).
Discovery of Mendelian disease genes through exome sequencing has been growing at
an impressive rate since it was first demonstrated a few years ago (Ng et al., 2009;
2010). As originally proposed, most studies concern well-characterised disorders
with clear phenotypes affecting a small number of families or unrelated individuals
(Bamshad et al., 2011). By comparing exomes grouped by disorder, the search can
be narrowed down to the variants shared by all (or most) of the patients and not
found in unaffected family members and controls. This approach does not apply
to our cohort of children with suspected mitochondrial disease, who lack a firm
clinical or, in some cases, biochemical diagnosis, and in whom the same mutation
can cause different phenotypes and the same phenotype can be caused by different
mutations. Indeed, only a molecular diagnosis can make certain whether patients in
the cohort have particular mitochondrial disorders in common. The exome variant
data analysis workflow discussed here has been applied in studies starting with a
single index patient (n=1 studies) suspected of having a mitochondrial disorder with
variably defined genotype-phenotype relationships.
It is clear that the challenge of identifying all genes linked to mitochondrial disease,
a remarkably heterogeneous and complex group of disorders, has greatly and quickly
benefited from exome sequencing (Tucker et al., 2011; Tyynismaa et al., 2012; Elo
et al., 2012; Haack et al., 2012; Kornblum et al., 2013; Carroll et al., 2013; to
cite some of the most recent data). Another indication of the impact of exome
sequencing is the growing interest in extending its use as a diagnostic tool from
research to clinical settings (Bamshad et al., 2011; Haack et al., 2012; McCormick
et al., 2012).
2.1.1 Technological limitations
In spite of the many successes, there are technological limitations to exome sequencing
that should be noted. To start with, causal variants located outside of coding
regions (in introns, untranslated (UTRs) and regulatory regions) are missed. Moreover,
a consensual map of the coding regions of the human genome is still being laid
out (Pruitt et al., 2009). An exome is defined in practice by the specifications of the
particular capture method (kit) employed.
Types of genetic variation that involve a genomic context broader than single exons
are not well detected. These include structural rearrangements and CNVs such as
repeats and larger deletions. Nonetheless, methods have been emerging for detection
of CNVs from exome data (Krumm et al., 2012; Fromer et al., 2012).
Errors can occur during exome capture due to defects in probes or deletions in exons,
for example. Also, hybridisation is inherently not fine enough to differentiate between
exons belonging to genes with very similar sequences such as close paralogues,
pseudogenes, and gene family members (Gnirke et al., 2009).
There are many parts to the process of variant calling through alignment of the
captured short reads, amplified to the millions, to a reference sequence. It is acknowledged
that the reference human genome is still not free of errors. Major quality
improvements have recently been made by the GENCODE project (Harrow et al.,
2012). Insufficient coverage depth of the target sequence at positions is a common
problem which hinders reliable variant interpretation. A coverage depth of 20_ for
80% of the sequence has been an early de facto standard. Newer technologies now
aim at an average depth of coverage in the 60_ to 180_ range.
In sum, characteristics of the chemistry used in the sequencing platform, of the reference
sequence and of the alignment and variant calling algorithms can all influence
the extent to which all true, and only true, variants are identified. Complete and
correct identification of all types of human genetic variation is a running challenge
for exome sequencing and NGS technologies at large.