Exome Sequencing for Diagnosis of Suspected Mitochondrial Disease

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


The present thesis project was carried out within the Research

Program for Molecular Neurology. The group holds

as its mission "to understand the molecular background of mitochondrial disorders,

and use that knowledge to develop diagnosis and therapy."

The project was motivated by the group having started to use exome sequencing,

performed in cooperation with FIMM (Institute for Molecular Medicine Finland),

for molecular diagnosis of patients, mostly children, with suspected mitochondrial

disease. The aim was then to develop an approach for the analysis of genetic variation

data resulting from exome sequencing, in order to identify the mutations and

genes linked to the patients’ disorders. In particular, the approach was to be applied

in studies comprising a single patient exome, without associated sequence data from

family members or from other patients affected by the same disorder.

An exome variant data analysis workflow was developed which is customised for

the characteristics of infantile mitochondrial disorders, combining computational

resources built in-house and external databases and tools. The development involved

users of the workflow — several members of the group — who took part

in iterative rounds of proposed improvements and practical application in patient


This thesis discusses all the elements and the setup of the workflow. Patient study

examples illustrate how the workflow elements are put together. The results obtained

in the analysis of exome variant data of a cohort of 49 paediatric patients are

also presented. The workflow was effective in identifying single nucleotide variants

(SNVs) in nuclear genes causing mitochondrial disease, as validated by functional

studies, for 10 of the patients.

The thesis is structured as follows. Chapter 2 proceeds with some background on

mitochondrial disorders and exome sequencing. The core of the thesis is Chapter

3 where the workflow itself is discussed, including application examples. Chapter 4

presents the outcome of applying the workflow to a cohort of infantile-onset patients.

In Chapter 5 we briefly address current and future work that stemmed from this

project. Lastly, concluding considerations are drawn in Chapter 6.


2 Mitochondrial disease and exome sequencing

Mitochondria are organelles present in almost all our cells, with mature red blood

cells as the only exception. Several cellular processes that are essential for life take

place in mitochondria, most prominently the production of energy which makes

the organelles known as the cell’s power plants. Energy is produced via cellular

respiration, whereby biochemical energy, in the form of oxygen and nutrients in food

molecules, is converted into ATP (adenosine triphosphate) molecules and oxygen is

reduced to water.

Mitochondria have their own genome: a small and circular DNA molecule containing

16 569 base pairs, believed to have been originally acquired by endosymbiosis

between our distant single-cell ancestor and a bacterium some few billion years

ago. Differently from nuclear DNA (nDNA), the inheritance of mitochondrial DNA

(mtDNA) is strictly maternal. The existence of mtDNA does not make the organelle

genetically self-sufficient, however. Most of the hundreds of proteins involved in the

energy production pathway are encoded in the nucleus, synthesised in the cytoplasm

and then imported into mitochondria. Notably, energy production in mitochondria

is the only process in the mammalian cell known to involve two genomes — mtDNA

and nDNA — operating in fine coordination.

Mutations in mtDNA or nDNA that affect proteins involved in energy metabolism

in mitochondria are the underlying cause of mitochondrial disease, although environmental

factors can also play a part (Ylikallio and Suomalainen, 2012). These are

usually severe and progressive disorders with an estimated minimum prevalence of

one in every 5 000 births, on the basis of combined data from studies undertaken

in Australia, for infantile-onset disorders, and England, for adult-onset disorders

(Thorburn, 2004). Treatment remains mostly palliative with no cure available at

this time (Koene and Smeitink, 2011).

Heterogeneity is the hallmark of mitochondrial disorders, from genetic to biochemical

to clinical features, making the disorders complex and difficult to diagnose:

"Oxidative phosphorylation, i.e., ATP synthesis by the oxygen-consuming

respiratory chain (RC) [in mitochondria], supplies most organs and tissues

with energy... Consequently, RC deficiency can theoretically give

rise to any symptom, in any organ or tissue, at any age, with any mode of

inheritance, due to the twofold genetic origin of RC components (nuclear

DNA and mitochondrial DNA)." (Munnich and Rustin, 2001)


To date, more than 100 causative mtDNA and nDNA genes have been linked to mitochondrial

disease (Tucker et al., 2010). Phenotypes usually manifest in multiple

organ systems, have a wide spectrum of time of onset, from perinatal to adulthood,

and vary in presentation and severity throughout an individual’s life span and between

individuals (Suomalainen, 2011). Children tend to be the most severely affected

with the poorest prognoses. Diagnosis in children is also the most challenging.

Clinical presentation is markedly variable in them and histological findings are often

less specific compared to adult patients (Thorburn and Smeitink, 2001; Wolf and

Smeitink, 2002). Amongst so much diversity, energy deficiency in the cells, and in

the patient by consequence, is the only unifying feature of mitochondrial disorders.

Within the currently known genetic underpinnings and genotype-phenotype correlations

of mitochondrial disorders, at best about half of patients studied by a particular

diagnostic centre have a causative mutation found in one of the known disease genes

(Kirby and Thorburn, 2008; Calvo et al., 2010). Mutant genes in mtDNA and their

associated disorders have been mapped out, while many nuclear genes remain to

be uncovered. It has been estimated that mutations in nuclear genes cause roughly

one-third of adult-onset and three-quarters of infantile-onset mitochondrial disease

(DiMauro and Schon, 2003). When current knowledge of mitochondrial disease

genes is exhausted to no avail, the more exploratory approach of exome sequencing

can provide some answers.

2.1 Exome sequencing in identifying disease-causing mutations

Proteins are encoded by genes in nuclear and mitochondrial DNA. Each gene has

coding sections, called exons, and non-coding sections, called introns. The exome

consists of all exons of all genes in a genome. The human exome is estimated to

correspond to only about 1% of the total genome, amounting to approximately

30Mb. This relatively small part of the genome, however, holds most of the mutations

currently known to be associated with human genetic diseases. What is

more, the exome is as yet better understood than non-coding and regulatory regions

of the genome. Whole-exome sequencing is seen as a middle-ground approach for

identifying Mendelian disease genes: it is more comprehensive and less biased than

sequencing a pre-determined gene panel while at the same time potentially more

cost-effective than sequencing and studying the entire genome.


Exome sequencing arose from the development of methods that couple together

targeted capture and massively parallel DNA sequencing (also referred to as ‘nextgeneration’

sequencing (NGS)). The technique of capture of targeted genomic loci,

today widely used for exome capture, was proposed in (Gnirke et al., 2009). In the

article, a fishing analogy illustrates the technique: exon baits are thrown in excess

in a pond of total human DNA fragments for a catch of enriched segments of exonic

DNA. The baits are long single-stranded oligonucleotide probes, each consisting of a

target exome segment, long enough to hybridise in solution with protein-coding exons

(they are in average 169 bp long) and flanked on both sides by primer sequences

for amplification. Since many target exons are shorter than the designed probes,

the captured sequence as a whole extends beyond the 30Mb long exome. Magnetic

beads are used to amass the catch of exome segments, which are amplified and can

then be sequenced in the chosen sequencing platform. Next, the obtained exome

sequence reads are aligned to the human reference genome. Identified differences

are genotyped to make up the set of genetic variants contained in the exome of

the sequenced individual. Today, the most widely used software for the alignment

step is the Burrows-Wheeler Alignment tool (BWA) (Li and Durbin, 2009), and

for variant calling, the Genome Analysis Toolkit (GATK) (DePristo et al., 2011)

developed at the Broad Institute, and SAMtools (Li et al., 2009) developed at the

Sanger Institute.

The types of genetic variation that can be ascertained by current exome variant calling

methods and accompanying tools are SNVs (point mutations, single nucleotide

substitutions), indels (small insertions and deletions) and CNVs (copy number variants).

SNVs are the simplest and most common type of variation, as well as the

most prevalent in association to disease. They make up approximately 55% of the

pathogenic mutations in the Human Gene Mutation Database (HGMD) (Stenson

et al., 2009). So far, the methods for identifying variant in NGS data are more

accurate in calling SNVs, as compared to other types of variation (McKenna et al.,

2010; DePristo et al., 2011). Also, for the subsequent stage of data analysis, SNVs

are the best served with a variety of computational resources, such as large-scale

population databases and tools for prediction of their functional effect on proteins.

In the studies of molecularly undiagnosed patients using exome sequencing data, the

Wartiovaara group concentrated first on SNVs. For this reason, SNVs are the type

of genetic variation this thesis focuses on. Recessively inherited, loss-of-function

mutations leading to altered, reduced or absent gene products required for normal

mitochondrial function are usually implicated in infantile-onset disorders. Of


interest, therefore, are the non-synonymous SNVs (nsSNVs), which change the corresponding

codon so that they either code for a different amino acid (a missense

mutation) or become a stop codon (a nonsense mutation).

Discovery of Mendelian disease genes through exome sequencing has been growing at

an impressive rate since it was first demonstrated a few years ago (Ng et al., 2009;

2010). As originally proposed, most studies concern well-characterised disorders

with clear phenotypes affecting a small number of families or unrelated individuals

(Bamshad et al., 2011). By comparing exomes grouped by disorder, the search can

be narrowed down to the variants shared by all (or most) of the patients and not

found in unaffected family members and controls. This approach does not apply

to our cohort of children with suspected mitochondrial disease, who lack a firm

clinical or, in some cases, biochemical diagnosis, and in whom the same mutation

can cause different phenotypes and the same phenotype can be caused by different

mutations. Indeed, only a molecular diagnosis can make certain whether patients in

the cohort have particular mitochondrial disorders in common. The exome variant

data analysis workflow discussed here has been applied in studies starting with a

single index patient (n=1 studies) suspected of having a mitochondrial disorder with

variably defined genotype-phenotype relationships.

It is clear that the challenge of identifying all genes linked to mitochondrial disease,

a remarkably heterogeneous and complex group of disorders, has greatly and quickly

benefited from exome sequencing (Tucker et al., 2011; Tyynismaa et al., 2012; Elo

et al., 2012; Haack et al., 2012; Kornblum et al., 2013; Carroll et al., 2013; to

cite some of the most recent data). Another indication of the impact of exome

sequencing is the growing interest in extending its use as a diagnostic tool from

research to clinical settings (Bamshad et al., 2011; Haack et al., 2012; McCormick

et al., 2012).

2.1.1 Technological limitations

In spite of the many successes, there are technological limitations to exome sequencing

that should be noted. To start with, causal variants located outside of coding

regions (in introns, untranslated (UTRs) and regulatory regions) are missed. Moreover,

a consensual map of the coding regions of the human genome is still being laid

out (Pruitt et al., 2009). An exome is defined in practice by the specifications of the

particular capture method (kit) employed.


Types of genetic variation that involve a genomic context broader than single exons

are not well detected. These include structural rearrangements and CNVs such as

repeats and larger deletions. Nonetheless, methods have been emerging for detection

of CNVs from exome data (Krumm et al., 2012; Fromer et al., 2012).

Errors can occur during exome capture due to defects in probes or deletions in exons,

for example. Also, hybridisation is inherently not fine enough to differentiate between

exons belonging to genes with very similar sequences such as close paralogues,

pseudogenes, and gene family members (Gnirke et al., 2009).

There are many parts to the process of variant calling through alignment of the

captured short reads, amplified to the millions, to a reference sequence. It is acknowledged

that the reference human genome is still not free of errors. Major quality

improvements have recently been made by the GENCODE project (Harrow et al.,

2012). Insufficient coverage depth of the target sequence at positions is a common

problem which hinders reliable variant interpretation. A coverage depth of 20_ for

80% of the sequence has been an early de facto standard. Newer technologies now

aim at an average depth of coverage in the 60_ to 180_ range.

In sum, characteristics of the chemistry used in the sequencing platform, of the reference

sequence and of the alignment and variant calling algorithms can all influence

the extent to which all true, and only true, variants are identified. Complete and

correct identification of all types of human genetic variation is a running challenge

for exome sequencing and NGS technologies at large.