Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.
Over the past 12 months (June 2016-July 2017), I have undertaken a professional placement in the R&D cancer team at Oxford Gene Technology (OGT). Here I have worked on protocol optimisation and SureSeq cancer panel creation using next generation sequencing (NGS) techniques. In this report, I will give an overview of the process of next generation sequencing, and its current applications as a precision medicine tool.
Oxford Gene Technology (OGT) is a molecular genetics company founded by Professor Sir Edwin Southern in 1995, following the patenting of DNA microarrays in 1991. OGT developed a high throughput microarray service department, Genefficiency Genomics, which then expanded to also provide the use of NGS on DNA analysis. The company also developed panels of biomarkers for detection for a range of cancers. Over the years, the company has shifted its focus to provide high quality genomic products to meet the demand and needs of the genetics market. OGT has three major area of focus: SureSeq, CytoSure and CytoCell. With each with their own products that aids researches to detect and interpret genetic abnormalities accurately.
During my year at OGT, I worked as a member of the solid cancer panel team of the R&D team of the SureSeq sector. My main role was to develop solid cancer (refer to appendix, page 46, for more information) panels, through the empirical testing of baits, and in the development of the library preparation protocol used together with SureSeq products. The OGT protocol is a vital component of SureSeq and needs constant optimisation to ensure its efficiency with the products. I started my placement with very limited knowledge on genomics and the vast techniques involved. Over the placement, I have managed to develop my skills and techniques involved in NGS and in molecular biology.
The sequencing panels target genes that are implicated in the disease in question and can be either disease specific or be a mixture of numerous genes for pan-cancer panels. The panels work on the basis of target enrichment, a process of hybrid capture, whereby a PCR-amplified library is hybridised with a pool of selected baits via a solid surface or in-solution. The selected baits (refer to appendix for more details, page 46) are complimentary to specific genes known to be mutated in key genetic diseases. Here at OGT, the in-solution hybridisation technique is used as it has been previously shown to provide better results (Ssamorodnitsky. E., 2015). OGT evolved from only providing two fixed panels in the market to now allowing customers to customise their panels to select a variety of genes specific to their needs. For example, in the case of the myeloid panel, it is essentially a pool of baits complementary to key genes such as those involved in the development of myeloid cancer.
I was given the task to develop the lung cancer panel and the glioma panel as well as refining and developing the protocol to allow copy number variation (CNV) to be identified in samples. This role involved a lot of background reading about glioma and lung cancer as well as CNVs and its detrimental effects if can bring. I also worked closely with my supervisor and the computational biology team to ensure that the right CNVs are called correctly as well providing as much raw data for the team to utilise.
Following the great success of discovering the three-dimensional structure of the DNA in 1953 by Watson and Crick, the next challenge was to accurate determine the sequence of DNA. The first DNA genome to be sequenced was that of the bacteriophage, PhiX (Sanger.F, 1977) , using the “plus and minus” method designed by Fred Sanger and Alan Coulson which depends on the use of the enzyme, DNA polymerase, to synthesize the strand using radiolabelled nucleotides before performing second polymerisation reactions
In 1977, Sanger tweaked and improved his method that eventually led to a major breakthrough in DNA sequencing technology. The method is now commonly referred as Sanger sequencing. The technique exploits the use of radiolabelled chemical analogues of deoxyribionucleotides; dideoxyribonucleotides (ddNTPs). This technique was used to sequence the first eukaryotic genome, the Saccharomcyes cerevisiae in 1996 (Engel. S.R., 2014). The technique was also used heavily in the Human Genome Project that involved various laboratories to sequence the human genome, this took 12 years and $2.7 billion (National Human Geome Research Institute, 2003)
Sanger sequencing became the most popular DNA sequencing technique for the last 25 years due to its accuracy and ease of use (Schuster. S.C., 2008). Over time, a few improvements were made to Sanger sequencing to enable high-throughput sequencing machines to be made. The biggest improvement was the replacement of radiolabelling with fluorometric-based detection which allows the reaction to occur in one vessel instead of four. The first-generation of sequencing machines was introduced in 1987 and produced reads that were slightly less than one kilobase (kb) in length (Heather. J.M., 2015)
Next generation sequencing (NGS) is often described as the “new generation of non-Sanger based DNA sequencing techniques” (Schuster. S.C., 2008). It is said to have begun in 2005 by 454 Life Sciences, later acquired by Roche, which developed the sequencing by synthesis technique (Guzvic, 2013). NGS techniques allows a high capacity of DNA to be sequenced in a short amount of time, by means of parallel sequencing, whilst maintaining the sensitivity and specificity of Sanger sequencing. The cost per raw megabase of DNA and cost per genome have dramatically reduced from July 2001 to July 2013 (Figure 1) due to the advancement in sequencing techniques (Barba, 2014). It is expected that with the advancement of sequencing techniques into third generation sequencing, the cost will be reduced as the capacity of sequencing will increase.
Figure 1: Graph showing the cost (in US Dollars) per raw megabase of DNA sequence (top) and cost per genome (bottom) from July 2001 to July 2013 estimated by the National Human Genome Research Institute, U.S. (National Human Genome Research Institute, 2016).
Now this benchmark has been met, this perhaps marks the peak of, and end of, NGS technologies, leading into the third generation, aiming to reduce time and cost further by decreasing the number of reagents used (Schadt et al., 2010).
Application of NGS
NGS has a very wide application in various fields of science, it is most widely used in genomics to provide better understanding of the genome of various organisms. There are three ways in which an organism’s DNA could be sequenced: whole genome sequencing (WGS), whole exome sequencing (WES) and targeted sequencing. WGS refers to determining the complete DNA sequence of an organism’s genome whereas WES refers to sequencing all the protein-coding genes in a genome, each have their own advantages as shown in Table 1. Targeted sequencing uses baits or probes that are designed against genes known to have a role in development of specific cancer(s) (Hagemann. I.S., 2013).
In recent studies, both WES and WGS are used to as diagnostic test for copy number variation (CNV) (Hehir-Kwa, 2015). The most common method for (CNV) is by using microarray-based comparative genomic hybridisation (aCGH) and MLPA. CNV is a type of structural variation that results in abnormal number of copies of one or more genes, arising from duplication or deletions of large regions of the genome. Due to the highly valuable traits of NGS, researches are attempting to use NGS techniques to identify CNVs due to the clinical implications it has on diagnostic laboratories (Hehir-Kwa, 2015). The difficulty with using NGS data, is developing a reliable analytical method of using the raw data to determine if there are any CNVs.
With the advancement of NGS, many individual can have a comprehensive genomic profile which have many advantages especially for a cancer patient. This profile allows specific data to be used for targeted therapy or biomarkers specific for the patient. On top of that, NGS has improved precision medicine tremendously in oncology. Using the various methods of sequencing, various drug have been designed based on this information as well as providing great information about the developmental pathways involved in certain cancers (Nawav. D.H., 2015 ).
Figure 2: The library preparation workflow followed when preparing DNA for use with the OGT SureSeq cancer panels. Adapted from (Oxford Gene Technology, 2016)
A universal step in next generation sequencing (NGS) is the preparation of libraries in which DNA or RNA molecules are converted into a sequencing library to be compatible with a sequencing platform. In the following section, each step involved in the library preparation (figure 2) will be explained.
- Quantification of genomic DNA
Genomic DNA (gDNA) is chromosomal DNA that encodes the genome of the organism
Prior to creating libraries, the quality and quantity of gDNA has to be determined to ascertain the volume required for specific starting amount and to observe the quality of the DNA. At OGT, the human gDNA used comes from two sources, extracted from blood or formalin-fixed paraffin-embedded (FFPE). Several quality control steps are done here:
- DNA integrity
DNA integrity refers to the condition of DNA and reflects how intact the DNA is after extraction as this involves various processes that could damage the DNA by mechanical disruption. This can be assessed using the Agilent 2200 TapeStation System that utilises an algorithm to provide a numerical assessment of the integrity of the gDNA comparing it to ~7000 different gDNA samples including FFPE (Gassmann.M, 2015) The number assigned is called the DNA Integrity Number (DIN). The Genomic DNA ScreenTape assay can analyse genomic DNA ranging from 200bp to >60, 000bp with an example of the electrophoresis gel image shown in figure 3 The fragment size acts as an indicative of the quality of the sample and a length of <1kb is considered too degraded for sequencing.
Figure 3: The electrophoresis image produced by a Genomic DNA run on Agilent 2200 TapeStation. This shows the peak of the lower marker and the peak of the DNA. This DNA has a fragment size > 60, 000bp and so is sufficiently not degraded
- DNA concentration
At OGT, the main method of determining the concentration of is by fluorescence and the sample concentration are in ng/ul. At OGT, the Qubit Fluorometer assay kits are used to detect the concentration of double stranded DNA in the sample. To provide a wide range of detection of concentration, the Qubit Assay comes in two different conditions; high sense (HS) and broad range (BR) that detect 10pg/ul – 100ng/ul and 100 pg/ul – 1ug/ul respectively (Nakayama. Y., 2016). The assay uses fluorescent dyes and the fluorescence signal is directly proportional to the concentration of the DNA.
After quantification, the DNA concentration can be used to calculate the volume required for library preparation. At OGT, the protocol suggests starting with 1-2ug of DNA, however over the past year the protocol has been optimised to ensure it works with a starting concentration of 500ng and below.
- Shearing of genomic DNA
The genomic DNA must be broken into smaller, sequence platform-compatible fragments – generally 100 – 300 bp. This is usually done in 2 ways: mechanical shearing e.g., the covaris and enzymatic digestion, e.g. fragmentase.
The Covaris S-series fragments DNA through sonication technology by applying bursts of high frequency acoustic energy to shear long DNA fragments at 4-8°C (Covaris). The high frequency allows the acoustic energy to be focused into a central zone which allows it to be highly controllable and flexible as the settings can be independently programmed. Different fragment sizes can be achieved by changing the settings. The volume required for Covaris shearing is 130ul to obtain a target size of 150-200 bp. The DNA concentration determined previously is made up with low TE buffer to 130ul. Low TE buffer is used as it offers greater stability during fragmentation in comparison to water.
As an example:
Concentration of sample 294 ng/ul, and for 1ug of DNA:
1000294= Volume of DNA needed for 1 μg of DNA
Thus, 3.4ul of sample and 126.6 ul of TE is needed.
The DNA is loaded into Covaris microTUBE that are glass tubes specially designed to focus the acoustic energy on the DNA sample. Mechanical shearing is a very accurate method of DNA shearing, and shearing by the Covaris is the recommended method by Illumina and Agilent for their library preparation protocols. The only downside is the cost of the machine which limits the access to a Covaris in many laboratories, however, an inexpensive alternative would be shearing by sonication with the Picoruptor sonication device by Diagnode.
An alternative method is by enzyme fragmentation which uses a cocktail of enzymes that will digest DNA at 37°C. As with like mechanical shearing, the DNA length can be controlled by adjusting the incubation period of the DNA and the enzymes. The longer the incubation time, the shorter the fragments will be. In the case of fragmentation, the DNA needs to be made up to 16ul before fragmentation and is done with water instead.
Using the same DNA sample as in the example above, 12.6ul of water is added to make to 16ul
The reaction is stopped by pipette mixing the sample with 5ul of 0.5M EDTA. EDTA is able to stop the enzyme activity by chelating ions required by the enzymes. This achieves a target fragment size of 225-250 bp. Enzymes are a cheaper and high-throughput alternative, but the enzymes are prone to degradation and does not produce consistent fragment sizes compared to mechanical shearing.
- Post-shear DNA purification
There are several reasons for a DNA purification: removal of buffer to prevent buffer carry-over between the various stages of library prepation, change in volumes or removal of enzymes and size selection through the removal of small (<100 bp) DNA fragments. After shearing, the DNA is purified to concentrate the sample in a smaller volume for subsequent steps in the library preparation and to remove the buffer the DNA was originally suspended in and excess enzymes. DNA purification can be done by using columns or using solid-phase reversible immobilization (SPRI) beads which uses a carboxyl-coated magnetic particle to reversibly bind with the DNA. Size selection of DNA is dependent on the ratio of polyethylene glycol (PEG) to salt concentration in the beads, thus the volumetric ratio of beads to DNA is vital. Larger fragments are captured when the volume is smaller as the PEG acts as a “crowding agent” that causes negatively-charged DNA to bind with the carboxyl on the bead. This ensures the beads only attract fragment sizes of >100 bp and those between 50-100 bp are reduced significantly
Several precautions must be taken before the beads can be used, the beads must be at room temperature and have to be mixed by vortex prior to usage. The beads and DNA are incubated at room temperature for five minutes to allow binding and then placed on a magnet which collates the beads and the bound DNA leaving the smaller fragments of DNA in the supernatant. The bead pellet is washed using 200 ul of 70% ethanol to remove any contaminant; a total of two ethanol washes are done. The bead pellet(s) is then dried at 37°C to remove by evaporation any residual Ethanol. Either water or TE buffer is used as elution buffer as it causes the beads to repel the DNA thus re-suspending it in the eluate.
Figure 4: Simplified process of a DNA clean up with SPRI beads, specifically Agencourt AMPure XP beads. There are slight differences here, with two ethanol washes and a drying phased between the washes and elution. The DNA is eluted in water before being transferred to fresh tubes (Beckman Coulter, 2013).
The purifications involved in subsequently steps in the library preparation have slight differences. Post-ligation and post-PCR purification focuses on size selections and removal of smaller, unwanted DNA fragments and excess reagents. Thus, the volume of bead used is calculated to maintain a ratio of 1.8X to the volume of DNA. The volume reflects on the binding capacity of DNA to the beads.
Figure 5: Showing binding capacity of the DNA to the AMPure XP beads. This shows that when 18 µl is used for 10 µl of DNA input, a higher DNA concentration is recovered per µl (Beckman Coulter, 2013).
After each purification step, quality control (QC) steps are done and the Agilent Tapestation is used to verify if the samples are of the expected/required size. On the programme, a peak is shown to indicate the DNA size. If >500ng of DNA is started with then D1000 ScreenTape is used, with 3ul of buffer added to 1ul of DNA. If ≤500ng of DNA is used, a high sensitivity (HS) D1000 ScreenTape is used, with 2ul of buffer added to 2ul of DNA.
For the case of FFPE DNA, the DNA has to be repaired after it is sheared. Failure to repair the damage can lead to reduction of the amplifiable template DNA which can cause problems when sequenced. There are numerous damages that has to be fixed, such as: oxidised bases, DNA-protein crosslinks, nicks and gaps, blocked 3’ ends, deamination of cytosine to uracil DNA fragmentation. For FFPE DNA, the DNA must be repaired after it is sheared before it is purified. This is done by using a cocktail of enzymes that fixes each damage involved.
- End Repair and A-base addition
Sheared DNA is made up of a mixture of ‘sticky ends’, which is a single stranded overhang at either or both ends of one or both strands, and blunt ended DNA fragments. On top of that, a single A-base needs to be annealed to the 3’ end to ensure that the adaptors, which have a corresponding T-base overhang, can bind properly. Thus, the first step is to “fill in” the ends to ensure that both strands are blunt ended. This is done by using a mixture of DNA modifying enzymes, each with respective functions highlighted in figure 6.
|Klenow fragment of DNA Polymerase I||Fills in 5’ overhangs|
|T4 DNA Polymerase||Fills in 3’ overhangs|
|T4 Polynucleotide Kinase||Phosphorylates 5’ ends to enable ligation|
|TOP DNA Polymerase||Inactivates 5’ to 3’ exonuclease activity to improve base incorporation|
Figure 6: List of DNA modifying enzymes involved in the end repair process with respective functions of each enzyme
Single dATPs are added to the DNA 3’ on both strands. All nucleotides are present in the master mix, but a higher proportion of dATP is present to increase the chance of A-tailing. The end-repair and A-base addition processes are run in a single tube. For end repair, the sample are incubated with enzymes, buffers and dNTPs at 20°C for 30 minutes, to provide enough time for the enzymes to modify the DNA then at 72°C for 30 minutes, to allow addition of the single A-base.
Figure 7: The changes that occur to the DNA fragments during the early steps of library preparation. IT shows how the end repair helps remove sticky ends and the position where the A base is added to the ends. (O’Geen. H., 2014)
- Adaptor ligation
Double stranded adaptors are ligated to the DNA by DNA ligases. These have a single 5’ T-base and are complementary to the 3’ A-tail added in the previous step. During this process, two different adaptors are added; adaptor A and B. Each adaptor will ligate to one of the ends of the fragment. However, this ligation does not occur all the time as some fragments might have two of the same adaptor at both ends. During PCR, only fragments with both types of adaptors, i.e. different adaptors at each end, will be amplified due to the primer sequences. After ligation, the DNA must be purified to remove any excess adaptors, buffer and enzymes.
As before, the concentration and size of the purified DNA is checked post-ligation, using the Agilent Tapestation and Qubit 2.0. If 1-2 ug of DNA is started with, the size is checked with a D1000 ScreenTape and concentration with dsDNA BR assay. If starting with ≤500 ng, then the size is checked with a HS D1000 ScreenTape and concentration with the dsDNA HS assay. The QC steps are done to check that the adaptors have successfully ligated with the target size of DNA fragment between 220 to 245 bp. The concentration is checked to determine the number of PCR1 cycles to be carried out.
|Post-ligation DNA concentration (ng/ul)||Cycle numbers||Average expected duplication following sequencing*|
|DNA from FFPE tissue||12-14||>50%|
|<1 ng||10 or more||>50%|
Figure 8: Table showing how many PCR 1 cycles to carry out based on the post-ligation concentration. A lower number of cycles is preferable as the duplication will be lower. Starting with 1µg of DNA is likely to give a concentration between 8-12ng/µl, and 500ng 4-8ng/µl. However, as the table shows, the concentrations can vary widely (OGT, 2016).
- Pre-capture PCR
The purified ligated DNA needs to be amplified through polymerase chain reaction (PCR) to ensure there is enough DNA for hybridisation. Theoretically, each cycle will double the total sample concentration. Consequently, a sample with a lower concentration will require more cycles in order to produce the same DNA concentration as a sample with a higher starting amount.
Only half (15ul) of the ligation product is used in PCR, with the rest of volume retained for a repeat, if required. The standard hybridisation requires an input of 500ng of DNA. The concentration is checked using the Qubit and dsDNA BR assay. If the ligation and PCR have been successful most samples will have enough DNA to perform repeats of the hybridisation. At this point, if a bulk library is made then all the PCR1 samples are pooled together and checked for quality. This helps eliminates any variation across samples from the experiments.
- Target capture and hybridisation
There are two widely-used methods used for targeted sequencing: PCR and hybrid capture. The main difference between the two is the stage at which the target enrichment occurs. For hybrid capture, the DNA of interest is isolated during hybridisation, which occurs after library preparation, using ‘baits’ or short oligos, that are complementary to the target region. Each ‘bait’ is designed for a specific target, normally a single exon.
The PCR method uses a different library preparation method whereby the PCR is performed on unsheared genomic DNA using specific primers designed to capture the region of interest, the adaptors required for sequencing are also added at this stage. The PCR method is more straightforward and is advantageous at achieving very high enrichment and few off-target reads from relatively low amounts of material (Mertes. F., 2011). Both methods have their strength and weaknesses as shown in Figure 9. At OGT, the target capture in solution is preferred as it offers higher sensitivity in detecting mutations.
Figure 9: Table showing the advantages and disadvantages of the two most common target enrichment techniques. Both produce results that are useful for cancer diagnosis and identifying mutations in genes.
There are several key components involved in the hybridisation stage
- Biotinylated baits (panels)
- Blocking oligos
- Human Cot-1 DNA
- Streptavidin-coated beads
After PCR, 500 ng of the PCR product with Human Cot-1 DNA and blocking oligos is dried down per sample. Once dried down, the DNA is resuspended in 2.5 ul of water. This ensures that all the samples have the same volume going into the hybridisation. The addition of the Human Cot-1 DNA and blocking oligos prevents non-specific hybridisation, such as daisy-chaining of baits, to ensure clearer results are obtained.
The DNA sample is mixed with panel-specific baits, manufactured by IDT, along with hybridisation buffer. The SureSeq baits developed by OGT are DNA oligonucleotides, however, RNA oligonucleotides can also be used. Next, the sample is incubated at 95°C for 5 minutes to denature the DNA, then left to incubate at 65°C for 4 hours. It is during this 4 hours that the baits will hybridise to complementary sequences in the sample.
The baits are manufactured to be biotinylated, meaning that they have a biotin molecule bound to the oligonucleotide at one end. Biotin has a very high affinity for streptavidin, due to the complementarity of the binding site of the streptavidin and the biotin molecule.
Figure 10: Diagram to show how the biotin on the baits will bind to the streptavidin beads. The DNA hybridised to the bait ‘hitchhikes’ with the bait and so is also isolated from the non-target DNA (Perkel, 2009).
After incubation, 100ul of streptavidin-coated beads are added and left to incubate at 65°C for a further 45 minutes. Every 15 minutes, the samples are mixed by vortex for 3 seconds to ensure the streptavidin beads remain well distributed. During the 45 minutes incubation, DNA bound to the biotinylated baits, binds to the streptavidin beads, separating it from the non-target DNA. Next, a total of 6 washes are done to remove any non-hybridised DNA as follows:
- Pipette mix with wash 1, remove immediately
- Incubate with stringent wash buffer for 5 minutes at 65°C
- Repeat step 2
- Mix with wash buffer 1 for 2 minutes at 35/40°C
- Mix with wash buffer 2 for 1 minute at 35/40°C
- Mix with wash buffer 3 for 30 seconds at 35/40°C
- Elute in 30ul of water
- Post hybridisation
A post-capture PCR is performed to amplify the DNA to produce enough to be sequenced. This PCR step amplifies the ssDNA bound to the beads and the oligonucleotide baits will remain on the beads which can be removed during purification. During the post-capture PCR, a 3’ 8-base index is added and a different index is added to each sample acting as a unique ‘barcode’. The barcode allows multiple samples to be run in the same sequencing lane.
The DNA post-PCR is cleaned up as before and eluted in 32ul. At this stage, a dsDNA HS Qubit and a D1000 HS ScreenTape are used, independent of the starting input, as the DNA concentrations are far too low. These readings are used to calculate the concentrations based on the equation below to esnure equimolar volume of each sample can be added into the sequencing pool. A spreadsheet can be used to help, an example of the layout is shown in Figure 11.
This spreadsheet has columns for the DNA size (bp) and Qubit concentration (ng/ul). Metrics can be changed that adjust the volume of each sample, such as the final volume for the pool (ul) and the final concentration of the sequencing pool (nM). These values are used to calculate the volumes needed to make a 4 nM pool as shown in Figure 12. The DNA needs to be denatured prior to sequencing and this is done by using 0.5nM of NaOH. 4nM is too concentrated to load onto the Miseq and so a less concentrated pool must be made, using the buffer provided by Illumina. Typically, a 8-12pM pool is created through a series of dilutions and used for sequencing.
Figure 11: Layout of the Excel spreadsheet used to determine how much of each sample to add to the pool to be sequenced on the MiSeq, this ensures each sample shares an equal percentage of the lane and so sequenced equally. This spreadsheet shows 24 samples will be added to the pool. There are columns for the index number used (and its corresponding 8-base sequence), DNA size (bp) and Qubit concentration (ng/µl). This spreadsheet has been set to a pool volume of 250µl, with a concentration of 4nM.
Figure 12: This is the next page of the same spreadsheet shown in Fig.15, showing how much of each sample to add to the pool. Those with a lower molar concentration will have a higher volume. To make the pool up to 250µl water is added.
At OGT, the SureSeq range has been developed to run on the Illumina MiSeq platform which is widely used for targeted gene sequencing. Illumina technology requires the library to be loaded onto a solid phase, flow cell, where each fragment is amplified into distinct clusters though bridge amplification. Subsequently the sequencing by synthesis (SBS) is used to incorporate fluorescently labelled deoxyribonucleotide triphosphates (dNTPs) into a DNA template strand during sequential cycles of DNA synthesis. At each cycle, the bases incorporated are fluorescently tagged and the nucleotides are identified by means of fluorophore excitation. The adaptor sequences (i.e. adaptors) that were added to the DNA fragment during the ligation step act as priming sites from which the polymerase can begin synthesising from. During the second round of PCR after hybridisation, two different primers are added; P5 and P7. This enables the library to bind to the flow cell in the sequencer.
Figure 13: The process of sequencing by synthesis technology used by Illumina platforms. This shows how the adaptors work as primers for sequencing and how each fragment is amplified via bridge amplification to produce clusters. Each type of nucleotide is fluorescent tagged and will fluoresce when incorporated to the strand, this is imaged by the machine. (Illumina, 2015).
My Work: Protocol Development
- the lung cancer and glioma panel to be used together with OGT cancer panel protocol
- to create a set of baits that are not gene or panel specific, but is able to work with any combination of baits, whilst able to consistently and accurately detect CNVs in any given sample, via read depth.
Despite having two general aims, the wet lab methods were kept the same in both projects, the only difference was the baits/panel used during the target capture stage.
Figure 14: General workflow of the projects done during my placement.
Three experiments were involved in the development of the lung cancer and glioma panel. The first 2 experiments involved making bulk libraries of control samples using good quality genomic DNA, whereas the final experiment was carried out using good quality HapMap DNA and FFPE DNA. All the post-PCR products were pooled together into a large library to eliminate any biological variation in library preparation. This is done to ensure the panel is compatible with the protocol on any quality of DNA including FFPE as the two diseases involved are “solid cancer”. For each experiment, the baits tested were always pooled with baits of BRCA1 and BRCA2.
Five experiments were involved in the development of OGT’s protocol to detect CNVs in samples. Different types of DNA were used ranging from controls, samples with no known CNVs, and samples with known CNVs. For the first two experiments, each used an independent library to allow direct comparison within an experiment. Each experiment involved different input of baits during the hybridisation step, with the first 2 experiments involving the myeloid panel as a background as well.
After each hybridisation for each experiment, the samples were sequenced and the analysis report produced was used to compare and observe the performance of each bait.
Results and conclusions
For both of these two projects, different metrics were examined after sequencing. In the case of developing the lung and glioma cancer panel, two key metrics were used; uniformity of coverage and minimum coverage of targets. These two metrics are produced by the SureSeq Interpret report and is used for internal use. To be able to confidently sign a gene off, both the metrics outline below must be met.
As the CNV project was an exploratory task there were no defined metrics; different facets of the results were looked at after each experiment to determine the objective and design of the next experiment.
Uniformity: Refers to the evenness of the coverage depth across the target region. Coverage depth is the total amount of times each base has been sequenced. The uniformity is determined by calculating the mean coverage depth based on the raw coverage depth of each position. Then, the lower boundary of the coverage depth relative to mean depth is set across the target region. The percentage of the target region covered to the depth of or greater than the lower boundary is calculated based on increments of 10% of the mean target coverage. In OGT, the lower boundary is set at 20% of the mean target coverage for coding bases. In order to pass the uniformity metric, 100% of all coding bases have to have at least 20% of the mean target coverage.
Figure 16: Example of table showing the uniformity for each sample and the percentages of bases that have that mean target coverage. The sample is said to pass the uniformity metric when it has >99% of its bases with >20%
Coverage per base: Refers to the coverage of each bait respective to the target region. At OGT, the requirement is that at least 99.9% of all target regions ± 20 bases must have at least 500X coverage per base. In the case of FFPE DNA, at least 500X coverage is required for at least 98.5% of all target regions ± 20 bases.
Lung and glioma panel testing
For this experiment, all the genes were tested with the myeloid panel as a background. The genes involved were a mixture of genes implicated in lung and glioma cancer. For each respective gene, where there is a different type of bait (refer to appendix, page 24, for details on the type of baits), it is pooled with core baits and tested on genomic DNA. The mean target coverage for each combination of baits, for each gene is shown in Figure 17. As seen clearly, all the samples have over 1000X coverage with the core baits having the highest coverage with the exception of MET. The standard deviation for each bait type is small except for the core baits which demonstrates the reproducibility of the process.
Figure 17: The mean target coverage for each gene respective to each type of baits. Each sample was hybridized with the myeloid panel background. The raw values for each type of baits is stated in the table. Standard deviation for each bait has been included.
To determine which bait combination provided the best coverage across the whole target, the Integrative Genomics Viewer (IGV) software is used to provide a visual representation of the coverage over the target. This can be seen by the mountain-like shape shown in Figure 18. On IGV, the coverage is shown in terms of coverage tracks respective to the combination of baits the sample is tested with and under the bait track for that sample. Due to the large number of targets involved in each gene for both panels, a general example is shown in Figure 18; the same process was applied to all the targets in other genes. Figure 19 shows a screenshot of the coverage track of a target in MET on IGV. From first glance at the coverage profile, both pools with low_RM and alternative baits have better coverage. To properly determine which baits performed better, the difference in coverage from the centre of the target to the difficult of the target is calculated and represented as a percentage. The table in Figure 19 shows the percentage difference for each bait type for the target in study. Thus, it shows that the pool with low_RM and core baits has the lowest drop in coverage as expressed as percentage difference and therefore is the preferred combination of baits. From this experiment, the preferred combination of baits with core/core baits were selected for each gene to create a bait set for the second round of testing.
Figure 18: The coverage of each target as an overall coverage of baits specific to the target. It also shows the coverage per type of bait as indicated by the different colours. The table below shows the percentage difference of coverage from the centre of the target and the ends. This is shown for each type of bait.
|MET:ENSE00000717730, MET:ENSE00000717732||Stock||Alt Stock||Low_RM||Iso_TM|
Figure 19: The coverage profile over two exons in MET. Panel 1 indicates the target region, panel 2 the core baits, panels 3 &4 (green) the coverage obtained when using core baits only [complete for rest of figure] The table below shows the percentage difference of coverage from the centre of the target and the ends. This is shown for each type of bait.
|TERT promoter sample 1||100.00%||100.00%||99.89%||99.51%||98.98%||97.65%||94.69%||88.01%||77.59%|
|TERT promoter sample 2||100.00%||100.00%||99.96%||99.66%||99.05%||98.02%||95.08%||88.55%||76.79%|
|RPS15 Sample 1||100.00%||100.00%||100.00%||100.00%||100.00%||100.00%||94.42%||81.16%||49.81%|
|RPS15 Sample 2||100.00%||100.00%||100.00%||100.00%||100.00%||100.00%||92.94%||74.47%||57.13%|
In the case for both RPS15 and TERT-promoter, the core baits tested have passed the metrics needed to successfully sign off the gene. As shown in tables x and x, all target bases in all the samples tested achieved more than 20% of the mean target coverage. Thus, the bait set for the TERT promotor and RPS15 are achieved the required metrics and produce reproducible results. In addition, all the target regions, inclusive of flanking regions of ±20 bp bases, involved in each gene have a coverage of at least 500x or more.
Figure 20: Uniformity table of each sample tested with baits for TERT-promoter gene. It shows that all the bases have at least 50% of the mean target coverage
Experiment 2: Testing of combination of baits per gene
In this experiment, only two pools of baits were tested for each gene respectively – the core baits for that gene only and a mixture of core baits with the best performing type of bait as decided based on the result of experiment 1. As in experiment one, these baits were pooled with the myeloid panel to act as a background as the coverage would be far too low if the baits were tested solely on their own. This was done for each gene and tested separately with normal DNA. Due to small number of baits involved in both RET and MET, these were pooled and tested together. The coverage for each pool of baits was over 1000X for each gene tested as shown in figure 22. This highlights the reproducibility of the performance of the baits for each gene as a similar trend of coverage is seen in experiment 1.
Figure 21: Shows the mean target coverage for samples with either core/baits or core/baits with baits selected based on results from the previous experiment for each gene. Data shown is an average of 4 replicates. The average mean target coverage is stated in the table. Standard deviation has been included.
For each gene under development, a coverage plot of the average of the replicates involved was generated to visualise the coverage for each base (Figure 23). In addition, the uniformity of coverage and the minimum depth of coverage was also calculated. The coverage plot illustrates the targets where an extra bait has improved the coverage and helps to highlight any problematic targets that might be difficult to cover comfortably. Using ROS1 as an example, the coverage plot in Figure 23 shows the coverage per bait, bearing in mind that several baits can be designed to cover a target, with the core baits having a higher coverage. Figure 23 also highlights how the coverage was improved when using the pool of core baits and the best baits from previous experiment compared to just using core baits alone, as circled in red. The uniformity table for ROS1 is shown in Figure 24, it indicates that over 99% of target bases of ROS1 have over 20% of the mean target coverage. In fact, it predicts that over 99% of the target bases have over 50% of the mean target coverage as highlighted in green. Table 2b also shows that over 99% of the target bases have at least 500X coverage per base. Thus, meaning that both of the metrics were achieved for ROS1.
The coverage plots and uniformity tables for MET and RET can be found in the appendix. The two metrics are not shown for ATRX as there was a pipetting error that caused contamination in the bait pools. Thus, the results from this experiment for ATRX could not be used to sign off the gene.
From this experiment, it shows that the pool of the core and combination baits for each gene from experiment 1 worked well and have passed both metrics. Thus, the next step from this is to test the reproducibility of these baits when tested with different types of DNA. The exact same baits are pooled again, with all variables kept same except for the type of DNA and the amount of DNA used at the start of the library preparation. For HapMap DNA libraries, 1000ng of DNA is used and for FFPE DNA libraries, 500ng of DNA is used.
|FileID||Average Coverage||10.00%||20.00%||30.00%||40.00%||50.00%||60.00%||70.00%||80.00%||90.00%||Bases with >500X|
|ROS1 Sample 1||944.61||100.00%||100.00%||100.00%||99.92%||99.49%||96.88%||89.65%||79.19%||64.54%||99.40%|
|ROS1 Sample 2||1179.1||100.00%||100.00%||100.00%||99.92%||99.45%||96.54%||89.70%||78.43%||65.51%||99.75%|
|ROS1 Sample 3||1324.2||100.00%||100.00%||99.97%||99.83%||99.30%||95.61%||90.06%||79.91%||66.81%||99.81%|
Figure 22: The mean target coverage in ROS1 for samples with either core baits only or core baits with selected baits from experiment 1 for ROS1. The ROS1 baitswere tested on a background of myeloid (coverage data not shown).
Figure 23: The uniformity of each sample tested with baits for ROS1 gene. It shows that all the bases have at least 50% of the mean target coverage
Lung and glioma panel testing
Experiment 3: Testing finalised pool of baits on different types of DNA
In experiment 3, the pool containing the best combination baits from the previous experiment was tested on libraries made using Hapmap DNA and FFPE DNA. This is to help highlight any areas of the targets that are problematic before the final signing off of the baits for each gene. This time, all baits for all the genes were pooled together without any myeloid panel as a background. Once again, coverage plots and uniformity tables are made for each gene tested. Using ROS1 as an example again, the coverage plot in Figure 25 represents the average coverage per bait for the targets in ROS1 for all replicates for each type of DNA. In the case of samples of FFPE DNA, the coverage fluctuates more from bait to bait compared to samples of HapMap DNA as clearly seen in Figure 25. This is because the FFPE DNA fragments after library preparation are smaller compared to other genomic DNA and thus the coverage is not as even across a target. However, the trend in coverage profile is the same for both types of DNA reflecting on the reproducibility of the baits. The uniformity of coverage and the minimum depth of coverage has been met for each sample shown clearly Figure 26. All samples have at least 20% of the mean target coverage and at least 500X coverage.
The coverage plots and uniformity tables for MET, RET, ATRX, ROS1, TERT-promoter, RPS15 can be found in the appendix. In the case of ATRX, there was no bait contamination and successfully passed both metrics. For all the genes, the threshold for both the uniformity of coverage and the minimum depth of coverage were achieved and thus all the genes can be successfully signed off.
Figure 24: The mean target coverage in ROS1 for samples with core and selected baits from experiment 2 for ROS1 tested on FFPE and HapMap DNA. No myeloid background panel was used.
Figure 25: The uniformity of each of the FFPE and HapMap sample tested with baits for ROS1, TERT-promoter, MET, RET, RPS15 gene. It shows that all the bases have at least 50% of the mean target coverage
|FileID||Average Coverage||10.00%||20.00%||30.00%||40.00%||50.00%||60.00%||70.00%||80.00%||90.00%||Bases with >500X|
|FFPE sample 1||6154.132582||100.00%||99.66%||97.97%||95.82%||92.76%||87.42%||78.73%||69.78%||60.68%||100.00%|
|FFPE sample 2||6726.206897||100.00%||99.60%||97.99%||95.70%||92.58%||87.56%||78.66%||69.39%||59.84%||100.00%|
|FFPE sample 3||6507.275748||100.00%||99.62%||98.19%||95.80%||92.80%||87.85%||78.35%||69.49%||60.21%||100.00%|
|HapMap sample 1||4346.053374||100.00%||99.94%||99.80%||99.59%||98.48%||95.48%||88.60%||77.27%||64.75%||100.00%|
|HapMap sample 2||3036.407306||100.00%||99.97%||99.83%||99.64%||98.43%||95.17%||87.55%||76.72%||64.75%||100.00%|
|HapMap sample 3||3813.71253||100.00%||99.97%||99.83%||99.66%||98.57%||95.10%||87.80%||77.40%||65.07%||100.00%|
|HapMap sample 4||3201.79151||99.98%||99.83%||99.64%||99.21%||98.05%||94.83%||88.43%||77.97%||65.47%||99.86%|
Copy Number Variation detection
Experiment 1: Initial testing of each normalising bait
All 54 of the normalising baits (details about baits can be found in page 21) were pooled together and combined with two separate panels; BRCA1 and BRCA2 panel and Chronic lymphocytic leukaemia (CLL) panel. This was to ensure that the performance of the baits is not panel specific. The mean target coverage and the percentage of on target reads for each sample tested with either the BRCA1 and BRCA2 panel individually or with normalising baits only is shown in Figure 27. It highlights that the normalising baits does not have any negative effect on the coverage, shown by the coverage and on target percentage of the samples with and without the normalising baits.
- BRCA1 and BRCA2 with normalising baits
- BRCA1 and BRCA2 only
Figure 26: Shows the mean target coverage in ROS1 for samples with either core baits only or core baits with selected baits from experiment 1 for ROS1. The ROS1 baitswere tested on a background of myeloid (coverage data not shown). The data for CLL was not shown.
A coverage plot for the mean target coverage per bait is used to provide visual presentation of the trend in coverage of the average of samples with all the normalising baits with either of the two panels, shown in Figure 28. It shows that the trend in coverage is very similar regardless of the panel the sample is tested with, providing good evidence on the reproducibility of the baits. Figure 4b also shows that the average coverage of all the samples for both panels have more than 1000X coverage which supports the evidence that the normalising baits does not have effect on the coverage.
Figure 27: The coverage plot for the each of the 54 normalising baits tested with BRCA1 and BRCA2 or with CLL baits.
These results only provide a very general overlook of the performance of the baits. Thus, the coverage of each individual bait of all the 54 baits needed to be looked closely. The Integrative Genomics Viewer (IGV) was used to look at the spread of the coverage as shown by the ‘mountain-like’ coverage spread that was seen in previous experiments. Whilst observing the spread of the coverage, several factors were considered; the evenness of the spread of coverage, any intronic deletion at the ends of the targets, difference in spread in samples sheared with covaris or degraded by enzymes. By using these guidelines, the best performing baits across both panels were chosen.
Figure 28: Screenshot of the coverage profile of the samples tested with bait Chr6_2 with BRCA1 and BRCA2 or with CLL baits. Both sample 1 and 2 are DNA that was Covaris-sheared and sample 3 and 4 have been fragmentased. Circled in red is an example of an intronic deletion.
A total of 36 baits were selected from the original 54 baits, these were taken forward for further testing. To reduce the number of baits used in each experiment, the 36 baits were split into three different sets each with 30 baits. Each set contained with as many different baits of the 36 as possible to keep the sets as different from each other as possible. Each set was pooled together with baits for BRCA1 and BRCA2.
Copy Number Variation detection
Experiment 2: Testing of each set of 20 baits
The setup of this experiment was the same as before and thus, all the variables for each set in testing were the same. The main objective was to select which set out of the three produces reproducible results. To do this, coverage plots of the sets were used to compare and to view the trend in coverage. Figure 30-32 shows the coverage plot for each respective set and shows the respective coverage value for each bait in that set. Looking closely at figure 30, the trend in the coverage for the baits across both samples are very similar and the coverage of each bait is very similar. On the other hand, the coverage for each bait in set two fluctuates more across both samples as shown clearly in Figure 31. Set 3 (Figure 33) showed very good evenness in coverage across all baits in both samples as well as the level of coverage of the same bait in both sample is similar and comparable.
Figure 29: Coverage plot of the 20 baits in set 1. The mean target coverage achieved was above 1000X.
Figure 30: Coverage plot of the 20 baits in set 2. The mean target coverage was within a tighter range, but the coverage were comparable between both samples.
Figure 31: Coverage plot of the 20 baits in set 3. The mean target coverage achieved was below 1000X, but the range was much tighter.
To conclude, set 1 was chosen as the best set that had less variability in coverage and also provided the highest coverage. This is ideal as off target sequences can also lead to low coverage and thus, the higher the coverage the higher the probability of the sample containing only sequenced fragments of interest. Set 3 was chosen as well as it appeared to be a good competitor. These two sets were used in the next experiment and tested on samples with known CNVs in BRCA1 and BRCA2. In addition, an alternate set was created which contained mixture of the baits in set 1 and set 3, with the interest of designing the best performing set.
Copy Number Variation detection
Experiment 3: Testing of set of baits on samples with known CNVs
Due to pipetting error, set 3 and the alternate set were not viable to detect CNVs in any of the samples because some of the baits had close to zero coverage indicating absence of those baits. Consequently, only set 1 generated sufficient data to accurately call CNVs.
To accurately call CNVs, the raw data had to be normalised to the coverage of the normalising bait. The mean target coverage had to be manipulated to provide more intricate data compared to the earlier experiments. To do this, the following series of calculations were carried out:
- Average of the raw coverage of each bait was tabulated and the average coverage for each sample calculated. The standard deviation for each sample was also calculated
- Ratio of the sample’s average coverage to the all the baits of all samples was calculated for each individual sample.
- The deduplicated target reads across different exons of BRCA1 and BRCA2 were split into 60 bins (i.e. 60 bps) to increase sensitivity and to allow easier calculation
- These target reads were normalised by dividing raw reads by the ratio previously calculated in step 2.
- A parameter of ±1 of the standard deviation for each sample is defined. This highlights which of the 60 bins are ‘normal’ as these would be within the parameters and those above or below the standard deviation were classified as not normal.
- The normalised counts from step 4 were then divided by the average of all reads of the 60 bins that were classified as normal in step 5 to obtain the final ratio of reads.
- The threshold for a deletion and duplication has been previously set to be
1.52) for deletion and Log2 (
2.52) for duplication
- Thus, a deletion is shown when the final ratio is below threshold and a duplication when the ratio is higher than the threshold set in step 7.
Figure 33 shows the visual representation of the ratios of each of the 60 bins of each sample across the entire exon of BRCA1 and BRCA2. Figure 33 and 34 highlights any sample that fall above or below the threshold, making it possible to identify and duplication or deletion(s). Graph 34 shows the same data as graph 33, but in terms of exons instead of bins which provides a lower resolution.
Figure 32: Shows the normalised coverage of the each bin across both BRCA1 and BRCA2. The threshold for duplication and deletion are shown and bins that exceed or fall below these threshold are flagged as CNVs.
Figure 33: Shows the normalised coverage of the each exon across both BRCA1 and BRCA2. The threshold for duplication and deletion are shown and areas on the exons that exceed or fall below these threshold are flagged as CNVs.
Figure 34 summarises which samples have exon deletion or duplication which is determined by using Figure 32 and Figure 34 together with the normalised reads. Figure 33 shows a lower resolution of the normalised coverage to the threshold. In conclusion, set 1 was able to identify the right CNVs in the samples tested and thus, can be used for further experiments.
|Sample||CNV detection using set 1||Known CNVs from customer sample|
|1||Deletion exon 1-3 BRCA1||BRCA 1 Del exon 1-3|
|6||Deletion Exon 2-24 BRCA1||BRCA 1 Del Exon 2-24|
|9||Duplication Exon 20 BRCA1||BRCA1 Duplication Exon 20|
|14||Deletion Exon 14-16 BRCA2||BRCA2 Del exons 14-16|
|16||Deletion 21-23 BRCA1||BRCA1 Del exons 21-23|
|Control||no aberrations found|
For the lung and glioma project, the final design of baits for ROS1, MET, RET, RPS15, ATRX, TERT-promoter are finalised and this will be the same product profile given to customers who request to buy baits for these genes. In these experiments, it shows that the bait composition analysed and chosen by me were sufficient in passing both metrics and there were no coding exons that were difficult to target.
In respect to the CNV project, the results generated from all the experiments collectively suggests that NGS, respect to the OGT SureSeq protocol, can be used to detect CNVs in samples. However, the experiments I have done were only tested on BRCA1 and BRCA2 and more experiments should be carried on other genes of interest. This is to test and monitor the reproducibility of the normalising baits when it is tested with other genes. Referring back to the second experiment (refer to page 40) when three different bait sets are tested, the method of selection of the best set by me was very subjective and was based on coverage alone and reproducibility. Further refining of the bait set composition should be done to ensure that a more reproducible and robust bait set is created. The results obtained from all the experiments were also done in a very small scale and was sufficient for the initial investigation that I have done, but personally
My time spent at OGT has been beneficial not only in improving my laboratory and scientific skills, but has allowed me to experience working full time for a year and developing important life skills. The skills I learnt simple from applying to placements and attending interviews are skills that are vital in my future working life. During the placement, I presented my results in various team meetings and also at Journal Club where I have led discussion on various articles on topics that interest me. These have increased my scientific knowledge and also my confidence in my presenting skills. These skills will be useful during my final year, where I would have to present to the rest of the class.
I have learnt a variety of general and molecular biological skills that are transferable across different laboratories. Such as using various machines, learning how techniques work and how to perform quality check on DNA samples. I have also learnt and assisted in RNA extractions and in the processing of microarrays. The time that I have spent in the laboratory this year has greatly increased my confidence in solo scientific laboratory work. I now feel comfortable with this line of work and feel that if asked to, I could easily follow a new protocol. Furthermore, I have also gained a vast amount of knowledge specific to genomics and next-generation sequencing, such as how to use an Illumina Miseq and how to carry out data analysis using the raw sequence data. All of which I would not have learnt at university due to modules I have selected.
Before this year, I struggled with time management and maintaining laboratory records. Having been given a laboratory notebook, I have disciplined myself to keep a record of all my work and to write up in detail to allow others to understand and build upon my experiments. This has improved my record keeping skills which will be very helpful during my final year project. I believe that the laboratory and transferable skills gained from this placement will be vital and useful as I intend to pursue for a PhD in Neuroscience. I also believe that the skills gained in genomics would allow me to integrate the knowledge and skills thatI will learn in the future.
I would like to thank everyone at OGT for their help and patience over the past year, especially to the Molecular Biology team which have blessed me with so much experience. Special thanks to Jackie Chan for helping me with everything and trusting me on my own in the laboratory.
Word count: 8’976 (excluding citations, figure legends, tables and references)
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please: