0115 966 7955 Today's Opening Times 10:00 - 20:00 (BST)

Report on RNA-seq Transcriptomics

Published: Last Edited:

Disclaimer: This essay has been submitted by a student. This is not an example of the work written by our professional essay writers. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

  • Giannakis Nikolaos


  • Transcriptomics

Transcriptomics is the field of biology that studies the RNA transcripts in a large scale manner. Transcriptome is the whole set of RNAs transcribed by the genome from a specific tissue or cell type at a developmental stage and/or under a certain physiological condition (Willingham and Ginger and Gingeras, 2006; Jacquier, 2009). RNAs are either coding or non-coding, which means that some RNAs are protein-coding, while other types of RNAs are not. More specifically, mRNA is translated into proteins, while non-coding RNAs can be categorized either as housekeeping (tRNAs, rRNAs, snRNAs, snoRNAs and gRNAs) or as regulatory. Housekeeping RNAs act as catalytic and structural elements. Regulatory RNAs can be short in length (miRNAs, piRNAs, siRNAs) or long (lncRNAs) and act as regulatory elements during the gene expression.

There are many ways to discover the transcriptional response of the genome in different tissues and physiological or environmental conditions. Expressed sequence tag based method (EST, SAGE), hybridization based gene microarray or chip technology, and NGS based RNA-sequencing (RNA-seq) technology were developed to scan the transcriptome quickly and obtain differentially expressed genes (Nagalakshmi et al., 2008; Wilhelm et al., 2008; Mortazavi et al., 2008; Schena et al., 1995; Velculescu et al., 1995; Lashkari et al., 1997).

  • RNA-seq

The most effective method in analysing RNA both qualitatively and quantitatively is next generation sequencing. To be more specific, in Next Generation Sequencing c DNA gets sequenced in a massive, parallel way. The prototype method, MPSS, applies four rounds of restriction enzyme digestion and ligation reactions to determine the nucleotide sequence of c DNA ends, generating a 17-20 bp sequence as the fingerprint of a corresponding RNA (Brenner et al., 2000). It is possible using MPSS to produce 100000 or more sequences.

After MPSS, other companies such as Illumina, Roche and Lifescientific created improved methods to sequence c DNA clusters as far the reading length and sequencing accuracy concerned (Mardis, 2008).

With all these technologies RNA-seq became the most convenient and cheapest method to analyse the transcriptome (Figure 1). More specifically, the whole RNA transcripts or parts of them are purified and then reverse transcribed into c DNAs. After that, these c DNAs undergo massive parallel sequencing, As a result billions sequence tags are analysed in a qualitative and quantitative manner. A new type of RNA-seq was proposed by Tang et al. (2009). Tang’s team analysed the transcriptome of a single cell using reverse transcription to create c DNA , PCR to amplify it with the primers annealing to the anchoring sequences introduced during the generation of the first and the second c DNA strand. Combined with SMART-seq, STRT (Tang’s method) used extra cytocines to amplify the c DNAs from a single cell (Ramskoll et al., 2012). Moreover, CEL-seq (cell expression by linear amplification) put the T7 promoter to the 5’ end of the first strand of the c DNA and amplified it in a less biased way.

In order to characterize small non-coding RNA, small RNA-SEQ or miRNA microarrays are used. 3P-seq, PAS-seq and 3’RACE are used to detect alternative polyadenylation. Argonaute CLIP-seq detects the Argonaute associated RNA to predict miRNA targets, while BRIC-seq is used to measure half-life of RNA transcripts.

Figure 1: RNA-SEQ EXPERIMENT : Conversion of long RNAs to cDNA and fragmentation. Addition of adaptors to cDNA fragments. Alignment and classification of reads to reference genome or transcriptome. Generation of expression profile (Wang et al., 2009)

Data Normalisation

  • Why normalise data?

If we imagine that we want to compare data from two different samples, of which the first sample has two times as many sequences as the second sample, everything will appear to be upregulated, if unnormalized.

There are several normalisation techniques that can be applied to datasets. Firstly, it has to be said that a “flow cell” consists of eight different lanes, where libraries are placed in order to be sequenced. Fragmented c DNA molecules are ligated to adaptors and continue to the sequencing procedure in order for the expression levels to be found. The reads that are mapped express the “library size”. Furthermore, because “library size” is analogous to the variation between the lanes, the most common normalisation technique used, is based on scaling the raw read counts in each lane by a single lane-specific parameter that reflects the size (Dillies et al., 2012).

Based on the scaling factors, normalisation methods can be described as follows:

  • Total Count (TC), that divides gene counts to the number of reads, which were mapped, multiplied by the mean total count in the whole data.
  • Upper Quartile (UQ), that divides gene counts to the number of mapped reads and multiplied by the mean of the upper quartile of counts
  • Median (Med), that divides gene counts to the total number of mapped reads and multiplied by the the median counts different from 0.
  • DESeq, works based on the differential expression in genes and fits a generalized linear model (GLM) of the negative binomial (NB) family. Most of the genes are not differentially expressed and for its lane a scaling factor could be calculated based on the median of each gene’s read count to the geometric mean across the lanes.
  • Trimmed Mean of M-values (TMM), where a scaling factor is calculated for a reference lane and the other lanes act as test samples. In each test sample, the TMM scaling factor is the weighted mean of log ratios of the test sample to the reference lane sample, ecluding the most up-regulated in terms of expression genes and the genes with the biggest log ratio.

Normalized read counts can be calculated if raw read counts are divided by these normalisation scaling factors.

Moreover, based on the distribution adaptation of read counts, the quantile normalisation method matches the distribution of gene counts across lanes (Bolstad et al., 2003).

Finally, the RPKM method (Reads per Kilobase per Million Mapped Reads) implements between- and within- samples normalisation by re-calculating in ascending or descending order the gene counts to regulate differencies in “library size” or gene length (Mortazavi et al., 2008).


In this report, analysis of the differential expression between two groups of samples corresponding to two Drosophila melanogaster tissues was implemented. In order to do the analysis, the Tuxedo Suit software was used. HPC calculations were performed in two steps. Firstly, the index of Drosophila melanogaster genome and trascriptome was prepared. More specifically, using bowtie2, the index for the reference genome was generated and tophat2 was used to generate the transcriptome index. Secondly, tophat was run for both tubule and wholefly samples followed by running Cuffdiff for the analysis of differential expression between these two types of tissue. To be more specific, tophat2 was used for each of the raw data files (3 for tubule and 4 for wholefly) and cuffdiff was used for the comparison between these two samples, of which the tubule had three replicates and wholefly had four replicates.

Results - Discussion

  • Qualitative analysis of reads

In order to make a quality control of our sequence data we used the fastQC software, developed by the Babraham Institute. We used the raw data in form of fastq files, found in the following directory: /export/projects/polyomics/biostuds/data . After uploading the seven different files to the fastQC software, we delivered results shown as follows:

For example for tb1:

In the first image, simple composition statistics are shown such as the filename (tb1.s.13.fq), the total amount of sequences processed (10694228), the sequence length (76) and the percentage of CG content(49%).

In the second image, an overview of the range of quality values across all bases at each position in the file is shown. For each position in read a BoxWhisker plot is shown, where the central red lines correspond to the median value and the yellow box represents the inter-quartile range. In the y-axis of the graph the quality scores are shown. It is important to say that the higher the score, the better the base call. A green background is associated with very good quality scores, while orange and red backgrounds are related to reasonable quality and poor quality respectively.

In our example, the quality of the base calling seems to be quite high due to high quality score and green background.

In the third image, quality scores per sequence are shown. To be more specific, the per sequence quality score shows if a subset of the sequences has low quality values. Here, the the average quality per read is around 39(Phred Score), which is quite high.

Similarly, we operated the same procedure for all the other sequences. The results were of same quality apart from the wf3 file which contained data generated with solexa pipeline 1.8, which creates a different way of coding per-base quality. In this dataset the quality scores across all bases were poor and the average quality per read was around 29 in terms of Phred score.

  • Alignment statistics for all the samples

To interpret the statistics for the alignment of the samples, the align.txt files generated from the tophat software package were used. In the following table, the results are shown:

  • Differentially Expressed Genes Statistics

After the completion of the second job on HPC, the output created, contained two files that are related to the differential gene expression between the tubule and the wholefly samples. Once the gene_exp.diff file opened with Microsoft Excel, it was possible to filter and sort the data in order to identify the expression levels of tubule compared to the wholefly. Firstly, we filtered the genes based on the significance. No significant genes were excluded from further analysis. Furthermore, the value(1) column that corresponds to the ratio of tubule expressed genes to wholefly expressed genes, was used to exclude the tubule genes with 0 value. Secondly, based on the log2 fold change, we used an arbitrary cut-off threshold of 1.5. That means that we excluded the genes with log2 fold change between -1.5 and 1.5 (log<-1.5 & log2>1.5). After the completion of this step we looked at the q-values (corrected p-values) and all were down the threshold of 0.05. As a result 1322 tubule genes were up-regulated compared to the wholefly genes, while 1557 were down-regulated. We created two xls files, one containing the up-regulated genes and one the down-regulated.

  • Functional Analysis of Differentialy Expressed Genes

We used the two .xls files in order to submit a list of the significant genes to DAVID v6.7 to use functional annotation tools in order to understand the biological meaning behind these lists of genes.

By using the functional annotation clustering tool for the up-regulated genes 203 clusters of genes were created. Based on the enrichment score, that shows how important is a gene group in our gene list, was shown that for the cluster with the highest enrichment score (16.89), the following categories of genes were affected:

  • CHK - Subfamily of choline kinases
  • CHK kinase-like - Zinc finger C4 and HLH domain containing kinases domain subfamily of choline kinases
  • PIRSF014263 - Drosophila hypothetical protein EG_34F3.5
  • DUF227 – Protein of unknown function

By using the Functional Annotation tool, were shown the pathways, in which up-regulated genes are expressed, as shown in the following image:

By using the functional annotation clustering tool for the down-regulated genes 238 clusters of genes were created. Based on the enrichment score, that shows how important is a gene group in our gene list, was shown that for the cluster with the highest enrichment score (8.54), the following categories of genes were affected:

  • DNA metabolic process – metabolic process involving deoxyribonucleic acid
  • Response to DNA damage stimulus – indicates damage to DNA from environmental insults or errors during metabolism
  • DNA repair – process of restoring DNA after damage that include several pathways such as direct reversal, base excision repair, nucleotide excision repair, photoreactivation and double-strand break repair pathway
  • Cellular response to stress – process that states change in terms of movement, secretion and enzyme production)

By using the Functional Annotation tool, were shown the pathways, in which down-regulated genes are expressed, as shown in the following image:

Visual Alignment of tb1 sample

In order to visualise the alignment of the tb1 sample to the reference genome, we had to import import the fasta formatted reference genome and tophat2 generated alignment .bam file. Moreover, by using samtools we took the index file for the .bam file using the script described in appendix III.

By using the Integrative Genome Viewer (IGV) and the accepted_hits.bam and accepted_hits.sorted.bam.bai files for tb1 sample, became possible the visualisation as shown below:

Appendix I

Job1 Script

#! /bin/bash

#PBS -l cput=15:00:00

#PBS -N greco

#PBS -l nodes=1:ppn=4

#PBS -M [email protected]

#PBS -d /export/home/biostuds/2161696g/tophat/

#PBS -m abe

bowtie2-build -f /export/projects/polyomics/Genome/Dmelanogaster/dm3/genome/dm3.fa dm3

tophat2 -G /export/projects/polyomics/Genome/Dmelanogaster/dm3/dmel5.57/

dmel-all-no-analysis-r5.57clean2.gff --transcriptome-index=transcriptome_index/dm3_trans dm3

Appendix II

Job2 Script


#PBS -l cput=15:00:00

#PBS -N greco

#PBS -l nodes=1:ppn=4

#PBS -M [email protected]

#PBS -d /export/home/biostuds/2161696g/tophat/

#PBS -m abe

#PBS -l walltime=15:00:00

cd /export/home/biostuds/2161696g/tophat/

tophat2 -p 4 -o tb1 -i 20 -I 10000 --min-segment-intron 20 --max-segment-intron 10000

--solexa1.3-quals -T --transcriptome-index=transcriptome_index/dm3_trans dm3


tophat2 -p 4 -o tb2 -i 20 -I 10000 --min-segment-intron 20 --max-segment-intron 10000

--solexa1.3-quals -T --transcriptome-index=transcriptome_index/dm3_trans dm3


tophat2 -p 4 -o tb3 -i 20 -I 10000 --min-segment-intron 20 --max-segment-intron 10000

--solexa1.3-quals -T --transcriptome-index=transcriptome_index/dm3_trans dm3


tophat2 -p 4 -o wf1 -i 20 -I 10000 --min-segment-intron 20 --max-segment-intron 10000

--solexa1.3-quals -T --transcriptome-index=transcriptome_index/dm3_trans dm3


tophat2 -p 4 -o wf2 -i 20 -I 10000 --min-segment-intron 20 --max-segment-intron 10000

--solexa1.3-quals -T --transcriptome-index=transcriptome_index/dm3_trans dm3


tophat2 -p 4 -o wf4 -i 20 -I 10000 --min-segment-intron 20 --max-segment-intron 10000

--solexa1.3-quals -T --transcriptome-index=transcriptome_index/dm3_trans dm3


tophat2 -p 4 -o wf3 -i 20 -I 10000 --min-segment-intron 20 --max-segment-intron 10000

-T --transcriptome-index=transcriptome_index/dm3_trans dm3


cuffdiff -o cuffdiff_real_out -library-norm-method geometric -dispersion-method per-condition -L Tubule,Wholefly



/export/home/biostuds/2161696g/tophat/tb3/accepted_hits.bam /export/home/biostuds/2161696g/tophat/wf1/accepted_hits.bam,



Appendix III

Samtools Script

#! /bin/bash

#PBS -l cput=15:00:00

#PBS -N greco

#PBS -l nodes=1:ppn=4

#PBS -M [email protected]

#PBS -d /export/home/biostuds/2161696g/tophat/tb1

#PBS -m abe

samtools sort accepted_hits.bam accepted_hits.sorted

samtools index accepted_hits.sorted.bam

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Request Removal

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please click on the link below to request removal:

More from UK Essays

We can help with your essay
Find out more
Build Time: 0.0025 Seconds