This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In the post genomic era, where lots of genetic data and information is available in various types of databases or in publications, it is required to bridge the gap between bioinformaticians who handle bio data and clinicians who needs them. Bridging this gap will benefit human kind and it will give rise to more opportunities for research.
Stroke is a multifactorial complex disorder, involving huge number of mediating proteins interacting with each other in a complicated manner during the process of development of a stroke. Even though a vast collection of data and information covering most areas are available freely, integration and annotation of these data and information related to development of stroke is a challenge as various types of data repositories are maintained for different purposes.
With the use of this knowledge extracted from biological publications, as well as using bioinformatics, prediction of genes related to stroke is beneficial to clinicians. Further, the determination of genes related to diseases needs experiments carried out in well equipped laboratories which are very costly. Therefore predicting good candidate genes before experimental analysis with this type of projects where a graph based topological network model represents the global picture of interactions among biological entities, is a timely requirement for the field of research in bioinformatics particularly for development of new drugs.
Following a thorough literature review on data repositories, Phenopedia component of HuGE Navigator, STRING protein interaction database which is highly informative and regularly updated, Human Genome Nomenclature consortium, disease list of HuGE Navigator knowledgebase and PubMed were used as resources for the project.
This project was aimed at visualizing the details of genetic involvement in development of stroke in a simple way through a network diagram. Identification of genes associated with stroke and identification of interactions among these genes is the preliminary task of this type of project. It is required to acquire a thorough knowledge on the process of development of a stroke with its genetic involvement and on the existing facilities that help to obtain the above knowledge.
2.1 Genetic disorders
Diseases caused by alteration in chromosomes or genes are called genetic disorders. There are two main broad categories namely polygenic and monogenic disorders. Diseases that result from mutations in a single gene are monogenic disorders and involvement of many genes in the process of causation of disease is polygenic. There are multifactorial or complex diseases where both environmental factors and genetic factors are involved. (1)
Stroke, either impairment or loss of function of part of the brain, resulting from interruption in blood supply to the brain, is considered to be a mutifactorial disorder. Severity of stroke can range from mild with full recovery, to massive with serious disabilities. It can even be fatal and is a leading cause of permanent disability worldwide. Various types of primary diseases causing stroke have been identified. (Table-1)
Scientists have identified many important risk factors for many forms of stroke, including high cholesterol levels, high blood pressure, diabetes, atrial fibrillation, smoking, disorders of clotting factors and having had a prior stroke. But the understanding of the reason for occurrence of stroke remains limited. This limited understanding prevents us from predicting an accurate diagnosis early enough to effectively manage stroke, or from developing more effective treatments. Medical practitioners have paid their attention to genetic involvement of diseases to overcome this limitation.
Some epidemiological studies done on stroke showed genetic involvement in stroke. (2) There are approximately 30,000 genes in each human being. These genes can differ slightly from person to person and it is strongly believed that the degree of vulnerability to have a disease will depend on those differences. It is also probable that multiple loci interact to increase stroke risk in a way that combined effect of genes is greater or lesser than that expected by multiplying their individual main effects. (3)
The development of stroke is a complex process. Biological processes affecting formation of blood clot, mechanisms by which integrity of vessels are affected, process of apoptosis (programmed cell death), inflammatory processes induced by reduction of blood supply or the cause itself which responsible for the reduction of blood supply play a role in developing stroke through various biological pathways. These processes themselves and regulatory processes (either stimulating or inhibiting) mentioned above are interconnected.
2.3 Analysis of integrated data
With the recent advances bioinformatics, there is an explosive growth of biological data resources worldwide generated by the scientific community over the past two decades. These are mainly web based and freely available in various public and private databases, which have been made possible by new database technologies and the Internet. The field of biomedical informatics has drawn increasing popularity and attention, which in turn resulted in a technical advancement in development of molecular, genomic, and biomedical areas and producing applications such as genome sequencing, protein identification, medical imaging, and electronic patient medical records.
Analysis of this widely available data retrieving from all available databases in a way that data can be used effectively by clinicians is a timely requirement in this post genomic era. Graphical representation of relationships that shows the global picture of interactions among bio entities, through a network diagram is a concise way of representing a complex process involved in causation of diseases. This is an intensive field of research. (4) However, statistical power to detect such genetic interactions is a serious limitation for most studies, particularly where the numbers of possible interactions are extremely large. Different statistical methods such as logistic regression, classification and regression trees and multifactor dimensionality reduction (MDR), have been applied to large-scale in some studies of complex diseases to detect gene gene interaction with some promising results. (4)
Epiphenomena Sickle cell crisis(trombotic)
Thrombotic thrombocytopenic purpura
Intracranial arteriovenous malformations
Mendelian inherited conditions Sickle cell anemia
Autosomal dominant conditions
Familial hemiplegic telangiectasia
Hutchison-Gilford progera syndrome
Polycystic kidney disease,adult[autosomal dominant]
Autosomal recessive conditions CARASIL
X-linked inherited conditions Haemophilia type A
Cardiac and vascular conditions
Carotid artery dissection
Carotid artery stenosis
Dissecting aortic aneurism
Internal carotid artery aneurism
Mitral valve prolapsed
Pulmonary arterio-venoeus malformations
Vertibral artery dissection
Autoimmune conditions Systemic lupus erythematosus
Posterior inferior cerebral artery syndrome
Trauma, mechanical and physical conditions Decompressin sickness
Iotrogenic conditions Coronary angiograpphy
Table-1 Features or causes of Cerebral vascular accident [sorted by category] according to diseases database
2.4 Interaction networks
A graph based model representing the global picture of interactions, between biological entities is called interaction networks. In this model bio-entities are shown as nodes in the graph while functional relations are represented as edges connecting the corresponding bio-entities. Properties of each bio-entity and the relationships are stored as attributes (5). In this type of models topology helps to understand the network architecture.
The following are some of important parameters.
a. Degree: Number of lines connected to a single node. In directed pathways links towards the relevant node and links away from the node are called in-degree and out-degree respectively.
b. Distance/Shortest path length: Indicates the number of edges in between given two nodes.
c. Diameter/longest path length: Maximum distance between two nodes.
d. Clustering coefficient: The degree to which nodes in the graph tend to cluster together. It depends on the number of neighboring nodes and number of connected pairs between all neighbours of the particular node. The average of the clustering coefficient is referred to as network clustering coefficient.
Using these networks, it is convenient to the end user to access and assess some of main features of bio-entities. A need for independent visualization of genes and their product as well as representation of bio-entity interactions by the same gene will provide a better resolution. However presence of multiple relationships in biological system may make graph representations more complex. Visualization of nodes distinguishing the properties of the bio-entity and properties of interactions is a challenge for bioinformaticians.
Several tools such as Cytoscape, Biological Networks, VisANT and Ospray have been developed for visualization and construction of networks of bio-entities that create networks from database stored information.
Chapter 3: Objectives
3.1 General objectives
To identify genes associated with stroke and to identify stroke associated gene interactions using bioinformatics and biological databases.
3.2 Specific objectives
To identify genes associated with development of stroke
To identify disease associated gene interactions specific to stroke
To model the disease associated gene interaction network specific to stroke
This literature review was done with the objective of reviewing the network representation of relationship among bio-entities. In order to understand the functioning of organisms and pathology of diseases, it is required to understand the gene expression in detail, which is a complex process involving regulation at various levels. The regulation of gene expression occurs through the genetic regulatory systems including network of interactions between Deoxyribose Nucleic Acid (DNA), Ribose Nucleic Acid(RNA), protein and other small molecules. There are complex positive and negative feedback loops that are yet to be understood. Understanding of dynamics of regulatory mechanisms is hard to study. Therefore it is required to model and simulate genetic regulatory mechanisms in order to easily understand the involvement of genes in complex diseases.
After the completion of the human genome project, several studies on biological data analysis on various aspects have been published. The knowledge derived from these may provide valuable information on various aspects of human kind, especially the evolutionary history and lead to indentifying genetic variants that are responsible for occurrence of various diseases. But still integrated information on multifactorial diseases like stroke is limited.
4.1Pathophysiology of Strokes (Cerebral Vascular Accidents)
Figure 1 causes of stroke
The two main types of stroke are ischaemic and haemorrhagic that account for approximately 85% and 15% respectively. In the ischaemic type blood supply of the brain is interrupted by a blockage to the brain vessels while in the hemorrhagic type will be interrupted by a rupture of blood vessel. Approximately 45% of the ischemic strokes are caused by an artery thrombus of which 20% are embolic in origin. (6) However, ischemic strokes caused by vasospasm and some form of arteritis stand out among the more infrequent causes of stroke.
Vascular malformations of the brain (VMBs) are the main cause for hemorrhagic stroke that cause serious neurological disability or death in a significant proportion of humans having them. The most common VMBs are arterio-venous malformations and cerebral cavernous malformations. (7)
The most common pathological process of vascular obstruction resulting in thrombotic stroke is atherosclerosis. Disruption of endothelium of the vessel following occurrence of atherosclerotic plaque initiates a complicated process that activates many destructive vasoactive bio chemicals. Platelets and leucocytes will play a major roll where platelet adherence and aggregation to the vascular wall induce formation of small clots with fibrin while leucocytes that are present at these sites mediate an inflammatory response. (8)
Pathological conditions other than atherosclerosis that cause thrombotic occlusion of a vessel include clot formation due to various clinical conditions such as hypercoagulable states, fibro-muscular dysplasia, Giant cell arteritis, Takayasu arteritis, and dissection of a vessel wall. (8)
Embolic stroke can result from embolization of an artery in the central circulation from a variety of sources such as atherosclerotic plaques, fibrin clots, Fat, air, tumor or metastasis, bacterial clumps, and foreign bodies. Neurological outcome of this stroke will depend, not only on the blockage but also on the ability of the thrombus to stimulate vascular spasms acting as a vascular irritant. (8)
The critical time period during which the tissues of the brain are at risk is referred to as the window of opportunity. During this period the neurological deficits caused by ischemia can be partly or completely reversed if re-perfused as yet viable brain tissues exist. (8)
4.1.1 Pathophysiology of Strokes at molecular level
Overreaction of some bio-chemicals that are triggered by the depletion of cellular energy stores, initiates the development of hypoxic or ischemic neuronal injury causing a variety of physiological and pharmacological effects. Endothelial cells are one of the first cell types which respond to reduction in oxygen supply (8).
Activation of several molecules promotes leukocyte adherence to the endothelial wall, initiating an inflammatory process involving many inflammatory mediators such as Tumor necrosis factor (TNF). Adhered leucocytes activate some other vasoactive substances causing different types of pathological consequences such as dilatation of vessels, constriction of vessels, alteration in permeability of wall etc (8).
Coagulation necrosis and apoptosis (programmed cell death) are the two pathological processes by which neurons die. In Coagulation Necrosis (CN) living neighbour cells get destroyed without eliciting an inflammatory response. The cell initially swells then shrinks undergoing nuclea destruction that shows marked nuclear chromatin condensation.
During apoptosis, the plasma membrane and the mitochondrial membrane would be maintained until the later stage of the process while damage to the nucleus occurs first. Ischemia activates latent suicide proteins in the nuclei, which starts an autolysis. This process of apoptosis is controlled by various cell signals which may originate either extracellularly or intracellularly by extrinsic inducers or intrinsic inducers respectively. Both intrinsic and extrinsic pathways will process through cascade of activation of caspases, a family of proteolysis enzyme. Activated caspases lead to cell death by degradation of DNA and digestion of structural proteins in cytoplasm. The third way of activating apoptosis is triggered by Apoptosis Inducing Factor (AIF) without activating caspases. (9)
In the intrinsic pathway, different types of regulatory proteins have been identified. For example Bcl-2 that found on outer membrane of mitochondria inhibits apoptosis while mediators such as Bax, Cytochrome c, Apaf-1(apoptotic protease activating factor-1) mediate aggregation of apoptosomes which stimulate apoptosis (9).
In the extrinsic pathway, the integral membrane proteins called Fas and TNF with their receptor domains initiate the process of apoptosis involving many more bio-chemicals that regulate cascade of caspases.
4.2Genetic involvement in Stroke
About two thirds of strokes have variations in known risk factors, identified either environmental influences or involvement of genetic factors. (1)
A number of proteins have been identified as associated with stroke that regulates the process either by stimulating or by inhibiting. There is an exponential rise in the number of regulatory proteins followed by recent advancement in understanding of details on the series of pathophysiological processes that may occur over many years. Any mutation of genes responsible for such protein can have effect on developing stroke. This in turn increases the list of polymorphism associated with stroke in humans. For example phosphodiesterase 4D (PDE4D) is a regulator of cyclic AMP levels that has effect on the control of the level of smooth muscle proliferation and immune function in vessels. Mutations in this PDE4D affect regulation of atherosclerosis which results in stroke development. (10) (11)
Although the availability of information increases day by day current understanding of involvement of genetic factors in stroke is still poor. (12) A minority has been identified as monogenic causes, such as sickle cell disease, cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL), cerebral autosomal recessive arteriopathy with subcortical infarcts and leukoencephalopathy (CARASIL), Cerebroretinal Vasculopathy and Hereditary Endotheliopathy with Retinopathy Nephropathy (CRV and HERNS), and mayomayo disease. Genetic factors do appear to be very important in the remainder also. Involvement of multiple genes each exerting a small influence or risk on phenotype might be further influenced by environmental factors. Outcome of the stroke is varying with different combinations of genetic and environmental influences. (1)
4.3 Gene expression
Gene expression is a complex process that is regulated at several stages in the synthesis of proteins. Apart from the regulation of DNA transcription, the expression of gene will be controlled during RNA processing and translation as well as post-translational modifications. The degradation of proteins and intermediate RNA products also will be regulated. The proteins that involve regulation of gene expression could be an end result of another gene that has a different regulatory system. This give rise to genetic regulatory systems consists of various networks of regulatory interactions. (13)
4.4Repositories of biological data
A collection of interrelated biological data is a biological database. It is a challenging task because of the properties of the biological data such as complexity, incompleteness and error prone affinity. Data retrieval, data annotation, data integration and data editing are the main areas to be focused in database management.
Biological databases include both public repositories of gene data like GenBank or the Protein DataBank (the PDB), and private databases like those used by research groups involved in gene mapping projects as well as biotech companies. Currently, quality and effectiveness of the bioinformatics work depends on technology used in the database. Biological sequence format, the way in which data such as nucleotide sequence, amino acid sequence and protein structure is stored is specific for each database. Some data bases, for example GenBank, DDBJ and MBL share the same format. (14) Most of the databases provide tools, Application Programming Interface (API) or instructions to retrieve data whatever the technology used.
There are two types of databases namely primary in which raw data is stored and secondary which consist of annotated/currated data. Curation may be achieved either manually or automated by computers. Data repositories like Gen bank, Single Nucleotide Polymorphism (SNP) database, Geo database maintain raw data. Geo dataset is a human currated database while UniGene and HomiGene are examples for computer derived currated databases. Combination of both human currated and computer derived databases can be found in databases like RefSeq and Genome Assembly.
There are various types of data repositories established for different purposes. For example, the reference human genome sequence consists of 24 finished chromosomes having 2.9 billion bases and covers about 99 percent of the genes. It is so advanced that the sequence accuracy is maintained up to average of error in single base pair per 10,000 bases. It provides foundation for the study of human genetics (15). (16) A proper systematically performed investigation of human variations needs to be based on entire knowledge of DNA sequence variations across the reference genome as understanding the relationship between genotype and phenotype is one of the main goals in biology and medicine.
The main popular databases are GenBank from National Center for Biotechnology Information (NCBI), SwissProt from the Swiss Institute of Bioinformatics and Protein Sequence Database (PSD) from Protein Information Resource (PIR). Data repositories like NCBI s Entrez Gene and Ensemble maintain annotations on whole genome. (17) (18) Information on sequences, gene locations, transcripts, classifications and links to several external databases can be retrieved from these databases. GenBank is one of the fastest growing data repositories of known genetic sequences. The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank and the DNA Database of Japan (DDBJ).
Of the protein databases there are popular databases such as SwissProt, Uniprot, PSD, and Translated EMBL nucleotide database (TrEMBLE). SwissProt is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy. Uniprot is well up to date as it updates frequently importing information from PIR-PSD, SwissProt and TrEMBLE. According to the content of the database, Protein sequence databases are classified as primary, secondary and composite. PSD and SwissProt are the examples for primary databases which store protein sequences as raw data. Secondary databases such as ProSite contain the information derived from protein sequences. In the composite databases, both raw data and data derived from raw data are available in a manner that the redundant data have been filtered.
4.4.1 Protein interaction databases
Major repositories of protein-protein interactions from multiple organisms are available in the databases like IntAct (19), Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) (20), Database of Interacting Proteins (DIP) (21), The Bio-molecular Interaction Network Database (BIND) (22), Human Protein Reference Database (HPRD) (23), BioGRID (24), and Molecular Interactions Database (MINT) (25). Data repositories like PID (26), Reactome (27), and BioCyc (28) provide information on both metabolic and signaling pathways. These databases can be considered as repositories of biological entities and their functional relations.
STRING, one of the largest databases contains information from various resources, including experimental data, public literatures and computational prediction methods. The database is freely accessible and it is weekly updated. The latest version 8.3 contains information about 2.5 millions proteins from 630 species. STRING imports protein association information from other databases of physical interaction and databases of currated biological pathway knowledge such as DIP, MINT, IntAct, BIND, BioGRID, Reactome, Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology(GO). It calculates a confidence score for each protein interaction using a weighing process.
Even though the confidence of the interaction has been calculated, there is no distinction that has been shown among the types of interaction. The scores are derived by mapping the performance of the predictions against a common reference set of trusted, true associations. The indicated confidence scores in STRING correspond to the likelihood of finding the linked proteins within the KEGG pathway. If the predicted association of the relevant protein and its interactor assigned is same as that of KEGG pathway, it is counted as a true positive. The functional grouping of proteins maintained at KEGG pathway is considered as a reference for many reasons. Mainly because they are based on manual curation, are available for a number of organisms, and cover several functional areas.
4.4.2 Disease associated databases
Databases such as the Human Genome Database (HGDB), Online Mendalian Inheritance in Man (OMIM), (29) Genetic Association Database (GAD), (30) GeneCards, HuGE navigator provide a good account on genes in relation to diseases. There are few more other private Diseases databases created for various purposes (31) that provide a search portal covering areas including medical disorders, symptom and signs, drugs and medications, common haematology and biochemistry investigation abnormalities with a medical text book-like index.
HGDB provides different types of information on genes, other DNA markers, genetic diseases and locus information, map locations, and bibliographic information while well annotated OMIM database contains information on Mendalian disorders in over 12.000 genes.
GAD a gene centered database which has been created with the view of providing information on human genetic association studies of complex diseases and disorders for genetic and medical professionals and students for further studies, includes summary data obtained from published papers in peer reviewed journals on candidate gene and genome-wide association study (GWAS) studies. These data is recorded in the context of official human gene nomenclature with additional molecular reference numbers and links. It will keep a separate record for each gene in polygenic disorders. In the August 2010 GAD update, the number of database records has increased 2 fold from 40,000 records to over 84,000 records.
GeneCard is an integrated database of human genes that includes various aspects of genetic information including genomic, proteomic and transcriptomic information, as well as disease relationships, SNPs, gene expression, gene function, and service links for ordering assays and related antibodies. The search engine sorts, and provides the user with only the relevant, detailed, information. The information presented in GeneCard is extracted from several databases.
Phenopedia, a genetic disease encyclopedia is a component of HuGE Navigator knowledgebase. It is weekly updated both manually and by automated way. This is based on the finding of HuGE Literature finder that provides a disease-centered view of genetic association studies. It returns a separate table displaying according to the frequency of the publications on the disease associated gene. Total number of publications, the numbers of meta-analyses and GWAS and gene environment interaction studies will also be displayed for each gene in tabular format. Provision of a link for publications on the specific gene disease association, hyper linking to a relevant detailed information page is an additional feature. (32)
4.5 Linkage analysis
Linkage analysis is the analysis of genetics with the view to identifying genetic loci inherited together in the disease phenotype. There is a tendency for genes closely located on the chromosome to stick together when passing from parent cells to daughter cells. When recombination is considered it is shown that the further apart two genes on the same chromosome are, the more the possibility that a recombination could occur between them. Two genes are called linked if the recombination fraction between relevant genes is considerably low. (Less than 50% chance)
Even though the use of linkage analysis in identifying genes for many Mendalian diseases has been successful, the success of it for complex diseases is not up to that level. It has been shown that linkage analysis is effective for localizing disease genes with moderate interaction effects and with most realistic sample sizes. The genes that exhibit small interaction effects could not be elicited. Linkage analysis is useful for a single disease locus at one time (33)
4.6 Integration and analysis of integrated data
As described above, there are an enormous number of publicly available databases containing various aspects of biological data. However a lot more has to be done in order to make these data understandable. Some of the identified issues with the data integration are diversity in data format, ambiguity in terminologies and various types of data versioning technologies. (34) Currently data management is largely conducted by the specialist databases that contain detailed information which is required for in depth analysis of relevant data. This level of detail is usually lacking in the primary databases such as SwissProt or GeneBank. These specialist databases generally exhibit properties or qualities such as increased detail of annotation, integrated data from multiple sources, and integrated searches and analysis tools.
4.7 Tools for network building, visualization and analysis
Networks are considered to be useful computational tools for representing many types of biological data, such as bio-molecular interactions, cellular pathways and functional modules (35). The interaction networks are very complex, as interactions are taking place at various levels such as at genomic, proteomic, and metabolomic levels, as well as inter-interactions among these levels (36) (37) .
Currently, a number of analysis and visualization tools have been developed by different groups. Examples are Cityscapes (38), Osprey (39), PathwayAssist (40), Pathways Database System, GeneGO, and VisANT. These tools use a number of specialized and publicly accessible databases. Osprey builds data-rich graphical representations, using color coded graphs for gene function and experimental interaction data. PathwayAssist is a software application developed for navigation and analysis of biological pathways, gene regulation networks and protein interaction maps (40) while VisANT is a web-based tool for visualizing and analyzing many types of networks of biological interactions.
Cytoscape is nonprofit, open source software used as network visualization and analysis tool, having more additional features as plugging. It is available as a platform-independent Java application; released under the terms of the Lesser General Public License (LGPL).The system requirements for Cytoscape depend on the size of the networks to be handled. It is required minimum of 1GHz Processor, 2GB+ 512MB Memory and High-end Graphics Card On board and Video Graphics Card with a wide or Dual Monitor. (38) Cytoscape being an efficient drawing tool, exhibits powerful visual styles with zooming functions for browsing. Network analysis plugin does the analysis using complex algorithms and parameters such as number of nodes, number of edges, length between nodes, network diameter or the largest distance, number of neighbours, number of connected pairs between all neighbours and direction of edges.
Bisogenet plugin of Cytoscape construct an interaction network for a list of genes queried providing a good account on various types of interactions without distinguishing the species, using its own database called sysBiomics. This sysBiomics database imports data from various databases such as NCBI, uniprot, KEGG, and DIP. Cytoscape supports the import of networks from delimited text files and Excel workbooks. Parsing options for specified files in the interactive GUI is an interesting feature provided by it. There is a preview which shows how the file will be parsed with current configuration. The user must provide the columns that represent the Source nodes, the Target nodes, an optional edge interaction type specifying how the file will be parsed, and whether the network is directed or undirected. Network is referred to as undirected if the direction of the edge is ignored (41) .
There are some studies focusing on analysis of networks especially how the properties of bio-entities vary with the topological parameters (42) .A relationship between clustering coefficient and the functionality of proteins were detected in a study whereas another study could not reveal any relationship with any parameter of network topology (43).
Chapter 5 Tools and Methodology
5.1 Initial list of genes [core genes]:
To derive the gene-disease association, an initial list of genes known to be related to stroke was prepared using mainly Phenopedia component of HuGE Navigator which is a continuously updated knowledge base in human genome epidemiology. OMIM database was also searched in order to obtain genes of Mendalian inheritance associated with stroke. Genes associated with stroke from Genecard, a freely available well informative database, were also taken into consideration.
Three independent queries with Phenopedia component of the HuGE Navigator knowledgebase, OMIM database and GeneCards database were conducted as follows.
5.1.1 Query with Phenopedia knowledgebase
Phenopedia was searched with the query of cerebrovascular accidents which returned 521 gene names. Following captured screen shows the returned values.(figure 3)
Figure 3: captured screen showing results from phenopedia
5.1.2 Query with OMIM database
OMIM database was searched with default basic search mode, without controlling the search using limits available in the OMIM site that enabled the search field to a limited and specified area. The query used was cerebral vascular accidents in order to retrieve cerebral vascular accident related genes.(figure 4)
Figure 4: Captured screen showing results from OMIM
5.1.3 Query with GeneCards database
GeneCards was searched with the query cerebral vascular accident which showed names of the genes in a tabular format including GeneCards Inferred Functionality Scores (GIFtS) and is ordered by score. (Figure 5)
The return of the GeneCards was downloaded using GeneALaCart, a batch querying application based on GeneCards database. It allows retrieval of multiple genes in a single file either in excel format or text format.
Figure 5: captured screen showing results from GeneCards
5.2 Ranking of initial list of genes [core genes]:
5.2.1 Weighing Process
188.8.131.52 First stage of sorting (based on presence of gene in databases)
In the first stage of ordering, a confidence factor of 1 was assigned if the gene was listed in each database. The formula applied is as follows;
Value = n1* PH + n2*OM + n3* GC
PH=phenopedia; OM=OMIM; GC=GeneCard
n1 = 1, if it is present in the phenopedia list
n1=0 if not present in phenopedia list
5.2.2 Second level of ordering (for equally weighted genes)
With the application of first stage of sorting, many genes had equal weightage. Out of this, where n1=1 (i.e. present in Phenopedia), genes were further sorted in descending order with the number of publications in the Phenopedia database.
Where n1=0, the genes having equal weightage were reordered in the alphabetical order.
5.2 Standardization of gene names
Gene s abbreviations and gene symbols used in different biological repositories might not be unique for a particular gene. To overcome this ambiguity, Standardization of gene names has been introduced by Human Genome Organization (HUGO) Gene nomenclature committee assigning a unique gene symbol for each gene. In this project each core genes as well as novel genes were mapped according to HUGO Gene nomenclature committee (HGNC).
Figure 6: data flow
5.3 Selection of genes for searching of interactions
The ranked core gene set was divided into separate sample spaces consisting of 15 genes in each set in order to draw a separate network diagram for each sample space. On completion of drawing of network for each gene set, the resulted interaction network to be merged with the previous network to produce the final gene disease association interaction network. There is a linear relationship between the number of genes in the sample space and the time it takes to derive the network diagram (figure 7). With the increased number of samples, the processing time is expected to increase. Hence considering the complexity of the design because of linear nature of the study and the research timeline imposed, sample space of 15 was selected.
5.4 Selection of novel genes interacted with core genes
Novel genes that are interact with core genes were searched for each core gene using STRING interaction database. It was convenient to use a protein interaction database as both gene names and the protein names of the same gene are similar. The database was searched with the name of protein of interest from the core gene set, for which functional associations are to be predicted. High confidence interactions, having a score of more than 0.9 were selected.
Figure 7: relationships between sample space and time
STRING -API [figure 9] which enables retrieving data without using the graphical user interface of the STRING web site was used with the following Uniform Resource Identifier (URIs). This prevented from downloading the entire dataset which is a complex process. API was used to access, the names of the interacted proteins which identify as novel genes in three steps.
To call the API, URI was made in the following form:
The flow of information retrieved from STRING is as follows .As the STRING database recognizes its own STRING id, first, the STRING id of the relevant core gene was obtained. Novel genes were also obtained with their STRIG id, following a query submitted with STRING id of each core gene recovered in the first step. The identification of gene names of the STRING id.s of retrieved novel gene was also done through API. (Please refer figure 8)
Figure 8: Information retrieval from STRING
Figure 9: STRING API
5.4 Filtering of disease specific novel genes
Resulted interactors or the novel genes were filtered to get the interacting genes associated with stroke by mapping against HuGE Navigator. The terms used while mapping with HuGE disease list were stroke, Brain Ischemia, cerebral vascular accidents, cerebral ischemia and cerebral infarction . Identification of disease variants was done mapping the filtered genes against GWAS catalogue. Each novel gene filtered was marked as positive genes for stroke associated genes. The interactor genes, which have not been annotated as a gene associated with stroke at above two data repositories, was searched manually in PubMed for the same.
5.5 Graphical presentation of gene disease association network
Cytoscape was used as the network drawing and analytic tool. A tab delimitated text file compatible with Cytoscape was created with three columns having core genes as source nodes and, filtered novel genes as target nodes and pp protein-protein interaction as interaction type (please see Annex C). This was fed to Cytoscape. Network analysis plugin of the Cytoscape was also used to get the analysis report.
NB: Some common interactions types can be used in the Cytoscape are as follows;
5.6 Data analysis:
To facilitate data analysis, first information was captured into a database and processing was performed on the stored data. For this purpose, open source based technologies were selected due to free availability and wealth of information and support provided. Hypertext Preprocessor (PHP) was chosen as the scripting language for processing and MySql as the backend database. The whole application was run on open source LAMP (Linux, Apache, MySql and PHP) stack. Java scripts and Cascade Style Sheets (CSS) which are client end technologies were also used to enhance the user interaction. Database Administration was achieved using PHP MyAdmin tool. Jpg, png and GIF formats were used for images.
A web based Graphical User Interface (GUI) was provided for the end user to interact with the system. Data visualization by way of network diagrams and searching capability through the GUI was provided. However, editing of Master Data can be done only by a user with the administrative privileges. Thus, a separate account for the administrator with separate username and password were provided.(figure 10)
Within the database, separate tables were created to capture the data for core genes, interactor genes, HuGE Disease association list, PubMed Data and HGNC data. (figure-11) and ranking of the core gene and the filtering was automated.
Figure 10-Screen shot 1: The user has the option to view the stroke associated gene interactions
Figure 10-Screen shot 2: The user can search stroke associated genes of the given chromosome
Figure 10-Screen shot 3: The user can search stroke associated genes of the given chromosome
Figure 10-Screen shot4; Administrator has the privilege to update
Figure: 10- GUI planned
Figure 11- structure of database
From the Phenopedia component of HuGE Navigator 521 genes associated with stroke were retrieved. 99 genes were identified that showed from 76 results from GeneCards while nine items were retrieved from the OMIM database. (Annex A). In the OMIM database the relevant locus of the chromosomes was returned instead particular gene name. There were 11 genes retrieved from OMIM that has not been mentioned in Phenopedia while there were 64 genes in the GeneCards that were not present in the phenopedia. The final list of core genes consists of 596 genes. (Figure 12)
Figure 12: Venn diagram showing core genes distribution among selected data repositories
The sample of first 15 genes, the novel genes retrieved for each core gene and novel genes that has been associated with stroke following filtering process is shown in table 2. There is an average of 9.47 interactor genes with a confidence score of more than 0.9, per each core gene. However, genes associated with stroke had an average of 4.39 per each core gene after mapping them against HuGE Navigator disease list.(Annex B) Further from the pubmed search 33 gene interactions were identified. It is only50 % of the searched items.
The resulted disease associated gene interaction network that was returned as the output from Cytoscape is as follows. (Figure 13) There were 68 nodes interconnected with 65 edges.
Figure 13: The network resulted from first 15 core genes
Following are the simple parameters calculated by the Cytoscape analysis;
a. Clustering co-efficient 0.033
b. Number of nodes 68
c. Isolated nodes 0
d. Connected components 6
e. Network radius 1
f. Network centralization 0.109
g. Characteristic path length 3.272
h. Average number of neighbors 1.88
i. Network density 0.028
j. Number of self loops 0
According to network analyzer, the network is disconnected at six points creating six separate connected components. (figure14). The smallest connected component has 2 nodes while the largest one having 27 different nodes. Topological analysis on each connected component is helpful.
Figure 14: connected components of the network displaying each connected component separately
Figure 15: connected components in a hierarchical manner
Figure 16: Distribution of shortest path length in the network
Table 2: Top 15 core genes with their HGNC id.s
As the term stroke is not a specific term, throughout the study either cerebrovascular accident or cerebral vascular accident was used selecting the relevant appropriate term depending on the output of the query. Stroke, being a multifactorial disease, identification of genes associated with it is a challenge. Even though there are a lot of regulatory proteins already known either stimulating or inhibiting the pathways responsible for development in different categories of stroke, lot more is yet to be identified. Some of the reliable well annotated available data repositories like OMIM mainly include information on the relevant locus of the chromosome instead of separate genes. Of the available data repositories GeneCards and Phenopedia component of HuGE Navigator provides a good description on the disease associations of genes applying fair waitage for each gene. For these reasons all three databases were selected, and priority was given to Phenopedia as it is more reliable because it always mentions genes only if published articles are available.
There was variability in the number of genes as the output in the Phenopedia as well as in PubMed search with the same query in different occasions. Each time it fluctuated with difference of 10-20 genes. This could be because these databases are updating frequently. But this temporal deviation could not be calculated because of the time constrain.
Having identified the core genes, they were ranked with the view of searching interactions in a way that first genes samples will includes more reliable genes as first few samples will also be used as pilot projects to measure the functionality of the pipeline adopted.
The coverage of the STRING protein interaction database is satisfactory as it imports information from other interaction databases also as indicated in the literature review section. The unique graphical visualization of the STRING interaction database and the availability of the user friendly API made the selection of the same for searching of interactors.
In the filtering process each novel gene identified was mapped against the disease list of HuGE navigator that contains good amount of information on diseases. The drawback of this was that the disease list of the same has not been standardized. Therefore the whole list was manually scanned and following terms were identified as terms related to stroke
b. Cerebral ischemia
c. Cerebral infarction
d. Brain ischemia
e. Brain infarction
f. Cerebral vascular accident
g. Brain hemorrhage
h. Cerebral hemorrhage
The Cytoscape tool provides a detail analysis including clustering coefficient, shortest path length distribution, neighborhood connectivity distribution, node degree distribution, average clustering coefficient distribution, topological coefficient, betweenness and closeness centrality. Analysis of subset of nodes or batch analysis is also possible. Identification of each parameter mentioned above, require further study. Only the identification of connected components, analysis of shortest and longest pathways was focused as this would provide an insight into the effectiveness of new drug invention related to stroke.
Limitations of study
(a) Only three databases of the available data repositories were used for the selection of core genes.
(b) Out of various interactions among bio-entities only protein interactions were considered in identification of interactors.
(c) In the protein interaction search with the STRING only interactors of confidence score more than 0.9 were selected because of time constrain. However all interactors without applying limits is required as even a slight possibility is important in clinical practice.
(d) The limits were applied in, manual literature mining because of the time constrain.
(e) Disease association of the interactors was checked against HuGE Navigator disease list which contains only detail on 10058 genes representing only one third of the total number of human genes.
(f) Negative result following filtering may not be truly negative, if the particular gene has not been studied for stroke.
(g) There is no standardization on the names of the diseases mentioned in the disease list of the HuGE navigator knowledgebase.
A disease associated gene interaction network was built with 15 core genes. This system would provide a foundation for a network based analysis platform for stroke, once finalized with all core genes identified.
There were 65 gene interactions associated with stroke in the first set of 15 core genes, showing average of 4.4 disease associated interactors per each core gene with 6 connected components. A conclusion can be derived on the completion of the study with other gene samples, as still no adequate data is available. This opens a study area for researches for further studies on gene interactions associated with stroke.
After analysis of data the following recommendations are hereby made;
(a) Recommended methodology of this type of study is use of both manual and automated literature mining methods.
(b) There are several disease associated gene interactions theoretically obtained that has not been studied in genome wide association studies. Researchers should pay their attention in this to detect prevalence of the above interactions in human kind.