# Comparison Of Methods For Large Network Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The post genomic era is particularly shifting from annotating individual genes and proteins to understand the DNA-protein interaction, protein-protein interaction occur inside the complex cell to elucidate regulatory signalling and metabolic pathways[1]. The "reductionist" approach for modelling molecular biology has been pursued in the past two decades to characterise and identify regulatory pathways leading to components, starting with gene or protein of interest and trying to elucidate genes or proteins involved in the same pathway. However, there has been many difficulties even though the available data was used to assemble in pathways. For instance, the modelling approaches does not decipher influences of multiple pathways and their cross talks[1].

Systems biology aims to understand on a whole than in parts of the physiology of living system. The high throughput technologies has enabled researchers to consider system wide approaches to understand and analyse biological entities like DNA, RNA, protein and metabolite on a global level. In order to understand the complex behaviour of the biological system , it is imperative to define the biological entities and their interactions in the model[2].

Elucidating the dynamics and deciphering the gene regulatory network and their cross talks is paramount in systems biology context. The "central dogma" the initial step is transcription which is mediated by the transcription factor (TF) activity and co-factor involved. The latter translation is mediated by post translation modification of protein to form the final gene product. The aim of gene regulatory network (GRN) to connect biological entities called nodes (genes, proteins and metabolites) with edges (DNA-protein, protein-protein interactions). However, GRN is mainly use RNA transcript level data for the construction of network where the nodes represent genes and edges represent the direct or indirect relationship between genes as shown in the schematics (Fig 1).

Fig 1. The above schematics represent the central dogma and generic gene regulatory network. Gene A depicts auto-crine effect as it regulates its own gene transcription and indirectly it influences the transcription of gene B and gene C. where gene B has effect on transcription of gene C [2].

However, integrating the transcriptomics data with other biological omics data like proteomics and metabolomics can improve quality of predictions drastically. The reconstruction of gene regulatory network based on gene expression data is known as network inferences or reverse engineering. However, it is difficult task to reverse engineer as it is complicated to determine the correct combinatorial components of regulators and data might not be accurate. This in effect it promotes the integration of biological measurements for gene expression, protein expression and metabolites and prior biological knowledge from the published journals in the modelling techniques. However, this is quite challenging to amalgamate this heterogeneous data data to infer GRN and to obtain detail insight of mechanism of action.

There are different modelling approaches have previously been used previously to infer gene regulatory network [3]. The predominant ones includes the ordinary differential equations(ODE) based modelling approaches like the traditional coexpression based correlation network inference[4], the the strategies for reverse engineering includes 1) models describing physical interaction between DNA-TF and 2) models describing no physical effects but influence based models[5], CORE-NET [6], Hidden variable dynamic modelling (HVDM) to predict target based subnetworks [7]. The probabilistic based stochastic modelling technique like Bayesian networks also widely considered for network inference[4], mutual information inference methods [8] , Regression based methods like 1) sparse linear regression and 2) data re-sampling approaches .

Fig.2. Illustrates the flow process for gene regulatory network using reverse engineering approach [5].

Dialogue on Reverse Engineering Assessment and Methods (DREAM) project employs different network inferences algorithms to reconstruct community network (http://www.the-dream-project.org/). This approach has been successful to decipher many biological interactions [3]. The keynote of the DREAM project is that collective knowledge from multiple inference methods is greater than the individual method. On applying same expression data on different methods generate different predictions. Considering the highlights and lowlights of each inference methods is critical to address its application on microarray data of different species.

In this project we aim to focus on different perceptive of deterministic and stochastic modelling approaches for inferencing biological networks in order to gain and understand principle component view of gene regulatory network and employ time course gene expression data to elucidate complex phenomena of biological systems for the reconstruction of dynamic network models.

## Methods

## Biological data

The sequence of genome data is supportive for the reconstruction of GRN as the transcription process as it is primary control mechanism. The presence of Transcription factor binding sites(TFBS) in a promoter regions of gene sequence determines the possibility of a particular gene could be expressed or not. The very numerous insilico methods have been developed to identify sequence motifs and TFBS[9, 10]. The database such as JASPER and TRANSFAC provide great details for TF and TFBS.

The time at which the measure of transcript, protein and metabolite levels reflects the current state or perhaps the activity of the biological system. The gene expression microarray data has been widely used for GRN construction. Essentially, two types of microarray data are used 1) single channel 2) two channel microarrays. Gene Expression Omnibus and ArrayExpress database contains huge amount of publicly available DNA microarray data.

Likewise, the proteomics data provide information whether gene has been expressed in a cell. Proteins are the end product of mRNA. Interaction of genes to proteins suggest that gene has been expressed to protein level. Furthermore, proteins(enzymes) catalyse the metabolic reaction synthesising metabolites ie small molecules. Therefore, connecting genes, proteins and metabolome remains a challenging task for the future which forms the interactome.

There are text mining tools developed to extract information on the interconnection between genes and proteins from the published literature (for instance PathwayStudio).

Gene functional annotations also provides imperative platform to study genes and their interactions. Gene ontology(GO) provides attributes and description for genes and genes product. In order to understand biological biological process of any subnetwork or cluster of GRN, GO enrichment analysis provides a powerful approach to unravel genes functions which have not been annotated.

## Data requirement and preprocessing

Considering gene expression data of environmental perturbations (eg. chemical stress, heat shock) and sampled biological entities over time provides an understanding of dynamics of biological system. Likewise, genetic manipulation expression data (eg. gene knockout, knockdown or overexpression) provides good understanding for the down stream effects on the pathway of particular gene in a static system. Knockout experiments have been efficient to derive data using network inference based approaches [11].

The amount of data required differs from different modelling algorithms in order to predict target nodes. Therefore it becomes difficult to specify how many experiments or perhaps replicates are required to reconstruct the GRN. However some inferences algorithms do mention the requirement of data. For instance, HVDM requires time course raw expression data(MAS5 summarised) with at least five time points and prior knowledge of targets genes of a particular TF to train the model [12].

Furthermore, CORE-NET data requirement should satisfy this equation m*h>m+n(where m- number of time series, n - number of genes and h- data points of each time series)[6].

Pre-processing of gene expression data is critical to reconstruct GRN as it influences the performances and results of inferences algorithms. In large scale data, the variability such as systemic noise and bias effects the predictions drastically. Therefore, normalization of data becomes essential before applying any of the inferences algorithms.

## Network inferences method

Choosing appropriate network model architecture is essential for inferring the GRN. The architecture applies to the mathematical functions used in the model to describe the activity of the regulatory components. After defining the model architecture, the structure of network (eg. components interaction) and parameters of the model (eg. interaction strength/type) should be learnt leant from the gene expression data provided and including prior biological knowledge. The architecture of the model can by distinguished by 1) Activity level representation of network component, 2) The model type (for instance, deterministic, stochastic, dynamic or static) and 3) The relationship types between variables (directed or undirected, linear or non linear function). We focus on directed networks on this project.

## Information theory models.

One of the simplest and traditional modelling approach is correlation based network[13]. Where the edges are weighted and represented by correlation coefficients in an undirected way. Resulting, the two predicted genes are connected only if the correlation coefficient is above the defined threshold. The threshold set determines the network being sparse or dense.

(equation 1)

Equation 1 represents pearson centred correlation coefficient [14].where dfg represents the gene expression distance between the gene f and g. egc and efc represents expression of gene g and f under condition c respectively.

There is R package for weighted correlation network analysis called WGCNA[15] which can used to find out cluster modules for highly correlated genes. Also, graphical user interface java program to construct correlation network like BioLayout Express (3D)[16] which can also employed for clustering of correlated network using Markov chain algorithm to determine cluster modules.

## .

In addition to correlation coefficient, the mutual information based on the information theoretic scores and euclidean distance were applied to the predict the regulatory networks[8]. Like correlation coefficient, mutual information determines the degree of statistical interconnection between two random variables. The mutual information derived statistic scores by weight are applied by network inference algorithms like ARACNE (Algorithm for the Reverse engineering of Accurate Cellular Networks)[17], RELNET (RELevance NETworks)[18] and context likelihood relatedness algorithm[19].

The mutual information(MI) is calculated between each pair of genes based on the entropy of a single gene expression A and gene expression B and difference of collective(A and B) entropy in equation 2.

(equation 2)

(equation 3)

H(A) is the entropy calculation of gene expression A which is measure of information content in the pattern. The entropy is calculated into discrete probability and for that histogram technique is applied. Range of each gene value calculated later is divided to n sub-ranges. p(xi) in equation 3 represents the measurement proportion in sub-range xi. The probability density function is calculated in the histogram as n tends to infinity.

## Differential equation models

Applying gene expression data in differential equations models describes the changes in the pattern of expression with respect to time taking into account environmental conditions and expression of other genes. Therefore, modelling GRN using this approach determines the dynamics behaviour of network in a quantitative manner.

The generic ordinary differential equation (ODEs) model used in the study of dynamics of expression data is represented in equation 4.

(equation 4)

x(t) = (x1(t),x2(t),x3(t),x4(t)....xn(t)) represents the gene expression values for genes 1,2,3...n and at time t, f represents the function of rate of change of state variable xi correspondence to parameter set p, and u represents the external signal or environmental perturbations. Determining the parameters p for the function f from measured signals u, x and t using some optimisation algorthms leads to inferring the network. Typically, the gene regulatory process is characterised by non-linear dynamics. However, most of the inference modelling approaches prefer linear ODE models.

ODE models are deterministic unlike information theory models and Bayesian network models which applies conditional probabilistic approaches. ODE based approaches results in directed networks and used to study both time series expression data and steady sate.

Recently number of ODE based methods have been developed to infer networks using gene expression data.

## ODE methods based on Linear Systems

## Hidden variable dynamic modelling(HVDM)

HVDM is a mathematical modelling technique called Hidden Variable Dynamic Modelling (HVDM) [7], which uses time series microarray data from which it derives unknown transcription factor activity and constructs a list of ranked putative gene targets.

(equation 5)

In equation 5, Bj is the constant basal transcription rate of gene xj; Sj f(t) is the transcription induced by given transcription factor, composed of a constant Sj, which is the sensitivity of gene j to Transcription factor, and f(t), which is the activity of Transcription factor at time t; and Djxj (t) is a degradation term, with Dj being a constant degradation rate.

(equation 6)

Xj in equation 6 represents discrete measurement of gene J at time t0 ...tn. Correspondingly f(t) for these respective times is below is equation 7.

(equation 7)

The differential operator on the left hand side of equation 5 is replaced by A, which is square matrix where number of time points n signifies it dimensions.

In the matrix the entries are calculated using polynomial interpolation. Equation 8 represents an approximate version of equation 5. The formulation solution of Xj is calculated using LU substitution represented in equation 9.

(equation 8)

(equation 9)

I represent identity matrix and 1 is a vector whose each single entry is 1.

The model expression values ie Xj are closest to observed expression values ie Xj . The model score M(p) in equation 10 represents closeness of fit for the set of parameters p and Ïƒi,j determines standard deviation of the measurement error.

(equation 10)

In order to use rHVDM package in R environment, the gene expression time course data must have following requirements which are imperative to execute the algorithm;

The gene expression data should consist of at least five time points.

At least three genes should be known to be targets of that transcription factor.

Measurement error is taken into account.

This package uses Bioconductor[20] built-in structures for expression data and depends on two other R packages: R2HTML for the output interface and minpack.lm uses optimization algorithm Levenberg-Marquardt (LM) gradient for model fitting. The useful product of LM algorithm is Hessian matrix H that equal to the partial second derivative for model score function for the corresponding elements of parameters P. Inverting the hessian matrix H can determines the covariance matrix for the parameters which are fitted.

S.cerevisiae expression data was tested with rHVDM package in R environment to get the formatting of arguments passed to algorithm and determine the activity parameter, the hidden variable can then be used to identify other targets of the same factor ranked in order of likelihood.

In recent years, other linear ODE models have been developed which address GRN issues that uses steady-state expression data by microarray network identification(MNI) [21]â and Network identification by multiple regression(NIR). Time series network identification (TSNI) uses time series expression data [22]â .

## ODE based on Least Square Optimisation Method

## Microarray network identification(MNI)

MNI algorithm uses steady state data to compute for the edges aij based on equation 11 [21]â .

(equation 11)

Where, i=1..N, N represents number of genes, u denotes external perturbations. Xj gene transcript concentration , aij signifies the influence of gene j on gene I and biu signifies the direct perturbation of genes in each experimental perturbations.

MNI is different from other inferences algorithm as it aim is to determine direct targets and pathways involved with compound used in the experiment. Initially, the parameters aij is obtained from the expression data and later the measurement of gene expression after the treatment of compound. The computation to determine the values of biu for each gene using equation 11. List of ranked genes is the output of the model. The lower ranks are putative targets of compound or external perturbations.

## Time series network identification(TSI)

TSI determines the direct targets of the external perturbations bi and also gene network aij.[22]â . The principle behind working of TSI is equation 12 and temporal gene expression data is applied. The assumption made in the model is that it assumes single perturbation is performed (experiment with a genetic perturbation or compound).After the perturbations M time points are measured. In particular for large networks TSI has been useful to predict direct targets of perturbations.

## ODE based on Convex Optimisation Method

## Regularised Least Square (RLS) and Instrumental variables(IV)

The problems and detrimental effects of noise in the measurement data give rise to significant research questions while inferring large scale networks. RLS-IV is a new inferencing algorithm which is being developed in house which is originally built on the previous published network inference algorithm called CORE-NET based on convex optimisation method [6]â . This algorithms aims to address the noise in the measurement data issue by combining Regularised Least Square (RLS) method with Instrumental Variable (IV) technique. The aim of RLS technique is to reduce the complexity of model through restriction of degrees of freedom. The aim of the IV method is to avoid introducing non-consistency and noise due to measurement noise during parameter estimation during the formalisation of standard Least Squares(LS). One of the process of validating this model is by generating in-silico networks by GeneNetWeaver [23]â and data sets which is tested on RLS-IV.

This approach uses linear ODE model in the continuous time scale represented in equation 12.

(equation 12)

where x(t)âˆˆRn xi signifies state variable for biological entities, here it represents mRNA concentrations of a gene at time t, m denotes number of time points and n signifies number of genes/nodes. AâˆˆRn*n is the state transition matrix ( the partial derivative of f(x). BâˆˆRn*m is matrix signifies direct targets of perturbations u(t) = (u1(t),u2(t),u3(t),u4(t)....un(t)) âˆˆRm.

The model equation for RLS-IV inference method is represented by equation 13 . The connectivity matrix A and external perturbations matrix B is similar to equation 11. Assumption is made for experimental observations (h+1), where x(j)(k)âˆˆR n , k = 0,1,2,3...h determines each exogenous perturbations u(j)(k), j = 0,1,2,3..m. Solution to above problem for discrete time domain can be represented as below where B*,jâˆˆR n.

(equation 13)

CORE-NET and RLS-IV employs temporal expression data to infer networks. However, steady state data produced by genetic perturbations is also used in convex optimisation approach to infer stable GRN [24]â .

## Bayesian network

The Baye's rule is applied to understand the stochastic nature of a gene regulation. The network constructed is Bayesian network (BN). In this approach gene expression values is considered as random variables. Association of each random variable through vertices or linked to other variables(genes/nodes) through directed acyclic graph (DAG) which reflects the structure dependency. The local probability distribution is described for each node and for all other nodes is by joint probability distribution p(x) in equation 14.

(equation 14)

xpa(v) denotes parent state vector which is activity measure of regulatory gene and x(v) is current state vector. Î¸v represents local distribution parameters. vâˆˆV denotes each component(genes) as a random variable xv . This approach reconstructs regulatory network by computing as product of conditional probabilities by applying Baye's theorem P(A,B)=P(A|B)*P(B)=P(B|A)*P(A). The advantage of this inference method is that different datasets can be combined and also its take into account prior knowledge for reconstruction of GRN.Other advantage include avoid over-fitting of the training data to the model and incorporate hidden variables (for eg, TF activity) and noisy measurement data as well.

Generic step wise process of reconstruction of BN:

Selection of model: Defining of DAG as entities for the relationship of graphs.

Fitting of parameters: Provided gene expression data and a graph, to determining for each node, the best conditional probability (CP).

Rating of fitness: Scores for each model candidates. Better the network model when the score is high that fits the data. Model with highest score signifies the results of GRN inference.

The critical step is "selection of model". Therefore heuristics are required to learn BN efficiently. The learning of BN can be done though discrete(ie Boolean) and continuous gene expression values. Therefore, the probabilistic model can be gaussian or multinomial distribution. The static and time series expression data can be employed to reconstruct static and dynamic BN (DBN) respectively. The DBN per se cannot infer the feedback loops. However, there is separation of output nodes from input nodes leads makes pavement to infer feedback loops as represented in the figure 3.

Fig 3. Illustrates static and dynamic bayesian network. Feedback loop in static BN is not allowed but can be represented in Dynamic BN [2]â .

For genetic network inferences BN is widely used. Furthermore, BANJO (BAysian Network with Java Objects) is an BN algorithm to infer static and dynamic inferences for GRN[25]â . BANJO is a heuristic approach to used to search network space to construct graphs. For steady-state data BANJO is not suitable for GRN as its ability to infer feed forwards or feeb back loops is lacking.

The linear modelling technique called Variational Bayesian State-Space Models (VBSSM) which implements Bayesian approach to study state-space models to determine the hidden state though automated methods[26]â .

## Clustering Algorithms

Clustering algorithms are also widely used to GRN. The assumption is that highly co-expressed genes are regulated by common TF. Also employed in model techniques to reduce the model variables, computational load and biological entities by determining the differential expressed genes or clustering the co-expressed genes. Clustering per se is not a network inference algorithm but is used to visualise and cluster genes behaving similar across different perturbations or time points. The distance metric is used to measure the similarity of genes for instance, correlation coefficient for pair of genes. The idea behind clustering is that the genes in the cluster (ie genes in the same cluster) have high probability of being co-regulated by same TF or belong to the same pathway or having functional or biological process [27]â . Enrichment for GO ontologies and promoter enrichment provides a powerful approach to identify biological processes and hidden regulatory variables of a unannotated genes for individual clusters.

## Hierarchial clustering

This is the most common clustering algorithm employed to determine similarity in expression profiles. The relationship between genes is characterised by a tree where the degree of similarity reflected by its branch length [14]â . The metric is calculated between genes is using centroid linkage uncentered correlation distance.

## Markov chain clustering

This clustering algorithm is a powerful approach to divide networks into sub-networks non-subjectively based on sub-networks sharing similarity in their expression profiles. Commonly used to cluster large graphs due to its robust nature[16]â .

## Comparative analysis of method for target based approach

Inferring networks using different algorithms provides a principle component view to elucidate target based predictions for a particular gene expression data.

Fig 4. Illustrates an overview of applying different network inferences algorithms to a particular gene expression datasets. The objective is to infer interactions between three genes in expression data with different experimental conditions. The expression data is pro-processed before its usage and prior biological information is incorporated in the model [2]â .

Comparing different networks inference methods to pull out common gene targets provides a powerful approach to understand the regulatory mechanism of genes and interactions between them as shown in fig 4. Recently published work used integration of 35 different network inferences algorithms in order to predict networks to construct community networks as DREAM challenge [3]â . This provides robust and high performance across diverse datasets.

## Network validation

Network validation effectively involves confirmation of predicted regulatory interaction from different inferences algorithms with available published experimental literatures or perhaps confirming through experimental procedures. It becomes imperative to employ scoring methodology to evaluate the inference model with internal validation and external validation.

Internal Validation

To evaluate the robustness of the model, the internal validation is employed using statistical based approaches such as perturbations, subsampling and bootstrapping. Cross validation is a classic example of subsampling and bootstrapping is essentially works by splitting the data into test datasets and training data. Essentially, for time series data sets bootstrapping and subsampling are not suited as there is splitting of datasets [2]â .

Generating insilico networks from synthetic datasets using GeneNetWeaver [23]â is widely employed across different network inferences algorithm to validate the performance of the model.

External Validation

The external validation checks for the model prediction with the available knowledge from the published literatures, databases or any gold standard network.

Insilico validation using promoter enrichment analysis can be performed for target based approaches to confirm whether a particular transcription factor (TF) binds to the upstream promoter sequence of a target gene. HMMER [9]â is a hidden markov model to evaluate TF binding sequence provided DNA sequence of the target genes and binding motif is known.

Literature databases based validation such as BioGRID [28]â , SGD [29]â , YEASTRACT [30]â and Pubmed provides plethora of information on gene interactions based on experimental data.

Gene ontology analysis through Amigo [31]â . Experimental validation can be performed in laboration with other labs members.

## Discussion and Conclusion

The major challenge in the field of systems biology is to discover the dynamics and structure gene regulatory network. There are many inferences methods available, network types and variational expression data to evaluate the metrics for network inference. Although difference modelling method incorporate different mathematical formalism, ultimately the interpretation of networks is through nodes (biological entities) which are interacting by edges (corresponds to regulatory interaction).

The purpose of employing mathematical models to use as "mechanistic" network and "influence" network. Mechanistic network objective is to determine true molecular interaction which includes interaction of DNA-protein for instance, TF binding to its target genes, also including signalling pathway through protein-protein interaction and protein-ligand interaction. Although the biological entities inside the cell is vast, the network inference algorithm do not address the all possible molecular connections but rather address the interaction at specified experimental conditions or exogenous perturbations. Influence network essentially addresses the global properties of a systemic behaviour. Influence network aims to address the interaction of expression of a gene to group of genes (modules) or vice versa. Most of the reverse-engineering algorithms work on principle of influence network which can incorporate prior biological information.

The experiments by perturbing the environmental conditions such starvation or heat shock alters the cellular behaviour and experiments by genetic perturbations by gene knock-out or gene knock-down provides highly informative data for inferencing networks. Incorporating prior biological knowledge and heterogeneous data are in particular use for linear ODE models and Bayesian network models[2]â .

Integration of prior biological information and heterogeneous biological data (transcriptomics, proteomics and metabolomics) to understand the global cellular behaviour is quite often challenging and will future focus on network inference research. Furthermore, interpretation and validation of network inference model results is proportional to biological knowledge available. Advancing the current network inference methods will be the future in bioinformatics and systems biology to unravel the hidden molecular properties of a cell.