# The Random Forests Gene Expression Data Generation Biology Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Ecoli global interaction network used as a template for extracting realistic gene network structures. Ecoli transcriptional regulatory network contain 1502 nodes and there are 3587 interaction between those nodes. There are two steps in the gene expression data generation.

Extract realistic gene network structures from the Global interaction networks

Generate realistic gene expression data from these extracted sub networks

## 1.1 Sub network extraction

Scale-free random graph models and manual design of small benchmark networks are the most dominant ways of generating synthetic gene network. But these methods are not able to capture important properties such as modularity or the occurrence of network motifs [1]. Module extraction is used by the GeneNetWeaver (GNW) to extract sub networks from the large interaction networks. Module extraction starts from a seed node that is selected randomly among the nodes of the source network. Then nodes are adding to the subnetwork based on modularity score. Modularity score is defined as below,

(1)

Where m denotes the total number of edges in the network, s is the index vector defining the subnetwork ( = 1 if node i is part of the subnetwork, = âˆ’1 if not), and B denotes modularity matrix with elements = âˆ’. is the actual number of edges between node i and j and = /2m is the expected number of edges in a randomized network ( being the degree of node i). Subnetworks extracted by module extraction have a biologically plausible connectivity because they preserve functional and structural properties of gene networks such as degree distribution and network motifs [1]. Figure 1 shows a GRN obtained using the modulo extraction method. Detailed module extraction method was given below,

Module Extraction Method

Select the seed node (randomly or user defined)

while (current subnetwork size < desired size)

Select all neighbors of the current subnetwork

Compute for each neighbor the modularity Q of the subnetwork after adding only this neighbor to the subnetwork

If (Neighbor selection == Greedy)

Select the neighbor that provide highest modularity Q and add it to the current subnetwork

## end if

If (Neighbor selection == Random among top (p%))

One of the top p% neighbors with the highest modularity is chosen randomly

## end if

## end while

Ecoli_20G.jpg

Figure 1. GRN which having 20 nodes obtained using the modulo extraction

## 1.2 Generation of gene expression time-series data

Let vector represent expression of a gene network having M number of genes at time t. According to multivariate vector autoregressive (MVAR) model [9], gene expression values of the all genes at time t can obtain by,

(2)

Where denotes the connectivity matrix of the gene network after multiply by the random weights and denotes the added Gaussian random noise to the gene expression at time t. Random weights were obtained by drawing samples from a uniform distribution on the interval [0.5, 1] . The initial gene expression vector was constructed by using the sample drawn by the uniform distribution on the interval [0, 1]. Algorithm runs fixed number of iteration (10,000) to gain the stability. Then required number of time points was extracted. Detailed algorithm was given below,

## Algorithm 1: Sampling gene expression data

Input: Network structure - S, number of time points - T

M = size(S)

weights = 0.5 + 0.5*rand(M)

for t = 2 to

## end for

Return: dataset

## Data discretization

All gene expression values were represented as continuous data in generated data set. Therefore these data were converted to the discrete values before using them in the learning algorithm. We assume that genes can have three states {-1, 0, +1} representing down-, no- and up-regulation of genes.

## Equal width interval binning

Range of the Continuous gene expression data is dividing in to a predetermined number of intervals in the Equal width interval binning. This method is very popular because its simplicity. Data range was divided in to three intervals because gene can have three states. Discretization steps are given below.

Sort the continues expression values and find the maximum value () and minimum value ()

compute the width of the interval () as follows,

Then state (down, no and up regulation) of the gene was determined by as follows,

State =

## Initialization of the structure using mutual information

The starting structure is initialized by using mutual information among the genes because it provides a measure about dependence between two genes [2, 3]. According to the information theory, entropy is a measure of the uncertainty present in the data [1]. Entropy of the ith gene can compute from the time series gene expression data as below,

(3)

Here T is the total number of time points and the k is the number of states that gene can have. Mutual information (MI) between gene i and j is given by,

(4)

Here first two terms are individual entropies of gene i and j. last term is the joint entropy. Joint entropy can obtain using following formula.

(5)

After calculate mutual information for all gene pairs, connectivity matrix (C) of the structure can obtain as follows,

## Experiments Results of DBN

Gene expression values for each GRN sub network were obtained by changing the number of time points (10, 50, 100 and 200 time points) using above algorithm 1. Each data set consists with 20 time series. Structure learning was done using MCMC simulation. Performance measure parameters such as precision, recall, accuracy and F-measure were calculated by using true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). Table 1 show the performance of MCMC inference procedure. The heatmap of figure 2 shows the accuracy (F-measure) of the inferring network structure using DBN. Following equations were used to calculate the performance measures.

Precision = TP/ (TP+FP)

Recall = TP/ (TP+FN)

Accuracy = (TP+TN)/ (TP+TN+FP+FN)

F-measure = 2*

Table 1 - Performance measures for DBN

#Genes

#Samples

Precision

Recall

Accuracy

F-measure

10

10

50

100

200

0.190.11

0.380.11

0.710.12

0.830.10

0.170.09

0.450.09

0.530.09

0.620.06

0.790.03

0.830.03

0.910.02

0.940.01

0.170.09

0.410.08

0.600.08

0.710.06

20

10

50

100

200

0.140.10

0.250.05

0.460.10

0.680.08

0.080.06

0.370.06

0.480.05

0.550.05

0.910.01

0.880.02

0.920.01

0.950.01

0.100.08

0.300.06

0.470.06

0.600.05

30

10

50

100

200

0.120.08

0.190.04

0.370.04

0.580.06

0.050.03

0.320.05

0.480.05

0.540.04

0.940.01

0.910.01

0.940.01

0.960.00

0.070.05

0.240.04

0.410.04

0.560.04

40

10

50

100

200

0.120.08

0.180.04

0.310.04

0.500.05

0.040.02

0.280.04

0.410.03

0.470.03

0.950.00

0.920.01

0.940.01

0.960.00

0.060.04

0.220.03

0.350.03

0.480.03

50

10

50

100

200

0.070.06

0.160.03

0.270.02

0.450.03

0.020.02

0.220.04

0.300.02

0.350.02

0.950.00

0.920.00

0.940.00

0.960.00

0.030.02

0.190.03

0.280.02

0.390.02

100

10

50

100

200

0.040.03

0.070.01

0.110.01

0.200.01

0.010.01

0.100.01

0.150.01

0.190.01

0.970.00

0.950.00

0.960.00

0.970.00

0.010.01

0.080.01

0.130.01

0.200.02

DBN_map.jpg

Figure 2 . Illustration of the effect of number of time points on accuracy (F-measure) of DBN method of inferring GRN structure of varying number of genes

## Tree based methods

Tree based methods and algorithms are very popular in data mining and machine learning. Decision tree is a tree like graph structure in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Main goal of the decision tree learning is to develop a model that predicts the value of a target variable based on several predictor variables. Because of following reasons trees are used in large range of problems.

Trees can capture complex interactions between target and predictor variables

Trees able to handle both continuous and categorical predictor variables and the missing values in the predictor variables

Trees are robust to outliers in the predictor variables

## 5.1 Classification and Regression Trees

Tree-based methods for solve classification and regression problems were introduced by Breiman in 1984. If target variable is continuous regression trees are used, while classification trees are used for a categorical target variable. Classification and regression trees are develop tree structured models to solve the supervised learning problems [4].

Root node is the starting position of the tree building process. It contains all the observations. Then observations in this node are sent to one of two descendant nodes, one left and one right. This is done using a split on single input variable. If input variables are continuous, observations for which the input is smaller than the split-point go to the left, the rest go to the right. For a categorical input variable, a split sends a subset of categories to the left and the rest to the right. Relevant split is selected by considering every possible split on every input variable. It helps to reduce the uncertainty about the output variable in the resulting subsets of samples.

After completion of tree building process, outcome tree is the best classifier on the training data but possibly not on new and unseen data. This happen because of a fully grown tree typically overfits the training data. Therefore the tree may not generalize very well. Overfitting (overtraining) can be avoided by performing a pruning of the tree. Goal of the pruning process is prevent overfitting to noise in the data. There are two strategies for "pruning" the decision tree.

Postpruning - take a fully-grown decision tree and discard unreliable parts

Prepruning - stop growing a branch when information becomes unreliable

More accurate results can be obtained by combining a variety of suitably chosen trees, leading to methods known as tree-based ensembles. Also ensemble methods have high resistance to overfitting and it generalizes well to new data. These methods are described in the next section.

## Tree-based Ensembles

Tree-based ensembles combine the predictions of many different trees to give an aggregated prediction. Mainly two aggregated predictions are used to solve the above mentioned problems by the tree based ensembles. Regression problem solve by taking average of the predictions from the individual trees and classification problem solve by taking majority vote of the trees. Prediction accuracy of single trees are improved by ensemble methods. Different trees are used to build ensemble and they are obtained by injecting randomness to the different level of the tree building process. The tree-based ensemble methods are differing from each other based on randomization techniques that used to generate diversity among the different trees. These methods are Bagged Trees (5), Extra-Trees (6) and Random Forests (7).

## 6.1 Bagged Trees

Bagging stands for bootstrap aggregation. In this algorithm each tree in the ensemble is build using bootstrap samples from the original data. In bagging trees bootstrapping is performed by obtaining samples (same amount as original data) randomly with replacement from original data. Because of replacement some of the original inputs will not appear in the bootstrap sample and some will appear more than one time. Finally Predictions are combined by averaging for regression and voting for classification.

## 6.2 Random Forests

In random forests ensemble first m numbers of predictors are randomly selected for each node from a bootstrap sample of the original learning sample. These m predictors are selected independently without replacement. Then the best split for the selected predictors is used to split the node. This process is continuing for build all trees in the ensemble. Random forests algorithm introduce an extra level of randomization compared to the Bagged Trees.

The value m is a tuning parameter of this random forests algorithm. This parameter thus determines the level of randomization of the trees. A higher value of m results in less randomized trees. The optimal value of m is problem-dependent. Another important parameter is the number of trees in an ensemble. Generalization error is reduced when there is higher number of trees in the ensemble [7].

## 6.3 Extra-Trees

Extra-trees stand for extremely randomized trees. In this algorithm each tree is built from whole learning sample rather than using bootstrap sample. At each node, the best split is determined among N random splits, each determined by randomly selecting one input variable (without replacement) and a cut-point.

## Inferring Regulatory Networks Using Tree-based Ensemble methods

GENIE3 algorithm decomposes the prediction of a regulatory network between q genes into q different sub problems and each of these sub problems is a supervised regression problem [8]. Then these different regression problems are solved by using tree-based ensemble methods.

Initial GENIE3 algorithm only supports for steady state gene expression data. Therefore initial GENIE3 algorithm is extended to support for time series data as follows. Suppose we draw time series samples of gene expression from a given GRN topology. Let , is a vector containing the expression values of q genes at the tth time point.

T (6)

T is the total number of time points at which the expressions of genes are gathered. So we can define learning sample using gene expression values measured at T time points as follows,

(7)

We assume that the expression of each gene of the network at a time point t + 1 is a function of the expression of the other genes of the network at time point t (previous time point). Let is a vector containing the expression values at time point t of all the genes except gene j. so we can write,

(8)

Where is a random noise. Also we assume that only exploits the expression in of the genes that directly regulate gene j in the underlying network. As we discuss earlier we can apply tree based ensemble algorithm to the following leaning sample in order to learn the function

(9)

In this learning sample, the vector of expression values of the target gene j is thus shifted by one time point with respect to the vectors of the input genes. Each sub problem that is defined by a learning sample is a supervised regression problem. Therefore to solve the problem we have to find function that minimize the square error loss given below.

(10)

Tree based ensemble methods such as regression trees solve this problem by building tree structures. Network inference algorithm which is based on random forests exploits above learning sample to obtain the weights to regulatory links from any genes i to any genes j.

## Experiments Results with Random Forests (GENIE3 algorithm) method

GRN inference performs with Random forests and each ensemble consists with 1000 trees. Most important parameter of this method is number of predictors selected randomly to find the best split in each node. This parameter was set to , where is the number of inputs. GENIE3 algorithm provides a ranking of the regulatory links from the most confident to the less confident. So there are two methods for compare the performance of two GRN inference algorithms (DBN and Random forests).

Plot the Receiver Operator Characteristic (ROC) curves and compare two algorithm using area under the ROC curve (AUROC).

Since inferring the network structure is a problem of deciding whether each edge is present or not. Receiver Operator Characteristic (ROC) curve can be used to compare the performance of RF inference procedure and the DBN since final results are independent from the specific threshold. ROC curve plots the true positive rate (sensitivity) versus false positive rate (1-specificity) for different threshold values. Table 2 shows the means values of area under the ROC curve (AUROC) for DBN and RF inference.

Table 2 - Performance comparison between DBN and RF using AUROC values

#Genes

#Samples

AUROC-DBN

AUROC - RF

10

10

50

100

200

0.540.08

0.740.05

0.870.05

0.950.04

0.740.07

0.970.02

0.980.01

0.990.01

20

10

50

100

200

0.550.06

0.760.05

0.900.02

0.960.02

0.740.06

0.960.01

0.970.05

0.980.01

30

10

50

100

200

0.570.05

0.770.03

0.900.02

0.960.01

0.780.04

0.950.01

0.960.01

0.980.03

40

10

50

100

200

0.590.04

0.770.04

0.880.02

0.940.02

0.770.02

0.970.01

0.980.05

0.990.07

50

10

50

100

200

0.550.05

0.710.03

0.830.03

0.910.02

0.720.03

0.950.01

0.970.06

0.990.09

100

10

50

100

200

0.540.04

0.620.03

0.690.03

0.720.03

0.650.01

0.900.02

0.920.02

0.940.05

Plot the Receiver Operator Characteristic (ROC) curves and find a threshold using optimal operating point of the ROC curves

ROC curve plots the true positive rate (sensitivity) versus false positive rate (1-specificity) for different threshold values. Out of all points on the curve point which is close to north-west corner (0,1) of the ROC plot is the optimal operating point. Optimal operating point is calculated for all networks separately and then average of the all of these points taken as the threshold. Since we have a threshold we can obtain network structure for random forests method by compare weights of the regulatory links with the threshold. Then these networks compare with gold network to derive the performance parameters like precision, recall, F-measureâ€¦etc. Table 3 shows the performance of Random forests inference procedure.

Table 3 - Performance measures for Random-forests (GENIE3)

#Genes

#Samples

Precision

Recall

Accuracy

F-measure

10

10

50

100

200

0.190.02

0.180.01

0.170.01

0.180.01

0.870.10

1.000.00

1.000.02

1.000.00

0.490.04

0.400.05

0.370.04

0.390.04

0.310.03

0.300.02

0.290.01

0.300.01

20

10

50

100

200

0.170.02

0.390.03

0.510.05

0.660.05

0.590.08

0.970.03

0.980.02

1.000.01

0.770.02

0.880.02

0.930.01

0.970.01

0.260.04

0.300.06

0.670.04

0.800.03

30

10

50

100

200

0.190.04

0.540.05

0.710.06

0.770.08

0.490.08

0.900.04

0.960.02

0.980.02

0.880.01

0.960.01

0.980.01

0.990.01

0.280.05

0.670.04

0.810.04

0.860.05

40

10

50

100

200

0.200.03

0.640.06

0.750.06

0.770.06

0.360.06

0.830.03

0.920.02

0.950.02

0.920.01

0.970.01

0.980.00

0.990.00

0.260.04

0.720.04

0.830.04

0.850.04

50

10

50

100

200

0.200.05

0.750.05

0.860.04

0.890.04

0.210.05

0.660.04

0.800.03

0.870.03

0.940.01

0.980.00

0.990.00

0.990.00

0.200.05

0.700.04

0.830.03

0.880.02

100

10

50

100

200

0.160.05

0.780.08

0.820.05

0.810.06

0.040.02

0.430.04

0.600.04

0.810.03

0.980.00

0.990.00

0.990.00

0.990.00

0.060.02

0.560.05

0.690.04

0.810.04

RF_fvalues.jpg

Figure 3 . Illustration of the effect of number of time points on accuracy (F-measure) of random forests method of inferring GRN structure of varying number of genes

## Stability of structure

Distance based stability criterion is used for evaluating GRN building algorithms. First GRN is built by DBN and tree based ensemble (random forests) using time-series samples of gene expressions from a given network topology. Then hamming distance is used to measure the distance over all the regulatory connections. Let be a set of M samples of gene expression data sets and be the GRN build from m-th dataset . The network structure is represented by a connectivity matrix where represents the presence of an edge between genes i and j, and otherwise 0. Consider two GRNs with connectivity matrices and and the similarity of the two networks is obtained by the average Hamming distances over all regulatory connections:

(11)

Where denotes the Hamming distance between the two network structures and denotes the total number of connections in the both network. The stability is in the range of [0, 1] where a lower value indicates a higher stability of GRN inference algorithm. The Hamming distance takes into account both the presence and the absence of regulatory connections between two networks. Then overall stability of GRN structure is obtained by averaging over M number of structures using the stability given by (11). (12) shows the formula for overall stability. Table 4 shows the stability of the network structure of the both inference algorithm.

(12)

Table 4 - Stability of structures

#Genes

#Samples

Stability of structure

DBN

Random forests

10

10

50

100

200

0.17

0.27

0.50

0.67

0.68

0.85

0.87

0.90

20

10

50

100

200

0.06

0.19

0.30

0.48

0.29

0.46

0.56

0.70

30

10

50

100

200

0.04

0.14

0.25

0.41

0.19

0.54

0.73

0.82

40

10

50

100

200

0.02

0.13

0.23

0.41

0.13

0.58

0.75

0.81

50

10

50

100

200

0.02

0.10

0.16

0.27

0.08

0.56

0.73

0.81

100

10

50

100

200

0.01

0.05

0.08

0.14

0.01

0.49

0.69

0.87

DBN_stability.jpgRF_stability.jpg

(b)

Figure 4 stability of networks. Illustration of effects of number of time points on the stability of estimating the structure with varying number of genes. Figures (a) and (b) denote stability of DBN and RF respectively

## Stability of Edges

Other than the stability of the GRN structure, stability of an independent edge also important. Therefore stability of an independent connection is defined as the average of bootstrap samples:

(13)

edge_20G50T.jpgedge_20G10T.jpg

(b)

edge_20G200T.jpg

edge_20G100T.jpg

(c) (d)

Figure 5 Distribution of stability of edges with increase in time points on 20 gene network. Figure (a),(b),(c) and (d) show distribution with 10,50,100 and 200 time points respectively.