# The Degree Of A Vertex English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Closeness centrality measures the average distance between a given node and other nodes in the network. Because it is an inverse measure, a larger value indicates lower level of centrality. It can be thought of as a measure of the rate at which information spreads to neighboring nodes \cite{Newman2010}. This measure has been used by Ortutay and Vihinen \cite{Ortutay2009} to identify primary immunodeficiency-related genes. Closeness can be defined as

where \emph{L(n,m)} indicates the distance of the shortest path between nodes \emph{n} and \emph{m}. This metric yields a value between 0 and 1.

This metric indicates the number of shortest paths passing through each vertex. Nodes with high betweenness centrality (often called bottlenecks) have been shown to correspond to essential genes in directed networks \cite{Barabasi2011}. The betweenness centrality of a node can be written as

where \emph{s} and \emph{t} are vertices other than \emph{n}, \emph{$\sigma_{st}$} represents the total number of shortest paths from \emph{s} to \emph{t}, and \emph{$\sigma_{st}(n)$} is the total number of shortest paths from \emph{s} to \emph{t} with which \emph{n} is involved.

The clustering coefficient (CC) of a vertex is the ratio of the existing edges between that node and its neighbors and the number of possible connections. This is a measure of edge density for the node's neighborhood \cite{Ideker2008,Feldman2008,Li2009}. It can be written as

where \emph{$k_n$} is the total number of neighbors of node \emph{n} and \emph{$e_n$} is the count of linked pairs of nodes between the neighbors of \emph{n} \cite{Watts1998,Barabasi2004}.

The stress centrality \cite{Brandes2001,Shimbel1953} for a node \emph{n} corresponds to the total number of shortest paths passing through it. If the number of shortest paths is large, the stress will be large as well. This metric is described in the following way:

where \emph{s} and \emph{t} are nodes in the network other than \emph{n}, \emph{$\sigma_{st}$} indicates the number of shortest paths from \emph{s} to \emph{t}, and \emph{$\sigma_{st}(n)$} is the count of shortest paths from \emph{s} to \emph{t} that pass through \emph{n}.

The neighborhood connectivity of a node \emph{n} is equal to the average connectivity of all of its neighbors \cite{Maslov2002}.

This metric describes the average number of neighbors a given node shares with other nodes. In a social context this would measure the number of mutual friends two people share. The topological coefficient \cite{Stelzl2005} can be represented as

where \emph{$k_n$} indicates the neighbors of node \emph{n} and \emph{J(n,m)} is the count of neighbors that \emph{n} and \emph{m} share, plus one if there is an edge between \emph{n} and \emph{m}. \emph{J(n,m)} is defined only for all nodes \emph{m} that have at least one neighbor in common with \emph{n}.

Eccentricity is the longest path between a node \emph{n} and another node. The value for eccentricity is 0 for isolated nodes. The maximum value for eccentricity is the diameter of the network.

Radiality is a measure of centrality \cite{Valente1998,Brandes2001}. It is the average shortest path length of a vertex \emph{n} minus the diameter of the connected component to which \emph{n} belongs plus 1, The resulting value is a number between 0 and 1. A high radiality value indicates that a vertex can easily reach other vertices \cite{Koschutzki2008}.

Our prediction strategy starts with either a protein structure or protein sequence. We then gather features for either the entire protein or each residue depending on what type of prediction we are making. Next, we follow one of two paths: 1) calculate structure-based features, which are attributes of 3-dimensional structures acquired using X-ray crystallography or NMR, or 2) calculate sequence-based features. These features are collected and possibly normalized or converted in some way depending on the algorithm. We then train models using this attribute information and subsequently test our models with new data (for which we calculate the same features) in order to predict either binding or non-binding.

Our data set for DNA-binding prediction included two classes of proteins; those that bind DNA (positive class) and those not known to bind DNA (negative class). It consisted of 75 DNA-binding proteins and 214 others not known to bind DNA including membrane-binding proteins, chaperones, and enzymes. This negative set is a subset of one used by Stawiski, Gregoret, and Mandel-Gutfreund \cite{Stawiski2003}. These sets were culled using the PISCES web server \cite{Wang2003}, and only structures with a sequence identity of $\leq$20\% and a resolution of $\leq$3{\AA} were included in our experiments.

For protein-level binding prediction, we used a set of 42 features related to protein structure and sequence. The sequence-based features included the amino acid composition (20 features) and the net charge calculated using the CHARMM \cite{Brooks1983} force field (1 feature). Structure-based features included surface amino acid composition derived from DSSP \cite{Kabsch1983} (20 features) and the size of the largest positively charged patch \cite{Bhardwaj2005,Bhardwaj2006} (1 feature).

We performed DNA-binding protein classification using sequence- and structure-based features with several algorithms (\ref{tab:DNA-BP_structure_pred}). We varied the size of the data sets via 2- and 5-fold cross validation. These results demonstrate our ability to distinguish between DNA-binding and non-DNA-binding proteins. AdaTree performed the best with 88.5\% accuracy, 66.7\% sensitivity, 96.3\% specificity, and an AUC of 88.7\%. These results can be interpreted in the following way. Given a random data set of proteins with the same class distribution, AdaTree should correctly assign $\approx$88\% of these proteins to the correct class. If we know that a protein binds DNA, our classifier will correctly categorize it $\approx$66\% of the time. Similarly, if we have prior knowledge that a protein that does not bind DNA, we can correctly predict this $\approx$96\% of the time. All of these metrics are dependent on class distribution with the exception of the AUC, and because our data set is imbalanced we place more confidence in the AUC. One interesting finding is that the results for SVM (a very slow algorithm to train) and the second fastest (AdaStump) are fairly close. This tells us that we can use a fast tree algorithm such as this and not sacrifice much accuracy. This is useful because the SVM algorithm ran at a rate $\approx$25 times slower than AdaStump.

The proteins comprising our data sets were extracted from the PDB database (\url{http://www.rcsb.org}) and culled using the PISCES web server \cite{Wang2003} with a sequence identity of $\leq$ 25\%. All structures were determined by X-ray diffraction and had a resolution of $\leq$ 3.0{\AA}. Our sequence-based DNA-binding residue data set consisted of 54 proteins and 14780 residues, 2083 of which were identified as DNA-binding and 12697 considered non-binding based on distance from the DNA molecule in the bound structure (class ratio of $\approx$1/6). The RNA-binding residue data set used for sequence-based prediction contained 84 proteins and 60,016 residues, 5,934 classified as RNA-binding and 54082 as non-binding (class ratio of $\approx$1/9).

Because we have formulated residue prediction in this case as a binary classification problem, each residue in the data set must be defined as DNA-binding or non-DNA-binding. As with previous studies \cite{Ahmad2004,Ahmad2005,Kuznetsov2006}, we based this class distinction on a residue's distance from the DNA molecule in the complex. A residue was defined as binding if any heavy atom (carbon, nitrogen, oxygen, or sulfur) belonging to the residue fell within a distance of 4.5{\AA} of any atom in the DNA molecule. In agreement with Kuznetsov, Gou, Li, and Hwang \cite{Kuznetsov2006}, we found that this distance provided the best accuracy for predictions. Any residues without atomic coordinates in the PDB file were not included in the data set.

A twenty-dimensional feature vector representing the 20 common amino acids is used to identify each residue, where a single non-zero entry indicates the current residue.

Since DNA molecules are negatively charged, positively charged, basic amino acid residues can play an important role in nucleic acid binding. Accordingly, we include a charge attribute for each residue. Arginine and lysine residues are assigned a charge of +1, histidines +0.5, and all others 0.

In order to consider the level of evolutionary conservation of each residue and its sequence neighbors, we create a position-specific scoring matrix (PSSM) for each residue in the test protein. Along with the NCBI-NR90 database \cite{Ahmad2005}, which contains $\leq$ 90\% sequence identity between any two proteins, PSI-BLAST \cite{Altschul1997} is used to create a matrix representing the distribution of all 20 amino acids at each position in the protein sequence. A 7-residue sliding window, which represents the distribution of amino acid residues at the positions occupied by three sequence neighbors on either side of the central residue, is subsequently created. This results in a 140-element feature vector for each residue. A similar 7-residue window is created using the BLOSUM62 matrix \cite{Henikoff1992} in order to capture non-position-specific evolutionary conservation information for the sequence neighborhood of each residue, resulting in another 140-element feature vector.

We evaluated the performance of our models against five other classification algorithms (SVM, Alternating Decision Tree, WillowBoost, C4.5 with Adaptive Boosting, and C4.5 with bootstrap aggregation). We built two models for each using sequence-based features: one for DNA-binding proteins and one for RNA-binding proteins. \ref{fig:ROC_DNA_RNA} describes the results for this comparison and shows the performance of each algorithm in terms of accuracy, sensitivity, specificity, precision, Matthews correlation coefficient (MCC), and the area under the Receiver Operating Characteristic curve (AUC). The AUC provides a measure of a model's ability to separate positive and negative examples and is generated from a plot of the true positive rate versus the false positive rate for each example in the data set \ref{fig:ROC_DNA_RNA}. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.

In order to demonstrate the stability of our classifiers, we built models using previously compiled data sets for both DNA- and RNA-binding residue predictions. \ref{fig:DNA-bp_previous_data_sets} shows the comparisons between the original classifier and ours using two previously compiled DNA-binding protein data sets and one RNA-binding protein data set used in seven publications \cite{Ahmad2005,Kumar2008,Kuznetsov2006,Ofran2007,Terribilini2006,Wang2006,Wang2008}. The classifiers were created using 10-fold cross-validation for both selection and validation. For the costing algorithm, the weight assigned to each class was equal to the class distribution and 200 costing iterations were run. Net accuracy was used to find the best model. The prediction metrics from previous works shown are either those reported as the best results from the publications, or if the author's intended best result is unclear, the results with the best accuracy or MCC.

Overall we found that, based on the metrics reported in these previous publications, we were able to improve on those results over each of three previously compiled data sets. First, we built our own classifier on the PDNA-62 data set, which was originally compiled by Selvaraj, Kono, and Sarai \cite{Selvaraj2002} and used for binding residue prediction in three subsequent publications \cite{Ahmad2005,Kuznetsov2006,Wang2006}. Our model (C.45 with bagging and costing) achieved $\approx$78\% accuracy, $\approx$80\% sensitivity, $\approx$77\% specificity, $\approx$86\% AUC, and an MCC of 0.57, which is an improvement of +0.12 in the MCC for the best previous result \cite{Kuznetsov2006}. The second data set we tested was compiled and used by Ofran, Mysore, and Rost \cite{Ofran2007} and consisted of 274 proteins. Our classifier reached $\approx$86\% accuracy, $\approx$85\% sensitivity, $\approx$88\% specificity, $\approx$93\% AUC, and an MCC of 0.725. The only directly comparable metric reported in this previous work is accuracy. While our accuracy is slightly lower than that reported by Ofran \cite{Ofran2007}, we believe that our model actually offers a more reliable result. In their work, they used sequence to derive evolutionary profiles, sequence neighborhood, and predicted structural features. Their SVM classifier gave its best performance at 89\% accuracy. However, their positive accuracy' (precision) and positive coverage' (sensitivity) were imbalanced. For example, at a sensitivity rate of $\approx$80\% (the number of true positive examples correctly classified), the precision rate is quite low ($\approx$55\%), which indicates that the classifier has low confidence that the predicted positive examples are actually positive. Finally, we tested 109 RNA-binding protein chains originally collected by Terribilini et al. \cite{Terribilini2006} and used in three works \cite{Kumar2008,Terribilini2006,Wang2008}. Our model achieved $\approx$76\% accuracy, $\approx$75\% sensitivity, $\approx$77\% specificity, $\approx$83\% AUC, and an MCC of 0.523 over this set, which is an improvement of +0.07 in the MCC over the best result \cite{Wang2008}.

The sequence-based feature sets used in the previous publications varied between works, as did the type of classifier used for prediction and the type of validation performed. While comparisons of this type are not ideal, they do demonstrate that, toward the goal of distinguishing binding from non-binding residues, each of the classifiers we have built using C4.5 with bagging and costing provides consistent results in terms of overall accuracy when trained over various data sets, thus increasing our confidence in this ensemble method.

In an attempt to validate our method, we predicted the binding residues of the gene-16 protein (GP16), a DNA-packing motor protein in Bacillus phage phi29, for one of our collaborators. This protein contains an ABC transporter nucleotide-binding domain and is known to bind ds-DNA. However, the DNA-binding residues for GP-16 are unknown, and there are no highly-related crystallized protein structures available. Our collaborator had some prior evidence that pointed toward two particular regions of interest in the protein. Using our methods, we were able to focus the costly experimental validation of nucleic acid binding residues on a few key locations in the sequence. We predicted the binding residues of this protein using our sequence-based DNA-binding classifier based on a Platt-calibrated version of the cost-sensitive method described above, which was built using residue charge, identity, and sequence homology information. \ref{fig:GP16} shows that our predicted binding residues overlap significantly with the collaborator's residues of interest.

The NAPS web server (\url{http://proteomics.bioengr.uic.edu/NAPS}) takes a DNA- or RNA-binding protein sequence as input and returns a list of residues, the predicted class (binding or non-binding), and a score indicating the classifier's confidence in the decision (\ref{fig:NAPSfig}). The model classifier assigns a confidence score between 0 and 1 for each residue in the test protein. This score reflects the level of certainty in the assigned class with 0.5 as the threshold. Residues with a confidence score between 0 and 0.5 are classified as non-binding residues; those with a score between 0.5 and 1 are classified as binding residues (\ref{fig:NAPSfig}). A table of calculated statistics, including the total number of residues binned by confidence score, the number of binding and non-binding residues in the protein, the percentage of each class, and the mean confidence value, is also returned. The server calculates a total of 301 sequence-based attributes for each residue in the test protein. We consider a 'sequence-based attribute' to be any residue feature that can be calculated without the use of a crystal structure (i.e., only protein sequence). The descriptors are described in more detail below.