Importance Of QSAR Study II Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Drug design is an iterative process which begins when a chemist identifies a compound that displays an interesting biological profile and ends with optimizing both the activity profile for the molecule and its chemical synthesis. A traditional approach to the drug discovery program relies on step wise synthesis and screening of a large number of compounds to optimize activity profiles. In 'rational' design, it is essential to identify a molecular target specific to a disease process or an infectious pathogen. The important prerequisite for drug design is the determination of the molecular structure of the target.

Quantitative Structure Activity Relationship is a methodology which is used to correlate biological property of molecule with molecular descriptors derived from chemical structures.

Importance of QSAR study

The number of compounds required for synthesis in order to place 10 substituents on the four open positions of an asymmetrically disubstituted benzene ring system is approximately 10,000. An alternative approach for compound optimization is to develop a theory that quantitatively relates variation in biological activity to changes in molecular descriptors which can be easily obtained for each compound. If a valid QSAR has been determined, it is possible to predict the biological activity of the related drug candidates before they are put through expensive and time consuming biological testing i.e. activity can be predicted without synthesizing the new molecule. Some time only the computed values need to be known to make an assessment. QSAR can be extensively used for the prediction of physicochemical properties in chemical, environmental, and pharmaceutical areas.

A QSAR attempts to find consistent relationship between the variations in the molecular properties and the biological activity for a series of compounds so that these equations can be used to evaluate new chemical entities.

A QSAR generally takes the form of a linear equation

Biological Activity = Const + (C1 + (C2 + (C3 +...

Here the parameters P1 through Pn were computed for each molecule in the series and the coefficients C1 through Cn were calculated by fitting variations in the parameters and the biological activity.

Applications of QSAR

At present, QSAR science founded on the systematic use of mathematical models and on the multivariate point of view is one of the basic tools of drug design.

QSAR has been applied successfully and extensively to find predictive models for activity of bioactive agents.

It has also been applied to the following areas related to discovery and subsequent development of bioactive agents;

drug like from non drug like molecules

drug resistance

toxicity prediction

physicochemical properties prediction

gastrointestinal absorption

activity of peptides

data mining

drug metabolism

prediction of pharmacokinetic and ADME properties.

Prerequisites for carrying out QSAR studies

Multiple readings for a given observation should be reproducible and have relatively small errors.

Compounds selected to describe the "chemical space" of experiments (the training set) should be diverse.

For a QSAR study the data must be expressed in terms of the free energy changes that occur during the biological response.

Set of parameters must be easily obtainable and should be related to receptor affinity.

There should be method for detecting a relationship between the parameters and binding data.

There must be validation method for the developed QSAR model.

Methodology of QSAR study

There are three groups osf chemoinformatic methods for building QSAR model. They are extracting descriptors from molecular structure, choosing those informative in the context of the analyzed activity, and finally, using the values of the descriptors as independent variables to define a mapping that correlates them with the activity in question.

Generation of Molecular Descriptors from Structure

Molecular descriptors have to be generated since the structure cannot be directly used for creating structure activity mapping for reasons stemming from chemistry and computer sciences. First, the chemical structures do not usually contain in an explicit form the information that relates to activity. Second, chemical structures of compounds are diverse in size and nature and as such do not fit into this model directly. To circumvent this obstacle, molecular descriptors convert the structure to the form of well-defined sets of numerical values.

Selection of Molecular Descriptors

Molecular descriptors should be correlated significantly with the activity. Some statistical methods require more number of compounds than the number of descriptors; large descriptor set require large data sets. To tackle this problem, a wide range of methods for automated narrowing of the set of descriptors to the most informative ones is used in QSAR analysis.

Mapping the Descriptors to Activity

Once the relevant molecular descriptors are computed and selected, the final task is to create a function between their values and the analyzed activity. The most accurate mapping function is usually fitted based on the information available in the training set.

Molecular descriptors

Molecular descriptors are numerical values that characterize the properties of molecules and encode structural features of molecules as numerical descriptors. Molecular descriptors map the structure of the compound into a set of numerical or binary values which represent various molecular properties that are deemed to be important for explaining the activity.

2D QSAR descriptors

The 2D QSAR descriptors are independent from the 3D orientation of the compound. These descriptors range from simple measures of entities constituting the molecule, through its topological and geometrical properties to computed electrostatic and quantum-chemical descriptors or advanced fragment-counting methods.

Constitutional Descriptors

Constitutional descriptors confine the properties of a molecule that is related to elements constituting its structure. These descriptors provide a fast and easy method of computation. Constitutional descriptors include molecular weight, total number of atoms in the molecule and numbers of atoms of different identity. They also includes the total numbers of single, double, triple or aromatic type bonds, as well as number of aromatic rings.

Electrostatic and Quantum-Chemical Descriptors

Electrostatic descriptors give information on electronic nature of the molecule. These include descriptors containing information on atomic net and partial charges. Solvent-accessible atomic surface areas are informative electrostatic descriptors for modeling intermolecular hydrogen bonding. Energies of highest occupied and lowest unoccupied molecular orbital forms the quantum- chemical descriptors.

Topological Descriptors

The topological descriptors treat the structure of the compound as a graph, with atoms as vertices and covalent bonds as edges. On this aspect many indices quantifying molecular connectivity were defined, starting with Wiener index, which counts the total number of bonds in shortest paths between all pairs of non-hydrogen atoms. Other topological descriptors include Randic indices x, defined as sum of geometric averages of edge degrees of atoms within paths of given lengths, Balaban's J index and Shultz index. Kier and Hall indices xv or Gálvez topological charge indices capture the information about valence electrons. The first ones use geometric averages of valence connectivities along paths. The latter measure topological valences of atoms and net charges transfer between pair of atoms separated by a given number of bonds.

Geometrical Descriptors

Geometrical descriptors define the spatial arrangement of atoms constituting the molecule. These descriptors confine information on molecular surface which is obtained from atomic van der Waals areas and their overlap. Molecular volume may be obtained from atomic van der Waals volumes. Geometrical descriptors include principal moments of inertia and gravitational indices, which provides the information on spatial arrangement of the atoms in molecule. Shadow areas, obtained by projection of the molecule to its two principal axes are also used.

3D QSAR descriptors

The 3D-QSAR methodology is computationally more complex than 2D-QSAR approach. In 3D QSAR, several steps are needed to obtain numerical descriptors of the compound structure. First, the conformation of the compound has to be determined either from experimental data or molecular mechanics and then refined by minimizing the energy. Next, the conformers in dataset have to be uniformly aligned in space. Finally, the space with immersed conformer is probed computationally for various descriptors. Some methods which are independent of the compound alignment have also been developed.

Comparative Molecular Field Analysis (CoMFA)

CoMFA uses electrostatic (Coulombic) and steric (van der Waals) energy fields defined by the inspected compound. The aligned molecule is placed in a 3D grid and in each point of the grid lattice a probe atom with unit charge is placed and the potentials (Coulomb and Lennard-Jones) of the energy fields are computed. Then, they serve as descriptors in further analysis, typically using partial least squares regression analysis. This analysis allows for identifying structure regions positively and negatively related to the activity in question.

Comparative Molecular Similarity Indices Analysis (CoMSIA)

CoMSIA is similar to CoMFA in the aspect of atom probing throughout the regular grid lattice in which the molecules are immersed. The similarity between probe atom and the analyzed molecule are calculated. Compared to CoMFA, CoMSIA uses a different potential function, namely the Gaussian-type function. Steric, electrostatic, and hydrophobic properties are then calculated; hence the probe atom has unit hydrophobicity as additional property. The use of Gaussian-type potential function allows for accurate information in grid points located within the molecule. In CoMFA, unacceptably large values are obtained in these points due to the nature of the potential functions and arbitrary cut-offs that have to be applied.

Automatic selection of relevant molecular descriptors

There are certain automatic methods for selecting the best descriptors or features which can be used for the construction of QSAR model. They are wrapper approach method and filtering method

Filtering Methods

These are applied independent of the mapping method used. These are executed prior to the mapping to reduce the number of descriptors following some objective criteria like inter-descriptor correlation. Filtering methods include correlation - based methods, methods based on information theory and statistical criteria.

Wrapper methods

Wrapper technique operates in conjunction with a mapping algorithm. The error of the mapping algorithm for a given subset measured guides the choice of best subset of descriptors e.g. with cross validation. These include genetic algorithm, simulated annealing, sequential feature forward selection and sequential backward feature elimination.

Hybrid methods

In these methods fusion of the above two approaches is utilized. A rapid objective method can be used as a preliminary filter to narrow the feature set. Next, one or more accurate but slower subjective methods are employed.

Mapping the molecular structure to activity

After the selection of relevant descriptors, the final step in building a QSAR model is to derive the mapping between the activity and the values of the features. Mapping by linear models are simple but non-linear methods extend this approach to more complex relations.

Linear models

Linear models predict the activity as linear function of molecular descriptors. For small data sets of similar compounds, linear models are easily interpretable and sufficiently accurate.

Multiple Linear Regression (MLR)

In MLR models, the activity to be predicted should be linear function of all the descriptors. Coefficients of the function are estimated from the training set and these free parameters are selected to minimize the squares of the errors between the predicted and the actual activity. The main drawback of MLR analysis is the large descriptors-to-compounds ratio or multicollinear descriptors in general, which makes the results unstable. The advantage of MLR is that it exhibits lower cross validation error than partial least squares, both using 4D-QSAR fingerprints.

The new methodologies based on MLR developed recently are: Best multiple linear Regression (BMLR), Heuristic Method (HM), Genetic Algorithm based Multiple Linear Regression (GA-MLR), Stepwise MLR, Factor Analysis MLR and so on.

BMLR works well when the number of compounds does not exceed the number of molecular descriptors by at least a factor of five. As the number of descriptors increases, the modeling process will become time consuming. So to speed up the calculation, descriptors with insignificant variance within the data set should be rejected.

HM is an advanced algorithm based on MLR. The selection of descriptors is done as follows: first of all, all descriptors are checked to ensure that values of each descriptor are available for each structure. If the values for the descriptors are not available for every structure, then the data are discarded. If the values of descriptors are constant in the data set, they are also discarded. Then all possible one-parameter regression models are tested and the insignificant descriptors are rejected. In the next step, the pair correlation matrixes of descriptors are calculated and this further reduces the descriptor pool by eliminating highly correlated descriptors. Finally the intercorrelation is validated and the goodness of the correlation is tested by the square of coefficient regression (R2), square of cross-validate coefficient regression (q2), the F-test (F), and the standard deviation (S).

Partial Least Squares (PLS)

PLS is a suitable method for overcoming the problems in MLR due to multicollinear or over-abundant descriptors. The PLS tries to indirectly obtain knowledge on the latent variables, the scores and the loadings. The scores are orthogonal and are able to capture the descriptor information, which allow good prediction of the activity. The score vectors are estimated iteratively. The first one can be derived using the first eigenvector of the activity descriptor combined variance-covariance matrix. Next, the descriptor matrix is deflated by subtracting the information explained by the first score vector. The matrix resulting from the above is used in the derivation of the second score vector, which followed by consecutive deflation, closes the iteration loop. In each iteration step, the coefficient relating the score vector to the activity is also determined.

Recently evolved PLS are Genetic Partial Least Squares (G/PLS), Factor Analysis Partial Least Squares (FA-PLS) and Orthogonal Signal Correction Partial Least Squares (OSC-PLS).

Linear Discriminant Analysis (LDA)

LDA is a classification method that creates a linear transformation of the original feature space into a space which maximizes the interclass separability and minimizes the within-class variance. The procedure involves solving a generalized eigenvalue problem based on the between-class and within class covariance matrices. Thus to avoid ill-conditioning of the eigenvalue problem, the number of features has to be significantly smaller than the number of observations. To avoid the above problem principal component analysis can be applied to reduce the dimension of the input data. LDA is used to create QSAR models e.g. for prediction of model validity for new compounds where it fared better than PLS, but worse than non-linear neural network.

Non-Linear Models

Non-linear models extended the structure-activity relationships to non-linear functions of input descriptors. These models are more accurate, especially for large and diverse datasets. However, they are usually harder to interpret. Complex non-linear models may also fall to over fitting i.e., low generalization to compounds unseen during training.

Artificial Neural Networks (ANN)

Artificial Neural Networks are biologically inspired prediction methods which are based on the architecture of a network of neurons. In this, during the prediction, the information flows only in the direction from the input descriptors, through a set of layers, to the output of the networks. Disadvantage of this is that it has a tendency to over fit the data, leading to a significant level of difficulty in ascertaining as to which descriptors are most significant in the resulting model. The most frequently used neural networks are Radial Basis Function Neural Network (RBFNN) and General Regression Neural Network (GRNN).

Support Vector Machines (SVM)

SVM stems from the structural risk minimization principle, with the linear support vector classifier as its most basic member. This aims at creating a decision hyper plane that maximizes the margin, i.e., the distance from the hyper plane to the nearest examples from each of the classes. The most important objective function is unimodal and thus can be optimized effectively to global optimum. Simply, compounds from different classes can be separated by linear hyperplane; such hyperplane is defined solely by its nearest compounds from the training set. Such compounds are referred to as support vectors, which give the name to the whole method.

These methods have been extended into Support Vector Regression (SVR) to handle regression problems. SVM methods have been shown to exhibit low prediction error in QSAR.

Gene Expression Programming (GEP)

GEP was invented by Ferreira in 1999 and was developed from genetic algorithms and genetic programming (GP). GEP is very simple compared to cellular gene progression. This mainly includes two sides: the chromosomes and the expression trees (ETs). The process of information translation in gene code is very simple, such as a one-to-one relationship between the symbols of the chromosome and the functions or terminals they represent. GEP determines the rules for spatial organization of the functions and terminals in the ETs and the type of interaction between sub-ETs. Hence the language of the genes and the ETs represent the language of GEP.


QSAR modeling involves three main steps each of which contains its own group of pitfalls

(1) Input data preparation and preprocessing

(2) Model generation and validation

(3) Analysis of results

Pitfalls Concerned with input Data Preparation and Preprocessing

Incompatible Concepts and Contraindications for QSAR

Multiconditionality: In silico studies have hardware and software limitations since the drug action is based on a sequence of complicated physiochemical events that are either still unknown or not fully understood on a molecular level. QSAR and QSPR describe quantitatively ADMET processes and can only fragmentally reproduce real observations.

Common Action Mechanism and Multiple Binding Modes: The occurrence of various binding modes (MBM) of the very same ligand to its target molecule makes the model complicated. So the QSAR is conducted under the silent assumption that no multiple binding modes are present when comparing molecular similarities.

Multiple Targets and Multipotency: Normal QSAR which works with cell-free data is not affected by drug binding to multiple targets which occur only when a molecule in lower doses binds to a biomolecule with higher affinity. But in higher doses the same ligand may bind to other targets with lower affinity.

Pitfalls Concerned with Model Generation and Validation

Selection of Predictor Variables

Meaningless Descriptor Selection: Not all descriptor are useful to describe the electronic and hydrophobic effects. So inclusion of a large number of descriptors in QSAR studies should be discouraged.


The number of independent variables in a final QSAR model should be as low as possible so that the reintroduction of collinear variables improves greatly the R2 or Q2 in LOO examination but deteriorates future model applicability.

Errors of Descriptor Calculations: Poor correlation results are eventually due to experimental or computational errors.

Robust Statistical Procedures and "Black Boxes"

QASR studies are done with user friendly softwares so that the models can be developed with out a detailed understanding of the underlying theories and statistics.

Pitfalls Concerned with Model Interpretation

Unrelatedness, also known as the "Correlation Problem"

A good correlation is often mistakenly interpreted as a proof of causality. In addition to this, achieving significance through MLR or PLS means only "significance on a statistical level" and nothing more. So a significant variable or model can be completely irrelevant on pharmacological grounds.

Chance Correlation

Having some independent variable correlating with activity does not necessarily mean that the corresponding feature is directly involved in explaining SAR.

Multiple Solutions

MLR leads to a model that does not describe all levels of explanatory complexity in nature. The computed model tries to simplify the problem and intends to approximate to reality and perform predictions of new data points along the regression line. So various solutions are there for a problem with different dimensionality. If there are large numbers of independent variables at hand, they may be dependent on the others (overlap, redundancy) despite their assuring name. Multiple solutions are always possible and are not necessarily indicators of wrong models.

Extrapolation and Interpolation

Theoretically, the activities can be predicted either by interpolation between the observed data points or by extrapolation to areas outside the variable levels. The activity of the new test molecules is predicted by the established equations but their structures behave differently and are poorly described by the chemical properties of the equations. So the inappropriate extrapolation or interpolation makes the prediction a risky operation.

Advantages of QSAR

QSAR quantifies the relationship between structure and activity and provides an understanding of the effect of structure on activity.

QSAR makes predictions leading to the synthesis of novel analogues.

The results can be used to help in understanding interactions between functional groups in the molecules of greatest activity with those of their target.

Disadvantages of QSAR

False correlations may arise because biological data are subject to considerable experimental error (noisy data).

If training data set is not large enough, the data collected may not reflect the complete property space. Consequently, many QSAR results cannot be used to confidently predict the most likely compounds with the best activity.

A feature of QSAR is that it not always reliable. This is particularly serious because 3D structures of ligands binding to receptor may not be available. Common approach is to minimize structure, but that may not represent the reality well.

In the present work QSAR models were developed using TSAR 3.3 software. The results from the antimicrobial studies on the activity of the synthesized compounds were converted to inverse log scale. These data of the synthesized compounds were divided in to two sets viz training set and test set for external validation. From the QSAR equation generated, activity of a new set of compounds could be predicted.