Rational Drug Design And Molecular Modeling Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Drug discovery is a multidisciplinary approach wherein drugs are designed and/or discovered. The R and D expenditure incurred to bring a new chemical entity (NCE) to the end of phase III clinical trials is estimated to be around $1.3 billion today in the US. The chief contribution to this cost increase is the decrease in efficiency of transforming lead ï‚® preclinical candidates from 75% to 50% and the rate of retrogression of compounds from phase 2 ï‚® phase 3 clinical trials from 50% to 30%.79 Despite advances in technology and understanding of biological systems, drug discovery is still a long process (~15 years) with low rates of new discoveries. This scenario demands alternate procedures/techniques that cut down both the cost and the time period involved and simultaneously increase the success rate.

A glance into the history of drugs shows that many early discoveries in the pharmaceutical industry were serendipitous; they fail to explain why a compound is active or inactive or how it may be improved. The advent of new knowledge of physiological mechanisms has made it possible to take a mechanistic approach and start from a rationally argued hypothesis to design new chemical entities (NCE's). The concept of RDD could be traced to the findings of Paul Ehrlich (chemoreceptor) and Emil Fischer (lock and key model) in 1872 and 1894 respectively.80 Advances in molecular biology, protein crystallography and computational chemistry since the 1980s have greatly aided the RDD paradigms. The advent and development of combinatorial chemistry and high throughput screening (HTS) in the 1990s led to a paradigm shift in drug research. Combinatorial chemistry uncorked the chemical bottleneck in drug research shifting the question in lead optimization from "what can we make" to "which should we make".81 Contemporarily, HTS makes it possible to screen huge libraries of molecules within a short time span.82 Nevertheless, initial euphoria that designated these techniques as universal lead generators subsided as a result of the considerable costs involved and disappointingly low hit rates.83 Lessons learnt from these strategies seek a complete shift of drug research paradigms from an empirical science to structure based analysis of macromolecule-ligand interactions. Figure 2.1 shows a flow chart that describes different approaches that enable RDD to evolve new NCE's with greater biological activity.

Figure 2.1: Different Approaches of Rational Drug Design

Role of Computer Aided Molecular Design in Drug Discovery

It was recognized in the 1960s, that computer-based methods can be of help in the discovery of new leads and can potentially eliminate chemical synthesis and screening of many irrelevant compounds. An ideal computational method for lead discovery should be able to generate structurally diverse leads rapidly and should give estimates of binding affinities that would correlate with experimental values. Generation of chemical diversity in silico, is easily achieved using existing computational resources and algorithms: putative ligands can either be extracted from large databases of compounds, or they can be "grown" computationally by joining molecular fragments. On the other hand, accurate prediction of binding affinities has been a more difficult task.84 Because of the multitude of energetic and entropic factors involved, the thermodynamics of binding cannot be analytically modeled without first simplifying the problem.85 Computational methods that attempt to design leads vary in the nature and in the degree of the simplifying assumptions they use.

Approaches to RDD

The state of the art in RDD or Computer Aided Molecular Design (CAMD), can be divided into two broad categories: analog based study and structure based study based on the availability of three dimensional structure of the target.

Analog Based Studies

In a broad sense, analog based studies gather information from already existing drugs/ligands that are active against target biological molecule (protein or DNA/RNA) of interest. Based on this information a set of rules are framed to either design a new ligand or modify an existing ligand in order to enhance its biological activity.

Quantitative Structure Activity Relationship (QSAR)

QSAR is one of the most widely used analog based methods. It aims at correlating structural features of a series of known compounds with their biological activities. From these correlations empirical equations are derived and subsequently used to guide the design of new leads. Early QSAR methods related biological activity to the presence (or absence) of functional groups in a series of structurally related compounds (Free-Wilson model), or to the physicochemical properties (lipophilicity, electronic properties) of the compounds in the training set (Hansch analysis). More recently, three-dimensional QSAR methods have been developed. Although QSAR models reproduce binding affinities of ligands more accurately than other methods they have three major shortcomings.86: (i) sufficient number of ligands active against target of interest should be available to develop the structure activity relationships; (ii) the equations that are parameterized for one target do not apply to another; and (iii) are of limited use in understanding the nature of proteinligand interactions and thermodynamics of binding.

Pharmacophore Modeling

Pharmacophore is one another analog based method. The word "pharmacophore" was coined by Paul Ehrlich in the early 1900s referring to a molecular framework that carries the essential features (phoros) responsible for a drug's (pharmacon) biological activity. Later in 1977 Peter Gund redefined it as "a set of structural features in a molecule that is recognized at a receptor site and is responsible for that molecule's biological activity". Pharmacophore models are constructed based on molecules of known biological activity and are refined as more data are acquired in an iterative process. Alternatively, a pharmacophore can also be generated from the receptor structure. One step ahead, the dynamic pharmacophore model based on molecular dynamics trajectories takes care of the the binding site dynamics.87 These models can be used for optimizing known ligands or for screening databases to find potential novel leads suitable for further development.88

Pharmacophore features

Hydrogen bond acceptor

Hydrogen bond donor


Hydrophobic aliphatic

Hydrophobic aromatic

Positive ionizable

Negative ionizable

Ring aromatic

Manual Pharmacophore Generation: Visual Pattern Recognition

Steps involved in the development of Pharmacophore model

Visual identification of common structural and chemical features among the active molecules and those features missing in the inactive ones

Measurement of the 3D aspects of the common features with each other

Development of a draft Pharmacophore and validation of the model so that the Pharmacophore fits the active compounds and fails to fit the inactive ones

Refinement of the Pharmacophore model by applying it to a database of compounds with known activity, until the desired result is obtained.

Particular compound can be inactive because

i It does not contain groups in the geometry required for recognition by the biomolecular target, that is, it does not match the required Pharmacophore.

ii Although it contains the Pharmacophore, it also contains groups that interfere with recognition and that can be detected by a subsequent QSAR.

iii It is less soluble than its bioactive conformation.

iv It contains groups that sterically prevent interaction with the target biomolecule, another potency-decreasing property that can be detected by QSAR.

In the absence of a crystallographic structure of a protein for which the active site for receptor binding is clearly identified, the medicinal chemist must rely on the structure activity for a given set of ligands. If these ligands are known to bind to the same receptor, then one can attempt to define the commonality between them.

The Pharmacophore model has been used in the below applications

Testing ideas (find new candidate molecule)

Prioritizing leads

Designing compound libraries

Predicting activity

Using the resulting alignments for additional studies

Search queries for database mining

Accelry's Catalyst can generate two types of chemical feature based models or hypotheses, depending on biological activities.89

If activity data is included, then Catalyst - HypoGen is used. Feature based models derived by HypoGen have been successfully used to suggest new directions in lead generation - lead discovery and for searching a database to identify new structural classes of potential lead candidates.

When no activity is considered during hypothesis building, and only common chemical features are required, then Catalyst - HipHop can be used for this purpose.

HipHop: Common feature based alignments

Pharmacophore model, or Hypothesis, consists of a three dimensional configuration of chemical functions surrounded by tolerance spheres. A tolerance sphere defines that area in space that should be occupied by a specific type of chemical functionality. Each chemical function is assigned a weight, which describes its relative importance within the hypothesis. A larger weight indicates that the feature is more important in conferring activity than other composite parts of the hypothesis.

When the ligand set is small (less than 15 compounds), and sufficient biological data is absent, one can build a model based on common feature alignments using Catalyst/HipHop.89, 90 HipHop has been used to align a set of molecules based on their common chemical features.

HipHop identifies configurations or three-dimensional spatial arrangements of chemical features that are common to molecules in training set. The configurations are identified by pruned exhaustive search, starting with small sets of molecules and extending them until no longer configuration is found. The user defines the number of molecules that must map completely or partially to the hypothesis. This user-defined option allows broader and more diverse hypotheses to be generated. If a pharmacophore model is less likely to map the active compound, then it will be given higher rank; the reverse is also true.


specifies the reference molecule(s)

reference configuration models are potential centers for hypotheses

0 do not consider these molecules

1 consider configurations of this molecule

2 use this compound as a reference molecule

used only for HipHop hypothesis generation

Max Omit Features

specifies how many features for each compound may be omitted

0 all features must map to generated hypotheses

1 all but one features must map to generated hypotheses

2 no features need to map to generated hypotheses

used only for HipHop hypothesis generation

HypoGen: Quantitative Pharmacophore Models

It creates SAR hypothesis models from a set of molecules for which activity values are known. HypoGen selects pharmacophore that are common among the active compounds but not among the inactive compounds and then optimizes the pharmacophores using simulated annealing. The top pharmacophores can be used to predict the activity of unknown compounds or to search for new possible leads contained in 3D chemical databases.91, 92

HypoGen generates hypotheses that are set of features in 3D space, each containing a certain tolerance and weight that fit to the features of the training set, and that correlate to the activity data.

The hypotheses are created in three phases Constructive, subtractive and optimization phase.

The constructive phase identifies hypotheses that are common among active compounds, the subtractive phase removes hypotheses that are common among the inactive compounds, and the optimization phase attempts to improve the initial hypotheses. The resulting hypotheses models consist of set of generalized chemical features in three-dimensional space as well as regression information. Therefore, the hypotheses models can be used as search queries to mine for potential leads (Figure 2.2) from a three-dimensional database or in the form of an equation to predict the activity of a potential lead.

Figure 2.2: Lead optimization using Pharmacophores

Running HypoGen

Ideal Training set

Should contain at least 16 compounds to assure statistical power

Activities should span 4 orders of magnitude

Each order of magnitude should contain 3-4 compounds

No redundant information & No excluded volume problems

HypoGen is done in three phases, a constructive, subtractive and optimization phase (Figure 2.3).

HypoGen calculates the cost of two theoretical hypotheses, one in which the cost is minimal (Fixed cost), and one where the cost is high (Null cost). Each optimized hypothesis cost should have a value between these two values and should be closer to the Fixed than the Null cost.

Randomized studies have found that if a returned hypothesis has a cost that differs from the Null hypothesis by 40-60 bits, it has 75-90% chance of representing a true correlation in the data.

Another useful number is the Entropy of hypothesis space. If this is less than 17, a thorough analysis of all the models will be carried out.

Constructive phase

Constructive phase is very similar to HipHop algorithm. This is done in several steps:

All active compounds are identified

All hypotheses (maximum 5 features) among the two most active compounds are identified and stored

Those that fit the remaining active compounds are kept

Subtractive phase

In this phase, the program removes hypotheses from the data structure that are not likely to be useful. The hypotheses that were created in the constructive phase are inspected and if they are common to most of the inactive compounds then they are removed from consideration.

Figure 2.3: Hypotheses generation in Catalyst

Optimization phase

The optimization is done using the well-known simulated annealing algorithm. The algorithm applies small perturbations to the hypotheses created in the constructive and subtractive phases in an attempt to improve the score.


The HypoRefine algorithm is an extension of the Catalyst HypoGen algorithm for generating SAR-based pharmacophore models which can be used to estimate activities of new compounds. HypoRefine helps to improve the predictive models generated from a dataset by a better correlating hypothesis with the steric properties that contribute to biological activity. In addition, HypoRefine can help overcome over-prediction of inactive compounds with pharmacophore features in common with other active compounds in the dataset, where inactivity is due to steric clashes with the target.

Interpreting the cost parameters in the output files

During an automated hypothesis generation run, Catalyst considers and discards many thousands of models. It distinguishes between alternatives by applying a cost analysis. The overall assumption is based on Occam's razor; that is between equivalent alternatives, the simplest model is best. In general, if this difference is greater than 60 bits, there is an excellent chance of the model to represent a true correlation. Since most returned hypotheses are higher in cost than the fixed cost model, a difference between fixed cost and null cost of 70 or more is necessary to achieve the 60 bits difference.

Fixed cost

Cost of the simplest possible hypothesis (initial)

Null cost

Costs when each molecule estimated as mean activity acts like a hypothesis with no features

Weight cost

A value that increases in a Gaussian form as the feature weight in a model deviates from an idealized value of 2.0. This cost factor favors hypotheses in which the feature weights are close to 2. The standard deviation of this parameter is given by the weight variation parameter.

Error cost

A value that increases as the rms difference between estimated and measured activities for the training set molecules increases. This cost factor is designed to favor models for which the correlation between estimated and measured activities is better. The standard deviation of this parameter is given by the uncertainty parameter.

Configuration cost

A fixed cost depends on the complexity of the hypothesis space being optimized. It is equal to the entropy of the hypothesis space. This parameter is constant among all the hypotheses.

The main assumption made by HypoGen is that an active molecule should map more features than an inactive molecule. In other words, the molecule is inactive because a) it misses important feature or b) the feature is present but cannot be oriented in correct space. Based on this assumption, the most active molecule in the dataset should map to all features of the generated hypotheses.

Metric for Analyzing Hit Lists and Pharmacophores

Validity of the pharmacophore model is determined by its ability to retrieve known active molecules from the various known databases (Figure 2.4).

Figure 2.4: Database searching using pharmacophore models

D = Total number of compounds in database,

A = Number of active compounds in database,

Ht = Number of compounds in search hit list and

Ha = Number of active compounds in hit list

Pharmacophore Validation

Percent yield of actives:

% Y = Ha / Ht x 100

Percent ratio of the activities in the hit list:

% A = Ha / A x 100

Enrichment (enhancement)

E = Ha / Ht = Ha x D

A / D Ht x A

False negatives: A - Ha

False positives: Ht - Ha

The best hit list is obtained when there is perfect overlap of the hit list to the known active compounds in the database. This occurs when both conditions Ha = Ht and Ha = A, hence Ha = Ht= A, are satisfied, which is a nearly impossible case to achieve in a real-life situation.

In reality, there may be many compounds in the database that may be active but either have not been listed as active, or have not been tested for specific activity. In either case, these compounds end up in the "False positives" list. Hence we consider the list of false positives as opportunities for potential leads. The objective is to improve the hit list in such a manner that the false positives can contain a large number of potential leads.

"False negatives" list is nothing but missing the retrieval of active molecules from database.

The best hit list is the one that retrieves all the actives and nothing else (i.e., Ht = Ha= A); False negatives = 0, false positives = 0.

The worst list is the one that retrieves everything else but the known actives in the database (i.e., Ha = 0, Ht = D-A) False negatives = A, false positives = D-A.

The GH score gives a good indication of how good the hit list is with respect to a compromise between maximum yield and maximum percent of activities retrieved. The Table 2.1 provides an acceptable sorting of the hit lists, from best to worst, via the GH score. The Goodness of Hit formula is a convenient way to quantify hit lists obtained from searches with various queries.


% Y

% A


False negatives

False positives










Typical Good







Extreme Y







Extreme A







Typical Bad














Table 2.1: Goodness of Hit score values

Structure Based Studies

Structure based approaches, based on the three-dimensional structure of the target overcome many of the limitations of analog based studies. These methods help to develop a general theoretical description of the proteinligand interactions that would enable an a priori design of new leads for a particular biological target.93 The first success story in structure based design is the antihypertensive drug Captopril, an inhibitor of Angiotensin Converting Enzyme (ACE).94 Table 2.2 lists other examples of drugs derived from structure based approaches. Different approaches used for the structure based design are as follows:




Trade name







Carbonic Anhydrase II




5-Hydroxy Tryptamine 1B




Angiotensin II




EGFR Kinase

Bcr-Abl Kinase








HIV-Reverse transcriptase


















Phase II*

Phase III*

* in clinical trials

Table 2.2: List of drugs resulting from structure based studies


The number of proteins with a known three-dimensional structure is increasing rapidly, and structures produced by structural genomics initiatives are beginning to become publicly available.95, 96 The increase in the number of structural targets is in part due to improvements in techniques for structure determination, such as high throughput X-ray crystallography.97 With large-scale structure-determination projects driven by genomics consortia, many target proteins have been selected for their therapeutic potential.98, 99

Docking in a true sense is the formation of non-covalent proteinligand complexes in silico. Given the structure of a protein and a ligand, the task is to predict the structure of the complex. Conceptually, docking is an energy optimization process concerned with the search of the lowest free energy binding mode of a ligand within a protein binding site. Docking constitutes two components: pose searching and scoring. Inclusion of protein flexibility is computationally expensive; therefore much of the existing docking programs treat the protein either as rigid or allow flexibility only to the side chain functional groups. On the other hand, ligand handling can be broadly classified as: whole molecule approach as shown in variant 1 and 2 and fragment based approach seen as variant 3 in Figure 2.5. A good docking method estimates the forces involved in the proteinligand recognition viz. electrostatic, van der Waals and hydrogen bonding and places the ligand appropriately in the active site.100 Table 2.3 lists all the existing docking methodologies and the strategies they use.

Figure 2.5: Strategies for flexible ligand docking

Searching Algorithm



Monte Carlo


Stochastic method of generating conformations. Selection based on Metropolis criterion

Ligand Fit

Simulated Annealing


Random thermal motions are induced, through high temperatures, to explore the local search space. System is driven to a minimum energy conformation by decreasing temperature. SA usually combined with MC




Genetic Algorithm


Based on Darwin principles of evolution. 'chromosome' encoding model parameters (like torsion angles) is varied stochastically. Populations generated through genetic operations (crossover, mutation, migration). The fittest survives in the population.


Tabu search

Stochastic method of generating conformation keeping a record of previous conformations (tabu). Generated conformation is retained if it is not tabu or if it scores better than that in tabu


Incremental construction

Systematic method where ligand is broken into rigid fragments at rotatable bonds. Fragments docked in all possible ways and assembled piecewise to regenerate ligand

Flex-X, DOCK, HOOK, LUDI, Hammerhead

Matching methods

Based on clique detection technique from graph theory. Ligand atoms matched to the complimentary atoms in the receptor


Simulation methods

Molecular dynamics simulations used to generate conformations


Table 2.3: Different search strategies for docking


In principle, a scoring function used in docking is a mathematical function whose values are proportional to the binding affinities of the leads. A good scoring function should be able to give reliable estimates of binding affinities of structurally diverse leads for different protein targets while considering the thermodynamic aspects of binding.101 Scoring functions are of three types.100 : (1) empirical; (2) force field based and; (3) knowledge based functions. Empirical scoring functions are regression based functions derived from a large sample of crystal structures with known affinities for the bound ligands. These functions reflect a best fit with respect to the training set used but rarely achieve generality. Force field based methods are first principle methods that use force field parameters to score the vdW and electrostatic interactions between protein and ligand. The score includes receptor-ligand interaction energy and internal ligand energy (such as steric strain induced by binding). These methods do not require calibration or training with experimental binding data. Knowledge based methods evaluate the frequencies of particular type of interactionthe mutual distance between particular type of atoms across the interface, in databases of protein-ligand complexes. The sample distribution describes the probability of occurrence of an interaction and is compared with the reference mean. Any deviation from the mean is translated into statistical preferences using mathematical equations and related to energies in a Boltzmann-like fashion.

Docking algorithms used in the study



C. Ligand Fit


GOLD102 (Genetic Optimization for Ligand Docking) is an automated ligand docking program that uses a genetic algorithm to explore the full range of ligand conformational flexibility with partial flexibility of the protein, and satisfies the fundamental requirement that the ligand must displace loosely bound water on binding

GOLD scoring functions

1. A scoring function to rank different binding modes; the Goldscore function is a molecular mechanics-like function with four terms:

GOLD Fitness = Shb_ext + Svdw_ext + Shb_int + Svdw_int, (1)

where Shb_ext is the protein-ligand hydrogen-bond score and

Svdw_ext is the protein-ligand van der Waals score.

Shb_int is the contribution to the Fitness due to intramolecular hydrogen bonds in the ligand; this term is switched off in all calculations presented in this work (this is the GOLD default, and generally gives the best results);

Svdw_int is the contribution due to intramolecular strain in the ligand.

2. A mechanism for placing the ligand in the binding site; GOLD uses a unique method to do this, which is based on fitting points; it adds fitting points to hydrogen bonding groups on protein and ligand, and maps acceptor points on the ligand on donor points in the protein and vice versa. Additionally, GOLD generates hydrophobic fitting points in the protein cavity onto which ligand CH groups are mapped.

3. A search algorithm to explore possible binding modes; GOLD uses a genetic algorithm (GA) in which the following parameters are modified/optimized: (a) dihedrals of ligand rotatable bonds; (b) ligand ring geometries (by flipping ring corners); (c) dihedrals of protein OH groups and NH3 groups; and (d) the mappings of the fitting points (i.e., the position of the ligand in the binding site). Of course, at the start of a docking run, all these variables are randomized.

ChemScore was derived empirically from a set of 82 protein-ligand complexes for which measured binding affinities were available 103, 104.

Unlike GoldScore, the ChemScore function was trained by regression against measured affinity data, although there is no clear indication that it is superior to GoldScore in predicting affinities.

ChemScore estimates the total free energy change that occurs on ligand binding as:


Each component of this equation is the product of a term dependent on the magnitude of a particular physical contribution to free energy (e.g. hydrogen bonding) and a scale factor determined by regression, i.e.


Here, the n terms are the regression coefficients and the P terms represent the various types of physical contributions to binding.

The final ChemScore value is obtained by adding in a clash penalty and internal torsion terms, which militate against close contacts in docking and poor internal conformations. Covalent and constraint scores may also be included.



GLIDE (Grid-based ligand docking with energetics) calculations are performed with Impact version v3.5 (L. Schrodinger). The grid generation step requires Maestro input files of both ligand and active site, including hydrogen atoms.

The protein charged groups that were neither located in the ligand-binding pocket nor involved in salt bridges were neutralized using the Schrodinger pprep script. The center of the grid enclosing box was defined by the center of the bound ligand as described in the original PDB entry. The enclosing box dimensions, which are automatically deduced from the ligand size, fit the entire active site. For the docking step, the size of bounding box for placing the ligand center was set to 12 Å. A scaling factor of 0.9 was applied to van der Waals radii of ligand atoms.

Scoring Functions of GLIDE 105, 106

GLIDE 3.5 employs two forms of GlideScore:

GLIDE Score 3.5 SP, used by Standard-Precision Glide;

GLIDE Score 3.5 XP, used by Extra-Precision Glide.

These functions use similar terms but are formulated with different objectives in mind. Specifically, GLIDE score 3.5 SP is a "softer", more forgiving function that is adept at identifying ligands that have a reasonable propensity to bind, even in cases in which the GLIDE pose has significant imperfections. This version seeks to minimize false negatives and is appropriate for many database screening applications. In contrast, GLIDE score 3.5 XP is a harder function that exacts severe penalties for poses that violate established physical chemistry principles such as that charged and strongly polar groups be adequately exposed to solvent. This version of GLIDE score is more adept at minimizing false positives and can be especially useful in lead optimization.

GLIDE score 3.5 modify and extend the ChemScore function as follows:

The lipophilic-lipophilic term is defined as in Chem-Score. The hydrogen-bonding term also uses the Chem-Score form but is separated into differently weighted components that depend on whether the donor and acceptor are both neutral, one is neutral and the other is charged, or both are charged. In the optimized scoring function, the first of these contributions is found to be the most stabilizing and the last, the charged-charged term, is the least important. The metal-ligand interaction term (the fifth term in eq 2) uses the same functional form as it is employed in ChemScore but varies in three principal ways. First, this term considers only interactions with anionic acceptor atoms (such as either of the two oxygen of a carboxylate group). This modification allows GLIDE to recognize the strong preference for coordination of anionic ligand functionality to metal centers in metalloproteases. The seventh term, from Schrodinger's active site mapping facility, rewards instances in which a polar but non-hydrogen-bonding atom (as classified by ChemScore) is found in a hydrophobic region. The second major component is the incorporation of contributions from the Coulomb and vdW interaction energies between the ligand and the receptor. The third major component is the introduction of a solvation model. To include solvation effects, GLIDE 3.5 docks explicit waters into the binding site for each energetically competitive ligand pose and employs empirical scoring terms that measure the exposure of various groups to the explicit waters.


Cerius2 LigandFit is designed to dock a ligand or a series of ligand molecules into a protein-binding site. During docking, the protein is kept rigid while the ligand remains flexible allowing different conformations to be searched and docked within the binding site.

The 3 key steps in LigandFit are

Site search

Conformational search

Ligand Fitting

Site search

The aim of the site search is to define the binding site of the protein, the position and shape of which will be used in the docking process. The protein is first mapped to a grid. Grid points within a certain user definable distance from protein atoms are marked as occupied by the protein. The cavity is now defined from the set of all unoccupied grid points. The site can be defined by two ways and they are,

Protein Shape: Sites are defined based on the shape of the protein. An "eraser" algorithm is used to clear all grid points outside the protein.

Docked Ligand: Sites are defined based on a docked ligand. If there is a docked ligand, the unoccupied grid points within a certain user definable distance to ligand atoms are collected to form the site.

Conformational Search

The Monte Carlo method is employed in the conformational search of the ligand. During the search, bond length and bond angles are untouched; only torsion angles (excepted there in a ring) are randomized. Therefore, the ligand molecule(s) should be energy minimized to ensure correct bond lengths and bond angles using LigandFit.

Ligand Fittings

After a new conformation is generated, the fitting is carried out in two steps.

The non mass-weighted principle moment of inertia (PMI) of the binding site is compared with non mass-weighted PMI of the ligand according to the following equations (a) and (b).

If the value (Fit PMI) is above the threshold or not better than fitting results previously saved, no further docking process will be performed. Another ligand or another conformation of the same ligand will be examined.

If the Fit PMI is better than previously saved results, the ligand is positioned into the binding site accordingly to the PMI.

Rigid body minimization is applied to the saved conformations of the ligand to optimize their positions and docking scores.

De Novo Ligand Design

De novo design uses structural information to "grow" a molecule into the active site by sequentially adding or joining molecular fragments instead of using libraries of existing compounds 107. Structure sampling is carried out by different methods like linking, growing, lattice-based sampling, random structure mutation, transitions driven by molecular dynamics simulations, and graph-based sampling. Figure 2.6 gives a schematic description of few of these strategies. Apart from these, the ligand can also be built from recombination of bioactive conformations of known ligands for a particular target. Recombination is carried out by overlaying the known ligands and swapping the fragments of different ligands. This procedure is carried out recursively, so that the compounds that emerge from recombination are added to the pool of known actives and participate in subsequent cycles of recombination. The largest advantage of de novo design is its ability to develop novel scaffolds utilizing the whole chemical space.108 However, this method also suffers limitations like: 1) synthetic feasibility is not considered while constructing structures and; (2) the prediction of binding affinities for the designed structures is not so accurate.

Figure 2.6: Strategies of de novo ligand design

Virtual Screening (VS)

VS is a knowledge driven process that uses computational chemistry techniques to analyze large chemical databases in order to identify possible new leads. VS is used as an initial screen for large databases to prune the number of compounds that are to be screened experimentally.109 This process of finding 'needles in a haystack' produces leads that may otherwise not have surfaced and therefore adds immense value to the early drug discovery stages. VS protocols include ligand based screens like 1D filters (e.g. molecular weight), 2D filters (similarity, substructure fingerprints), and 3D filters (3D-pharmacophore, 3D shape matching) and structure based screens like docking.110 The potential sources of error contributing to the identification of false positives and false negatives in VS include: 1) approximations in the scoring functions employed; 2) improper solvation terms; 3) neglect of protein flexibility and; 4) poor assessment of the protonation states of active site residues or ligands 111. Significant improvements in VS have been made by consensus scoring of multiple scoring functions and by clustering docking poses, from multiple docking tools before scoring.99, 112

Molecular Dynamics

Molecular Dynamic (MD) simulations are widely used to obtain information on the time evolution of conformations of biological molecules with the associated kinetic and thermodynamic properties. The basic feature of molecular dynamics is the calculation of a trajectory of the molecule, i.e. a series of structures at regular time steps in which the system is moving under the influence of the forces acting on the atoms.113 These are calculated from the first derivative of the potential function with respect to the atom positions. By applying Newton's equations of motion, these forces can then be used to calculate how the atomic positions change with time resulting in a dynamic trajectory. MD can be utilized to quantify the properties of a system and is therefore a valuable tool in understanding the complete profile of a model system.

Breakthroughs in MD lead to its first application on the protein bovine pancreatic trypsin inhibitor for 9.2ps in-vacuo in 1977.114 In 1988, cumulative advances in the MD simulations made it possible to carry out a microsecond MD of a much larger protein in solution.115 Table 2.4 lists different MD methods where the utility of each varies with the aspects of desire.

MD Methods


Brownian MD

accounts for Brownian motion of molecules particularly evident with solvents of high viscosity

Langevin MD

Uses Langevin equations of motion where additional frictional and random forces are added to Newtonian equation

Activated MD

Simulates activated processes allowing to cross barriers existing between the initial and final stages

Accelerated MD

Time treated as statistical quantity. Conformational sampling through i) modified potential energy surface ii) non-Boltzmann type iii) degrees of freedom at the expense of faster degrees of freedom

Steered MD

Based on Atomic Force Microscopy (AFM) principle. Introduces a time and position dependent force to steer systems along particular degrees of freedom

Targeted MD

Force used to drive simulation towards target conformation


Multiple simulations using molecules with each one subjected to force introducing cooperative behaviour and drives the trajectory to the mean trajectory of the entire swarm

Replica Exchange MD


Series of simultaneous non-interacting simulations (replica) carried over a range of temperatures and swapped at particular intervals of temperature

Self Guided MD


Introduces an additional guiding force that is a continuously updated time average of the force of the current simulation

Leap Dynamics

A combination of MD and essential dynamics. Leaps applied to the system to force it over energy barriers

Multiple Body O(N) Dynamics [MBO(N)D]

Combines rigid body dynamics with multiple time steps where highest frequency harmonic motions are removed retaining the low-frequency anharmonic motions

 Dynamics

A multiple copy method with reduced interaction potential of the ligand, lowering the barriers of conformational transitions. Fraction of interacting potential, i2 (additional degree of freedom) allows ligand to explore conformational space due to reduced barriers

Table 2.4: List of various MD techniques

ADMET Prediction

Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) profiles of chemical compounds have become the bottleneck and a major challenge in drug research. Table 2.5 lists few drugs withdrawn from the market due to insufficient ADMET profile. To overcome the high rate of attrition of active compounds plagued by hidden pharmacokinetic issues/problems, ADMET property evaluation is incorporated into drug design strategies. In silico prediction of physicochemical parameters of compound's ionizability (pKa), and lipophilicity (log P or log D), provide an indication of its likely absorption in the gut. Various computational techniques using in silico models are emerging that attempts to give the ADMET profile of a given compound.116, 117


Therapeutic Utility

Year of


Withdrawn due to


Morning sickness in pregnancy


Teratogenicity leading to deformities in fetal development







drug-drug interactions with MAO inhibitors






Anti hypertension



Interferes with metabolism of other drugs used for hypertension


Anti diabetic




Anti hyperlipidemic




Anti inflammatory


Myocardial Infarction

Phenyl proponalamine

In cough as decongestant


Hemorrhagic stroke

Table 2.5: Drugs withdrawn from market due to ADMET failure

Advances in Computational Power

All the methods described above are computer intensive and therefore demand immense increase in the availability of computational power. Consequently, Linux farms allowing many hundreds of parallel calculations in acceptable timescales substitute for much expensive supercomputers. GRID computing is one other advancement to high computational needs. An elegant example of applications of such grid computation is the screen saver project [email protected] (Search for Extra Terrestrial Intelligence) over PC networks 118. In this project about 1.2 million household PCs across 215 countries were used to constitute a 65-teraflop equivalent machine with more than 100,000 years of CPU. Further, this network was used to screen 3.2 billion virtual structures in 13 protein active sites in just 24 days searching for novel anticancer agents and anthrax inhibitors. [email protected] is a similar kind of project used in the prediction of protein folding.119

Simulations One Step AheadFuture Directions

Advances in computations in structural biology have made it possible to carry out virtual cell simulations that mimic the cell environment and the cellular events therein.120 A number of programs have been developed in this area. For example NEURON and GENESIS simulate the electrophysiological behavior of single neurons and neuronal networks. One step ahead of these, E-CELL constructs a model of a hypothetical self-sustaining whole-cell with 127 genes sufficient for transcription, translation and energy production.121 Alternately, MCell provides a modeling tool for realistic simulation of cellular signaling in the complex 3-D cellular microphysiologysubcellular microenvironment in and around living cells, using Monte Carlo algorithms to track the stochastic behavior of discrete molecules in space and time.122 Virtual Cell is another program that models cell biological processes. Future prospects of such virtual cell simulations would hopefully reduce the limitations of simulations using isolated proteins which will never exist in real situations and interrelated ADMET properties.