Journal of Chemical Information and Modeling, volume 47, issue 2, pages 488-508

Evaluating Virtual Screening Methods:  Good and Bad Metrics for the “Early Recognition” Problem

Jean-Francois Truchon 1
Christopher I. Bayly 1
1
 
Department of Medicinal Chemistry, Merck Frosst Centre for Therapeutic Research, 16711 TransCanada Highway, Kirkland, Québec, Canada H9H 3L1
Publication typeJournal Article
Publication date2007-02-09
scimago Q1
SJR1.396
CiteScore9.8
Impact factor5.6
ISSN15499596, 1549960X
PubMed ID:  17288412
General Chemistry
Computer Science Applications
General Chemical Engineering
Library and Information Sciences
Abstract
Many metrics are currently used to evaluate the performance of ranking methods in virtual screening (VS), for instance, the area under the receiver operating characteristic curve (ROC), the area under the accumulation curve (AUAC), the average rank of actives, the enrichment factor (EF), and the robust initial enhancement (RIE) proposed by Sheridan et al. In this work, we show that the ROC, the AUAC, and the average rank metrics have the same inappropriate behaviors that make them poor metrics for comparing VS methods whose purpose is to rank actives early in an ordered list (the "early recognition problem"). In doing so, we derive mathematical formulas that relate those metrics together. Moreover, we show that the EF metric is not sensitive to ranking performance before and after the cutoff. Instead, we formally generalize the ROC metric to the early recognition problem which leads us to propose a novel metric called the Boltzmann-enhanced discrimination of receiver operating characteristic that turns out to contain the discrimination power of the RIE metric but incorporates the statistical significance from ROC and its well-behaved boundaries. Finally, two major sources of errors, namely, the statistical error and the "saturation effects", are examined. This leads to practical recommendations for the number of actives, the number of inactives, and the "early recognition" importance parameter that one should use when comparing ranking methods. Although this work is applied specifically to VS, it is general and can be used to analyze any method that needs to segregate actives toward the front of a rank-ordered list.
Hanley J.A., McNeil B.J.
Radiology scimago Q1 wos Q1
2014-07-08 citations by CoLab: 15568 Abstract  
A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented. It is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject. Moreover, this probability of a correct ranking is the same quantity that is estimated by the already well-studied nonparametric Wilcoxon statistic. These two relationships are exploited to (a) provide rapid closed-form expressions for the approximate magnitude of the sampling variability, i.e., standard error that one uses to accompany the area under a smoothed ROC curve, (b) guide in determining the size of the sample required to provide a sufficiently reliable estimate of this area, and (c) determine how large sample sizes should be to ensure that one can statistically detect differences in the accuracy of diagnostic techniques.
Drummond C., Holte R.C.
Machine Learning scimago Q1 wos Q2
2006-05-08 citations by CoLab: 233 Abstract  
This paper introduces cost curves, a graphical technique for visualizing the performance (error rate or expected cost) of 2-class classifiers over the full range of possible class distributions and misclassification costs. Cost curves are shown to be superior to ROC curves for visualizing classifier performance for most purposes. This is because they visually support several crucial types of performance assessment that cannot be done easily with ROC curves, such as showing confidence intervals on a classifier's performance, and visualizing the statistical significance of the difference in performance of two classifiers. A software tool supporting all the cost curve analysis described in this paper is available from the authors.
Cleves A.E., Jain A.N.
Journal of Medicinal Chemistry scimago Q1 wos Q1
2006-04-22 citations by CoLab: 86 Abstract  
Systematic annotation of the primary targets of roughly 1000 known therapeutics reveals that over 700 of these modulate approximately 85 biological targets. We report the results of three analyses. In the first analysis, drug/drug similarities and target/target similarities were computed on the basis of three-dimensional ligand structures. Drug pairs sharing a target had significantly higher similarity than drug pairs sharing no target. Also, target pairs with no overlap in annotated drug specificity shared lower similarity than target pairs with increasing overlap. Two-way agglomerative clusterings of drugs and targets were consistent with known pharmacology and suggestive that side effects and drug-drug interactions might be revealed by modeling many targets. In the second analysis, we constructed and tested ligand-based models of 22 diverse targets in virtual screens using a background of screening molecules. Greater than 100-fold enrichment of cognate versus random molecules was observed in 20/22 cases. In the third analysis, selectivity of the models was tested using a background of drug molecules, with selectivity of greater than 80-fold observed in 17/22 cases. Predicted activities derived from crossing drugs against modeled targets identified a number of known side effects, drug specificities, and drug-drug interactions that have a rational basis in molecular structure.
Kairys V., Fernandes M.X., Gilson M.K.
2005-12-13 citations by CoLab: 69 Abstract  
In the absence of an experimentally solved structure, a homology model of a protein target can be used instead for virtual screening of drug candidates by docking and scoring. This approach poses a number of questions regarding the choice of the template to use in constructing the model, the accuracy of the screening results, and the importance of allowing for protein flexibility. The present study addresses such questions with compound screening calculations for multiple homology models of five drug targets. A central result is that docking to homology models frequently yields enrichments of known ligands as good as that obtained by docking to a crystal structure of the actual target protein. Interestingly, however, standard measures of the similarity of the template used to build the homology model to the targeted protein show little correlation with the effectiveness of the screening calculations, and docking to the template itself often is as successful as docking to the corresponding homology model. Treating key side chains as mobile produces a modest improvement in the results. The reasons for these sometimes unexpected results, and their implications for future methodologic development, are discussed.
Triballeau N., Acher F., Brabet I., Pin J., Bertrand H.
Journal of Medicinal Chemistry scimago Q1 wos Q1
2005-03-08 citations by CoLab: 519 Abstract  
The “receiver operating characteristic” (ROC) curve method is a well-recognized metric used as an objective way to evaluate the ability of a given test to discriminate between two populations. This facilitates decision-making in a plethora of fields in which a wrong judgment may have serious consequences including clinical diagnosis, public safety, travel security, and economic strategies. When virtual screening is used to speed-up the drug discovery process in pharmaceutical research, taking the right decision upon selecting or discarding a molecule prior to in vitro evaluation is of paramount importance. Characterizing both the ability of a virtual screening workflow to select active molecules and the ability to discard inactive ones, the ROC curve approach is well suited for this critical decision gate. As a case study, the first virtual screening workflow focused on metabotropic glutamate receptor subtype 4 (mGlu4R) agonists is reported here. Six compounds out of 38 selected and tested in vitro were shown to have agonist activity on this target of therapeutic interest.
Killeen P.R., Taylor T.J.
2004-12-11 citations by CoLab: 9 Abstract  
For receiver operating characteristic curves to be symmetric the signal distribution must be an orientation-reversing involution of the noise distribution on the strength axis.
Ferrara P., Gohlke H., Price D.J., Klebe G., Brooks C.L.
Journal of Medicinal Chemistry scimago Q1 wos Q1
2004-05-04 citations by CoLab: 407 Abstract  
An assessment of nine scoring functions commonly applied in docking using a set of 189 protein−ligand complexes is presented. The scoring functions include the CHARMm potential, the scoring function DrugScore, the scoring function used in AutoDock, the three scoring functions implemented in DOCK, as well as three scoring functions implemented in the CScore module in SYBYL (PMF, Gold, ChemScore). We evaluated the abilities of these scoring functions to recognize near-native configurations among a set of decoys and to rank binding affinities. Binding site decoys were generated by molecular dynamics with restraints. To investigate whether the scoring functions can also be applied for binding site detection, decoys on the protein surface were generated. The influence of the assignment of protonation states was probed by either assigning “standard” protonation states to binding site residues or adjusting protonation states according to experimental evidence. The role of solvation models in conjunction with CHARMm was explored in detail. These include a distance-dependent dielectric function, a generalized Born model, and the Poisson equation. We evaluated the effect of using a rigid receptor on the outcome of docking by generating all-pairs decoys (“cross-decoys”) for six trypsin and seven HIV-1 protease complexes. The scoring functions perform well to discriminate near-native from misdocked conformations, with CHARMm, DOCK-energy, DrugScore, ChemScore, and AutoDock yielding recognition rates of around 80%. Significant degradation in performance is observed in going from decoy to cross-decoy recognition for CHARMm in the case of HIV-1 protease, whereas DrugScore and ChemScore, as well as CHARMm in the case of trypsin, show only small deterioration. In contrast, the prediction of binding affinities remains problematic for all of the scoring functions. ChemScore gives the highest correlation value with R2 = 0.51 for the set of 189 complexes and R2 = 0.43 for the set of 116 complexes that does not contain any of the complexes used to calibrate this scoring function. Neither a more accurate treatment of solvation nor a more sophisticated charge model for zinc improves the quality of the results. Improved modeling of the protonation states, however, leads to a better prediction of binding affinities in the case of the generalized Born and the Poisson continuum models used in conjunction with the CHARMm force field.
Perola E., Walters W.P., Charifson P.S.
2004-04-28 citations by CoLab: 366 Abstract  
A thorough evaluation of some of the most advanced docking and scoring methods currently available is described, and guidelines for the choice of an appropriate protocol for docking and virtual screening are defined. The generation of a large and highly curated test set of pharmaceutically relevant protein-ligand complexes with known binding affinities is described, and three highly regarded docking programs (Glide, GOLD, and ICM) are evaluated on the same set with respect to their ability to reproduce crystallographic binding orientations. Glide correctly identified the crystallographic pose within 2.0 A in 61% of the cases, versus 48% for GOLD and 45% for ICM. In general Glide appears to perform most consistently with respect to diversity of binding sites and ligand flexibility, while the performance of ICM and GOLD is more binding site-dependent and it is significantly poorer when binding is predominantly driven by hydrophobic interactions. The results also show that energy minimization and reranking of the top N poses can be an effective means to overcome some of the limitations of a given docking function. The same docking programs are evaluated in conjunction with three different scoring functions for their ability to discriminate actives from inactives in virtual screening. The evaluation, performed on three different systems (HIV-1 protease, IMPDH, and p38 MAP kinase), confirms that the relative performance of different docking and scoring methods is to some extent binding site-dependent. GlideScore appears to be an effective scoring function for database screening, with consistent performance across several types of binding sites, while ChemScore appears to be most useful in sterically demanding sites since it is more forgiving of repulsive interactions. Energy minimization of docked poses can significantly improve the enrichments in systems with sterically demanding binding sites. Overall Glide appears to be a safe general choice for docking, while the choice of the best scoring tool remains to a larger extent system-dependent and should be evaluated on a case-by-case basis.
Muegge I., Enyedy I.J.
Current Medicinal Chemistry scimago Q1 wos Q2
2004-03-01 citations by CoLab: 54 Abstract  
Kinases have become a major area of drug discovery and structure-based design. Hundreds of 3D structures for more than thirty different kinases are available to the public. High structural and sequence homology within the kinase gene family makes the remaining kinases ideal targets for homology modeling and virtual screening. Somewhat surprisingly, however, the number of publications about virtual screening of kinases is very low. Therefore, rather than reviewing the field of virtual screening for kinases, we attempt here a hybrid approach of presenting what is known and common practice together with new studies on CDK2 and SRC kinase. To illustrate the challenges and pitfalls of virtual screening for kinase targets we focus on the question of how ranking is influenced by the database screened, the docking scheme, the scoring function, the activity of the compounds used for testing, and small changes in the binding pocket. In addition, a case study of finding irreversible inhibitors of ErbB2 through in silico screening is presented.
Halgren T.A., Murphy R.B., Friesner R.A., Beard H.S., Frye L.L., Pollard W.T., Banks J.L.
Journal of Medicinal Chemistry scimago Q1 wos Q1
2004-02-27 citations by CoLab: 4108 Abstract  
Glide's ability to identify active compounds in a database screen is characterized by applying Glide to a diverse set of nine protein receptors. In many cases, two, or even three, protein sites are employed to probe the sensitivity of the results to the site geometry. To make the database screens as realistic as possible, the screens use sets of "druglike" decoy ligands that have been selected to be representative of what we believe is likely to be found in the compound collection of a pharmaceutical or biotechnology company. Results are presented for releases 1.8, 2.0, and 2.5 of Glide. The comparisons show that average measures for both "early" and "global" enrichment for Glide 2.5 are 3 times higher than for Glide 1.8 and more than 2 times higher than for Glide 2.0 because of better results for the least well-handled screens. This improvement in enrichment stems largely from the better balance of the more widely parametrized GlideScore 2.5 function and the inclusion of terms that penalize ligand-protein interactions that violate established principles of physical chemistry, particularly as it concerns the exposure to solvent of charged protein and ligand groups. Comparisons to results for the thymidine kinase and estrogen receptors published by Rognan and co-workers (J. Med. Chem. 2000, 43, 4759-4767) show that Glide 2.5 performs better than GOLD 1.1, FlexX 1.8, or DOCK 4.01.
Schulz-Gasch T., Stahl M.
Journal of Molecular Modeling scimago Q3 wos Q3
2003-01-14 citations by CoLab: 166 Abstract  
Two new docking programs FRED (OpenEye Scientific Software) and Glide (Schrödinger, Inc.) in combination with various scoring functions implemented in these programs have been evaluated against a variety of seven protein targets (cyclooxygenase-2, estrogen receptor, p38 MAP kinase, gyrase B, thrombin, gelatinase A, neuraminidase) in order to assess their accuracy in virtual screening. Sets of known inhibitors were added to and ranked relative to a random library of drug-like compounds. Performance was compared in terms of enrichment factors and CPU time consumption. Results and specific features of the two new tools are discussed and compared to previously published results using FlexX (Tripos, Inc.) as a docking engine. In addition, general criteria for the selection of docking algorithms and scoring functions based on binding-site characteristics of specific protein targets are proposed. Figure Enrichment factors obtained with FlexX, Glide and FRED docking engines in combination with different scoring functions for seven selected targets with highly variable binding sites
Sheridan R.P., Singh S.B., Fluder E.M., Kearsley S.K.
2001-08-01 citations by CoLab: 106 Abstract  
Similarity searches based on chemical descriptors have proven extremely useful in aiding large-scale drug screening. Typically an investigator starts with a "probe", a drug-like molecule with an interesting biological activity, and searches a database to find similar compounds. In some projects, however, the only known actives are peptides, and the investigator needs to identify drug-like actives. 3D similarity methods are able to help in this endeavor but suffer from the necessity of having to specify the active conformation of the probe, something that is not always possible at the beginning of a project. Also, 3D methods are slow and are complicated by the need to generate low-energy conformations. In contrast, topological methods are relatively rapid and do not depend on conformation. However, unmodified topological similarity methods, given a peptide probe, will preferentially select other peptides from a database. In this paper we show some simple protocols that, if used with a standard topological similarity search method, are sufficient to select nonpeptide actives given a peptide probe. We demonstrate these protocols by using 10 peptide-like probes to select appropriate nonpeptide actives from the MDDR database.
Edgar S.J., Holliday J.D., Willett P.
2000-01-01 citations by CoLab: 56 Abstract  
This article reviews measures for evaluating the effectiveness of similarity searches in chemical databases, drawing principally upon the many measures that have been described previously for evaluating the performance of text search engines. The use of the various measures is exemplified by fragment-based 2D similarity searches on several databases for which both structural and bioactivity data are available. It is concluded that the cumulative recall and G-H score measures are the most useful of those tested.
Swets J.A.
Science scimago Q1 wos Q1 Open Access
1988-06-03 citations by CoLab: 7292 PDF Abstract  
Diagnostic systems of several kinds are used to distinguish between two classes of events, essentially "signals" and "noise". For them, analysis in terms of the "relative operating characteristic" of signal detection theory provides a precise and valid measure of diagnostic accuracy. It is the only measure available that is uninfluenced by decision biases and prior probabilities, and it places the performances of diverse systems on a common, easily interpreted scale. Representative values of this measure are reported here for systems in medical imaging, materials testing, weather forecasting, information retrieval, polygraph lie detection, and aptitude testing. Though the measure itself is sound, the values obtained from tests of diagnostic systems often require qualification because the test data on which they are based are of unsure quality. A common set of problems in testing is faced in all fields. How well these problems are handled, or can be handled in a given field, determines the degree of confidence that can be placed in a measured value of accuracy. Some fields fare much better than others.
Fallico M.J., Alberca L.N., Enrique N., Orsi F., Prada Gori D.N., Martín P., Gavernet L., Talevi A.
Brain Research scimago Q2 wos Q3
2025-06-01 citations by CoLab: 0
Wang L., Wu Y., Luo H., Liang M., Zhou Y., Chen C., Liu C., Zhang J., Zhang Y.
2025-03-19 citations by CoLab: 0 Abstract  
AbstractDeep-learning techniques have significantly advanced small-molecule drug discovery. However, a critical gap remains between representation learning and small molecule generations, limiting their effectiveness in developing new drugs. We introduce Ouroboros, a unified framework that integrates molecular representation learning with generative modeling, enabling efficient chemical space exploration using pre-trained molecular encodings. By reframing molecular generation as a process of encoding space compression and decompression, Ouroboros resolves the challenges associated with iterative molecular optimization and facilitates directed chemical evolution within the encoding space. Comprehensive experimental tests demonstrate that Ouroboros significantly outperforms conventional approaches across multiple drug discovery tasks, including ligand-based virtual screening, chemical property prediction, and multi-target inhibitor design and optimization. Unlike task-specific models in traditional approaches, Ouroboros leverages a unified framework to achieve superior performance across diverse applications. Ouroboros offers a novel and highly scalable protocol for rapid chemical space exploration, fostering a potential paradigm shift in AI-driven drug discovery.
Hansel‐Harris A.T., Tillack A.F., Santos‐Martins D., Holcomb M., Forli S.
Protein Science scimago Q1 wos Q1
2025-02-25 citations by CoLab: 0 Abstract  
AbstractRecent advances in structural biology have led to the publication of a wealth of high‐resolution x‐ray crystallography (XRC) and cryo‐EM macromolecule structures, including many complexes with small molecules of interest for drug design. While it is common to incorporate information from the atomic coordinates of these complexes into docking (e.g., pharmacophore models or scaffold hopping), there are limited methods to directly leverage the underlying density information. This is desirable because it does not rely on the determination of relevant coordinates, which may require expert intervention, but instead interprets all density as indicative of regions to which a ligand may be bound. To do so, we have developed CryoXKit, a tool to incorporate experimental densities from either cryo‐EM or XRC as a biasing potential on heavy atoms during docking. Using this structural density guidance with AutoDock‐GPU, we found significant improvements in re‐docking and cross‐docking, important pose prediction tasks, compared with the unmodified AutoDock4 force field. Failures in cross‐docking tasks are additionally reflective of changes in the positioning of pharmacophores in the site, suggesting it is a fundamental limitation of transferring information between complexes. We additionally found, against a set of targets selected from the LIT‐PCBA dataset, that rescoring of these improved poses leads to better discriminatory power in virtual screenings for selected targets. Overall, CryoXKit provides a user‐friendly method for improving docking performance with experimental data while requiring no a priori pharmacophore definition and at virtually no computational expense. Map‐modification code available at: https://github.com/forlilab/CryoXKit.
Jia Z., Li Y., Shi W., Qian J., Xu Y., Fan H., Hu X., Wang H.
Medicinal Chemistry Research scimago Q2 wos Q3
2025-01-30 citations by CoLab: 0 Abstract  
Bromodomain-containing protein 4 (BRD4), as the reader of epigenetics, could regulate gene transcription by recognizing acetylated lysine of histones. In recent years, researchers have found that dysregulation of BRD4 leads to the occurrence and development of various cancers, making BRD4 a promising target for cancer therapy. To identify novel BRD4 inhibitors from natural products, a hierarchical virtual screening method including pharmacophore modeling, molecular docking, and molecular dynamic simulation was performed to obtain five hit compounds with potential BRD4 inhibitory activity. Subsequently, structural optimization of the hit compound (ZINC2648030) with chromone structure was conducted to afford a series of derivatives (8a–13e), and their BRD4 inhibitory activities were evaluated. Among them, 13b showed remarkable BRD4 inhibitory activity (IC50 = 0.60 μM). Moreover, 13b displayed a potent inhibitory effect on A549 cells with an IC50 value of 0.79 μM, and further investigations demonstrated that it has the potential to induce apoptosis, inhibit colony formation, and suppress cell invasion. These findings indicated that 13b might be a candidate for cancer treatment.

Top-30

Journals

20
40
60
80
100
120
140
20
40
60
80
100
120
140

Publishers

20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?