DOI: 10.1002/minf.201000061 Best Practices for QSAR Model Development, Validation, and Exploitation Alexander Tropsha*[a] 476  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2010, 29, 476 – 488 Review 1 Introduction: Basic Principles and Workflow of Predictive QSAR Modeling Rapid development of information and communication technologies during the last few decades has dramatically changed our capabilities of collecting, analyzing, storing and disseminating all types of data. This process has had a profound influence on the scientific research in many disciplines, including the development of new generations of effective and selective medicines. Large databases containing millions of chemical compounds tested in various biological assays such as PubChem[1] are increasingly available as online collections (recently reviewed by Oprea and Trop- sha;[2] see also recent commentary by Williams et al.[3] ). In order to find new drug leads, there is a need for efficient and robust procedures that can be used to screen chemical databases and virtual libraries against molecules with known activities or properties. To this end, Quantitative Structure-Activity Relationships (QSAR) modeling provides an effective means for both exploring and exploiting the relationship between chemical structure and its biological action towards the development of novel drug candidates. The QSAR approach can be generally described as an application of data analysis methods and statistics to developing models that could accurately predict biological activities or properties of compounds based on their structures. Any QSAR method can be generally defined as an application of mathematical and statistical methods to the problem of finding empirical relationships (QSAR models) of the form Pi =k’(D1, D2,…,Dn), where Pi are biological activities (or other properties of interest) of molecules, D1, D2,…,Dn are calculated (or, sometimes, experimentally measured) structural properties (molecular descriptors) of compounds, and k’ is some empirically established mathematical transformation that should be applied to descriptors to calculate the property values for all molecules (Figure 1). The goal of QSAR modeling is to establish a trend in the descriptor values, which parallels the trend in biological activity. In essence, all QSAR approaches imply, directly or indirectly, a simple similarity principle, which for a long time has provided a foundation for the experimental medicinal chemistry: compounds with similar structures are expected to have similar biological activities. The detailed description of major tenets of QSAR modeling is beyond the scope of this paper; the overview of many popular QSAR modeling techniques including statistical and datamining techniques as well as approaches to descriptor calculations could be found in many reviews and monographs, e.g.,[4,5] ). Here, we comment on most critical general aspects of model development and, most importantly, validation that are especially important in the context of using QSAR models for virtual screening. Most of our discussion captures trends that the author has either observed or contributed to in the last 20 years of active research in the field. Additional important information concerning both common errors as well as established practices in the QSAR modeling field can be found in other critical essays on the subject, e.g., by Stouch et al.[6] and Dearden et al.[7] Our experience in QSAR model development and validation has led us to establish a complex strategy[8] that is summarized in Figure 2. It describes the predictive QSAR modeling workflow focused on delivering validated models and ultimately, computational hits that should be ultimately confirmed by the experimental validation. We start by carefully curating chemical structures and, if possible, associated biological activities to prepare the dataset for subsequent calculations. This issue of assessing and addressing data accuracy has not been properly addressed in the literature and we discuss some aspects of this critical component of the workflow below. Then, a fraction of compounds (typically, 10–20%) is selected randomly as an external evaluation set (a more rigorous n-fold external validation protocol can be employed when the dataset is randomly divided Abstract: After nearly five decades “in the making”, QSAR modeling has established itself as one of the major computational molecular modeling methodologies. As any mature research discipline, QSAR modeling can be characterized by a collection of well defined protocols and procedures that enable the expert application of the method for exploring and exploiting ever growing collections of biologically active chemical compounds. This review examines most critical QSAR modeling routines that we regard as best practices in the field. We discuss these procedures in the context of integrative predictive QSAR modeling workflow that is focused on achieving models of the highest statistical rigor and external predictive power. Specific elements of the workflow consist of data preparation including chemical structure (and when possible, associated biological data) curation, outlier detection, dataset balancing, and model validation. We especially emphasize procedures used to validate models, both internally and externally, as well as the need to define model applicability domains that should be used when models are employed for the prediction of external compounds or compound libraries. Finally, we present several examples of successful applications of QSAR models for virtual screening to identify experimentally confirmed hits. Keywords: QSAR modeling · Model validation · Virtual screening · Drug discovery [a] A. Tropsha Laboratory for Molecular Modeling and Carolina, Center for Exploratory Cheminformatics Research, CB # 7568, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA *e-mail: alex_tropsha@unc.edu Mol. Inf. 2010, 29, 476 – 488  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 477 Best Practices for QSAR Model Development, Validation, and Exploitation into n nearly equal parts and then nÀ1 parts are systematically used for model development and the remaining fraction of compounds is used for model evaluation). The Sphere Exclusion protocol implemented in our laborato- ry[9,10] is then used to rationally divide the remaining subset of compounds (the modeling set) into multiple training and test sets that are used for model development and validation, respectively (alternative rational approaches for dividing the modeling set into diverse and representative training and test sets could be devised as well). We employ multiple QSAR techniques based on the combinatorial exploration of all possible pairs of descriptor sets and various supervised data analysis techniques (combi-QSAR) and select models characterized by high accuracy in predicting both training and test sets data. The model acceptability thresholds are typically characterized by the lowest acceptable value of the leave-one-out cross validated R2 (q2 ) for the training set and by conventional R2 for the test set; our default values are 0.6 for both q2 and R2 . All validated models are finally tested in an ensemble using the external evaluation set. The critical step of the external validation is the use of applicability domains (AD), which is defined uniquely for each model used in consensus (ensemble) prediction of the external set. If external validation demonstrates the significant predictive power of the models we employ them for virtual screening of available chemical databases (e.g., ZINC[11] ) to identify putative active compounds and work with collaborators who could validate such hits experimentally. Thus, models resulting from the predictive QSAR modeling workflow (Figure 2) can be used to prioritize the selection of chemicals for the experimental validation. In fact, it is increasingly critical to include experimental validation as the ultimate assertion of the model-based prediction. We note that the focus on experimental validation shifts the emphasis on ensuring good (best) statistics for the model that fits known experimental data towards generating testable hypotheses about purported bioactive compounds. Thus, the output of the modeling has exactly same format as the input, i.e., chemical structures and (predicted) activities making model interpretation and utilization completely seamless for medicinal chemists. Some of our application studies demonstrating the ability of models to identify computational hits that were subsequently validated experimentally are described below. We now discuss specific procedures (best practices) that should be followed within each individual component of the workflow. Alexander Tropsha is K. H. Lee Distinguished Professor and Chair of the Division of Medicinal Chemistry and Natural Products in the Eshelman School of Pharmacy, UNC-Chapel Hill. He received PhD in Chemical Enzymology in 1986 from Moscow State University, Russia. He immigrated to the United States in 1989 and has been affiliated with UNC since then. His research interests are in the areas of ComputerAssisted Drug Design, Computational Toxicology, Cheminformatics, and Structural Bioinformatics. His research is supported by multiple grants from the NIH, NSF, EPA, and private companies. He is a member of several editorial boards of scientific journals, permanent member of the BDMA Study Section at the NIH and an elected member of the Board and vice-chair of the international Cheminformatics and QSAR Society in 2005–2010 Figure 1. The process of QSAR model development. 478 www.molinf.com  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2010, 29, 476 – 488 Review A. Tropsha 2 Best Practices for Key Elements of QSAR Modeling Workflow In this section we discuss specific protocols and procedures that in our experience should be followed to enable the development of reliable and predictive QSAR models. The discussion follows the path of the workflow summarized in Figure 2, from data preparation to model development and validation to application of models for external prediction and virtual screening. 2.1 The Importance of Chemical Data Curation in QSAR Modeling Molecular modelers typically analyze data generated by other (experimental) researchers. Consequently, when it comes to the experimental data quality they are always at the mercy of the data providers. Practically any cheminformatics study entails the calculation of chemical descriptors that are expected to accurately reflect intricate details of underlying chemical structures. Obviously, any error in the structure translates into either inability to calculate descriptors for erroneous chemical records or into erroneous descriptors; this outcome makes the models developed with such incomplete or inaccurate descriptors either restricted only to a fraction of formally available data or, what is even worse, making the models inaccurate. A recent study[12] showed that on average there are two structural errors per each medicinal chemistry publication with an overall error rate for compounds indexed in the WOMBAT database[13] as high as 8%. In another recent study,[14] the authors investigated several public and commercial databases to calculate their error rates: the latter were ranging from 0.1 to 3.4% depending on the database. As both data and data models as well as the body of scholarly publications in cheminformatics continue to grow it becomes increasingly important to address the issue of data quality that inherently affects the quality of models. How significant is the problem of accurate structure representation as it concerns the adequacy and accuracy of cheminformatics models? There appears to be no systematic studies on the subject in the published literature. However, even a few recent reports indicate that this problem should be given very serious attention. For instance, recent benchmarking studies by a large group of collaborators from six laboratories[15,16] have clearly demonstrated that the type of chemical descriptors has much greater influence on the prediction performances of QSAR models than the nature of the model optimization techniques. Furthermore, in another recent seminal publication[14] the authors clearly pointed out the importance of chemical data curation in the context of QSAR modeling. They have discussed several illustrative examples of incorrect structures generated from either correct or incorrect SMILEs using commercial software. They also tried to determine the error rate in several known databases and evaluate the consequences of both random and systematic errors for the prediction performance of QSAR models. Their main conclusions were that small structural errors within a dataset could lead to significant losses in predictive abilities of QSAR models. At the same time they further demonstrated that manual curation of structural data leads to substantial increase in model predictivity.[14] Although there are obvious compelling reasons to believe that chemical data curation should be given a lot of attention, it is also obvious that for the most part the basic steps to curate a dataset of compounds have been either considered trivial or ignored even by experts. For instance, in an effort to improve the quality of publications in the QSAR modeling field the Journal of Chemical Information Figure 2. Predictive QSAR modeling workflow. Mol. Inf. 2010, 29, 476 – 488  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 479 Best Practices for QSAR Model Development, Validation, and Exploitation and Modeling published a special editorial highlighting the requirements to QSAR papers that should be followed should the authors consider publishing their results in the journal;[17] however, no special attention was given to data curation. There have been several recent publications addressing common mistakes and criticizing faulty practices in QSAR modeling field;[7,18,19] however, these papers have not explicitly described and discussed the importance of chemical record curation for developing robust QSAR models. Generally speaking, since the models of chemical data may only be as good as the data itself there is a pressing need to develop and systematically employ standard chemical record curation protocols that should be helpful in the pre-processing of any chemical dataset. Recently, we have integrated several protocols in a standardized chemical data curation strategy[20] that in our opinion, should be followed at the onset of any molecular modeling investigation. The simple, but important, steps for cleaning chemical records in a database include the removal of a fraction of the data that cannot be appropriately handled by conventional cheminformatics techniques, e.g., inorganic and organometallic compounds, counterions, salts and mixtures; structure validation; ring aromatization; normalization of specific chemotypes; curation of tautomeric forms; and the deletion of duplicates. It is also critical to visualize and manually inspect at least a fraction of chemical data that go into model development to detect structures that for some reasons escaped the automatic curation steps described above. It is important to realize that most of these structure curation steps do not depend on the level of chemical structure representation, i.e., 2D or 3D, with possible exception of instances when a dataset includes chiral compounds. Obviously, if standard descriptors are calculated from 2D representation of chemical structure, e.g., by chemical graphs, such as most of molecular connectivity indices,[21] then any pair of enantiomers or diastereoisomers will be formally recognized as duplicates. If specific chiralities for such pairs of compounds are known along with compounds’ activities, descriptors taking chirality into account should be used, and all isomers should be retained in the dataset. If, however, chirality information is unavailable, only one compound, usually with the highest (or mean) activity should be retained, and chirality-sensitive descriptors should not be used. There are different tools available for dataset curation. For example, Molecular Operating Environment (MOE) from CCG[22] includes Database Wash tool. It allows changing molecules’ names, adding or removing hydrogen atoms, removing salts and heavy atoms, even if they are covalently connected to the rest of the molecule, and changing or generating the tautomers and protomers (cf. the MOE manual for more details). Various database curation tools are included in ChemAxon[23] as well. If commercial software tools such as MOE are unavailable (notably, the ChemAxon software is free to academic investigators), one can use standard UNIX/LINUX tools to perform some of the dataset cleaning tasks. It is important to have some freely available molecular format converters such as OpenBa- bel,[24] or MolConverter from ChemAxon.[23] Figure 3 illustrates major elements of the data curation workflow discussed in more detail in our recent paper.[20] This protocol is enabled by accessible software tools; most of them are publicly available and free-for-academic-use (from ChemAxon,[23] OpenEye,[25] OpenBabel,[24] ISIDA,[26] HiT QSAR,[27] Hyleos[28] )), but some are commercial (from Molecular Networks,[29] CCG,[22] CambridgeSoft[30] ). Figure 3. Workflow for chemical data curation. 480 www.molinf.com  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2010, 29, 476 – 488 Review A. Tropsha It is more difficult to spot the errors in biological data since there are no obvious technical approaches similar to chemical record curation that can be used in this case. However, rigorously derived QSAR models could be indeed used to identify compounds for which predictions consistently disagree with experimental observations and that are likely to be annotated with erroneous biological testing results. Our recent studies provide specific examples demonstrating that the use of cheminformatics approaches helped spotting gaps or errors in biological annotations of toxic compounds.[20,31] 2.2 Dataset Size and Balancing The number of compounds in the dataset for QSAR studies should not be too small, or, for practical reasons, too large. The upper limit is often defined by the computer and time resources available for building QSAR models using the selected methodologies. For example, for k-nearest neighbors (kNN) QSAR approach frequently practiced in our laboratory, the maximum number of compounds in the training set (i.e. compounds used to build QSAR models) may not exceed about ca. 2000 due to the inefficiency of the approach when processing large datasets. When a dataset includes more compounds, several approaches can be implemented: (i) select a diverse subset of compounds; (ii) cluster a dataset and build models separately for each cluster; (iii) sometimes, in case of classification or category QSAR when compounds belong to a small number of activity classes or categories (e.g., active and inactive), it is possible to exclude many compounds from model development. The difference between classes and categories is that in contrast to classes, categories can be ordered. Examples of classes are given by ligands of different receptors; and examples of categories are sets of compounds that are described as very active, moderately active, and inactive. The lower limit of the number of compounds in the dataset is also defined by several factors. For example, in most cases as part of model validation schemes we divide the dataset into three subsets: training, test and external evaluation sets (see additional discussion below). Training sets are used in model development, and if they are too small, chance correlation and overfitting become major problems not allowing one to build truly predictive models. While it is impossible to give an exact minimum number of compounds in a dataset for which building reliable QSAR models is feasible, some simple ideas described here may help. We suggest that in case of continuous response variable (activity) the number of compounds in the training set should be at least 20, and about 10 compounds should be in each of the test and external evaluation sets, so the total minimum number of compounds should be no less than 40. In case of classification or category response variable, training set should contain at least about 10 compounds of each class, and test and external evaluation sets should contain no less than five compounds for each class. So, there should be at least 20 compounds of each class. The best situation is when the number of compounds in the dataset is between these two extremes: about 150–300 compounds in total, and in case of classification or category QSAR approximately equal number of compounds of each class or category. There are also requirements for activity values. In case of continuous response variable, the total range of activities should be at least five times higher than the experimental error. No large gaps (that exceed 10%–15% of the entire range of activities) are allowed between two consecutive values of activities ordered by value. In case of classification or category QSAR, there should be at least 20 compounds of each class or category; preferably, the number of compounds in all classes or categories should be approximately the same. However, many existing datasets are imbalanced or biased (i.e. sizes of different classes or categories are different). In these cases, special approaches should be used to equalize the number of compounds in different classes or categories. Indeed, in many datasets, the counts of compounds that belong to different classes or categories are significantly different (there could be several times and even orders of difference). Usually, active compounds constitute a smaller class and inactive compounds a larger class (which is practically always the case for datasets resulting from large scale HTS studies). Active compounds (typically binding to a certain biological target) belong to a relatively small number of structural classes. On the other hand, compounds included in the larger class (i.e. inactive compounds) can be very diverse: some of them can belong to the same structural classes as active compounds, while other compounds (often, the majority of them) have very different structures highly dissimilar from those included in the smaller class. So they cover a large area in the descriptor space relative to the active compounds which are much more similar to each other. In these cases, direct development of predictive QSAR models using entire datasets is difficult, if not impossible. Indeed, training and test sets reflect the composition of the entire dataset, in which almost all compounds are inactive, so the modeling and validation will be biased toward correct prediction of the larger class. Thus, reducing the number of compounds included in the larger class is necessary. This can be achieved easily by calculating the distance (or similarity) matrix between compounds belonging to different classes followed by excluding compounds of the larger class that are dissimilar beyond certain threshold from those of the smaller class. Ideally, after excluding dissimilar compounds of the larger class, the number of remaining compounds of this class should be more or less equal to the number of compounds of the smaller class. Classification QSAR models are developed then only for compounds that remain in the balanced dataset. In other words, the modeling subset will not include compounds of the (initially) larger class that were excluded by the procedure as more dissimilar to the smaller class than the reMol. Inf. 2010, 29, 476 – 488  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 481 Best Practices for QSAR Model Development, Validation, and Exploitation maining molecules of the (initially) bigger class. This approach makes it more challenging to achieve a successful QSAR model that discriminates, say, active compounds from most chemically similar inactive compounds; therefore we consider it inherently more robust over alternative approaches reported in the literature when random samples of the bigger class are used to create balanced datasets for classification modeling. Alternative approaches that could help balancing datasets include undersampling of the bigger class[32,33] or oversampling of the smaller class.[34] The extended discussion of these approaches is beyond the scope of this review. 2.3 Detection and Removal of Outliers Prior to QSAR Studies Success of QSAR modeling depends on the appropriate selection of a dataset for QSAR studies. In a recent editorial of the Journal of Chemical Information and Modeling, Mag- giora[18] noticed that one of the main deficiencies of many chemical datasets is that they do not fully satisfy the main hypothesis underlying all QSAR studies: similar compounds have similar biological activities or properties. Maggiora defines the “cliffs” in the descriptor space where the properties change so rapidly, that, in fact adding or deleting one small chemical group can lead to a dramatic change in the compound’s property. In other words, small changes of descriptor values can lead to large changes in molecular properties. Generally, in this case there could be not just one outlier, but a subset of compounds whose properties are different from those on the other “side” of the cliff. In other words, cliffs are areas where the main QSAR hypothesis (similar compounds have similar properties) does not hold. So cliff detection is a major QSAR problem. In QSAR area, many people were aware of these and other problems related to outlier detection, but have not yet paid a sufficient attention to addressing them in automated QSAR proce- dures. There are two types of outliers we must be aware of: leverage (or structural) outliers and activity outliers. Structural outliers can be defined as singletons in a dataset clustered using any of available techniques described in standard statistical literature and activity outliers are essentially defined as activity “cliffs” (see above). One should keep in mind that both types of outliers can be real or due to errors in structure representation or biological activity annotation but in any case, preserving outliers in a modeling dataset will likely lead to model instability; the latter can be manifested in significant differences in external predictive power of models built with n-fold external validation strategy. Thus, outliers should be removed before proceeding with model development and analyzed separately for possible errors; however, no current QSAR modeling techniques provides a reliable approach to build models taking outliers into account. Finding such approaches is one of the challenges facing the field. 2.4 Critical Importance of Model Validation In our important paper titled “Beware of q2 !”,[35] we have demonstrated the insufficiency of the training set statistics for developing externally predictive QSAR models and formulated the main principles of model validation. Despite earlier observations and warnings of several authors[36–38] that high cross-validated correlation coefficient R2 (q2 ) is the necessary, but insufficient condition for the model to have high predictive power, many studies continue to consider q2 as the only parameter characterizing the predictive power of QSAR models. In this respect, we have shown[35] that the predictive power of QSAR models can be claimed only if the model was successfully applied for predicting the external test set compounds, which were not used in the model development. Indeed, it is important to emphasize that the true predictive power of a QSAR model can be established only through model validation procedure which consists of prediction of activities of compounds which were not included in model building, i.e., compounds in the test set. In contrast to the test set, compounds used for model building constitute the training set. In many QSAR studies multiple models are built and from them “best” models are selected, which are defined as those based on the prediction statistics for the test set. Thus, the test set is actually used to select models. This use of the test set for model selection practically negates the consideration of such routine as an adequate external model validation. In fact, it does not guarantee at all that models selected in this way will make accurate predictions if used for chemical database mining (i.e. predicting activities of compounds in truly external database). In our workflow, to simulate the use of QSAR models for database mining, a so called external evaluation set is employed. It should consist of compounds with known activities that are not included in either training or test sets. External evaluation set can be selected randomly from the entire initial dataset. In general, the size of the external evaluation set should be about 15%–20% of the entire dataset. The remaining part of the dataset is called modeling set that can be divided into training and test sets. Algorithms for dividing a modeling set into diverse and representative training and test sets were developed in our group previously and are reported and discussed in detail elsewhere.[39] We have demonstrated earlier[35] that the majority of models with high q2 values have poor predictive power when applied for prediction of compounds in the external test set. In the subsequent publication[40] the importance of rigorous validation was again emphasized as a crucial, integral component of model development. Several examples of published QSPR models with high fitted accuracy for the training sets, which failed rigorous validation tests, have been considered. We presented a set of simple guidelines for developing validated and predictive QSPR models and discussed several validation strategies such as the randomi- 482 www.molinf.com  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2010, 29, 476 – 488 Review A. Tropsha zation of the response variable (Y-randomization), external validation using rational division of a dataset into training and test sets, and others. We highlighted the need to establish the domain of model applicability in the chemical space to flag molecules for which predictions may be unreliable, and discussed some algorithms that can be used for this purpose. We advocated the broad use of these guidelines in the development of predictive QSPR models.[40–42] The importance of model validation could now be regarded as collective wisdom within the community of molecular modelers. At the 37-th Joint Meeting of Chemicals Committee and Working Party on Chemicals, Pesticides & Biotechnology, held in Paris on 17–19 November 2004, the OECD (Organization for Economic Co-operation and Development) member countries adopted the following five principles that valid (Q)SAR models should follow to allow their use in regulatory assessment of chemical safety: (i) a defined endpoint; (ii) an unambiguous algorithm; (iii) a defined domain of applicability; (iv) appropriate measures of goodness-of-fit, robustness and predictivity; (v) a mechanistic interpretation, if possible. Since then, most of the European authors publishing in QSAR area include a statement that their models fully comply with OECD principles (e.g., see References[43–46] ). Validation of QSAR models is one of the most critical problems of QSAR. Recently, we have extended our requirements for the validation of multiple QSAR models selected by acceptable statistics criteria of prediction for the test set.[47] Additional studies on this critical component of QSAR modeling should establish reliable and commonly accepted applicability domain criteria, which should make models increasingly useful for virtual screening. 2.5 Applicability Domains and Model Acceptability Criteria One of the most important problems in QSAR analysis is establishing the domain of applicability for each model. In the absence of the applicability domain restriction, each model can formally predict the activity of any compound, even with a completely different structure from those included in the training set. Thus, the absence of the model applicability domain as a mandatory component of any QSAR model would lead to the unjustified extrapolation of the model in the chemistry space and, as a result, a high likelihood of inaccurate predictions. In our research we have always paid particular attention to this issue.[40,48–55] A good overview of commonly used applicability domain definitions can be found elsewhere.[56,57] In our earlier publications[35,40] we have recommended a set of statistical criteria which must be satisfied by a predictive model. For continuous QSAR, criteria that we will follow in developing activity/property predictors are as follows: (i) correlation coefficient R between the predicted and observed activities; (ii) coefficients of determination[58] (predicted versus observed activities R0 2 , and observed versus predicted activities R’0 2 for regressions through the origin); (iii) slopes k and k’ of regression lines through the origin. We consider a QSAR model predictive, if the following conditions are satisfied (i) q2 >0.5; (ii) R2 >0.6; (iii) (R0 2 ÀR0 2 )/R2 <0.1 and 0.85 k 1.15 or (R0 2 ÀR’0 2 )/R2 <0.1 and 0.85 k’ 1.15; (iv) jR0 2 ÀR’0 2 j<0.3 where q2 is the cross-validated correlation coefficient calculated for the training set, but all other criteria are calculated for the test set (for additional discussion, see[8] ). 3 Predictive QSAR Models as Virtual Screening Tools In our recent studies we were fortunate to recruit experimental collaborators who have validated computational hits identified by virtual screening of commercially available compound libraries using rigorously validated QSAR models. Examples include anticonvulsants,[53] HIV-1 reverse transcriptase inhibitors,[59] D1 antagonists,[60] antitumor compounds,[61] beta-lactamase inhibitors,[62] Human Histone Deacetylase (HDAC) inhibitors,[63] and geranylgeranyltransferase-I inhibitors.[64] Thus, models resulting from predictive QSAR workflow could be used to prioritize the selection of chemicals for the experimental validation. To illustrate the power of validated QSAR models as virtual screening tools we shall discuss the examples of studies that resulted in experimentally confirmed hits. We note that such studies could only be done if there is sufficient data available for a series of tested compounds such that robust validated models could be developed. The following examples illustrate the use of QSAR models developed with predictive QSAR modeling and validation workflow (Figure 2) for virtual screening of commercial libraries to identify experimentally confirmed hits. 3.1 Discovery of Novel Anticancer Agents A combined approach of validated QSAR modeling and virtual screening was successfully applied to the discovery of novel tylophorine derivatives as anticancer agents.[61] QSAR models have been initially developed for 52 chemically diverse phenanthrine-based tylophorine derivatives (PBTs) with known experimental EC50 using chemical topological descriptors (calculated with the MolConnZ program) and variable selection k nearest neighbor (kNN) method. Several validation protocols have been applied to achieve robust QSAR models. The original dataset was divided into multiple training and test sets, and the models were considered acceptable only if the leave-one-out cross-validated R2 (q2 ) values were greater than 0.5 for the training sets and the correlation coefficient R2 values were greater than 0.6 for the test sets. Furthermore, the q2 values for the actual dataset were shown to be significantly higher than those obtained for the same dataset with randomized target properties (Y-randomization test), indicating that models were statistically significant. Ten best models were then emMol. Inf. 2010, 29, 476 – 488  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 483 Best Practices for QSAR Model Development, Validation, and Exploitation ployed to mine a commercially available ChemDiv Database (ca. 500 K compounds) resulting in 34 consensus hits with moderate to high predicted activities. Ten structurally diverse hits were experimentally tested and eight were confirmed active with the highest experimental EC50 of 1.8 mM implying an exceptionally high hit rate (80%). The same ten models were further applied to predict EC50 for four new PBTs, and the correlation coefficient (R2 ) between the experimental and predicted EC50 for these compounds plus eight active consensus hits was shown to be as high as 0.57. 3.2 Discovery of Novel Histone Deacetylase (HDAC) Inhibitors Histone deacetylases (HDAC) play a critical role in transcription regulation. Small molecule HDAC inhibitors have become an emerging target for the treatment of cancer and other cell proliferation diseases. We have employed variable selection k Nearest Neighbor approach (kNN) and Support Vector Machines approach (SVM) to generate QSAR models for 59 chemically diverse compounds with inhibition activity on class I HDAC. MOE[22] and MolConnZ[65] based 2D descriptors were combined with k nearest neighbor (kNN) and support vector machines (SVM) approaches independently to improve the predictive power of models. Rigorous model validation approaches were employed including randomization of target activity (Y-randomization test) and assessment of model predictability by consensus prediction on two external datasets. Highly predictive QSAR models were generated with leave-one-out cross validation R2 (q2 ) values for the training set and R2 values for the test set as high as 0.81 and 0.80, respectively with MolconnZ/kNN approach and 0.94 and 0.81, respectiveley with MolconnZ/SVM approach. Validated QSAR models were then used to mine four chemical databases: National Cancer Institute (NCI) database, Maybridge database, ChemDiv database and ZINC database, including a total of over 3 million compounds. The searches resulted in 48 consensus hits, including two reported HDAC inhibitors that were not included in the original data set. Four database hits with novel structural features were purchased and tested using the same biological assay that was employed to assess the inhibition activity of the training set compounds. Three of these four compounds were confirmed active with the best inhibitory activity (IC50) of 1 mM. 3.3 Discovery of Novel Geranylgeranyltransferase Type I (GGTase-I) Inhibitors In another recent study,[64] we employed our standard QSAR modeling workflow (Figure 2) to discover novel Geranylgeranyltransferase type I (GGTase-I) inhibitors. Geranylgeranylation is critical to the function of several proteins including Rho, Rap1, Rac, Cdc42, and G-protein gamma subunits. GGTase-I inhibitors (GGTIs) have therapeutic potential to treat inflammation, multiple sclerosis, atherosclerosis, and many other diseases. Following our standard QSAR modeling workflow, we have developed and rigorously validated models for 48 GGTIs using variable selection k nearest neighbor[66] and automated lazy learning,[54] and genetic algorithm-partial least square[67] QSAR methods. The QSAR models were employed for virtual screening of 9.5 million commercially available chemicals yielding 47 diverse computational hits. Seven of these compounds with novel scaffolds and high predicted GGTase-I inhibitory activities were tested in vitro, and all were found to be bona fide and selective micromolar inhibitors. Figure 4 shows the structures of both representative training set compounds as well as confirmed computational hits. We should emphasize that QSAR models have been traditionally viewed as lead optimization tools capable of predicting compounds with chemical structure similar to the structure of molecules used for the training set. However, this study clearly indicates (Figure 4) that with enough attention given to the model development process and using chemical descriptors characterizing whole molecules (as opposed to, e.g., chemical fragments) it is indeed possible to discover compounds with novel chemical scaffolds. Furthermore, in our study we have additionally demonstrated that these novel hits could not be identified using traditional chemical similarity search,[64] which highlights the power of robust QSAR models as the drug discovery tool. In summary, several examples above demonstrate that QSAR models could be used successfully as virtual screening tools to discover compounds with the desired biological activity in chemical databases or virtual libra- ries.[53,60,61,68,69] It should be stressed that the total number of compounds selected for virtual screening based on QSAR model predictions is typically relatively small, only a few dozen. Obviously, the total number of computational hits is controlled by the value of applicability domain. In Figure 4. The use of QSAR modeling, virtual screening of commercial libraries, and experimental validation of computational hits afforded the discovery of geranylgeranyltransferase-I inhibitors with novel scaffolds. 484 www.molinf.com  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2010, 29, 476 – 488 Review A. Tropsha most published cases, because we were limited in both time and resources, we chose a very conservative applicability domain leading to the selection of a small library of computational hits with an expectation that a large fraction of these would be confirmed as active compounds. In the industrial size projects it may be more reasonable to loosen the applicability domain requirement and increase the size of virtual hit library. One may expect that the increase in the library size will result in lower relative accuracy of prediction but the absolute number of confirmed hits may actually increase. Thus, scientists using QSAR models that incorporate the applicability domain should always be aware of the interplay between the size of the domain, the coverage of the virtual screening library, and the prediction accuracy so they should use the applicability domain as a tunable parameter to control this interplay. The discovery of novel bioactive chemical entities is the primary goal of computational drug discovery, and the development of validated and predictive QSAR models is critical to achieve this goal. 4 Best Practices for Contests in QSAR Modeling: Competitive Collaboration and Consensus Modeling The title of this section may appear contradictory and perhaps controversial because competition is perhaps one of the major (and for the most part, healthy) attributes of scientific research. Nevertheless, we believe that QSAR modeling may provide unique environment to advance the field by a mechanism that we may regard as “competitive collaboration”. The following example may help illustrate our point. 4.1 Study Design In a recent study,[15] the combinational QSAR modeling approach was applied to a diverse series of organic compounds tested for aquatic toxicity in Tetrahymena pyriformis in the same laboratory of Prof. T. Schultz over nearly a decade.[70–76] The unique aspect of this research was that it was conducted in collaboration between six academic groups specializing in cheminformatics and computational toxicology. The common goals for our virtual collaboratory were to explore the relative strengths of various QSAR approaches in their ability to develop robust and externally predictive models of this particular toxicity end point. The members of our collaboratory included scientists from the University of North Carolina at Chapel Hill in the United States (UNC); University of Louis Pasteur (ULP) in France; University of Insubria (UI) in Italy; University of Kalmar (UK) in Sweden; Virtual Computational Chemistry Laboratory (VCCLAB) in Germany; and the University of British Columbia (UBC) in Canada. Each group relied on its own QSAR modeling approaches to develop toxicity models using the same modeling set, and we agreed to evaluate the realistic model performance using the same external validation set(s). The T. pyriformis toxicity dataset used in this study was compiled from several publications of the Schultz group as well as from data available at the Tetratox database website of (http://www.vet.utk.edu/TETRATOX/). After deleting duplicates as well as several compounds with conflicting test results and correcting several chemical structures in the original data sources, our final dataset included 983 unique compounds. The dataset was randomly divided into two parts: 1) the modeling set of 644 compounds; 2) the validation set including 339 compounds. The former set was used for model development by each participating group and the latter set was used to estimate the external prediction power of each model as a universal metric of model performance. In addition, when this project was already well underway, a new dataset had become available from the most recent publication by the Schultz group.[77] It provided us with an additional external set to evaluate the predictive power and reliability of all QSAR models. Among compounds reported in[77] 110 were unique, i.e., not present among the original set of 983 compounds; thus, these 110 compounds formed the second independent validation set for our study. Naturally, different groups employed different techniques and (sometimes) different statistical parameters to evaluate the performance of models developed independently for the modeling set. To harmonize the results of this study the same standard parameters were chosen to describe each model’s performance as applied to the modeling and external test set predictions. Thus, we have employed Q2 abs (squared leave-one-out cross-validation correlation coefficient) for the modeling set, R2 abs (frequently described as coefficient of determination) for the external validations sets, and MAE (mean absolute error) for the linear correlation between predicted (Ypred) and experimental (Yexp) data (here, Y=pIGC50); these parameters are defined as follows: Q2 abs ¼ 1 À X Y ðYexp À YLOOÞ2 = X Y ðYexp À expÞ2 ð1Þ R2 abs ¼ 1 À X Y ðYexp À YpredÞ2 = X Y ðYexp À expÞ2 ð2Þ MAE ¼ X Y Y À Ypred    =n ð3Þ Many other statistical characteristics can be used to evaluate model performance; however, we restricted ourselves to these three parameters that provide minimal but sufficient information concerning any model’s ability to reproduce both the trends in experimental data for the test sets as well as mean accuracy of predicting all experimental values. The models were considered acceptable if Rabs 2 exceeded 0.5. Mol. Inf. 2010, 29, 476 – 488  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 485 Best Practices for QSAR Model Development, Validation, and Exploitation 4.2 QSAR Models of Aquatic Toxicity; Comparison Between Methods and Models The objective of this study from methodological prospective was to explore the suitability of different QSAR modeling tools for the analysis of a dataset with an important toxicological endpoint. Typically, such datasets are analyzed with one (or several) modeling techniques, with a great emphasis on the (high value of) statistical parameters of the training set models. In this study, we went well beyond the modeling studies reported in the original publications from the Schultz group in several respects. First, we have compiled all reported data on chemical toxicity against T. pyriformis in a single large dataset and attempted to develop global QSAR models for the entire set. Second, we have employed multiple QSAR modeling techniques thanks to the engagement of six collaborating groups. Third, we have focused on defining model performance criteria not only using training set data but most importantly using external validation sets that were not used in model development in any way (unlike any common cross-validation pro- cedure).[78] This focus afforded us the opportunity to evaluate and compare all models using simple and objective universal criteria of external predictive accuracy, which in our opinion is the most important single figure of merit for a QSAR model that is of practical significance for experimental toxicologists. Fourth, we have explored the significance of applicability domains and the power of consensus modeling in maximizing the accuracy of external predictivity of our models. The results of this exercise demonstrated that all models performed quite well for the training set with even the lowest Qabs 2 among them as high as 0.72. However, there was much greater variation between these models when looking at their (universal and objective) performance criteria as applied to the external validation sets. Thus, of 15 QSAR approaches used in this study, nine implemented method-specific applicability domains. Models that did not define the AD showed a reduced predictive accuracy for the validation set II even though they yielded reasonable results for the validation set I. 4.3 The Power of Consensus For the most part all models succeeded in achieving reasonable accuracy of external prediction especially when using the AD. It then appeared natural to bring all models together to explore the power of consensus prediction. Thus, the consensus model was constructed by averaging all available predicted values taking into account the applicability domain of each individual model. In this case we could use only nine of 15 models that had the AD defined. Since each model had its unique way of defining the AD, each external compound could be found within the AD of anywhere between one and nine models so for averaging we only used models covering the compound. The advantage of this data treatment is that the overall coverage of the prediction is still high because it was rare to have an external compound outside of the ADs of all available models. The results showed that the prediction accuracy for both the modeling set (MAE=0.22) and the validation sets I and II (0.27 and 0.34, respectively) was the best compared to any individual model. The same observation could be made for the correlation coefficient R2 abs. The coverage of this consensus model was actually 100% for all three data sets. This observation suggests that consensus models afford both high space coverage and high accuracy of pre- diction. In summary, this study presents an example of a fruitful international collaboration between researchers that use different techniques and approaches but share general principles of QSAR model development and validation. Significantly, we did not make any assumptions about the purported mechanisms of aquatic toxicity yet were able to develop statistically significant models for all experimentally tested compounds. However, the most significant single result of our studies is the demonstrated superior performance of the consensus modeling approach when all models are used concurrently and predictions from individual models are averaged. We have shown that both the predictive accuracy and coverage of the final consensus QSAR models were superior as compared to these parameters for individual models. The consensus models appeared robust in terms of being insensitive to both incorporating individual models with low prediction accuracy and the inclusion or exclusion of the AD. Another important result of this study is the power of addressing complex problems in QSAR modeling by forming a virtual collaboratory of independent research groups leading to the formulation and empirical testing of best modeling practices. This study confirms the power of the “competitive collaboration” principle that we proposed in the beginning of this section. 5 Summary and Conclusions As is true perhaps for any computational field, QSAR modeling has been both blessed and sometimes, cursed in the literature. Our group was among the first emphasizing the importance of statistical validation of QSAR models.[35] As we pointed out and demonstrated with examples in this review (cf. Section 4.2), the high accuracy of the training set model characterized with leave-one-out cross validated R2 (q2 ), i.e., model fitness, is not indicative of the high external predictive power of the model. Thus, the exclusive reliance on training set modeling without any external validation is one of the reasons why many models cannot be considered reliable. Another important paper examined the reasons behind the failure of in silico ADME/Tox models[6] linking the frequent failures to the inappropriate use of models, false expectations, or procedures used to develop models. In a brief but very important editorial note G. Mag- 486 www.molinf.com  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2010, 29, 476 – 488 Review A. Tropsha giora[18] outlined limitations and some reasons for failures of QSAR modeling that relate to the so called “activity cliffs”, which are known cases when a small change in chemical structure leads to dramatic changes in the target activity. Such cases are indeed difficult to foresee and hard to capture and explain using QSAR models since the models work best in reflecting relatively smooth trends in structure-activity correlations. Addressing the activity cliffs problem is indeed a hard problem in QSAR modeling and in some cases it is a source of poor predictions. A summary of various reasons leading to erroneous QSAR models was given in a recent critical overview of the field.[79] Another recent important paper listed as many as 21 possible sources of error when developing QSAR models[7] and provided some recipes as to how to avoid at least some common errors in QSAR model development. In most cases the authors concerned with the quality and practical utility of QSAR models looked deeply into possible sources of errors or offered approaches to improve the robustness of models. On the other hand, the author of the negative opinion letter published in early 2008[19] made an unfortunate attempt to equate the fraction of papers not paying enough attention to the statistical quality of models with the entire field. As we discuss in this review, it is critically important to avoid the oversimplification of the QSAR modeling process and employ statistically robust approaches for both model development and validation. Authors ignoring the complexity of the problem or those paying insufficient attention to model validation do end up developing and in some cases, publishing models that could not be regarded as reliable. Conversely, the criticism of the field should be balanced and based on the thorough analysis of possible sources of error rather than equating the entire field to one large error as the aforementioned opinion letter[19] did. Thus, in this review we have made a fair attempt to outline the objective challenges facing (but not dooming!) the field (such as activity cliffs) and emphasize the importance of developing and practicing rigorous approaches to both model development and validation. In conclusion, we have discussed current best practices for developing robust and externally predictive QSAR models. As with any computational molecular modeling approach, it is imperative that QSAR method is used expertly. Therefore, this review has focused on the discussion of critical components of QSAR modeling procedures that should be studied and executed rigorously to enable their successful application. We have shown that with enough attention paid to critical issues of model validation and applicability domain definition, the models could be indeed used successfully to mine external virtual libraries, especially of commercially available chemicals, to generate reliable computational hits. The methods and applications discussed in this review should be of help to both computational and synthetic chemists as well as experimental biologists working in the areas of biological screening of chemical libraries. Acknowledgements The author would like to acknowledge the support from National Cancer Institute NIH (Grant R01GM066940). The author is also thankful to several colleagues in his laboratory, especially Profs. Golbraikh, Fourches, Zhu and Wang, who have conducted many studies cited in this review as well as were engaged in many scientific discussions that helped the author formulate the best current practices for QSAR modeling discussed herein. References [1] PubChem, http://pubchem.ncbi.nlm.nih.gov/, 2008. [2] T. Oprea, A. Tropsha, Drug Discov. Today 2006, 3, 357–365. [3] A. Williams, V. Tkachenko, C. Lipinski, A. Tropsha, S. Ekins, Drug Discovery World 2010, 10, 33–39. [4] A. Tropsha, in Comprehensive Medicinal Chemistry II, Vol. 4 (Ed: Y. C. Martin), Elsevier, Amsterdam 2006, pp. 149–165. [5] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, Weinheim, Germany, 2000. [6] T. R. Stouch, J. R. Kenyon, S. R. Johnson, X. Q. Chen, A. Doweyko, Y. Li, J. Comput. Aided Mol. Des. 2003, 17, 83–92. [7] J. C. Dearden, M. T. Cronin, K. L. Kaiser, SAR QSAR. Environ. Res. 2009, 20, 241–266. [8] A. Tropsha, A. Golbraikh, Curr. Pharm. Des. 2007, 13, 3494– 3504. [9] A. Golbraikh, A. Tropsha, Mol. Divers. 2002, 5, 231–243. [10] A. Golbraikh, M. Shen, Z. Xiao, Y. D. Xiao, K. H. Lee, A. Tropsha, J. Comput. Aided Mol. Des. 2003, 17, 241–253. [11] J. J. Irwin, B. K. Shoichet, J. Chem. Inf. Model. 2005, 45, 177– 182. [12] M. Olah, M. Mracec, L. Ostopovici, R. Rad, A. Bora, N. Hadaruga, I. Olah, M. Banda, Z. Simon, M. Mracec, T. I. Oprea, in Chemoinformatics in Drug Discovery (Ed: T. I. Oprea), Wiley-VCH, New York, 2005, pp. 223–239. [13] M. Olah, R. Rad, L. Ostopovici, A. Bora, N. Hadaruga, D. Hadaruga, R. Moldovan, A. Fulias, M. Mracec, T. I. Oprea, in Chemical Biology: From Small Molecules to Systems Biology and Drug Design (Eds: S. L. Schreiber, T. M. Kapoor, G. Weiss), Wiley-VCH, Weinheim, 2007, pp. 760–786. [14] D. Young, T. Martin, R. Venkatapathy, P. Harten, QSAR Comb. Sci. 2008, 27, 1337–1345. [15] H. Zhu, A. Tropsha, D. Fourches, A. Varnek, E. Papa, P. Gramatica, T. Oberg, P. Dao, A. Cherkasov, I. V. Tetko, J. Chem. Inf. Model. 2008, 48, 766–784. [16] I. V. Tetko, I. Sushko, A. K. Pandey, H. Zhu, A. Tropsha, E. Papa, T. Oberg, R. Todeschini, D. Fourches, A. Varnek, J. Chem. Inf. Model. 2008, 48, 1733–1746. [17] W. L. Jorgensen, J. Chem. Inf. Model. 2006, 46, 937. [18] G. M. Maggiora, J. Chem. Inf. Model. 2006, 46, 1535. [19] S. R. Johnson, J. Chem. Inf. Model. 2008, 48, 25–26. [20] D. Fourches, E. Muratov, A. Tropsha, J. Chem. Inf. Model. 2010, DOI: 10.1021/ci100176x in press. [21] L. B. Kier, L. H. Hall, Molecular Connectivity in Chemistry and Research, Academic Press, New York, 1976. [22] MOE, Chemical Computing Group. http://www.chemcomp. com/index.htm, 2010. [23] ChemAxon, ChemAxon JChem (http://www.chemaxon.com), 2010. Mol. Inf. 2010, 29, 476 – 488  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 487 Best Practices for QSAR Model Development, Validation, and Exploitation [24] OpenBabel, the OpenSource Chemistry Toolbox, Openbabel.org, 2010, 2–1–2010. [25] OpenEye Scientific Software, http://www.eyesopen.com/ products/applications/filter.html, 2010. [26] ISIDA software, Laboratoire d’Infochimie, Louis Pasteur University, Strasbourg, France (infochim.u-strasbg.fr), 2010. [27] V. E. Kuz’min, A. G. Artemenko, E. N. Muratov, J. Comp. Aid. Mol. Des. 2008, 22, 403–421. [28] Hyleos, http://www.hyleos.net/, 2010. [29] Molecular Networks GmbH, (http://www.molecular-networks. com/products), 2010. [30] CambridgeSoft, http://www.cambridgesoft.com/, 2009. [31] D. Fourches, J. C. Barnes, N. C. Day, P. Bradley, J. Z. Reed, A. Tropsha, Chem. Res. Toxicol. 2010, 23, 171–183. [32] S.-J. Yen, Y.-S. Lee, Lecture Notes in Control and Information Sciences 2006, 344, 731–740. [33] M. Kubat, S. Matwin, Proc. 14th Conf. on Machine Learning, 1977, pp. 179–186. [34] N. Japkowicz, Proc. Learning from Imbalanced Data Sets, Papers from the AAAI Workshop, Technical Report WS-00-05 (Ed: N. Japkowicz), pp. 10–15. [35] A. Golbraikh, A. Tropsha, J. Mol. Graph. Model. 2002, 20, 269– 276. [36] E. Novellino, C. Fattorusso, G. Greco, Pharm. Acta Helv. 1995, 70, 149–154. [37] U. Norinder, J. Chemomet. 1996, 10, 95–105. [38] A. Tropsha, S. J. Cho, in 3D QSAR in Drug Design (Eds: H. Kubinyi, G. Folkers, Y. C. Martin), Kluwer Academic, Dordrecht, The Netherlands, 1998, pp. 57–69. [39] A. Golbraikh, M. Shen, Z. Xiao, Y. D. Xiao, K. H. Lee, A. Tropsha, J. Comput. Aided Mol. Des. 2003, 17, 241–253. [40] A. Tropsha, P. Gramatica, V. K. Gombar, QSAR Comb. Sci. 2003, 22, 69–77. [41] A. Golbraikh, A. Tropsha, J. Comput. Aided Mol. Des. 2002, 16, 357–369. [42] A. Golbraikh, M. Shen, Z. Xiao, Y. D. Xiao, K. H. Lee, A. Tropsha, J. Comput. Aided Mol. Des. 2003, 17, 241–253. [43] M. Pavan, T. I. Netzeva, A. P. Worth, SAR QSAR Environ. Res. 2006, 17, 147–171. [44] M. Vracko, V. Bandelj, P. Barbieri, E. Benfenati, Q. Chaudhry, M. Cronin, J. Devillers, A. Gallegos, G. Gini, P. Gramatica, C. Helma, P. Mazzatorta, D. Neagu, T. Netzeva, M. Pavan, G. Patlewicz, M. Randic, I. Tsakovska, A. Worth, SAR QSAR Environ. Res. 2006, 17, 265–284. [45] A. G. Saliner, T. I. Netzeva, A. P. Worth, SAR QSAR Environ. Res. 2006, 17, 195–223. [46] D. W. Roberts, A. O. Aptula, G. Patlewicz, Chem. Res. Toxicol. 2006, 19, 1228–1233. [47] S. Zhang, A. Golbraikh, A. Tropsha, J. Med. Chem. 2006, 49, 2713–2724. [48] A. Golbraikh, D. Bonchev, A. Tropsha, J. Chem. Inf. Comput. Sci. 2001, 41, 147–158. [49] A. Kovatcheva, G. Buchbauer, A. Golbraikh, P. Wolschann, J. Chem. Inf. Comput. Sci. 2003, 43, 259–266. [50] A. Kovatcheva, A. Golbraikh, S. Oloff, Y. D. Xiao, W. Zheng, P. Wolschann, G. Buchbauer, A. Tropsha, J. Chem. Inf. Comput. Sci. 2004, 44, 582–595. [51] M. Shen, Y. Xiao, A. Golbraikh, V. K. Gombar, A. Tropsha, J. Med. Chem. 2003, 46, 3013–3020. [52] M. Shen, A. LeTiran, Y. Xiao, A. Golbraikh, H. Kohn, A. Tropsha, J. Med. Chem. 2002, 45, 2811–2823. [53] M. Shen, C. Beguin, A. Golbraikh, J. P. Stables, H. Kohn, A. Tropsha, J. Med. Chem. 2004, 47, 2356–2364. [54] S. Zhang, A. Golbraikh, S. Oloff, H. Kohn, A. Tropsha, J. Chem. Inf. Model. 2006, 46, 1984–1995. [55] A. Golbraikh, M. Shen, Z. Xiao, Y. D. Xiao, K. H. Lee, A. Tropsha, J. Comput. Aided Mol. Des. 2003, 17, 241–253. [56] L. Eriksson, J. Jaworska, A. P. Worth, M. T. Cronin, R. M. McDowell, P. Gramatica, Environ. Health Perspect. 2003, 111, 1361– 1375. [57] T. I. Netzeva, S. A. Gallegos, A. P. Worth, Environ. Toxicol. Chem. 2006, 25, 1223–1230. [58] L. Sachs, Handbook of Statistics, Springer, Heidelberg, 1984. [59] J. L. Medina-Franco, A. Golbraikh, S. Oloff, R. Castillo, A. Tropsha, J. Comput. Aided Mol. Des. 2005, 19, 229–242. [60] S. Oloff, R. B. Mailman, A. Tropsha, J. Med. Chem. 2005, 48, 7322–7332. [61] S. Zhang, L. Wei, K. Bastow, W. Zheng, A. Brossi, K. H. Lee, A. Tropsha, J. Comput. Aided Mol. Des. 2007, 21, 97–112. [62] J. H. Hsieh, X. S. Wang, D. Teotico, A. Golbraikh, A. Tropsha, J. Comput. Aided Mol. Des. 2008, 22, 593–609. [63] H. Tang, X. S. Wang, X. P. Huang, B. L. Roth, K. V. Butler, A. P. Kozikowski, M. Jung, A. Tropsha, J. Chem. Inf. Model. 2009, 49, 461–476. [64] Y. K. Peterson, X. S. Wang, P. J. Casey, A. Tropsha, J. Med. Chem. 2009, 52, 4210–4220. [65] MolconnZ. http://www.edusoft-lc.com/molconn/, 2010. [66] W. Zheng, A. Tropsha, J. Chem. Inf. Comput. Sci. 2000, 40, 185– 194. [67] S. J. Cho, W. Zheng, A. Tropsha, J. Chem. Inf. Comput. Sci. 1998, 38, 259–268. [68] A. Tropsha, W. Zheng, Curr. Pharm. Des. 2001, 7, 599–612. [69] A. Tropsha, in Cheminformatics in Drug Discovery (Ed: T. Oprea), Wiley-VCH, Weinheim, 2005, pp. 437–455. [70] A. O. Aptula, D. W. Roberts, M. T. D. Cronin, T. W. Schultz, Chem. Res. Toxicol. 2005, 18, 844–854. [71] T. W. Schultz, G. D. Sinks, L. A. Miller, Environ. Toxicol. 2001, 16, 543–549. [72] T. W. Schultz, M. T. Cronin, T. I. Netzeva, A. O. Aptula, Chem. Res. Toxicol. 2002, 15, 1602–1609. [73] T. W. Schultz, T. I. Netzeva, M. T. Cronin, SAR QSAR Environ. Res. 2003, 14, 59–81. [74] T. W. Schultz, T. I. Netzeva, M. T. Cronin, SAR QSAR Environ. Res. 2004, 15, 385–397. [75] T. W. Schultz, T. I. Netzeva, D. W. Roberts, M. T. Cronin, Chem. Res. Toxicol. 2005, 18, 330–341. [76] T. W. Schultz, Chem. Res. Toxicol. 1999, 12, 1262–1267. [77] T. W. Schultz, M. Hewitt, T. I. Netzeva, M. T. D. Cronin, QSAR Comb. Sci. 2007, 26, 238–254. [78] P. Gramatica, QSAR Comb. Sci. 2007, 26, 694–701. [79] A. M. Doweyko, J. Comput. Aided Mol. Des. 2008, 22, 81–89. Received: May 28, 2010 Accepted: June 8, 2010 Published online: July 6, 2010 488 www.molinf.com  2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2010, 29, 476 – 488 Review A. Tropsha