© F1 QSAR Principles and Methods Quantitative Structure Activity Relationship (QSAR) is involved in building mathematical models for correlating molecular structures with molecular properties. In this section we introduce the notion of molecular descriptors and present the QSAR model and its validation. Author(s): Hanoch Senderowitz (Predix Pharmaceutical), Claude Cohen (Synergix) Prerequisites: None Number of Pages: 191 (194 Screens) Last updated: April 2004 d Voice: available Q F1.1 Introduction to QSAR The topic Introduction to QSAR contains the following 20 pages: ■ Molecular Structure and Molecular Properties ■ Structure-Property Relationships: Example 1 ■ Structure-Property Relationships: Example 2 ■ Structure-Property Relationships: Example 3 - What is QSAR? ■ What is QSPR? ■ Focus on a Single Property at a Time " Molecular Descriptors " Examples of Molecular Descriptors ■ The QSAR Equations ■ Types of Molecular Descriptors ■ Molecular Descriptors: 1D ■ Molecular Descriptors: 2D For the entire list, see the navigation panel. F1.1.1 Molecular Structure and Molecular Properties One of the most pervasive postulates in the life sciences is that all molecular properties are coded by and consequently result from molecular structure. Some examples of structure-property relationships are illustrated on the following pages. o CN Molecular Structure • biological properties • chemical properties • physical properties • electronical properties 9 et c. •. F1.1.2 Structure-Property Relationships: Example 1 Paracetamol selectively inhibits the cyclooxygenase enzyme COX-3 found in the brain and spinal cord and consequently relieves pain and reduces fever. Structure Property raraceiamoi Relives F1.1.3 Structure-Property Relationships: Example 2 Cyanide exerts its toxicity by inhibiting cytochrome-c oxidase, the terminal enzyme of the respiratory chain, leading to insufficient utilization of oxygen and suffocation. Inhibition occurs through binding to the ferric ion of the cytochrome. Structure Property II Cyanide F1.1.4 Structure-Property Relationships: Example 3 Saccharin (usually sold as sodium saccharin) binds to the sweet taste T1R3 receptor located in the plasma membrane of the sweet-taste sensory cells located in the taste buds. Binding of saccharin to T1R3 initiates a cascade of events in the taste-sensory cell that eventually releases a signaling molecule to an adjoining sensory neuron, causing the neuron to send impulses to the brain. In the brain, these signals cause the actual sensation of sweetness. Structure Property Saccharin F1.1.5 What isQSAR? Molecules exert their biological effect by binding to their respective receptors, a phenomenon that in turn is governed by their molecular structures (and the molecular structure of the receptor). QSAR (Quantitative Structure Activity Relationship) attempts to formulate the relationship between structure and activity as a mathematical model. Biological effect = £ (Molecular Structure) Quantitative Structure Activity Relationships F1.1.6 What isQSPR? The biological effect is just one example of molecular properties. QSPR (Quantitative Structure Property Relationship) is an extension of QSAR and is designed to formulate the relationship between structure and any molecular property as a mathematical model. Other properties include for example: solubility, oral bioavailability, metabolic stability and cell permeability. Property X = ft F1.1.7 Focus on a Single Property at a Time No single QSPR model can capture the direct connection between all the properties of a compound and its molecular structure; oniy a single property is handled at a time. Property 1 Ad z Property 2 coon Property 3 Property 4 F1.1.8 Molecular Descriptors Thus, the derivation of a direct relation with the molecular structure of one single property is extremely challenging. However, structural factors known as molecular descriptors that influence the molecular property can be identified. For this reason, the QSAR model correlates the property with molecular descriptors. F1.1.9 Examples of Molecular Descriptors Examples of molecular properties with their associated descriptors are listed in the following table. Later on in this chapter the nature and the meaning of some QSAR descriptors are presented. olecular Property Descriptors ipophilicity 7t, log P, Rm, f teric Properties Es, AAR, AAV, parachor lectronic Properties F1.1.10 The QSAR Equations All QSAR equations have a molecular property expressed as a function of specific descriptors. They differ in terms of the property they are attempting to correlate, the descriptors they use and the mathematical expression of the model. Oral bioavailability = ^(descriptors s ;etl) Cell permeability = j2 (descriptors 5 ;et2) Toxicity = ^(descriptors £ ;et3) Metabolic stability = Ja (descriptors « »et4) Receptor binding = ^(descriptors s ;et5) F1.1.11 Types of Molecular Descriptors Molecular descriptors can be classified according to the dimensionality of the molecular structure from which they are derived. 1D descriptors are derived from the chemical formula, 2D descriptors are derived from a 2D (chemdraw-like) structure and 3D descriptors are derived from the 3-dimensional structure. F1.1.12 Molecular Descriptors: 1D The chemical formula constitutes a 1-Dimensional representation of the molecular structure from which 1D descriptors can be derived. Such descriptors are based exclusively on the type of atoms which make up the molecule. C7H5NO3S Molecular Weight (gr/mol): 183.2 F1.1.13 Molecular Descriptors: 2D A Chemdraw-like structure constitutes a 2-Dimensional representation of the molecular structure from which 2D descriptors can be calculated. In addition to types of atoms, 2D descriptors also incorporate the bonding pattern of the molecule. Rotatable bonds: 0 H-bond donors: 1 F1.1.14 Molecular Descriptors: 3D 3D descriptors derived from a 3D molecular structure take the spatial arrangement of the atoms in the molecule into account. Dipole Moment: 2.2 De byes F1.1.15 A Multitude of Molecular Descriptors The number of descriptors that can be derived from a molecular structure is virtually unlimited. Currently available software packages can calculate thousands of descriptors. For example the DRAGON program calculates 1612 descriptors distributed into 20 categories. constitutional descriptors WHIM descriptors topo)ogicQ| chapge m topological descriptors molecular properties Randic molecular profiles RDF descriptors functional group counts information indices BCUT descriptors eigenvalue-based indices atom-centred fragments walk and path counts geometrical descriptors edge adjacency indices charge descriptors 3D-MoR5E descriptors connectivity indices 2D autocorrelations GETAWAY descriptors F1.1.16 Biologically Relevant Descriptors When constructing a QSAR model, the key is to use descriptors that are relevant to the specific property of interest. These "biologically relevant descriptors" help generate a model that differentiates between molecules that possess the property of interest and those that do not. 4j> Non-Relevant Descriptor % Relevant Descriptor Understanding Design activity, design lese reasons in understand the F1.1.18 Understanding Structure-Activity Relationships A good model can reveal information about the receptor's binding site. For example a correlation with electronic descriptors may indicate that the biological activities could be due to the chemical reactivity of the compounds, or alternatively, a correlation with hydrophobic descriptors may reveal the existence of a hydrophobic pocket in the receptor. F1.1.19 Designing Compounds with Improved Activities Once a QSAR model is obtained and reproduces the known data satisfactorily, it can be exploited to predict the biologicai activity of not yet synthesized analogs. This is of paramount importance in lead optimization and represents one of the most popular uses of the QSAR approach. Compounds not yet synthesized QSAR model * Prediction of the biological activities F1.1.20 Reducing a Virtual Library to a Practical Size The recent explosion in combinatorial chemistry has added a new dimension to the QSAR approach by reducing a huge virtual library to a manageable size for combinatorial synthesis and high through-put screening. Virtual Library Generator Biological Activity Prediction 30254 To Chemical Synthesis F1.2 I he Foundations of QSAR The topic The Foundations of QSAR contains the following 28 pages: ■ Birth of QSAR ■ The Foundations of QSAR ■ The Hammett Contribution ■ Dissociation Constants of Substituted Benzoic Acids ■ Dissociation of Substituted Phenylacetic Acids ■ Linear Free Energy Relationship " The Hammett Equation ■ The Meaning of p ■ The Meaning of o • Examples of a Constants ■ Predicting the pKa of Benzoic Acid Compounds ■ Hansch Contribution ■ The Importance of Lipophilicity For the entire list, see the navigation panel. F1.2.1 Birth of QSAR QSAR dates back to the 19th century with the work of Cros (1863) who first observed an inverse correlation between the toxicity of alcohols and their water solubility. Other important milestones include work by Crum-Brown and Frazer who related physiological action to chemical constitution (1868). A few years later Horst, Overton and Richet independently observed that the toxicity of organic compounds depended on their lipophilicity/solubility. This discovery was followed by research by Meyer and Overton, who proved that anesthetic potency correlated well with partition coefficients (1899). • 1863 Cros inverse correlation between toxicity and water solubility of alcohols 1868 Crum-Brown A Frazer "physiological action" is a function of "chemical constitution" • 1890's Horst & Overton toxicity of organic compounds depend on their lipophilicity. • 1893 Richet "more they are soluble, less they are toxic" 1899 Meyer-Overton partition coefficients correlate with anesthetic potency F1.2.2 The Foundations of QSAR During the first half of the 20th century, Louis Hammett laid the foundation for modern QSAR by correlating electronic properties of organic acids and bases with their equilibrium constants and reactivity. An important landmark in the development of QSAR took place in 1964 with the introduction of the Free-Wilson method and Hansch analysis. This section covers these three seminal contributions to QSAR in some detail. • Louis Hammett • Free-Wilson • Corwin Hansch F1.2.3 The Hammett Contribution The dissociation of HA organic acids is a process by which a proton (H+) is removed from the neutral compound, leaving behind a negatively charged species (A). The extent of the reaction is measured by the dissociation constant K, Louis Hammett observed that the dissociation constants of aromatic acids are influenced by the electronic properties of the substituents on the phenyl ring. HA =5=^ H* + A~ = [//*] [A'] [HA] F1.2.4 Dissociation Constants of Substituted Benzoic Acids The dissociation constants of substituted benzoic acids indicate that electron withdrawing groups increase dissociation while electron donating groups decrease it. ft p-Et % Benzoic Acid ft p-N02 F1.2.5 Dissociation of Substituted Phenylacetic Acids A similar effect exists for other equilibria such as substituted phenylacetic acids • p-Et % Phenylacetic Acid « p-N02 COOH COO- I electron withdrawing A \ dissociation Constant (10~5) ■5.2 F1.2.6 Linear Free Energy Relationship When plotting the quantity log(K/Ko) for benzoic acids on the X axis, where K and Ko refer to the unsubstituted and substituted compounds, respectively, and the corresponding values measured for the same set of substituents in phenylacetic acids on the Y axis, Hammett obtained a straight line. Because of the association between dissociation constants and free energies [AG=-RT Log(K)] this phenomenon is known as the linear free energy relationship. Benzoic Acid R K N02 '.05 x 10s 4.4 x ZO"5 -0.15 H 6.2 x 10 6;(kg) 0 Phenylacetic Acid N02 14 t v 10 E 0.43 Et 4.2 x 10 ? H 0 N02 U u V 1 £ 1t c a. -0.2 y/0 -. -o.i ■ 0.2 0.4 0.6 0.8 Benzoic Acids fog(K/Ko) F1.2.7 The Hammett Equation The straight line described on the previous page can be written as a linear equation, the Hammett equation. Note that p is related to a given scaffold (e.g. phenylacetic acids), whereas a o is a descriptor of a substituent and describes its influence on the dissociation constant. It is positive for electron withdrawing substituents and negative for electron donating substituents. p pertains to a given equilibrium as compared to the benzoic acid equilibrium. CT is a descriptor of a substituent F1.2.8 The Meaning of p p describes the magnitude of the effect a substituent can exert on the dissociation reaction of a given scaffold. As the distance between the substituent and the dissociated proton increases, its influence on the dissociation reaction decreases and so does the value of p. Benzoic Acid % Phenylacetic Acid « Phenylpropionic Acid F1.2.9 The Meaning of a a describes the effect of substituents on the dissociation reaction. Substituents on the phenyl ring can increase or decrease the equilibrium constant by stabilizing or destabilizing the anionic form via the formation of a positive or negative partial charge at C1. H3C -5 l COOH Destabilizes anionic form Decreases dissociation F1.2.10 Examples of o Constants Electron donating substituents have negative o values, whereas positive as correspond to electron withdrawing groups. Note that o values differ depending on whether the substituent is meta or para (sigma values are clickable). F1.2.11 Predicting the pKa of Benzoic Acid Compounds The Hammett equation is an example of a QSPR equation. It correlates a molecular property, the dissociation constant, with a set of molecular descriptors (o and p). It can be used to predict the pKa of benzoic acid analogs. When a molecule has multiple substituents, the a values are summed to yield the total value for the compound, as shown in the following example. pKacd = 4.2 - 1.00 (0.71 - 0.13 + 0.71) = 2.91 F1.2.27 Predictability of the Model The experimental and calculated values of the antiadrenergic molecules of the training set are indicated below and show that the Free-Wilson model reproduces the biological activities well. Moreover the equation can be used to predict the biological activities of new not yet synthesized analogs. Q F 1.3 Design of a QSAR Model The topic Design of a QSAR Model contains the following 3 pages: ■ Embarking on the Design of a QSAR Model " The Four Steps a An Iterative Process F1.3.1 Embarking on the Design of a QSAR Model The planning of a QSAR model must be carefully managed. In this section we will explore the methodology for designing a QSAR model in some detail, present the ideas and statistical concepts behind the QSAR model, the rules that need to be followed and the errors that should be avoided. descriptors ? parabolic ? equations ? near ? how many molecules 7 training set ? correlation coefficient ? What are the requirement ? predictive ? trend ? normalize decriptors ? I molecule to synthesize ? F1.3.2 The Four Steps To construct a QSAR model the following steps should be followed: (1) assemble a sufficiently large and diverse set of compounds along with their biological activities; (2) select a set of descriptors which is likely to be related to the biological activity of interest; (3) formulate a mathematical equation that reflects the relationship between the biological activity and the chosen descriptors, and finally (4) validate the QSAR model. • 1. Compounds Selection • 2. Descriptors Selection • 3. Building the QSAR model • 4. Methods for Validating the model. F1.3.3 An Iterative Process Constructing a QSAR model is an iterative process. First, the QSAR equation is derived from an initial set of descriptors. Attempts are then made to improve this model by adding or removing descriptors and refining the mathematical equation, in an iterative fashion. Compounds selection Descriptors selection @ F1.4 Compounds Selection: Step 1 The topic Compounds Selection: Step 1 contains the following 5 pages: " Compounds Selection ■ Predictions by Interpolation ■ Example of Extrapolative Model ■ Identification of Outliers ■ Biological Activities in Terms of Log 1/C F1.4.1 Compounds Selection The selection of the compounds is the first step in building a QSAR model and consists of assembling a sufficiently large and diverse set of compounds with known biological activities. The molecules should be selected with great care in order to define a set of compounds that is homogenous and represents the system well. Compounds selection Descriptors selection Building the QSAR model Validating the model F1.4.2 Predictions by Interpolation The compounds selected for a QSAR analysis should cover a large range of values for those descriptors believed to be relevant to biological activity. This increases the probability that future compounds will have descriptors within this range and allow predictions to be interpolative rather than extrapolative. As a rule, interpolative predictions are more accurate than extrapolative predictions. Poor compound selection Better compound selection Descriptor n Descriptor s Selected compound Interpolative zone Extrapolative zone F1.4.3 Example of Extrapolative Model Extrapolating a model for values that are outside the range of the training set may lead to incorrect predictions. In the following example the experimental points lie in a straight line, however at higher values the model is more complex and no longer linear. F1.4.4 Identification of Outliers QSAR modeling is based on the assumption of homogeneity and an absence of influential outliers in the training set. An outlier can be a molecule acting according to a different mechanism of action, an improper biological activity as reported by another laboratory, or simply an incorrect value (experimental or typographic error). Repeat measurements of biological activities and using the greatest number of molecules helps reduce the distortions introduced by outliers. % with outlier • without outlier @ F15 Descriptors Selection: Step 2 The topic Descriptors Selection: Step 2 contains the following 14 pages: ■ Descriptors Selection ■ Methods for Selecting Relevant Descriptors ■ Manual Selection of Descriptors ■ Automated Selection of Descriptors ■ Systematic Combination of Descriptors ■ Methods for Selecting a Subset of Descriptor ■ Forward Selection ■ Backward Elimination ■ Stepwise Regression ' Scaling Descriptors ■ Correlation Between Descriptors ■ Example of Correlated Descriptors ■ Solution to the Problem of Correlated Descriptors For the entire list, see the navigation panel. F1.5.1 Descriptors Selection As mentioned earlier in this chapter, the number of available descriptors for QSAR analyses is very large. A good model is based on a small number of well-chosen descriptors. When many descriptors are screened, a fortuitous correlation may occur. In the following pages important rules for the selection of relevant descriptors are presented. Compounds selection Descriptors selection Building the QSAR model Validating the model F1.5.2 Methods for Selecting Relevant Descriptors Relevant descriptors can be selected either manually or by using automated approaches. For each method, computer programs are available that help in the selection of relevant descriptors. F1.5.3 Manual Selection of Descriptors The manual method is based on a thorough understanding of the SAR and exploiting intuitions generated by the analyses. For example if preliminary analyses indicate that steric or hydrophobic substituents may increase activity, descriptors such as the molar refractivity (MR) and the hydrophobic substituent constant, tt should be selected in the first place. F1.5.4 Automated Selection of Descriptors The second method looks at the selection of descriptors in an automated manner, using programs that score and rank them. Automated and manual methods can also be combined to select relevant descriptors and select those that are easy to interpret. Modern methods use genetic algorithms based on natural evolution principles (Darwin). m F1.5.5 Systematic Combination of Descriptors In principle the identification of the best descriptors can be accomplished by a systematic evaluation of all their combinations. For each combination, a QSAR equation can be derived and then ranked. The highest-ranked equation will reveal the best subset of descriptors. However this systematic approach is not always feasible: for n descriptors (current software can process 2000), there are 2n-1 different combinations (subsets). In the following pages we present automated methods that circumvent this difficulty. % Calculator # Example number of descriptors: number of subsets: I F1.5.6 Methods for Selecting a Subset of Descriptor "Forward regression", "backward elimination" and "stepwise regression" are methods for selecting a subset of descriptors from a large descriptor pool. The process starts with an initial subset of descriptors, then successive small alterations of this subset are made and assessed. If this modification improves the model, the change is accepted, otherwise it is rejected. The treatment is terminated when it is not possible to improve the model further. F1.5.7 Forward Selection The "forward selection" method starts with the single descriptor which best correlates with the dependent parameter. At each subsequent step the method adds the next most contributing descriptor. The process stops when the addition of a descriptor does not improve the model's performance as assessed by appropriate statistical indices. F1.5.8 Backward Elimination The "backward elimination" method starts with a model that includes all the descriptors. At each step the method removes those descriptors that do not degrade the model's performance. The process is stopped when performance starts to decline as assessed by relevant statistical indices. F1.5.9 Stepwise Regression The "stepwise regression" method starts (like in forward selection) with the single descriptor that best correlates with the dependent parameter. At each subsequent step the method adds the next most contributing descriptor and can potentially remove non-contributing descriptors. The process is stopped when additional descriptors do not improve the model or when removing descriptors causes the model's performance to decline, as assessed by appropriate statistical indices. F1.5.10 Scaling Descriptors Descriptors represent a broad range of physico-chemical properties. They need to be calibrated in order to provide a good balance of their respective influence when they are combined. Scaling treatment consists of a mathematical operation called "normalization" which sets boundaries for the variation of each descriptor. Descriptor 2 F1.5.11 Correlation Between Descriptors When two descriptors essentially convey the same information about a series of molecules they are said to be correlated. The use of correlated descriptors in the same equation must be avoided, because the information they characterize is over-represented when both are present. A "correlation matrix" provides useful information on the degree of correlation of different pairs of descriptors. F1.5.12 Example of Correlated Descriptors Consider for example the molecular weight and the number of carbon atoms as two descriptors characterizing a series of alkanes. These two descriptors are highly correlated, which can be shown graphically. # Carbons F1.5.13 Solution to the Problem of Correlated Descriptors When two descriptors are highly correlated, the solution is to remove one of them. The descriptor that carries strong structural information is preferred and the less intuitive one is removed. An alternative solution consists of removing the descriptor that has the highest correlation with the other descriptors. Structure MW # Carbons 44.1 3 58.1 4 m * * 72.2 5 rf % % 86.2 6 Structure MW yV 44.1 58.1 72.2 % * % t» % 86.2 F1.5.14 The Holy Grail in QSAR There is a general consensus that in a meaningful QSAR equation, the number of molecules in the training set should exceed the number of descriptors by a factor of 3 to 5. n descriptors molecules activity di d2 d3 d„ 1 0.2 2 0.50 3.5 0.5 54 3 2.10 5.1 21 53 4 -0.70^ ^-7 > 3n @ F16 Deriving the Equation: Step 3 The topic Deriving the Equation: Step 3 contains the following 24 pages: ■ Deriving The QSAR Equation ■ The Starting Point: The Study Table ■ Graphical Analysis of the Data * Choice of the Mathematical Equation ■ Complexity Levels and Data Overfitting ■ Mathematics are Very (too) Powerful ■ Illustration with an Example ■ A Simple Model ■ A Complex Model ■ Comparing the Two Models ■ Predictive Power of the Simple Model ■ Predictive Power of the Complex Model ■ Complexity Dictated by Predictability of the Model For the entire list, see the navigation panel. F1.6.1 Deriving The QSAR Equation Step 3 consists of deriving the QSAR equation corresponding to the set of descriptors that were selected in the previous step. Compounds selection Descriptors selection Building the QSAR model Validating the model F1.6.2 The Starting Point: The Study Table The starting point for deriving a QSAR equation is the study table. It consists of a spreadsheet with molecules across the rows and molecular characteristics (biological activity, descriptors) down the columns. Typically, the first column indicates the molecular identification (e.g. compound number or name, 2D structure), the second column its activity, and subsequent columns the values of the corresponding descriptors. Property of interest Descriptors Compound Activity LogP MR MW HOMO Density i 98 -4.03 87.10 332.2 -12.0 1.47 2 24 -3.68 76.53 324.4 -11.5 1.43 3 28 -4.34 91.23 290.3 -11.2 1.37 4 64 -5.19 100.2 310.1 -9.2 1.36 5 18 -5.59 91.32 291.5 -10.2 1.41 n 52 -4.83 72.12 340.3 -11.3 1.36 F1.6.3 Graphical Analysis of the Data The study table should lead to graphical analyses. This step is of paramount importance and leaves room for "hunches" and preliminary interpretations. This is where the key questions are asked: is there an order? Are the points distributed according to known patterns? Can the recognized trends be translated into physico-chemical expressions? etc... descriptor x descriptor z descriptor y descriptor k F1.6.4 Choice of the Mathematical Equation After having identified trends in the system, the correlation process can begin. The initial analyses help guide the choice of the right mathematical equation. This equation should not be treated as a black-box; rather it should contain information that reflects the behavior and allows for interpretation of the system in a structural manner. Sound structural informational content in a QSAR equation is of utmost importance for formulating step 3. F1.6.5 Complexity Levels and Data Overfitting The next hurdle is the mathematical equation. At this stage the complexity of the model depends on both the form of the mathematical equation and the number of descriptors considered. single linear regression parabolic model Activity = a (descriptor) * b Activity = a (descriptori)2 + b multiple linear regression: Activity = a(descriptori)+b(descriptor2)+c(descrrptor3)+d... other models: parabolic, bilinear, probability, equilibrium etc... F1.6.6 Mathematics are Very (too) Powerful QSAR models can be skewed unintentionally by overly powerful mathematical choices. An equation that fits the data of a training set precisely can yield an equation that is perfect mathematically but meaningless for molecules other than those in the training set. For example if the training set consists of 20 molecules, it is always possible to select a set of 20 randomly chosen descriptors and solve the mathematical system for 20 equations and 20 unknowns. This error is known as data-overfitting. 20 equations and 20 unknowns ivity of 1 = ai i dl + ai 2 d2 + ai 3 d3 +......+ ai 20 d2 ctivity of 2 activity of 3 activity of 4 - 02,1 dl * 022 d2 * Q23 d3 + ...... + 02,20 d2 = 03,1 dl + 03,2 d2 + 03,3 d3 + ...... 103,20 d2 = 04,1 dl + 04,2 d2 + 04,3 d3 + ...... + 04,20 d2 biological activities / unknowns 1 descriptors F1.6.7 Illustration with an Example To illustrate the data-overfitting problem, let's take a series of compounds for which the permeability through the blood brain barrier (BBB) has been found to be correlated with their logP and polar surface area. In the following graph we have plotted a hypothetical series of compounds in this space and color-coded them according to their BBB permeability. Compounds colored green are permeable whereas compounds colored red are not. • permeable Polar surface area F1.6.8 A Simple Model A linear model for differentiating between BBB permeable and BBB impermeable compounds can be formulated by drawing a straight line through the logP / Polar surface area space. Most of the compounds on the left side of the line are BBB permeable whereas most of the compounds on its right are BBB impermeable. As the model correctly classifies 45 out of the 50 compounds it has a success rate of 90%. Polar surface area F1.6.9 A Complex Model A model with an improved success rate can be generated by drawing a curved line across the logP / Polar surface area space. This model completely separates the BBB permeable compounds from the BBB impermeable compounds and thus has a success rate of 100%. Polar surface area F1.6.10 Comparing the Two Models Which of the two models better distinguishes BBB permeable from BBB impermeable compounds? Clearly the complex model has a higher success rate. However, by doing so it distorts its shape to correctly classify the outliers thereby completely reflecting the scatter of the training data - it is therefore an overfitted model. On the other hand, the simple model mislabels the outliers on the assumption that they are indeed outliers. outliers Polar surface area Polar surface area F1.6.11 Predictive Power of the Simple Model The simple model predicts that all test compounds lying to the left of the line are BBB permeable and all those lying to the right of the line are BBB impermeable. Assuming that the test compounds are similar to the training compound, the prediction power of this model is expected to be high. Predicted to be impermeable and is probably so Predicted to be"* permeable and is probably so Polar surface area F1.6.12 Predictive Power of the Complex Model The complex model also predicts that all test compounds lying to the left of the line are BBB permeable and all those lying to the right of the line are BBB impermeable. However, under the same assumption of similarity between test compound and training compound, many of its predictions are expected to be erroneous. Predicted to be permeable but is probably impermeable Predicted to be ^ impermeable but is probably permeable Polar surface area F1.6.13 Complexity Dictated by Predictability of the Model In the QSAR approach tailoring an equation to the peculiarities of a training set is not a problem. However, forcing the mathematics to fit too closely to the data may lead to meaningless models in terms of predictability (tools for assessing the predictability of a QSAR model will be presented in Step 4). The real issue is to stop the refinements early enough so that the predictive capabilities of the model are not lost. Complexity level F1.6.14 Single Linear Equation: Mathematical Outline The simplest form of a QSAR equation is a linear model with one descriptor. This simply yields the equation of a straight line of the form y = b0 where b0 indicates the intercept of the line with the y axis and b1 the slope of the line. bn and b1 are calculated as described on the next page. F1.6.15 Calculating bO and b1 bi = £ (Xi-x)(yi-y) m_ S(Xi-x)2 (=1 bo = y - bix :ulations are presented for F1.6.16 Multiple Linear Regression: Mathematical Outline It is not always possible to correlate biological activities with a single descriptor (linear model with one descriptor). Given that biological action results from the combined influence of many factors, one can extend the QSAR model to multiple descriptors. Indeed, the observation that several parameters used simultaneously can lead to good models prompted the development of a method referred to as "multiple linear regression" (MLR). In this model linearity is maintained for each of the individual descriptors. coefficients descriptors F1.6.17 Example: MLR vs. Single Linear Models The example of anticonvulsant compounds shown below demonstrates that each descriptor Es, o and logP alone was not able to give a good correlation (r less than 0.40) with the biological activities. However, by using simultaneously logP and a, a significant improvement was made (r=0.80). The addition of Es improves the model even more (r=0.95). This indicates that the biological properties result from the combined action of lipophilicity, steric and electronic effects. \\// Y O model bad model good model \^\ log 1/C = 0.009 Es + 3.411 log 1/C = -0.626 a+3.314 log 1/C - -0.078 logP * 3.432 0.03 0.27 0.38 log 1/C = -0.210 logP - 2.214 a ♦ 3.154 0.80 log 1/C = 0.21 Es - 0.238 logP - 3.81 a + 3.046 0.95 F1.6.18 The Mathematics of MLR: a Single Sample In MLR we try to express activity as a linear combination of descriptors. We recognize the fact that in most cases, our fit to the experimental data will not be perfect and error is usually unavoidable. In the equations listed below, y (the activity) is a scalar; Xj is the value of the descriptor j and bj its associated coefficient; e is the error. In the matrix notation, xT is a row vector of the descriptors and b, a column vector of their associated coefficients. y = biXi + b2X2 + b3X3 + ... + bmxm + e m V = X bJxJ + e Matrix notation: Y=x7b + e F1.6.19 The Mathematics of MLR: Many Molecules For the case of multiple compounds, the activity values are assembled into a vector y of length n, where n is the number of compounds. The descriptors are collected into an n by m matrix where n again is the number of compounds and m is the number of descriptors. The coefficients are collected into a vector of length m and the errors are collected into another vector of length n. y = x b + e F1.6.20 The Solution of MLR In the MLR formalism we search for the (unknown) set of coefficients b, which, when multiplied by the (known) descriptors, best approximates the (known) activity data (equation 1). A solution to this problem can be obtained through a matrix inversion procedure (equation 2). ^coefficients • example y = x b + e (1) The transposed of the original descriptors matrix. A transposed matrix replaces columns with rows and vice versa. The M" indicates matrix inversion b = (xTxy1 t I xry (2) The unknown vector The original The known vector of coefficients descriptors matrix of activities F1.6.21 Analysis of the MLR Equation One of the purposes of QSAR analyses is to understand the forces governing the activity of a particular class of compounds and to assist drug design. In the example shown below QSAR analyses reveal that the relative importance of the descriptors vary in the following order: logP > a > Es > MR; therefore the biological activities are governed in the first place by hydrophobicity (logP) and polarity (o) and to a lesser extent by steric effects (Es and MR). Descriptors logP a Es MR F1.6.22 Non-Linear Equations A non-linear equation is an extension of a multiple linear regression. In some systems the linearity may not be sufficient to achieve a good correlation. Hansch was the first to introduce a parabolic term, and a complex biological process can be satisfactorily modeled by non-linear equations. F1.6.23 Example of Non-Linear Model In the example below, the anticonvulsant activities of a set of molecules was initially found to be linearly correlated with logP. However, it is implausible to assume that the biological activity can increase indefinitely by increasing the lipophilicity of the molecules. It is known that highly lipophilic compounds cannot reach their site of action, because they are trapped in lipophilic environments. It is therefore more realistic to improve the initial model using a non-linear equation. The modified equation proved to be correct and revealed the existence of an optimum logP value, information that could not be derived from molecules with a small range of logP values. %> linear model • non-linear model R2 Rl log (1/C) s 0.73 logP * 2.5 Log P F1.6.24 Typical Non-Linear Equations There are many reasons why the use of non-linear models is justified, including the kinetics of the drug transport, the equilibrium control of its distribution, allosteric effects, different pharmacokinetics, metabolism, solubility etc... The following are examples of non-linear models that have proved to be valid at ieast for special and complex biological systems. Parabolic Model (Hansch) log 1/C = a (logP)2 + b logP * c Probability Model (McFarland) log 1/C = a logP - 2a log (P+l) + c Equilibrium Model (Hyde) log 1/C - a logP - log (aP+1) + c Bilinear Model (Kubinyi) log 1/C = a logP - b log (pP+1) + c F1.7 Validating the Model: Step 4 The topic Validating the Model: Step 4 contains the following 19 pages: ■ Tools for Assessing the Quality of a Model ■ Predictive and non-Predictive Models ■ The Standard Deviation ■ Correlation Index r2 ■ The Mathematics of r2 ■ TSS, the Total Variance ■ RSS, the Explained Variance ■ t-test for Single Descriptors and Significance of r2 ■ Shape of t-distribution and Number of Molecules ■ Student's t-test Procedure ■ F-test for Assessing the Significance of r2 ■ Performing the F-test ■ F-test Procedure For the entire list, see the navigation panel. F1.7.1 Tools for Assessing the Quality of a Model Efficient tools are necessary for assessing the validity of a QSAR model. Numerical analyses or statistical methods provide a variety of indexes that serve to evaluate the quality of the model and its limitations. In the following pages we present some of these tools and explain how to use them. Validating the model F1.7.2 Predictive and non-Predictive Models Broadly speaking there are two groups of indices: (1) those that indicate how well the QSAR equation can "reproduce" the experimental data and (2) those that can tell how far the model can be extrapolated to new molecules. F1.7.3 The Standard Deviation The easiest way to "validate" a QSAR model is to calculate the standard error or standard deviation (SD or s), which is calculated as the average squared deviation of each number (the "residuals") from the mean. This index reflects how much the deviation between the data and the model is. The smaller the SD, the more the model is considered of good quality. % s calculation « example The Equation F1.7.4 Correlation Index r2 The most frequently used index for evaluating the performance of a QSAR model is r2 (squared correlation coefficient), r2 measures the degree of correlation between the activity values calculated by the model and those measured experimentally. The value of r2 can range between 0 (no correlation) to 1 (perfect correlation). »r2=1 • r2=0.5 « r2=0 r2 = 1 perfect correlation >- o < ft) ^^^^^^ £^ a ft) Calculated Activity F1.7.5 The Mathematics of r2 Mathematically, r2 is calculated by dividing the fraction of variance explained by the model (the "explained sum of squares", ESS) by the original variance (the "total sum of squares", TSS). ESS, the fraction of variance explained by the model is equal to the total variance (TSS) minus that portion of the variance which was not explained by the model (residual, RSS). N F1.7.6 TSS, the Total Variance F1.7.7 RSS, the Explained Variance In order to obtain RSS, the variance explained by the QSAR model, we start from the fact that the total variance is the sum of the explained and unexplained variances. Thus, the explained variance is the difference between the total variance and the unexplained variance. That portion of the variance which is left unexplained by the QSAR model (unexplained variance) can be obtained by finding the difference between the measured activity and the predicted activity (as given by the regression line). 71 F1.7.8 t-test for Single Descriptors and Significance of r2 r2 alone is not sufficient to determine whether the relationship has occurred by chance; its significance can be calculated using the t-statistic for single descriptors as follows. We repeat the process of deriving of a QSAR equation and calculate the resulting r2 values many times, each one using a different descriptor. If the number of molecules is large (> 30), the sampling distribution of the resulting r2 values will have a normal (i.e., Gaussian) shape. If the number of molecules is small, it will have a shape known as a t-distribution. Normal (gaussian) distribution t-distribution Values on the x-axis represent standard deviations from the mean located at X = 0. 9181 F1.7.9 Shape of t-distribution and Number of Molecules A value r2 = 1 will always be obtained for a set of two molecules irrespective of the descriptor used for the QSAR analysis however, as the number of molecules increases, the probability of obtaining large r2 values with irrelevant descriptors decreases. This probability corresponds to the area under the t-distribution curve (see below), away from the center (where r2 = 0). The shape of the t-distribution therefore depends on the number of molecules used in the analysis. C^.C F1.7.10 Student's t-test Procedure The Student t-test employs the t-distribution to test whether the correlation coefficient obtained from the QSAR analysis is significantly different from 0. The larger the t-value, the larger the probability that r2 significantly differs from 0; that is, the larger the probability that the descriptor used for the analysis is relevant to the activity. Technically, the steps involved in the Student t-test are as follows. % Overview C Step 1 • Step 2 * Step 3 • Step 4 1. Calculate t according to the above equation. t - r 2. Select a significance level (e.g., 0.05). (see step 2) 3. Look up the t value from a t-distribution derived for the correct number of data points (N) at the selected significance level. 4. If the calculated t-value is larger than the listed t-value, then the regression equation is significant at this significance level. F1.7.11 F-test for Assessing the Significance of r2 The F-test is an extension of the t-test for the case of many descriptors. Like the t-test it tests (and hopefully rejects) the assumption that the model did not explain any of the original variance in the data set (i.e., ESS = 0). Like the t-test, the F-test uses an F-distribution which, similar to the t-distribution depends on the number of compounds and descriptors. Molecules = 10 Descriptors - 4 0.0 1.7 3.3 5.0 6.7 8.3 10.0 Molecules = 100 Descriptors = 10 0.0 1.7 3.3 5.0 6.7 8.3 10.0 F1.7.12 Performing the F-test The F-test employs the F-distribution to test whether the correlation coefficient obtained from the MLR analysis significantly differs from 0. The larger the F-value, the larger the probability that r2 significantly differs from 0; i.e. the greater the probability that the descriptor used for the analysis is relevant to the activity. Technically, the steps involved in the F-test are as follows. ESS RSS rz - and p2 - 1- TSS TSS F s RSS TSS = ess 1 TSS 1-r2 " RSS ESS r2 RSS l-r; Calculate F according to this equation: N - number of molecules, k - number of descriptors F1.7.13 F-test Procedure The application of the steps involved in evaluating the significance of r2 for the Capsaicin analogs using the F-test proceeds as follows: % Procedure • F-table Calculate F: F = r2(N-k-l) k(l-r2) ; r2 = 0.92; N=8; k=3 0.92(8-3-1) F =---- = 15.33 3(1-0.92) Select a significance level (p): p = 0,01 Look up the F value from an F-distribution with N=7, k = 1, p = 0.01: F = 7.59 tab The calculated F value (15.33) is larger than the tabulated F value (7.59). Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than 1%. F1.7.14 Assessing the Predictive Power of a Model r2, t and F are indices that can be generated to evaluate QSAR results. However, these parameters basically only tell us about the ability of the QSAR model to reproduce the data from which it was derived and not its aptitude to predict the activities of new compounds. Two methods are presented in the following pages to estimate the predictive power of a QSAR model. F1.7.15 The Test Set Method The first method is known as the "test set method" and consists of partitioning the initial data into two sets, a preferred strategy when a large set of compounds is available. The initial data set is randomly divided into two parts; the first one is used to build a QSAR model and the second one to validate this model. Training set Test set F1.7.16 The Cross Validation Method The second method is known as "the cross validation method" - it is preferred when the size of the data set is too small. In this method the data are randomly divided into N equal parts; N-1 parts are used to build the model which is then used for the remaining Nth part to predict the activities of the corresponding molecules. The procedure is repeated until the activities of all compounds have been predicted independently. Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Training set Training set Training set Training Training Training Training Test set Training Training Training Test set Training Training Training Test set Training Training Training Test set Training Training Training Training F1.7.17 Limits of the Cross Validation Method With the cross validation method, the QSAR model that is ultimately used to predict the activities of new compounds is derived from all N data points and is therefore different from the N partial QSAR models (i.e. those derived from the N-1 data points). Therefore cross validation does not provide us with the predictive power of a specific QSAR equation but rather with an estimate of our ability to make predictions for compounds similar to those used in our QSAR analysis. Compounds used in original QSAR analysis Prediction estimates made by cross validation apply Prediction estimates made by cross validation do not apply Descriptor 1 F1.7.18 The Predictive Index Q2 The predictive power of the model, termed Q2, is computed by analogy with r2, the difference being the use of the PRESS (predicted sum of squares) rather than the RSS (residual sum of squares) in the numerator. PRESS is calculated as the difference between the measured activity and the predicted activity for the test set compounds. r2 = 1 - rss t Q2 = 1 - press rss = ^ (ycoicj - y.)2 i=i I N press =2 (ypred.i - yi): F1.7.19 Summary When discussing mathematical tools available for assessing the quality of a QSAR model we saw that (1) the standard deviation is an isolated "absolute" index of local meaning; (2) with r2 it is possible to compare different models, but this index is only mathematical - not statistical; (3) t and F have a statistical content that can be used for single and multiple linear regression respectively; however they only measure the ability of the QSAR model to reproduce the data from which it was constructed. log£= 1.14 log P + 0.16 correlation coefficient for assessing the quality of the model F-value for assessing the statistical significance n = 25; rz=0.91; s = 0.155; F = 66.4; Q2 = 0.875 number of molecules regression coefficient for measuring the predictibility standard deviation F1.8 Example of Simple Linear Regression The topic Example of Simple Linear Regression contains the following 11 pages: " Example of Capsaicin Analogs ' Relevant Descriptors of Capsaicin Analogs ■ The Capsaicin Study Table " Graphical Analysis of Capsaicin Analogs ■ Deriving a QSAR Linear Equation ' Experimental vs. Calculated Values ■ Calculating r2 for the Capsaicin analogs ■ t-test for the Capsaicin Analogs ■ F-test for a Series of the Capsaicin Analogs ■ The QSAR Equation for the Capsaicin Analogs ■ Predicting the Activities of Unknown Compounds F1.8.1 Example of Capsaicin Analogs Capsaicin analogs were studied for their analgesic properties and we will use this study to illustrate the derivation of a simple QSAR model. The biological activities (EC50) were measured for some analogs as indicated below. The question is whether on the basis of these data, it is possible to develop a QSAR model and predict the biological activities of new compounds. F1.8.2 Relevant Descriptors of Capsaicin Analogs The selection of descriptors that correlate with the target biological activity is mandatory for the derivation of a meaningful QSAR model. For Capsaicin analogs, biological activity appears to be influenced by the lipophilicity of the substituent R. Following this assumption the descriptors deemed most suitable are the molar refractivity (MR) and the hydrophobic substituent constant tt. Lipophilicity Descriptors TT : encodes the lipophilic behavior MR: contains information on the volume F1.8.3 The Capsaicin Study Table The following table summarizes the MR and tt values which were calculated for the seven Capsaicin analogs. As discussed above, activities (EC50) are expressed as their log values. Compound log EC50 71 MR I j 1.07 0 1.03 2 0.09 0.71 6.03 3 0.66 -0.28 7.36 4 1.42 -0.57 6.33 5 -0.62 1.96 25.36 0.64 0.18 15.55 7 -0.46 1.12 13.94 F1.8.4 Graphical Analysis of Capsaicin Analogs For Capsaicin analogs, if we plot the values from the study table for MR and tt, respectively, there seems to be a weak correlation between the biological activity and the molar refractivity (MR). However, the hydrophobic substituent constant tt shows a possible linear correlation. F1.8.5 Deriving a QSAR Linear Equation The correlation between tt and the biological activities is represented by the equation y = bg+^X, where b0 is the intercept of the line with the y axis and b1 the slope of the line. We show below how to calculate their numerical values. bi = I (Xi-x)(yry) i=l S(Xi-x)2 1=1 bo = y - bix F1.8.6 Experimental vs. Calculated Values There is a difference between the experimental and the calculated values as shown below, continue ■ F1.8.7 Calculating r2 for the Capsaicin analogs For Capsaicin analogs, r2 is calculated as follows. — 1.07 + 0.09 * 0.66 + 1.42 + (-0.62) + 0.64 + (-0.46) Y =-j- = 0.4 TSS = (1.07-0.4)2 + (0.09-0.4)2 + (0.66-0.4)2 + (1.42-0.4)2 * (-0.62-0.4)2 + (0.64-0.4)2 + (-0.46-0.4)z = 3.49 RSS = (0.28)2 + (-0.12)2 + (-0.36)2 + (0.16)* * (0.19)2 * (-0.01)2 + (-0.34)2 = 0.40 2 3.49 - 0.40 3.09 -P = 3.49 = 3.49 F1.8.8 t-test for the Capsaicin Analogs The steps involved in evaluating the significance of r2 are as follows t calculation • t-table Calculate t: t = r. N-2 ^=0.89; N-7 t - J0.89 7-2 1-0.89 Select a significance level (p) Look up the t value from a t-distribution with A/=7, p=0.01: The calculated t value (6.3604) is larger than the tabulated f value (2.998). Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than 1%. F1.8.9 F-test for a Series of the Capsaicin Analogs The steps involved for evaluating the significance of r2 using the F-test proceed as indicated below. The F-test analyses finally indicate that a significant correlation is obtained and the probability of a chance correlation is less than1%. % F calculation « F-table Calculate F: f = r2(N-k-l) k(l-r2) ; r2 = 0.89; N=7; k=l 0.89(7-1-1) F =---- = 40.45 1(1-0.89) Select a significance level (p): p = 0,01 Look up the F value from an F-distribution with N=7, k = 1, p = 0.01: F = 12.25 The calculated F value (40,45) is larger than the tabulated F value (12.25). Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than \%. F1.8.10 The QSAR Equation for the Capsaicin Analogs QSAR studies reveal the importance of lipophilicity in the analgesic properties of a series of Capsaicin analogs as indicated by the good correlation found with the tt descriptor. The correlation coefficient r2 is 0.89 and analyses of the significance of the equation (t-test and F-test) show that there is less than a 5% chance that the relationship is due to chance. This validates the use of tt as a descriptor for the structure-activity relationships. r2=0.89; s=0.28; t=6.36; F=40.45 F1.8.11 Predicting the Activities of Unknown Compounds The derived QSAR model can be used to predict the biological activities of novel capsaicin analogs by introducing their corresponding tt values in the QSAR equation. For example, the biological activity of the amide anajog indicated below is predicted with an EC50 of 0.98 uM. 71