Protein solubility: sequence based prediction and experimental verification Pawel Smialowski1 , Antonio J. Martin-Galiano1 , Aleksandra Mikolajka2 , Tobias Girschick1 , Tad A. Holak2 and Dmitrij Frishman1, * 1 Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85350 Freising, Germany 2 Max Planck Institute for Biochemistry, 82152 Martinsried, Germany ABSTRACT Motivation: Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets in large-scale proteomics pro- jects. Results: We present a machine-learning approach called PROSO to assess the chance of a protein to be soluble upon heterologous expression in E. coli based on its amino acid composition. The classification algorithm is organized as a two-layered structure in which the output of primary support vector machine classifiers serves as input for a secondary Naive Bayes classifier. Experimental progress information from the TargetDB database as well as previously published datasets were used as the source of training data. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72% (75% and 68% for positive and negative class respectively). We also provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different proteins. Availability: A Web server for protein solubility prediction is available at http://webclu.bio.wzw.tum.de:8080/proso. Contact: d.frishman@wzw.tum.de Supplementary information: Supplementary data are available at Bioinformatics online 1 INTRODUCTION Protein solubility is an important pre-requisite for structural and biophysical studies. Obtaining soluble proteins in sufficiently high concentrations remains a major experimental challenge. Many heterologously expressed proteins are insoluble and often solubilization is a trial and error process with relatively low success rate (Chow et al., 2006; Singh and Panda, 2005). Although highly efficient overexpression into inclusion bodies is sometimes desirable as it results in relatively clean protein, refolding procedures pose a significant technical challenge (Armstrong et al., 1999; Chow et * To whom correspondence should be addressed. al., 2006) and frequently lead to the loss of biological activity (Singh and Panda, 2005). Common strategies to improve protein solubility during heterologous expression include weak promoters, low-temperature (Makrides, 1996), modified growth media (Georgiou and Valax, 1996; Makrides, 1996), co-expression with molecular chaperons (Tresaugues et al., 2004), and fusion with solubility enhancing tags (Davis et al., 1999; Kapust and Waugh, 1999). If some knowledge of three-dimensional structure is at hand, it is possible to enhance solubility by structure-guided mutagenesis (Dale et al., 1994). When a priori knowledge of structure is not available protein solubility can be increased by directed evolution method (Pedelacq et al., 2002; Waldo, 2003) that relies on high-throughput screening of large protein diversity libraries generated by random mutagenesis for more soluble variants. An alternative way to increase the overall success rate of biophysical studies relying on protein solubility is to avoid potentially difficult targets altogether and focus experimental work on those proteins that offer better chances to be soluble. This approach is being frequently practiced by large-scale structural proteomics consortia that initially target more proteins than they can address and then select the most promising candidates (“low hanging fruits”) according to certain criteria (Edwards et al., 2000). Despite the fact that the experimental determination of solubility is relatively accessible, the ability to predict from protein sequence its potential for solubility when overexpressed in standard host cells, e.g Escherichia coli, would be extremely beneficial for rational target selection in structural proteomics as well as for a variety of biophysical studies. Under a given set of experimental conditions, including the expression host, protein solubility is an individual trait ultimately determined by its primary structure. A number of previous reports addressed the interconnection between protein solubility and various sequence-derived features. A simple method for calculating protein solubility from sequence was first proposed by Wilkinson et al. (1991) and then improved by Davis et al. (1999). Their solubility model is based on two parameters: average charge, determined by the relative numbers of Asp, Glu, Lys, and Arg residues, and the content of turn-forming residues (Asn, Gly, Pro, and Ser). Christendat and coworkers (2000) examined experimental success and failure data accumulated in a high-throughput structural genomics project on 424 nonmembrane proteins from Methanobacterium thermoautotrophicum. They demonstrated that insoluble proteins tended to have more hydro© The Author (2006). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org Associate Editor: Thomas Lengauer Bioinformatics Advance Access published December 6, 2006 P.Smialowski et al. 2 phobic stretches (longer then 20 amino acids), lower glutamine content (Q<4%), fewer negatively charged residues (DE<17%), and higher percentage of aromatic amino acids (FYW >7.5%) than soluble ones. A set of simple rules was derived based on these observations allowing prediction of protein solubility with 65% accuracy. A year later Bertone et al. (2001) reanalyzed 562 proteins from the same organism and confirmed that high content of negative residues (DE>18%) and absence of hydrophobic patches are associated with improved solubility. Additionally they found that low percentage of aspartic acid, glutamic acid, asparagines and glutamine residues (DENQ<16%) increases the probability of a protein to be insoluble. Goh and coworkers (2004) applied random forest and decision tree analysis to various attributes of more then 27000 proteins from multiple organisms and found that protein solubility is influenced by (in decreasing order of importance): percentage of serine (S<6.4%), fraction of negatively charged residues (DE<10.8%), percentage of S,C,T, and M amino acids, and length (<516 aa). However, in a study describing high-throughput overexpression in E. coli of 10167 proteins of Caenorhabditis elegans, no statistically significant correlation between protein’s pI, molecular weight, presence of rare codons, overall sequence hydrophobicity and protein solubility was observed (Luan et al., 2004). The authors indicate that proteins homologous to those with known structures have higher chances to be soluble (Luan et al., 2004). Finally, in a recent publication Idicula-Thomas et al. (2006) presented a solubility prediction method based on a Support Vector Machine trained on unbalanced datasets of 62 soluble and 130 insoluble proteins. According to this publication the method was able to predict correctly the increase/decrease in solubility upon mutations. This method is discussed and evaluated in detail below. Here we present a novel prediction technique dubbed PROSO (PROtein SOlubility predictor) to assess the chance of a protein to be soluble upon heterologous expression in E. coli. The method employs support vector machine and Naive Bayes classifiers trained on experimental progress data stored in the TargetDB (Chen et al., 2004) and PDB (Berman et al., 2000) databases as well as on an additional dataset extracted from literature. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72% (75% and 68% for positive and negative class respectively). We analyze the importance of diverse solubility determinants identified in our work as well as those previously reported. Finally, we provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different pro- teins. 2 METHODS 2.1 Datasets of soluble and insoluble proteins The TargetDB database (Chen et al., 2004) http://targetdb.pdb.org/) stores amino acid sequences and experimental progress information of structural targets pursued by structural genomics consortia. For each protein, TargetDB lists its current experimental status, such as Selected, Cloned, Expressed, Purified, Soluble, Crystallized, and so forth. Thus, all proteins that achieved the status Soluble or any subsequent status may be confidently considered soluble. Comparing the experimental status of sequences stored in TargetDB at two sufficiently distant time points - April 2005 and November 2005 - we divided all proteins into TargetDB-Soluble (annotated with the Soluble or any more advanced descriptor) and TargetDB-Insoluble (annotated as Expressed but not as Soluble in April 2005 and still remaining in that state seven months later). We did not consider work-stopped targets (those which were expressed but then either explicitly stated as “Work-stopped” or those that disappeared from TargetDB) since the reason for aborting work on a given target can be unrelated to its experimental behavior. Furthermore, to remove targets dropped as a result of competitors having structure submitted to PDB all proteins with 100% identity to any PDB protein were removed from the insoluble dataset. Considering all proteins with known three-dimensional structure soluble by definition, another dataset of soluble proteins (PDB-Soluble) was built from the PDB database (release 1.XII.2005). We selected PDB entries annotated with the descriptors “EXPRESSION_SYSTEM: ESCHERICHIA COLI” and “EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID”. We also used a further dataset of soluble (Lit-Soluble) and insoluble (LitInsoluble) proteins from E. coli described by Idicula-Thomas and coworkers (Idicula-Thomas and Balaji, 2005; Idicula-Thomas et al., 2006). Each of the described datasets was refined by removing proteins with one or more transmembrane segments as predicted using TMHMM (Krogh et al., 2001), and those sequences containing more than one contiguous “X” characters. The number of instances and other details on each dataset used in this work can be found in Suppl. Table 1. Finally we merged separately data for soluble and insoluble proteins and removed redundancy by homology clustering at the 50% sequence identity level using the CD-HIT program (Li et al., 2001; Li et al., 2002). 2.2 Multiple/mono domain split and length distribution adjustment Large multidomain proteins generally represent more challenging structural targets then smaller single-domain proteins. In particular, the relationship between amino acid sequence and solubility may be significantly different between single- and multidomain proteins since the latter involve sizeable sequence portions corresponding to inter-domain contacts. Multi-domain proteins are believed to fold via a hierarchical multi-step organization of individual co-translationally folded domains (Frydman, 2001; Seckler and Jaenicke, 1992). By contrast, it is common for monodomain proteins to fold in a fast two-state process (Onuchic and Wolynes, 2004). In order to take into account these differences in the nature of folding/misfolding we split our datasets into the subsets of long multiple domain and short monodomain proteins. To find a reasonable length threshold for this split we analyzed size distribution of mono and multiple-domain proteins using domain sequences from the CATH database (Pearl et al., 2005) clustered at 50% identity level (Suppl. Figure 1). Obviously, no perfect decision rule exists, but we found that threshold length in the range between 250 and 300 residues provides a good separation for dividing our data into two sections - one enriched in monodomain, and the other enriched in multidomain proteins; the threshold of 250 amino acid residues was finally adopted. Solubility classification was performed independently for each dataset. Since sequence length distributions are somewhat different for insoluble and soluble datasets (data not shown), the composition of sequence datasets was adjusted to account for this effect using a simple approach. Proteins were divided into bins (5 bins for mono and 9 for multiple domain proteins) according to their length. The width of the bins was 50 and 200 for proteins with length below and above 250, respectively. For each bin, the number of sequences from the least populated dataset was used as a limit of sequence number for both datasets, thus assuring equal population of sequence length ranges and removing any length-related bias. Characterization of the final datasets used for classification can be found in Suppl. Table 2 In summary, as a result of restrictive data selection from TargetDB and PDB databases we built a sufficiently large (more then 14000 proteins) yet reasonably reliable input dataset. However, it is important to realize that the majority of sequences in our final dataset stem from TargetDB which con- Protein solubility: sequence based prediction and experimental verification 3 tains only brief descriptions of experimental results and does not impose standard requirements for annotation. Consequently, in contrast to the PDB-derived data, TargetDB entries lack detailed information about the expression system used and the setup of expression experiment. Nevertheless E.coli is predominantly being used as expression host by structural genomics consortia: out of 3171 PDB entries which originate from TargetDB (as of September 2006) 2381, or 75.1%, had annotation “EXPRESSION_SYSTEM: ESCHERICHIA COLI”. 2.3 Protein sequence features Amino acid sequences were represented by the frequencies of mono-, di-, and tri-peptides. However, for di- and tri-peptides in the original 20-letter amino acid alphabet the total number of features would be very large – 400 and 8000, respectively. In order to reduce the dimensionality of the feature space and improve signal to noise ratio we clustered amino acids into groups with similar physico-chemical or structural properties as described in our previous work (Smialowski et al., 2006). For the original amino acid alphabet we calculated the frequencies of words of length one and two while for amino acid groups, words of length one, two, and three were used. To derive amino acid clustering pertinent to protein solubility a set of eight numeric scales related to protein solubility was obtained from the Amino Acid Index Database (Kawashima and Kanehisa, 2000): CHAM820102 (free energy of solution in water, kcal/mole) (Charton and Charton, 1982), PONP800101 (surrounding hydrophobicity in folded form) (Ponnuswamy et al., 1980), NOZY710101 (transfer energy, organic solvent/water) (Nozaki and Tanford, 1971), KRIW790102 (fraction of site occupied by water) (Krigbaum and Komoriya, 1979a; Krigbaum and Komoriya, 1979b), ZHOH040103 (buriability) (Zhou and Zhou, 2004), BIOV880101 (information value for accessibility; average fraction 35%) (Biou et al., 1988), ROSM880102 (side chain hydropathy, corrected for solvation) (Roseman, 1988), and JANJ780101 (average accessible surface area) (Janin and Wodak, 1978). Values from all scales were normalized and used together to cluster amino acids by Expectation-Maximization (EM) algorithm (Dempster et al., 1977; Meila and Heckerman, 2001), into two alternative groupings, one wih 14 and another one with 17 sets. These cluster numbers have been found optimal in our previous work on protein crystallizability (Smialowski et al., 2005). Amino acids clustering rules are listed in the Suppl. Table 3. 2.4 Classification and feature selection We classified data using the two level procedure described in detail in our previous publication (Smialowski et al., 2006). Briefly, each set of input data for a fixed word size and a given amino acid grouping scheme was first classified using a primary classifier – a support vector machine (SVM) with the Gaussian kernel (Keerthi et al., 2001; Platt, 1999) as described by Fan et al. (2005). The output of the primary classifier for each protein was obtained by 10-fold cross-validation and served as input for a secondary Naive Bayes classifier (Domingos and Pazzani, 1996). Ten-fold stratified cross-validation over input data was performed to obtain class assignment for each protein and to estimate the accuracy of the second level classifier. For feature selection we used the wrapper method (Kohavi and John, 1997) with the Naive Bayes method (Domingos and Pazzani, 1996) as a classification procedure and the ‘Best first’ approach (Kohavi and John, 1997) as a search algorithm. The detailed procedure can be found in Smialowski et al. (2006). Additionally feature ranking was performed by measuring symmetrical uncertainty of attributes with respect to a given class (Hall and Holmes, 2003). While selecting features the grouping schema which performed best for a given word size was utilized. 2.5 Classifier evaluation In order to quantify the performance of the classifiers the following measures were calculated: • accuracy: the number of correctly classified instances divided by the total number of instances • true positive rate (TP rate) (also called sensitivity or recall of positive class): the number of correctly classified instances from the positive class divided by the number of all instances from the positive class (TP+FN) • true negative rate (TN rate) (also called sensitivity or recall of negative class): the number of correctly classified instances from the negative class divided by the number of all instances from the negative class (TN+FP) • precision (selectivity): the ratio of the number of correctly classified positive (TP) or negative (TN) instances to the number of all instances classified as positive (TP+FP) or negative (TN+FN), for positive and negative class respectively • gain: proportion of given class precision (selectivity) to the ratio of the given class in full dataset • Matthews Correlation Coefficient (MCC), calculated as: ( ) ( ) ( )( )( )( )FN+TNFP+TNFP+TPFN+TP FNFPTNTP =MCC ∗−∗ where TP, FP, and FN are the numbers of true positive, false positive and false negative instances, respectively. In particular, gain is an important performance measure that quantifies how much better is the decision when guided by the classifier in comparison to random drawing of instances. MCC indicates correlation between the classifier assignments and the actual class in the two class case. It is a good measure of classifier performance even when classes are unbalanced. Additionally, the statistical relevance of the classification was compared separately for positive and negative classes using the chi square test against the null hypothesis which states that the classifier has no discriminatory capacity (Motulsky, 1995). 2.6 Evaluation of additional global protein features and previously reported solubility prediction methods We evaluated all previously published methods which score protein solubility known to us. For every protein in our dataset we calculated solubility according to Wilkinson-Harrison (Davis et al., 1999; Harrison, 1999; Wilkinson and Harrison, 1991) using their web-server (http://www.biotech.ou.edu/) and according to Idicula-Thomas (IdiculaThomas and Balaji, 2005) using our own implementation of their method. Unfortunately the newer method published in 2006 (Idicula-Thomas et al., 2006) is not available publicly and was not provided to us upon repeated email requests to the authors. Discriminatory power of both methods was evaluated directly according to the rules specified by the authors. Additionally, the scores produced by these methods were used as input to train and evaluate a Naive Bayes classifier. We also calculated the following global sequence features: sequence length, isoelectric point (pI), grand average of hydropathicity index (GRAVY) (Gasteiger et al., 2003; Kyte and Doolittle, 1982), aliphatic index (AI) (Ikai, 1980), and fold index (FI) (Prilusky et al., 2005). Naive Bayes classifier was trained and evaluated with these features used as input to check whether any of them could result in reasonably good classification performance. Additionally, the combination of AI, FI, GRAVY and pI was also used. 2.7 Protein cloning, expression, purification and solubility measurements For direct experimental verification of our predictions we utilized two proteins - FOP (FGFR1 oncogene partner, SWISSPROT accession number O95684) and CAP350 (centrosome associated protein 350, SWISSPROT code Q5VT06) – as well as their multiple sequence variants (constructs) P.Smialowski et al. 4 generated by directed mutagenesis. All studied protein constructs were cloned into the pET46 Ek/LIC vector (Novagen) according to the manufacturer’s manual. Mutations were introduced by the Quick-Change Site Directed Mutagenesis Kit (Stratagene). All used proteins excluding ten FOP constructs spanning residues 54- 134, had the N-terminal His-tag. 19 of them had a 15 amino acid His-tag from the pET46 Ek/Lic vector and two constructs named FC2SHisMut and FC2LisH_2 had shorter tags spanning 10 and 9 residues respectively (Suppl. Table 4). To increase the efficiency of protein production, ten FOP constructs were expressed as high yielding 1-134 constructs with a factor Xa cleavage site at position 54. N-terminal 1-53 oligopeptides were subsequently removed by factor Xa protease cleavage. Additionally, to facilitate protein concentration measurements we added tryptophan at the position 135 to all FOP constructs spanning amino acids 1-134 and 54-134. Sequences of all resulting constructs are listed in Suppl. Table 4. All constructs were expressed in the E. coli strain BL21 STAR (DE3) (Invitrogen) grown in LB medium supplemented with 100 mg/ml of ampicillin by overnight induction with 0.5 mM IPTG at 25 ºC for CAP350 constructs and 1 mM IPTG at 37ºC for FOP constructs. After expression cells were lysed in a lysis buffer (50 mM NaH2PO4, pH 8, 300 mM NaCl, 10 mM βME, 10 mM imidazole), sonicated and centrifuged. Supernatants were passed through NiNTA column (Qiagen) in the first step of purification. The samples were loaded onto the column and washed with lysis buffer supplemented with 20 mM imidazole. Elution was performed with the lysis buffer supplemented with 250 mM imidazole. Xa protease cleavage was performed in 20 mM Tris-HCl at pH 7.5 supplemented with 100 mM NaCl and 2 mM CaCl2. Identities of all the constructs were checked by DNA sequencing. All proteins were checked by Western-blot, mass spectrometry and/or Nterminal Edman-sequencing. Purity of all proteins was estimated to be 85- 98% from SDS-PAGE and mass spectrometry. It was not possible to purify two tested constructs, namely FOP_FL (full length) and FC3. Protein solubility was first estimated by measuring relative amount of protein in the cytoplasmatic fraction and in inclusion bodies by SDSPAGE, dot-blot and/or Western-blot with appropriate antibodies. Concentration of purified proteins was calculated by measuring absorbance at 280 nm (Layne, 1957; Stoscheck, 1990). Extinction coefficients for all constructs were calculated with the ExPASy ProtParam tool (Gasteiger et al., 2005; Gill and Hippel, 1989). For some of them maximal concentration achievable in the given buffer without causing precipitation was measured after purification as described by Golovanov and coworkers (2004). All concentrations were performed using 10 ml stirring concentrator (Amicon Inc). Oligomerization state of all proteins was determined using the ÄKTA explorer 10 chromatography system (Amersham-Pharmacia) with the following gel filtration chromatography columns: Superdex 75 HR 10/30 or HiLoad 26/60 Superdex 75 prep grade (both Amersham-Pharmacia). A protein was considered insoluble when it predominantly accumulated in inclusion bodies and resisted standard attempts to refold from denaturizing solution (Singh and Panda, 2005; Tsumoto et al., 2004) using the dilution-refolding method. Stability in solution was assumed when the protein did not precipitate nor degrade (as checked by SDS-PAGE) over a one week period at 4°C. For all proteins we calculated solubility scores using our classifier. When the protein was classified as a positive (soluble) the class probability was used as a score, in the case of negative it was 1-class probability. This resulted in score values in the range from 0 to 1, with insoluble proteins having values less then 0.5 and soluble greater or equal 0.5. 3 RESULTS AND DISCUSSION 3.1 Performance of the primary classifier We classified independently data for mono and multiple domain proteins. Based on Matthews Correlation Coefficient (MCC) values we chose the best classifiers for each amino acid word size and clustering schema. For each word length, the best classifier corresponding to clustering schemes was selected. In general, improvement of accuracy and MCC by increasing word size from one to two is limited. Performance of classifiers based on word size three is lower than the results for shorter word sizes. This effect can be associated with lower number of counts for word size three leading to less efficient problem generalization by SVM. A large difference was observed between classification results for mono and multidomain proteins. All primary classifiers for multidomain proteins had MCC higher than 0.4 and accuracy over 70% while monodomain protein classifiers had MCC not exceeding 0.35 and accuracies below 68 %. Detailed results for mono and multiple domain proteins are listed in Suppl. Tables 5 and 6, respectively. All statistical descriptions of the classifiers and classification of all instances were obtained using 10-fold stratified cross-validation over the dataset. In other words, we always tested our classifier on the data which were never part of the training data. It also means that the test sequences did not share sequence similarity greater than 49% with the training data. 3.2 Feature selection for primary classification The most advantageous feature subsets were selected using the wrapper (Kohavi and John, 1997) method as described in Materials and Methods. Primary features selected for the best clustering schema for word size of one, two and three residues, are reported in Suppl. Table 7 separately for mono and multiple domain proteins. It is important to remember that wrapper is designed to establish subsets of features optimal for classification (Kohavi and John, 1997). In doing so, however, it does not provide directly the relations between feature values and solubility. It is interesting that out of eight frequencies of amino acids (R, D, C, E, G, L, M, S) selected by wrapper for both mono and multiple domain proteins four – G, R, D, and E - are common with those used as input for the Wilkinson-Harrison method (Davis et al., 1999). Relative content of negatively charged residues (DE) seems to be the strongest determinant of protein solubility as it was selected as an important attribute by all researchers so far (Bertone et al., 2001; Christendat et al., 2000; Davis et al., 1999; Goh et al., 2004; Wilkinson and Harrison, 1991). Frequencies of cysteine and methionine were also found to be important by Goh and coworkers (Goh et al., 2004). Amongst di-peptide frequencies five of them are common for mono and multiple domain proteins: RE, QA, EG, HM, and KG. Notably, the first position in three out of five peptides is occupied by charged (in 2 cases basic, in one acidic) and in one by polar residues while in the second position in four cases there is a residue with non-polar side chain and only in one case with charged (acidic) side chain. Thus, frequencies of di-peptides with the first residue charged and the second non-polar stand out as important determinants of protein solubility. Interestingly, frequencies of all such peptides were shown to be neutral or even favorable for protein stability (Guruprasad et al., 1990). Evaluation of symmetrical uncertainty confirmed the importance of these five peptides only partially. EG, HM, and KG had scores at least five times larger then the average for each mono and multiple domain dataset, while RE and QA were not found to correlate significantly with protein solubility by this approach (data not shown). Overall, the frequencies of these five di-peptides when considered alone for Naive Bayes classification yielded 56.5% and 62% accuracy for mono and multiple domain datasets, respec- Protein solubility: sequence based prediction and experimental verification 5 tively, 10% less then the classification based on frequencies of all 400 peptides. 3.3 Performance of the meta-classifier Naive Bayes was used as a second-level classifier to aggregate the information from primary SVM classifiers. Input data for the metaclassifier was formed by the classification results for each instance acquired upon 10-fold cross validation of the three best performing primary-classifiers (one for each of the three different word lengths). Combining inputs from different primary classifiers improved the overall separation power of the method and made classifiers less prone to overfitting. In total our two-level method had an overall accuracy of 72%, with 75% for positive and 68.5% for negative class. The statistical relevance of the results for both classes is very high with p-value <2.2E-16. The MCC value achieved in this work is 0.434, the highest amongst all tested or previously reported classifiers. For mono and multiple domain proteins the accuracy is 68.0% and 74.6% respectively. A more detailed characterization of the metaclassifier is given in the last column of Table 1, Suppl. Table 8 and by the receiver operating characteristic (ROC) curves shown in Suppl. Figure 3. The area under ROC curve is equal to 0.781. This last value is a good measurement of classifier performance and is usually interpreted as the probability that the classifier will assign a higher score to the positive example than to the negative, when both are randomly picked. The shape of the curve reveals the detailed relation between the classifier sensitivity (Sensitivity(P)=TPrate) and indirectly specificity (FPrate=1-Specificity(P)) in the entire range of these variables. The diagonal line on the graph symbolizes random classifier. The PROSO ROC curves are slightly asymmetrical. For the insoluble class the curve goes up steeper in the first half of the chart. This type of ROC is typical for classifiers that can discriminate relatively modest fractions of examples, but do so with high degree of certainty. On the other hand the curve for soluble class grows slower and reaches higher values on the right side of the chart, which is characteristic for classifiers able to retrieve high percentage of instances of the given class. As revealed by the ROC curve, PROSO classified ~70% of positive or negative instances correctly while having only ~25% instances misclassified into the opposite class. 3.4 A comparative study of solubility predicting methods We compared the performance of our predictor with other previously reported algorithms: the widely used Wilkinson-Harrison method (Wilkinson and Harrison, 1991), and the recently reported Idicula-Thomas method (Idicula-Thomas and Balaji, 2005). We assessed the results of the Wilkinson-Harrison and Idicula-Thomas approaches using classification thresholds as described by the authors (Harrison, 1999; Idicula-Thomas and Balaji, 2005). Additionally, using the output of these methods and a number of global protein features such as pI, length, etc. we trained and evaluated (by 10-fold stratified cross-validation) a set of Naive Bayes classifiers as described in Materials and Methods. All results obtained on a dataset including 14200 proteins are summarized in Table 1 and Suppl. Table 8. In this comparison our method achieved the highest value of MCC 0.434 and the accuracy of 72%. The Wilkinson-Harrison approach resulted in much lower accuracy of 56.2% and the MCC of 0.127. Although the accuracy value reported by Idicula-Thomas (Idicula-Thomas et al., 2006) is 72% and is thus equal to ours, the MCC of this classifier is 18% lower (0.358). Furthermore, the latter classifier was originally tested only on a small set of 64 instances. A chi square test revealed that the results reported by Idicula-Thomas et al. (2006) are only slightly statistically significant for the negative class (p-value 0.0065) and not significant for the positive class (p-value 0.988) (Suppl. Table 8). The performance reported by Idicula-Thomas and coworkers (accuracy 72%) is due to unbalance in the class representation. Their method is strongly biased toward the negative class with TNrate=0.8 and TPrate=0.55. On the same data a dummy classifier labeling all proteins as insoluble would have the accuracy of 67%, meaning that the reported classifier outperforms the dummy approach only by 5 percentage points. PROSO surpasses the dummy classifier by over 20 percent points. Amongst all classifiers based on single global features the one exploiting isoelectrical point (pI) was the most efficient: it reached MCC of 0.2 and accuracy of 59.5%. The combination of all global descriptors was slightly more successful, with MCC and accuracy values of 0.222 and 60.7%, respectively. Despite their relatively high accuracy, both pI based and combined classifiers suffer from high unbalance of classification (TPrate ~0.7 and TNrate ~0.4) and relatively limited statistical significance of results for the negative class (p-value > 0.001). It is also worth noting that Naive Bayes based on the combination of all global descriptors was the only method except PROSO to reach both positive and negative gain values of more then 1.15. Amongst dedicated solubility prediction methods used in combination with Naive Bayes, WilkinsonHarrison clearly outperformed Idicula-Thomas approach on our dataset reaching the accuracy of 57.6% (MCC=0.153) as contrasted with 54.5% (MCC=0.091) of the later method. Our solubility prediction method reaches gain values of 1.41 for soluble and 1.46 for insoluble class, which means that it improves the likelihood of correctly distinguishing between soluble and insoluble proteins by more then 40% relative to random selection (gain=1). It is important to realize that the use of any classifier which shows the gain value 1 or lower is pointless as the equivalent can be achieved by randomly drawing instances from input data. From this comparison we also conclude that the performance of our classifier can not be explained by merely detecting features indirectly linked to pI (accuracy = 60.7%) nor the presence of unfolded regions (accuracy = 55.5%). 3.5 Experimental verification of solubility predictions We tested our method against experimental data on solubility measured for 31 different constructs of two proteins: FOP (FGFR1 oncogene partner) and CAP350 (centrosome associated protein 350). For two FOP constructs FOP_FL and FC3 we were unable to obtain proteins pure enough for reliable estimation of concentration. For these two constructs protein solubility was quantified as the percentage of soluble fraction in relation to inclusion bodies after expression in E. coli and was measured by SDS-PAGE and/or Western-blot (Suppl. Table 9). The rest of the proteins were divided into insoluble, medium soluble and highly soluble based on their estimated maximum concentration, oligomerization state, and stability in solution as described in Materials and Methods. Insoluble proteins were those that could not stay in solution without denaturizing agents like urea or gauanidinium hydrochloride. Highly soluble proteins were defined as those that stayed in solution in P.Smialowski et al. 6 high concentrations, did not oligomerize strongly (percentage of oligomers lower then 50%) and were stable (did not precipitate or aggregate). As an example, Suppl. Figure 2 demonstrates experimental evaluation of solubility for one of the constructs used in this work (FC2SHisMut). Proteins were labeled medium soluble when more than half of the protein in solution was in high oligomeric stage. In all cases our method categorized instances correctly to soluble/insoluble class. Additionally there was a correlation between PROSO score (see Methods) and membership in medium or highly soluble class. By attributing the proteins with PROSO score higher or equal 0.745 to the highly soluble category we were correct in 10 out of 14 cases (the number of highly soluble proteins) achieving TPrate(highly soluble) = 0.71. For medium soluble proteins we were able to classify correctly all instances (PROSO score higher or equal 0.5 and lower then 0.745). Detailed results are provided in Suppl. Table 9 and 10. 3.6 Conclusions: advances, limitations, and future directions We report a novel sequence-based approach to classify proteins into “soluble” and “insoluble”. It is able to categorize sequences with low or no sequence homology to training data. Based on the MCC value of 0.434 and accuracy values of 74.9% and 68.5% for positive and negative class, respectively, we believe that our classifier outperforms any previously reported solubility predictor and is thus expected to be helpful in selecting soluble proteins for biophysical studies as well as in detecting particularly hard cases. Predictive abilities of our method were furthermore confirmed by high correlation of assigned scores with experimentally determined solubility. We also identified the subset of features which have the strongest impact on protein solubility. An obvious limitation of our method is that it is only applicable to non-membrane proteins. It is also unable to take into account factors unrelated to protein sequence such as buffer composition etc. Instead of the currently used simplistic approach to dividing proteins into single- and multi-domain a more sophisticated prediction technique would be desirable, such as described, for example, by Jones et al. (2005) or Liu & Rost (2004). Finally, a significant potential for further improvement of our method exists as the annotation of TargetDB proteins gets more accurate and detailed. ACKNOWLEDGEMENTS This work was funded by BMBF BIO/0312992A REFERENCES Armstrong, N., de Lencastre, A. and Gouaux, E. (1999) A new protein folding screen: application to the ligand binding domains of a glutamate and kainate receptor and to lysozyme and carbonic anhydrase, Protein Sci, 8, 1475-1483. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank, Nucleic Acids Res, 28, 235-242. Bertone, P., Kluger, Y., Lan, N., Zheng, D., Christendat, D., Yee, A., Edwards, A.M., Arrowsmith, C.H., Montelione, G.T. and Gerstein, M. (2001) SPINE: an integrated tracking database and data mining approach for identifying feasible targets in highthroughput structural proteomics, Nucleic Acids Res, 29, 2884-2898. Biou, V., Gibrat, J.F., Levin, J.M., Robson, B. and Garnier, J. (1988) Secondary structure prediction: combination of three different methods, Protein Eng, 2, 185-191. Charton, M. and Charton, B.I. (1982) The structural dependence of amino acid hydrophobicity parameters, J Theor Biol, 99, 629-644. Chen, L., Oughtred, R., Berman, H.M. and Westbrook, J. (2004) TargetDB: a target registration database for structural genomics projects, Bioinformatics, 20, 2860-2862. Chow, M.K., Amin, A.A., Fulton, K.F., Fernando, T., Kamau, L., Batty, C., Louca, M., Ho, S., Whisstock, J.C., Bottomley, S.P. and Buckle, A.M. (2006) The REFOLD database: a tool for the optimization of protein expression and refolding, Nucleic Acids Res, 34, D207-212. Christendat, D., Yee, A., Dharamsi, A., Kluger, Y., Savchenko, A., Cort, J.R., Booth, V., Mackereth, C.D., Saridakis, V., Ekiel, I., Kozlov, G., Maxwell, K.L., Wu, N., McIntosh, L.P., Gehring, K., Kennedy, M.A., Davidson, A.R., Pai, E.F., Gerstein, M., Table 1 Comparative analysis of solubility prediction methods.a Naive Bayes Method Idicula-Thomas, (2006) Dummy classifier (all insoluble) Idicula-Thomas (2005) WH Idicula-Thomas (2005) WH Length AI GRAVY pI AI+FI+G RAVY+pI PROSO Instances 64 64 14200 14200 14200 14200 14200 14200 14200 14200 14200 14200 Accuracy 72 67.2 53.1 56.2 54.5 57.6 50.5 58.4 56.0 59.5 60.7 71.7 MCC 0.358 - 0.078 0.127 0.091 0.153 0.011 0.171 0.134 0.201 0.222 0.434 TPrate 0.55 0 0.233 0.447 0.451 0.623 0.589 0.678 0.781 0.753 0.737 0.749 TNrate 0.803 1 0.829 0.676 0.639 0.530 0.422 0.490 0.339 0.438 0.477 0.685 Gain(P) 1.677 - 1.155 1.160 1.111 1.140 1.009 1.141 1.083 1.145 1.170 1.408 Gain(N) 1.195 1 1.039 1.100 1.076 1.169 1.013 1.207 1.215 1.278 1.290 1.463 AUROC - - 0.573 0.601 0.565 0.598 0.499 0.611 0.576 0.628 0.650 0.781 a Performance of different methods for solubility prediction. Additionally in the first column we present results reported by Idicula-Thomas and coworkers (2006) on the small test dataset inclusing 64 instances. WH: Wilkinson Harrison method. We also evaluated how strongly protein solubility is correlated with simple sequence features: aliphatic index (AI), fold index (FI), GRAVY index (GRAVY), and pI (pI). To optimize the performance a Naive Bayes classifier was trained with output values of a given method or a sequence feature. All results presented were obtained using stratified ten fold cross validation as described in the text. MCC: Matthews Correlation Coefficient, AUROC: Area under receiver operating characteristic curve. The letters P and N in parentheses refer to positive (soluble) and negative (insoluble) class, respectively. PROSO: our method presented in this work. Protein solubility: sequence based prediction and experimental verification 7 Edwards, A.M. and Arrowsmith, C.H. (2000) Structural proteomics of an archaeon, Nat Struct Biol, 7, 903-909. Dale, G.E., Broger, C., Langen, H., D'Arcy, A. and Stuber, D. (1994) Improving protein solubility through rationally designed amino acid replacements: solubilization of the trimethoprim-resistant type S1 dihydrofolate reductase, Protein Eng, 7, 933- 939. Davis, G.D., Elisee, C., Newham, D.M. and Harrison, R.G. (1999) New fusion protein systems designed to give soluble expression in Escherichia coli, Biotechnol Bioeng, 65, 382-388. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likehood from incomplete data via the EM algorithm, J Roy Statist Soc, 39, 1-38. Domingos, P. and Pazzani, M. (1996) Beyond independence: conditions for the optimality of the simple bayesian classifier. In Saitta, L. (ed), International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 105-112. Edwards, A.M., Arrowsmith, C.H., Christendat, D., Dharamsi, A., Friesen, J.D., Greenblatt, J.F. and Vedadi, M. (2000) Protein production: feeding the crystallographers and NMR spectroscopists, Nat Struct Biol, 7 Suppl, 970-972. Fan, R.E., Chen, P.H. and Lin, C.J. (2005) Working set selection using second order information for training SVM, Journal of Machine Learning Research 6, 1889-1918. Frydman, J. (2001) Folding of newly translated proteins in vivo: the role of molecular chaperones, Annu Rev Biochem, 70, 603-647. Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D. and Bairoch, A. (2003) ExPASy: The proteomics server for in-depth protein knowledge and analysis, Nucleic Acids Res, 31, 3784-3788. Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M.R., Appel, R.D. and Bairoch, A. (2005) Protein Identification and Analysis Tools on the ExPASy Server. In Walker, J.M. (ed), The Proteomics Protocols Handbook. Humana Press, Totowa, New Jersey. Georgiou, G. and Valax, P. (1996) Expression of correctly folded proteins in Escherichia coli, Curr Opin Biotechnol, 7, 190-197. Gill, S. and Hippel, P. (1989) Calculation of protein extinction coefficients from amino acid sequence data. , Anal. Biochem., 182, 319-326. Goh, C.S., Lan, N., Douglas, S.M., Wu, B., Echols, N., Smith, A., Milburn, D., Montelione, G.T., Zhao, H. and Gerstein, M. (2004) Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis, J Mol Biol, 336, 115-130. Golovanov, A.P., Hautbergue, G.M., Wilson, S.A. and Lian, L.Y. (2004) A simple method for improving protein solubility and long-term stability, J Am Chem Soc, 126, 8933-8939. Guruprasad, K., Reddy, B.V. and Pandit, M.W. (1990) Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence, Protein Eng, 4, 155-161. Hall, M.A. and Holmes, G. (2003) Benchmarking Attribute Selection Techniques for Data Mining, IEEE Transactions on Knowledge and Data Engineering, 15, 1437- 1447. Harrison, R.G. (1999) Recombinant Protein Solubility Prediction, http://www.biotech.ou.edu/. University of Oklahoma. Idicula-Thomas, S. and Balaji, P.V. (2005) Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli, Protein Sci, 14, 582-592. Idicula-Thomas, S., Kulkarni, A.J., Kulkarni, B.D., Jayaraman, V.K. and Balaji, P.V. (2006) A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli, Bioinformatics, 22, 278-284. Ikai, A. (1980) Thermostability and aliphatic index of globular proteins, J Biochem (Tokyo), 88, 1895-1898. Janin, J. and Wodak, S. (1978) Conformation of amino acid side-chains in proteins, J Mol Biol, 125, 357-386. Jones, D.T., Bryson, K., Coleman, A., McGuffin, L.J., Sadowski, M.I., Sodhi, J.S. and Ward, J.J. (2005) Prediction of novel and analogous folds using fragment assembly and fold recognition, Proteins, 61 Suppl 7, 143-151. Kapust, R.B. and Waugh, D.S. (1999) Escherichia coli maltose-binding protein is uncommonly effective at promoting the solubility of polypeptides to which it is fused, Protein Sci, 8, 1668-1674. Kawashima, S. and Kanehisa, M. (2000) AAindex: amino acid index database, Nucleic Acids Res, 28, 374. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C. and Murthy, K.R.K. (2001) Improvements to Platt's SMO Algorithm for SVM Classifier Design, Neural Computation, 13, 637-649. Kohavi, R. and John, G. (1997) Wrappers for feature subset selection., Artificial Intelligence journal, 97, 273-324. Krigbaum, W.R. and Komoriya, A. (1979a) Local interactions as a structure determinant for protein molecules: II, Biochim Biophys Acta, 576, 204-248. Krigbaum, W.R. and Komoriya, A. (1979b) Local interactions as a structure determinant for protein molecules: III, Biochim Biophys Acta, 576, 229-246. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes, J Mol Biol, 305, 567-580. Kyte, J. and Doolittle, R.F. (1982) A simple method for displaying the hydropathic character of a protein, J Mol Biol, 157, 105-132. Layne, E. (1957) Spectrophotometric and Turbidimetric Methods for Measuring Proteins., Methods in Enzymology, 3, 447-455. Li, W., Jaroszewski, L. and Godzik, A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases, Bioinformatics, 17, 282-283. Li, W., Jaroszewski, L. and Godzik, A. (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases, Bioinformatics, 18, 77-82. Liu, J. and Rost, B. (2004) Sequence-based prediction of protein domains, Nucleic Acids Res, 32, 3522-3530. Luan, C.H., Qiu, S., Finley, J.B., Carson, M., Gray, R.J., Huang, W., Johnson, D., Tsao, J., Reboul, J., Vaglio, P., Hill, D.E., Vidal, M., Delucas, L.J. and Luo, M. (2004) High-throughput expression of C. elegans proteins, Genome Res, 14, 2102-2110. Makrides, S.C. (1996) Strategies for achieving high-level expression of genes in Escherichia coli, Microbiol Rev, 60, 512-538. Meila, M. and Heckerman, D. (2001) An Experimental Comparison of Model-Based Clustering Methods., Machine Learning, 42, 9-29. Motulsky, H. (1995) Intuitive Biostatistics. Oxford University Press, New York. Nozaki, Y. and Tanford, C. (1971) The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions. Establishment of a hydrophobicity scale, J Biol Chem, 246, 2211-2217. Onuchic, J.N. and Wolynes, P.G. (2004) Theory of protein folding, Curr Opin Struct Biol, 14, 70-75. Pearl, F., Todd, A., Sillitoe, I., Dibley, M., Redfern, O., Lewis, T., Bennett, C., Marsden, R., Grant, A., Lee, D., Akpor, A., Maibaum, M., Harrison, A., Dallman, T., Reeves, G., Diboun, I., Addou, S., Lise, S., Johnston, C., Sillero, A., Thornton, J. and Orengo, C. (2005) The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis, Nucleic Acids Res, 33, D247-251. P.Smialowski et al. 8 Pedelacq, J.D., Piltch, E., Liong, E.C., Berendzen, J., Kim, C.Y., Rho, B.S., Park, M.S., Terwilliger, T.C. and Waldo, G.S. (2002) Engineering soluble proteins for structural genomics, Nat Biotechnol, 20, 927-932. Platt, J. (1999) Fast training of support vector machines using sequential minimal optimization. In Scholkopf, B., Burges, C.J.C. and Smola, A.J. (eds), Advances in Kernel Methods: Support Vector Learning MIT Press, Cambridge, MA, 182-208. Ponnuswamy, P.K., Prabhakaran, M. and Manavalan, P. (1980) Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins, Biochim Biophys Acta, 623, 301-316. Prilusky, J., Felder, C.E., Zeev-Ben-Mordehai, T., Rydberg, E.H., Man, O., Beckmann, J.S., Silman, I. and Sussman, J.L. (2005) FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded, Bioinformatics, 21, 3435- 3438. Roseman, M.A. (1988) Hydrophilicity of polar amino acid side-chains is markedly reduced by flanking peptide bonds, J Mol Biol, 200, 513-522. Seckler, R. and Jaenicke, R. (1992) Protein folding and protein refolding, Faseb J, 6, 2545-2552. Singh, S.M. and Panda, A.K. (2005) Solubilization and refolding of bacterial inclusion body proteins, J Biosci Bioeng, 99, 303-310. Smialowski, P., Schmidt, T., Cox, J., Kirschner, A. and Frishman, D. (2006) Will my protein crystallize? A sequence-based predictor, Proteins, 62, 343-355. Stoscheck, C. (1990) Quantitation of Protein, Methods in Enzymology, 182, 50-69. Tresaugues, L., Collinet, B., Minard, P., Henckes, G., Aufrere, R., Blondeau, K., Liger, D., Zhou, C.Z., Janin, J., Van Tilbeurgh, H. and Quevillon-Cheruel, S. (2004) Refolding strategies from inclusion bodies in a structural genomics project, J Struct Funct Genomics, 5, 195-204. Tsumoto, K., Umetsu, M., Kumagai, I., Ejima, D., Philo, J.S. and Arakawa, T. (2004) Role of arginine in protein refolding, solubilization, and purification, Biotechnol Prog, 20, 1301-1308. Waldo, G.S. (2003) Genetic screens and directed evolution for protein solubility, Curr Opin Chem Biol, 7, 33-38. Wilkinson, D.L. and Harrison, R.G. (1991) Predicting the solubility of recombinant proteins in Escherichia coli, Biotechnology (N Y), 9, 443-448. Zhou, H. and Zhou, Y. (2004) Quantifying the effect of burial of amino acid residues on protein stability, Proteins, 54, 315-322.