Protein Engineering vol.4 no.2 pp.155-161. 1990 Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence Kunchur Guruprasad, B.V.Bhasker Reddy and Madhusudan W.Pandit1 Centre for Cellular and Molecular Biology. Hyderabad - 500 007, India 'To whom correspondence should be addressed Statistical analysis of 12 unstable and 32 stable proteins revealed that there are certain dipeptides, the occurrence of which is significantly different in the unstable proteins compared with those in the stable ones. Based on the impact of these dipeptides on the unstable proteins over the stable ones, a weight value of instability is assigned to each of the dipeptides. For a given protein the summation of these weight values normalized to the length of its sequence helps to distinguish between unstable and stable proteins. Results suggest that the in vivo instability of proteins is possibly determined by the order of certain amino acids in its sequence. An attempt is made to correlate metabolic stability of proteins with features of their primary sequence where weight values of instability for a protein of known sequence could thus be used as an index for predicting its stability characteristics. Key words: instability index/primary structure/protein instability/ stability prediction/turnover rate Introduction Proteins are known to degrade rapidly when their conformations are altered by mutations, incorporation of amino acid analogs, denaturation or premature chain terminations, etc. (Goldberg and St John, 1976). All such modifications are likely to prevent proper folding or to disrupt tertiary structure which can make the resulting aberrant polypeptide prone to degradation. However, normal cellular proteins also display a wide range of half-lives; turnover rates of individual proteins can differ as much as 1000-fold (Rechsteiner et al., 1987). These observations have led to many speculations about the features of proteins that elicit proteolysis. Several hypotheses have been proposed to explain the intracellular stability (or instability) of proteins; these hypotheses have been reviewed recently (Rechsteiner et al., 1987). Current status suggests that among several factors, sequence specific properties (Bachmair etai, 1986; Rogers et al., 1986), global features and the location of a protein in the cell are important in deciding the intracellular stability of a protein. In this communication, we report on an interesting observation that has emerged from the analysis of primary sequences of a set of proteins that are known to degrade rapidly (Rogers et al., 1986), and the comparison of these data with those obtained from a set of stable proteins (i.e. those with comparatively high in vivo half-lives). The proteins which are known to degrade very rapidly are found to contain certain dipeptides in relatively large proportions; these might be directly or indirectly involved in the rapid degradation of proteins. Our analysis also points out that there is yet another set of dipeptides, the existence of which could be equally important for the stable proteins. The statistical analytical procedure leading to the dipeptide instability weight value reported here indicates the possibility of its use in predicting whether a particular protein would be unstable (or stable) in vivo. Materials and methods The strategy and the development of the instability index As classified by Rogers et al. (1986), a set of 32 proteins with an in vivo half-life of > 16 h was taken as a class of stable proteins and a set of 12 proteins with an in vivo half-life of <5 h was taken as an unstable class of proteins (Table I). The sequences of these two sets of proteins were analysed separately. The frequency of occurrence of each of the 20 amino acids was calculated in the unstable as well as stable class of proteins and was compared with the frequency of occurrence of various amino acids in the total Protein Sequence Database of the PIR (Release 12.0). These data (Figure 1) explicitly bring out certain notable differences between the frequency of occurrence of various amino acids in unstable and stable proteins. It can be seen from Figure 1, that in the case of unstable proteins, amino acids Met(M), Gln(Q), Pro(P), Glu(E) and Ser(S) are found to occur with a relatively high frequency. The PEST hypothesis proposed earlier by Rogers et al. (1986) reports the presence of regions consisting of amino acids (Pro)P, (Glu)E, (Ser)S and (Thr)T in the unstable proteins. Our observation indicates that, unlike the PEST hypothesis, Thr(T) does not occur more frequently in unstable proteins compared with stable proteins. In fact, the presence of amino acids such as Met(M) is more frequent in unstable proteins. Similarly it was noted that amino acids Asn(N), Lys(K) and Gly(G) occur with relatively higher frequencies in stable proteins when compared with the unstable ones. Therefore, it appears that the PEST hypothesis could partly be a reflection of the consequence of differential occurrence of certain amino acids. However, the PEST hypothesis talks about regions of negatively charged amino acids that tend to be clustered and generally flanked by positively charged amino acids. Therefore, it appears that instability or stability characteristics are probably governed by an arrangement of certain amino acids in a specific order. Hence the smallest unit that defines order in a sequence being a dipeptide, the occurrence of an amino acid in juxtaposition to the other might be a significant factor in determining the stability of the protein. Therefore we chose to search for the occurrence of all 400 possible dipeptides in the two classes of proteins. The expected (probable) occurrences of dipeptides were calculated, assuming the constituents of dipeptides as independent events, by equation: = [A/°Cx)/7] 20 20 [N°(y)/T\ E E fl°(ry) x = I y = I (1) where A^xy) and A^Cry) are the expected and the observed occurrence respectively, of dipeptide xy\ and N°(x) and N°(y) are the observed occurrences of amino acids x and v respectively. T is the total number of amino acids in a particular class. © Oxford University Press 155 atMasarykUniversityonJanuary29,2013http://peds.oxfordjournals.org/Downloadedfrom K.Curuprasad, B.V.B.Reddy and M.W.Pandit Table I. Proteins studied, their Serial no. Proteina (A) Stable proteins 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 (B) Unstable 33 34 35 36 37 38 39 40 41 42 43 44 ADK ADH AAT CAI CPA CAT CHY CIS CCC AAD DPA DHF ELA FER GPD HEM HEM ABP LDH LYS MYO PAR PGK PLA PYK RNA SOD STI SUB THI TPI TRY : proteins EIA a-Casein /3-Casein c-fos c-myc v-myb HSP 70 HMG-CoA ODC P730 TAT p53 half-life and II Half-lifeb (h) 133 139 66 72 137 80 50 128 26 118 80 97 46 61 84 209 209 65 171 16 127 41 207 78 181 61 41-200 40 84 155 122 44 0.5 2-5 2-5 0.5 0.5 0.5 1-2 1.5-3 0.5 1.0 0.5 0.5 PIR/EMBL* Code KIHUA DEHUAB XNPGDC CRHU1 CPRTA CSBO KYBOA YKPG CCHU OXPGDA ADECD RDBOD ELPG FRHUL DEHUGL HAHU HBHU JGECA DEPGLH LZCH MYHO PVRYC KIBYG PSKF2U KICHPM NRGPA DSBOCZ TISY SUBSC TXEC ISBYT TRBOTR AD5001* KABOSB KBBOA2 HSCFOS* HSMYCI* GGCMYB1* HHFF72 RDHYE DCMSO ASAP3R* RNTATR* MMP53* II 29.4 17.4 36.3 25.1 21.5 27.5 15.3 20.8 11.4 34.4 14.2 27.4 23.0 23.8 16.1 7.0 6.1 23.1 25.6 19.9 9.1 22.8 22.2 15.1 21.2 52.2 7.0 27.4 12.0 5.6 19.7 20.6 100.3 57.7 96.5 78.8 92.2 74.5 44.1 54.5 52.3 50.4 53.9 70.0 "Abbreviations: ADK, adenylate kinase 1; ADH, alcohol dehydrogenase; AAT, aspartate amino transferase; CAI, carbonic anhydrase I; CPA. carboxypeptidase A; CAT, catalase; CHY, chymotrypsinogen; CIS, citrate synthase; CCC, cytochrome C; AAD, D-amino acid oxidase; DPA, deoxyribose phosphate aldolase; DHF, dihydrofolate reductase; ELA, elastase; FER, ferritin: GPD, glyceraldehyde 3-phosphate dehydrogenase; HEM, haemoglobin; ABP, L-arabinose binding protein; LDH, lactate dehydrogenase; LYS. lysozyme c: MYO, myoglobin; PAR, parvaJbumin; PGK, phosphoglycerate kinase; PLA, phospholipase A2; PYK, pyruvate kinase; RNA, ribonuclease A; SOD. superoxide dismutase; STI. soybean trypsin inhibitor; SUB. subtilisin: THI. thioredoxin; TPI. triosephosphate isomerase; TRY. trypsinogen: EIA. adenovirus early protein; HSP70, heatshock protein 70; HMG-CoA, hydroxymethyl glutaryl-CoA; ODC. ornithine decarboxylase; P730. phytochrome: TAT. tyrosine amino transferase. b References for half-life are taken from Rogers et al. (1986). The chi-square values, x2 (*y) between observed and expected occurrences of dipeptides (xy) for each class of proteins were calculated by equation: x\xy) = (2) An average value of x2 for each class of protein was calculated using equation: 400 X2 = (1/400) L x\xy) x\=\ (3) X2 was then used as the confidence limit to select significant dipeptides for each class of protein. The condition used for selecting significant dipeptides for unstable and stable classes of proteins respectively are: X usCry) ^ X"uS and > x2 s X\s-s(xy) w a s also calculated from values of observed occurrences of dipeptides in unstable and stable classes of proteins respectively, using the following equation: (4) where A^s(ry) and N^(xy), are the observed occurrences of dipeptides in unstable and stable classes of proteins respectively. As we were interested in picking up factors which contribute to the instability, in calculating x^is-sC*)7 ). e a c n squared difference in the occurrence of dipeptides in unstable and stable classes was weighted inversely by the occurrence of dipeptides in the stable class. A third set of dipeptides was selected from these X2 us-a(x y) values satisfying the condition: ^ X2 us-5 The potential occurrence, P(xy) for each of the dipeptides, in all the three sets was calculated by equation: P(xy) = N°(xy) I PF(xy) (5) Each of the above three sets of dipeptides was further classified into two subsets depending upon P(xy) being significantly different from unity (>1.5 or <0.64). The corresponding potential values for dipeptides from all the three sets were used to formulate the conditions given in Table II. We classified the dipeptides that satisfy each of the conditions given in Table II based on the chi-square value and its Pixy) of every dipeptide for each class separately. Weight value of instability, corresponding to each of the conditions, was obtained by summing up yV°(.ry) for all dipeptides (xy) satisfying that condition. The impact factor for /th condition (IF,) was estimated by equation: IF, = (Vus, - Vs,) / Vs, (6) where Vus, and Vs, are the normalized values of occurrence of dipeptides satisfying /th condition in the unstable and stable class of proteins respectively. The impact factor for the /th condition 156 atMasarykUniversityonJanuary29,2013http://peds.oxfordjournals.org/Downloadedfrom Stability characteristics of protein in vivo a o < o z o o o UJ o 1 0 0 8 0 6 0 3 O 4 0 UJ 3 O UJ oc. 20 Ul TOTAL PIR DATABASE UNSTABLE PROTEINS STABLE PROTEINS W C M H Y F Q N I R D P T K E V S G A L AMINO ACID Fig. 1. Relative frequency of occurrence of individual amino acids per 1000 amino acids in various sets of proteins; PIR—Protein Sequence Database is taken from Release 12. was operated on, to bring it into a positive range, and was termed as the instability weight value (IWV,) given by equation: IWV,- = 2 + (IF, / |LIF|) (7) where LIF was the lowest impact factor observed. The contribution of each of the dipeptides towards instability was obtained by summing the instability weight values corresponding to the condition(s) satisfied by the dipeptide and termed as the dipeptide instability weight value (DIWV). The DIWVs for all 400 combinations are presented as a matrix (GRP matrix) in Table III. The instability index (II) for a protein was then computed using the DIWV by equation: Table II. Rule-based weightages Conditions If the 1. 2. 3. 4. 5. 6. 7. dipeptide satisfies: Pus > 1.50 and Ps />us < 1.50 and Ps Pus < 0.64 and Ps Pas > 0.64 and Ps Pus.s > 1.50 PUM < 0.64 None of the above < 1.50 > 1.50 > 0.64 < 0.64 conditions Instability weight value + 13.34 - 1.88 - 6.54 +24.68 +20.26 - 7.49 + 1.0 II = (10/L) E DIWV(jciy, i = i (8) where jc^y,-+1 is a dipeptide, L is the length of the sequence and 10 is a scaling factor. Instability indices for various proteins in the stable as well as in the unstable class are given in column 5 of Table I. Results and discussion Dipeptides involved in instability of proteins The observed and the expected frequency of occurrence of dipeptides in the sets of both stable and unstable proteins helped to identify the dipeptides which are predominant in either of these sets. GRP-matrix (Table W) indicates that out of the 400 possible dipeptides 81 (i.e. 20%) with DIWV > 1 may be involved in the instability of the protein, whereas -72 (i.e. 18%) with DIWV < 1 may contribute to the stability. We do not know at present how the presence (or the absence) of dipeptides such as these render a protein stable or susceptible to degradation. However, the net effect of their combined presence in a protein appears to reflect in its instability index that serves as a useful indicator in deciding the stability (or instability) characteristics of the protein. In vivo half-life of a protein versus its instability index We have presented various proteins that are used in our analysis along with the values of their half-lives in Table I. It can be seen from the results that all the unstable proteins have an II > 40, whereas all the stable proteins, with the only exception of RNase A, have an II < 40. The segregation of the two classes of proteins based on the II is illustrated in Figure 2. This figure clearly brings out differences between the two classes of proteins. The relatively high value of the II (= 52.2) for RNase A puts this protein into the class of unstable proteins; however, it is known that RNase A in its native form is held together by four disulphide bridges between Cys groups. Such disulphide bridges are known to impart significant stability to the protein making it resistant to degradation 157 atMasarykUniversityonJanuary29,2013http://peds.oxfordjournals.org/Downloadedfrom K.Guruprasad, B.V.B.Reddy and M.W.Pandit Table III. First amino acid of dipeptide W C M H Y F 0 N 1 R D P T K E V s G A L GRP matrix of Second W 1.0 24.68 1.0 -1.88 -9.37 1.0 1.0 -9.37 1.0 58.28 1.0 -1.88 -14.03 1.0 -14.03 1.0 1.0 13.34 1.0 24.68 condition-basec amino acid of dipeptide C 1.0 1.0 1.0 1.0 1.0 1.0 -6.54 -1.88 1.0 1.0 1.0 -6.54 1.0 1.0 44.94 1.0 33.6 1.0 44.94 1.0 M 24.68 33.6 -1.88 1.0 44.94 1.0 1.0 1.0 1.0 1.0 1.0 -6.54 1.0 33.6 1.0 1.0 1.0 1.0 1.0 1.0 H 24.68 33.6 58.28 1.0 13.34 1.0 1.0 1.0 13.34 20.26 1.0 1.0 1.0 1.0 -6.54 1.0 1.0 1.0 -7.49 1.0 instability values for Y 1.0 1.0 24.68 44.94 13.34 33.6 -6.54 1.0 1.0 -6.54 1.0 1.0 1.0 1.0 1.0 -6.54 1.0 -7.49 1.0 1.0 F 1.0 1.0 1.0 -9.37 1.0 1.0 -6.54 -14.03 1.0 1.0 -6.54 20.26 13.34 .0 .0 .0 .0 .0 .0 .0 0 1.0 -6.54 -6.54 1.0 1.0 1.0 20.26 -6.54 1.0 20.26 1.0 20.26 -6.54 24.68 20.26 1.0 20.26 1.0 1.0 33.6 400 possible dipeptides N 13.34 1.0 1.0 24.68 1.0 1.0 1.0 1.0 1.0 13.34 1.0 1.0 -14.03 1.0 1.0 1.0 1.0 -7.49 1.0 1.0 I 1.0 1.0 1.0 44.94 1.0 1.0 1.0 44.94 1.0 1.0 1.0 1.0 1.0 -7.49 20.26 1.0 1.0 -7.49 1.0 1.0 R 1.0 1.0 -6.54 1.0 -15.91 1.0 1.0 1.0 1.0 58.28 -6.54 -6.54 1.0 33.6 1.0 1.0 - 20.26 1.0 1.0 20.26 D 1.0 20.26 1.0 1.0 24.68 13.34 20.26 1.0 1.0 1.0 1.0 -6.54 1.0 1.0 20.26 14.03 1.0 1.0 -7.49 1.0 P 1.0 20.26 44.94 -1.88 13.34 20.26 20.26 -1.88 -1.88 20.26 1.0 20.26 1.0 -6.54 20.26 20.26 44.94 1.0 20.26 20.26 T -14.03 33.6 -1.88 -6.54 -7.49 1.0 1.0 -7.49 1.0 1.0 -14.03 1.0 1.0 1.0 1.0 -7.49 1.0 -7.49 1.0 1.0 K 1.0 1.0 1.0 24.68 1.0 -14.03 1.0 24.68 -7.49 1.0 -7.49 1.0 1.0 1.0 1.0 -1.88 1.0 -7.49 1.0 -7.49 E 1.0 1.0 1.0 1.0 -6.54 1.0 20.26 1.0 44.94 1.0 1.0 18.38 20.26 1.0 33.6 1.0 20.26 -6.54 1.0 1.0 V -7.49 -6.54 1.0 1.0 1.0 1.0 -6.54 1.0 -7.49 1.0 1.0 20.26 1.0 -7.49 1.0 1.0 1.0 1.0 1.0 1.0 S 1.0 1.0 44.94 1.0 1.0 1.0 44.94 1.0 1.0 44.94 20.26 20.26 1.0 1.0 20.26 1.0 20.26 1.0 1.0 1.0 G -9.37 1.0 1.0 -9.37 -7.49 1.0 1.0 -14.03 1.0 -7.49 1.0 1.0 -7.49 -7.49 1.0 -7.49 1.0 13.34 1.0 1.0 A -14.03 1.0 13.34 1.0 24.68 1.0 1.0 1.0 1.0 1.0 1.0 20.26 1.0 1.0 1.0 1.0 1.0 -7.49 1.0 1.0 L 13.34 20.26 2( - .0 .0 .0 .0 .0 .0 ).26 .0 .0 .0 .0 7.49 .0 .0 .0 .0 .0 .0 (Creighton, 1988). Therefore it appears that cross-linking between Cys groups may override the impact of dipeptides. In our approach we have not taken into account such higher-order elements, which could modify the intrinsic stability characteristics of the protein. It is also interesting to note that none of the unstable proteins which we have studied has such a high degree of crosslinking (disulphide bridges per 100 amino acids) between Cys groups. Thus there is a direct correlation between the half-lives of various proteins and the II arrived at by the method described. The II of a protein, therefore, could be used directly to predict whether a given protein is stable or unstable based on the II being either <40 or >40 respectively, provided the sequence of the protein is known. Validity of the II of a protein in determining its stability In order to test the merits of our method in predicting the stability characteristics of proteins, we used another set of proteins which was not a part of our database used in developing the II but for which the data on in vivo half-life became available. These proteins are listed in Table IV along with their IIs. It can be seen from the last two columns of Table IV that, in every case, the prediction based on the II of the protein is in complete agreement with the conclusion based on the in vivo half-life of the protein. This shows clearly that it is possible to predict the stability characteristics of a protein with reasonable success by using its II. Comparison with earlier hypotheses The results of our analysis indicate that, although there is a significant number of proteins which shows overlap between the predictions made by the II described here and those governed by PEST hypothesis, there exists a considerable number of proteins in which there is no correlation between the PEST hypothesis and the experimental observations. For example, protein P730 (phytochrome) which contains PEST regions with a negative PEST-FIND (PF) score between -2.21 and -30.47 is expected to be stable according to the PEST hypothesis. However, this protein has been shown to be unstable (Pratt et al.. 1974). The II calculated by our method for this protein is 50.4. and categorizes it as an unstable protein in agreement with S.No. OF PROTEIN (UNSTABLE, t ) 36 40 44 X ill o z 0 0 9 0 8 0 7 0 6 0 5 0 4 0 3 0 20 1 0 [ • • • g 1 1 • m O -o r o o 1 1 1 V • o o ( o o o 1 1 1 1 1 1 1 1 - - - _ o >° o ° OOoo oo ° o o o °o o o 1 1 1 1 1 1 1 1 1 1 4 8 12 16 20 S.No. OF PROTEIN (STABLE, 0) 2 4 28 3 2 Fig. 2. Us of various proteins used in the analysis. The proteins could be identified by their serial (S) nos. given in Table I. O. Protein with S. nos. 1—32; • . proteins with S. nos. 33—44. experimental observations. There are a few other proteins, e.g. AAT (aspartate amino transferase). DHF (dihydrofolate reductase) and PYK (pyruvate kinase). which are reported to contain at least one PEST region with a positive PF score and 158 atMasarykUniversityonJanuary29,2013http://peds.oxfordjournals.org/Downloadedfrom Stability characteristics of protein in vivo Table IV. Prediction of known Protein Metallothionein (gold) Protein kinase C Phosphoenol pyruvate carboxylase Glucose-6-phosphate dehydrogenase Ornithine amino transferase Frustose-1,6- diphosphatase Cytochrome P450 Cytochrome b5 Fatty acid synthetase Malate dehydrogenase Arginase Cytochrome b Ubiquitin Catalase Na+ /K+ -ATPase Transferrin receptor Cytochrome c oxidase /3-Galactosidase Actin c-AMP protein kinase c-GMP protein kinase Calmodulin the stability PIR Code SMRT2 KJRTC2 QYEC DESHGC XNRTO PAPGF O4RTP2 CBRT5 FZRTI DEPGMM WZBYR CBRT UQBY CSRT PWSHNA JXHU OBMS2 GBECE ATHU OKBO1R OKBOG MCHU of a few proteins whose Half-life (h) <2a 2-8" 5" 15" 19b 36b 48b 55b 71b 96b 96b 130b 9-40" 9^tt)a 9-40" 9^0a 41-200" 41-200" 41-200" 2-8" 2-8" 9-40a U 66.82 40.29 46.8C 27.18C 34.21 29.16C 35.89 32.86 12.37 27.8C 28.78C 32.86 3O.33c 33.82 31.67C 22.06° 36.25 37.62C 36.19 49.3C 41.4C 27.53 half-life is Whether stable (S) or unstable (US) 11 US us us s s s s s s s s s s s s s s s s us us s from Half-life US US US S S S S S S S S S S S S S S s s us us s a Rechsteiner et al. (1987). b Dice and Goldberg (1975). c The II was calculated from the available sequence as the source of the original protein was not mentioned in the cited reference. therefore are expected to be unstable according to the proponents of the PEST hypothesis. Experimental results, however, are not in agreement with these expectations as these proteins have been reported to be stable in vivo (Rogers et al., 1986). It is interesting to note that Us for these proteins predict them to be stable (Table I) in keeping with the experimental results. These examples indicate that instability of a protein is governed by factors other than those suggested by the PEST hypothesis. When we examined PF scores for the PEST regions for the set of proteins given in Table IV, it was observed that predictions based on the PEST hypothesis and II did not tally in two cases. Sodium potassium ATPase (PWSHNA) has three PEST regions with positive PF scores of 6.93, 5.78 and 5.05 and therefore should be unstable. However, the prediction on the II suggests that the protein is stable and is in accordance with the experimental observation (Rechsteiner etai, 1987). Metallothionein (SMRT2) completely lacks PEST regions with a positive PF score. But this protein has an II, which predicts that the protein is unstable—the conclusion is again in agreement with experimental observation (Rechsteiner etai, 1987). The II appears therefore to pick some features in the sequence which are overlooked by the PEST hypothesis. It may be pointed out that out of the 81 dipeptides significant for the unstable class (DIWV > 1 in Table HI) only 39 are originating from dipeptides containing either P, E, S or T. More than half of the dipeptides are exclusive of any of these amino acids. These facts establish the significance of dipeptides devoid of P, E, S or T. In addition, it is interesting to note that there are 25 dipeptides containing either P, E, S or T which have DIWV < 0 indicating that in fact, they contribute significantly to the stability of the protein, in contrast to the proponents of the PEST hypothesis. It is possible to carry out rigorous experimental checks by selective clipping of residues, which contribute significantly to the high PF score, and following the fate of the residual protein in terms of stability. Rogers et al. (1986) have reported that /3-casein at its N-terminal end (residues 1 -25) has only one PEST region with a positive PF score. The clipping of this region should lead to a residual protein which should be stable; however, the II suggests that the residual protein will remain unstable. Similarly these workers have reported another protein, v-myb, where residues 1 —16 at the N-terminal end contain a PEST region with a positive PF score (1.85). As proposed by them, removal of this region should make the protein stable. The II for the residual v-myb, however, suggests that it would be unstable. Experimental checks such as these will help in arriving at discrete features which play a crucial role in conferring stability on the protein. Another hypothesis proposed by Bachmair et al. (1986), based on the studies on (3-galactosidase and termed as the N-end rule, suggests that a specific amino acid at the N-terminus can serve as one of the determinants for in vivo half-life of a protein. As the N-end rule is derived from studies on a single protein, one cannot be certain about its applicability to all the proteins in general. In addition, it has already been suggested that the Nend rule may be primarily cotranslational (Rechsteiner et al., 1987), and therefore has a limited use in determining the stability characteristics of a protein in general. We have, therefore, not exhaustively compared our data with that given by the N-end rule. However, preliminary examination of such data indicates that predictions based on the N-end rule are not in anyway better than those based on the PEST hypothesis. Analysis was carried out for all the proteins whose sequences are known from the SWISS-PROT Protein Sequence Database (Release 13) in order to pick up proteins that have very low II values (< 10). Similarly, proteins that have very high II values (> 90) were also picked up. For the reasons discussed earlier, proteins that have a cysteine content higher than RNase A were eliminated. Out of these, unique proteins showing extremely high as well as low II values are listed in Table V. These proteins would prove to be good candidates to provide the experimental validity to the predictions in regard to their instability based on II. The results allow us to make certain plausible conjectures about point mutations altering the stability of a protein. The GRP-matrix consists of certain dipeptides with high positive as well as negative values. Point mutations in an unstable protein may lead to a change from one dipeptide with a high positive value to another with a negative value, consequently making a protein stable. A single residue change in a protein can affect the contribution of instability weight values of two dipeptides, which arise due to residues on either side of it. Therefore all possible tripeptides were first generated where the dipeptides comprising them had DIWV values either > 1 or <0 (refer to Table HT). The difference in the II value (AII^) for a pair of tripeptides 'axb' and 'ayb' obtained by changing the central residue and keeping the neighbouring residues the same was estimated by the relation: AIItri = [DIWV(ax) + DIWV(xfc)] [DlWV(ay) + DIWVO*)] (9) We have analysed the data on substitutions in the central 159 atMasarykUniversityonJanuary29,2013http://peds.oxfordjournals.org/Downloadedfrom K.Guruprasad, B.V.B.Reddy and M.W.Pandit Table V. Unique proteins with extreme values of Us Protein name (A) Proteins with II > 90 and length > 50 Amelogenin Calspermin Beta casein precursor Core antigen Contiguous repeat polypeptide (CRP) precursor Early El A 11 kd protein Early El A 32 kd and 26 kd proteins Early El A 6 kd protein EBNA-2 Nuclear protein Extensin precursor Alpha/beta-gliadin precursor (Prolamin) Gamma/gliadin precursor low mol. wt glutenin precursor Bl-Hordein Sperm histone (Protamine) Involucrin Myc proto-concogene protein Sperm-specific protein PHI-0 Salivary proline-rich protein precursor Rev protein anti-repression transactivator Ribosomal protein S18 40S Ribosomal protein S26 Spermatid-specific protein Transition protein 2 Female-specific transformer protein Tequment phosphoprotein 11 kd protein Probable E3 protein (B) Proteins with II < 10 and length >50 Acylphosphatase, erythrocyte Antifreeze protein 1IA7 precursor ATP synthase lipid-binding protein Azurin iso-1 Cytochrome C556 Cytochrome C DNA-binding protein (7 kd) Fatty-acid-binding protein Type-1 fimbrial protein Nitrogen regulatory protein PII Haemoglobin alpha-A chain Hisactophilin Small histidine-alanine-rich protein precursor Histidine-rich-protein precursor Acrosin inhibitor II Insulin Light-harvesting protein, alpha chain (LH-2) Mammary-derived growth inhibitor Retinoic-acid-induced differentiation factor Major outer membrane lipoprotein precursor Myoglobin Outer membrane protein precursor (porin) Outer mitochondrial membrane protein (porin) Profilin Retinol-binding protein II. cellular (CRBP-II) 50S Ribosomal protein LI 30S Ribosomal protein S14 S-100 protein, alpha chain S-Antigen protein precursor Probable early protein GP5 Gene 5 protein Early protein GP5B Gene 15 Protein Lysis protein SWISS-PROT Code A JVI CXJJ D U V UN CALSSRAT CASBSBOVIN CORASHPBV4 CRPPSRAT E111SADEM1 E1ASADEN2 EIA6SADEN5 EBN2SEBV EXTNSDAUCA GDAOSWHEAT GDB2SWHEAT GLTASWHEAT HOR1SH0RVU HSPSCOTJA INVOSLEMCA MYCSFELCA PHI0SHOLTU PRPlSHUMAN REVSHIV13 RS18SMARP0 RS264DROME SSS1SSCYCA STP2SMOUSE TRSF4DROME USO9SHSV11 V11KSPRV VE3SBPV2 ACYESHUMAN ANPXSSEAM ATPHSSOYBN AZU1SMETJ C556 SAGRTA CYCSANAPL DN7ASSULAC FABHSRAT FM1ASECOLI GLNBSECOLI HBASAEGHO HIAPSDICDI HRPSPLAFF HRP1SPLAFA IAC2SBOVIN INSSBALBO LHA2SRHOCA MDGISBOVIN MK1SMOUSE MULISERWAM MYGSHORSE OMPCSSALTY PORISNUECR PROFSACACA RET2SRAT RL1SBACST RS14SMYCCA S10ASBOVIN SANTSPLAFW VG5SBPPH2 VG5SSPV4 VG5BSBPPZA VI5SVACCV VLYSSBPT7 Table VI. Substitutions of central amino acid likely to) alter All by > 75 as of All are given in brackets) a Trp Cys Met His Tyr Gin x—y Met-Ala(105) His-Val(84) His-Gly(87) His-Thr(77) His-Gly(87) Asn-Gly(75) His-Gln(92) His-Val(92) His-Met(80) His-Gln(116) His-Arg(116) His-Arg(75) His-Thr(99) Pro-Gln(78) Pro-Gln(78) Ser-Gln(92) Ser-Gln(75) Ser-Arg(75) Tyr- Pro(98) Tyr-Gly(75) Tyr- Pro(78) Tyr-Trp(86) Tyr-Gly(87) Asn-Gly(87) Ile-Thr(75) Ile-Gly(106) Met-Trp(87) Met-Tyr(75) Met-Arg(99) Met-Glu(116) Met- Ala(86) Met-Arg(92) Met-Gly(84) Met-His(78) Met-Arg(86) Met-Glu(75) Met-Glu(75) Met-Trp(81) His-Arg(80) Ser-Cys(78) Ser-Tyr(87) Ser-Cys(75) Ser-Tyr(83) Ser-Phe(75) Ser-Val(75) Ser-Tyr(78) in a tripeptide (axb) that are a result of the replacement of x by v (values b a His Tyr Asn Tyr Asn lie lie lie Tyr Tyr Tyr Arg Tyr Tyr Asn Asn Phe Val Cys Pro Pro Met Try Asp Ala Ala He Glu Glu Asp His His Thr His His Lys His Tyr Tyr Pro Pro Pro Ser Glu Ser Tyr Gin Ala Arg Pro Pro Pro Pro Glu .t— v Ile-Gln(75) Ile-Thr(77) Ile-Gly(lll) Glu-Pro(98) Glu —Lys(80) Glu-Val(87) Glu-Lys(78) Trp-Try(75) Trp-Gly(87) His-Gly(80) His-Gly(80) Arg-His(99) Arg-Tyr(133) Arg —Asn(113) Arg-Pro(99) Arg-Gly(lll) Arg-Gly(87) Arg-Tyr(139) Arg-Pro(l03) Ser-Tyr(87) Ser-Tyr(83) Ser— Asn(78) Ser-Tyr(78) Ser-Gly(78) Arg-Thr(80) Ser— Lys(78) Glu-Gln(78) Glu-Asn(81) Met-Ile(86) Met-Ile(87) Arg-Pro(99) Arg-Gly(86) Arg-Leu(75) Arg-Pro(l 05) Arg-Leu(78) Cys-His(78) Cys-Trp(107) Cys-His(92) Cys-His(78) Cys-His(93) Cys-Asp(lOl) position of a tripeptide along with their associated AIItrj > and listed only those amino acid replacements (Table VI) the AE can be be: where before [trj value changes significantly ( b Glu Glu Glu Cys He Asp Pro His Asn Tyr lie Trp Trp Trp Trp Trp Asn Arg Arg Arg Pro Pro Glu Glu Trp Pro Cys Cys His Pro Trp Trp Trp Arg Arg Trp Thr Thr Trp Thr Thr /alues, where >75). The formula that used for estimating the change in the n of a protein All protein = (10/L) * AIIprotejn is the difference in II and after the reiplacement of Alltn would (10) of a protein of length L residue x bv residue v. 160 atMasarykUniversityonJanuary29,2013http://peds.oxfordjournals.org/Downloadedfrom Stability characteristics of protein in vivo Therefore depending upon the length of the protein and AIIproKin value, one can choose effective substitutions from AIItri values given in Table VI, so that the II of the protein assumes any value <40 and hence becomes stable. As AIIprolein is inversely proportional to the length of the protein, a single substitution may not be sufficient to change II to the expected level; hence, more than one substitutions may be necessary. It should also be kept in mind that some of the substitutions which suggest improvement in the stability may not be desirable in terms of the functional requirements of the protein. Only those substitutions which would enhance the stability of a protein without affecting its function would be useful. Conclusions Our analysis and the method developed for predicting the in vivo stability of a protein suggest that the primary determinants of the stability of the protein probably reside in its primary structure and is an intrinsic property of a protein. There appears to be a correlation between the sensitivity of a protein to in vivo degradation and the presence of certain dipeptides in it. The overall influence of such dipeptides appears to contribute to instability or stability characteristics of proteins. Apart from these sequence-dependent characteristics several factors, e.g. structuredependent features, the presence of disulphide bridges, ligand binding, protease recognition mechanisms, etc., are known to determine the in vivo protein stability. Therefore stability of a protein as manifested in vivo could be a net effect of the contributions made by several such factors. Our work indicates that sequence-specific elements is one of the significant factors which play an important role in determining the stability of a protein. The observations presented here might help open new vistas, where the knowledge about the dipeptides having specific characteristics could be used in the modification of existing proteins or in the design of novel protein molecules of a desired stability. References Bachmair.A., Finley,D. and Varshavsky.A. (1986) Science, 234, 179-186. Creighton.T.E. (1988) BioEssays, 8, 57-63. Dice,J.F. and Goldberg.A.L. (1975) Arch. Biochem. Biophys., 170,213-219. Goldberg,A.L. and St John.A.C. (1976) Annu. Rev. Biochem., 45, 747-803. Pratt,L.H., Kidd.G. and Coleman.R. (1974) Biochim. Biophys. Aaa, 365, 93-107. Rechsteiner.M., Rogers.S. and Rote,K. (1987) Trends Biochem. Sri., 12, 390-394. Rogers.S.. Wells.R. and Rechsteiner.M. (1986) Science, 234, 364-368. Received on May 10, 1990; accepted on August 30, 1990 161 atMasarykUniversityonJanuary29,2013http://peds.oxfordjournals.org/Downloadedfrom