Journal of, research articles O TO L0O m © •research Prediction of Lipoprotein Signal Peptides in Gram-Positive Bacteria with a Hidden Markov Model Pantelis G. Bagos,*f* Konstantinos D. Tsirigos,f Theodore D. Liakopoulos,1 and Stavros J. Hamodrakas1 Department of Cell Biology and Biophysics, Faculty of Biology, University of Athens, Athens 15701, Greece, and Department of Informatics with Applications in Biomedicine, School of Applied Sciences, Lamia 35100, Greece Received March 1, 2008 We present a Hidden Markov Model method for the prediction of lipoprotein signal peptides of Gram-positive bacteria, trained on a set of 67 experimentally verified lipoproteins. The method outperforms LipoP and the methods based on regular expression patterns, in various data sets containing experimentally characterized lipoproteins, secretory proteins, proteins with an N-terminal TM segment and cytoplasmic proteins. The method is also very sensitive and specific in the detection of secretory signal peptides and in terms of overall accuracy outperforms even SignalP, which is the top-scoring method for the prediction of signal peptides. PRED-LIPO is freely available at http://bioinformatics.bi-ol.uoa.gr/PRED-LIPO/, and we anticipate that it will be a valuable tool for the experimentalists studying secreted proteins and lipoproteins from Gram-positive bacteria. Keywords: lipoproteins • signal peptide • hidden markov model • prediction • bacteria Introduction Signal peptides1 in Bacteria are mainly divided into the secretory signal peptides that are cleaved by Signal Peptidase I (SPase I),2,3 and those cleaved by Signal Peptidase II (SPase II or Lsp),4 which characterize the membrane-bound lipoproteins. The secretory signal peptides have been extensively studied for years, revealing a structure comprised of a short, positively charged N-region, a hydrophobic H-region that spans the membrane, a C-region of mostly small and uncharged residues and a cleavage site (known as the A-X-A motif, in which A stands for alanine and X for any amino acid), that is recognized by the peptidase that cleaves the peptide and releases the mature protein.5-7 The signal peptide of bacterial lipoproteins possesses a similar structure,8 with main differences being the comparatively shorter length and the unique pattern in the C-region (which is commonly denoted by [LVTJ-[AST]-[GA]-C and termed as "lipobox") that is recognized for cleavage by SPase II.4 The cysteine in the last position of the particular pattern is indispensable in both Gram-positive and Gram-negative bacteria, and is necessary for membrane anchoring. The post-translational lipid modification involves three enzymes that act sequentially: the prolipoprotein diacylglyceryl transferase (Lgt), that transfers a diacylglyceride to the cysteine sulfydryl group, the signal peptidase II (SPase II or Lsp) that cleaves the signal peptide at the residue before the cysteine forming an apolipoprotein, and the apolipoprotein AT-acyl-transferase (Lnt) which acylates the a-amino group of the * To whom correspondence should be addressed. E-mail: pbagos@ ucg.gr, pbagos@biol.uoa.gr. + University of Athens. * University of Central Greece. 5082 Journal of Proteome Research 2008, 7, 5082-5093 Published on Web 11/01/2008 apolipoprotein N-terminal cysteine forming the mature lipoprotein.910 The proteins carrying a secretory signal peptide can be directed to the membrane through the action of the Sec translocase,11,12 although another major pathway has been discovered, utilizing the Twin-Arginine (TAT) translocase which recognizes (longer in general) signal peptides that are carrying a distinctive pattern of two consecutive arginines (R-R) in the N-region.13-15 Translocation of lipoproteins through the TAT pathway has been postulated based on sequence analysis,16 but only recently has been proven for Bacteria (Desulfovibrio vulgaris17) and Archaea (Haloferax volcanii18). The discovery of globomycin, a specific inhibitor of SPase II, represented a major breakthrough in the biochemical studies of lipoprotein maturation.19,20 Bacteria treated with globomycin, as well as SPase II deficient strains, show accumulation of lipid-modified prolipoproteins 21 Nevertheless, extensive studies in SPase II deficient strains showed that absence of SPase II results in rather pleiotropic effects on the composition of the extracellular proteome, since some prolipoproteins were released in the medium, whereas the synthesis of others was strongly reduced.22,23 Conversely, only in the case of Lgt deficient strains, significantly more lipoproteins are observed in the growth medium.22,24 The most excellent, however, proof that a protein is a lipoprotein would be labeling with [3H] or [14C] palmitate in the presence/absence of globomycin (or in wild-type and SPase II or Lgt deficient strains), combined with immunoblotting, immunoprecipitation, protein fractionation and protease accessibility assays to investigate its extracellular localization 25,26 Computational prediction of secretory signal peptides was performed initially using weight matrices.27 However, the I0.1021/pr800162c CCC: $40.75 © 2008 American Chemical Society Prediction of Lipoprotein Signal Peptides research articles N-region H-region C-region A-X-A N-region H-region L-V-A-G-C uympiasmic tail N-terminal TM helix txiracenuiar tail Globular domain □ODOODCIODOr. Figure 1. The topology of the full model with the four branches (submodels) corresponding to the secreted signal peptides, lipoprotein signal peptides, N-terminal TM segments and the cytoplasmic domain. Neural Networks introduced by the SignalP method,28,29 as well as Hidden Markov Models (HMM),30 have been proven to be the most successful methods currently available.31 Recently, the SignalP method was upgraded, mainly due to better annotation and selection of the training set, yielding an ever better accuracy,32 whereas TatP has been presented offering the most accurate classification of TAT signal peptides.33 A different approach has been followed in the Phobius method34,35 where a HMM was used to predict simultaneously the presence of a secretory signal peptide and the TM topology of a given protein. Following this approach, the authors showed that they can minimize the number of signal peptides predicted as transmembrane (TM) segments and vice versa. Concerning lipoproteins, for years, regular expression patterns were used based on the von Heijne rule,8 with various modifications.26,36-38 Recently, a method called LipoP was developed, which is based on HMMs and was trained exclusively on Gram-negative bacteria lipoproteins.39 LipoP performs not only lipoprotein signal peptide prediction but also discrimination from secretory signal peptides, N-terminal TM helices and cytoplasmic proteins. LipoP has been reported to accurately classify ~97% of Gram-negative bacteria lipoproteins with an error rate (on nonlipoproteins) of approximately 0.3%. When used, however, on lipoproteins from Gram-positive bacteria, the sensitivity of the prediction dropped to 90—92%.39 In this work, we present a HMM-based method for performing the same task that is trained exclusively on experimentally verified lipoproteins from Gram-positive bacteria. We performed an extensive literature search in order to overcome the problem of limited experimentally verification and annotation of Gram-positive bacteria lipoproteins found in public databases, and in this way, we collected a data set of 67 such lipoproteins, the largest such set compiled so far. By analyzing these sequences, we show that they possess slightly different characteristics compared to their Gram-negative bacteria counterparts, providing, thus, a justification of our approach for constructing a different predictor. The method discriminates also very accurately secretory signal peptides (and predicts their cleavage site) as well as N-terminal TM anchored proteins. We show that the method developed here (PRED-LIPO) outperforms LipoP when applied to lipoproteins from Gram-positive bacteria, and we validate it on a number of different data sets. We also show that the module that predicts secretory signal peptides is also very accurate and compares favorably even to the currently top-scoring method, SignalP. Thus, the method developed here [http://bioinformatics.biol.uoa.gr/PRED-LIPO/) can be used also as a general predictor for signal peptides of Gram-positive Figure 2. The model corresponding to the lipoprotein signal peptides. States in the n- and h-region that share the same emission probabilities are depicted with the same color. The cleavage site is presented with a dashed vertical line between G and C. Allowed transitions are depicted with arrows. bacteria, and we anticipate that it will be useful in proteomics applications and in large-scale genome analyses. Materials and Methods The Hidden Markov Model. The Hidden Markov Model [HMM) that we used is quite similar to the one proposed by LipoP. It consists of four different submodels (Figure 1), the Lipoprotein submodel, corresponding to the signal peptides cleaved by SPase II, the Signal Peptide submodel corresponding to the secretory signal peptides cleaved by SPase I, the N-terminal TM submodel corresponding to the N-terminal TM domain, and a globular submodel used to model the globular N-terminal domains of cytoplasmic or membrane proteins. The Lipoprotein submodel (Figure 2) was especially designed to capture the sequence features of Gram-positive bacteria lipoproteins. It contains states modeling the N-terminal n-region, the hydrophobic h-region and the lipobox (lipoprotein c-region). We used the same emission probabilities for the states in each region (with the exception of the lipobox) in order to avoid overfitting and the allowed transition probabilities were set in order to model as closely as possible the sequence features of the known lipoproteins. The secretory signal peptide model (Figure 3) is very similar to the lipoproteins' model, with the exception of the precaution for the existence of longer Journal of Proteome Research • Vol. 7, No. 12, 2008 5083 research articles Bagos et al. Figure 3. The model corresponding to the secretory signal peptides. States in the n- and h-region that share the same emission probabilities are depicted with the same color. The cleavage site is presented with a dashed vertical line between A and 1 (first amino acid of the mature protein). Allowed transitions are depicted with arrows. n-regions, the variable length of c-regions and the different patterns of amino acids at the cleavage site. The TM submodel is identical to the one used by the HMM-TM predictor for a-helical membrane proteins,40 whereas the globular submodel consists simply of a self-transitioning state. The total number of the model's states is 134 (including start and end states) with 227 freely estimated transitions. On the other hand, the total number of freely estimated emission probabilities is 589 (31 x 19), yielding a total number of freely estimated parameters equal to 816. The model was trained using the Baum-Welch algorithm for labeled sequences41 and the decoding was performed using the standard Viterbi algorithm,42 although more advanced techniques such as the Posterior-Viterbi decoding43 and the Optimal Accuracy Posterior Decoder44 yield nearly identical results. With respect to this, the model introduced here differs from LipoP which uses the Forward decoding algorithm for choosing between the various submodels. In addition to the Viterbi decoding which produces the optimal path of states through the model, and hence predicts simultaneously the type of the sequence as well as the cleavages site (if any), we also report the SI reliability index,45 which takes values in the range 0—1 and it is a measure of the reliability of the prediction, useful in many situations. The reported results correspond to a 33-fold cross-validation procedure, where each set consists of 11 proteins with an equally balanced number of SPase I cleaved signal peptides (3 or 4), SPase II cleaved signal peptides (2 or 3), TM (1 or 2) and globular proteins (3 or 4). The training procedure consists of removing 1 of the 33 subsets from the training set, training the model with the remaining proteins and performing the test on the proteins of the set that was removed. This process is tandemly repeated for all subsets in the training set, and the final prediction accuracy summarizes the outcome of all independent tests. For measures of accuracy in each binary classification problem (lipoproteins vs nonlipoproteins, signal peptides vs nonsignal peptides), we used the percentage of correctly classified positive examples (sensitivity), the percentage of correctly classified negative examples (specificity) and the Mathews Correlation coefficient that summarizes in a single measure True Positives (TP), False Positives (FP), True Negatives (TN) and False Negatives (FN) 46 For estimating the rate of correct predictions in the genome analysis, of particular importance is also the Positive Predictive Value (PPV) and the Negative Predictive Value (NPV) defined, respectively, as the percentage of true positives among the positive predictions [TP/TP + FP) and the percentage of true negatives among the negative predictions (TN/TN + FN). The method is available online at http://bioinformatics.biol.uoa.gr/PRED-LIPO/. For comparison purposes, we created also two profile Hidden Markov Models (pHMMs) using the HMMER 2.3.2 package.47 The pHMM is a special case of Hidden Markov Model and can be seen also as an extension of sequence profiles. It uses a HMM to model in a statistical manner a multiple alignment of related sequences. The pHMM, in contrast to the simple HMM described above, uses position specific parameters (emission and transition probabilities) and in general has a larger number of freely estimable parameters. It is well-suited for modeling protein families, but we used it here for comparison, since it has been shown that, under certain circumstances, it can also be used to model the sequence features of signal peptides 48,49 We created the multiple alignments as advised previously,48,49 we built the pHMMs using the hmmbuild command of the HMMER package, and we performed searches using the hmmpfam command of the same package. Data Sets. The Data set that we used for training contained 67 experimentally verified lipoproteins from Gram-positive bacteria, 127 secreted proteins containing a signal peptide cleaved by SPase I from Gram-positive bacteria, 111 cytoplasmic proteins from Gram-positive bacteria and 58 Gram-positive bacterial sequences with an N-terminal TM segment that have their N-terminus located to the cytoplasmic side of the membrane. The 67 experimentally verified lipoproteins (Table 1) contain the 33 verified lipoproteins previously reported26 that were used already for the construction of the G + LPP regular expression pattern. One of these sequences (MBL of Streptococcus equi) could not be retrieved from UniProt and we identified it using a BLAST50 search against the genome sequence at http://www.sanger.ac.uk/Projects/S_equi/. In addition to these and given the low quality of the annotation in public databases regarding the experimental verification of Gram-positive bacteria lipoproteins, we performed an extensive literature search to identify additional such proteins. In total, we identified additionally 34 such proteins from various species of Gram-positive bacteria (Mycoplasma, Mycobacterium, Coryne-bacterium, Spiroplasma, Streptomyces, Streptococcus, Staphylococcus, Bacillus), which are listed in Table 1 along with the original references. Interestingly, in one of these proteins [CseA), the start codon reported in UniProt from previous publications was misassigned,51 and the error was reported to UniProt database.52 The identified papers provided results of varying degrees of reliability. The majority of the identified papers used chemical labeling of cysteine coupled with verification of the extracellular localization by subcellular fraction-ization and/or immunoblotting 53-59 Others used site-directed mutagenesis in the lipobox region,51,60 others relied only in the results obtained by treatment with globomycin coupled with subcellular localization techniques,61,62 and one proteomic study used Lgt deficient strains.24 Finally, several studies were included based only on indirect evidence63-66 in order to obtain an as large as possible training set. 5084 Journal of Proteome Research • Vol. 7, No. 12, 2008 Prediction of Lipoprotein Signal Peptides reS63 rch articles Table 1. The training set of 67 experimentally verified lipoproteins used in this study. We list the UniProt AC, the organism and the reference the 34 newly identified lipoproteins original set of 33 lipoproteins from ref 26 UniProt AC52 organism UniProt AC52 organism Q8KVR9 Mycoplasma mycoides54 MBL (SEQ1660) Streptococcus equi 005121 Mycoplasma gallisepticum58 Q9RHZ6 Alicyclobacillus acidocaldarius Q50327 Mycoplasma pneumoniae51 Mycoplasma mycoides66 P06548 Bacillus cereus P55801 P00808 Bacillus licheniformis P29230 Mycoplasma hyorhinis59 Mycoplasma agalactiae64 Q56247 Bacillus PS3 Q9X775 P34957 Bacillus subtilis P0A671 Mycobacterium bovis60 P24327 Bacillus subtilis P21625 Spiroplasma melliferum55,56 P46922 Bacillus subtilis Q46023 Corynebacterium diphtheriae61 Streptococcus equisimilis62 Q08429 Bacillus subtilis 005471 P24011 Bacillus subtilis Q70UQ6 Streptococcus uberis65 P46338 Bacillus subtilis Q9ZEP5 Streptomyces coelicolor51 Q93HZ4 Corynebacterium glutamicum Q99VY4 Staphylococcus aureus90 069087 Heliobacterium gestii Q8VQS9 Staphylococcus aureus90 P14308 Lactococcus lactis Q5HET4 Staphylococcus aureus53 Q03490 Mycobacterium intracellulare Q2FI86 Staphylococcus aureus53 Q10790 Mycobacterium tuberculosis Q7A603 Staphylococcus aureus53 PI 1572 Mycobacterium tuberculosis Q99U04 Staphylococcus aureus53 P15712 Mycobacterium tuberculosis Q600S6 Mycoplasma hyopneumoniae63 P96278 Mycobacterium tuberculosis Q5ZZQ6 Mycoplasma hyopneumoniae63 Bacillus subtilis24 P00807 Staphylococcus aureus P40409 Q9ZIN7 Staphylococcus carnosus P37580 Bacillus subtilis24 Q7CCL6 Staphylococcus epidermidis (strain ATCC 12228) 034385 Bacillus subtilis24 Q9Z692 Streptococcus equi 034335 Bacillus subtilis24 005471 Streptococcus equisimilis P24141 Bacillus subtilis24 P31306 Streptococcus gordonii Challis P36949 Bacillus subtilis24 Q00749 Streptococcus mutans 034966 Bacillus subtilis24 P18791 Streptococcus pneumoniae 005497 Bacillus subtilis24 P97008 Streptococcus pneumoniae 031567 Bacillus subtilis24 Q51933 Streptococcus pneumoniae 034348 Bacillus subtilis24 Q99Y38 Streptococcus pyogenes P54535 Bacillus subtilis24 Q53919 Streptomyces chrysomallus 005410 Bacillus subtilis24 Q9X9R7 Streptomyces reticuli 032167 Bacillus subtilis24 068456 Thermoanaerobacter ethanolicus P54941 Bacillus subtilis24 The 127 secreted proteins containing a SPase I cleaved signal peptide were retrieved from the set for training the SignalP method,30 taking into consideration the corrections made later concerning wrongly annotated cleavage sites, initial mthionines and false annotations.32 We did not try to eliminate proteins translocated through the Twin-Arginine Translocation (TAT) machinery,1314 in either the secretory or the lipoprotein set. The 111 proteins not containing either a signal peptide or an a-helical TM segment within the first 70 amino acids were collected from the well-curated data set of Menne et al.,31 that was used to check the accuracy of signal peptide predictors. Putative TM proteins in this set were removed using TM-HMM.67 Finally, in order to model the N-terminal transmembrane (TM) domains, we scrutinized various well-annotated data sets68-71 in order to compile a nonredundant set of transmembrane proteins from Gram-positive bacteria with experimentally verified topology. The final set consisted of 22 such transmembrane proteins, and from these, we extracted the TM segments with orientation from the cytoplasm to the extracellular space (In — Out), in a procedure similar to the one followed in the development of LipoP39 and CW-PRED.72 Thus, if a particular TM segment was localized in a 60-residue long window not overlapping with another TM segment, it was included in the set. In case of closely packed TM segments from multispanning TM proteins, we included only the upstream and downstream regions corresponding to the half of the proximal loop (extracellular or cytoplasmic). To have an independent test set to further evaluate the method and compare it against the other available ones, we searched once again the recent literature. Experimentally verified lipoproteins from Gram-positive bacteria are, as we discussed earlier, very difficult to And. We have found, however, several proteomic analyses,22'23,73,74 in which dozens of proteins were identified as potential lipoproteins. By this way, and also by searching the 278 experimentally verified bacterial lipoproteins from DOLOP,36,37 a total number of 117 Lipoproteins have been collected. From UniProt, following Menne and coworkers,31 we collected proteins having an experimentally verified signal peptide from Gram-positive bacteria, and after removing proteins that are already present in the set of SignalP (that we used for training), we came up with 89 proteins. The proteins carrying a secretory as well as a lipoprotein signal peptide were submitted to redundancy reduction following the procedures used in SignalP papers28,29 (18 identical residues in the first 40 residues of the sequence). This reduction was extended further to the proteins of the training set in order to have a truly objective evaluation of the accuracy of the method. Finally, in the set of lipoproteins remained 110 sequences and in the set of secretory signal peptides remained 80 sequences. We also retrieved cytoplasmic proteins from UniProt by searching the "Subcellular Localization" field and excluding entries marked as "Potential", "Putative" and "By Similarity". Given that the number of sequences was large, these were submitted to redundancy reduction using full sequences (30% Journal of Proteome Research • Vol. 7, No. 12, 2008 5085 research articles Bagos et al. identities in an alignment of at least 80 residues), and once again, we removed proteins present (or having a homologue) in the training set, leaving us with 198 proteins. Finally, to test our method on TM proteins, we used the 106 experimentally verified cytoplasmic membrane proteins from Gram-positive bacteria used for the development of the PSORTB method.75 From this set, we removed proteins present in the training set, proteins with a putative signal peptide (based on the annotation) and we performed redundancy reduction at 30% identical residues in an alignment of at least 80 residues, leaving finally 66 TM proteins. We also validated the method on 109 secreted proteins of Bacillus subtilis whose signal peptides are sufficient for protein secretion in a novel expression system,76 and 713 cytoplasmic proteins from the same organism identified by proteomic analyses.73 From the work of Brockmeier et al.,76 we also retrieved 25 proteins with predicted signal peptide that were not expressed in the heterologous expression system for various reasons, and 35 proteins that even though expressed they were not detected in the medium by any of the two reporters used (Cutinase or Esterase). These data sets were used to further assess the sensitivity and the specificity of the methods developed here, since these proteins were predicted to possess a signal peptide by SignalP. Once again, the proteins in the independent sets were compared against the sequences used for training in order to avoid redundancy, and cross-checked between the sets to avoid mistakes arising from the low-resolution proteomics experiments. Thus, from the 713 initially identified cytoplasmic proteins, 11 were also found in the other two subsets (lipoproteins and secreted), and thus, they were removed. All the protein sequences were retrieved from Sub-tiList.77 Finally, we used the complete sequenced genomes of Gram-positive bacteria from the NCBI repository in order to perform predictions and compare the results. Comparison to Other Prediction Methods. For comparison, we used mainly the LipoP39 method, which is based on HMMs and was trained on Gram-negative bacteria lipoproteins, since it is the only available machine learning method for the same task. LipoP39 possesses a model architecture similar to the one we used (discrimination between lipoprotein signal peptides, secretory signal peptides, N-terminal TM helices and cytoplasmic proteins). Besides the major difference that the method was trained on Gram-negative bacteria lipoproteins, another difference is the fact that LipoP is based on the forward decoding, whereas the method proposed here is using the Viterbi decoding algorithm. Traditionally, the identification of bacterial lipoproteins was based on regular expression patterns. The first such pattern was the one proposed by von Heijne, back in 19898 which is [LVI]-[ASTG]-[GA]-C, requiring only one match to the first 2 positions (for all the patterns here we use the notation of Prosite78). The most widely used pattern is the PS00013 pattern of the Prosite database,79 {DERK}(6)-[LrVTMFWSTAG](2)-[UVM-FYSTAGCQ]-[AGS]-C, with the additional rule that the cysteine (C) must be between position 15 and 35, and at least one lysine (K) or arginine (R) must be in one of the first seven positions of the signal peptide. Recently, this pattern has been replaced by a Position Specific Scoring Matrix (PSSM) with Accession number: PS51257; this PSSM was also used in the analysis using ScanProsite,80 but we chose to keep PS00013 in the analysis for historical reasons. Later, a pattern especially designed for Gram-positive bacteria lipoproteins (G + LPP) was developed based on observations on 33 experimentally characterized SPaseI 0 - n-1-1-1-1 10 20 30 40 >50 Length of Signal Peptide Figure 4. Smoothened histogram of the length distribution of lipoprotein and secretory signal peptides of Gram-positive bacteria. The latter are significantly longer with a mean length of approximately 31 amino acid (compared to 22). Notice the second mode in both distributions, accounting for signal peptides of length approximately 40 amino acids long. These could be instances of false annotations concerning the initial methionine or TAT signal peptides (see Results and Discussion). lipoproteins.26Thispatternis: < [MV]-X(0,13)-[RK]-{DERKQ}(6,20)-[LIVMFESTAG]-[LVIAM]-[IVMSTAFG]-[AG]-C. Lastly, another pattern was developed by the creators of the DOLOP database,36,37 which is a slightly modified version of the von Heijne pattern: [LVI]-[ASTVI]-[ASG]-C, requiring the additional rules that the cysteine (C) must be placed within the first 40 amino acids, at least one lysine (K) or arginine (R) must be in one of the first seven positions of the signal peptide, and that this positively charged residue should be 7—22 amino acids far from the cysteine. The regular expression patterns used here were implemented locally using PERL scripts. Two other advanced methods have been proposed for the prediction of bacterial lipoproteins. The first one (SPEPlip) uses a combination of a Neural Network predictor to detect the presence of the signal peptide, and afterward filters out nonlipoproteins by using the PS00013 pattern81 This eliminates much of the false positive predictions made by the regular expression pattern alone. However, the method was not trained in experimentally verified but rather on putative lipoproteins, and moreover, the web-server does not allow massive submissions; thus, we did not use it on our evaluation. Furthermore, since the method uses the PS00013 pattern, we expect that its sensitivity would be equal, although it may be more specific. The second method is based on the concept of probabilistic alignments of sequences with patterns (motifs),82 and the authors used as an illustrative example for the development of the method, the case of Gram-positive bacteria lipoproteins. Later, they also applied the same method in Escherichia coli lipoproteins.83 However, the original method was trained on all B. subtilis putative lipoproteins, and furthermore, no prediction tool is available to run the tests. Finally, for analyses concerning accuracy in predicting secretory signal peptides, we also used SignalPv2,28 SignalPv3,32 Phobius34 and PrediSi84 in order to predict the putative signal-sequences of the proteins tested. We used both the Neural Network (NN) and the HMM modules of SignalP, trained on Gram-positive bacteria, using the default parameters, with the submitted sequences truncated to their first 70 residues. Results and Discussion In Figure 4, one can see the different length distributions of the signal peptides cleaved by SPase I (secreted proteins) and 5086 Journal of Proteome Research • Vol. 7, No. 12, 2008 Prediction of Lipoprotein Signal Peptides research articles s S 2A > fi to n to r Tt inioh- c P