World J Microbiol Biotechnol (2008) 24:2377-2382 DOI 10.1007/sll274-008-9795-2 REVIEW Methods for the bioinformatic identification of bacterial lipoproteins encoded in the genomes of Gram-positive bacteria Obaidur Rahman • Stephen P. Cummings • Dean J. Harrington • Iain C. Sutcliffe Received: 30 April 2008 / Accepted: 15 June 2008 / Published online: 27 June 2008 © Springer Science+Business Media B.V. 2008 Abstract Bacterial lipoproteins are a diverse and functionally important group of proteins that are amenable to bioinformatic analyses because of their unique signal peptide features. Here we have used a dataset of sequences of experimentally verified lipoproteins of Gram-positive bacteria to refine our previously described lipoprotein recognition pattern (G+LPP). Sequenced bacterial genomes can be screened for putative lipoproteins using the G+LPP pattern. The sequences identified can then be validated using online tools for lipoprotein sequence identification. We have used our protein sequence datasets to evaluate six online tools for efficacy of lipoprotein sequence identification. Our analyses demonstrate that LipoP (http://www.cbs.dtu.dk/services/LipoP/) performs best individually but that a consensus approach, incorporating outputs from predictors of general signal peptide properties, is most informative. Keywords Lipoproteins • Signal peptides • Bioinformatics • Genomics • Firmicutes ■ Actinobacteria Electronic supplementary material The online version of this article (doi:10.1007/sll274-008-9795-2) contains supplementary material, which is available to authorized users. 0. Rahman • S. P. Cummings • I. C. Sutcliffe Northumbria University, Newcastle upon Tyne NE1 8ST, UK D. J. Harrington University of Bradford, West Yorkshire BD7 1DP, UK 1. C. Sutcliffe (El) Biomolecular and Biomedical Research Centre, School of Applied Science, Northumbria University, Newcastle upon Tyne NE1 8ST, UK e-mail: iain.sutcliffe@unn.ac.uk Introduction Bacterial lipoproteins (Lpp) are a functionally diverse class of membrane anchored proteins that typically represent ca. 2% of the bacterial proteome (Sutcliffe and Harrington 2002; Sutcliffe and Harrington 2004; Babu et al. 2006; Sutcliffe and Hutchings), although in some taxa the proportion is even higher (Bendtsen et al. 20052007; Setubal et al. 2006). Lpp are of particular significance in Gram-positive bacteria as, in the absence of an outer membrane, various proteins must be tethered to the plasma membrane in order to be retained within the cell envelope. Thus many Lpp of Gram-positive bacteria have functions directly comparable to those of periplasmic or surface proteins in Gram-negative bacteria. For example, the substrate binding proteins which deliver substrates to the integral membrane components of ABC importer systems are typically Lpp in Gram-positive bacteria and periplasmic proteins in Gram-negative bacteria (Sutcliffe and Russell 1995). Consequently, many of the known or predicted functions of Gram-positive bacterial Lpp reflect their predicted localisation at the interface between the cell membrane and the extracytoplasmic compartment. Thus, in addition to the well defined category of substrate binding Lpp, a brief selection of Lpp functions include roles as enzymes; in sensing environmental cues; in membrane-associated redox processes; and in correct protein export and localisation (Sutcliffe and Russell 1995; Sutcliffe and Harrington 2004; Sutcliffe and Hutchings 2007). This functional versatility means that it is extremely useful to be able to identify putative Lpp in order to gain further insights into the biology of biotechnologically and medically significant organisms. Moreover, the accurate prediction of protein localisation by sequence analysis is clearly an important aspect of genome annotation and, eventually, understanding of protein function (Gardy and Brinkman 2006). Springer 2378 World J Microbiol Biotechnol (2008) 24:2377-2382 Bacterial Lpp are anchored to cellular membrane(s) as the result of their post-translational modification with, as a minimum, a diacylglyceride group which is added to an essential cysteine. This cysteine is located in the C-termi-nal region of a signal peptide that directs precursor-Lpp translocation across the plasma membrane prior to lipid modification (Braun and Wu 1994; Sutcliffe and Harrington 2002). The stretch of amino acids preceding the cysteine is relatively well conserved (the 'lipobox') and this means that, in combination with the recognition of other conserved signal peptide features (Fig. 1), Lpp are highly amenable to identification by bioinformatic analyses. However, there is evidence for subtle taxon-specific differences in the signal peptide features of Lpp from different bacterial taxa (Setubal et al. 2006; Sutcliffe and Harrington 2002). In order to refine the methods for the bioinformatic analysis of Lpp from Gram-positive bacteria, we have curated a true positive (TP) dataset of 90 experimentally proven Lpp and a true negative (TN) dataset of sequences not considered to be Lpp. These datasets have been used to test the performance of several online applications in accurately identifying Gram-positive bacterial Lpp. Screening Gram-positive bacterial genomes for putative Lpp The conserved signal peptide and lipobox features of bacterial Lpp can be expressed in regular sequence patterns. Following the work of Klein et al. (1988) and von Heijne (1989), the Prosite profile PS51257 (formerly Pro-site pattern PS00013) was defined to allow bacterial Lpp sequences to be recognised. Subsequently, we refined the pattern search approach and defined a pattern, denoted G+LPP, with greater accuracy (higher specificity) for the recognition of Lpp from Gram-positive bacteria (Sutcliffe and Harrington 2002; Table 1). Both the Prosite profile Fig. 1 Signal peptide features of a typical bacterial Lpp. (a) The sequence shown is that of the substrate binding protein MsmE of Streptococcus mutans, an experimentally verified Lpp (Sutcliffe et al. 1993). The positively charged N-region amino acids are shown in bold followed by the hydrophobic H-region. The lipoprotein specific lipobox, culminating in the crucial cysteine, is underlined. The arrow represents the mature protein sequence, (b) Output from SignalP-HMM for MsmE, demonstrating how this tool typically predicts the signal peptide H-region of bacterial Lpp to end in close proximity to the lipobox cysteine MKWYKKIGLLGIVGLTSVLLAAC i.e 9.8 9.6 0.4 a.s SignalP-HNM prediction