Protein identification from two-dimensional gel electrophoresis analysis of Klebsiella pneumoniae by combined use of mass spectrometry data and raw genome sequences

Separation of proteins by two-dimensional gel electrophoresis (2-DE) coupled with identification of proteins through peptide mass fingerprinting (PMF) by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) is the widely used technique for proteomic analysis. This approach relies, however, on the presence of the proteins studied in public-accessible protein databases or the availability of annotated genome sequences of an organism. In this work, we investigated the reliability of using raw genome sequences for identifying proteins by PMF without the need of additional information such as amino acid sequences. The method is demonstrated for proteomic analysis of Klebsiella pneumoniae grown anaerobically on glycerol. For 197 spots excised from 2-DE gels and submitted for mass spectrometric analysis 164 spots were clearly identified as 122 individual proteins. 95% of the 164 spots can be successfully identified merely by using peptide mass fingerprints and a strain-specific protein database (ProtKpn) constructed from the raw genome sequences of K. pneumoniae. Cross-species protein searching in the public databases mainly resulted in the identification of 57% of the 66 high expressed protein spots in comparison to 97% by using the ProtKpn database. 10 dha regulon related proteins that are essential for the initial enzymatic steps of anaerobic glycerol metabolism were successfully identified using the ProtKpn database, whereas none of them could be identified by cross-species searching. In conclusion, the use of strain-specific protein database constructed from raw genome sequences makes it possible to reliably identify most of the proteins from 2-DE analysis simply through peptide mass fingerprinting.


Background
The identification of proteins and protein expression patterns under given physiological conditions by proteomic analysis has gained fundamental importance for functional study of cellular processes in recent years. Mass spectrometry (MS) has become a central element for proteomic analysis [1][2][3][4][5][6]. It is used in combination with various protein separation methods and bioinformatic tools for large-scale protein identification and characterization of various organisms [reviewed in [7][8][9][10][11][12][13][14][15][16]]. Among different MS techniques peptide mass fingerprinting (PMF) remains the most simple and powerful technique for high-throughput protein identification, in which peptide mass fingerprints acquired by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) are compared with theoretical peptide mass fingerprints calculated for all the proteins in a given protein sequence database. For bacterial strains PMF is found even more reliable for species-specific protein identification than N-terminal sequencing [17]. This approach can be successfully applied for organisms the genomes of which are fully sequenced and annotated, or for proteins having sequences well-conserved for cross-species identification. Otherwise, additional information such as amino acid sequences is often needed for an unambiguous identification [1,3,10,[17][18][19]. Therefore, systematic identification of protein expressions of an organism can be greatly enhanced with genome sequence data.
Presently, the genomes of more than 100 organisms have been completely sequenced http:// www.ncbi.nlm.nih.gov/PMGifs/Genomes/bact.html, NCBI 06.Sept.2003). More than 100 genome sequencing projects are in progress. However, the annotation (function assignment) of these genome sequences is a time and resource consuming process. So far, the genome sequences of about 90 organisms have been extensively annotated. In fact, there is a relatively long time-delay between the completion of genome sequencing and the full annotation of the sequences. This means a time-delay in the full use of genome sequences for protein identification.
In this work, we propose to directly use raw genome sequences for identifying proteins separated by 2-DE and analyzed by mass spectrometry. The method is demonstrated for proteomic analysis of Klebsiella pneumoniae, an organism with biomedical importance (e.g. respiratory tract infection, urea tract infection and biofilm formation) and many potential biotechnological applications (e.g nitrogen fixation, production of enzymes and biochemicals). To our knowledge a large-scale proteomic analysis of K. pneumoniae has not been reported in literature so far. One of our objectives is to identify as many as possible proteins involved in the anaerobic bioconversion of glycerol to 1,3-propanediol by K. pneumoniae. The microbial production of 1,3-propanediol has recently attracted a great deal of industrial attention as an emerging new biochemical and K. pneumoniae is a major model organism for understanding the metabolic and regulatory pathways of this bioconversion process [20,21]. The large-scale profiling of expressed proteins under given physiological conditions, especially those directly involved in the enzymatic conversion of glycerol into intermediates and the final product is desirable [22]. The proteomic method should be also useful for studying the medical aspects of this organism.

Results and Discussion
Protein database preparation and genome sequence annotation Because the web version of GeneMarkS does not accept multiple-sequence FASTA format file as input and does not consider partial ORF lying on the ends of a contig, the unfinished genome sequences must be modified before submission for the prediction of ORFs. A short artificial linker DNA sequence (CATTCATTCATAAATAAATAAAT-GAATGAATGTTATTTATTTA) that includes the start codons (underlined) and stop codons (italic) with all six possible transcription frame shifts in two directions was used to link all the 341 contigs of K. pneumoniae into one. All possible partial ORFs were then predicted by Gene-MarkS. A total of 5616 ORFs were found and translated into proteins numbered as Kpn1, Kpn2...Kpn5616, respectively. These proteins were further compared to the genomic sequences to remove additional amino acid groups caused by the linker sequence. The functions of these proteins were assigned based on similarity comparison to proteins in SWISS-PROT and TrEMBL (see the protein sequence database at http://genome.gbf.de/ bioinformatics/index.html).
The number of predicted ORFs for K. pneumoniae is much higher than the number of ORFs in Salmonella typhimurium LT2 (4451 ORFs) and E. coli K12 MG1655 (4289 ORFs), two close relatives of K. pneumoniae. This indicated the existence of gaps between contigs and/or sequencing errors inside contigs of the genome sequences of K. pneumoniae even with the 8 times coverage of the genome data. As a consequence, a real and biologically functioning ORF could be in silico predicted as several partial ORFs. In some cases, the sequencing errors (gaps or extensions) can lead to translation shifts and thus wrong predictions of protein sequences. In fact, near 20% of these proteins have a length shorter than half length of their most similar proteins in SWISS-PROT and TrEMBL. 759 proteins can be clearly classified as partial proteins belonging to 343 intact proteins. If an ORF is predicted as several partial ORFs or its translation is wrong, the chance to identify the corresponding protein merely by searching with PMFs in this strain-specific database will be reduced. In such a situation, PMF can be performed by searching in public databases for cross-species identification or additional information such as partial amino acid sequences from ESI-QqTOF-MS analysis should be considered. An alternative solution is to identify the partial ORFs and sequencing errors (mainly gaps or extensions) and to correct them in silico. A program is now under development to identify partial proteins through integrity comparison between the predicted protein and its most similar proteins found in the public protein databases (Swiss-Prot and TrEMBL). This program will also further identify sequencing errors such as base pair gaps or extensions causing transcription frame shifting and important base pair substitutions causing abnormally terminating or reading through, and then correct them. The curated protein database should enhance the identifiability especially when a lot of sequence errors exist, e.g., in the case of the low coverage of genome sequences.

Identification of proteins separated by 2-DE based on PMF and a strain-specific protein database
The proteomic analysis of K. pneumoniae was intended to elucidate some unusual dynamic behaviour of this organism anaerobically grown on glycerol [23]. To this end, samples were taken for 2-DE analysis to identify and quantify protein expressions in different stages of the fermentation process. Fig. 1 shows a typical protein expression pattern of K. pneumoniae obtained from the 2-DE analysis. 197 different protein spots were excised from 2-DE gels and analyzed by peptide mass fingerprinting using MALDI-TOF MS. In some cases ESI-QqTOF MS/MS analysis was also applied to verify the PMF results. As shown in Fig. 1, after searching in the specific ProtKpn database for K. pneumoniae and, in some case in the two public protein databases NCBInr and SWISS-PROT/ TrEMBL, 163 spots that correspond to 83% of the proteins submitted for MS analysis, were identified as 122 individual proteins. That means some proteins appeared as several spots on the 2-DE gels because of co-and posttranslational modifications [24]. Table 1 [see Additional file 1] summarizes information on the identified proteins using Mascot as search program [25]. All the proteins listed in Table 1 [see Additional file 1] are the first protein candidate in the search result lists and their scores are generally much higher than the significant level (significant level is 50 when using the ProtKpn database) defined by Mascot and are normally significantly higher than the scores for the second candidates. In most cases, these proteins are also the only candidate having significant score, leading to their unambiguous identifications.
Using the specific ProtKpn database 156 protein spots from 2-DE gels, corresponding to 95.7% of the 163 protein spots identified, can be successfully identified simply with their peptide mass fingerprinting data obtained from the MALDI-TOF MS analysis. These results were also confirmed by sequence similarity searching with partial amino acid sequences obtained from ESI-QqTOF MS/MS analysis as shown in Table 1 [see Additional file 1] for some of the proteins. This clearly demonstrates that protein identification of K. pneumoniae by simply using the peptide mass fingerprints and the ProtKpn database is reliable and sufficient enough for an identification when the score is significant. The reason for the failure to identify these two proteins with the ProtKpn database is the appearance of these proteins as partial proteins in the ProtKpn database as mentioned above by genome sequence annotation. As a consequence of this, the scores were not significant enough for an unambiguous identification.
However, some proteins predicted as partial proteins in the ProtKpn database can still be identified. 15 Spots listed in Table 1 (Figure 1). As a consequence, one isoform of each of the two proteins appeared coincidently at the same positions on the 2-DE gels.
34 proteins spots were not identified probably because of their low concentrations or of low MS spectra qualities.
Most of them were obvious proteins of low expression levels on the 2-DE gels. In addition, they might belong to partial proteins or wrong predicted proteins in the ProtKpn database. All of them were just measured once by MALDI-TOF MS. With additional measurements they could be identified as well.

Comparison of protein identifications with public database and with strain-specific database
Peptide mass fingerprinting is the method of choice for straightforward high-throughput protein identification, when the amino acid sequence of a protein exists in any kind of annotated protein databases. For organisms whose genomes are not yet fully sequenced or annotated, it is desirable for cross-species protein identification of protein spots from 2-DE gels [26]. However, a theoretical study of Wilkins and Williams showed that peptide masses were not well conserved across species boundaries, with few or no peptides being conserved when sequence identity between two proteins was below 75% [27]. This means that cross-species protein identification by PMF is not reliable.
In order to make a judgment of how advantageous the specific ProtKpn database is in comparison with public databases, cross-species protein identification by PMF was carried out for 66 spots that belong to the high expressed proteins. Only 57% could be identified by searching in the public databases, whereas 97% were identified using the ProtKpn database (Data not shown). Using the three database search programs Mascot, PeptIdent and Knexus to interrogate the public databases NCBInr and SWISS-PROT/TrEMBL resulted in similar identifications.
Many house-keeping proteins of K. pneumoniae were identified with high scoring simply by peptide mass fingerprinting through both cross-species homologue protein searching and searching in the ProtKpn database. These are proteins mainly involved in the glycolysis and gluconeogenesis pathways, energy metabolism, amino acid metabolism, protein biosynthesis and anti-stress processes. Nearly all these identified proteins of K. pneumoniae show high sequence similarities to the sequences of homologue proteins from S. typhimurium, S. typhi and E.
coli. The genomes of these closely related microorganisms were sequenced and to a large extent annotated and therefore presented in the public databases. As a result, a crossspecies protein identification of some of these housekeeping proteins is possible due to the high sequence conservation of these proteins between K. pneumoniae and these microorganisms.
Using the smaller strain-specific database significantly decreased the noises and uncertainties caused by the large number of sequences in the public databases. In contrast, searching with PMFs in public databases often provide many probabilistic protein candidates. A clear identification of the target protein is not always possible. In such a case, further fragmentation of selected peptides of a protein was inevitable to gain partial amino acid sequences for an unambiguous protein identity. There are different techniques for peptide fragmenting such as electrospray tandem quadrupole time-of-flight mass spectrometer (ESI-QqTOF MS/MS) [28,29] or MALDI-quadrupole time-of-flight mass spectrometer (MALDI-QqTOF MS) [30] as well as tandem time-of-flight mass spectrometer (MALDI-TOF/TOF MS) [31,32]. Except the requirement of additional resources and works, these techniques are more complex and less scalable than MALDI/MS peptide mass fingerprinting. ESI-QqTOF MS/MS which was used in this study is much more susceptible to contaminants of small molecules in digested peptide mixtures than MALDI-TOF MS and desalting with ZipTip is needed. For our experience it is often difficult to obtain MS/MS-spectra of high quality for performing successful sequencing.
As expected, when the proteins or their homologues are not present in the public databases, the search in public databases did not lead to their identification. This is especially the case for the identification of proteins of the dha regulon which encodes enzymes for the initial assimilation of glycerol and for the formation of 1,3-propanediol that were specially interesting for us [33]. Most proteins of the dha regulon in K. pneumoniae were identified with significant scores by searching the ProtKpn database but could not be identified using the public databases (Table  1). Except K. pneumoniae, the dha regulon is known to exist only in a few organisms like Citrobacter freundii, Clostridium perfringens, Clostridium pasteurianum and Clostridium butyricum [35]. Except for C. perfringens, the genomes of these organisms have not yet been sequenced. Using PMF only 1,3-propanediol oxidoreductase (PDOR) was identified both in the SwissProt/TrEMBL database and the NCBInr database and two subunits of glycerol dehydratase (GDHt) (beta and small subunits) were found in the NCBInr database. However, the identifications were not resulted from a cross-species identification but due to the existence of these two proteins of K. pneumoniae in the public databases. For the identification of dihydroxyace-tone kinase (DHAK I), DHAK I from C. freundii and the hypothetical oxidoreductase yqhD (HOR) from E. coli were found as possible candidates by cross-species searching with PeptIdent in the SwissProt/TrEMBL database. But DHAK I of C. freundii was the third score in the protein candidates list and HOR of E. coli the sixteenth so that a definite identification of this enzyme was not possible with this approach.

Function assignment of identified proteins and their biological interpretations
The functions of identified proteins were assigned by comparing their sequences to public protein database SWISS-PROT/TrEMBL through a NCBI-BLAST local server. Homologue proteins with the highest sequence similarities are listed under the corresponding annotation for each identified proteins in Table 1 [see Additional file 1]), and wherever possible, well studied homologue proteins are preferred to be included.
The identified 122 proteins can be classified into 9 categories and 38 subcategories as shown in Table 1 [see Additional file 1] based on KEGG http://www.genome.ad.jp/ kegg/. The categories cover from energy metabolism, catabolism of small molecules and anabolism of building blocks to genetic and environmental information processing such as transcription, translation, transportation and stress response. The first category, carbohydrate metabolism, is the biggest category and contains about 25% of all identified proteins or peptides. It includes all the enzymes of the dha regulon as well as enzymes of near-complete glycoslysis and gluconeogenesis pathways, enzymes of the pentose phosphate pathway and partial of the TCA cycle. The carbohydrate metabolism plays an essential role not only by delivering metabolic precursors but also by supplying energy. In the anaerobic glycerol bioconversion by K. pneumoniae substrate-level phosphorylation is the only way to generate energy. Many proteins in this category were found highly expressed under the defined fermentation conditions. It is interesting to mention that the key enzyme for the Entner-Doudoroff pathway, KHG/KDPG aldolase, was unexpectedly also identified. Its function for anaerobic glycerol metabolism is unknown and in fact has not been studied so far.
The dha regulon includes 15 ORFs which encode 5 metabolic enzymes, namely dihydroxyacetone kinase I and II (DHAK I and II), glycerol dehydrogenase (GDH), glycerol dehydratase (GDHt) and 1,3-propanediol dehydrogenase (PDOR), 1 regulatory protein, 1 activator for GDHt, 1 transport facilitator and 2 proteins of unknown functions [26]. As shown in Table 1 [see Additional file 1], we have identified all the 5 metabolic enzymes or their subunits with the help of strain-specific database ProtKpn. Of particular interest is the identification of both DHAK I and II of the dha regulon. The expression level of DHAK II was much higher than DHAK I. DHAK II was recently found by us as a second dha kinase and its expression well explains some peculiar observations of the fermentation process [23,33]. The expression of OrfY, which is a common component in the dha regulons of different organisms with unknown function, was identified as well. We found that the amino acid sequence of this protein (orfY) is slightly different from the one in the public database, which belongs to another K. pneumoniae strain ATCC 25655.  a Refers to the proteins labelled in Figure 1 b Protein access numbers in the ProtKpn database c Knexus uses ProFound as search program. Profound calculates the probability that a candidate in a database search is the protein being analysed., A Z score is estimated as an indicator of the quality of the search result, when the search result is compared against an estimated random match population. Z score is the distance to the population mean in unit of standard deviation. It also corresponds to the percentile of the search in the random match population. d Using Peptident score is the number of peptides that match the theoretical peptides from a database entry divided by the total number of peptide masses specified for the search. Using Mascot score is -10*Log(P), where P is the probability that the observed match is a random event. If there is also a superscript number beside the score, it represents the position of this protein in the protein candidate list. Otherwise, it is the top one. e SC: Sequence coverage, defined as the ratio of the portion of protein sequence covered by matched peptides to the whole length of protein sequence.
controls through covalent modifications and will be further studied.
By applying the method presented in this work several new enzymatic and regulatory proteins were identified that have large impacts for understanding and optimizing the microbial production of 1,3-propanediol. The expression patterns of some of these proteins were discussed in term of metabolic pathway analysis of this emerging important industrial bioprocess elsewhere [22,23,33]. The identified protein spots are being used for comparison of protein expression profiles of K. pneumoniae to elucidate metabolic pathway regulation associated with gene overexpression or knockout experiments aimed at development of more efficient bioprocess for 1,3-propanediol production. The protein database and the method for protein identification can be also used to study other important biological processes and phenomena such as biofilm formation, nitrogen fixation and antibiotic resistance in K. pneumoniae.

Conclusion
The combined use of high-resolution 2-DE separation, high-throughput MS analysis and raw genome sequences for an extensive and reliable identification of proteins has been shown in this work to significantly accelerate the proteomic and functional-genomic studies of K. pneumoniae anaerobically grown on glycerol. In particular, the establishment of a strain-specific protein database from unannotated genome sequences simplifies and improves the protein identification to a large extent. With this approach, identification of a large portion of the expressed protein spots from 2-DE analysis can be achieved for this organisms with high confidence simply by peptide mass fingerprinting using MALDI-TOF MS data.

Organism and cultivation
Klebsiella pneumoniae DSM 2026 obtained from the German Collection of Microorganisms (DSMZ) was used in this study. Cultivation medium and conditions in a fedbatch bioreactor were reported in detail by Wang et al. (23).

MALDI/TOF-MS analysis of tryptic peptides
Matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) was employed to obtain peptide mass fingerprint of a given protein. 0.5 -1 µl of each concentrated peptide solution was mixed with the same volume of a saturated matrix solution of αcyano-4-hydoxycinnamic acid (Bruker Daltonics) in 0.5% HCOOH-65%MeOH and spotted onto a 384 MTP target and dried at room temperature. In the case of very small protein spots (low expressions) peptides were directly eluted onto the target with 1-2 µl matrix solution after the treatment with ZipTip. The molecular masses of the tryptic peptides were determined in the positive-ion mode on a Bruker Ultraflex time-of-flight mass spectrometer (Bruker Daltonics GmbH, Germany) using an acceleration voltage of 20 kV.

ESI-MS/MS sequencing of selected peptides
Electrospray ionization quadrupole-time-of-flight tadem mass spectrometry (ESI-QqTOF MS/MS) was performed to acquire partial amino acid sequences of a protein. 3 µl of a concentrated peptide solution were filled into goldcoated nanospray glass capillaries and placed orthogonally in front of the entrance hole of a Q-TOF 2 mass spectrometer (Micromass, Manchester, England) equipped with a nanospray ion source. A voltage of approximately 700-1000 V was applied to the capillary. For collisioninduced dissociation, parent ions were selectively transmitted from the qudrupole mass analyzer into the collision cell. Argon was used as the collision gas and the kinetic energy was set at around -20 -40 V for optimal fragmentation. Daughter ions acquired were then separated by the orthogonal time-of-flight mass analyzer.

Protein identification using public protein sequence databases
Peptide mass fingerprints (PMFs) obtained from MALDI-TOF MS analysis were used for cross-species protein identification in public protein primary sequence databases. Mascot (Matrix Science Ltd., UK, http://www.matrix science.com), PeptIdent (Swiss Institute of Bioinformatics, http://www.expasy.ch/tools/peptident.html) and Knexus™ (Proteometrics Inc., http://www.proteomet rics.com) were employed for analysis of the Maldi data using the public databases NCBInr and SWISS-PROT/ TrEMBL. Trypsin was given as the digestion enzyme, 2 missed cleavage sites were allowed, Cysteine was modified by iodoacetamide and methionine was assumed to be partially oxidized. All peptide mass values are monoisotopic and the mass tolerance was set at 200 ppm, but the observed mass accuracy was usually better than 50 ppm for identified peptides. Using PeptIdent Mr and pI values observed from the 2-D electrophoresis were also used as search parameters with pI range set at 0.5 and Mr range at 20%.
Selected peptides of a protein were fragmented by ESI-QqTOF MS/MS. MS/MS spectra were enhanced using the Max Ent 3 software (Micromass), followed by automatic or manual sequencing using the PepSeq program of the software package Masslynx™ Version 3.5 (Micromass). The partial amino acid sequences obtained were used for similarity searching of amino acid sequences against the SWALL Non-Redundant Protein Sequence Database using FASTA3 http://www.ebi.ac.uk/fasta33/ on the internet.

Protein identification using genome sequences of K. pneumoniae strain MGH 78578
The genome of the strain used in this study, Klebsiella pneumoniae DSMZ 2026, is not yet sequenced. However, another Klebsiella strain (K. pneumoniae MGH 78578) that is very similar to K. pneumoniae DSMZ 2026, was sequenced by the Genome Sequencing Center in the Medical School of Washington University http:// genome.wustl.edu/projects/bacterial/. A whole genome shotgun approach was used to generate the 7.9 time coverage of genome data given as 341 contigs (state of January 2002). Until now there is no annotation publicly available for this organism. The contigs of the K. pneumoniae strain MGH 78578 were downloaded as a local data-base. Open reading frames (ORFs) were predicted from the contigs and translated to protein sequences by using the web version of the program GeneMarkS [34]. The functions of these proteins were assigned by comparing their sequences to public protein database SWISS-PROT and TrEMBL. Isoelectric point (pI) and molecular weight (Mr) of the proteins were calculated by using Vector NTI Suite 7.1 (InforMax, USA). Both genome sequences and protein sequences were formatted as local databases of BLAST (Basic Local Alignment Search Tool) [35].
After the development of a strain-specific protein sequence database (ProtKpn) for K. pneumoniae, it was formatted and installed on our local Mascot server http:// genome.gbf.de/bioinformatics/index.html. PMFs from MALDI-TOF MS analysis were compared to the predicted peptide masses in this specific protein database using Mascot as a search program. Additionally, partial amino acid sequences from ESI-QqTOF MS/MS analysis were searched in the same database using the NCBI local BLAST function of the program BioEdit (downloaded from http://www.mbio.ncsu.edu/BioEdit/bioedit.html.