Peptide charge state determination of tandem mass spectra from low-resolution collision induced dissociation
© Shi and Wu; licensee BioMed Central Ltd. 2011
Published: 14 October 2011
Skip to main content
Volume 9 Supplement 1
© Shi and Wu; licensee BioMed Central Ltd. 2011
Published: 14 October 2011
Charge states of tandem mass spectra from low-resolution collision induced dissociation can not be determined by mass spectrometry. As a result, such spectra with multiple charges are usually searched multiple times by assuming each possible charge state. Not only does this strategy increase the overall database search time, but also yields more false positives. Hence, it is advantageous to determine charge states of such spectra before database search.
We propose a new approach capable of determining the charge states of low-resolution tandem mass spectra. Four novel and discriminant features are introduced to describe tandem mass spectra and used in Gaussian mixture model to distinguish doubly and triply charged peptides. By testing on three independent datasets with known validity, the results have shown that this method can assign charge states to low-resolution tandem mass spectra more accurately than existing methods.
The proposed method can be used to improve the speed and reliability of peptide identification.
Mass spectrometry has been widely used to analyze high throughput protein samples. Proteins are first cleaved into peptides with enzymes or chemical cleavages. Then, peptides are separated from mixture solutions by high pressure liquid chromatography (HPLC), and sent to ionization sources where they get ionized. There are two ionization techniques, electrospray ionization (ESI) and matrix assisted laser desorption/ionization (MALDI), which are often used in proteomics laboratories. MALDI is mainly used in peptide mass fingerprinting as it predominantly yields singly charged ions. Unlike MALDI, ESI typically produces multiply charged ions. After being ionized, peptides are introduced into analyzers such as ion trap or triple quadrupole to produce mass spectra (MS). To obtain tandem mass spectra (MS/MS), peptide ions with the highest intensities in MS are isolated and subjected to fragmentation by collision induced dissociation (CID). The resultant MS/MS are used to provide structural composition information of peptides.
The commonly used database search programs for peptide identification include Sequest  and Mascot . These programs compare experimental spectra with theoretical spectra in a database and use scoring functions to measure the similarity between them. Typically, the peptide with the highest score is identified. However, the growing number of protein sequences in expanding databases becomes a challenge for database search software because the search space is sharply increasing. Moreover, multiply charged peptide tandem mass spectra from ESI-CID also add complexities to these programs, because they generate much more complex spectra. Although high-resolution mass spectrometers can provide separable isotropic spacing of fragment ions to derive charge states, most commonly used ion trap and triple quadrupole analyzers have limited resolution to do so . In such a case, one spectrum is usually searched multiple times by assuming each possible charge state of its precursor peptide ion. This strategy increases the overall time of database search and yields more false positives as true positives need to be distinguished from much more peptide candidates. The requirement of determining peptide charge states is not limited to database search, but also is necessary in de novo sequencing methods .
This paper will focus on the charge state determination of low-resolution tandem mass spectra. There have been reports in determining charge states of low-resolution tandem mass spectra [3, 5–7]. Thirty-four features were proposed in  to describe MS/MS and the link between MS and MS/MS, then support vector machine (SVM) was used to classify MS/MS into three groups +2, +3 and +2/ +3. One problem with this method is that it classifies peptide ions into three groups, which still leaves ambiguities in the charge determination. Lately, twenty-eight features of MS/MS were proposed to train SVM in  to discriminate doubly and triply charged peptides. The common problem with [5, 7] is that SVM needs trained with labeled data. This inherent drawback of supervised methods limits their generality in determining the charges of any experimental MS/MS. Last but not least, it is computationally expensive to first train SVM and then apply it on test data.
In this paper, we present an unsupervised learning method based on Gaussian mixture model (GMM) to determine the charge states of low-resolution tandem mass spectra. Four novel and discriminant features are proposed to describe MS/MS. By testing on three low-resolution MS/MS datasets with verified charge states, the results have shown that the proposed method can accurately assign charge states to such tandem mass spectra.
In database search, tandem mass spectra are usually considered to carry 1, 2 and 3 charges. Research  shows that singly charged MS/MS can be reliably determined. Therefore, the charge state determination can be reduced to the classification of doubly and triply charged MS/MS. To solve this problem, this study uses the unsupervised GMM with features proposed to reflect the properties of MS/MS. Since the features are to be extracted from MS/MS, we will first introduce several properties of peptide CID tandem mass spectra. For more details about these properties, we would refer readers to .
Since one peptide with different charges can produce different MS/MS, we can infer the charge state of a peptide according to the features of its MS/MS. As we will see, these features will be calculated based on the above relationships between the singly and doubly charged fragment ions.
where m 1 and m 2 are the m/z values of any two peaks from the given peptide tandem mass spectrum and m 2 >m 1.
where |·| denotes the cardinality of a set. The feature δ cp is the difference between the number of complementary pairs (+1, +1) and the number of complementary pairs (+1, +2) in MS/MS. This feature accounts for the fact that +2 peptides tend to generate two +1 ions at the same bond, while +3 peptides are prone to yield one +1 and one +2 ion [3, 6]. From the definition, this feature is expected to be larger for doubly charged peptides than triply charged ones.
where I(·) represents the intensity of peaks. The feature δ Rcp is the difference between the ratio of +1 peak intensity over their complementary +1 peak intensity and the ratio of +2 peak intensity over their complementary +1 peak intensity. The item 0. 5 is added in view that the intensity of y ions in higher mass regions is larger than that of b ions in lower mass regions. This feature accounts for the fact that the intensity of +1 peaks and the intensity of their complementary +1 peaks should be comparable when they are produced from doubly charged peptides, while the intensity of +1 peaks from triply charged peptides should be comparable to the intensity of their complementary +2 peaks. Thus, the difference between these two ratios should be greater than 0 for doubly charged peptides while less than 0 for triply charged ones. This newly proposed feature is expected to be more significant than the first feature proposed in , because it integrates the intensity information into the feature definition rather than just counts the number of complementary pairs.
The feature I dc is the intensity of +2 peaks in the mass region [m p , 1. 5m p ]. In theory, the m/z values of +2 peaks from +2 peptides should not exceed m p , while they should not exceed 1. 5m p when they are from +3 peptides. Hence, I dc which accounts for the +2 peak intensity in the region [m p , 1. 5m p ] should be very discriminant for doubly and triply charged peptides. This feature is expected to be smaller for doubly charged peptides than triply charged ones.
where N t is the theoretical repeat number of basic residues in a mass spectrum. More discussion about n bs is given later.
When we compute the values of all features, the situations when peaks are produced by losing water, ammonia, CO or NH group are considered as proposed in .
These equations are intimately coupled with one another, because the term p(k|n) in turn depends on all terms on the left-hand sides through (21) and (22). Thus, it is hard to solve these equations directly. However, EM algorithm can provide a solution. We start with a guess for the parameters p k , µ k , σ k , and then iteratively cycle through (21), (22) (E-step), and then (25), (26) and (27) (M-step). The procedures of EM algorithm are given as follows:
Three datasets are used to investigate the performance of the proposed method in predicting charge states of peptide CID tandem mass spectra.
ISB dataset ISB dataset was acquired on an LC-ESI ion trap (ThermoFinnigan) and was provided by the Institute of Systems Biology (ISB, Seattle, USA). It contains 37,044 peptide MS/MS from a control mixture of 18 standard proteins . The charge states were assigned to 1656 doubly charged and 984 triply charged peptides with Sequest.
TOV dataset TOV dataset includes 22,577 peptide MS/MS which were acquired on an LCQ DECA XP ion trap (Thermo Electron Corp.). The samples analyzed were generated by the tryptic digestion of a whole-cell lysate from 36 fractions of TOV-112D . These spectra were searched using Sequest and the assignments of 1898 doubly charged and 261 triply charged spectra were verified to be correct by Scaffold (http://www.proteomesoftware.com) with the minimum probability of 0. 95.
BALF dataset BALF dataset was obtained from an LCQ DECA ion trap mass spectrometer (ThermoFinnigan) and is available in PeptideAtlas (http://www.peptideatlas.org/repository) data repository. MS/MS were searched with Sequest against IPI human protein database. The assignments of 2492 doubly charged and 3686 triply charged spectra were validated using PeptideProphet with the minimum probability 0. 90.
GMM is solved by implementing the EM algorithm described previously with MATLAB. All features are transformed to have variances 1. Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) are employed to measure the classifier performance. ROC curves of actual classifications locate in between the ideal plot (the point (0,1)) and the random-guess plot (the diagonal line) with AUC ∈ (0. 5, 1). The bigger the AUC, the more powerful the classification is.
Estimates of means of all features and their expected relationships
EXPECTED Feature values
+2 > +3
+2 > +3
+2 < +3
+2 < +3
AUC of classifiers built with each feature
The result about the discriminant power of each feature shows that the number of basic sites is not powerful in discriminating peptides with different charges. The reason is that the computation of this feature is not quite precise. It is hard to compute the number of basic sites, because it is complicated by the following factors: (1) it is possible that the mass differences between many pairs of peaks correspond to one same basic site, because 6 kinds of ions can be generated in CID although they are not equally likely generated. Besides, those ions can produce variants by losing water, ammonia, CO or NH group. (2) When we compute the number of basic sites, we don’t want to consider too much about their positions in a sequence, otherwise, it would become another complex problem, peptide de novo sequencing. However, when there are multiple basic sites especially multiple same basic sites like two K or two R existing in a peptide, we need to find a way to differentiate these two K or two R. (3) Situations when tryptic peptides end with two adjacent basic sites (KK, RR, KR, RK, HK, HR) or start with a basic site also complicate the computation. The research in  shows that when two basic sites are adjacent, it is more possible that only one of them can attach protons because there exists strong Coulombic repulsion force between adjacent protons. In addition, peptides start with basic residues will make the N-terminal amine group attract protons less likely, because the side chains of basic residues have much higher proton affinities than the amine group .
According to the definition of n bs, we can approach its computation in two possible ways: (1) compute the pseudo-number of basic sites by counting the number of all cases corresponding to a basic site and ignoring duplicate cases. This is reasonable because the pseudo-number of triply charged peptides should be generally larger than that of doubly charged ones. (2) figure out the theoretical repeat number of basic sites with the statistics of mass spectrometry generating ions. There is some research conducted to quantify the percentage of each kind of ion produced in CID. The study  reports some of such statistics based on the yeast proteome. However, data in a more general sense is needed. With the statistics of ions produced in CID, we can compute a theoretical repeat number for each basic residue. Then, it can be combined with the pseudo-number to derive the real number of basic sites in a mass spectrum. In this study, the feature n bs was computed as the pseudo-number and transformed to have the variance 1. This feature is cogent in theory to discriminate doubly and triply charged MS/MS, but how to precisely compute it is still an open problem.
A new approach for assigning charge states to low-resolution CID MS/MS is proposed based on the unsupervised GMM with four novel and discriminant features extracted from MS/MS. ROC and AUC demonstrate that GMM with proposed features is very promising in classifying doubly and triply charged MS/MS. For the future work, we will examine more on the computation of the number of basic sites, which theoretically should be the most significant feature in discriminating peptides with different charges.
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). The authors would like to thank Dr. Andrew Keller from Institute for Systems Biology for generously providing spectral data and protein databases for the ISB dataset and Dr. Guy G. Poirier from Laval University for providing the TOV dataset.
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.