A feedback framework for protein inference with peptides identified from tandem mass spectra
© Shi and Wu; licensee BioMed Central Ltd. 2012
Received: 9 July 2012
Accepted: 2 November 2012
Published: 19 November 2012
Skip to main content
© Shi and Wu; licensee BioMed Central Ltd. 2012
Received: 9 July 2012
Accepted: 2 November 2012
Published: 19 November 2012
Protein inference is an important computational step in proteomics. There exists a natural nest relationship between protein inference and peptide identification, but these two steps are usually performed separately in existing methods. We believe that both peptide identification and protein inference can be improved by exploring such nest relationship.
In this study, a feedback framework is proposed to process peptide identification reports from search engines, and an iterative method is implemented to exemplify the processing of Sequest peptide identification reports according to the framework. The iterative method is verified on two datasets with known validity of proteins and peptides, and compared with ProteinProphet and PeptideProphet. The results have shown that not only can the iterative method infer more true positive and less false positive proteins than ProteinProphet, but also identify more true positive and less false positive peptides than PeptideProphet.
The proposed iterative method implemented according to the feedback framework can unify and improve the results of peptide identification and protein inference.
Protein inference by assembling peptides identified from tandem mass spectra (MS/MS) is an important computational step in proteomics, based on which further analysis, such as inference of protein structure and function can be performed. Comprehensive discussion about this problem can be referred to [1–3]. Existing MS/MS-based methods to address this problem can be categorized into two groups. The first group performs protein inference and peptide identification separately [4–8]. Peptides are first identified from tandem mass spectra by de novo sequencing [9–11] or database search [12–14], and then proteins are inferred by assembling these identified peptides. The other group combines protein inference with peptide identification, identifying peptides and proteins simultaneously [15–17]. A Barista model  has been built to formulate the protein inference as an optimization problem. A tripartite graph is used to represent the protein inference problem, with layers corresponding to spectra, peptides and proteins. The input to Barista is the tripartite graph with a set of features describing the peptide-spectrum-match (PSM). The score of a PSM is computed with a nonlinear function based on the feature set, and the score of a peptide is the maximum PSM score of all spectra mapped to this peptide, then the score of a protein is the normalized sum of its constituent peptide scores. It is advantageous for this model to utilize the spectrum information in all the steps of its protein inference, without discarding spectra from peptide identification to protein inference. The parameters in the model are estimated by training the model with reference data, and then the trained model is used to infer proteins. Its application is limited by the requirement of reference data to train the model each time when a different dataset is analyzed.
Since many well-developed search engines for peptide identification are available, methods for processing the peptide identification reports from these engines have been proposed. As an example, a nested mixture model  has been used by Li etc to estimate peptide and protein probability simultaneously with identified peptides and their scores from search engines. This model allows evidence feedback between proteins and their constituent peptides. Several reasonable assumptions are adopted to build this model, except that the problem of shared peptides is completely ignored.
This paper proposes a unified framework to process peptide identification results from database search engines. The goal is to output a list of proteins and a list of corresponding peptides at the same time, and it is achieved by iteratively updating the two lists with a feedback from the inferred proteins to the selection of correct peptides. Specifically, the inferred protein sequences are used to search low-confidence peptides from the search engine and the probabilities of these peptides are recomputed. Different methods can be designed according to this framework for protein inference. Here, an iterative method is exemplified to process Sequest peptide identification reports based on the proposed framework. In addition, to address the challenge of assigning shared peptides, an MS/MS intensity-based strategy is proposed to compute the probabilities of shared peptides based on the closeness between the intensity of a shared peptide and the intensity of its siblings in parent proteins. We evaluate the iterative method on two datasets with known validity. The results have shown that not only can it infer more true positive and less false positive proteins than ProteinProphet , but also identify more true positive and less false positive peptides than PeptideProphet .
In the following sections, an iterative method is implemented to process Sequest peptide identification reports according to the unified framework. A list of peptides and a list of proteins will be output simultaneously. The computation steps in the iteration process are introduced.
where the superscripts (n) and (n−1) denote the index of iteration, and n≥1. In the following, for simplicity, we will only introduce the variables if it is not necessary to mention the index of iteration. q k is the probability of protein Q k being present in the sample; n k and N k are the number of experimental and theoretical peptides from protein Q k ; x i is the probability of peptide P i being correctly identified, and is the probability that peptide P i comes from protein Q k , the computation of which will be introduced in the next section.
Here, it is assumed that the event “peptide P i is correctly identified, i.e. P i =1”, the event “peptide P i comes from protein Q k , i.e., P i ∈Q k ”, and the event “protein Q k exists in the sample, i.e., Q k =1” are independent of each other, because whether peptide P i is identified is not dependent on whether protein Q k is present in the sample. Peptide P i could be generated by other proteins. In addition, the number of theoretical peptides N k is included to factor the length of a protein in the model. It is computed based on these criteria: (1) trypsin-cutting; (2) two missed cleavages are allowed; and (3) peptides with masses falling in M min M max . The minimum M min and maximum M max peptide mass are determined from the peptide identification reports. An alternative way is to only consider peptides with a certain length .
It is difficult to compute the probabilities of a shared peptide belonging to different parent proteins, because the connection between peptides and proteins is lost in proteome experiments. Here we propose an MS/MS intensity-based strategy to assign shared peptides to truly present proteins. The idea is that, for a given peptide which is shared by protein Q1 and Q2, if the peptide was from Q1, then its intensity will be closer to the intensity of its siblings in Q1than that in Q2. Two peptides are siblings when they are from the same parent protein. The intensity of a peptide is computed with the signal peak intensity in its matched tandem mass spectra.
where I p is the peptide intensity and N s is the number of tandem mass spectra matched to the peptide. is the preliminary score in Sequest  output for the ithtandem mass spectrum, which is the sum of the intensity of all signal peaks in the spectrum. And it is factored with the ratio between experimental and theoretical peaks which can be derived from the peptide. This factor can eliminate the unfair advantage of longer peptides over short ones. In addition, we normalize with the maximum value in each whole data set.
where I b is the average intensity of a given shared peptide’s siblings, and N b is the number of its siblings. is the intensity of its i th peptide sibling.
During the iteration, after computing the probabilities of proteins, high-confidence proteins are selected to replenish the list of peptides, and new group of proteins and peptides are used to update the values of and n k .
It is worth pointing out that although the probabilities of shared peptides also sum to 1 as in ProteinProphet , it is not required that these shared peptides can only come from one truly present protein in the sample. In the case of ProteinProphet, the weights of a shared peptide will eventually be one of them is or close to 1, and the others are or close to 0, because it assumes that shared peptides can only come from one truly present protein. This is not true in practical experiments and also misinterprets the real meaning of shared peptides. Based on this assumption, a shared peptide can only come from one truly present protein in the sample; it is shared because it can be also generated by some other proteins in the chosen database. By removing this assumption, the probability allows shared peptides to be assigned to multiple proteins in the sample, as long as these proteins have enough evidence to support their existence.
To this point, we have introduced all the computational steps in the iteration process. The initial protein probabilities are set to the same of the value 1 by assuming that each protein has the same chance to be present in the sample as long as it has constituent peptides being identified. The initial values of peptide probability x i is the probability output by PeptideProphet, and the initial values of and n k are computed from the Sequest reports.
Statistics of ISB standard protein mix datasets
The proposed method is compared with PeptideProphet  and ProteinProphet  for the peptide identification and protein inference, respectively. Specifically, we compare the number of true positive and false positive peptides and proteins produced from these methods.
Protein SW:K2C1_HUMAN and its constituent peptides
It is not attempted here to give a mathematical proof of the convergence of the iterative method. Instead, an explanation is provided. The stop criterion is the convergence of the probabilities of the putative proteins. According to the flowchart in Figure 1, the stop criterion naturally fails if there is any change to the putative protein list. In addition, the protein inference model can assure that the probabilities of proteins with high-confidence identified peptides will converge to 1. Therefore, the convergence of the protein probabilities is reduced to reaching the steady state of the protein list. Since the protein list is produced by using a group of peptides to find their parent proteins, this stop criterion can be further reduced to reaching the steady state of the peptide list.
The steady state of the peptide list is guaranteed. Peptides can be classified into unique peptides and shared peptides. First, we will see that shared peptides can remain steady in the list. There are three kinds of shared peptides in the iteration process: shared by both negative and positive, by negative and by positive proteins. (Given the threshold of high-confidence proteins, proteins with probability equal to or greater than the threshold are classified as positive; otherwise, negative). If a peptide is only shared by negative proteins, then it will not be selected into the high-confidence peptide list; if a peptide is only shared by positive proteins, then it will be selected and remains steady in the peptide list. If a peptide is shared by both negative and positive proteins, then it will be included and eventually remains steady in the peptide list. This is assured by the discovery and application of an elimination-rule for negative proteins which share peptides with positive proteins. During the iteration, some shared peptides are selected into the process by positive proteins. Then, these peptides are used to search proteins, and negative proteins will be introduced into the iteration process. After several iterations, these negative proteins will be removed because of their low probabilities produced by the protein inference model. However, they will be re-selected into the cycle due to the shared peptides from positive proteins. Therefore, if these proteins are allowed to enter the iteration process, they will always be “in and out” of the putative protein list. Proteins with such pattern will be eliminated from the iteration process. This elimination-rule can rule out negative proteins, which are usually also absent proteins, and thus increase the chance of assigning shared peptides to truly present proteins. After the removal of those false proteins, shared peptides will only be considered from positive proteins, and thus shared peptides can remain steady in the peptide list.
Similarly, there are three kinds of unique peptides: unique to positive proteins, unique to negative proteins with no peptides shared with positive proteins and unique to negative proteins with shared peptides from positive proteins. In these three situations, only unique peptides from positive proteins will stay steady in the high-confidence peptide list. Unique peptides from negative proteins with no shared peptides from positive proteins will not be selected at all; while if the negative proteins with shared peptides from positive proteins, these unique peptides will be eliminated eventually with the removal of these proteins by the elimination-rule. Therefore, unique peptides can also remain steady in the peptide list, and this completes the explanation of the convergence of the iterative method.
In addition, we briefly account the scalability of our method here. Before beginning the iterative method, we need to construct two hash tables which are used to organize the information about each peptide and each protein from the Sequest reports and PeptideProphet probabilities, the complexity of which are O(MN) and O(N), respectively. When running the iterative method, the cost of each iteration is (O(M×max(M i )) + O(N×max(n k )), i=1…M;k=1…N), where M and N are the total number of peptides and proteins involved in the iteration process, respectively; and M i is the number of parent proteins of peptide P i , and n k is the number of peptides mapped to protein Q k .
This paper proposed a unified feedback framework for protein inference based on peptides identified from tandem mass spectra, and an iterative method is implemented to process Sequest peptide identification reports according to this framework. This method outputs a list of peptides and a list of counterpart proteins simultaneously. Based on the two datasets from standard proteins, the results have shown that the iterative method performs superiorly to the popular programs PeptideProphet and ProteinProphet in identifying peptides and proteins. However, at this point, the implementation of the iterative method is not ready for the practical use on identifying peptides and proteins like PeptideProphet and ProteinProphet. First, it is not tested with complex datasets yet, so the rigor of this method needs more examination. Secondly, it is mainly developed for testing the framework, not for direct-use like those programs. Based on the results we got, there is obvious advancement of this method, and we will leave the development of a practical implementation as the future work.
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). The comments and suggestions given by the anonymous reviewers greatly improved the article.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.