Volume 11 Supplement 1
Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science
Peptide identification based on fuzzy classification and clustering
 Xijun Liang^{1},
 Zhonghang Xia^{2}Email author,
 Xinnan Niu^{3},
 Andrew J Link^{3},
 Liping Pang^{1},
 FangXiang Wu^{4} and
 Hongwei Zhang^{1}
DOI: 10.1186/1477595611S1S10
© Liang et al; licensee BioMed Central Ltd. 2013
Published: 7 November 2013
Abstract
Background
The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge.
Results
A novel scoring method named FCRanker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops.
Conclusions
Our experimental studies show that FCRanker outperforms other postdatabase search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.
Keywords
Peptide identification Peptide spectrum matches (PSMs) Fuzzy support vector machine (SVM) Fuzzy silhouetteBackground
In protein identification, observed peptide spectra are searched against theoretical fragmentation spectra derived from target databases. Peptide spectrum matches (PSMs) are scored by database search tools and those highscored PSMs are selected as target PSMs. In fact, more than half of selected PSMs are not correct [1]. Although many filters [2, 3] have been proposed to refine the outputs further, they are not universal for different datasets.
To tackle this problem, PeptideProphet [4] used unsupervised learning for automatically selecting PSMs output by database search tools. Based on the assumption that the PSM samples are sampled from a mixture distribution which represents the chance of a "correct" PSM and an "incorrect" PSM, PeptideProphet applies the expectation maximization (EM) method to calculate the possibility of each PSM being "correct". As only the set of highscored PSMs are searched for "correct" ones by PeptideProphet, some good lowranked PSMs may be lost. Adaptive PeptideProphet was proposed in [5] to improve the performance of PeptideProphet by iteratively training a discriminant function from a set of topranked PSM samples, while [6] attempted to extend PeptideProphet by exploiting decoy PSMs in semisupervised learning. In [7–9], decoy databases were used for validation of the performance of the postdatabase search algorithms. It is proposed in [6] to estimate a more accurate probability by combining decoy PSMs into a unified semisupervised expectation maximization framework.
Support vector machines (SVMs) have also been studied for the peptide assignment problem in [10, 11]. Percolator [12] employed the SVM to iteratively adjust models fitting target PSMs with higher scores than decoy PSMs. Percolator, as a semisupervised learning model, did not fully make use of the labels and samples of target PSMs. More recently, a fully supervised SVM learning model is proposed in [11] to improve the performance of Percolator by utilizing target PSM data, where those "incorrect" target PSMs are viewed as noises, and a special loss function is employed to reduce the noise's negative impact on the learning model. Although most good target PSMs are identified by the classification learning model from noises and decoy PSMs, all selected PSMs are treated in the same way.
In this paper, a new scoring method, FCRanker, is developed not only to identify reliable target PSMs, but also to evaluate the confidence of each target PSM. As good target PSMs are close to each other, FCRanker integrates sample clustering into the classification procedure to compute the possibility of each target PSM being correct. Compared with the standard SVM model, the proposed fuzzy classification model assigns a weight to each target PSM indicating its likelihood being correct. The score of each PSM sample is computed by combining discriminant function value and fuzzy silhouette value. The algorithm repeatedly updates the values of the discriminant function and fuzzy silhouette index for each PSM sample, and recompute the weights of targets until the algorithm stops. In experimental studies, while FCRanker shows a large overlap of the identified target PSMs with PeptideProphet and Percolator, it has identified more target PSMs in all datasets.
Statistics of datasets
Total  Target set  Decoy set  

Total  Full  Half  None  Total  Full  Half  None  
Yeast  14891  6702  1453  1210  4039  8189  106  1465  6618 
UPS1  17335  8974  645  2013  6316  8361  118  1707  6536 
Tal08  18653  9907  1081  2133  6693  8746  164  1923  6659 
Results and discussion
The FCRanker algorithm is compared with PeptideProphet [4] and Percolator [12] to validate its effectiveness. We used a PC with Intel (R) CPU 1.80 GHz×2, and RAM 2.0Gb for all experiments.
Experimental Setup
Dataset
FCranker was examined over three datasets: S. cerevisiae Gcn4 (Yeast), Universal Proteomics Standard (UPS1) and Tal08 [14]. Trysin digestion of the protein samples generates three types of tryptic peptides: fulldigested (both ends of a peptide satisfy enzyme specificity rule), halfdigested (only one end satisfies the enzyme specificity rule) and nonedigested (neither of the ends satisfies the rule). The database of Yeast protein sequences was obtained from Saccharomyes Genome Database (SGD) [15] and the Sigma48 protein sequences database from NCBI gene bank [16]. The attributes of each PSM sample include xcorrelation, deltacn, ions, sprank and calcneutralpepmass.
The SEQUEST search results on UPS1 contains 48 purified human proteins and 17,335 PSMs, consisting of 8974 target PSMs and 8361 decoy PSMs. On the Yeast dataset, it contains 6652 proteins and 14,891 PSMs, consisting of 6702 target PSMs and 8189 decoy PSMs. On the Tal08 dataset, it contains 9907 target PSMs, and 8746 decoy PSMs, totally 18,653 PSMs.
Statistics of the three datasets are listed in Table 1.
Preprocess
In addition to those attributes output by SEQUEST, such as xcorrelation, deltacn, ions, sprank and calcneutralpepmass, another attribute "digested type" was added in the representation, with scalars "2", "1" and "0" for fulldigested type, halfdigested type, and nonedigested type, respectively. The values of each attribute have been transformed linearly beforehand such that they have zero mean and unit variance (this is called a normalization process). We multiply a weight of 2.0 to the values of xcorrelation and deltacn attributes after normalization, inasmuch as these two attributes take more important position in data representation. As the attribute "digested type" also plays an important role by experimental experience, a weight of 2.0 was applied, similarly, on the values of this attribute after the normalization process.
Parameter setting
was chosen, with parameter σ = 2.0.
In the iterations of FCRanker algorithm, we set n = 70 in Eq. (10) and $\widehat{p}=0.03\left{\text{\Omega}}_{+}\right$,$\hat{sep}=0.25$ Eq. (15). The strategy for solving largescale programming was employed as described in the subsection "FCRanker for the largescale problem", where the parameter ρ was chosen as 0.2.
Validation of sep throughout iterations
Comparison of target PSMs
Target PSMs output by PeptideProphet, Percolator and FCRanker
TP+FP  TP  FP  

Total  Full  Half  None  Total  
Yeast  PeptideProphet  1481  1443  1374  68  1  38 
Percolator  1429  1393  1342  51  1  36  
FCRanker  1513  1475  1376  83  16  38  
UPS1  PeptideProphet  582  566  403  147  16  16 
Percolator  450  438  278  144  16  12  
FCRanker  698  681  444  198  39  17  
Tal08  PeptideProphet  982  957  881  76  0  25 
Percolator  978  953  895  58  0  25  
FCRanker  1119  1092  865  173  54  27 
On the UPS1 dataset (Figure 2B), the three algorithms have 383 target PSMs in common. The overlap covers 67.7% of the total target PSMs by PeptideProphet, 87.4% by Percolator and 56.2% by FCRanker. Particularly, there are 520 target PSMs catched by PeptideProphet and FCRanker in common, covering 91.9% of the total target PSMs by PeptideProphet and 76.4% by FCRanker; there are 406 target PSMs catched by Percolator and FCRanker in common, covering 92.7% of the total target PSMs by Percolator and 59.6% by FCRanker. Particularly, FCRanker identified 137 PSMs (24.2%) selected by PeptideProphet but not covered by Percolator, and found 23 PSMs (5.3%) selected by Percolator but not covered by PeptideProphet.
On the Tal08 dataset (Figure 2C), the three algorithms have 829 PSMs in common. The overlap covers 86.6% of the total target PSMs by PeptideProphet, 87.0% by Percolator and 75.9% by FCRanker. Particularly, there are 862 target PSMs catched by PeptideProphet and FCRanker in common, covering 90.1% of the total target PSMs by PeptideProphet and 78.9% by FCRanker; there are 847 target PSMs catched by Percolator and FCRanker in common, covering 88.9% of the total target PSMs by Percolator and 77.6% by FCRanker. Particularly, FCRanker identified 33 PSMs (3.4%) selected by PeptideProphet but not covered by Percolator, and found 18 PSMs (1.9%) selected by Percolator but not covered by PeptideProphet.
ROC curve
Methods
Classification and clustering methods for peptide identification
Fuzzy clustering
Clustering analysis is an unsupervised learning method to group similar data samples together. Silhouette index was introduced in [17, 18] to measure how well a sample belongs to a cluster.
Classification
Our task is to identify those correct PSMs from a set of PSMs generated by some database searching tools in peptide identification. Usually decoy PSMs are employed to validate target PSMs, then the samples of PSMs can be categorized into "good" class, with labels " +1", and "bad" class, with labels "− 1". In the setting of classification, we use a vector of attributes such as xcorrelation, deltacn, ions, sprank, calcneutralpepmass, etc., to represent a PSM data sample. Let {x _{ i }} ⊆ R^{ q } , i = 1, . . ., l be the PSM data samples with q the number of attributes. We aim at finding a discriminant function f : R^{ q } → R to classify the PSM data samples according to their labels.
One of the greatest challenges arising from the problem of the peptide identification is that there is lack of data samples with deterministic +1 labels. For a standard classification setting, the discriminant function is solved by training the models on two balanced types of data samples with deterministic labels. In peptide identification problem, however, a great number of PSMs generated by database searching engines are incorrect, and the data samples with +1 labels are quite unreliable. Thus, the great amount of data samples with incorrect +1 labels would extremely distort the trained discriminant function if they are employed directly in the standard classification models.
where b ∈ R, k(·,·) is a chosen kernel function. The label of a data sample x is predicted as +1, if f (x) > 0, otherwise it is predicted as −1. A quadratic programming is usually solved to obtain the coefficients α and b, which requires huge computations overhead, especially for largescale problems. To overcome this problem, a class of linear programming SVM is introduced in [19].
where c > 0 is a given constant, and the discriminant function $f\left(\cdot \right)={\sum}_{j=1}^{l}{\alpha}_{j}{y}_{j}k\left({x}_{j},\cdot \right)+b$.
The basic FCRanker algorithm
In this section, the FCRanker algorithm is present to calculate the score of each PSM data sample. The score values reflect the possibility of the PSM data samples being correct, and those PSMs with high scores are selected for users at last.
On the other hand, a data sample may not be a good target PSM either if it locates comparatively close to other target PSMs but has a small discriminant function value. The data sample represented by "⊕" in Figure 5 should also be excluded from the set Ω_{1}. The above observations hints us that a good target PSM data sample should satisfy two rules: 1) has a large discriminant function value; 2) is close to other target PSMs.
Fuzzy SVM classification
A weight θ _{ i } ∈ [0, 1] is introduced for each target sample x _{ i } indexed by Ω_{+} to indicate its possibility of being correct since its label is not trustworthy. A large weight of a sample usually indicates that the PSM has more possibility to be correct. Since it is definitely sure that the decoy PSMs are incorrect, we constantly set the weights θ _{ i } to 1 for x _{ i } ∈ Ω_{ − }. Denote loss(f (x _{ i }), y _{ i }) the empirical error of sample x _{ i }, then the empirical error can be formulated as $\sum _{i\in \text{\Omega}}}\mathit{\text{loss}}\left(f\right({x}_{i}),{y}_{i})$ in traditional classification problems with deterministic labels. Assigning a weight to each data sample, we reformulate the total empirical error as $\sum _{i\in \text{\Omega}}}{\theta}_{i}\mathit{\text{loss}}\left(f\right({x}_{i}),{y}_{i})$.
where α ∈ R^{ l } , b ∈ R^{1}, r ∈ R^{1} and ξ = [ξ _{1} , . . ., ξ _{ l }] ∈ R^{ l }. Model (2) is referred as the fuzzy linear programming SVM model.
where θ = [θ _{1}, . . ., θ _{ l }]^{ T } , Λ(y) = Diag(y), 0_{ l } ∈ R^{ l } is a vector with zero elements, 1_{ l } ∈ R^{ l } is a vector with each element equal to 1, I _{ l } is the l × l unit matrix, and K = (k(x _{ i } , x _{ j }))_{1≤i≤l,1≤j≤l }. The model can be solved by existing optimization softwares, such as Mosek.
Fuzzy silhouette
To adapt the situations with uncertain labels we generalize the silhouette concept for deterministic setting to fuzzy silhouette index.
It measures the degree that a PSM sample goes far away from the decoys and that is close to the good target samples. Hence, a PSM data sample is more likely to be a correct one if it has a large fuzzy silhouette value.
as a metric to indicate the separation degree of decoy PSM samples and good PSMs.
Score of the samples
where f _{max} and s _{max} are the largest values of {f (x _{ i }) − f _{0} } and {s _{ i } − s _{0} } for i ∈ Ω_{+}, respectively, and f _{0} is the threshold of the values of discriminant function, s _{0} the threshold of fuzzy silhouette. The power of $\frac{1}{4}$ on f (x _{ i }) − f _{0}  is introduced to smooth the weight contributions.
The FCRanker algorithm
The FCRanker algorithm iteratively adjusts the index set of good PSM Ω_{1} by calculating the scores and weights of the data samples until a stop criterion is met. Initially, the algorithm set ${\text{\Omega}}_{1}^{0}={\text{\Omega}}_{+}$ and ${\text{\Omega}}_{0}^{0}=\varphi $, i.e. all PSM samples are viewed as good ones at iteration 0. At iteration k, the algorithm solves the fuzzy linear programming SVM model (3), calculates the fuzzy silhouette values of the samples according to Eq. (5) and updates the index set Ω_{1} and Ω_{0} such that the indices of target PSMs in Ω_{1} with small scores are moved to Ω_{0}, while the indices of target PSMs in Ω_{0} with large scores are moved to Ω_{1}.
Then indices of the samples indexed by ${\text{\Omega}}_{0}^{k+1/2}$ are moved to ${\text{\Omega}}_{1}^{k+2/3}$ if the samples have large score values,
where ${\overline{f}}_{1}^{k+2/3}$is the average of $\left\{f\left({x}_{i}\right)i\in {\text{\Omega}}_{1}^{k+2/3}\right\}$.
The FCRanker algorithm is summarized in Algorithm 1.
Algorithm 1 The FCRanker Algorithm
Input: {x _{ i } , y _{ i }}, i ∈ Ω;
Output: Scores of samples indexed by Ω;
1: Initialization: k = − 1, ${\text{\Omega}}_{1}^{0}={\text{\Omega}}_{+}$, ${\text{\Omega}}_{0}^{0}:\; =\text{\xd8}$, ${\theta}_{i}^{0}=1$, i ∈ Ω.
2: while Stop criterion (15) is not satisfied do
3: k := k + 1.
4: SVM classification.
5: Solve fuzzy SVM classification model Eq. (3);
6: Calculate ${\text{\Omega}}_{1}^{k+1/3}$ via Eq. (10).
7: Clustering analysis.
8: Calculate fuzzy silhouettes s _{ i } , i ∈ Ω via (5);
9: Calculate ${\text{\Omega}}_{1}^{k+2/3}$, ${\text{\Omega}}_{0}^{k+1/2}$ via Eq. (11), (12).
10: Update weights.
11: Calculate score(i)^{ k+1}, θ^{ k+1}via Eq. (7), (13);
12: Calculate ${\text{\Omega}}_{1}^{k+1}$, ${\text{\Omega}}_{0}^{k+1}$, sep^{ k+1}via Eq. (14), (6).
13: end while
FCRanker for the largescale problem
The number of PSMs output by a database search engine is usually extremely large. In this section, some implementation practice is discussed further such that the algorithm is capable for solving largescale problems.
Fuzzy SVM classification for the largescale problem
If the data matrix is sparse, the interiorpoints algorithms would be competent in solving largescale linear programming problems. The kernel matrix K in Problem (3) is, unfortunately, not sparse in general. In fact, kernel matrix K is usually quite dense and most of its elements are nonzero. To store a large dense matrix K is not a trivial task. Take a matrix K with Gaussian kernel and l = 400, 000 as an example, if four bytes are occupied per element then the matrix K would have l^{2} = 1.6 × 10^{11} elements and take up 640Gb of storage in all.
Where $\alpha \in {R}^{{l}^{\prime}}$, b ∈ R^{1}, r ∈ R^{1}, r ∈ R^{ l }, and Λ(y′) = Diag(y′).
Fuzzy silhouette for the largescale problem
For updating fuzzy silhouette value s _{ i } of sample i, the major work is to compute ${\beta}_{i}^{1}$ and ${\beta}_{i}^{1}$ in Eq. (4) where it is required to calculate l distances. In all, each iteration computes  Ω *  Ω = l^{2} distances with total samples. Denote a given sample rate by ρ with ρ ∈ (0, 1). We sample ρ *  Ω_{1}  indices of targets from Ω_{1}, and ρ *  Ω_{ − 1}  indices of decoys from Ω_{ − 1}, denoted by Ωt and ${\text{\Omega}}_{1}^{\prime}$, to substitute Ω_{1} and Ω_{−1} in Eq. (4), resp. Then at most ρl( Ω_{ − 1}  +  Ω_{1} ) ≤ ρl distances need to be calculated at each iteration.
Conclusion
A new scoring method has been developed based on the iterations of FCRanker algorithm which were equipped with fuzzy silhouette index and a fuzzy SVM classification model to cope with the large amount of incorrect labels of target PSM samples. In the fuzzy classification model, each PSM was assigned a calculated weight which indicates the possibility of the PSM sample being correct. The performance of FCRanker algorithm has been compared with PeptideProphet and Percolator on Yeast, UPS1 and Tal08 datasets, showing that FCRanker surpassed PeptideProphet and Percolator in terms of ROC and the quantity of identified target PSM samples under the same FDR level. Moreover, FCRanker outputs more target PSMs than PeptideProphet and Percolator does while they share a large number of PSMs in common.
Abbreviations
 PSMs:

peptide spectrum matches
 SVM:

support vector machine
Declarations
Acknowledgements
XN and AJL were supported by NIH grant GM64779. LP was supported by NSF of China under grant 11171049.
Declarations
The publication costs for this article were funded by Xijun Liang.
This article has been published as part of Proteome Science Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.
Authors’ Affiliations
References
 Elias J, Gygi S: Targetdecoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nature methods 2007,4(3):207–214. 10.1038/nmeth1019PubMedView ArticleGoogle Scholar
 Perkins D, Pappin D, Creasy D, Cottrell J: Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999,20(18):3551–3567. 10.1002/(SICI)15222683(19991201)20:18<3551::AIDELPS3551>3.0.CO;22PubMedView ArticleGoogle Scholar
 Ramakrishnan S, Mao R, Nakorchevskiy A, Prince J, Willard W, Xu W, Marcotte E, Miranker D: A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 2006,22(12):1524–1531. 10.1093/bioinformatics/btl118PubMedView ArticleGoogle Scholar
 Keller A, Nesvizhskii A, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry 2002,74(20):5383–5392. 10.1021/ac025747hPubMedView ArticleGoogle Scholar
 Ding Y, Choi H, Nesvizhskii A: Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. Journal of proteome research 2008,7(11):4878–4889. 10.1021/pr800484xPubMed CentralPubMedView ArticleGoogle Scholar
 Choi H, Nesvizhskii A: Semisupervised modelbased validation of peptide identifications in mass spectrometrybased proteomics. Journal of proteome research 2007, 7: 254–265.PubMedView ArticleGoogle Scholar
 Richard E, Knierman M, Freeman A, Gelbert L, Patil S, Hale J: Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. Journal of proteome research 2007,6(5):1758–1767. 10.1021/pr0605320View ArticleGoogle Scholar
 Olsen J, Mann M: Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proceedings of the National Academy of Sciences of the United States of America 2004,101(37):13417–22. 10.1073/pnas.0405549101PubMed CentralPubMedView ArticleGoogle Scholar
 Bianco L, Mead J, Bessant C: Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG 2006 standard MS/MS data sets. Journal of proteome research 2009,8(4):1782–1791. 10.1021/pr800792zView ArticleGoogle Scholar
 Anderson D, Li W, Payan D, Noble W: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. Journal of proteome research 2003,2(2):137–146. 10.1021/pr0255654PubMedView ArticleGoogle Scholar
 Spivak M, Weston J, Bottou L, KaÌĹll L, Noble W: Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. Journal of proteome research 2009,8(7):3737–3745. 10.1021/pr801109kPubMed CentralPubMedView ArticleGoogle Scholar
 Käll L, Canterbury J, Weston J, Noble W, MacCoss M: Semisupervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 2007,4(11):923–925. 10.1038/nmeth1113PubMedView ArticleGoogle Scholar
 Liang X, Xia Z, Niu X, Link AJ, Pang L, Wu F, Zhang H: A fuzzy clusterbased algorithm for peptide identification. In Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on. IEEE; 2012:602–609.View ArticleGoogle Scholar
 Sanders S, Jennings J, Canutescu A, Link A, Weil P: Proteomics of the eukaryotic transcription machinery: identification of proteins associated with components of yeast TFIID by multidimensional mass spectrometry. Molecular and cellular biology 2002,22(13):4723–4738. 10.1128/MCB.22.13.47234738.2002PubMed CentralPubMedView ArticleGoogle Scholar
 SGD: Saccharomyes Genome Database. 2012. [http://www.yeastgenome.org]Google Scholar
 GenBank: NCBI gene bank. 2012. [http://www.ncbi.nlm.nih.gov/genbank]Google Scholar
 Rousseeuw P: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 1987, 20: 53–65.View ArticleGoogle Scholar
 Petrovic S: A comparison between the silhouette index and the daviesbouldin index in labelling ids clusters. Proceedings of the 11th Nordic Workshop of Secure IT Systems 2006, 53–64.Google Scholar
 Zhou W, Zhang L, Jiao L: Linear programming support vector machines. Pattern recognition 2002,35(12):2927–2936. 10.1016/S00313203(01)002102View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.