 Research
 Open Access
 Published:
Peptide identification based on fuzzy classification and clustering
Proteome Science volume 11, Article number: S10 (2013)
Abstract
Background
The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge.
Results
A novel scoring method named FCRanker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops.
Conclusions
Our experimental studies show that FCRanker outperforms other postdatabase search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.
Background
In protein identification, observed peptide spectra are searched against theoretical fragmentation spectra derived from target databases. Peptide spectrum matches (PSMs) are scored by database search tools and those highscored PSMs are selected as target PSMs. In fact, more than half of selected PSMs are not correct [1]. Although many filters [2, 3] have been proposed to refine the outputs further, they are not universal for different datasets.
To tackle this problem, PeptideProphet [4] used unsupervised learning for automatically selecting PSMs output by database search tools. Based on the assumption that the PSM samples are sampled from a mixture distribution which represents the chance of a "correct" PSM and an "incorrect" PSM, PeptideProphet applies the expectation maximization (EM) method to calculate the possibility of each PSM being "correct". As only the set of highscored PSMs are searched for "correct" ones by PeptideProphet, some good lowranked PSMs may be lost. Adaptive PeptideProphet was proposed in [5] to improve the performance of PeptideProphet by iteratively training a discriminant function from a set of topranked PSM samples, while [6] attempted to extend PeptideProphet by exploiting decoy PSMs in semisupervised learning. In [7–9], decoy databases were used for validation of the performance of the postdatabase search algorithms. It is proposed in [6] to estimate a more accurate probability by combining decoy PSMs into a unified semisupervised expectation maximization framework.
Support vector machines (SVMs) have also been studied for the peptide assignment problem in [10, 11]. Percolator [12] employed the SVM to iteratively adjust models fitting target PSMs with higher scores than decoy PSMs. Percolator, as a semisupervised learning model, did not fully make use of the labels and samples of target PSMs. More recently, a fully supervised SVM learning model is proposed in [11] to improve the performance of Percolator by utilizing target PSM data, where those "incorrect" target PSMs are viewed as noises, and a special loss function is employed to reduce the noise's negative impact on the learning model. Although most good target PSMs are identified by the classification learning model from noises and decoy PSMs, all selected PSMs are treated in the same way.
In this paper, a new scoring method, FCRanker, is developed not only to identify reliable target PSMs, but also to evaluate the confidence of each target PSM. As good target PSMs are close to each other, FCRanker integrates sample clustering into the classification procedure to compute the possibility of each target PSM being correct. Compared with the standard SVM model, the proposed fuzzy classification model assigns a weight to each target PSM indicating its likelihood being correct. The score of each PSM sample is computed by combining discriminant function value and fuzzy silhouette value. The algorithm repeatedly updates the values of the discriminant function and fuzzy silhouette index for each PSM sample, and recompute the weights of targets until the algorithm stops. In experimental studies, while FCRanker shows a large overlap of the identified target PSMs with PeptideProphet and Percolator, it has identified more target PSMs in all datasets.
The first stage of the work was published in [13]. In this work, we compared the FCRanker algorithm with another benchmark method, Percolator, in the experimental studies. As Percolator is developed based on the SVMbased learning model, and hence it provides a better reference in performance comparison. Furthermore, we added a new dataset, Tal08, which has different characteristics (refer to Table 1) with datasets Yeast and UPS1. The performance of the proposed FCRanker algorithm has been conducted on all three datasets in terms of number of target PSMs, overlaps and ROC curves, and compared with PeptideProphet and Percolator. The new data analysis and results reinforce the efficiency of the proposed FCRanker method.
Results and discussion
The FCRanker algorithm is compared with PeptideProphet [4] and Percolator [12] to validate its effectiveness. We used a PC with Intel (R) CPU 1.80 GHz×2, and RAM 2.0Gb for all experiments.
Experimental Setup
Dataset
FCranker was examined over three datasets: S. cerevisiae Gcn4 (Yeast), Universal Proteomics Standard (UPS1) and Tal08 [14]. Trysin digestion of the protein samples generates three types of tryptic peptides: fulldigested (both ends of a peptide satisfy enzyme specificity rule), halfdigested (only one end satisfies the enzyme specificity rule) and nonedigested (neither of the ends satisfies the rule). The database of Yeast protein sequences was obtained from Saccharomyes Genome Database (SGD) [15] and the Sigma48 protein sequences database from NCBI gene bank [16]. The attributes of each PSM sample include xcorrelation, deltacn, ions, sprank and calcneutralpepmass.
The SEQUEST search results on UPS1 contains 48 purified human proteins and 17,335 PSMs, consisting of 8974 target PSMs and 8361 decoy PSMs. On the Yeast dataset, it contains 6652 proteins and 14,891 PSMs, consisting of 6702 target PSMs and 8189 decoy PSMs. On the Tal08 dataset, it contains 9907 target PSMs, and 8746 decoy PSMs, totally 18,653 PSMs.
Statistics of the three datasets are listed in Table 1.
Preprocess
In addition to those attributes output by SEQUEST, such as xcorrelation, deltacn, ions, sprank and calcneutralpepmass, another attribute "digested type" was added in the representation, with scalars "2", "1" and "0" for fulldigested type, halfdigested type, and nonedigested type, respectively. The values of each attribute have been transformed linearly beforehand such that they have zero mean and unit variance (this is called a normalization process). We multiply a weight of 2.0 to the values of xcorrelation and deltacn attributes after normalization, inasmuch as these two attributes take more important position in data representation. As the attribute "digested type" also plays an important role by experimental experience, a weight of 2.0 was applied, similarly, on the values of this attribute after the normalization process.
Parameter setting
In all of the experiments, the parameter c is set to 1.0 in the proposed fuzzy linear programming SVM model where the Gaussian (RBF) kernel
was chosen, with parameter σ = 2.0.
In the iterations of FCRanker algorithm, we set n = 70 in Eq. (10) and $\widehat{p}=0.03\left{\text{\Omega}}_{+}\right$,$\hat{sep}=0.25$ Eq. (15). The strategy for solving largescale programming was employed as described in the subsection "FCRanker for the largescale problem", where the parameter ρ was chosen as 0.2.
Validation of sep throughout iterations
Figure 1 depicts the variation of the values of sep in the iterations of the FCRanker algorithm on Yeast and UPS1 datasets. On both of the two datasets, the value of ${\overline{s}}_{1}$ is almost equal to ${\overline{s}}_{1}$ initially, and then values of ${\overline{s}}_{1}$ increases as iterations proceed while values of ${\overline{s}}_{1}$ decreases throughout the procedure. Hence, an increasing curve of sep which is defined as $\left({\overline{s}}_{1}{\overline{s}}_{1}\right)/2$is observed in the figure. At iteration 4 of Figure 1A(Yeast dataset) the value of sep exceeds the given threshold 0.25, reaching the termination criteria of the algorithm. The increasing values of sep illustrates that the identified good target PSMs indexed by Ω_{1} are closer to each other and were separated from decoy PSMs as the iterations increase, showing the effectiveness of the fuzzy silhouette index.
Comparison of target PSMs
We compared the target PSMs output by PeptideProphet, Percolator and FCRanker under FDR level 0.05 in Table 2. On the Yeast, FCRanker identified 1475 target PSMs while PeptideProphet output 1443 target PSMs and Percolator output 1393 target PSMs. There are in all 32 target PSMs more found by FCRanker than PeptideProphet and 82 target PSMs more than Percolator. On the UPS1, there are 681 target PSMs found by FCRanker, which is 243 PSMs (55.5%) more than that of Percolator and 115 PSMs (20.3%) more than that of PeptideProphet. On the Tal08, FCRanker output 1092 target PSMs, which is 135 PSMs (14.1%) more than that of PeptideProphet and 139 PSMs (14.6%) more than that of Percolator. Similar results of PSMs output by the three methods on particular digested types are also shown in Table 2.
We analyzed the outputs of the target PSMs of the three methods and their overlaps are summarized in Figure 2. It is shown that there are large overlaps among the output PSMs of the three approaches in all Yeast, UPS1 and Tal08 datasets. Specifically, FCRanker, PeptideProphet and Percolator identified 1248 common target PSMs in Yeast dataset (Figure 2A), which covers 86.5% of the total target PSMs by PeptideProphet, 89.6% of the output of Percolator and 84.6% of the output targets of FCRanker. Particularly, FCRanker identified 129 PSMs (8.9%) selected by PeptideProphet but not covered by Percolator, and found 14 PSMs (1.0%) selected by Percolator but not covered by PeptideProphet.
On the UPS1 dataset (Figure 2B), the three algorithms have 383 target PSMs in common. The overlap covers 67.7% of the total target PSMs by PeptideProphet, 87.4% by Percolator and 56.2% by FCRanker. Particularly, there are 520 target PSMs catched by PeptideProphet and FCRanker in common, covering 91.9% of the total target PSMs by PeptideProphet and 76.4% by FCRanker; there are 406 target PSMs catched by Percolator and FCRanker in common, covering 92.7% of the total target PSMs by Percolator and 59.6% by FCRanker. Particularly, FCRanker identified 137 PSMs (24.2%) selected by PeptideProphet but not covered by Percolator, and found 23 PSMs (5.3%) selected by Percolator but not covered by PeptideProphet.
On the Tal08 dataset (Figure 2C), the three algorithms have 829 PSMs in common. The overlap covers 86.6% of the total target PSMs by PeptideProphet, 87.0% by Percolator and 75.9% by FCRanker. Particularly, there are 862 target PSMs catched by PeptideProphet and FCRanker in common, covering 90.1% of the total target PSMs by PeptideProphet and 78.9% by FCRanker; there are 847 target PSMs catched by Percolator and FCRanker in common, covering 88.9% of the total target PSMs by Percolator and 77.6% by FCRanker. Particularly, FCRanker identified 33 PSMs (3.4%) selected by PeptideProphet but not covered by Percolator, and found 18 PSMs (1.9%) selected by Percolator but not covered by PeptideProphet.
ROC curve
Figure 3 shows ROC curves of the three methods on the Yeast, UPS1 and Tal08 datasets. On the Yeast dataset (Figure 3A), when FPR level near zero FCRanker has the same TPR level with PeptideProphet while higher TPRs are reached by FCRanker than those by PeptideProphet and Percolator on other FPR levels. On both the UPS1 dataset (Figure 3B) and Tal08 dataset (Figure 3C), FCRanker reaches higher TPRs than the other two methods throughout all FPR levels. Particularly, on Tal08 dataset, FCRanker reaches evidently high TPR levels even on comparatively high FPR levels.
Figure 4 depicts the relation between the number of TP and FDR, where we observed similar patterns with the corresponding ROC curves.
Methods
Classification and clustering methods for peptide identification
Fuzzy clustering
Clustering analysis is an unsupervised learning method to group similar data samples together. Silhouette index was introduced in [17, 18] to measure how well a sample belongs to a cluster.
Suppose that there are l data samples {x _{1}, . . ., x _{ l }}, which are grouped into K clusters, denoted as C ={C _{1} , . . ., C _{ K }}. Denote by d(x _{ i } , x _{ j }) the distance between two samples x _{ i } and x _{ j }, and by ${C}_{k}=\left\{{x}_{1}^{k},\dots ,{x}_{mk}^{k}\right\}$ the samples of the k th cluster, where m _{ k } = C _{ k }  and k = 1, . . ., K. The average distance, denoted by ${a}_{i}^{k}$, between the i th data sample in cluster C _{ k }and other samples in the same cluster is formulated as
and the minimum average distance between the i th data sample in cluster C _{ k }and all other data samples in clusters C _{ v } , v = 1, . . ., K, v ≠ k is defined as
Then, we define the silhouette value of the i th data sample in C _{ k } as follows
Clearly, the silhouette values located in the interval [− 1, 1]. The silhouette value of the cluster C _{ k } is defined as
Classification
Our task is to identify those correct PSMs from a set of PSMs generated by some database searching tools in peptide identification. Usually decoy PSMs are employed to validate target PSMs, then the samples of PSMs can be categorized into "good" class, with labels " +1", and "bad" class, with labels "− 1". In the setting of classification, we use a vector of attributes such as xcorrelation, deltacn, ions, sprank, calcneutralpepmass, etc., to represent a PSM data sample. Let {x _{ i }} ⊆ R^{q} , i = 1, . . ., l be the PSM data samples with q the number of attributes. We aim at finding a discriminant function f : R^{q} → R to classify the PSM data samples according to their labels.
One of the greatest challenges arising from the problem of the peptide identification is that there is lack of data samples with deterministic +1 labels. For a standard classification setting, the discriminant function is solved by training the models on two balanced types of data samples with deterministic labels. In peptide identification problem, however, a great number of PSMs generated by database searching engines are incorrect, and the data samples with +1 labels are quite unreliable. Thus, the great amount of data samples with incorrect +1 labels would extremely distort the trained discriminant function if they are employed directly in the standard classification models.
Here, we consider the kernelbased SVM classifier as follows:
where b ∈ R, k(·,·) is a chosen kernel function. The label of a data sample x is predicted as +1, if f (x) > 0, otherwise it is predicted as −1. A quadratic programming is usually solved to obtain the coefficients α and b, which requires huge computations overhead, especially for largescale problems. To overcome this problem, a class of linear programming SVM is introduced in [19].
For the l data samples {(x _{ i } , y _{ i })}, i = 1, . . ., l, with x _{ i } ∈ R^{q} , y _{ i } ∈ {1, −1}, the linear programming SVM model is formulated as
where c > 0 is a given constant, and the discriminant function $f\left(\cdot \right)={\sum}_{j=1}^{l}{\alpha}_{j}{y}_{j}k\left({x}_{j},\cdot \right)+b$.
The basic FCRanker algorithm
In this section, the FCRanker algorithm is present to calculate the score of each PSM data sample. The score values reflect the possibility of the PSM data samples being correct, and those PSMs with high scores are selected for users at last.
Denote by Ω = {1, . . ., l} the set of indices of l PSM data samples, by Ω_{+} the set of indices of target PSMs, by
the set of indices of decoy PSMs, by Ω_{1} the set of indices of good target PSMs, and Ω_{0} = Ω_{+} \ Ω_{1} the set of bad target PSMs. The FCRanker algorithm aims to select the set Ω_{1} from Ω_{+} utilizing the data samples indexed by Ω_{ − }. To classify good target PSMs from others, a discriminant function f is constructed such that the function value f (x _{ i }) is positive if sample x _{ i } belongs to Ω_{1}, and negative otherwise. A large discriminant function value of a target PSM sample x _{ i } indicates that the sample locates far away from the decision boundary, and hence large possibility of being a good PSM. However, only a large discriminant function value of f (x _{ i }) itself is not sufficient to ensure that the PSM sample x _{ i } is good. Take the sample represented by "□" in Figure 5 as an example, it has a large distance from the decision boundary and thus has a large function value of f (□). This sample, however, tends to be a bad PSM since it locates too far away from the other PSM data samples indicated by the set Ω_{+}.
On the other hand, a data sample may not be a good target PSM either if it locates comparatively close to other target PSMs but has a small discriminant function value. The data sample represented by "⊕" in Figure 5 should also be excluded from the set Ω_{1}. The above observations hints us that a good target PSM data sample should satisfy two rules: 1) has a large discriminant function value; 2) is close to other target PSMs.
Fuzzy SVM classification
A weight θ _{ i } ∈ [0, 1] is introduced for each target sample x _{ i } indexed by Ω_{+} to indicate its possibility of being correct since its label is not trustworthy. A large weight of a sample usually indicates that the PSM has more possibility to be correct. Since it is definitely sure that the decoy PSMs are incorrect, we constantly set the weights θ _{ i } to 1 for x _{ i } ∈ Ω_{ − }. Denote loss(f (x _{ i }), y _{ i }) the empirical error of sample x _{ i }, then the empirical error can be formulated as $\sum _{i\in \text{\Omega}}}\mathit{\text{loss}}\left(f\right({x}_{i}),{y}_{i})$ in traditional classification problems with deterministic labels. Assigning a weight to each data sample, we reformulate the total empirical error as $\sum _{i\in \text{\Omega}}}{\theta}_{i}\mathit{\text{loss}}\left(f\right({x}_{i}),{y}_{i})$.
Thus, the linear programming SVM model (1) is transformed as follows
where α ∈ R^{l} , b ∈ R^{1}, r ∈ R^{1} and ξ = [ξ _{1} , . . ., ξ _{ l }] ∈ R^{l}. Model (2) is referred as the fuzzy linear programming SVM model.
The model (2) can be rewritten as
where θ = [θ _{1}, . . ., θ _{ l }]^{T} , Λ(y) = Diag(y), 0_{ l } ∈ R^{l} is a vector with zero elements, 1_{ l } ∈ R^{l} is a vector with each element equal to 1, I _{ l } is the l × l unit matrix, and K = (k(x _{ i } , x _{ j }))_{1≤i≤l,1≤j≤l }. The model can be solved by existing optimization softwares, such as Mosek.
Fuzzy silhouette
To adapt the situations with uncertain labels we generalize the silhouette concept for deterministic setting to fuzzy silhouette index.
For k = − 1, 1, i ∈ Ω_{ k }, the average distance of sample x _{ i } to the other data samples in Ω_{ k } is formulated as
where θ _{ i } ∈ [0, 1]. Then, we define the fuzzy silhouette of sample x _{ i } as
It measures the degree that a PSM sample goes far away from the decoys and that is close to the good target samples. Hence, a PSM data sample is more likely to be a correct one if it has a large fuzzy silhouette value.
For the sets of Ω_{1}, Ω_{1} and Ω_{0} we define their average fuzzy silhouettes as
where  Ω_{ k }  is the cardinality of Ω_{ k } , k = − 1, 1, 0. We also define
as a metric to indicate the separation degree of decoy PSM samples and good PSMs.
Score of the samples
Based on the fuzzy SVM model and fuzzy silhouette metric we design a scoring scheme, which defines the score of sample x _{ i } as
where φ(·) and ψ(·) are functions for scaling the values of f (x _{ i }) and s _{ i }, respectively. Here, function φ(·) : R → [− 1, 1] is constructed as an increasing function, and ψ(·) as an increasing function mapping from [− 1, 1] to [− 1, 1]. Particularly, we choose function φ(f (x _{ i })) and ψ(s _{ i }) as
where f _{max} and s _{max} are the largest values of {f (x _{ i }) − f _{0} } and {s _{ i } − s _{0} } for i ∈ Ω_{+}, respectively, and f _{0} is the threshold of the values of discriminant function, s _{0} the threshold of fuzzy silhouette. The power of $\frac{1}{4}$ on f (x _{ i }) − f _{0}  is introduced to smooth the weight contributions.
The FCRanker algorithm
The FCRanker algorithm iteratively adjusts the index set of good PSM Ω_{1} by calculating the scores and weights of the data samples until a stop criterion is met. Initially, the algorithm set ${\text{\Omega}}_{1}^{0}={\text{\Omega}}_{+}$ and ${\text{\Omega}}_{0}^{0}=\varphi $, i.e. all PSM samples are viewed as good ones at iteration 0. At iteration k, the algorithm solves the fuzzy linear programming SVM model (3), calculates the fuzzy silhouette values of the samples according to Eq. (5) and updates the index set Ω_{1} and Ω_{0} such that the indices of target PSMs in Ω_{1} with small scores are moved to Ω_{0}, while the indices of target PSMs in Ω_{0} with large scores are moved to Ω_{1}.
At the k th iteration, PSM samples indexed by Ω_{+} are ranked according to their scores, and the top n% of them in Ω_{1} are reserved. Then ${\text{\Omega}}_{1}^{k}$ is updated by the discriminant function values as
where 0 < n < 100 is a constant percentage. Based on the calculated fuzzy silhouettes, ${\text{\Omega}}_{1}^{k+1/3}$ is then updated by
and ${\text{\Omega}}_{0}^{k}$ is updated by
Finally, for i ∈ Ω new scores score(i)^{k+1}, are computed according to Eq. (7) and the weights ${\theta}_{i}^{k+1}$ are calculated by the following equation
Then indices of the samples indexed by ${\text{\Omega}}_{0}^{k+1/2}$ are moved to ${\text{\Omega}}_{1}^{k+2/3}$ if the samples have large score values,
i.e.,
where ${\overline{f}}_{1}^{k+2/3}$is the average of $\left\{f\left({x}_{i}\right)i\in {\text{\Omega}}_{1}^{k+2/3}\right\}$.
The algorithm terminates when the number of identified good PSM samples reaches a given threshold $\stackrel{\u2322}{p}$, or the separation degree sep^{k+1}defined by Eq. (6) reaches a threshold $\hat{sep}$, i.e.,
The FCRanker algorithm is summarized in Algorithm 1.
Algorithm 1 The FCRanker Algorithm
Input: {x _{ i } , y _{ i }}, i ∈ Ω;
Output: Scores of samples indexed by Ω;
1: Initialization: k = − 1, ${\text{\Omega}}_{1}^{0}={\text{\Omega}}_{+}$, ${\text{\Omega}}_{0}^{0}:\; =\text{\xd8}$, ${\theta}_{i}^{0}=1$, i ∈ Ω.
2: while Stop criterion (15) is not satisfied do
3: k := k + 1.
4: SVM classification.
5: Solve fuzzy SVM classification model Eq. (3);
6: Calculate ${\text{\Omega}}_{1}^{k+1/3}$ via Eq. (10).
7: Clustering analysis.
8: Calculate fuzzy silhouettes s _{ i } , i ∈ Ω via (5);
9: Calculate ${\text{\Omega}}_{1}^{k+2/3}$, ${\text{\Omega}}_{0}^{k+1/2}$ via Eq. (11), (12).
10: Update weights.
11: Calculate score(i)^{k+1}, θ^{k+1}via Eq. (7), (13);
12: Calculate ${\text{\Omega}}_{1}^{k+1}$, ${\text{\Omega}}_{0}^{k+1}$, sep^{k+1}via Eq. (14), (6).
13: end while
FCRanker for the largescale problem
The number of PSMs output by a database search engine is usually extremely large. In this section, some implementation practice is discussed further such that the algorithm is capable for solving largescale problems.
Fuzzy SVM classification for the largescale problem
If the data matrix is sparse, the interiorpoints algorithms would be competent in solving largescale linear programming problems. The kernel matrix K in Problem (3) is, unfortunately, not sparse in general. In fact, kernel matrix K is usually quite dense and most of its elements are nonzero. To store a large dense matrix K is not a trivial task. Take a matrix K with Gaussian kernel and l = 400, 000 as an example, if four bytes are occupied per element then the matrix K would have l^{2} = 1.6 × 10^{11} elements and take up 640Gb of storage in all.
Interestingly, our experimental experience indicates that the kernel matrix is usually quite low rank in the peptide identification problem. Hence, a submatrix K' consisting of l' columns of K (l' << l) is selected to substitute K in Problem (3). These l' columns of the submatrix are selected randomly from the total columns of matrix K. This operation can be implemented by sampling l' data samples randomly and then calculating the submatrix K' according to the kernel function. It reduces the storage greatly. Denote an index set Ω' ⊂ Ω which consists of the indices of l' columns. Then the matrix (K')_{ ij } = k(x _{ i } , x _{ j } ), i ∈ Ω, j ∈ Ω' can be calculated with size of l × l'. Let y' = (y')_{ j∈ Ω}', then Problem (3) is reduced to
Where $\alpha \in {R}^{{l}^{\prime}}$, b ∈ R^{1}, r ∈ R^{1}, r ∈ R^{l}, and Λ(y′) = Diag(y′).
Fuzzy silhouette for the largescale problem
For updating fuzzy silhouette value s _{ i } of sample i, the major work is to compute ${\beta}_{i}^{1}$ and ${\beta}_{i}^{1}$ in Eq. (4) where it is required to calculate l distances. In all, each iteration computes  Ω *  Ω = l^{2} distances with total samples. Denote a given sample rate by ρ with ρ ∈ (0, 1). We sample ρ *  Ω_{1}  indices of targets from Ω_{1}, and ρ *  Ω_{ − 1}  indices of decoys from Ω_{ − 1}, denoted by Ωt and ${\text{\Omega}}_{1}^{\prime}$, to substitute Ω_{1} and Ω_{−1} in Eq. (4), resp. Then at most ρl( Ω_{ − 1}  +  Ω_{1} ) ≤ ρl distances need to be calculated at each iteration.
Conclusion
A new scoring method has been developed based on the iterations of FCRanker algorithm which were equipped with fuzzy silhouette index and a fuzzy SVM classification model to cope with the large amount of incorrect labels of target PSM samples. In the fuzzy classification model, each PSM was assigned a calculated weight which indicates the possibility of the PSM sample being correct. The performance of FCRanker algorithm has been compared with PeptideProphet and Percolator on Yeast, UPS1 and Tal08 datasets, showing that FCRanker surpassed PeptideProphet and Percolator in terms of ROC and the quantity of identified target PSM samples under the same FDR level. Moreover, FCRanker outputs more target PSMs than PeptideProphet and Percolator does while they share a large number of PSMs in common.
Abbreviations
 PSMs:

peptide spectrum matches
 SVM:

support vector machine
References
 1.
Elias J, Gygi S: Targetdecoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nature methods 2007,4(3):207–214. 10.1038/nmeth1019
 2.
Perkins D, Pappin D, Creasy D, Cottrell J: Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999,20(18):3551–3567. 10.1002/(SICI)15222683(19991201)20:18<3551::AIDELPS3551>3.0.CO;22
 3.
Ramakrishnan S, Mao R, Nakorchevskiy A, Prince J, Willard W, Xu W, Marcotte E, Miranker D: A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 2006,22(12):1524–1531. 10.1093/bioinformatics/btl118
 4.
Keller A, Nesvizhskii A, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry 2002,74(20):5383–5392. 10.1021/ac025747h
 5.
Ding Y, Choi H, Nesvizhskii A: Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. Journal of proteome research 2008,7(11):4878–4889. 10.1021/pr800484x
 6.
Choi H, Nesvizhskii A: Semisupervised modelbased validation of peptide identifications in mass spectrometrybased proteomics. Journal of proteome research 2007, 7: 254–265.
 7.
Richard E, Knierman M, Freeman A, Gelbert L, Patil S, Hale J: Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. Journal of proteome research 2007,6(5):1758–1767. 10.1021/pr0605320
 8.
Olsen J, Mann M: Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proceedings of the National Academy of Sciences of the United States of America 2004,101(37):13417–22. 10.1073/pnas.0405549101
 9.
Bianco L, Mead J, Bessant C: Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG 2006 standard MS/MS data sets. Journal of proteome research 2009,8(4):1782–1791. 10.1021/pr800792z
 10.
Anderson D, Li W, Payan D, Noble W: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. Journal of proteome research 2003,2(2):137–146. 10.1021/pr0255654
 11.
Spivak M, Weston J, Bottou L, KaÌĹll L, Noble W: Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. Journal of proteome research 2009,8(7):3737–3745. 10.1021/pr801109k
 12.
Käll L, Canterbury J, Weston J, Noble W, MacCoss M: Semisupervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 2007,4(11):923–925. 10.1038/nmeth1113
 13.
Liang X, Xia Z, Niu X, Link AJ, Pang L, Wu F, Zhang H: A fuzzy clusterbased algorithm for peptide identification. In Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on. IEEE; 2012:602–609.
 14.
Sanders S, Jennings J, Canutescu A, Link A, Weil P: Proteomics of the eukaryotic transcription machinery: identification of proteins associated with components of yeast TFIID by multidimensional mass spectrometry. Molecular and cellular biology 2002,22(13):4723–4738. 10.1128/MCB.22.13.47234738.2002
 15.
SGD: Saccharomyes Genome Database. 2012. [http://www.yeastgenome.org]
 16.
GenBank: NCBI gene bank. 2012. [http://www.ncbi.nlm.nih.gov/genbank]
 17.
Rousseeuw P: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 1987, 20: 53–65.
 18.
Petrovic S: A comparison between the silhouette index and the daviesbouldin index in labelling ids clusters. Proceedings of the 11th Nordic Workshop of Secure IT Systems 2006, 53–64.
 19.
Zhou W, Zhang L, Jiao L: Linear programming support vector machines. Pattern recognition 2002,35(12):2927–2936. 10.1016/S00313203(01)002102
Acknowledgements
XN and AJL were supported by NIH grant GM64779. LP was supported by NSF of China under grant 11171049.
Declarations
The publication costs for this article were funded by Xijun Liang.
This article has been published as part of Proteome Science Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
XL and ZX designed the basic FCRanker algorithm and wrote the manuscript. XN, AL and FW designed the version of FCRanker algorithm for the largescale problem and corresponding experiments. XL, LP and HZ designed and operated experiments. All authors read and approved the final manuscript.
Rights and permissions
About this article
Cite this article
Liang, X., Xia, Z., Niu, X. et al. Peptide identification based on fuzzy classification and clustering. Proteome Sci 11, S10 (2013). https://doi.org/10.1186/1477595611S1S10
Published:
Keywords
 Peptide identification
 Peptide spectrum matches (PSMs)
 Fuzzy support vector machine (SVM)
 Fuzzy silhouette