Skip to main content

Peptide identification based on fuzzy classification and clustering

Abstract

Background

The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge.

Results

A novel scoring method named FC-Ranker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops.

Conclusions

Our experimental studies show that FC-Ranker outperforms other post-database search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.

Background

In protein identification, observed peptide spectra are searched against theoretical fragmentation spectra derived from target databases. Peptide spectrum matches (PSMs) are scored by database search tools and those high-scored PSMs are selected as target PSMs. In fact, more than half of selected PSMs are not correct [1]. Although many filters [2, 3] have been proposed to refine the outputs further, they are not universal for different datasets.

To tackle this problem, PeptideProphet [4] used unsupervised learning for automatically selecting PSMs output by database search tools. Based on the assumption that the PSM samples are sampled from a mixture distribution which represents the chance of a "correct" PSM and an "incorrect" PSM, PeptideProphet applies the expectation maximization (EM) method to calculate the possibility of each PSM being "correct". As only the set of high-scored PSMs are searched for "correct" ones by PeptideProphet, some good low-ranked PSMs may be lost. Adaptive PeptideProphet was proposed in [5] to improve the performance of PeptideProphet by iteratively training a discriminant function from a set of top-ranked PSM samples, while [6] attempted to extend PeptideProphet by exploiting decoy PSMs in semi-supervised learning. In [79], decoy databases were used for validation of the performance of the post-database search algorithms. It is proposed in [6] to estimate a more accurate probability by combining decoy PSMs into a unified semi-supervised expectation- maximization framework.

Support vector machines (SVMs) have also been studied for the peptide assignment problem in [10, 11]. Percolator [12] employed the SVM to iteratively adjust models fitting target PSMs with higher scores than decoy PSMs. Percolator, as a semi-supervised learning model, did not fully make use of the labels and samples of target PSMs. More recently, a fully supervised SVM learning model is proposed in [11] to improve the performance of Percolator by utilizing target PSM data, where those "incorrect" target PSMs are viewed as noises, and a special loss function is employed to reduce the noise's negative impact on the learning model. Although most good target PSMs are identified by the classification learning model from noises and decoy PSMs, all selected PSMs are treated in the same way.

In this paper, a new scoring method, FC-Ranker, is developed not only to identify reliable target PSMs, but also to evaluate the confidence of each target PSM. As good target PSMs are close to each other, FC-Ranker integrates sample clustering into the classification procedure to compute the possibility of each target PSM being correct. Compared with the standard SVM model, the proposed fuzzy classification model assigns a weight to each target PSM indicating its likelihood being correct. The score of each PSM sample is computed by combining discriminant function value and fuzzy silhouette value. The algorithm repeatedly updates the values of the discriminant function and fuzzy silhouette index for each PSM sample, and recompute the weights of targets until the algorithm stops. In experimental studies, while FC-Ranker shows a large overlap of the identified target PSMs with PeptideProphet and Percolator, it has identified more target PSMs in all datasets.

The first stage of the work was published in [13]. In this work, we compared the FC-Ranker algorithm with another benchmark method, Percolator, in the experimental studies. As Percolator is developed based on the SVM-based learning model, and hence it provides a better reference in performance comparison. Furthermore, we added a new dataset, Tal08, which has different characteristics (refer to Table 1) with datasets Yeast and UPS1. The performance of the proposed FC-Ranker algorithm has been conducted on all three datasets in terms of number of target PSMs, overlaps and ROC curves, and compared with PeptideProphet and Percolator. The new data analysis and results reinforce the efficiency of the proposed FC-Ranker method.

Table 1 Statistics of datasets

Results and discussion

The FC-Ranker algorithm is compared with PeptideProphet [4] and Percolator [12] to validate its effectiveness. We used a PC with Intel (R) CPU 1.80 GHz×2, and RAM 2.0Gb for all experiments.

Experimental Setup

Dataset

FC-ranker was examined over three datasets: S. cerevisiae Gcn4 (Yeast), Universal Proteomics Standard (UPS1) and Tal08 [14]. Trysin digestion of the protein samples generates three types of tryptic peptides: full-digested (both ends of a peptide satisfy enzyme specificity rule), half-digested (only one end satisfies the enzyme specificity rule) and none-digested (neither of the ends satisfies the rule). The database of Yeast protein sequences was obtained from Saccharomyes Genome Database (SGD) [15] and the Sigma48 protein sequences database from NCBI gene bank [16]. The attributes of each PSM sample include x-correlation, delta-cn, ions, sprank and calc-neutral-pep-mass.

The SEQUEST search results on UPS1 contains 48 purified human proteins and 17,335 PSMs, consisting of 8974 target PSMs and 8361 decoy PSMs. On the Yeast dataset, it contains 6652 proteins and 14,891 PSMs, consisting of 6702 target PSMs and 8189 decoy PSMs. On the Tal08 dataset, it contains 9907 target PSMs, and 8746 decoy PSMs, totally 18,653 PSMs.

Statistics of the three datasets are listed in Table 1.

Preprocess

In addition to those attributes output by SEQUEST, such as x-correlation, delta-cn, ions, sprank and calcneutral-pep-mass, another attribute "digested type" was added in the representation, with scalars "2", "1" and "0" for full-digested type, half-digested type, and none-digested type, respectively. The values of each attribute have been transformed linearly beforehand such that they have zero mean and unit variance (this is called a normalization process). We multiply a weight of 2.0 to the values of x-correlation and delta-cn attributes after normalization, inasmuch as these two attributes take more important position in data representation. As the attribute "digested type" also plays an important role by experimental experience, a weight of 2.0 was applied, similarly, on the values of this attribute after the normalization process.

Parameter setting

In all of the experiments, the parameter c is set to 1.0 in the proposed fuzzy linear programming SVM model where the Gaussian (RBF) kernel

k ( x 1 , x 2 ) = exp ( - | | x 1 - x 2 | | 2 2 σ 2 ) ,

was chosen, with parameter σ = 2.0.

In the iterations of FC-Ranker algorithm, we set n = 70 in Eq. (10) and p ^ = 0 . 03 | Ω + | , s e p ^ = 0 . 25 Eq. (15). The strategy for solving large-scale programming was employed as described in the subsection "FC-Ranker for the large-scale problem", where the parameter ρ was chosen as 0.2.

Validation of sep throughout iterations

Figure 1 depicts the variation of the values of sep in the iterations of the FC-Ranker algorithm on Yeast and UPS1 datasets. On both of the two datasets, the value of s ¯ 1 is almost equal to s ¯ - 1 initially, and then values of s ¯ 1 increases as iterations proceed while values of s ¯ - 1 decreases throughout the procedure. Hence, an increasing curve of sep which is defined as ( s ¯ 1 - s ¯ - 1 ) / 2 is observed in the figure. At iteration 4 of Figure 1A(Yeast dataset) the value of sep exceeds the given threshold 0.25, reaching the termination criteria of the algorithm. The increasing values of sep illustrates that the identified good target PSMs indexed by Ω1 are closer to each other and were separated from decoy PSMs as the iterations increase, showing the effectiveness of the fuzzy silhouette index.

Figure 1
figure 1

Variations of sep throughout the iterations. A: On Yeast dataset; B: On UPS1 dataset. The curve of sep is increasing throughout the iterations on both Yeast and UPS1 dataset. Similar curve of sep is also observed on Tal08 dataset, which is not listed here for simplicity of the layout.

Comparison of target PSMs

We compared the target PSMs output by PeptideProphet, Percolator and FC-Ranker under FDR level 0.05 in Table 2. On the Yeast, FC-Ranker identified 1475 target PSMs while PeptideProphet output 1443 target PSMs and Percolator output 1393 target PSMs. There are in all 32 target PSMs more found by FC-Ranker than PeptideProphet and 82 target PSMs more than Percolator. On the UPS1, there are 681 target PSMs found by FC-Ranker, which is 243 PSMs (55.5%) more than that of Percolator and 115 PSMs (20.3%) more than that of PeptideProphet. On the Tal08, FC-Ranker output 1092 target PSMs, which is 135 PSMs (14.1%) more than that of PeptideProphet and 139 PSMs (14.6%) more than that of Percolator. Similar results of PSMs output by the three methods on particular digested types are also shown in Table 2.

Table 2 Target PSMs output by PeptideProphet, Percolator and FC-Ranker

We analyzed the outputs of the target PSMs of the three methods and their overlaps are summarized in Figure 2. It is shown that there are large overlaps among the output PSMs of the three approaches in all Yeast, UPS1 and Tal08 datasets. Specifically, FC-Ranker, PeptideProphet and Percolator identified 1248 common target PSMs in Yeast dataset (Figure 2A), which covers 86.5% of the total target PSMs by PeptideProphet, 89.6% of the output of Percolator and 84.6% of the output targets of FC-Ranker. Particularly, FC-Ranker identified 129 PSMs (8.9%) selected by PeptideProphet but not covered by Percolator, and found 14 PSMs (1.0%) selected by Percolator but not covered by PeptideProphet.

Figure 2
figure 2

Overlap of the identified PSMs by FC-Ranker, PeptideProphet and Percolator. A: On Yeast dataset; B: On UPS1 dataset; C: On Tal08 dataset. "Prophet" indicates the results of PeptideProphet.

On the UPS1 dataset (Figure 2B), the three algorithms have 383 target PSMs in common. The overlap covers 67.7% of the total target PSMs by PeptideProphet, 87.4% by Percolator and 56.2% by FC-Ranker. Particularly, there are 520 target PSMs catched by PeptideProphet and FC-Ranker in common, covering 91.9% of the total target PSMs by PeptideProphet and 76.4% by FC-Ranker; there are 406 target PSMs catched by Percolator and FC-Ranker in common, covering 92.7% of the total target PSMs by Percolator and 59.6% by FC-Ranker. Particularly, FC-Ranker identified 137 PSMs (24.2%) selected by PeptideProphet but not covered by Percolator, and found 23 PSMs (5.3%) selected by Percolator but not covered by PeptideProphet.

On the Tal08 dataset (Figure 2C), the three algorithms have 829 PSMs in common. The overlap covers 86.6% of the total target PSMs by PeptideProphet, 87.0% by Percolator and 75.9% by FC-Ranker. Particularly, there are 862 target PSMs catched by PeptideProphet and FC-Ranker in common, covering 90.1% of the total target PSMs by PeptideProphet and 78.9% by FC-Ranker; there are 847 target PSMs catched by Percolator and FC-Ranker in common, covering 88.9% of the total target PSMs by Percolator and 77.6% by FC-Ranker. Particularly, FC-Ranker identified 33 PSMs (3.4%) selected by PeptideProphet but not covered by Percolator, and found 18 PSMs (1.9%) selected by Percolator but not covered by PeptideProphet.

ROC curve

Figure 3 shows ROC curves of the three methods on the Yeast, UPS1 and Tal08 datasets. On the Yeast dataset (Figure 3A), when FPR level near zero FC-Ranker has the same TPR level with PeptideProphet while higher TPRs are reached by FC-Ranker than those by PeptideProphet and Percolator on other FPR levels. On both the UPS1 dataset (Figure 3B) and Tal08 dataset (Figure 3C), FC-Ranker reaches higher TPRs than the other two methods throughout all FPR levels. Particularly, on Tal08 dataset, FC-Ranker reaches evidently high TPR levels even on comparatively high FPR levels.

Figure 3
figure 3

ROC curves of FC-Ranker, PeptideProphet and Percolator. A: On Yeast dataset; B: On UPS1 dataset; C: On Tal08 dataset. True Positive Rate (TPR): TPR = TP/(TP + FN), False Positive Rate (FPR): FPR = FP/(FP + TN), with TP : number of true positives, FP : number of false positives, FN : number of false negatives, TN : number of true negatives.

Figure 4 depicts the relation between the number of TP and FDR, where we observed similar patterns with the corresponding ROC curves.

Figure 4
figure 4

Performance comparison of FC-Ranker, PeptideProphet and Percolator in terms of the number of true positives (TPs). A: On Yeast dataset; B: On UPS1 dataset; C: On Tal08 dataset. False Discovery Rate (FDR): FDR = 2 · FP/(FP + TP), with TP : number of true positives, FP : number of false positives.

Methods

Classification and clustering methods for peptide identification

Fuzzy clustering

Clustering analysis is an unsupervised learning method to group similar data samples together. Silhouette index was introduced in [17, 18] to measure how well a sample belongs to a cluster.

Suppose that there are l data samples {x 1, . . ., x l }, which are grouped into K clusters, denoted as C ={C 1 , . . ., C K }. Denote by d(x i , x j ) the distance between two samples x i and x j , and by C k = x 1 k , , x m k k the samples of the k th cluster, where m k = |C k | and k = 1, . . ., K. The average distance, denoted by a i k , between the i th data sample in cluster C k and other samples in the same cluster is formulated as

a i k = 1 m k - 1 j = 1 , , m k , j i d ( x i k , x j k ) , i = 1 , , m k ,

and the minimum average distance between the i th data sample in cluster C k and all other data samples in clusters C v , v = 1, . . ., K, vk is defined as

b i k = min v = 1 , , K , v k 1 m v j = 1 m v d ( x i k , x j v ) ,i=1, m k .

Then, we define the silhouette value of the i th data sample in C k as follows

s i k = b i k - a i k max { a i k , b i k } .

Clearly, the silhouette values located in the interval [ 1, 1]. The silhouette value of the cluster C k is defined as

s k = 1 m k i = 1 m k s i k ,k=1,,K.

Classification

Our task is to identify those correct PSMs from a set of PSMs generated by some database searching tools in peptide identification. Usually decoy PSMs are employed to validate target PSMs, then the samples of PSMs can be categorized into "good" class, with labels " +1", and "bad" class, with labels " 1". In the setting of classification, we use a vector of attributes such as x-correlation, delta-cn, ions, sprank, calc-neutral-pepmass, etc., to represent a PSM data sample. Let {x i } Rq , i = 1, . . ., l be the PSM data samples with q the number of attributes. We aim at finding a discriminant function f : Rq → R to classify the PSM data samples according to their labels.

One of the greatest challenges arising from the problem of the peptide identification is that there is lack of data samples with deterministic +1 labels. For a standard classification setting, the discriminant function is solved by training the models on two balanced types of data samples with deterministic labels. In peptide identification problem, however, a great number of PSMs generated by database searching engines are incorrect, and the data samples with +1 labels are quite unreliable. Thus, the great amount of data samples with incorrect +1 labels would extremely distort the trained discriminant function if they are employed directly in the standard classification models.

Here, we consider the kernel-based SVM classifier as follows:

f ( x ) = i = 1 l α j k ( x j , x ) +b

where b R, k(·,·) is a chosen kernel function. The label of a data sample x is predicted as +1, if f (x) > 0, otherwise it is predicted as −1. A quadratic programming is usually solved to obtain the coefficients α and b, which requires huge computations overhead, especially for large-scale problems. To overcome this problem, a class of linear programming SVM is introduced in [19].

For the l data samples {(x i , y i )}, i = 1, . . ., l, with x i Rq , y i {1, −1}, the linear programming SVM model is formulated as

min α , r , ξ , b - r + c i = 1 l ξ i s .t . y i f ( x i ) = y i ( j = 1 l α j y j k ( x j , x i ) + b ) r - ξ i , - 1 α i 1 , ξ i 0 , i = 1 , , l
(1)

where c > 0 is a given constant, and the discriminant function f ( ) = j = 1 l α j y j k ( x j , ) + b .

The basic FC-Ranker algorithm

In this section, the FC-Ranker algorithm is present to calculate the score of each PSM data sample. The score values reflect the possibility of the PSM data samples being correct, and those PSMs with high scores are selected for users at last.

Denote by Ω = {1, . . ., l} the set of indices of l PSM data samples, by Ω+ the set of indices of target PSMs, by

Ω - 1 = { i Ω | y i = - 1 } ,

the set of indices of decoy PSMs, by Ω1 the set of indices of good target PSMs, and Ω0 = Ω+ \ Ω1 the set of bad target PSMs. The FC-Ranker algorithm aims to select the set Ω1 from Ω+ utilizing the data samples indexed by Ω . To classify good target PSMs from others, a discriminant function f is constructed such that the function value f (x i ) is positive if sample x i belongs to Ω1, and negative otherwise. A large discriminant function value of a target PSM sample x i indicates that the sample locates far away from the decision boundary, and hence large possibility of being a good PSM. However, only a large discriminant function value of f (x i ) itself is not sufficient to ensure that the PSM sample x i is good. Take the sample represented by "□" in Figure 5 as an example, it has a large distance from the decision boundary and thus has a large function value of f (□). This sample, however, tends to be a bad PSM since it locates too far away from the other PSM data samples indicated by the set Ω+.

Figure 5
figure 5

Classification and clustering. "−" represents decoy PSM, while "+" represents target PSM. The data sample represented by "□" locates far away from the decision boundary. However, the possibility of it being a correct PSM is remote since it goes too far away from other data target PSMs. The data sample represented by "" also has a small possibility to be a correct PSM since it locates near the decision boundary.

On the other hand, a data sample may not be a good target PSM either if it locates comparatively close to other target PSMs but has a small discriminant function value. The data sample represented by "" in Figure 5 should also be excluded from the set Ω1. The above observations hints us that a good target PSM data sample should satisfy two rules: 1) has a large discriminant function value; 2) is close to other target PSMs.

Fuzzy SVM classification

A weight θ i [0, 1] is introduced for each target sample x i indexed by Ω+ to indicate its possibility of being correct since its label is not trustworthy. A large weight of a sample usually indicates that the PSM has more possibility to be correct. Since it is definitely sure that the decoy PSMs are incorrect, we constantly set the weights θ i to 1 for x i Ω . Denote loss(f (x i ), y i ) the empirical error of sample x i , then the empirical error can be formulated as i Ω loss(f( x i ), y i ) in traditional classification problems with deterministic labels. Assigning a weight to each data sample, we reformulate the total empirical error as i Ω θ i loss(f( x i ), y i ).

Thus, the linear programming SVM model (1) is transformed as follows

min α , r , ξ , b - r + c i Ω θ i ξ i s .t . y i ( j = 1 l α j y j k ( x j , x i ) + b ) r - ξ i , i Ω , - 1 α i 1 , ξ i 0 , i Ω ,
(2)

where α Rl , b R1, r R1 and ξ = [ξ 1 , . . ., ξ l ] Rl. Model (2) is referred as the fuzzy linear programming SVM model.

The model (2) can be rewritten as

min α , r , ξ , b [ 0 t T 0 c θ T - 1 ] , [ α T b ξ T r ] s .t . [ Λ ( y ) K Λ ( y ) y I l - 1 l ] α b ξ r 0 , r 0 , - 1 α i 1 , ξ i 0 , i Ω ,
(3)

where θ = [θ 1, . . ., θ l ]T , Λ(y) = Diag(y), 0 l Rl is a vector with zero elements, 1 l Rl is a vector with each element equal to 1, I l is the l × l unit matrix, and K = (k(x i , x j ))1≤i≤l,1≤j≤l . The model can be solved by existing optimization softwares, such as Mosek.

Fuzzy silhouette

To adapt the situations with uncertain labels we generalize the silhouette concept for deterministic setting to fuzzy silhouette index.

For k = − 1, 1, i Ω k , the average distance of sample x i to the other data samples in Ω k is formulated as

β i k = j Ω k , j i θ j d ( x i , x j ) j Ω k , j i θ j
(4)

where θ i [0, 1]. Then, we define the fuzzy silhouette of sample x i as

s i = β i - 1 - β i 1 max { β i - 1 , β i 1 } ,iΩ.
(5)

It measures the degree that a PSM sample goes far away from the decoys and that is close to the good target samples. Hence, a PSM data sample is more likely to be a correct one if it has a large fuzzy silhouette value.

For the sets of Ω-1, Ω1 and Ω0 we define their average fuzzy silhouettes as

s ¯ k = i Ω k s i Ω k

where | Ω k | is the cardinality of Ω k , k = − 1, 1, 0. We also define

s e p = s ¯ 1 - s ¯ - 1 / 2
(6)

as a metric to indicate the separation degree of decoy PSM samples and good PSMs.

Score of the samples

Based on the fuzzy SVM model and fuzzy silhouette metric we design a scoring scheme, which defines the score of sample x i as

s c o r e ( i ) = ( 1 - s e p ) φ ( f ( x i ) ) + s e p ψ ( s i ) ,
(7)

where φ(·) and ψ(·) are functions for scaling the values of f (x i ) and s i , respectively. Here, function φ(·) : R → [ 1, 1] is constructed as an increasing function, and ψ(·) as an increasing function mapping from [ 1, 1] to [ 1, 1]. Particularly, we choose function φ(f (x i )) and ψ(s i ) as

φ ( f ( x i ) ) = 2 π s i g n ( f ( x i ) - f 0 ) atan ( ( | f ( x i ) - f 0 | f max ) 1 / 4 ) ,
(8)
ψ ( s i ) = ( s i - s 0 ) / s max ,
(9)

where f max and s max are the largest values of {|f (x i ) − f 0 |} and {|s i − s 0 |} for i Ω+, respectively, and f 0 is the threshold of the values of discriminant function, s 0 the threshold of fuzzy silhouette. The power of 1 4 on |f (x i ) − f 0 | is introduced to smooth the weight contributions.

The FC-Ranker algorithm

The FC-Ranker algorithm iteratively adjusts the index set of good PSM Ω1 by calculating the scores and weights of the data samples until a stop criterion is met. Initially, the algorithm set Ω 1 0 = Ω + and Ω 0 0 = ϕ , i.e. all PSM samples are viewed as good ones at iteration 0. At iteration k, the algorithm solves the fuzzy linear programming SVM model (3), calculates the fuzzy silhouette values of the samples according to Eq. (5) and updates the index set Ω1 and Ω0 such that the indices of target PSMs in Ω1 with small scores are moved to Ω0, while the indices of target PSMs in Ω0 with large scores are moved to Ω1.

At the k th iteration, PSM samples indexed by Ω+ are ranked according to their scores, and the top n% of them in Ω1 are reserved. Then Ω 1 k is updated by the discriminant function values as

Ω 1 k + 1 / 3 = { i Ω 1 k | f ( x i ) is ranked at top n %  in all { f ( x ) } i Ω 1 k } ,
(10)

where 0 < n < 100 is a constant percentage. Based on the calculated fuzzy silhouettes, Ω 1 k + 1 / 3 is then updated by

Ω 1 k + 2 / 3 = { i Ω 1 k + 1 / 3 | s i is ranked at top n % in all { s j } j Ω 1 k + 1 / 3 }
(11)

and Ω 0 k is updated by

Ω 0 k + 1 / 3 = Ω + \ Ω 1 k + 2 / 3 .
(12)

Finally, for i Ω new scores score(i)k+1, are computed according to Eq. (7) and the weights θ i k + 1 are calculated by the following equation

θ i k + 1  =  max { s c o r e ( i ) k + 1 , 0 } , i Ω + ; 1 , i Ω - .
(13)

Then indices of the samples indexed by Ω 0 k + 1 / 2 are moved to Ω 1 k + 2 / 3 if the samples have large score values,

i.e.,

Ω 1 k + 1 = Ω 1 k + 2 / 3 { i Ω 0 k + 1 / 2 | f ( x i ) f ¯ 1 k + 2 / 3 } , Ω 0 k + 1 = Ω + \ Ω 1 k + 1 ,
(14)

where f ¯ 1 k + 2 / 3 is the average of { f ( x i ) | i Ω 1 k + 2 / 3 } .

The algorithm terminates when the number of identified good PSM samples reaches a given threshold p , or the separation degree sepk+1defined by Eq. (6) reaches a threshold s e p ^ , i.e.,

Ω 1 k + 1 p , or se p k + 1 s e p . ^
(15)

The FC-Ranker algorithm is summarized in Algorithm 1.

Algorithm 1 The FC-Ranker Algorithm

Input: {x i , y i }, i Ω;

Output: Scores of samples indexed by Ω;

1: Initialization: k = − 1, Ω 1 0 = Ω + , Ω 0 0 : =Ø, θ i 0 =1, i Ω.

2: while Stop criterion (15) is not satisfied do

3:    k := k + 1.

4:    SVM classification.

5:        Solve fuzzy SVM classification model Eq. (3);

6:        Calculate Ω 1 k + 1 / 3 via Eq. (10).

7:    Clustering analysis.

8:        Calculate fuzzy silhouettes s i , i Ω via (5);

9:        Calculate Ω 1 k + 2 / 3 , Ω 0 k + 1 / 2 via Eq. (11), (12).

10:    Update weights.

11:        Calculate score(i)k+1, θk+1via Eq. (7), (13);

12:        Calculate Ω 1 k + 1 , Ω 0 k + 1 , sepk+1via Eq. (14), (6).

13: end while

FC-Ranker for the large-scale problem

The number of PSMs output by a database search engine is usually extremely large. In this section, some implementation practice is discussed further such that the algorithm is capable for solving large-scale problems.

Fuzzy SVM classification for the large-scale problem

If the data matrix is sparse, the interior-points algorithms would be competent in solving large-scale linear programming problems. The kernel matrix K in Problem (3) is, unfortunately, not sparse in general. In fact, kernel matrix K is usually quite dense and most of its elements are nonzero. To store a large dense matrix K is not a trivial task. Take a matrix K with Gaussian kernel and l = 400, 000 as an example, if four bytes are occupied per element then the matrix K would have l2 = 1.6 × 1011 elements and take up 640Gb of storage in all.

Interestingly, our experimental experience indicates that the kernel matrix is usually quite low rank in the peptide identification problem. Hence, a sub-matrix K' consisting of l' columns of K (l' << l) is selected to substitute K in Problem (3). These l' columns of the sub-matrix are selected randomly from the total columns of matrix K. This operation can be implemented by sampling l' data samples randomly and then calculating the sub-matrix K' according to the kernel function. It reduces the storage greatly. Denote an index set Ω' Ω which consists of the indices of l' columns. Then the matrix (K') ij = k(x i , x j ), i Ω, j Ω' can be calculated with size of l × l'. Let y' = (y') j Ω', then Problem (3) is reduced to

min α , r , ξ , b [ 0 t T 0 c θ T - 1 ] , [ α T b ξ T r ] s .t . [ Λ ( y ) K Λ ( y ) y I l - 1 l ] α b ξ r 0 , r 0 , ξ i 0 , i Ω - 1 α i 1 , j Ω .
(16)

Where α R l , b R1, r R1, r Rl, and Λ(y′) = Diag(y′).

Fuzzy silhouette for the large-scale problem

For updating fuzzy silhouette value s i of sample i, the major work is to compute β i 1 and β i - 1 in Eq. (4) where it is required to calculate l distances. In all, each iteration computes | Ω| * | Ω| = l2 distances with total samples. Denote a given sample rate by ρ with ρ (0, 1). We sample ρ * | Ω1 | indices of targets from Ω1, and ρ * | Ω 1 | indices of decoys from Ω 1, denoted by Ωt and Ω - 1 , to substitute Ω1 and Ω−1 in Eq. (4), resp. Then at most ρl(| Ω 1 | + | Ω1 |) ≤ ρl distances need to be calculated at each iteration.

Conclusion

A new scoring method has been developed based on the iterations of FC-Ranker algorithm which were equipped with fuzzy silhouette index and a fuzzy SVM classification model to cope with the large amount of incorrect labels of target PSM samples. In the fuzzy classification model, each PSM was assigned a calculated weight which indicates the possibility of the PSM sample being correct. The performance of FC-Ranker algorithm has been compared with PeptideProphet and Percolator on Yeast, UPS1 and Tal08 datasets, showing that FC-Ranker surpassed PeptideProphet and Percolator in terms of ROC and the quantity of identified target PSM samples under the same FDR level. Moreover, FC-Ranker outputs more target PSMs than PeptideProphet and Percolator does while they share a large number of PSMs in common.

Abbreviations

PSMs:

peptide spectrum matches

SVM:

support vector machine

References

  1. Elias J, Gygi S: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods 2007,4(3):207–214. 10.1038/nmeth1019

    Article  CAS  PubMed  Google Scholar 

  2. Perkins D, Pappin D, Creasy D, Cottrell J: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999,20(18):3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2

    Article  CAS  PubMed  Google Scholar 

  3. Ramakrishnan S, Mao R, Nakorchevskiy A, Prince J, Willard W, Xu W, Marcotte E, Miranker D: A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 2006,22(12):1524–1531. 10.1093/bioinformatics/btl118

    Article  CAS  PubMed  Google Scholar 

  4. Keller A, Nesvizhskii A, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry 2002,74(20):5383–5392. 10.1021/ac025747h

    Article  CAS  PubMed  Google Scholar 

  5. Ding Y, Choi H, Nesvizhskii A: Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. Journal of proteome research 2008,7(11):4878–4889. 10.1021/pr800484x

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  6. Choi H, Nesvizhskii A: Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. Journal of proteome research 2007, 7: 254–265.

    Article  PubMed  Google Scholar 

  7. Richard E, Knierman M, Freeman A, Gelbert L, Patil S, Hale J: Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. Journal of proteome research 2007,6(5):1758–1767. 10.1021/pr0605320

    Article  Google Scholar 

  8. Olsen J, Mann M: Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proceedings of the National Academy of Sciences of the United States of America 2004,101(37):13417–22. 10.1073/pnas.0405549101

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  9. Bianco L, Mead J, Bessant C: Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG 2006 standard MS/MS data sets. Journal of proteome research 2009,8(4):1782–1791. 10.1021/pr800792z

    Article  CAS  Google Scholar 

  10. Anderson D, Li W, Payan D, Noble W: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. Journal of proteome research 2003,2(2):137–146. 10.1021/pr0255654

    Article  CAS  PubMed  Google Scholar 

  11. Spivak M, Weston J, Bottou L, KaÌĹll L, Noble W: Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. Journal of proteome research 2009,8(7):3737–3745. 10.1021/pr801109k

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  12. Käll L, Canterbury J, Weston J, Noble W, MacCoss M: Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 2007,4(11):923–925. 10.1038/nmeth1113

    Article  PubMed  Google Scholar 

  13. Liang X, Xia Z, Niu X, Link AJ, Pang L, Wu F, Zhang H: A fuzzy cluster-based algorithm for peptide identification. In Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on. IEEE; 2012:602–609.

    Chapter  Google Scholar 

  14. Sanders S, Jennings J, Canutescu A, Link A, Weil P: Proteomics of the eukaryotic transcription machinery: identification of proteins associated with components of yeast TFIID by multidimensional mass spectrometry. Molecular and cellular biology 2002,22(13):4723–4738. 10.1128/MCB.22.13.4723-4738.2002

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  15. SGD: Saccharomyes Genome Database. 2012. [http://www.yeastgenome.org]

    Google Scholar 

  16. GenBank: NCBI gene bank. 2012. [http://www.ncbi.nlm.nih.gov/genbank]

    Google Scholar 

  17. Rousseeuw P: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 1987, 20: 53–65.

    Article  Google Scholar 

  18. Petrovic S: A comparison between the silhouette index and the davies-bouldin index in labelling ids clusters. Proceedings of the 11th Nordic Workshop of Secure IT Systems 2006, 53–64.

    Google Scholar 

  19. Zhou W, Zhang L, Jiao L: Linear programming support vector machines. Pattern recognition 2002,35(12):2927–2936. 10.1016/S0031-3203(01)00210-2

    Article  Google Scholar 

Download references

Acknowledgements

XN and AJL were supported by NIH grant GM64779. LP was supported by NSF of China under grant 11171049.

Declarations

The publication costs for this article were funded by Xijun Liang.

This article has been published as part of Proteome Science Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhonghang Xia.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XL and ZX designed the basic FC-Ranker algorithm and wrote the manuscript. XN, AL and FW designed the version of FC-Ranker algorithm for the large-scale problem and corresponding experiments. XL, LP and HZ designed and operated experiments. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Liang, X., Xia, Z., Niu, X. et al. Peptide identification based on fuzzy classification and clustering. Proteome Sci 11 (Suppl 1), S10 (2013). https://doi.org/10.1186/1477-5956-11-S1-S10

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1477-5956-11-S1-S10

Keywords