Peptide identification based on fuzzy classification and clustering

Background The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge. Results A novel scoring method named FC-Ranker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops. Conclusions Our experimental studies show that FC-Ranker outperforms other post-database search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.


Background
In protein identification, observed peptide spectra are searched against theoretical fragmentation spectra derived from target databases. Peptide spectrum matches (PSMs) are scored by database search tools and those high-scored PSMs are selected as target PSMs. In fact, more than half of selected PSMs are not correct [1]. Although many filters [2,3] have been proposed to refine the outputs further, they are not universal for different datasets.
To tackle this problem, PeptideProphet [4] used unsupervised learning for automatically selecting PSMs output by database search tools. Based on the assumption that the PSM samples are sampled from a mixture distribution which represents the chance of a "correct" PSM and an "incorrect" PSM, PeptideProphet applies the expectation maximization (EM) method to calculate the possibility of each PSM being "correct". As only the set of high-scored PSMs are searched for "correct" ones by PeptideProphet, some good low-ranked PSMs may be lost. Adaptive PeptideProphet was proposed in [5] to improve the performance of PeptideProphet by iteratively training a discriminant function from a set of top-ranked PSM samples, while [6] attempted to extend PeptideProphet by exploiting decoy PSMs in semi-supervised learning. In [7][8][9], decoy databases were used for validation of the performance of the post-database search algorithms. It is proposed in [6] to estimate a more accurate probability by combining decoy PSMs into a unified semi-supervised expectation-maximization framework. Support vector machines (SVMs) have also been studied for the peptide assignment problem in [10,11]. Percolator [12] employed the SVM to iteratively adjust models fitting target PSMs with higher scores than decoy PSMs. Percolator, as a semi-supervised learning model, did not fully make use of the labels and samples of target PSMs. More recently, a fully supervised SVM learning model is proposed in [11] to improve the performance of Percolator by utilizing target PSM data, where those "incorrect" target PSMs are viewed as noises, and a special loss function is employed to reduce the noise's negative impact on the learning model. Although most good target PSMs are identified by the classification learning model from noises and decoy PSMs, all selected PSMs are treated in the same way.
In this paper, a new scoring method, FC-Ranker, is developed not only to identify reliable target PSMs, but also to evaluate the confidence of each target PSM. As good target PSMs are close to each other, FC-Ranker integrates sample clustering into the classification procedure to compute the possibility of each target PSM being correct. Compared with the standard SVM model, the proposed fuzzy classification model assigns a weight to each target PSM indicating its likelihood being correct. The score of each PSM sample is computed by combining discriminant function value and fuzzy silhouette value. The algorithm repeatedly updates the values of the discriminant function and fuzzy silhouette index for each PSM sample, and recompute the weights of targets until the algorithm stops. In experimental studies, while FC-Ranker shows a large overlap of the identified target PSMs with PeptideProphet and Percolator, it has identified more target PSMs in all datasets.
The first stage of the work was published in [13]. In this work, we compared the FC-Ranker algorithm with another benchmark method, Percolator, in the experimental studies. As Percolator is developed based on the SVMbased learning model, and hence it provides a better reference in performance comparison. Furthermore, we added a new dataset, Tal08, which has different characteristics (refer to Table 1) with datasets Yeast and UPS1. The performance of the proposed FC-Ranker algorithm has been conducted on all three datasets in terms of number of target PSMs, overlaps and ROC curves, and compared with PeptideProphet and Percolator. The new data analysis and results reinforce the efficiency of the proposed FC-Ranker method.

Results and discussion
The FC-Ranker algorithm is compared with PeptideProphet [4] and Percolator [12] to validate its effectiveness. We used a PC with Intel (R) CPU 1.80 GHz×2, and RAM 2.0Gb for all experiments.

Experimental Setup Dataset
FC-ranker was examined over three datasets: S. cerevisiae Gcn4 (Yeast), Universal Proteomics Standard (UPS1) and Tal08 [14]. Trysin digestion of the protein samples generates three types of tryptic peptides: full-digested (both ends of a peptide satisfy enzyme specificity rule), halfdigested (only one end satisfies the enzyme specificity rule) and none-digested (neither of the ends satisfies the rule). The database of Yeast protein sequences was obtained from Saccharomyes Genome Database (SGD) [15] and the Sigma48 protein sequences database from NCBI gene bank [16]. The attributes of each PSM sample include x-correlation, delta-cn, ions, sprank and calc-neutral-pep-mass.
The Statistics of the three datasets are listed in Table 1.

Preprocess
In addition to those attributes output by SEQUEST, such as x-correlation, delta-cn, ions, sprank and calcneutralpep-mass, another attribute "digested type" was added in the representation, with scalars "2", "1" and "0" for fulldigested type, half-digested type, and none-digested type, respectively. The values of each attribute have been transformed linearly beforehand such that they have zero mean and unit variance (this is called a normalization process). We multiply a weight of 2.0 to the values of x-correlation and delta-cn attributes after normalization, inasmuch as these two attributes take more important position in data representation. As the attribute "digested type" also plays an important role by experimental experience, a weight of 2.0 was applied, similarly, on the values of this attribute after the normalization process.

Parameter setting
In all of the experiments, the parameter c is set to 1.0 in the proposed fuzzy linear programming SVM model where the Gaussian (RBF) kernel was chosen, with parameter s = 2.0.

Validation of sep throughout iterations
datasets. On both of the two datasets, the value ofs 1 is almost equal tos −1 initially, and then values ofs 1 increases as iterations proceed while values ofs −1 decreases throughout the procedure. Hence, an increasing curve of sep which is defined as (s 1 −s −1 )/2is observed in the figure. At iteration 4 of Figure 1A(Yeast dataset) the value of sep exceeds the given threshold 0.25, reaching the termination criteria of the algorithm. The increasing values of sep illustrates that the identified good target PSMs indexed by Ω 1 are closer to each other and were separated from decoy PSMs as the iterations increase, showing the effectiveness of the fuzzy silhouette index.

Comparison of target PSMs
We compared the target PSMs output by PeptideProphet, Percolator and FC-Ranker under FDR level 0.05 in Table  2. On the Yeast, FC-Ranker identified 1475 target PSMs while PeptideProphet output 1443 target PSMs and Percolator output 1393 target PSMs. There are in all 32 target PSMs more found by FC-Ranker than PeptideProphet and 82 target PSMs more than Percolator. On the UPS1, there are 681 target PSMs found by FC-Ranker, which is 243 PSMs (55.5%) more than that of Percolator and 115 PSMs (20.3%) more than that of PeptideProphet. On the Tal08, FC-Ranker output 1092 target PSMs, which is 135 PSMs (14.1%) more than that of PeptideProphet and 139 PSMs (14.6%) more than that of Percolator. Similar results of PSMs output by the three methods on particular digested types are also shown in Table 2.
We analyzed the outputs of the target PSMs of the three methods and their overlaps are summarized in Figure 2. It is shown that there are large overlaps among the output PSMs of the three approaches in all Yeast, UPS1 and Tal08 datasets. Specifically, FC-Ranker, PeptideProphet and Percolator identified 1248 common target PSMs in Yeast dataset (Figure 2A), which covers 86.5% of the total target PSMs by PeptideProphet, 89.6% of the output of Percolator and 84.6% of the output targets of FC-Ranker. Particularly, FC-Ranker identified 129 PSMs (8.9%) selected by PeptideProphet but not covered by Percolator, and found 14 PSMs (1.0%) selected by Percolator but not covered by PeptideProphet.
On the UPS1 dataset ( Figure Figure 3 shows ROC curves of the three methods on the Yeast, UPS1 and Tal08 datasets. On the Yeast dataset ( Figure 3A), when FPR level near zero FC-Ranker has the same TPR level with PeptideProphet while higher TPRs are reached by FC-Ranker than those by PeptideProphet and Percolator on other FPR levels. On both the UPS1 dataset ( Figure 3B) and Tal08 dataset ( Figure 3C), FC-Ranker reaches higher TPRs than the other two methods throughout all FPR levels. Particularly, on Tal08 dataset, FC-Ranker reaches evidently high TPR levels even on comparatively high FPR levels. Figure 4 depicts the relation between the number of TP and FDR, where we observed similar patterns with the corresponding ROC curves.

Classification and clustering methods for peptide identification Fuzzy clustering
Clustering analysis is an unsupervised learning method to group similar data samples together. Silhouette index was introduced in [17,18] to measure how well a sample belongs to a cluster.
Suppose that there are l data samples {x 1 , . . ., x l }, which are grouped into K clusters, denoted as C ={C 1 , . . ., C K }. Denote by d(x i , x j ) the distance between two samples x i and x j , and by C k = x k 1 , . . . , x k mk the samples of the kth cluster, where m k = |C k | and k = 1, . . ., K. The average distance, denoted by a k i , between the ith data sample in cluster C k and other samples in the same cluster is formulated as and the minimum average distance between the ith data sample in cluster C k and all other data samples in clusters Then, we define the silhouette value of the ith data sample in C k as follows Clearly, the silhouette values located in the interval [−1, 1]. The silhouette value of the cluster C k is defined as

Classification
Our task is to identify those correct PSMs from a set of PSMs generated by some database searching tools in peptide identification. Usually decoy PSMs are employed to validate target PSMs, then the samples of PSMs can be categorized into "good" class, with labels " +1", and "bad" class, with labels "−1". In the setting of classification, we use a vector of attributes such as x-correlation, delta-cn, ions, sprank, calc-neutral-pepmass, etc., to represent a PSM data sample. Let {x i } ⊆ R q , i = 1, . . ., l be the PSM data samples with q the number of attributes. We aim at finding a discriminant function f : R q R to classify the PSM data samples according to their labels.
One of the greatest challenges arising from the problem of the peptide identification is that there is lack of data samples with deterministic +1 labels. For a standard classification setting, the discriminant function is solved by training the models on two balanced types of data samples with deterministic labels. In peptide identification problem, however, a great number of PSMs generated by database searching engines are incorrect, and the data samples with +1 labels are quite unreliable. Thus, the great amount of data samples with incorrect +1 labels would extremely distort the trained discriminant function if they are employed directly in the standard classification models.  Liang et al. Proteome Science 2013, 11(Suppl 1):S10 http://www.proteomesci.com/content/11/S1/S10 Here, we consider the kernel-based SVM classifier as follows: where b R, k(·,·) is a chosen kernel function. The label of a data sample x is predicted as +1, if f (x) >0, otherwise it is predicted as −1. A quadratic programming is usually solved to obtain the coefficients a and b, which requires huge computations overhead, especially for large-scale problems. To overcome this problem, a class of linear programming SVM is introduced in [19].
For the l data samples {(x i , y i )}, i = 1, . . ., l, with x i R q , y i {1, −1}, the linear programming SVM model is formulated as where c >0 is a given constant, and the discriminant

The basic FC-Ranker algorithm
In this section, the FC-Ranker algorithm is present to calculate the score of each PSM data sample. The score values reflect the possibility of the PSM data samples being correct, and those PSMs with high scores are selected for users at last.
Denote by Ω = {1, . . ., l} the set of indices of l PSM data samples, by Ω + the set of indices of target PSMs, by the set of indices of decoy PSMs, by Ω 1 the set of indices of good target PSMs, and Ω 0 = Ω + \ Ω 1 the set of bad target PSMs. The FC-Ranker algorithm aims to select the set Ω 1 from Ω + utilizing the data samples indexed by Ω − . To classify good target PSMs from others, a discriminant function f is constructed such that the function value f (x i ) is positive if sample x i belongs to Ω 1 , and negative otherwise. A large discriminant function value of a target PSM sample x i indicates that the sample locates far away from the decision boundary, and hence large possibility of being a good PSM. However, only a large discriminant function value of f (x i ) itself is not sufficient to ensure that the PSM sample x i is good. Take the sample represented by "☐" in Figure 5 as an example, it has a large distance from the decision boundary and thus has a large function value of f (☐). This sample, however, tends to be a bad PSM since it locates too far away from the other PSM data samples indicated by the set Ω + .
On the other hand, a data sample may not be a good target PSM either if it locates comparatively close to other target PSMs but has a small discriminant function value. The data sample represented by "⊕" in Figure 5 should also be excluded from the set Ω 1 . The above observations hints us that a good target PSM data sample should satisfy two rules: 1) has a large discriminant function value; 2) is close to other target PSMs.

Fuzzy SVM classification
A weight θ i [0, 1] is introduced for each target sample x i indexed by Ω + to indicate its possibility of being correct since its label is not trustworthy. A large weight of a sample usually indicates that the PSM has more possibility to be correct. Since it is definitely sure that the decoy PSMs are incorrect, we constantly set the weights θ i to 1 for x i Ω − . Denote loss(f (x i ), y i ) the empirical error of sample xi, then the empirical error can be formulated as i∈ loss(f (x i ), y i ) in traditional classification problems with deterministic labels. Assigning a weight to each data sample, we reformulate the total empirical error as i∈ θ i · loss(f (x i ), y i ) .
Thus, the linear programming SVM model (1) is transformed as follows Figure 5 Classification and clustering. "−" represents decoy PSM, while "+" represents target PSM. The data sample represented by "☐" locates far away from the decision boundary. However, the possibility of it being a correct PSM is remote since it goes too far away from other data target PSMs. The data sample represented by "⊕" also has a small possibility to be a correct PSM since it locates near the decision boundary.
The model (2) can be rewritten as where θ = [θ 1 , . . ., θ l ] T , Λ(y) = Diag(y), 0 l R l is a vector with zero elements, 1 l R l is a vector with each element equal to 1, I l is the l × l unit matrix, and K = (k(x i , x j )) 1≤i≤l,1≤j≤l . The model can be solved by existing optimization softwares, such as Mosek.

Fuzzy silhouette
To adapt the situations with uncertain labels we generalize the silhouette concept for deterministic setting to fuzzy silhouette index.
For k = −1, 1, i Ω k , the average distance of sample x i to the other data samples in Ω k is formulated as where θ i [0, 1]. Then, we define the fuzzy silhouette of sample x i as It measures the degree that a PSM sample goes far away from the decoys and that is close to the good target samples. Hence, a PSM data sample is more likely to be a correct one if it has a large fuzzy silhouette value.
For the sets of Ω -1 , Ω 1 and Ω 0 we define their average fuzzy silhouettes as where |Ω k | is the cardinality of Ω k , k = −1, 1, 0. We also define as a metric to indicate the separation degree of decoy PSM samples and good PSMs.

Score of the samples
Based on the fuzzy SVM model and fuzzy silhouette metric we design a scoring scheme, which defines the score of sample x i as where (·) and ψ(·) are functions for scaling the values of f (x i ) and s i , respectively. Here, function (·) : R [−1, 1] is constructed as an increasing function, and ψ(·) as an increasing function mapping from [−1, 1] to [−1, 1]. Particularly, we choose function (f (x i )) and ψ(s i ) as At the kth iteration, PSM samples indexed by Ω + are ranked according to their scores, and the top n% of them in Ω 1 are reserved. Then k 1 is updated by the discriminant function values as where 0 < n <100 is a constant percentage. Based on the calculated fuzzy silhouettes, k+1/3 1 is then updated by and k 0 is updated by Finally, for i Ω new scores score(i) k+1 , are computed according to Eq. (7) and the weights θ k+1 i are calculated by the following equation The algorithm terminates when the number of identified good PSM samples reaches a given threshold p, or the separation degree sep k+1 defined by Eq. (6) reaches a threshold sep, i.e., The FC-Ranker algorithm is summarized in Algorithm

FC-Ranker for the large-scale problem
The number of PSMs output by a database search engine is usually extremely large. In this section, some implementation practice is discussed further such that the algorithm is capable for solving large-scale problems.

Fuzzy SVM classification for the large-scale problem
If the data matrix is sparse, the interior-points algorithms would be competent in solving large-scale linear programming problems. The kernel matrix K in Problem (3) is, unfortunately, not sparse in general. In fact, kernel matrix K is usually quite dense and most of its elements are nonzero. To store a large dense matrix K is not a trivial task. Take a matrix K with Gaussian kernel and l = 400, 000 as an example, if four bytes are occupied per element then the matrix K would have l 2 = 1.6 × 10 11 elements and take up 640Gb of storage in all.
Interestingly, our experimental experience indicates that the kernel matrix is usually quite low rank in the peptide identification problem. Hence, a sub-matrix K' consisting of l' columns of K (l' << l) is selected to substitute K in Problem (3). These l' columns of the sub-matrix are selected randomly from the total columns of matrix K. This operation can be implemented by sampling l' data samples randomly and then calculating the sub-matrix K' according to the kernel function. It reduces the storage greatly. Denote an index set Ω' ⊂ Ω which consists of the indices of l' columns. Then the matrix (K') ij = k(x i , x j ), i Î Ω, j Î Ω' can be calculated with size of l × l'. Let y' = (y') jÎΩ ', then Problem (3) Where α ∈ R l , b R 1 , r R 1 , r R l , and Λ(y′) = Diag(y′).

Fuzzy silhouette for the large-scale problem
For updating fuzzy silhouette value s i of sample i, the major work is to compute β 1 i and β −1 i in Eq. (4) where it is required to calculate l distances. In all, each iteration computes |Ω| * |Ω| = l 2 distances with total samples. Denote a given sample rate by r with r (0, 1). We sample r * |Ω 1 | indices of targets from Ω 1 , and r * |Ω−1| indices of decoys from Ω −1 , denoted by Ωt and −1 , to substitute Ω 1 and Ω −1 in Eq. (4), resp. Then at most rl(|Ω −1 | + |Ω 1 |) ≤ rl distances need to be calculated at each iteration.

Conclusion
A new scoring method has been developed based on the iterations of FC-Ranker algorithm which were equipped with fuzzy silhouette index and a fuzzy SVM classification model to cope with the large amount of incorrect labels of target PSM samples. In the fuzzy classification model, each PSM was assigned a calculated weight which indicates the possibility of the PSM sample being correct. The performance of FC-Ranker algorithm has been compared with PeptideProphet and Percolator on Yeast, UPS1 and Tal08 datasets, showing that FC-Ranker surpassed PeptideProphet and Percolator in terms of ROC and the quantity of identified target PSM samples under the same FDR level. Moreover, FC-Ranker outputs more target PSMs than PeptideProphet and Percolator does while they share a large number of PSMs in common.

Abbreviations
PSMs: peptide spectrum matches; SVM: support vector machine