Peptide identification based on fuzzy classification and clustering

Liang, Xijun; Xia, Zhonghang; Niu, Xinnan; Link, Andrew J; Pang, Liping; Wu, Fang-Xiang; Zhang, Hongwei

doi:10.1186/1477-5956-11-S1-S10

Volume 11 Supplement 1

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science

Research
Open access
Published: 07 November 2013

Peptide identification based on fuzzy classification and clustering

Xijun Liang¹,
Zhonghang Xia²,
Xinnan Niu³,
Andrew J Link³,
Liping Pang¹,
Fang-Xiang Wu⁴ &
…
Hongwei Zhang¹

Proteome Science volume 11, Article number: S10 (2013) Cite this article

15k Accesses
5 Citations
Metrics details

Abstract

Background

The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge.

Results

A novel scoring method named FC-Ranker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops.

Conclusions

Our experimental studies show that FC-Ranker outperforms other post-database search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.

Background

In protein identification, observed peptide spectra are searched against theoretical fragmentation spectra derived from target databases. Peptide spectrum matches (PSMs) are scored by database search tools and those high-scored PSMs are selected as target PSMs. In fact, more than half of selected PSMs are not correct [1]. Although many filters [2, 3] have been proposed to refine the outputs further, they are not universal for different datasets.

To tackle this problem, PeptideProphet [4] used unsupervised learning for automatically selecting PSMs output by database search tools. Based on the assumption that the PSM samples are sampled from a mixture distribution which represents the chance of a "correct" PSM and an "incorrect" PSM, PeptideProphet applies the expectation maximization (EM) method to calculate the possibility of each PSM being "correct". As only the set of high-scored PSMs are searched for "correct" ones by PeptideProphet, some good low-ranked PSMs may be lost. Adaptive PeptideProphet was proposed in [5] to improve the performance of PeptideProphet by iteratively training a discriminant function from a set of top-ranked PSM samples, while [6] attempted to extend PeptideProphet by exploiting decoy PSMs in semi-supervised learning. In [7–9], decoy databases were used for validation of the performance of the post-database search algorithms. It is proposed in [6] to estimate a more accurate probability by combining decoy PSMs into a unified semi-supervised expectation- maximization framework.

Support vector machines (SVMs) have also been studied for the peptide assignment problem in [10, 11]. Percolator [12] employed the SVM to iteratively adjust models fitting target PSMs with higher scores than decoy PSMs. Percolator, as a semi-supervised learning model, did not fully make use of the labels and samples of target PSMs. More recently, a fully supervised SVM learning model is proposed in [11] to improve the performance of Percolator by utilizing target PSM data, where those "incorrect" target PSMs are viewed as noises, and a special loss function is employed to reduce the noise's negative impact on the learning model. Although most good target PSMs are identified by the classification learning model from noises and decoy PSMs, all selected PSMs are treated in the same way.

In this paper, a new scoring method, FC-Ranker, is developed not only to identify reliable target PSMs, but also to evaluate the confidence of each target PSM. As good target PSMs are close to each other, FC-Ranker integrates sample clustering into the classification procedure to compute the possibility of each target PSM being correct. Compared with the standard SVM model, the proposed fuzzy classification model assigns a weight to each target PSM indicating its likelihood being correct. The score of each PSM sample is computed by combining discriminant function value and fuzzy silhouette value. The algorithm repeatedly updates the values of the discriminant function and fuzzy silhouette index for each PSM sample, and recompute the weights of targets until the algorithm stops. In experimental studies, while FC-Ranker shows a large overlap of the identified target PSMs with PeptideProphet and Percolator, it has identified more target PSMs in all datasets.

The first stage of the work was published in [13]. In this work, we compared the FC-Ranker algorithm with another benchmark method, Percolator, in the experimental studies. As Percolator is developed based on the SVM-based learning model, and hence it provides a better reference in performance comparison. Furthermore, we added a new dataset, Tal08, which has different characteristics (refer to Table 1) with datasets Yeast and UPS1. The performance of the proposed FC-Ranker algorithm has been conducted on all three datasets in terms of number of target PSMs, overlaps and ROC curves, and compared with PeptideProphet and Percolator. The new data analysis and results reinforce the efficiency of the proposed FC-Ranker method.

Table 1 Statistics of datasets

Full size table

Results and discussion

The FC-Ranker algorithm is compared with PeptideProphet [4] and Percolator [12] to validate its effectiveness. We used a PC with Intel (R) CPU 1.80 GHz×2, and RAM 2.0Gb for all experiments.

Experimental Setup

Dataset

FC-ranker was examined over three datasets: S. cerevisiae Gcn4 (Yeast), Universal Proteomics Standard (UPS1) and Tal08 [14]. Trysin digestion of the protein samples generates three types of tryptic peptides: full-digested (both ends of a peptide satisfy enzyme specificity rule), half-digested (only one end satisfies the enzyme specificity rule) and none-digested (neither of the ends satisfies the rule). The database of Yeast protein sequences was obtained from Saccharomyes Genome Database (SGD) [15] and the Sigma48 protein sequences database from NCBI gene bank [16]. The attributes of each PSM sample include x-correlation, delta-cn, ions, sprank and calc-neutral-pep-mass.

The SEQUEST search results on UPS1 contains 48 purified human proteins and 17,335 PSMs, consisting of 8974 target PSMs and 8361 decoy PSMs. On the Yeast dataset, it contains 6652 proteins and 14,891 PSMs, consisting of 6702 target PSMs and 8189 decoy PSMs. On the Tal08 dataset, it contains 9907 target PSMs, and 8746 decoy PSMs, totally 18,653 PSMs.

Statistics of the three datasets are listed in Table 1.

Preprocess

In addition to those attributes output by SEQUEST, such as x-correlation, delta-cn, ions, sprank and calcneutral-pep-mass, another attribute "digested type" was added in the representation, with scalars "2", "1" and "0" for full-digested type, half-digested type, and none-digested type, respectively. The values of each attribute have been transformed linearly beforehand such that they have zero mean and unit variance (this is called a normalization process). We multiply a weight of 2.0 to the values of x-correlation and delta-cn attributes after normalization, inasmuch as these two attributes take more important position in data representation. As the attribute "digested type" also plays an important role by experimental experience, a weight of 2.0 was applied, similarly, on the values of this attribute after the normalization process.

Parameter setting

In all of the experiments, the parameter c is set to 1.0 in the proposed fuzzy linear programming SVM model where the Gaussian (RBF) kernel

k (x_{1}, x_{2}) = exp (- \frac{| | x_{1} - x_{2} | |^{2}}{2 σ^{2}}),

was chosen, with parameter σ = 2.0.

In the iterations of FC-Ranker algorithm, we set n = 70 in Eq. (10) and $\hat{p} = 0.03 | Ω_{+} |$ , $\hat{s e p} = 0.25$ Eq. (15). The strategy for solving large-scale programming was employed as described in the subsection "FC-Ranker for the large-scale problem", where the parameter ρ was chosen as 0.2.

Validation of sep throughout iterations

Figure 1 depicts the variation of the values of sep in the iterations of the FC-Ranker algorithm on Yeast and UPS1 datasets. On both of the two datasets, the value of ${\bar{s}}_{1}$ is almost equal to ${\bar{s}}_{- 1}$ initially, and then values of ${\bar{s}}_{1}$ increases as iterations proceed while values of ${\bar{s}}_{- 1}$ decreases throughout the procedure. Hence, an increasing curve of sep which is defined as $({\bar{s}}_{1} - {\bar{s}}_{- 1}) / 2$ is observed in the figure. At iteration 4 of Figure 1A(Yeast dataset) the value of sep exceeds the given threshold 0.25, reaching the termination criteria of the algorithm. The increasing values of sep illustrates that the identified good target PSMs indexed by Ω₁ are closer to each other and were separated from decoy PSMs as the iterations increase, showing the effectiveness of the fuzzy silhouette index.

Comparison of target PSMs

We compared the target PSMs output by PeptideProphet, Percolator and FC-Ranker under FDR level 0.05 in Table 2. On the Yeast, FC-Ranker identified 1475 target PSMs while PeptideProphet output 1443 target PSMs and Percolator output 1393 target PSMs. There are in all 32 target PSMs more found by FC-Ranker than PeptideProphet and 82 target PSMs more than Percolator. On the UPS1, there are 681 target PSMs found by FC-Ranker, which is 243 PSMs (55.5%) more than that of Percolator and 115 PSMs (20.3%) more than that of PeptideProphet. On the Tal08, FC-Ranker output 1092 target PSMs, which is 135 PSMs (14.1%) more than that of PeptideProphet and 139 PSMs (14.6%) more than that of Percolator. Similar results of PSMs output by the three methods on particular digested types are also shown in Table 2.

Table 2 Target PSMs output by PeptideProphet, Percolator and FC-Ranker

Full size table

We analyzed the outputs of the target PSMs of the three methods and their overlaps are summarized in Figure 2. It is shown that there are large overlaps among the output PSMs of the three approaches in all Yeast, UPS1 and Tal08 datasets. Specifically, FC-Ranker, PeptideProphet and Percolator identified 1248 common target PSMs in Yeast dataset (Figure 2A), which covers 86.5% of the total target PSMs by PeptideProphet, 89.6% of the output of Percolator and 84.6% of the output targets of FC-Ranker. Particularly, FC-Ranker identified 129 PSMs (8.9%) selected by PeptideProphet but not covered by Percolator, and found 14 PSMs (1.0%) selected by Percolator but not covered by PeptideProphet.

On the UPS1 dataset (Figure 2B), the three algorithms have 383 target PSMs in common. The overlap covers 67.7% of the total target PSMs by PeptideProphet, 87.4% by Percolator and 56.2% by FC-Ranker. Particularly, there are 520 target PSMs catched by PeptideProphet and FC-Ranker in common, covering 91.9% of the total target PSMs by PeptideProphet and 76.4% by FC-Ranker; there are 406 target PSMs catched by Percolator and FC-Ranker in common, covering 92.7% of the total target PSMs by Percolator and 59.6% by FC-Ranker. Particularly, FC-Ranker identified 137 PSMs (24.2%) selected by PeptideProphet but not covered by Percolator, and found 23 PSMs (5.3%) selected by Percolator but not covered by PeptideProphet.

On the Tal08 dataset (Figure 2C), the three algorithms have 829 PSMs in common. The overlap covers 86.6% of the total target PSMs by PeptideProphet, 87.0% by Percolator and 75.9% by FC-Ranker. Particularly, there are 862 target PSMs catched by PeptideProphet and FC-Ranker in common, covering 90.1% of the total target PSMs by PeptideProphet and 78.9% by FC-Ranker; there are 847 target PSMs catched by Percolator and FC-Ranker in common, covering 88.9% of the total target PSMs by Percolator and 77.6% by FC-Ranker. Particularly, FC-Ranker identified 33 PSMs (3.4%) selected by PeptideProphet but not covered by Percolator, and found 18 PSMs (1.9%) selected by Percolator but not covered by PeptideProphet.

ROC curve

Figure 3 shows ROC curves of the three methods on the Yeast, UPS1 and Tal08 datasets. On the Yeast dataset (Figure 3A), when FPR level near zero FC-Ranker has the same TPR level with PeptideProphet while higher TPRs are reached by FC-Ranker than those by PeptideProphet and Percolator on other FPR levels. On both the UPS1 dataset (Figure 3B) and Tal08 dataset (Figure 3C), FC-Ranker reaches higher TPRs than the other two methods throughout all FPR levels. Particularly, on Tal08 dataset, FC-Ranker reaches evidently high TPR levels even on comparatively high FPR levels.

Figure 4 depicts the relation between the number of TP and FDR, where we observed similar patterns with the corresponding ROC curves.

Methods

Classification and clustering methods for peptide identification

Fuzzy clustering

Clustering analysis is an unsupervised learning method to group similar data samples together. Silhouette index was introduced in [17, 18] to measure how well a sample belongs to a cluster.

Suppose that there are l data samples {x ₁, . . ., x _l}, which are grouped into K clusters, denoted as C ={C ₁ , . . ., C _K}. Denote by d(x _i , x _j) the distance between two samples x _i and x _j, and by $C_{k} = \{x_{1}^{k}, \dots, x_{m k}^{k}\}$ the samples of the k th cluster, where m _k = |C _k | and k = 1, . . ., K. The average distance, denoted by $a_{i}^{k}$ , between the i th data sample in cluster C _kand other samples in the same cluster is formulated as

a_{i}^{k} = \frac{1}{m_{k} - 1} \sum_{j = 1, \dots, m_{k}, j \neq i} d (x_{i}^{k}, x_{j}^{k}), i = 1, \dots, m_{k},

and the minimum average distance between the i th data sample in cluster C _kand all other data samples in clusters C _v , v = 1, . . ., K, v ≠ k is defined as

b_{i}^{k} = min_{v = 1, \dots, K, v \neq k} \{\frac{1}{m_{v}} \sum_{j = 1}^{m_{v}} d (x_{i}^{k}, x_{j}^{v})\}, i = 1, \dots m_{k} .

Then, we define the silhouette value of the i th data sample in C _k as follows

s_{i}^{k} = \frac{b_{i}^{k} - a_{i}^{k}}{max {a_{i}^{k}, b_{i}^{k}}} .

Clearly, the silhouette values located in the interval [− 1, 1]. The silhouette value of the cluster C _k is defined as

s_{k} = \frac{1}{m_{k}} \sum_{i = 1}^{m_{k}} s_{i}^{k}, k = 1, \dots, K .

Classification

Our task is to identify those correct PSMs from a set of PSMs generated by some database searching tools in peptide identification. Usually decoy PSMs are employed to validate target PSMs, then the samples of PSMs can be categorized into "good" class, with labels " +1", and "bad" class, with labels "− 1". In the setting of classification, we use a vector of attributes such as x-correlation, delta-cn, ions, sprank, calc-neutral-pepmass, etc., to represent a PSM data sample. Let {x _i} ⊆ R^q , i = 1, . . ., l be the PSM data samples with q the number of attributes. We aim at finding a discriminant function f : R^q → R to classify the PSM data samples according to their labels.

One of the greatest challenges arising from the problem of the peptide identification is that there is lack of data samples with deterministic +1 labels. For a standard classification setting, the discriminant function is solved by training the models on two balanced types of data samples with deterministic labels. In peptide identification problem, however, a great number of PSMs generated by database searching engines are incorrect, and the data samples with +1 labels are quite unreliable. Thus, the great amount of data samples with incorrect +1 labels would extremely distort the trained discriminant function if they are employed directly in the standard classification models.

Here, we consider the kernel-based SVM classifier as follows:

f (x) = \sum_{i = 1}^{l} α_{j} k (x_{j}, x) + b

where b ∈ R, k(·,·) is a chosen kernel function. The label of a data sample x is predicted as +1, if f (x) > 0, otherwise it is predicted as −1. A quadratic programming is usually solved to obtain the coefficients α and b, which requires huge computations overhead, especially for large-scale problems. To overcome this problem, a class of linear programming SVM is introduced in [19].

For the l data samples {(x _i , y _i)}, i = 1, . . ., l, with x _i ∈ R^q , y _i ∈ {1, −1}, the linear programming SVM model is formulated as

\begin{matrix} min_{α, r, ξ, b} & - r + c \sum_{i = 1}^{l} ξ_{i} \\ s .t . & y i f (x_{i}) = y_{i} (\sum_{j = 1}^{l} α_{j} y_{j} k (x_{j}, x_{i}) + b) \geq r - ξ_{i}, \\ - 1 \leq α_{i} \leq 1, ξ_{i} \geq 0, i = 1, \dots, l \end{matrix}

(1)

where c > 0 is a given constant, and the discriminant function $f (\cdot) = \sum_{j = 1}^{l} α_{j} y_{j} k (x_{j}, \cdot) + b$ .

The basic FC-Ranker algorithm

In this section, the FC-Ranker algorithm is present to calculate the score of each PSM data sample. The score values reflect the possibility of the PSM data samples being correct, and those PSMs with high scores are selected for users at last.

Denote by Ω = {1, . . ., l} the set of indices of l PSM data samples, by Ω₊ the set of indices of target PSMs, by

Ω_{- 1} = {i \in Ω | y_{i} = - 1},

the set of indices of decoy PSMs, by Ω₁ the set of indices of good target PSMs, and Ω₀ = Ω₊ \ Ω₁ the set of bad target PSMs. The FC-Ranker algorithm aims to select the set Ω₁ from Ω₊ utilizing the data samples indexed by Ω₋. To classify good target PSMs from others, a discriminant function f is constructed such that the function value f (x _i) is positive if sample x _i belongs to Ω₁, and negative otherwise. A large discriminant function value of a target PSM sample x _i indicates that the sample locates far away from the decision boundary, and hence large possibility of being a good PSM. However, only a large discriminant function value of f (x _i) itself is not sufficient to ensure that the PSM sample x _i is good. Take the sample represented by "□" in Figure 5 as an example, it has a large distance from the decision boundary and thus has a large function value of f (□). This sample, however, tends to be a bad PSM since it locates too far away from the other PSM data samples indicated by the set Ω₊.

On the other hand, a data sample may not be a good target PSM either if it locates comparatively close to other target PSMs but has a small discriminant function value. The data sample represented by "⊕" in Figure 5 should also be excluded from the set Ω₁. The above observations hints us that a good target PSM data sample should satisfy two rules: 1) has a large discriminant function value; 2) is close to other target PSMs.

Fuzzy SVM classification

A weight θ _i ∈ [0, 1] is introduced for each target sample x _i indexed by Ω₊ to indicate its possibility of being correct since its label is not trustworthy. A large weight of a sample usually indicates that the PSM has more possibility to be correct. Since it is definitely sure that the decoy PSMs are incorrect, we constantly set the weights θ _i to 1 for x _i ∈ Ω₋. Denote loss(f (x _i), y _i) the empirical error of sample x _i, then the empirical error can be formulated as $\sum_{i \in Ω} loss (f (x_{i}), y_{i})$ in traditional classification problems with deterministic labels. Assigning a weight to each data sample, we reformulate the total empirical error as $\sum_{i \in Ω} θ_{i} loss (f (x_{i}), y_{i})$ .

Thus, the linear programming SVM model (1) is transformed as follows

\begin{matrix} min_{α, r, ξ, b} & - r + c \sum_{i \in Ω} θ_{i} ξ_{i} \\ s .t . & y_{i} (\sum_{j = 1}^{l} α_{j} y_{j} k (x_{j}, x_{i}) + b) \geq r - ξ_{i}, i \in Ω, \\ - 1 \leq α_{i} \leq 1, ξ_{i} \geq 0, i \in Ω, \end{matrix}

(2)

where α ∈ R^l , b ∈ R¹, r ∈ R¹ and ξ = [ξ ₁ , . . ., ξ _l] ∈ R^l. Model (2) is referred as the fuzzy linear programming SVM model.

The model (2) can be rewritten as

\begin{matrix} min_{α, r, ξ, b} & ⟨[0_{t}^{T} 0 c θ^{T} - 1], [α^{T} b ξ^{T} r]⟩ \\ s .t . & [Λ (y) K Λ (y) y I_{l} - 1_{l}] [\begin{matrix} α \\ b \\ ξ \\ r \end{matrix}] \geq 0, \\ r \geq 0, \\ - 1 \leq α_{i} \leq 1, ξ_{i} \geq 0, i \in Ω, \end{matrix}

(3)

where θ = [θ ₁, . . ., θ _l]^T , Λ(y) = Diag(y), 0_l ∈ R^l is a vector with zero elements, 1_l ∈ R^l is a vector with each element equal to 1, I _l is the l × l unit matrix, and K = (k(x _i , x _j))_{1≤i≤l,1≤j≤l}. The model can be solved by existing optimization softwares, such as Mosek.

Fuzzy silhouette

To adapt the situations with uncertain labels we generalize the silhouette concept for deterministic setting to fuzzy silhouette index.

For k = − 1, 1, i ∈ Ω_k, the average distance of sample x _i to the other data samples in Ω_k is formulated as

β_{i}^{k} = \frac{\sum_{j \in Ω_{k}, j \neq i}^{} θ_{j} d (x_{i}, x_{j})}{\sum_{j \in Ω_{k}, j \neq i}^{} θ_{j}}

(4)

where θ _i ∈ [0, 1]. Then, we define the fuzzy silhouette of sample x _i as

s_{i} = \frac{β_{i}^{- 1} - β_{i}^{1}}{max {β_{i}^{- 1}, β_{i}^{1}}}, i \in Ω .

(5)

It measures the degree that a PSM sample goes far away from the decoys and that is close to the good target samples. Hence, a PSM data sample is more likely to be a correct one if it has a large fuzzy silhouette value.

For the sets of Ω_-1, Ω₁ and Ω₀ we define their average fuzzy silhouettes as

{\bar{s}}_{k} = \frac{\sum_{i \in Ω_{k}}^{} s_{i}}{|Ω_{k}|}

where | Ω_k | is the cardinality of Ω_k , k = − 1, 1, 0. We also define

s e p = ({\bar{s}}_{1} - {\bar{s}}_{- 1}) / 2

(6)

as a metric to indicate the separation degree of decoy PSM samples and good PSMs.

Score of the samples

Based on the fuzzy SVM model and fuzzy silhouette metric we design a scoring scheme, which defines the score of sample x _i as

s c o r e (i) = (1 - s e p) \cdot φ (f (x_{i})) + s e p \cdot ψ (s_{i}),

(7)

where φ(·) and ψ(·) are functions for scaling the values of f (x _i) and s _i, respectively. Here, function φ(·) : R → [− 1, 1] is constructed as an increasing function, and ψ(·) as an increasing function mapping from [− 1, 1] to [− 1, 1]. Particularly, we choose function φ(f (x _i)) and ψ(s _i) as

φ (f (x_{i})) = \frac{2}{π} s i g n (f (x_{i}) - f_{0}) atan ({(| f (x_{i}) - f_{0} | f_{max})}^{1 / 4}),

(8)

ψ (s_{i}) = (s_{i} - s_{0}) / s_{max},

(9)

where f _max and s _max are the largest values of {|f (x _i) − f ₀ |} and {|s _i − s ₀ |} for i ∈ Ω₊, respectively, and f ₀ is the threshold of the values of discriminant function, s ₀ the threshold of fuzzy silhouette. The power of $\frac{1}{4}$ on |f (x _i) − f ₀ | is introduced to smooth the weight contributions.

The FC-Ranker algorithm

The FC-Ranker algorithm iteratively adjusts the index set of good PSM Ω₁ by calculating the scores and weights of the data samples until a stop criterion is met. Initially, the algorithm set $Ω_{1}^{0} = Ω_{+}$ and $Ω_{0}^{0} = ϕ$ , i.e. all PSM samples are viewed as good ones at iteration 0. At iteration k, the algorithm solves the fuzzy linear programming SVM model (3), calculates the fuzzy silhouette values of the samples according to Eq. (5) and updates the index set Ω₁ and Ω₀ such that the indices of target PSMs in Ω₁ with small scores are moved to Ω₀, while the indices of target PSMs in Ω₀ with large scores are moved to Ω₁.

At the k th iteration, PSM samples indexed by Ω₊ are ranked according to their scores, and the top n% of them in Ω₁ are reserved. Then $Ω_{1}^{k}$ is updated by the discriminant function values as

Ω_{1}^{k + 1 / 3} = {i \in Ω_{1}^{k} | f (x_{i}) is ranked at top n % in all {f (x)}_{i \in Ω_{1}^{k}}},

(10)

where 0 < n < 100 is a constant percentage. Based on the calculated fuzzy silhouettes, $Ω_{1}^{k + 1 / 3}$ is then updated by

Ω_{1}^{k + 2 / 3} = {i \in Ω_{1}^{k + 1 / 3} | s_{i} is ranked at top n % in all {s_{j}}_{j \in Ω_{1}^{k + 1 / 3}}}

(11)

and $Ω_{0}^{k}$ is updated by

Ω_{0}^{k + 1 / 3} = Ω_{+} \ Ω_{1}^{k + 2 / 3} .

(12)

Finally, for i ∈ Ω new scores score(i)^k+1, are computed according to Eq. (7) and the weights $θ_{i}^{k + 1}$ are calculated by the following equation

θ_{i}^{k + 1} = \{\begin{matrix} max {s c o r e {(i)}^{k + 1}, 0}, & i \in Ω_{+}; \\ 1, & i \in Ω_{- .} \end{matrix}

(13)

Then indices of the samples indexed by $Ω_{0}^{k + 1 / 2}$ are moved to $Ω_{1}^{k + 2 / 3}$ if the samples have large score values,

i.e.,

\begin{gathered} Ω_{1}^{k + 1} = Ω_{1}^{k + 2 / 3} \cup {i \in Ω_{0}^{k + 1 / 2} | f (x_{i}) \geq {\bar{f}}_{1}^{k + 2 / 3}}, \\ Ω_{0}^{k + 1} = Ω_{+} \ Ω_{1}^{k + 1}, \end{gathered}

(14)

where ${\bar{f}}_{1}^{k + 2 / 3}$ is the average of ${f (x_{i}) | i \in Ω_{1}^{k + 2 / 3}}$ .

The algorithm terminates when the number of identified good PSM samples reaches a given threshold $\overset{⌢}{p}$ , or the separation degree sep^k+1defined by Eq. (6) reaches a threshold $\hat{s e p}$ , i.e.,

|Ω_{1}^{k + 1}| \leq \overset{⌢}{p}, or s e p^{k + 1} \geq \hat{s e p .}

(15)

The FC-Ranker algorithm is summarized in Algorithm 1.

Algorithm 1 The FC-Ranker Algorithm

Input: {x _i , y _i}, i ∈ Ω;

Output: Scores of samples indexed by Ω;

1: Initialization: k = − 1, $Ω_{1}^{0} = Ω_{+}$ , $Ω_{0}^{0} : = Ø$ , $θ_{i}^{0} = 1$ , i ∈ Ω.

2: while Stop criterion (15) is not satisfied do

3: k := k + 1.

4: SVM classification.

5: Solve fuzzy SVM classification model Eq. (3);

6: Calculate $Ω_{1}^{k + 1 / 3}$ via Eq. (10).

7: Clustering analysis.

8: Calculate fuzzy silhouettes s _i , i ∈ Ω via (5);

9: Calculate $Ω_{1}^{k + 2 / 3}$ , $Ω_{0}^{k + 1 / 2}$ via Eq. (11), (12).

10: Update weights.

11: Calculate score(i)^k+1, θ^k+1via Eq. (7), (13);

12: Calculate $Ω_{1}^{k + 1}$ , $Ω_{0}^{k + 1}$ , sep^k+1via Eq. (14), (6).

13: end while

FC-Ranker for the large-scale problem

The number of PSMs output by a database search engine is usually extremely large. In this section, some implementation practice is discussed further such that the algorithm is capable for solving large-scale problems.

Fuzzy SVM classification for the large-scale problem

If the data matrix is sparse, the interior-points algorithms would be competent in solving large-scale linear programming problems. The kernel matrix K in Problem (3) is, unfortunately, not sparse in general. In fact, kernel matrix K is usually quite dense and most of its elements are nonzero. To store a large dense matrix K is not a trivial task. Take a matrix K with Gaussian kernel and l = 400, 000 as an example, if four bytes are occupied per element then the matrix K would have l² = 1.6 × 10¹¹ elements and take up 640Gb of storage in all.

Interestingly, our experimental experience indicates that the kernel matrix is usually quite low rank in the peptide identification problem. Hence, a sub-matrix K' consisting of l' columns of K (l' << l) is selected to substitute K in Problem (3). These l' columns of the sub-matrix are selected randomly from the total columns of matrix K. This operation can be implemented by sampling l' data samples randomly and then calculating the sub-matrix K' according to the kernel function. It reduces the storage greatly. Denote an index set Ω' ⊂ Ω which consists of the indices of l' columns. Then the matrix (K')_ij = k(x _i , x _j ), i ∈ Ω, j ∈ Ω' can be calculated with size of l × l'. Let y' = (y')_{j∈ Ω}', then Problem (3) is reduced to

\begin{matrix} min_{α, r, ξ, b} & ⟨[0_{t}^{T} 0 c θ^{T} - 1], [α^{T} b ξ^{T} r]⟩ \\ s .t . & [Λ (y) K^{'} Λ (y^{'}) y I_{l} - 1_{l}] [\begin{matrix} α \\ b \\ ξ \\ r \end{matrix}] \geq 0, \\ r \geq 0, ξ_{i} \geq 0, i \in Ω \\ - 1 \leq α_{i} \leq 1, j \in Ω^{'} . \end{matrix}

(16)

Where $α \in R^{l^{'}}$ , b ∈ R¹, r ∈ R¹, r ∈ R^l, and Λ(y′) = Diag(y′).

Fuzzy silhouette for the large-scale problem

For updating fuzzy silhouette value s _i of sample i, the major work is to compute $β_{i}^{1}$ and $β_{i}^{- 1}$ in Eq. (4) where it is required to calculate l distances. In all, each iteration computes | Ω| * | Ω| = l² distances with total samples. Denote a given sample rate by ρ with ρ ∈ (0, 1). We sample ρ * | Ω₁ | indices of targets from Ω₁, and ρ * | Ω_{− 1} | indices of decoys from Ω_{− 1}, denoted by Ωt and $Ω_{- 1}^{'}$ , to substitute Ω₁ and Ω₋₁ in Eq. (4), resp. Then at most ρl(| Ω_{− 1} | + | Ω₁ |) ≤ ρl distances need to be calculated at each iteration.

Conclusion

A new scoring method has been developed based on the iterations of FC-Ranker algorithm which were equipped with fuzzy silhouette index and a fuzzy SVM classification model to cope with the large amount of incorrect labels of target PSM samples. In the fuzzy classification model, each PSM was assigned a calculated weight which indicates the possibility of the PSM sample being correct. The performance of FC-Ranker algorithm has been compared with PeptideProphet and Percolator on Yeast, UPS1 and Tal08 datasets, showing that FC-Ranker surpassed PeptideProphet and Percolator in terms of ROC and the quantity of identified target PSM samples under the same FDR level. Moreover, FC-Ranker outputs more target PSMs than PeptideProphet and Percolator does while they share a large number of PSMs in common.

Abbreviations

PSMs:: peptide spectrum matches
SVM:: support vector machine

References

Elias J, Gygi S: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods 2007,4(3):207–214. 10.1038/nmeth1019
Article CAS PubMed Google Scholar
Perkins D, Pappin D, Creasy D, Cottrell J: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999,20(18):3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
Article CAS PubMed Google Scholar
Ramakrishnan S, Mao R, Nakorchevskiy A, Prince J, Willard W, Xu W, Marcotte E, Miranker D: A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 2006,22(12):1524–1531. 10.1093/bioinformatics/btl118
Article CAS PubMed Google Scholar
Keller A, Nesvizhskii A, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry 2002,74(20):5383–5392. 10.1021/ac025747h
Article CAS PubMed Google Scholar
Ding Y, Choi H, Nesvizhskii A: Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. Journal of proteome research 2008,7(11):4878–4889. 10.1021/pr800484x
Article CAS PubMed Central PubMed Google Scholar
Choi H, Nesvizhskii A: Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. Journal of proteome research 2007, 7: 254–265.
Article PubMed Google Scholar
Richard E, Knierman M, Freeman A, Gelbert L, Patil S, Hale J: Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. Journal of proteome research 2007,6(5):1758–1767. 10.1021/pr0605320
Article Google Scholar
Olsen J, Mann M: Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proceedings of the National Academy of Sciences of the United States of America 2004,101(37):13417–22. 10.1073/pnas.0405549101
Article CAS PubMed Central PubMed Google Scholar
Bianco L, Mead J, Bessant C: Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG 2006 standard MS/MS data sets. Journal of proteome research 2009,8(4):1782–1791. 10.1021/pr800792z
Article CAS Google Scholar
Anderson D, Li W, Payan D, Noble W: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. Journal of proteome research 2003,2(2):137–146. 10.1021/pr0255654
Article CAS PubMed Google Scholar
Spivak M, Weston J, Bottou L, KaÌĹll L, Noble W: Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. Journal of proteome research 2009,8(7):3737–3745. 10.1021/pr801109k
Article CAS PubMed Central PubMed Google Scholar
Käll L, Canterbury J, Weston J, Noble W, MacCoss M: Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 2007,4(11):923–925. 10.1038/nmeth1113
Article PubMed Google Scholar
Liang X, Xia Z, Niu X, Link AJ, Pang L, Wu F, Zhang H: A fuzzy cluster-based algorithm for peptide identification. In Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on. IEEE; 2012:602–609.
Chapter Google Scholar
Sanders S, Jennings J, Canutescu A, Link A, Weil P: Proteomics of the eukaryotic transcription machinery: identification of proteins associated with components of yeast TFIID by multidimensional mass spectrometry. Molecular and cellular biology 2002,22(13):4723–4738. 10.1128/MCB.22.13.4723-4738.2002
Article CAS PubMed Central PubMed Google Scholar
SGD: Saccharomyes Genome Database. 2012. [http://www.yeastgenome.org]
Google Scholar
GenBank: NCBI gene bank. 2012. [http://www.ncbi.nlm.nih.gov/genbank]
Google Scholar
Rousseeuw P: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 1987, 20: 53–65.
Article Google Scholar
Petrovic S: A comparison between the silhouette index and the davies-bouldin index in labelling ids clusters. Proceedings of the 11th Nordic Workshop of Secure IT Systems 2006, 53–64.
Google Scholar
Zhou W, Zhang L, Jiao L: Linear programming support vector machines. Pattern recognition 2002,35(12):2927–2936. 10.1016/S0031-3203(01)00210-2
Article Google Scholar

Download references

Acknowledgements

XN and AJL were supported by NIH grant GM64779. LP was supported by NSF of China under grant 11171049.

Declarations

The publication costs for this article were funded by Xijun Liang.

This article has been published as part of Proteome Science Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.

Author information

Authors and Affiliations

School of Mathematical Sciences, Dalian University of Technology, Dalian, 116024, China
Xijun Liang, Liping Pang & Hongwei Zhang
Dept. of Computer Science, Western Kentucky University, Bowling Green, KY, 42101, USA
Zhonghang Xia
Dept. of Pathology, Microbiology and Immunology, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA
Xinnan Niu & Andrew J Link
Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Dr, Saskatoon, SK, S7N 5A9, Canada
Fang-Xiang Wu

Authors

Xijun Liang
View author publications
You can also search for this author in PubMed Google Scholar
Zhonghang Xia
View author publications
You can also search for this author in PubMed Google Scholar
Xinnan Niu
View author publications
You can also search for this author in PubMed Google Scholar
Andrew J Link
View author publications
You can also search for this author in PubMed Google Scholar
Liping Pang
View author publications
You can also search for this author in PubMed Google Scholar
Fang-Xiang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhonghang Xia.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XL and ZX designed the basic FC-Ranker algorithm and wrote the manuscript. XN, AL and FW designed the version of FC-Ranker algorithm for the large-scale problem and corresponding experiments. XL, LP and HZ designed and operated experiments. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Liang, X., Xia, Z., Niu, X. et al. Peptide identification based on fuzzy classification and clustering. Proteome Sci 11 (Suppl 1), S10 (2013). https://doi.org/10.1186/1477-5956-11-S1-S10

Download citation

Published: 07 November 2013
DOI: https://doi.org/10.1186/1477-5956-11-S1-S10

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science

Peptide identification based on fuzzy classification and clustering

Abstract

Background

Results

Conclusions

Background

Results and discussion

Experimental Setup

Dataset

Preprocess

Parameter setting

Validation of sep throughout iterations

Comparison of target PSMs

ROC curve

Methods

Classification and clustering methods for peptide identification

Fuzzy clustering

Classification

The basic FC-Ranker algorithm

Fuzzy SVM classification

Fuzzy silhouette

Score of the samples

The FC-Ranker algorithm

FC-Ranker for the large-scale problem

Fuzzy SVM classification for the large-scale problem

Fuzzy silhouette for the large-scale problem

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Proteome Science

Contact us