An unsupervised machine learning method for assessing quality of tandem mass spectra
© Lin et al; licensee BioMed Central Ltd. 2012
Published: 21 June 2012
Skip to main content
Volume 10 Supplement 1
© Lin et al; licensee BioMed Central Ltd. 2012
Published: 21 June 2012
In a single proteomic project, tandem mass spectrometers can produce hundreds of millions of tandem mass spectra. However, majority of tandem mass spectra are of poor quality, it wastes time to search them for peptides. Therefore, the quality assessment (before database search) is very useful in the pipeline of protein identification via tandem mass spectra, especially on the reduction of searching time and the decrease of false identifications. Most existing methods for quality assessment are supervised machine learning methods based on a number of features which describe the quality of tandem mass spectra. These methods need the training datasets with knowing the quality of all spectra, which are usually unavailable for the new datasets.
This study proposes an unsupervised machine learning method for quality assessment of tandem mass spectra without any training dataset. This proposed method estimates the conditional probabilities of spectra being high quality from the quality assessments based on individual features. The probabilities are estimated through a constraint optimization problem. An efficient algorithm is developed to solve the constraint optimization problem and is proved to be convergent. Experimental results on two datasets illustrate that if we search only tandem spectra with the high quality determined by the proposed method, we can save about 56 % and 62% of database searching time while losing only a small amount of high-quality spectra.
Results indicate that the proposed method has a good performance for the quality assessment of tandem mass spectra and the way we estimate the conditional probabilities is effective.
Proteomics is the systematic study of proteins in order to understand their structures and functional relations . One area in proteomics is to identify proteins in biological complexes via peptides identified from tandem mass spectra. Commonly used methods for identifying peptides from tandem mass spectra can be divided into two categories: database searching methods such as Mascot  and SEQUEST  and de novo sequencing methods such as PEAKS  and PepNovo . Unfortunately, a large number of poor quality spectra are commonly observed in tandem mass spectral datasets, which contain too little, irrelevant, or ambiguous information. The existence of spectra with poor quality not only slows down the identification process, but also increases the false positives and false negatives . In Keller et al's experiments , the mixture of 29 proteins produced 37,071 tandem mass spectra, of which only 2,784 spectra originated from those 29 proteins , while the rest spectra could be removed from the analysis without losing any relevant protein information. Hence, it is worthwhile to develop an automatic quality assessment algorithm to discriminate high-quality from poor-quality spectra before further interpretation.
Spectral quality assessment methods select high quality spectra for further processing, but do not change the selected spectra themselves . Several spectral quality assessment methods have been developed in recent years. Existing spectral quality assessment methods generally define a number of features to describe the quality of spectra [10–15]. Based on defined features these methods assessed the quality of tandem mass spectra by supervised machine learning methods, which require labelled training datasets to train a classifier. The trained classifier is then used to classify spectra into high-quality or poor-quality ones. Ideally, the training set should be validated by some peptide identification algorithms or manual checking, i.e., the set should be correctly labelled without or with very few falsely labelled spectra. However, this information is hard to be obtained prior to the peptide identification for new dataset. Even worse, tandem mass spectrometers may produce different spectra for the same peptide under different experimental conditions. Classifier trained by one dataset may not be effective on another. Therefore, unsupervised machine learning methods are appealing for assessing the quality of tandem mass spectra. In , we applied the weighted k-means to classify tandem mass spectra into high-quality cluster and poor quality spectra, based on the features defined in .
In the literature, hundreds of features have been defined to describe the quality of tandem mass spectra, some of which are closely relevant, yet other are not. In the previous work, Ding et al  used a two-stage recursive feature elimination method which is based on support vector machine (SVM-RFE) to select most relevant features from those collected in the existing literature to assess the quality of tandem mass spectra. To verify the relevance of selected features, classifiers are trained with different sets of selected features and their performances are analyzed. The results demonstrate that the sets with a small number of features outperforms the full set of features, which indicates that these features together can better describe the quality of tandem mass spectra and hence improve the performance of tandem mass spectral quality assessment.
In this paper, we propose an unsupervised machine learning method with a set of 10 most relevant features from the previous work  to assess the quality of tandem spectra. These 10 features have clear physical meanings: the higher the individual feature value of a spectrum, the more possible it is of high quality. Therefore, each individual feature can be used to easily assign a spectrum to be of high quality or poor quality by a user specified threshold. However, the precision of assessments from each individual feature is too low. Our proposed method in this paper will integrate all assessments from 10 individual features into a consensus assessment with a better precision, based a constraint optimization model. The remainder of the paper is organized as follows. The "Method" section introduces the 10 features, describes the constraint optimization model and then present an iterative algorithm to solve it. The "Experimental results and discussion" section investigates the performance of proposed quality assessment method with two tandem mass spectra datasets with low resolution. The results are presented and discussed. The "Conclusions and future work" section concludes this study and points out some direction of the future work along with this study.
In this section, 10 features used for quality assessment of tandem mass spectra are introduced in the subsection A. In subsection B, we describe a graph-based consensus optimization method  to integrate individual assessments into a consensus assessment and also propose an algorithm method to solve this optimization problem. The convergence of the algorithm is also proved.
A tandem mass spectrum usually contains tens to hundreds of m/z values with their corresponding signal intensities. In the literature, hundreds of features have been proposed to describe the quality of tandem mass spectra, for example [19–21]. In the previous study, after removing the noisy peaks by using the morphological reconstruction method [22, 23], 10 most relevant spectral features are selected based on support vector machine methods [14, 17] which are introduced as follows:
Feature 1 is proposed by Bern et al  and defined as the total normalized intensity of pairs of peaks with their m/z values summing to the mass of the precursor ion . This feature is based on the reasonable assumption that the peaks with lower intensity are noises and that the complementary peaks are more likely to be signal.
Feature 2 is proposed by Flikka et al  and defined as the mass of uncharged precursor ion. This feature is based on the observation that most of poor quality spectra have the small mass of precursor ions as they maybe came from not long enough peptides or noisy chemical molecules.
Feature 3 is proposed by Wu et al  and defined as the number of peaks whose mass difference equals to one of the 20 amino acids mass (all peaks are considered as single charged). The comparison uses a tolerance which is set to 0.5 Da. This feature reflects that in the theoretical tandem mass spectrum of a peptide each of all the same type ions (for example, b-ion) in order differs an amino acid from its before- and/or after- neighbors.
Feature 4 is proposed by Flikka et al  and defined as the average delta mass - average of all mass differences between any two neighbor peaks in a spectrum. This feature reflects that the too-dense spectra are of poor quality [15, 20, 24].
where M (x) is the m/z value of peak x and M 1,M 2,...,M 20 represents the masses of 20 amino acids (not all of which are unique). The comparison implied by ≈ uses a tolerance, which was set to 0.5 Da. Similar to Feature 3, it measures how likely two peaks are to differ by the mass of an amino acid.
Feature 6 is proposed by Wu et al  and defined as the number of pairs of complementary peaks. A pair of peaks is complementary if the sum of their m/z values is equal to the mass of the precursor ion (all peaks are considered as single charged). This feature measures how likely an N-terminus ion and a C-terminus ion in the tandem mass spectra are produced as the peptide fragments from the same peptide bond.
Feature 7 is proposed by Wu et al  and defined as the number of pairs of peaks whose m/z value differences is equal to the mass of a water molecule or an ammonia molecule (all peaks are considered as single charged). This feature measures how likely one ion in a peptide tandem mass spectrum is produced by losing a water or ammonia molecule from other ion.
Feature 8 is proposed by Wong et al  and defined as the ratio of number of peaks that have a relative intensity greater than 1% of total intensity to the total number of peaks in a spectrum. The reasoning for this feature is similar to that for Feature 1;
Feature 9 is proposed by Flikka et al  and defined as the standard deviation of delta mass (all mass differences between any two neighbor peaks) values in a spectrum. The reasoning for this feature is similar to that for Feature 4.
Feature 10 is proposed by Wu et al  and defined as the number of pairs of peaks whose m/z value difference is equal to the mass of a CO group or an NH group (all peaks are considered as single charged). This feature measures how likely one ion in a peptide CID mass spectrum is a supportive ion. Two kinds of supportive ions (a-ions and z-ions) were considered.
From the definitions and physical meaning of these features, the larger the values, the more likely the spectra are of high quality. Therefore, according to the feature values, each of these features can be used to assess the quality of tandem mass spectra and easily divided into two categories: one with high quality and another with poor quality. However, such individual assessments are not as good as the assessment from the combination of all 10 features [14, 17].
An object pool classified into several groups
It is obvious that the value of cost function is zero if all assessments based m individual features are perfect agreed. Nevertheless, this does not happen in practice. Therefore, the desired resultant matrix will be obtained when the cost function in the constraint optimization proplem (2) reaches its minimal value. Finally, every spectrum will be assigned with a probability to class z directly according to the values in matrix .
From constraint optimization problem (2), we can see that for the given matrix U the objective function is quadratic in elements of matrix Q and that for the given matrix Q the objective function is quadratic in elements of matrix U. We therefore propose the following iterative algorithm to solve this optimization problem.
Step 1: Initialize Q by Y, that is, Qt=Y, and t= 0.
Step 2: t=t+1,
Step 3: Stop if ||U t - U t -1|| ≤ ε and output U, where ε is a user specified small positive number.
the solutions of the above algorithm at every iteration t satisfying all constraints in optimization problem (2). We can use the technique of mathematical induction to prove that
for t = 1, 2, ......
Therefore, for any positive integer t, (7a) and (7b) are true.
From inequality above, J (U t , Q t ) is non-increase as the number of iteration t is increasing. On the other hand, J (U t , Q t ) is bounded below. Therefore, exists, that is, our algorithm is converged.
The algorithm reflects that at each iteration the probability estimation of group node Q receives the information from its neighboring spectral nodes while not deviating from its initial value Y too wild. In return, the updated probability estimates of group nodes propagate the information back to its neighboring spectral nodes. The propagation stops when the process converges. The process converges to a stationary point.
To evaluate our proposed method, experiments are conducted on two low resolution tandem mass spectral datasets: TOV and ISB.
The tandem mass spectra in this dataset are acquired from a LCQ DECA XP ion trap spectrometer (ThermoElectron Corp.) as described in . The number of spectra in this dataset is 22, 576, and these spectra are searched using SEQUEST against the ipi.HUMAN. v3.42.fasta containing 72, 340 protein sequences and 5 contaminant sequences.
The spectra in this dataset are acquired from the complex of 18 control mixture proteins which were analyzed by mLC-MS on an ESI-ITMS (ThermoFinnigan, San Jose, CA) using a standard top-down data-dependent ion selection approach . This dataset consists of 37, 044 tandem mass spectra. These spectra were searched against a human protein database appended with the sequences of the 18 standard proteins and other common contaminants (totally, 5, 395 protein sequences in the final database) using SEQUEST search program.
The distribution of multiply charged spectra in the ISB and TOV dataset
In the experiment, we applied the proposed method on both datasets to obtain assessments based on individual features. For each feature, spectra with the top 50% feature values are assigned to high quality class. In the method, the parameter α in the model was taken as 90.
Furthermore, our method achieved a better result from TOV dataset than the one from ISB dataset. This may because that there are more poor quality spectra in ISB dataset (35997/37044 = 97%) than in TOV dataset (21440/22576 = 95%). High percentage of poor quality spectra makes quality assessment more challenging . Another reason maybe is that there are more triply charged spectra in ISB dataset (18044) than in TOV (9732). Triple charged spectra contain more doubly charged peaks than both doubly and singly charged spectra. The quality of triply charged spectra are not well described by 10 features we used in this paper, especially, feature 3, 6, 7, 10 we used are only designed for singly charged peaks while triply charged spectra produce many doubly charged peaks [25, 26].
This paper has presented an un-supervised machine learning method to integrate the assessments based on individual features (which is easy to do with a low precision) into a consensus assessment with a higher precision. This unsupervised machine learning method first estimate the conditional probability of a spectrum being high quality from the assessments based on individual features. The estimation of the probabilities is solved through a constraint optimization problem. Experiment results illustrate that if we just search spectra assessed as the high-quality in TOV and ISB, we can save about 56% and 62% of searching time while losing only 9% and 10% of high-quality spectra, respectively. This result indicates that the proposed method is useful in saving database searching time. Besides, under the true positive rate (90%), our new method reaches the true negative rate at 74% and 63%, respectively. This indicates that the new method has a good performance on quality assessment of tandem mass spectra. Also, this result shows the way we estimate the conditional probability is effective.
However, the proposed method could be improved in several ways for the future work. For example, in the ten features we adapted, four of them were calculated for singly charged peaks. This makes the classification method less effective on the triply or higher charged spectra. In the future, we may adapt different features for different charges of spectra. In this study, the value of α and percentage cut-off value for individual features were taken according to several trial and error repeats. In the future, a more objective method should be developed for specifying these values. In addition, the proposed constraint optimization model can be applied for other unsupervised classification problems in bioinformatics and proteomics.
This research is supported by Natural Sciences and Engineering Research Council of Canada (NSERC). The authors would like to thank Dr. Andrew Keller from Institute for Systems Biology for generously providing spectral data and protein databases for the ISB dataset and Dr. Guy G. Poirier from Laval University for providing the TOV dataset and search results. We also thank Mr. Jiarui Ding for providing the program for computing the features.
This article has been published as part of Proteome Science Volume 10 Supplement 1, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/10/S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.