Methods for peptide identification by spectral comparison

Background Tandem mass spectrometry followed by database search is currently the predominant technology for peptide sequencing in shotgun proteomics experiments. Most methods compare experimentally observed spectra to the theoretical spectra predicted from the sequences in protein databases. There is a growing interest, however, in comparing unknown experimental spectra to a library of previously identified spectra. This approach has the advantage of taking into account instrument-dependent factors and peptide-specific differences in fragmentation probabilities. It is also computationally more efficient for high-throughput proteomics studies. Results This paper investigates computational issues related to this spectral comparison approach. Different methods have been empirically evaluated over several large sets of spectra. First, we illustrate that the peak intensities follow a Poisson distribution. This implies that applying a square root transform will optimally stabilize the peak intensity variance. Our results show that the square root did indeed outperform other transforms, resulting in improved accuracy of spectral matching. Second, different measures of spectral similarity were compared, and the results illustrated that the correlation coefficient was most robust. Finally, we examine how to assemble multiple spectra associated with the same peptide to generate a synthetic reference spectrum. Ensemble averaging is shown to provide the best combination of accuracy and efficiency. Conclusion Our results demonstrate that when combined, these methods can boost the sensitivity and specificity of spectral comparison. Therefore they are capable of enhancing and complementing existing tools for consistent and accurate peptide identification.


Background
One key issue in proteomics is to identify proteins and characterize their expressions in cells. Tandem mass spectrometry paired with advanced liquid chromatography has emerged as the standard technique for high throughput protein identification [1,2]. This shotgun technology does not require the initial separation of individual proteins and therefore can be applied to complex mixtures.
Typically, a tissue sample is first fractionated, the resulting mixture of proteins is digested into peptides by an enzyme such as trypsin. The peptide mixture is then separated by High Performance Liquid Chromatography (HPLC), ionized and sent to a mass spectrometer to measure the mass/ charge ratio of each peptide. Peptides of interest are selected for further fragmentation in a collision cell to produce tandem (MS/MS) mass spectra. A MS/MS spectrum consists of a sequence of peaks, each characterizing the mass/charge ratio and intensity of an ion. Computer software is then used to identify the peptide sequence associated with each MS/MS spectrum. Finally, the identified peptides are grouped together to determine the underlying proteins.
Historically, methods for identifying peptides from MS/ MS spectra can be categorized into two general classes. In the first group, De Novo sequencing methods such as PEAKS [3] reconstruct the peptide sequence from the spectrum based on knowledge of the peptide fragmentation pattern. This class of algorithms requires high quality spectra with nearly complete ladders of b/y ions. The second method is database search. This approach compares the experimental spectrum against theoretical spectra determined by the in silico digestion and fragmentation of known proteins in a sequence database. This approach is currently preferred due to its reliability in practice. MAS-COT [4] and SEQUEST [5] are examples of this approach which employ sophisticated statistical models to determine the similarity of experimental and theoretical spectra. Some recent studies [6,7] further proposed the combination of the two approaches by using sequence tags, determined by De Novo to reduce the number of peptides during the database search.
Peptide fragmentation is a complex process, so spectra generated from mass spectrometers are often significantly different from their theoretical counterparts. As a result, sophisticated algorithms, such as dynamic programming [3] and Hidden Markov Models [8], have been used in the kernel of database search methods to recognize spectra in such circumstances. These algorithms require very intense computations so that the search time becomes a bottleneck, especially when high-throughput experiments produce hundreds of thousands of spectra in hours. This problem is exacerbated when the possibility of unknown post-translational modifications (PTMs) must be considered [9]. It not only drastically increases computations required for a blind PTM search, but also severely jeopardizes the discovery of true positives. To cope with these challenges, peptide identification by direct comparison of experimental spectra has drawn much attention. This approach has the advantage of taking into account instrument-dependent or peptide-specific contributing factors for producing tandem spectra. Moreover, duplicate spec-tra are ubiquitous in large scale proteomic data [10] due to the following facts: 1) many proteins may share the same peptides; 2) the same peptides may be fragmented multiple times or repeated in different runs. It would also reduce the search time substantially by recognizing the duplicates. Therefore, spectral comparison provides another means to support high throughput peptide sequencing with current techniques [11][12][13][14][15][16]. The software package NoDupe [10] was designed to detect duplicate peptide spectra. Other tools, such as Pep-Miner [17] and MS2grouper [18], cluster spectra by their similarity and derive a representative spectrum. Tools of this type allow comparison of the protein/peptide contents of different sample mixtures without the actual identification of peptides. Recently, software tools X! Hunter [19] and Bib-lioSpec [20] have been available in the public domain for searching libraries and compiling spectra. In this study, we examined some computational issues related to the identification of peptides by spectral comparison. Our goal was to explore the methods to identify peptides by comparing experimental spectra to a library of reference spectra of known origin. We first surveyed related studies and conducted a comparative study on fundamental issues of spectral comparison. These include how to normalize peak intensities, vectorize spectra, and measure their pairwise similarity. We then investigated how to construct a robust representative spectrum from a group of spectra produced from the same peptide.

Results and discussion
In this section, we present the experimental results for peptide identification by evaluating the spectral similarity. Detailed specifications of some terms and experimental spectra are presented in the Methods section.

Experimental results
As described in the Methods section, it is assumed that peak intensities follow Poisson distribution. We first verified this hypothesis as the basis for further optimization. About 1,000 spectra were generated by the direct infusion of the peptide GluFib (peptide sequence EGVNDNEEGFF-SAR) during the calibration of the mass spectrometer. Fig.  1 shows the distribution of ion intensities recorded at three different m/z values, fitted by the theoretical Poisson distributions. Although the peak intensities for different ions have different mean values and variability, a Poisson distribution provides a close approximation for each of these ions. This enables us to stabilize the peak intensity by square root transform.
In this study, three datasets were used to examine the effectiveness of our proposed methods. Details on the spectra can be found in the Datasets section. For these datasets, the mass errors of different spectra assigned to the same peptides were generally very small, and over 90% of them were less than 0.1 Da. During profile generation, the bin size and error window were set at 1 and 0.1 Da, respectively. Since spectra from the same peptide with different charge states may be very different [21], spectra were compared only when their precursor ion masses were within 2 Da error tolerance and with the same charge states.
All the experimental spectra were previously matched to peptides using MASCOT. These peptide identifications were used to cluster the spectra into groups matching the same peptide. Ideally the pairwise similarity score should be high for two spectra from the same peptide but low for spectra from different peptides. To verify this assumption, we examined the distributions of similarity scores for spectra assigned to the same and different peptides for various configurations. These two score populations were denoted as P ss and P sd , respectively. Experiments were conducted to examine the effectiveness of different similarity measures of pairwise spectral comparison. Fig. 2 shows the distributions of three similarity scores generated from dataset 1: ratio of shared peaks, cosine correlation value, and correlation coefficient. It is evident that there is a greater separation between P ss and P sd for the cosine and correlation measures than for the shared peak ratio. This suggests that they would provide a better basis for comparing spectra. Correlation coefficients appeared to be a little more robust than the cosine values since the P ss values tended to be smaller. Therefore, the correlation coefficient was used as the similarity measure in the remainder of this work.
The selection of a similarity measure and the way reference spectra are modeled constitute a clustering configu- The peak intensities in GluFib MS/MS spectra can be approximated by a Poisson distribution Figure 1 The peak intensities in GluFib MS/MS spectra can be approximated by a Poisson distribution. Histograms of peak intensities observed in sequential ms/ms runs are shown. The superimposed fitting curves are those predicted for the Poisson distributions. Each panel illustrates the result for peaks at a different m/z value. To assess the effects of the different factors in clustering spectra, we tested eight typical clustering configurations over dataset 1. Table 1 specifies the details of each configuration. Spectra matched to the same peptide were clustered together. A total of 585 clusters were constructed, varying in size from 2 to 19 spectra. A reference spectrum was constructed for each clustering configuration, except for the closest neighbor scheme where all spectra served as representatives in turn during the matching process. Each experimental spectrum was then compared to all references within error tolerance of precursor mass. The spectrum was assigned to the peptide associated with the reference with the highest similarity scores. Finally, we examined whether the assignments were consistent with those of the MASCOT identifications. Assuming that the MASCOT results were correct, Receiver Operating Characteristic (ROC) curves were then computed for each configuration. Their performances can be evaluated by the areas under the ROC curves, where a larger area usually indicates better performance. The eight configurations in Table 1 were designed to unveil the impact of various intensity transforms and reference modeling on spectral comparison. Fig. 3 shows the effectiveness of the different intensity transforms. The best performance was obtained with the two configurations in which the spectra were transformed by the square root prior to comparison. The overall performance of these configurations was substantially better than those obtained with no transform, the logarithm transform, or by weighting the intensity with the square of the m/z ratio (SMZ). Thus, the variance stabilization property of the square root transform improved the performance. It is apparent that profiling-based comparison was only slightly better than that of direct binning. We attribute this to the fact that the spectra were produced by high resolution QTOF. The B/Y peaks in the spectra of the same peptides were well aligned; their m/z errors were generally smaller than 0.1 Da. Consequently, only 1% of the peaks Distributions of scores for different similarity measures for spectra (dataset 1) from the same and different peptides: (a) shared peak ratio, (b) cosine value and (c) correlation coefficient were allocated to neighboring bins in spectral profiles. For low accuracy instruments with larger m/z errors, we anticipate that the profiling technique will perform more robustly. Fig. 4 further illustrates how the choice of the reference spectra influenced sensitivity and specificity. The best performance was obtained using the average spectra as the references. The closest neighbors configuration performed almost as well as the average spectra. However, it is much more computationally efficient to use the average spectra since it requires only a single pairwise comparison for each candidate peptide. Both of them performed somewhat better than using the individual experimental spectra as the reference. The ROC curve for theoretical spectra was much worse than the other three. This observation suggests that peptide and instrument-specific factors play a significant role in the direct spectral comparison.

Generalizing the performance of reference spectra
As stated earlier, the error tolerance of the precursor mass was set to 2 Da for pairwise comparison during the experiments. In our dataset there were typically only a few spectra from different peptides within this tolerance range. However, for large scale proteomics studies, an experimental spectrum may have to be compared with many reference spectra. To assess the performance expected under such circumstances, we ran a Monte Carlo simulation for different numbers of candidate peptides. The sensitivity and specificity for a cutoff similarity score s c was calculated as follows. We assume that there are k reference spectra within the mass error tolerance window for a given spectrum. Assign the spectrum to the reference spectrum with the highest spectral similarity measure above the threshold s c . There are two possibilities: either the given spectrum is generated from the peptide associated with the reference spectrum or not. The sensitivity and specificity can be determined from either possibility. In the first case, one score s was randomly drawn from P ss , and the remaining k -1 scores {d 1 ,..., d k-1 } were drawn from P sd . Repeating this procedure for a sufficient number of times determines the sensitivity as the frequency of s ≥ s c and s ≥ d i , where i = 1,... k -1. Similarly, in the second case, k scores {d 1 ,..., d k } from P sd were drawn at each repeat. Specificity is then approximated by the frequency of s c ≥ d i , where i = 1,..., k. Table 2 shows the ROC areas using average and individual spectra as references when the number of candidate spectra increases. It is clear that the average spectra as the reference consistently outperformed individual spectra. Its advantage becomes more pronounced when the number of candidate peptides is large.
According to the traditional academic point system, a classifier is considered to be excellent or good when the area under its ROC curve is in the range of [0.9,1] or [0.8,0.9]. We noticed that when the number of candidate peptides reaches 1,000, the spectral comparison does not provide a good balance of sensitivity and specificity. This is a common drawback for statistical measures of spectral similarity. For instance, to ensure satisfactory sensitivity, SEQUEST [5] only compares the cross correlation of the top 500 peptides produced from the preliminary selection stage. In practice, however, the database search is usually limited to a specific taxonomy. Since the number of candidate peptides is relatively small after the search scope is narrowed down on the basis of parent ion mass, this method will provide adequate sensitivity for spectral comparison.

Further validation
To determine how well the above results are reproducible, we cross-examined spectra from datasets 1 and 2. Within each dataset, spectra assigned to the same peptides were clustered. There were 537 clusters of size 2, 255 clusters of size 3, and 388 clusters with more than 3 spectra. Their average spectra were also constructed. As described in the Datasets section, these two datasets shared many peptides. For each pairwise comparison, an individual and a reference spectra were drawn from different datasets. Score distributions similar to those in Fig. 2 were generated. This reproducibility confirmed the robustness of our approach for spectral comparison.
One intriguing question is how the cluster sizes influence the performance of average spectra. To answer this question, the similarity scores are further partitioned based on the cluster sizes. Fig. 5 depicts the cumulative distributions of similarity scores. As the cluster size increased, the similarity between references and spectra from different peptides tended to decrease, while their similarity to those of the same peptides increased. Therefore, increasing the cluster sizes helps to enhance the performance asymptotically while the noisy peaks diminish gradually after averaging.
Dataset 3 was used to further examine the quality of ensemble average spectra. The original spectra were anno-tated by searching against customized protein databases for specific taxonomies. Based on their initial annotations, we derived 35 clusters, whose average spectra were also constructed. The individual and the average spectra were searched against the NCBInr database using the MASCOT online server. Fig. 6 shows the MASCOT scores for individual and average spectra. Note that the initial peptide identifications were based upon a search against smaller customized protein databases, therefore some spectra were not recognized when the search space increased to the whole NCBInr database. Fig. 6 illustrates that all the average spectra were correctly recognized. Moreover, for two clusters, the average spectra were identified even though original spectra failed MASCOT. In other words, the average spectra generally retained the quality to be identifiable since the peak amplitude of noise after averaging was decreased.
ROC curves obtained for spectral comparison (dataset 1) using different signal transforms

Conclusion
Database search is currently the prevailing approach to sequence peptides from MS/MS spectra. However, this approach is often compromised by a large number of unassigned spectra because the fragmentation process is both peptide-specific and instrument-dependent. On the other hand, it is very computationally expensive to analyze spectra when the peptides have undergone unknown modifications and mutations or the protein database is very large. Current tandem mass spectrometry has standardized the conditions used for acquiring data, therefore similar experimental spectra are highly reproducible for the same peptides. This enables the direct comparison of experimental spectra as an alternative method. Some recent studies X! Hunter [19] and BiblioSpec [20] have demonstrated that it is practically feasible to construct representatives from spectra, even they were generated by different brands and types of instruments. A pairwise comparison of spectra can be finished within tens of milliseconds; it is much faster than a traditional database search that can take up to a few minutes on a modern computer. Therefore, it provides an efficient alternative to identify peptides via tandem mass spectra. This paper systematically examined some methodological and computational issues related to this approach. Our experimental results confirmed that transforming intensity signals by the square root stabilizes variance and therefore facilitates spectral comparison. Also, statistical measures such as correlation coefficients can provide a good indication of their pairwise similarity. In contrast to other studies focusing on optimizing peak selection to construct representative spectra, we discovered that ensemble averaging spectra is an effective and efficient way to generate reference spectra. This method outperformed the use of individual experimental spectra as a ref- Figure 4 Effects of different models of reference spectra (dataset 1) on the ROC curves. Ensemble average and closest neighbor perform better than other methods.  erence, and the advantage increased as the candidate peptide set became large.

Effects of different models of reference spectra (dataset 1) on the ROC curves
In summary, our methods and results provide a sound basis to further develop algorithms to optimize peptide identification by spectral comparison.

Methods
In this section, we first describe the spectra used in our experiments. We then address the computational issues related to spectral comparison at different levels. These subjects include how to adjust peak intensity from raw spectra and vectorize spectral data, the measurement of spectral similarity, and the methods to build up the reference spectra.

Datasets
In our experiments, three datasets of tandem mass spectra were used to validate our methods. Two of them were obtained from highly enriched rat liver organelle (endoplasmic reticulum) fractions provided by McGill University and the University of Montreal. Organelle collection, purification, and characterization were carried out as previously described [22][23][24]. Organelle proteins were sepa-rated using 1D-SDS PAGE, in-gel trypsinized, excised, and analyzed by an LC QTOF mass spectrometer, using standard methods [25,26]. Raw tandem MS data were processed using MASCOT Distiller for peak detection, and processed spectra were searched against the NCBInr database with taxonomy restricted to rattus (30,949 protein sequences) using MASCOT Cluster v 1.9.03 software [27]. The spectra along with the MASCOT database search results were then stored in an in-house Cellmap database. More details on the experimental protocols and spectra can be found in [28].
The first dataset was obtained from the analysis of the entire fraction of the endoplasmic reticulum rough membrane. Initially the dataset had 8,323 spectra, but after removing the spectra with negative MASCOT scores (the difference between the identity score and peptide score for a given spectrum), only 5,454 spectra were left, leading to the identification of 3,364 unique peptides. The second set of spectra was used for further cross validation. These data were acquired with the same procedures as the first dataset, except that the sample was the detergent fraction of the endoplasmic reticulum rough membrane. Initially this dataset was composed of 13,966 spectra. Only 4,051 Cross validation between datasets 1 and 2: cumulative distributions of similarity scores for the same and different peptides in spectrum clusters of different sizes Figure 5 Cross validation between datasets 1 and 2: cumulative distributions of similarity scores for the same and different peptides in spectrum clusters of different sizes. Increasing the sizes of clusters leads to better separation of spectral similarity scores between the same and different peptides. spectra were left after removing the spectra with negative MASCOT scores. 3,825 unique peptides were identified from these spectra. Since these two datasets were produced from the same organelle, there were 1,173 peptides common for both benchmarks. The third dataset was the benchmark data in our previous work [29], which was acquired from other research institutes. It consisted of 367 spectra generated by LCQ/QTOF instruments, and the peptides were identified by the MASCOT search engine against proprietary databases and further validated by experts. Many of these spectra mapped to the same peptides, finally we constructed 35 clusters.

Peak intensity transforms
It is well known that the magnitude of peak intensities in MS/MS spectra can vary by several orders of magnitude. In order to neutralize this influence on spectral comparison, it should be advantageous to transform the peak intensity in certain ways prior to comparison. Previous studies have suggested using square root (such as NoDupe [10] and MASCOT Scores for the individual and average spectra from the 35 peptides in dataset 3, sorted by the MASCOT score of the ensemble average spectrum Peptides [30]), the logarithm (PEAKS [3]) for the transform, or weighing the intensity by square of m/z (SMZ) as in Bib-lioSpec [20].
In particular, QTOF instruments employ time-to-digital converters (TDCs) to record the abundance of ions. TDCs divide time into bins and register the bin when ion detection events occur during scan cycles. As a result, peak intensities generated can be expected to follow a Poisson distribution [31]. If this is the case, the standard deviation of the peak intensities will vary with the mean value. In this situation, the square root transform may be the optimal transform since the variance of the transformed variable will be constant. To understand why, assume the original intensity I has mean λ and variance λ. If I* = , then by using a first-order Taylor expansion, we have I* ≈ + (I -λ) * 0.5 * λ -0.5 . Therefore, In other words, after applying the square root transform, the variance of the peak intensities is stabilized at approximately 0.25. Therefore, this transform is applied to the experimental spectra as a preprocessing step in our experiments, unless otherwise stated.

Profiling spectra
Spectral comparison can be performed in a number of ways. Some approaches match spectra based on the similarity of individual peaks [5,17,32]. Another approach is to vectorize the whole spectrum, and then calculate the distance between two vectors. Here, the peak list of a spectrum is evenly divided into a consecutive sequence of bins on the m/z axis, and a vector for the spectrum is derived by summing up the intensities of peaks in each bin. This method has been used in many studies [10,18], and we refer to it as direct binning. However, as pointed out in [19,32], it is not straightforward to establish the correspondence between peaks and bins. The measured m/z value of a peak is subject to measurement errors; in other words, its theoretical counterpart can be either larger or smaller.
To avoid the above pitfall, we used an enhanced profiling technique that reduces the problem of irregular sampling of mass spectra. For simplicity, it is assumed that m/z values following a uniform distribution in an error window. During the profiling step, the intensity of each peak is distributed into neighboring bins. Formally, given the bin width w and a m/z error window e, and assume that w ≥ e and a peak with value m for m/z ratio is located inside bth bin [l,r], then its intensity i is proportioned into three consecutive bins as follows: When e = 0, this model regresses to the model of direct binning used by NoDupe [10], except that it only used the most significant peaks for binning. While it is characterized by the same computation complexity and memory storage requirement as the direct binning, the profiling alleviates the problem in the event that the peaks are located in the neighboring bins due to measurement errors.

Similarity measures
The resulting spectrum profiles are represented as vectors of intensity in each bin. The problem of comparing spectra is reduced to determining the similarity of the corresponding profiles. The intensities of peaks in two spectra may still be on different scales even when their deviations are stabilized. Therefore, it is not appropriate to use a direct measure of the geometric distance, such as Euclidean distance (L 2 -norm), between a pair of spectra. In this study, we considered three options.
1. The correlation coefficient in Eqn 5 is a well-founded statistical method to evaluate the similarity between two vectors, as used in SEQUEST [5].
where μ X and μ Y are the means of X and Y, σ X and σ Y are their standard deviations, respectively. If the precursor ions of two spectra do not have exactly the same mass, the lengths of corresponding profiles X and Y are made the same by appending a few zeros to the shorter one. Correlation coefficients have the advantage of yielding values restricted to the range of [-1, 1]. Intuitively, spectral profiles of the same peptides are anticipated to return high and positive correlation coefficients.
correlation coefficients, have been used to characterize the pairwise similarity in many other studies [10,17,19,20].
As with the correlation coefficient, the cosine similarity measure is theoretically restricted to [-1,1]. However, all the elements of the profile vector are non-negative, so the cosine values are effectively limited to the range [0, 1].
3. Another straightforward measure is to count the number of shared peaks in two spectra. Given two spectra with m and n peaks respectively, and assume that they share k peaks with same m/z values, the ratio of share peaks is calculated as Since the intensity information is completely ignored, we hypothesize that this measure will be less discriminating than the above measures in distinguishing spectra.

Models for reference spectra
There are various ways to generate a reference spectrum for each peptide. The most direct approach is to use an individual experimental spectrum. In our study, we chose the spectrum with the highest MASCOT score as the reference since it presumably reflects the real fragmentation pattern of a peptide better than low quality spectra. Alternatively, the theoretical spectra could also be used as references. Theoretical spectra are based on the simplified assumptions of the peptide fragmentation pattern. However, such a model does not accommodate instrumentand peptide-specific contributing factors during the fragmentation. Consequently, the experimental spectra may display limited resemblance to the corresponding theoretical spectra. Therefore, we expected that the experimental spectra for a peptide would be more similar to sibling spectra from the same peptide than to its theoretical counterpart.