A guide through the computational analysis of isotope-labeled mass spectrometry-based quantitative proteomics data: an application study

Background Mass spectrometry-based proteomics has reached a stage where it is possible to comprehensively analyze the whole proteome of a cell in one experiment. Here, the employment of stable isotopes has become a standard technique to yield relative abundance values of proteins. In recent times, more and more experiments are conducted that depict not only a static image of the up- or down-regulated proteins at a distinct time point but instead compare developmental stages of an organism or varying experimental conditions. Results Although the scientific questions behind these experiments are of course manifold, there are, nevertheless, two questions that commonly arise: 1) which proteins are differentially regulated regarding the selected experimental conditions, and 2) are there groups of proteins that show similar abundance ratios, indicating that they have a similar turnover? We give advice on how these two questions can be answered and comprehensively compare a variety of commonly applied computational methods and their outcomes. Conclusions This work provides guidance through the jungle of computational methods to analyze mass spectrometry-based isotope-labeled datasets and recommends an effective and easy-to-use evaluation strategy. We demonstrate our approach with three recently published datasets on Bacillus subtilis [1,2] and Corynebacterium glutamicum [3]. Special focus is placed on the application and validation of cluster analysis methods. All applied methods were implemented within the rich internet application QuPE [4]. Results can be found at http://qupe.cebitec.uni-bielefeld.de.

With a slight modification-the subtraction of each protein's mean abundance value is omitted-Pearson's uncentered correlation coefficient provides another possibility to measure similarities between two classification objects:

Formal definition of cluster analysis
Formally, a cluster analysis can be described as the partitioning of a number N of classification objects or-in the sense of proteomics-a number of patterns with an endless dimension P in K groups or clusters {C k , k = 1, . . . , K}. Given N objects X = {x i , i = 1, . . . , N }, where x i,j denotes the j-th element of x i , the grouping of all objects with index i = 1, . . . , N in clusters k = 1, . . . , K can be defined as follows: Two conditions apply for the matrix W(X) = [w k,i ] K×N to ensure that the association of each object to a cluster is unique (please note that this only applies for hierarchical (a) and partitioning (b) cluster analysis. In case of probabilistic (c) approaches a pattern may belong to more than one cluster with a certain probability): Furthermore, let the following definition denominate the number of objects belonging to a cluster C k : Cluster indexes for cluster validation

Calinski-Harabasz
The cluster index of Calinski and Harabasz [2] is calculated using the following equation: where B denotes the error sum of squares between different clusters (inter-cluster) and W the squared differences of all objects in a cluster from their respective cluster center (intra-cluster) Calculated for each possible cluster solution the maximal achieved index value indicates the best clustering of the data. The important characteristic of the index is the fact that on the one hand trace W will start at a comparably large value. With increasing number of clusters K, approaching the optimal clustering solution in K * groups, the value should significantly decrease due to an increasing compactness of each cluster. As soon as the optimal solution is exceeded an increase in compactness and thereby a decrease in value might still occur; this decrease, however, should be notably smaller. On the other hand, trace T should behave in the opposite direction, getting higher as the number of clusters K increases, but should also reveal a kind of softening in its rise if K gets larger than K * .

Index-I
Maulik and Bandyopadhyay [3] proposed a cluster index that is, in principle, composed of three individual elements: While the first factor simply normalizes each index value by the overall number of clusters K, the second term sets the overall error sum of squares of the complete datasets in relation to the intra-cluster error of a given clustering: A third factor takes into account the maximally observed difference between two of the K clusters: The index computation includes a variable parameter p ∈ N that may be "used to control the contrast between the different cluster configurations"' [3, p.1651]. The authors recommend a value of p = 2.

Davies-Bouldin
Instead of simply proposing a cluster index, Davies and Bouldin [4] formulated a general framework for the evaluation of the outcomes of cluster algorithms. In analogy to Halkidi et al. [5] an instance of their index DB(K) may be defined as follows: where R k = max j=1,...,K,j =k and as well as For each cluster C k an utmost similar cluster-regarding their intra-cluster error sum of squares-is searched, leading to R k . The index then defines the average over these values. In contrast to the aforementioned cluster indexes, here, the minimal observed index indicates the best cluster solution.

Krzanowski-Lai
Krzanowski and Lai [6] developed a cluster index that, similar to the index of Calinski and Harabasz [2], is based on the squared differences of all objects in a cluster from their respective cluster center-trace W. The authors define DIF F (K) as the difference between a clustering of the data in K and a clustering in K − 1 clusters. Let J be the number of variables that has been measured on each x i ∈ X and trace W K the sum of squares function that corresponds to the clustering in K clusters, their measure DIF F (K) is then defined as follows: Here, the introduction of the normalizing factor 2 J is derived from the observation thatgiven independently uniformly distributed measurements on each variable j ∈ [1, . . . , J]-the optimal clustering of the data will reduce the sum of squares exactly by this factor [6, p.25].
The authors claim that if there exists an optimal clustering solution in K * groups, the value of DIF F (K * ) should be comparably large and positive (see index of Calinski and Harabasz for further explanation). In contrast, all values of DIF F (K) for K > K * will have rather small values (maybe even negative), while values for K < K * will be rather large and positive. Bringing these observations together the index KL(K) is defined as follows: The optimal cluster solution is then indicated by the highest value of KL(K).

Figure of Merit
Coming from a gene expression background, the Figure of Merit [7] is based on the assumption that the validity of a cluster is certainly increasing in value if in a second experiment the same genes would group together and reveal a similar pattern of expression. Following a bootstrapping or jackknife approach, one may assume that a cluster algorithm is successively applied on a set of genes whereby in each iteration one experimental condition-in exact terms a feature of each classification object, or a column of the data matrix-is left out. If a cluster algorithm would have assigned each object to a cluster just by chance, it seems logical that omitting a condition will lead to different results. Otherwise, it is likely that two cluster results reveal a similar structure if the dependence on the left-out feature is small. Let in the following X = {x i , i = 1, . . . , N } denote a set of N classification objects, each having the dimension P ∈ N, such that x i,j is the j-th feature of x i , j ∈ 1, . . . , P ; furthermore, let there be a number of clusters K ∈ N whereby W(X) = [w k,i ] K×N describes the clustering of the data. Assuming that a clustering has been performed with a data matrix where the j-th feature has been omitted, the Figure of Merit is defined as follows: where denotes the arithmetic mean in feature j of all objects of cluster k.
To avoid a bias towards the overall number of clusters, the so called "adjusted Figure of Merit" takes this amount K into account: If the calculation is iterated over all P features of the classification objects, the "aggregate