Analysis of quantitative changes in a specific proteome (i.e., complement of proteins expressed in a particular tissue or cell at a given time) is commonly carried out using two-dimensional gel electrophoresis (2-DE). With this procedure, proteins are separated in the first dimension based on iso-electric point, followed by separation based on molecular mass in the second dimension. Subsequently, protein spots are visualized, and the scanned gel images are analyzed using image analysis programs (e.g. ImageMaster, PDQuest). Once the relevant proteins spots have been determined, these specific proteins are identified using mass spectrometry. Because quantitative protein changes can be analyzed on a large scale, 2-DE frequently is used as an initial screening procedure whereby results obtained generate new hypotheses and determine the direction of subsequent studies. 2-DE analyses, however, are expensive and can be time-consuming; these issues result in a possibly limited sample size. Furthermore, in some cases (e.g., aging studies, chronic drug treatment, screening for biomarker) replication of the study may be prohibitive. The above factors not only make it critically important to correctly analyze the 2-DE results, but also to maximize information obtained.
The statistical analysis of 2-DE gels can be divided into two classes: analysis via spot finding, and analysis using image modeling and decomposition such as described in . For our purposes, we will focus on the former 2-DE analysis, employing spot detection and spot matching across gels. In this analysis, a common problem is the presence of missing values. This generally occurs when a protein spot is not found on all gels. Missing spot values can be caused by technical issues such as variations in spot migration and staining, background noise or distortions in gel images, and the ability of the image analysis software to detect and match the protein spots across the gels. Values also may be missing, however, due to biological variation; here, the protein amount in some samples may fall below the detection limit, or post-translational modifications may alter the migration of the protein on the gel. It has been reported that 30% of data points may be missing in 2-DE analyses [2–4].
Besides the obvious loss of information due to missing values, data analysis is also hampered by missing values. Clustering techniques (e.g., k-means, hierarchical) and various statistical approaches (such as principal component analysis (PCA) and significance analysis of microarrays (SAM)) require complete datasets [3, 5]. The prevalence of missing values in 2-DE and associated uncertainty as to the cause presents a dilemma on handling missing values. Some image analysis programs, including ImageMaster TM 2D Platinum, substitute missing values with zeroes which potentially could lead to an erroneous interpretation of the results if the values were missing for technical rather than biological reasons . Omitting protein spots that contain missing values would result in a dramatic loss of information since a significant number of the protein spots will have missing values [2–4]. Replicating the study may likewise be impractical and would provide only a marginal advantage, given the prevalence of missing values. Running multiple gels for each sample and then using a composite gel in subsequent statistical analyses will reduce variability due to technical issues and also might reduce the number of missing values caused by non-biological reasons (e.g., image analysis software). Running replicate samples, however, will lead to a proportional increase in the total number of gels to be run, and the logistics of running these additional gels will likely strain resources; this can cause fewer samples to be analyzed. Because technical replication is less beneficial than biological replication in reducing variability, the former should not be pursued at the expense of the latter .
A solution to the problem of missing values is to "impute" these data, i.e. replace the missing spot values with values that use information from the protein spots that are present. Various imputation methods have been applied to microarray data, thereby improving detection of differentially expressed genes (e.g., [8–16]). Several works have, likewise, extensively compared these methods on microarray data [17–19]. In contrast, however, data imputation has found less extensive use in proteomic studies with little work comparing such approaches for proteomic data [2, 4, 20, 21].
This study compares various imputation methods (and studies their impact on typically-used high-level statistical methods) in 2-DE studies. We examine two datasets for this study. The first is an unpublished dataset from Dr. Rabin's laboratory (Rabin dataset), comparing a control condition against phorbol 12-myristate 13-acetate (PMA; see Methods). The second dataset (Coling dataset) was developed to analyze cisplatin-induced cochlear damage, see . We assume that the image processing has been suitably performed, including the spot matching across gels. Our starting point for analysis is the data matrix with rows corresponding to spots and columns corresponding to gels. The (i, j)th entry in the matrix represents the normalized spot volume for the i th spot from the j th gel. Note the similarity between this "proteomic matrix" and the "gene expression matrix" which is a common starting point in microarray analysis. For our analysis, we focus on two main areas: the influence of different imputation methods, and the influence of different statistical tests in determining what protein spots are present in different amounts between two conditions. We examine four different imputation methods and six different statistical tests on both real and simulated datasets. The imputation methods considered are the row average (RA) method, the k nearest neighbors (KNN) method, the least squares method (LSM), and nonlinear partial least squares (NIPALS) method. The statistical tests under consideration are the parametric t test, permutation t test, the "Chebby Checker" test, and three different types of bootstrap tests. All of the imputation methods and statistical tests are further detailed in the Data Analysis section. To compare the methods, we randomly remove data points from the datasets and compare the results between the complete dataset and the dataset(s) with simulated missing spots.