A comparison of imputation procedures and statistical tests for the analysis of two-dimensional electrophoresis data
© Miecznikowski et al; licensee BioMed Central Ltd. 2010
Received: 17 June 2010
Accepted: 15 December 2010
Published: 15 December 2010
Numerous gel-based softwares exist to detect protein changes potentially associated with disease. The data, however, are abundant with technical and structural complexities, making statistical analysis a difficult task. A particularly important topic is how the various softwares handle missing data. To date, no one has extensively studied the impact that interpolating missing data has on subsequent analysis of protein spots.
This work highlights the existing algorithms for handling missing data in two-dimensional gel analysis and performs a thorough comparison of the various algorithms and statistical tests on simulated and real datasets. For imputation methods, the best results in terms of root mean squared error are obtained using the least squares method of imputation along with the expectation maximization (EM) algorithm approach to estimate missing values with an array covariance structure. The bootstrapped versions of the statistical tests offer the most liberal option for determining protein spot significance while the generalized family wise error rate (gFWER) should be considered for controlling the multiple testing error.
In summary, we advocate for a three-step statistical analysis of two-dimensional gel electrophoresis (2-DE) data with a data imputation step, choice of statistical test, and lastly an error control method in light of multiple testing. When determining the choice of statistical test, it is worth considering whether the protein spots will be subjected to mass spectrometry. If this is the case a more liberal test such as the percentile-based bootstrap t can be employed. For error control in electrophoresis experiments, we advocate that gFWER be controlled for multiple testing rather than the false discovery rate.
Analysis of quantitative changes in a specific proteome (i.e., complement of proteins expressed in a particular tissue or cell at a given time) is commonly carried out using two-dimensional gel electrophoresis (2-DE). With this procedure, proteins are separated in the first dimension based on iso-electric point, followed by separation based on molecular mass in the second dimension. Subsequently, protein spots are visualized, and the scanned gel images are analyzed using image analysis programs (e.g. ImageMaster, PDQuest). Once the relevant proteins spots have been determined, these specific proteins are identified using mass spectrometry. Because quantitative protein changes can be analyzed on a large scale, 2-DE frequently is used as an initial screening procedure whereby results obtained generate new hypotheses and determine the direction of subsequent studies. 2-DE analyses, however, are expensive and can be time-consuming; these issues result in a possibly limited sample size. Furthermore, in some cases (e.g., aging studies, chronic drug treatment, screening for biomarker) replication of the study may be prohibitive. The above factors not only make it critically important to correctly analyze the 2-DE results, but also to maximize information obtained.
The statistical analysis of 2-DE gels can be divided into two classes: analysis via spot finding, and analysis using image modeling and decomposition such as described in . For our purposes, we will focus on the former 2-DE analysis, employing spot detection and spot matching across gels. In this analysis, a common problem is the presence of missing values. This generally occurs when a protein spot is not found on all gels. Missing spot values can be caused by technical issues such as variations in spot migration and staining, background noise or distortions in gel images, and the ability of the image analysis software to detect and match the protein spots across the gels. Values also may be missing, however, due to biological variation; here, the protein amount in some samples may fall below the detection limit, or post-translational modifications may alter the migration of the protein on the gel. It has been reported that 30% of data points may be missing in 2-DE analyses [2–4].
Besides the obvious loss of information due to missing values, data analysis is also hampered by missing values. Clustering techniques (e.g., k-means, hierarchical) and various statistical approaches (such as principal component analysis (PCA) and significance analysis of microarrays (SAM)) require complete datasets [3, 5]. The prevalence of missing values in 2-DE and associated uncertainty as to the cause presents a dilemma on handling missing values. Some image analysis programs, including ImageMaster TM 2D Platinum, substitute missing values with zeroes which potentially could lead to an erroneous interpretation of the results if the values were missing for technical rather than biological reasons . Omitting protein spots that contain missing values would result in a dramatic loss of information since a significant number of the protein spots will have missing values [2–4]. Replicating the study may likewise be impractical and would provide only a marginal advantage, given the prevalence of missing values. Running multiple gels for each sample and then using a composite gel in subsequent statistical analyses will reduce variability due to technical issues and also might reduce the number of missing values caused by non-biological reasons (e.g., image analysis software). Running replicate samples, however, will lead to a proportional increase in the total number of gels to be run, and the logistics of running these additional gels will likely strain resources; this can cause fewer samples to be analyzed. Because technical replication is less beneficial than biological replication in reducing variability, the former should not be pursued at the expense of the latter .
A solution to the problem of missing values is to "impute" these data, i.e. replace the missing spot values with values that use information from the protein spots that are present. Various imputation methods have been applied to microarray data, thereby improving detection of differentially expressed genes (e.g., [8–16]). Several works have, likewise, extensively compared these methods on microarray data [17–19]. In contrast, however, data imputation has found less extensive use in proteomic studies with little work comparing such approaches for proteomic data [2, 4, 20, 21].
This study compares various imputation methods (and studies their impact on typically-used high-level statistical methods) in 2-DE studies. We examine two datasets for this study. The first is an unpublished dataset from Dr. Rabin's laboratory (Rabin dataset), comparing a control condition against phorbol 12-myristate 13-acetate (PMA; see Methods). The second dataset (Coling dataset) was developed to analyze cisplatin-induced cochlear damage, see . We assume that the image processing has been suitably performed, including the spot matching across gels. Our starting point for analysis is the data matrix with rows corresponding to spots and columns corresponding to gels. The (i, j)th entry in the matrix represents the normalized spot volume for the i th spot from the j th gel. Note the similarity between this "proteomic matrix" and the "gene expression matrix" which is a common starting point in microarray analysis. For our analysis, we focus on two main areas: the influence of different imputation methods, and the influence of different statistical tests in determining what protein spots are present in different amounts between two conditions. We examine four different imputation methods and six different statistical tests on both real and simulated datasets. The imputation methods considered are the row average (RA) method, the k nearest neighbors (KNN) method, the least squares method (LSM), and nonlinear partial least squares (NIPALS) method. The statistical tests under consideration are the parametric t test, permutation t test, the "Chebby Checker" test, and three different types of bootstrap tests. All of the imputation methods and statistical tests are further detailed in the Data Analysis section. To compare the methods, we randomly remove data points from the datasets and compare the results between the complete dataset and the dataset(s) with simulated missing spots.
Results & Discussion
The PMA-treated gel that was used as the reference gel in the Rabin dataset is shown in the Additional Materials (Additional file 1, Figure S1). The Deleted Residuals method in HDBStat! was used to test for non-homogeneity among the gels and to identify gels that are outliers in comparison with other gels in the group . No gels were determined to be outliers via this method (results not shown).
Studying the KNN methods in Figure 2, we can see the sensitivity associated with the choice of k in the KNN imputation method. In general, the mean RMSE increases as k increases. This is expected since large values of k tend to over-fit the data, hence leading to a large RMSE when applied to the missing data; meanwhile, small values of k lead to simpler models that will likely fit the missing data better in terms of a smaller RMSE. We further see that small values of k (k < 6) showed similar performance in terms of RMSE.
From Figure 2, we see that the NIPALS method performs favorably compared to KNN, and also there appears to be minimal dependence on RMSE and the number of principal components employed in the NIPALS procedure.
Statistical Test Results
The effects of imputation on the subsequent statistical analysis using the parametric t test (unequal variances), permutation t test, bootstrap t test(s), and Chebyshev's inequality test ("Chebby Checker") were investigated. For these studies, the statistical analysis of either dataset (Rabin or Coling) was compared to the analysis using the dataset and simulating 10% randomly missing values which were then imputed using the RA, KNN (with k = 1 and 5), and LSM methods.
In short, our conclusions regarding the statistical tests likewise hold for the Coling dataset. The percentile bootstrap test yields the largest number of mean discoveries (Additional file 6, Figure S6), while the most conservative tests are the normal bootstrap test and pivotal bootstrap test. These differences between the bootstrap tests are due to the nature of the bootstrap assumptions regarding the distribution of the test statistics . In both datasets, the parametric t, permutation t, and "Chebby Checker" tests yield an intermediate number of discoveries compared to the bootstrap testing methods.
In the analysis of the Rabin dataset, approximately 60% of protein spots had at least one missing data point (i.e., a spot not observed in at least one gel). This resulted in a proteomic "expression" matrix that contained 21% missing values. Similar studies using 2-DE analysis have reported at least 30% missingness in their data [2–4]. In this study, the presence of a missing value was not related to the PMA treatment. Further, missing values were not dependent on location of the spot on the gel; this indicates the absence of bias based on isoelectric point or molecular weight. More missing values were found, however, in the bottom quartile of spot intensities, while the top quartile had the most number of complete protein spots. This indicates an association between missing values and the overall abundance of a protein (results not shown). Similarly,  reported that, as spot volume increased, the number of matched spots also increased. It is unclear whether the greater amount of missing values in the bottom quartile of spot intensities was related to biological variation or the technical difficulty of the software to detect, align, and match spots of low intensity. The ImageMaster TM 2D Platinum software (Version 5.0) clearly does better with the high intensity protein spots, although it may be worthwhile to consider normalization methods (e.g., ) in conjunction with image analysis software.
The performance of all the imputation methods, however, depended on the fraction of values that were missing. For the RA, the increase in RMSE was directly related to the percent of missing values. Lower percentages showed similar average variances between KNN, RA, and LSM imputation schemes. As the percentage of imputed values increased, however, the average variance of KNN and LSM imputation methods decreased slightly, whereas the variance with RA had a more profound decrease. A decrease in variance after RA imputation would increase the number of protein spots that would be identified as significantly altered thereby potentially inflating the occurrence of false positives. Based upon our findings, one should restrict the maximal fraction of values that are imputed in 2-DE studies. To accomplish this, we suggest only analyzing protein spots that are present in a majority of the gels. The criterion for a majority of spots to be present will reduce the loss of information due to missing values, while limiting the amount of required imputation. In addition, greater confidence in imputation accuracy may be obtained as the imputed values can be checked against the actual spot values observed.
The ability to detect differentially expressed proteins with 2-DE depends not only on the method and level of imputation, but also on the statistical analysis. This study applied six univariate tests to evaluate differences in protein amounts. While powerful, parametric methods require a number of assumptions, including that the data represent random samples from a Gaussian distribution; that may be difficult to assess due to the small sample size typically used in proteomic studies. An inflated Type I error can occur, for example, if the data do not fit a normal distribution . Permutation tests, which generate their own distribution and do not make any assumptions about the underlying distribution of the test statistic [28–31], have been suggested to be more powerful than parametric tests and should be preferred for small sample sizes [32, 33]. The "Chebby Checker" variation of Chebyshev's inequality test is robust against departures from normality and inequality of variance in small datasets . The bootstrap t test is useful with data that does not conform to known statistical distributions. The bootstrap methods, however, cannot completely alleviate the difficulties caused by a small sample size. With only six gels per treatment group, we restricted our bootstrap simulations and only used 25 bootstrap resampled datasets.
Of the statistical methods used, the percentile-based bootstrap t test was the most liberal in detecting differentially expressed proteins, while the normal-based bootstrap t test appeared to be the most conservative and thus potentially the least sensitive. The parametric t, permutation t, and Chebby Checker tests yielded comparable results and displayed an intermediate amount of discovered proteins. While these relationships between statistical tests were observed irrespective of the imputation method used, the imputation method slightly impacted the selection process (statistical tests) identifying "changed proteins".
The salient question is how best to analyze the results of 2-DE analysis. This issue is complicated by the fact that the statistical analysis involves the testing of a large number of hypotheses and is performed without knowledge of the identity of the proteins involved. In a typical quantitative proteomic study using 2-DE, statistical analyses are used to determine which protein spots are differentially expressed and subsequently will be subjected to mass spectrometry for protein identification. Thus, unlike analysis of microarray data, statistical analysis of 2-DE data occurs without the possible benefit of relevant biological information (e.g., cellular function of the protein, or how it is regulated) that may help either to substantiate the statistical analysis or to identify possible false positives. As the intent of the statistical analysis is to provide an objective means of identifying which proteins are changed and thus allowing the proteins to be prioritized (i.e. "triaged") for subsequent study, the statistical analysis should identify as many true effects as possible while incurring few or at least a low proportion of false positives. Specifically, the statistical methods used to analyze 2-DE data should be guided primarily by the study objective and whether making a Type I or a Type II error is more egregious. For example, if 2-DE analysis is being used in an initial screening procedure to identify candidate proteins as possible biomarkers, greater concern at first might be with omitting a true effect as false positives would be weeded out in subsequent studies. In summary, we advocate for a three-step statistical analysis of 2-DE data with a data imputation step, choice of statistical test, and lastly an error control method in light of multiple testing. For imputation methods, the best results in terms of RMSE are obtained using the LSM imputation method with the EM algorithm approach to estimate missing values with an array covariance structure. When determining the choice of statistical test, it is worth considering whether the protein spots will be subjected to mass spectrometry. If this is the case, a more liberal test such as the percentile-based bootstrap t should be employed. Otherwise, outside of the bootstrap-based t tests, there are only relatively small differences between the different statistical tests. Specifically, the normal bootstrap test and the pivotal bootstrap test yield the smallest number of discoveries on both datasets, while the parametric and permutation t tests are in the middle in terms of number of discoveries. Lastly, for error control in testing protein spots in electrophoresis (e.g. usually < 1000 tests), from our work in , we advocate that gFWER be controlled rather than the false discovery rate.
For the Rabin dataset, PC12 cells were cultured as previously described . The cells were harvested and treated for 10 minutes at 34°C in PBS containing 1 μ M phorbol 12-myristate 13-acetate (PMA). Cells were collected by centrifugation at 14,000 g for 1 minute at 4°C, and the resulting cell pellet was resuspended in 50 mM Tris buffer (pH 7.4) containing complete EDTA-free protease inhibitors (Roche, Indianapolis, IN) and a cocktail of phosphatase inhibitors (1 mM Na3VO4, 2.5 mM sodium pyrophosphate, 1 mM β-glycerolphosphate). Samples were sonicated (three 5s bursts separated by one minute incubation on ice between each burst) and centrifuged at 40,000 g for 15 minutes at 4°C. The resulting supernatants were extracted with chloroform:methanol:water (1:4:3), and the proteins subsequently were precipitated with cold methanol. The resulting protein pellets were air-dried and then resuspended in 2X sample buffer (7 M urea, 2 M thiourea, 4% (w/v) CHAPS, 2% (v/v) IPG buffer 4-7, and 2% (w/v) dithiothreitol DTT), and then diluted with an equal volume of 2X rehydration buffer (5 M urea, 2 M thiourea, 4% (w/v) CHAPS, 0.002% (w/v) bromophenol blue, 20% (v/v) isopropanol, 10% (v/v) glycerol, 1% IPG buffer 4-7, and 2.8 mg/ml DTT) to yield a protein concentration of 1 μ g/μ l. The details on the cell cultures for the Coling dataset can be found in .
For the Rabin dataset, isoelectric focusing (IEF) was carried out using an Ettan IPGphor (GE Healthcare) and 24 cm linear, pH 4-7 Immobiline DryStrips (GE Healthcare). The conditions used for IEF were: rehydration loading of the IPG strips at 30v for 12 hours, 500v for 1 hour, 1000v for 1 hour and 8000v for 8.20 hours. Subsequently, the IPG strips were incubated successively (15 minutes each) at room temperature in an equilibration buffer containing 50 mM Tris-HCl, 6 M urea, 30% (v/v) glycerol, 2% (w/v) SDS, 0.002% (w/v) bromophenol blue and 2% (w/v) DTT (pH 8.8) followed by an incubation in the above buffer, but with 4.5% (w/v) iodoacetamide in place of DTT. Electrophoresis was carried out using an Ettan DALTsix Electrophoresis System (GE Healthcare) and 1 mm thick 10% SDS-PAGE (25.5 cm × 20.5 cm) gels. Six controls and six PMA-treated samples were separated. To ensure that gels remained attached to the plates during scanning and spot picking, the plates were pre-coated with Bind-Silane (GE Healthcare). In addition, self-adhesive markers were also placed on the plates coated with Bind-Silane to facilitate spot localization by an Ettan DALT spot picker (GE Healthcare). The conditions for electrophoresis were 5 W/gel for 30 min followed by 10 W/gel until the bromophenol blue dye front was approximately 1 mm from the bottom of the plate. The gels were fixed and stained with ProQ Diamond® (Molecular Probes) according to the manufacturer's instructions. The details on the 2-DE separation methods for the Coling dataset can be found in .
For the Rabin dataset, fluorescent images of gels stained with ProQ Diamond® (Molecular Probes) were acquired using a Typhoon 9400 variable mode imager (GE Healthcare). The scanned fluorescent images of ProQ Diamond® were then analyzed using the ImageMaster TM 2D Platinum software (Version 5.0). To reduce variations due to manual cropping, the gel images were first cropped with Picture Manager (Microsoft) . The following spot detection parameters were used for image analysis: Smooth 3, Minimum Area 5, and Saliency 6. An automatic spot detection algorithm was used, and manual editing of spots was avoided in the analysis to minimize quantitation errors. One of the PMA-treated gels was used as the reference gel for spot matching and alignment. Spot volumes were normalized using the mean-normalization method (i.e. spot volume for a specific protein spot was divided by the spot volumes for all the spots in the gel). Spot normalization reduces experimental variations between gels caused by conditions such as differences in protein loading or staining. The details on the 2-DE imaging and normalization methods for the Coling dataset can be found in . For the analysis of the Coling dataset, we used the Cy2 channel to normalize the Cy3 and Cy5 channels.
The design for the Rabin dataset consisted of six gels for each of two conditions, a control condition and the PMA condition. After image analysis, a data matrix consisting of normalized spot volumes was produced where the rows corresponded to spots and the columns corresponded to gels.
The Rabin dataset includes the 70 protein spots that were matched or found in the all of the gels. In the Additional Materials, we included the analysis of the Coling dataset, where the Coling dataset contains 343 protein spots that were found in all gels in the analysis. For each of the datasets, the three main steps in the analysis pipeline were 1) imputation of missing data 2) statistical testing and 3) error control in light of multiple testing.
For either the Rabin or Coling dataset (see Additional Materials), four different imputation procedures were performed, the Row Average (RA), the k nearest neighbor (KNN), nonlinear iterative partial least squares (NIPALS), and least square method (LSM) . The Row Average (RA) and k nearest neighbor (KNN) imputation were done using the R computing language with the impute package  while LSM was implemented using the java language code . In the RA method, the average of the values that are present for that particular protein spot are used to replace the missing data points. The KNN algorithm classifies objects based on closest ("nearest") protein spots. In this algorithm we find the k nearest neighbors using a suitable distance metric, and then we impute the missing elements by averaging those (non-missing) elements of its neighbors. In the KNN method, there are different types of distance metrics (Pearson correlation, Euclidean, Mahalonobis, and Chebyshev's distance) that can be employed. We chose the Euclidean distance metric as it has been reported to be more accurate . Although designed for microarray data, we have employed the LSM method to our proteomic dataset.
The NIPALS method is summarized in  and is implemented using the R package "pcaMethods" . Similar to KNN, in order to implement the NIPALS algorithm, it is necessary for the user to specify the number of principal components.
The LSM method estimates missing values utilizing correlations between protein spots and gels. There are several variants of the LSM described fully in , where each variation is related to different methods of estimating the correlation within the dataset. The LSM method was implemented using the LSimpute.jar java script available at http://www.ii.uib.no/~trondb/imputation/. To evaluate the three different methods of imputation, spot values were randomly deleted across groups from the complete dataset, and the normalized root mean square error (RMSE) was calculated to compare the imputation methods.
The second step in the pipeline is to employ a statistical test on each protein spot to assess whether the spot is present in different amounts between the conditions. For this analysis, six different statistical tests were examined, specifically, the standard t test, Chebyshev's inequality test, permutation t test and three different variants of the bootstrap t test (normal approximation, percentile, and pivotal). The permutation t test was performed using the Deducer software package in R . The standard t test (unequal variances) and Chebyshev's inequality test were carried out using standard R functions. The version of Chebyshev's inequality test (or "Chebby Checker") is described in Equation (7) in . There are three types of bootstrap tests that can be performed: tests derived from the normal approximation, percentile confidence intervals, and pivotal confidence intervals . For the class of bootstrap tests, the confidence intervals were inverted in order to obtain the p values for each protein spot. The output using all three bootstrapping methods are summarized in the Results.
After employing a statistical test for each protein spot, the third step in the pipeline is to determine protein spot significance with consideration of error control in light of multiple testing. To compare the number of significant spots across different simulated imputation procedures from the complete dataset, the per comparison error rate was controlled at 0.05. To examine imputation methods, 20 different Monte-Carlo simulations were performed, where each simulation consisted of randomly deleting 10% of the data from the complete dataset and imputing the data using either KNN, RA, or LSM imputation. We summarized the imputation methods using spots where the p-value for significance was less than 0.05 in at least half of the simulated datasets. We recognize that controlling the per comparison error rate is likely to inflate the number of false positives, nevertheless, this method is acceptable for comparing imputation procedures since we are not making claims that the discovered spots are truly differentially expressed.
In practice, when determining protein significance, from our work in , we advocate controlling the generalized family wise error rate (gFWER). An overview of methods to control gFWER is available in  with their implementation in the R software provided in the package multtest .
List of Abbreviations
two-dimensional gel electrophoresis
false discovery rate
generalized family wise error rate
k nearest neighbor
least square method
nonlinear partial least squares
principal component analysis
phorbol 12-myristate 13-acetate
row average method
root mean square error
significance analysis of microarrays
- Morris J, Baladandayuthapani V, Herrick R, Sanna P, Gutstein H: Automated Analysis of Quantitative Image Data Using Isomorphic Functional Mixed Models with Application to Proteomics Data. UT MD Anderson Cancer Center Department of Biostatistics Working Paper Series 2010.Google Scholar
- Wood J, White I, Cutler P: A likelihood-based approach to defining statistical significance in proteomic analysis where missing data cannot be disregarded. Signal Processing 2004,84(10):1777–1788. 10.1016/j.sigpro.2004.06.019View ArticleGoogle Scholar
- Jung K, Gannoun A, Sitek B, Apostolov O, Schramm A, Meyer H, Stuhler K, Urfer W: Statistical evaluation of methods for the analysis of dynamic protein expression data from a tumor study. RevStat-Statistical Journal 2006, 4: 67–80.Google Scholar
- Pedreschi R, Hertog M, Carpentier S, Lammertyn J, Robben J, Noben J, Panis B, Swennen R, Nicolai B: Treatment of missing values for multivariate statistical analysis of gel-based proteomics data. Proteomics 2008,8(7):1371–1383. 10.1002/pmic.200700975PubMedView ArticleGoogle Scholar
- Jung K, Gannoun A, Sitek B, Meyer H, Stuhler K, Urfer W: Analysis of dynamic protein expression data. RevStat-Statistical Journal 2005, 3: 99–111.Google Scholar
- Meleth S, Deshane J, Kim H: The case for well-conducted experiments to validate statistical protocols for 2D gels: different pre-processing = different lists of significant proteins. BMC biotechnology 2005.,5(7):
- Horgan G: Sample size and replication in 2D gel electrophoresis studies. J Proteome Res 2007,6(7):2884–2887. 10.1021/pr070114aPubMedView ArticleGoogle Scholar
- Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R: Missing value estimation methods for DNA microarrays. Bioinformatics 2001,17(6):520. 10.1093/bioinformatics/17.6.520PubMedView ArticleGoogle Scholar
- Kim H, Golub G, Park H: Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005,21(2):187. 10.1093/bioinformatics/bth499PubMedView ArticleGoogle Scholar
- Scheel I, Aldrin M, Glad I, Sorum R, Lyng H, Frigessi A: The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 2005,21(23):4272–4279. 10.1093/bioinformatics/bti708PubMedView ArticleGoogle Scholar
- Sehgal M, Gondal I, Dooley L: Collateral Missing Value Estimation: Robust missing value estimation for consequent microarray data processing. AI 2005: Advances in Artificial Intelligence 2005, 274–283. full_textView ArticleGoogle Scholar
- Gan X, Liew A, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Research 2006,34(5):1608. 10.1093/nar/gkl047PubMed CentralPubMedView ArticleGoogle Scholar
- Tuikkala J, Elo L, Nevalainen O, Aittokallio T: Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2006,22(5):566–572. 10.1093/bioinformatics/btk019PubMedView ArticleGoogle Scholar
- Wang X, Li A, Jiang Z, Feng H: Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC bioinformatics 2006, 7: 32. 10.1186/1471-2105-7-32PubMed CentralPubMedView ArticleGoogle Scholar
- Jörnsten R, Ouyang M, Wang H: A meta-data based method for DNA microarray imputation. BMC Bioinformatics 2007, 8: 109.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang X, Song X, Wang H, Zhang H: Sequential local least squares imputation estimating missing value of microarray data. Computers in Biology and Medicine 2008,38(10):1112–1120. 10.1016/j.compbiomed.2008.08.006PubMedView ArticleGoogle Scholar
- Nguyen D, Wang N, Carroll R: Evaluation of missing value estimation for microarray data. Journal of Data Science 2004,2(4):347–370.Google Scholar
- Brock G, Shaffer J, Blakesley R, Lotz M, Tseng G: Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC bioinformatics 2008, 9: 12. 10.1186/1471-2105-9-12PubMed CentralPubMedView ArticleGoogle Scholar
- Celton M, Malpertuy A, Lelandais G, De Brevern A: Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC genomics 2010, 11: 15. 10.1186/1471-2164-11-15PubMed CentralPubMedView ArticleGoogle Scholar
- Ahmad N, Zhang J, Brown P, James D, Birch J, Racher A, Smales C: On the statistical analysis of the GS-NS0 cell proteome: Imputation, clustering and variability testing. BBA-Proteins and Proteomics 2006,1764(7):1179–1187. 10.1016/j.bbapap.2006.05.002PubMedView ArticleGoogle Scholar
- Chang J, Van Remmen H, Ward W, Regnier F, Richardson A, Cornells J: Processing of data generated by 2-dimensional gel electrophoresis for statistical analysis: missing data, normalization, and statistics. J Proteome Res 2004,3(6):1210–1218. 10.1021/pr049886mPubMedView ArticleGoogle Scholar
- Coling D, Ding D, Young R, Lis M, Stofko E, Blumenthal K, Salvi R: Proteomic analysis of cisplatin-induced cochlear damage: methods and early changes in protein expression. Hearing research 2007,226(1–2):140–156. 10.1016/j.heares.2006.12.017PubMedView ArticleGoogle Scholar
- Trivedi P, Edwards J, Wang J, Gadbury G, Srinivasasainagendra V, Zakharkin S, Kim K, Mehta T, Brand J, Patki A, Page G, Allison D: HDBStat!: a platform-independent software suite for statistical analysis of high dimensional biology data. BMC bioinformatics 2005, 6: 86. 10.1186/1471-2105-6-86PubMed CentralPubMedView ArticleGoogle Scholar
- Efron B, Tibshirani R: Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science 1986, 1: 54–75. 10.1214/ss/1177013815View ArticleGoogle Scholar
- Karp N, McCormick P, Russell M, Lilley K: Experimental and statistical considerations to avoid false conclusions in proteomics studies using differential in-gel electrophoresis. Molecular & Cellular Proteomics 2007,6(8):1354.View ArticleGoogle Scholar
- Sellers K, Miecznikowski J, Viswanathan S, Minden J, Eddy W: Lights, Camera, Action! Systematic variation in 2-D difference gel electrophoresis images. Electrophoresis 2007,28(18):3324–3332. 10.1002/elps.200600793PubMedView ArticleGoogle Scholar
- Hayes A: Permutation Test Is Not Distribution-Free: Testing H 0 ρ = 0. Psychological Methods 1996, 1: 184–198. 10.1037/1082-989X.1.2.184View ArticleGoogle Scholar
- Adams D, Anthony C: Using randomization techniques to analyse behavioural data. Animal Behaviour 1996,51(4):733–738. 10.1006/anbe.1996.0077View ArticleGoogle Scholar
- Edgington E: Randomization tests. CRC Press; 1995.Google Scholar
- Manly B: Randomization, bootstrap and Monte Carlo methods in biology. Chapman & Hall/CRC; 2006.Google Scholar
- Pitt D, Kreutzweiser D: Applications of computer-intensive statistical methods to environmental research. Ecotoxicology and environmental safety 1998,39(2):78–97. 10.1006/eesa.1997.1619PubMedView ArticleGoogle Scholar
- Ludbrook J, Dudley H: Why permutation tests are superior to t and F tests in biomedical research. The American Statistician 1998.,52(2): 10.2307/2685470Google Scholar
- Tsai C, Chen Y, Chen J: Testing for differentially expressed genes with microarray data. Nucleic acids research 2003,31(9):e52. 10.1093/nar/gng052PubMed CentralPubMedView ArticleGoogle Scholar
- Beasley T, Page G, Brand J, Gadbury G, Mountz J, Allison D: Chebyshev's inequality for nonparametric testing with small N and α in microarray research. Journal of the Royal Statistical Society. Series C (Applied Statistics) 2004, 53: 95–108. 10.1111/j.1467-9876.2004.00428.xView ArticleGoogle Scholar
- Gold D, Miecznikowski J, Liu S: Error control variability in pathway-based microarray analysis. Bioinformatics 2009,25(17):2216. 10.1093/bioinformatics/btp385PubMed CentralPubMedView ArticleGoogle Scholar
- Damodaran S, Rabin R: Minimizing Variability in Two-dimensional Electrophoresis Gel Image Analysis. OMICS: A Journal of Integrative Biology 2007,11(2):225–230. 10.1089/omi.2007.0018PubMedView ArticleGoogle Scholar
- Bo T, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Research 2004,32(3):e34. 10.1093/nar/gnh026PubMed CentralPubMedView ArticleGoogle Scholar
- Hastie T, Tibshirani R, Narasimhan B, Chu G: impute: impute: Imputation for microarray data. 1999. [R package version 1.10.0]Google Scholar
- Bo T, Dysvik B, Jonassen I: LSimpute: Accurate estimation of missing values in microarray data with least squares methods. 2005. [http://www.ii.uib.no/~trondb/imputation/]Google Scholar
- Wold H: Path models with latent variables: the NIPALS approach. Quantitative sociology: International perspectives on mathematical and statistical modeling 1975, 307–357.View ArticleGoogle Scholar
- Stacklies W, Redestig H, to Kevin Wright for improvements to nipalsPca T: pcaMethods: A collection of PCA methods. 2007. [R package version 1.18.0]Google Scholar
- Fellows I: Deducer: Deducer. 2009. [R package version 0.2–2] [http://CRAN.R-project.org/package=Deducer]Google Scholar
- Wasserman L: All of statistics: a concise course in statistical inference. Springer Verlag; 2004.View ArticleGoogle Scholar
- Guo W, Romano J: A generalized Sidak-Holm procedure and control of generalized error rates under independence. Statistical Applications in Genetics and Molecular Biology 2007., 6: 10.2202/1544-6115.1247Google Scholar
- Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S: multtest: Resampling-based multiple hypothesis testing. 2009. [R package version 2.0.0]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.