Informed baseline subtraction of proteomic mass spectrometry data aided by a novel sliding window algorithm
 Tyman E. Stanford^{1}Email author,
 Christopher J. Bagley^{1} and
 Patty J. Solomon^{1}
DOI: 10.1186/s1295301601078
© The Author(s) 2016
Received: 24 April 2016
Accepted: 1 November 2016
Published: 7 December 2016
Abstract
Background
Proteomic matrixassisted laser desorption/ionisation (MALDI) linear timeofflight (TOF) mass spectrometry (MS) may be used to produce protein profiles from biological samples with the aim of discovering biomarkers for disease. However, the raw protein profiles suffer from several sources of bias or systematic variation which need to be removed via preprocessing before meaningful downstream analysis of the data can be undertaken. Baseline subtraction, an early preprocessing step that removes the nonpeptide signal from the spectra, is complicated by the following: (i) each spectrum has, on average, wider peaks for peptides with higher masstocharge ratios (m/z), and (ii) the timeconsuming and errorprone trialanderror process for optimising the baseline subtraction input arguments. With reference to the aforementioned complications, we present an automated pipeline that includes (i) a novel ‘continuous’ line segment algorithm that efficiently operates over data with a transformed m/zaxis to remove the relationship between peptide mass and peak width, and (ii) an inputfree algorithm to estimate peak widths on the transformed m/z scale.
Results
The automated baseline subtraction method was deployed on six publicly available proteomic MS datasets using six different m/zaxis transformations. Optimality of the automated baseline subtraction pipeline was assessed quantitatively using the mean absolute scaled error (MASE) when compared to a goldstandard baseline subtracted signal. Several of the transformations investigated were able to reduce, if not entirely remove, the peak width and peak location relationship resulting in nearoptimal baseline subtraction using the automated pipeline. The proposed novel ‘continuous’ line segment algorithm is shown to far outperform naive sliding window algorithms with regard to the computational time required. The improvement in computational time was at least fourfold on real MALDI TOFMS data and at least an order of magnitude on many simulated datasets.
Conclusions
The advantages of the proposed pipeline include informed and data specific input arguments for baseline subtraction methods, the avoidance of timeintensive and subjective piecewise baseline subtraction, and the ability to automate baseline subtraction completely. Moreover, individual steps can be adopted as standalone routines.
Keywords
Mathematical morphology Tophat operator Line segment algorithm Mass spectrometry Baseline subtraction Preprocessing Matrixassisted laser desorption/ionization Timeofflight Unevenly spaced dataBackground
Discovery of protein biomarkers by mass spectrometry
Protein biomarkers are proteins or protein fragments that serve as markers of a disease or condition biomarkers [1, 2] by virtue of their altered relative abundance in the disease state versus the healthy condition. Matrixassisted laser desorption/ionisation (MALDI) linear timeofflight (TOF) mass spectrometry (MS) is a widely used technology for biomarker discovery as it can create a representative profile of polypeptide expression from biological samples. These profiles are displayed as points of polypeptide abundance (intensity; the yaxis) for a range of masstocharge values (m/z; the xaxis). Each spectrum is an array of positive intensity values for discretely measured m/z values, but the profile is typically displayed on a continuous scale. MALDI TOFMS spectra are typically limited to polypeptides less than 30 kilo Daltons although there is no theoretical upper limit [3]. Numerous biomarkers using MALDI TOFMS have been identified to date [3, 4].
Signal smoothing is the first step in preprocessing the data and aims to remove instrumentderived noise in the data and stochastic variation in the spectrum signal. Baseline subtraction then follows, which is the removal of the estimated ‘bed’ on which the spectral profile sits, composed of nonbiological signal, e.g. chemical noise from ionised matrix. Normalisation is the third step in preprocessing. This has the aim of making the observed signals proportionate over the experiment; to correct for instrument variability and sampleionisation efficiency that will influence the number of peptide ions reaching the detector. Peak detection is the fourth step, which is the detection of peak signal as peptide mass and intensity pairs. Finally, in the fifth step, the peaks are subject to peak alignment which adjusts for small drifts in m/z location which result from the calibration required for the TOFMS system. This ensures that peptides common across spectra are recognised and compared at the same m/z value. Once the data have been preprocessed, analysis to detect potential biomarkers can be performed.
There are numerous freely available MS preprocessing packages. For example, in the R statistical software environment, MALDIquant, PROcess and XCMS are available [5–9]. Although we have set out the usual sequence of five data preprocessing steps, an optimal approach to preprocessing is not yet established and there is scope to improve current preprocessing methods and the order in which they are applied, to allow more reliable biomarker identification [10]. The present paper focuses on optimising methods for the baseline subtraction step of preprocessing of the raw spectra.
Baseline subtraction
The baseline subtraction method discussed in the present paper utilises the tophat operator, which is an operator defined in mathematical morphology. Mathematical morphology was originally proposed for twodimensional image analysis then further developed for image processing of microarray data images [13, 14]. It has since been applied to MS data [7, 15–19], and we describe the theory that is largely ignored when applied naively. The mathematical morphology definitions of an erosion, dilation, opening and tophat allow us to extend the current use of mathematical morphology in MS baseline subtraction.
The tophat operator has some properties, i.e. it is a nonparametric and nonlinear filter, which make it desirable for baseline subtraction. In particular, this suits the nonbiological signal in MS spectra which may not follow a known functional form. Furthermore, the tophat operator is computationally inexpensive compared with standard functional filters that require estimates of model parameters.
Other algorithmic methods of baseline subtraction such as the sensitive nonlinear iterative peak (SNIP) algorithm [20, 21] provide an alternative to the tophat operator. However, it will be shown in the Methods section that the tophat operator can importantly be extended, using the mathematical theory underpinning it, for unevenly spaced data.
Standard methods of baseline subtraction estimate local minima (troughs) and fit either local regression (LOESS, SavitzkyGolay) or interpolate (splines) through these points [22]. These methods require careful selection of the window size for detecting troughs, the polynomial order and the span of points for fitting the model, where applicable. Despite using optimised input arguments for these methods, they cannot guarantee a nonnegative resultant signal without applying contraints to guarantee nonnegativity. Without such constraints, padded or removed signal in places of high curvature in the spectra may be produced. This can easily be envisioned by considering two local minimums and an adjacent point to one of the local minimums that lies between both. There is no property that stops the adjacent point lying below an interpolation of the two minimums, especially where there exists a large difference between the values of the local minimums.
The tophat operator
The tophat operator is a function described within mathematical morphology theory, an area that is heavily applied in image processing and analysis [23, 24]. For the interested reader, we direct you to ‘Additional file 1’ for a mathematical description.
The tophat operator is the end result of applying a rolling (moving) minimum calculation, called an erosion, to a spectrum; then applying a rolling maximum calculation, called a dilation, to the erosion; and finally removing this estimate of the spectrum’s baseline, called an opening, from the initial spectrum.
The rolling minimum and maximum calculations require an appropriately sized linesegment, or window, to be defined which provides the local domain for the minimum and maximum calculations to be made. In mathematical morphology parlance this window is called a structuring element (SE).
The opening has the desirable property that it is restricted to values equal or less than the spectral values to which it was applied. In turn, the tophat operator therefore provides a background estimate and removal without risk of creating negative signal, since it is a physical impossibility of the system.
Current application of the tophat operator to linear TOFMS
A naive algorithmic application of an erosion to spectral intensities simply requires a traversal of each point, where the minimum value within a window over that point is the resulting erosion. The process is performed similarly for a dilation. However, erosions and dilations can be calculated more efficiently with the line segment algorithm (LSA) [25, 26]. Application of the LSA is mainly seen in medical imaging and analysis [27, 28]. The R package MALDIquant and OpenMS use this algorithm in their implementation of the tophat operator.

If a SE is too large, then it will be too conservative and leave false signal.

If a SE is too small, it will result in undercut peaks and remove valid signal.

The mean peak width increases further along the m/zaxis [29]. The baseline subtraction needs to be performed in a piecewise manner, otherwise the above issues 1 and 2 will occur.
Despite the simplicity of the tophat operator compared to functional alternatives, piecewise baseline subtraction is still required. In fact, piecewise baseline subtraction should be applied for any method that implicitly assumes peak width remains constant, such as local regression, interpolating splines or the SNIP algorithm.
The SE size used for the tophat operator needs to be of equivalent window size to each spectrum’s peak widths, or greater, to ensure the tophat operator does not ‘undercut’ peak intensities. The piecewise baseline subtraction involves determining subsections of the m/zaxis, where fixed SE widths (in the number of m/z points) in each section are appropriate, or the equivalent input arguments for other baseline methods. Smaller SEs will be chosen corresponding to lower m/z values and larger SEs will be used corresponding to larger m/z values.
Improving baseline subtraction
Prior to preprocessing MALDI TOFMS data, a log or square root transformation of the intensity axis is usually performed as a variance stabilisation measure but no such transformation is made to the m/zaxis. If an appropriate m/z transformation could be made however, piecewise preprocessing of the spectra for the baseline subtraction step (and potentially for other preprocessing steps) could be avoided. Additionally, the default arguments such as window size in software to perform baseline subtraction are statically defined. Uninformed default arguments such as these are highly likely to need modification for successful baseline subtraction, as spectra attributes vary from one experiment to another. Dynamic default arguments that are informed by the data would be an advantage in saving both user time and minimising user error.
Methods
A pipeline to achieve automated baseline subtraction

Firstly, the pipeline automates the baseline subtraction step, that is otherwise conducted in a piecewise manner. This eliminates the need for user input and timeconsuming calibration by observation. Automation of the baseline subtraction step also minimises the potential for user error and the time required to assess the input arguments for optimality.

Secondly, the novel algorithm that is computationally less expensive than a naive minimum or maximum sliding window algorithm to perform the tophat operation on unevenly spaced data, presented here, further minimises the computational time burden of baseline subtraction.
Fields of application outside of bioinformatics that encounter unevenly spaced data are also likely to find this algorithm useful in practice. Other names for unevenly spaced data include unevenly sampled, nonequispaced, nonuniform, inhomogeneous, irregularly sampled or nonsynchronous data. Such data occur in various fields including, but not limited to; financial timeseries, geologic timeseries, astrophysics and medical imaging [30–36]. Analysis and processing of unevenly spaced data is an ongoing field of research, as most methods for analysis assume equally spaced data.
Data used
Six proteomic MS datasets from previously published studies were used to validate the methods presented over a broad range of dataset attributes such as different numbers of peaks, different peak widths, differently spaced m/z values, different number of m/z values, different number of spectra, samples from different organisms and samples from different biological origins. Fiedler data: Urine samples were taken from 10 healthy women and 10 healthy men and peptides were separated using magnetic beads (fractionation). The fractionated samples were then subject to MALDI TOFMS [37]. A subset of the MALDI TOFMS data is freely available in the R package MALDIquant [7] and is the dataset used here. The spectra are observed over the range of values 1,00010,000 m/z. Yildiz data: As described in [38], sera were collected from 142 lung cancer patients and 146 healthy controls to find relevant biomarkers. The serum samples were subject to MALDI TOFMS without magnetic bead separation. The spectra are observed over the range of values 3,00020,000 m/z. Wu data: MALDI TOFMS data were generated from sera, as described in [39, 40], with the aim of differentiating between 47 ovarian and 42 control subjects. The spectra are observed over the range of values 8003,500 m/z using Reflectron mode which resolve peptide peaks into their isotopomers. Adam data: Surfaceenhanced laser desorption/ionization (SELDI) TOFMS data from 326 serum samples from subjects classified as prostate cancer, benign hyperplasia or control [41]. While SELDI has been found to be less sensitive than MALDI, samples do not require fractionation before applying MS. The data analysed here are limited to the range 2,00015,000 m/z as peptide signals beyond this range are sparse. Taguchi data: The dataset available was first described in [42] but is available as a supplement for [43]. The data are 210 serumderived MALDI TOF mass spectra from 70 subjects with nonsmallcell lung cancer with the aim of predicting response to treatment. The data observed cover the 2,00070,000 m/z range. Mantini data: The data in this study were produced using MALDI TOFMS from purified samples containing equine myoglobin and cytochrome C [44]. A total of 30 spectra are available in the range 5,00022,000 m/z.
Transformation of the m/zaxis
The proposed pipeline for baseline subtraction requires a suitable transformation of the m/zaxis as the first step. In this section we investigate potential transformations, that will be assessed quantitatively for their suitability.
The transforms, t _{ i }, of the m/zaxis trialled to produce a roughly uniform distribution of peak widths across the t _{ i }(m/z)axis
Label  Transform 

t _{0}(x)  x 
t _{1}(x)  −1000x ^{−1} 
t _{2}(x)  x ^{1/4} 
t _{3}(x)  lnx 
t _{4}(x)  −1000(lnx)^{−1} 
t _{5}(x)  −1000x ^{−1/4} 
Obtaining approximate peak widths prior to baseline subtraction
Peak widths can be obtained at the peak detection step (step four of preprocessing) but such information is not generally known prior to the second preprocessing step of baseline subtraction. To determine the constant SE size to be passed over the transformed m/zaxis, peak widths need to be estimated. An algorithm to estimate peak widths from the data was created here for this purpose.

For each spectrum, the lower convex hull of the twodimensional set of spectrum points is used to determine an approximate baseline for each spectrum.

The longest segment of the lower convex hull is then halved, with the two sets of points created by this split subject to a new lower convex hull calculation.

The newly calculated lower convex hull points for the two set of points are then added to the original set of lower convex hull points to improve the approximate baseline calculation.

This is repeated r−1 more times to produce an approximate estimated baseline.

The approximate baseline is then removed and median intensity is then calculated for the resulting spectrum.

Intensities above the median value are treated as points along a peak.

The consecutive points above the median value are the estimated peak widths.
The above algorithm is crude and could not be used for reliable baseline subtraction. However, estimated peak widths are easily extracted using this method and can be used within the proposed automated baseline subtraction pipeline.
The algorithm presented above attempts to automatically find peak widths without user input. We outline this automated procedure in the Methods section as it is not the focus of this paper, and may be substituted with any peak region finding method that requires no user input; such is the modularity of the pipeline shown in Fig. 4. There exist other methods to estimate peak widths (regions), such as that found in [20], but they require previous knowledge of likely peak widths and are therefore not a baseline subtraction method that can be automated.
Selecting a SE size and applying the tophat operator in the transformed space
Point three of Fig. 4 requires a choice of SE size. This can be chosen from the estimated peak widths found using the algorithm presented in previous section. The aim is to select a SE of sufficient size to not undercut peaks; such a SE size roughly translates to the maximum of the peak widths. However, there is likely to be a SE size smaller than the maximum estimated peak width but much greater than the minimum estimated peak width that performs optimally. Given a set of estimated peak widths for all spectra in an experiment and a SE size, we define the proportion of peak widths that are estimated to be the SE size or smaller as the estimated peak coverage proportion (EPCP). Please refer to ‘Additional file 3’ for an illustrative plot of the estimated peak widths for the 16 spectra in the Fiedler dataset on the t _{2}(m/z) scale.
We trial different SE sizes corresponding to different EPCP values in the hope an optimal EPCP value for each of the six datasets we utilise can be found. A SE size that fully covers 95% of detected estimated peak widths (EPCP of 0.95) for example, could yield optimised baseline subtraction.
Both the EPCP and m/zaxis transformation are variables that can be tuned to find an empirically optimal combination, assessed by calculating the minimum value of an error metric relative to a goldstandard baseline subtraction. The metric used to compare the automated baseline subtraction to the goldstandard is outlined in the next section and the modified algorithm to perform tophat baseline subtraction on the unevenly spaced and transformed m/zaxis is provided in the section after that.
Comparison of proposed methods to the goldstandard
Piecewise, tophat baseline subtracted spectra were used as the goldstandard baseline subtracted spectra. The SE sizes for each piecewise segment along the m/zaxis were selected using trialanderror to produce the best baseline subtraction as determined visual inspection. These baseline subtracted, goldstandard spectra were produced prior to the automated baseline subtraction methods being applied.
Mean absolute scaled error (MASE [48]) was selected to be the error metric of the automatically baselined spectra for a given transformation and EPCP, when compared to the goldstandard baseline subtracted spectra. Because the MALDI TOF mass spectra intensities are on arbitrary scales prior to normalisation, it is important to use a metric that is scale free, in order to be able to compare results between spectra from different experiments. MASE also avoids many degeneracy issues of other relative error metrics with zero denominators. Baseline subtracted spectra will have many zero values where no signal is present. Other metrics such as mean squared error (MSE) were considered (which did not change the selection of the optimal transform and EPCP) however the ability to compare the error with other data is not possible and some sort of normalisation or weighting of spectra is required to ensure the MSE, say, of selected spectra do not dominate the result.
The ‘continuous’ line segment algorithm
A novel algorithm is proposed here that can be applied to the unevenly spaced values of the transformed m/zaxis using a constant SE width. This algorithm, which we name the ‘continuous’ line segment algorithm (CLSA), requires fewer computations per element than current rolling maximum and minimum algorithms on unevenly spaced data [49].
When the algorithm considers each point x _{ i } for the minimum f in the window spanning k/2 either side, it checks whether the most extreme xvalues in this window are either in the current block or one block away (these values cannot be further than one block away as block sizes are of length k) to decide on which combination of g and h is required. Note the algorithm is impervious to arbitrarily spaced x _{ i } as long as they are in ascending order. If θ _{ i }≠j for any i=2,3,…,n−1;j=2,3,…,m−1 (empty blocks) or \(x_{b_{j1}+1}=x_{b_{j}}\) for any j=2,3,…,m (blocks with only one x _{ i }), for example, do not affect the validity of the proposed algorithm.
This algorithm can be seen as a generalised version of the LSA [25, 26] as it works on evenly and unevenly spaced data. An R implementation of this novel CLSA can be found as an Rpackage using compiled C code at https://github.com/tystan/clsa .
A demonstration of why the creation of blocks the size of the SE and accessing cumulative values half an SE length away allows the calculation of rolling minimums is shown in [25]. Examples to demonstrate the mechanics of the CLSA algorithm are presented in ‘Additional file 4’.
Results and discussion
Presented in this paper is a pipeline to automate the baseline subtraction step in proteomic TOFMS preprocessing. The pipeline consists of transforming the m/zaxis, then finding an appropriate SE size via an automated peak width estimation algorithm on the transformed scale, applying a novel algorithm to perform the tophat baseline subtraction, then finally, baseline subtracted spectra are returned by backtransforming the data to the m/z scale.
There remain two elements of the pipeline to be assessed. Firstly, for the pipeline to be fully automated, an optimal combination of EPCP value and transformation need to be found. In the next section we perform a grid search over EPCP values of 0.8,0.85,0.9,0.95,0.98,0.99,1 and transformations t _{0},t _{1},t _{2},t _{3},t _{4},t _{5} to find which combination provides the closest baseline subtracted signal to the goldstandard. Given sufficient similarity to the goldstandard is achieved, it is hoped that a consensus over all datasets, in their varying attributes, of the optimal combination of EPCP value and transformation can be found. If a consensus is indeed found, the pipeline is likely to be applicable to other proteomic TOFMS datasets.
A theoretical and empirical assessment of the efficiency of the CLSA in comparison to naive rolling window algorithms then follows. The theoretical efficiency is discussed with respect to the number of operations required over all the elements input into the CLSA. By performing the tophat operation on the six proteomic TOFMS datasets and simulated datasets of varying sizes, the computational time required for the CLSA versus the naive algorithm provides an empirical assessment of their relative efficiencies.
Comparison of piecewise and transformed axis baseline subtraction
No single transformation or EPCP was optimal, however, EPCP between 0.95 and 0.99 provided the optimal AMASE value for all datasets suggesting the peak width estimation process is relatively stable. On the Fiedler, Yildiz, Taguchi and Mantini datasets, the null transformation which implicitly implies a constant peak width across the m/zaxis was not valid as AMASE values were notably higher than for the remaining transformations. The transformations t _{2}, t _{3}, t _{4} and t _{5} produced the best results. It should be noted that the transformations t _{3}, t _{4} and t _{5} produced very similar AMASE values. With the exception of the Yildiz dataset, using these transformations with an EPCP of 0.95 produced sensible results. Please see ‘Additional file 3’ for more details relating to the transformation and EPCP optimisation.
Because the goldstandard baseline estimate is subject to expert input and opinion, the differences seen in the goldstandard and the optimal AMASE baseline estimate are not of concern as both look sensible. The nonoptimal baseline estimate produces a reasonable automated baseline subtraction, however, it can be seen that this estimate does undercut the peaks especially at high m/zvalues.
Efficiency of the CLSA compared to the naive rolling window
The naive rolling minimum algorithm consists of the lineartime process of finding the indexes of points at the upper and lower edges of the sliding window for each element, by incrementing the edge indexes from the previous element when required. Using a _{ k } as the average number of data points in the sliding window of size k, the computational cost of finding the minimum value in the window requires approximately a _{ k }−1 comparisons per element. This is because each element requires, on average, a minimum or maximum comparison of all the data points in the window except one: the first data point does not require a comparison. The resulting computational complexity is \(\mathcal {O}\left (a_{k}n\right)\) for the naive algorithm, which is dependent on the size of the sliding window and the number of elements in X.
Like the LSA, the CLSA is a lineartime algorithm irrespective of the window size, k. For the CLSA, a lineartime progression through the n elements is required to assign integers of the Θvector, as each element is an integer equal to or greater than that which precedes it. The lineartime process of finding the \(W^{{\triangledown }}\) and \(W^{\vartriangle }\) indexes at the lower and upper edges of the sliding window, respectively, for each element is similar to that required in the naive algorithm. One lineartime sweep forward and one lineartime sweep back on the data is required to create g and h. A final sweep of the created vectors \(W^{{\triangledown }}\), \(W^{\vartriangle }\), Θ, g and h is required to compute the r _{min} values. Each r _{min}(f(x _{ i })) calculation requires the tests \(\theta _{w^{{\triangledown }}_{i}} = \theta _{w^{\vartriangle }_{i}+1}\), \(\theta _{w^{{\triangledown }}_{i}1} = \theta _{w^{\vartriangle }_{i}}\) or min{g(x _{ i }),h(x _{ i })}. It can therefore be deduced the CLSA is \(\mathcal {O}(n)\) complexity, requiring a series of lineartime operations, importantly independent of the length of the sliding window, k.
Given the MS application, a _{ k }−1 operations per element in the naive algorithm would be much larger than the constant number of operations required per element for the CLSA and efficiency strongly favours the CLSA. It should be pointed out that the CLSA requires extra memory availability beyond the iterative algorithm for the creation of the vectors \(W^{{\triangledown }}\), \(W^{\vartriangle }\), Θ, g and h. Another computational advantage of the CLSA is that by using the minimum of the two temporary vectors g and h as opposed to the minimum of a nonconstant number of data points for each x _{ i }∈X, vectorised programming can be utilised instead of loops. This is of significant advantage in programming languages that are interpreted such as R.
Using the clsa package, the CLSA and naive sliding window algorithms were compared for computational time to calculate the tophat on real and simulated data. The computations were performed on a 21.5” iMac (late 2013 model, 2.7GHz Intel Core i5, 8GB 1600MHz DDR3 memory, OS X 10.10.2). To optimise speed, the calculations requiring iterative looping were performed using compiled C code for both the CLSA and naive algorithms. The code to run the test of computational running time on the simulated data is provided in ‘Additional file 5’.
Computational time to perform tophat baseline subtraction in the transformed space using the naive and CLSA algorithms on the six datasets under study
Number of  Number of  Computational time (sec)  

Data  specta  m/z values  Naive algorithm  CLSA 
Fiedler  16  42388  7.7  0.2 
Yildiz  264  75958  312.6  5.5 
Wu  89  91378  34.1  1.7 
Adam  326  8461  3.0  0.7 
Taguchi  210  19234  18.0  0.9 
Mantini  30  32967  6.7  0.2 
Computational time in seconds to perform tophat baseline subtraction in the transformed space using the naive and CLSA algorithms on synthetic data for varying data assumptions and SE sizes
Number of  Naive  CLSA  

points  Window size (% of xaxis)  Window size (% of xaxis)  
n (×10^{4})  0.5  1  2  5  10  20  0.5  1  2  5  10  20 
1  0.1  0.1  0.3  0.6  1.2  2.3  0.0  0.0  0.2  0.0  0.0  0.2 
2  0.3  0.5  1.0  2.5  4.9  9.2  0.1  0.1  0.1  0.1  0.1  0.1 
3  0.6  1.2  2.3  5.7  11.0  20.6  0.3  0.1  0.1  0.2  0.1  0.1 
4  1.1  2.1  4.2  10.2  19.6  36.5  0.1  0.2  0.1  0.1  0.3  0.1 
5  1.7  3.3  6.5  15.8  30.5  56.8  0.2  0.3  0.2  0.2  0.3  0.2 
6  2.4  4.7  9.3  22.7  44.0  82.0  0.2  0.2  0.3  0.2  0.2  0.3 
7  3.2  6.4  12.6  30.9  59.7  111.7  0.2  0.2  0.4  0.2  0.2  0.4 
8  4.2  8.4  16.6  40.3  78.3  146.4  0.4  0.3  0.3  0.4  0.3  0.3 
9  5.4  10.6  21.0  51.1  98.8  185.3  0.4  0.3  0.3  0.4  0.3  0.3 
10  6.6  13.0  25.9  63.1  121.8  228.7  0.3  0.4  0.3  0.3  0.5  0.3 
The CLSA was faster than the naive algorithm in every scenario as shown in Table 3. As expected, the computational time was constant for the CLSA irrespective of the window size for a fixed number of points (number of transformed m/z values). The difference in computational time between the two algorithms was reasonably small for small datasets and small SE sizes. However, for a typical number of m/z points seen in practice, say 50,000, and a moderate window size that on average encapsulates 5,000 points (1% of xaxis), the CLSA provides an order of magnitude increase in speed.
Conclusion
The current goldstandard in baseline subtraction is a piecewise approach that is performed manually, that is, by inspection. Piecewise baseline subtraction is typically performed because, as we have consistently observed with the datasets analysed in this paper, the properties of the spectra do not remain constant over their domain. In particular, a spectrum’s peak width increases with increasing m/zvalues. We have proposed a new baseline subtraction pipeline be adopted for the correction of mass proteomic spectra data which avoids both the manual user input and the piecewisesubtraction aspect of existing methods. Our new pipeline is based on the premise that a suitable transformation of the m/zaxis can be found which removes the relationship between peak width and peak location.
As part of the new pipeline, we propose a method to create databased, and therefore data specific, peakwidth estimates from smoothed spectra. Even if this step is not used to automate baseline subtraction, it provides an initial sensible SE size that adapts to each individual dataset. Our generalised version of the LSA is also presented in the paper, which we call CLSA. CLSA can be applied to unevenly or evenly spaced data and is not limited in its application to proteomic MS data. Should a transformation be known to create peak widths independent of m/zlocation in proteomic MS data, an efficient and effective baseline subtraction can be performed using the tophat operator with a CLSA implementation. A major contribution to note is that we have demonstrated CLSA far outperforms the naive rolling minimum algorithm in required computational time by an order of magnitude or more on numerous datasets of realworld complexity.
The transformed and constantsized window approach may suffer from a slight but largely unnoticeable reduction in sensitivity in comparison. The tradeoff between exactness of the piecewise approach and the speed of the automated transformation and continuous approach may be a consideration, especially if a known relationship exists between the peak width and peak location.
Availability of supporting data

A subset of the MALDI TOFMS data generated by the study [37] is available in the publicly available R package: MALDIquant [7].

Available at http://www.vicc.org/biostatistics/serum/JTO2007.htm.

Previously available at http://bioinformatics.med.yale.edu/MSDATA.

Data was obtained on request from the authors of [41]. However some Eastern Virginia Medical School data is available at http://edrn.nci.nih.gov/sciencedata .

Available at http://www.vicc.org/biostatistics/download/WSData.zip.

Available at http://www.biomedcentral.com/content/supplementary/147121058101S2.zip.
Abbreviations
 AMASE:

Average mean absolute scaled error
 CLSA:

Continuous line segment algorithm
 Da:

Dalton
 EPCP:

Estimated peak coverage proportion
 LOESS:

Locally weighted scatterplot smoothing
 LSA:

Line segment algorithm
 MALDI:

Matrixassisted laser desorption/ionization
 MASE:

Mean absolute scaled error
 MS:

Mass spectrometry
 MSE:

Mean squared error
 SNIP:

Sensitive nonlinear iterative peak
 TOF:

Time of flight
Declarations
Acknowledgements
Thank you to the creators and custodians of the publicly available data used in this manuscript. We would also like to thank the anonymous reviewer for their time and constructive comments that have improved this manuscript.
Funding
Portions of the work was undertaken as part of TS’s PhD which was financially supported by an Australian Postgraduate Award scholarship.
Authors’ contributions
TS and PS developed the statistical and analytical methods. CB and PS provided guidance on the analysis of proteomic data. TS developed the code and implementation. All authors contributed to the writing of the manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
The research was exempt from formal University of Adelaide Human Research Ethics Committee approval according to the Australian National Statement on Ethical Conduct in Human Research, 2007.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Albrethsen J. Reproducibility in protein profiling by MALDITOF mass spectrometry. Clin Chem. 2007; 53(5):852–8.View ArticlePubMedGoogle Scholar
 Kulasingam V, Diamandis EP. Strategies for discovering novel cancer biomarkers through utilization of emerging technologies. Nat Clin Pract Oncol. 2008; 5(10):588–99.View ArticlePubMedGoogle Scholar
 Hortin GL. The MALDITOF mass spectrometric view of the plasma proteome and peptidome. Clin Chem. 2006; 52(7):1223–37.View ArticlePubMedGoogle Scholar
 Croxatto A, Prod’hom G, Greub G. Applications of malditof mass spectrometry in clinical diagnostic microbiology. FEMS Microbiol Rev. 2012; 36(2):380–407.View ArticlePubMedGoogle Scholar
 R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2014. http://www.Rproject.org/.Google Scholar
 Gentleman RC, Carey VJ, Bates DM, et al.Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 2004; 5:80.View ArticleGoogle Scholar
 Gibb S, Strimmer K. MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics. 2012; 28(17):2270–1.View ArticlePubMedGoogle Scholar
 Li X. PROcess: Ciphergen SELDITOF Processing. 2005. R package version 1.42.0. http://bioconductor.org/packages/release/bioc/html/PROcess.html. Accessed July 2015.
 Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G. XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching and identification. Anal Chem. 2006; 78:779–87.View ArticlePubMedGoogle Scholar
 Stanford TE. Statistical analysis of proteomic mass spectrometry data for the identification of biomarkers and disease diagnosis. PhD thesis, School of Mathematical Sciences, The University of Adelaide;. 2015.
 Glish GL, Vachet RW. The basics of mass spectrometry in the twentyfirst century. Nat Rev Drug Discov. 2003; 2(2):140–50.View ArticlePubMedGoogle Scholar
 Savitzky A, Golay MJE. Smoothing and differentiation of data by simplified least squares procedures. Anal Chem. 1964; 36(8):1627–39.View ArticleGoogle Scholar
 Yang YH, Buckley MJ, Dudoit S, Speed TP. Comparison of methods for image analysis on cDNA microarray data. J Comput Graph Stat. 2002; 11:108–36.View ArticleGoogle Scholar
 Mayer CD, Glasbey CA. Statistical methods in microarray gene expression data analysis. In: Husmeier D, Dybowski R, Roberts S, editors. Probabilistic Modeling in Bioinformatics and Medical Informatics. Advanced Information and Knowledge Processing. London: Springer: 2005. p. 211–38.Google Scholar
 Sauve AC, Speed TP. Normalization, baseline correction and alignment of highthroughput mass spectrometry data. In: Proceedings of the Genomic Signal Processing and Statistics workshop. John Hopkins University, Baltimore, MD, May 26–27: 2004.
 Kohlbacher O, Reinert K, Gröpl C, Lange E, Pfeifer N, SchulzTrieglaff O, Sturm M. TOPPthe OpenMS proteomics pipeline. Bioinformatics. 2007; 23(2):191–7.View ArticleGoogle Scholar
 Lange E, Gröpl C, SchulzTrieglaff O, Leinenbach A, Huber C, Reinert K. A geometric approach for the alignment of liquid chromatographymass spectrometry data. Bioinformatics. 2007; 23(13):273–81.View ArticleGoogle Scholar
 Sturm M, Bertsch A, Gröpl C, Hildebrandt A, Hussong R, Lange E, Pfeifer N, SchulzTrieglaff O, Zerck A, Reinert K, Kohlbacher O. OpenMS  an opensource software framework for mass spectrometry. BMC Bioinformatics. 2008; 9(1):163.View ArticlePubMedPubMed CentralGoogle Scholar
 Bauer C, Kleinjung F, Smith C, Towers M, Tiss A, Chadt A, Dreja T, Beule D, AlHasani H, Reinert K, Schuchhardt J, Cramer R. Biomarker discovery and redundancy reduction towards classification using a multifactorial malditof ms t2dm mouse model dataset. BMC Bioinformatics. 2011; 12(1):140.View ArticlePubMedPubMed CentralGoogle Scholar
 Morháč M. An algorithm for determination of peak regions and baseline elimination in spectroscopic data. Nuclear Instruments Methods Phys Res Sect A Accelerators Spectrometers Detectors Assoc Equip. 2009; 600(2):478–87.View ArticleGoogle Scholar
 Ryan CG, Clayton E, Griffin WL, Sie SH, Cousens DR. SNIP, a statisticssensitive background treatment for the quantitative analysis of {PIXE} spectra in geoscience applications. Nuclear Instruments Methods Phys Res Sect B Beam Interact Mater Atoms. 1988; 34(3):396–402.View ArticleGoogle Scholar
 Yang C, He Z, Yu W. Comparison of public peak detection algorithms for maldi mass spectrometry data analysis. BMC Bioinformatics. 2009; 10(1):4. doi:http://dx.doi.org/10.1186/14712105104.
 Dougherty E. Mathematical Morphology in Image Processing. New York: MarcelDekker; 1992.Google Scholar
 Soille P. Morphological Image Analysis: Principles and Applications. Secaucus: Springer; 1999.View ArticleGoogle Scholar
 van Herk M. A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels. Pattern Recogn Lett. 1992; 13(7):517–21.View ArticleGoogle Scholar
 Gil J, Werman M. Computing 2D min, median, and max filters. IEEE Trans Pattern Anal Mach Intell. 1993; 15:504–7.View ArticleGoogle Scholar
 van Herk M, de Munck JC, Lebesque JV, Muller S, Rasch C, Touw A. Automatic registration of pelvic computed tomography data and magnetic resonance scans including a full circle method for quantitative accuracy evaluation. Med Phys. 1998; 25:2054.View ArticlePubMedGoogle Scholar
 Heneghan C, Flynn J, O’Keefe M, Cahill M. Characterization of changes in blood vessel width and tortuosity in retinopathy of prematurity using image analysis. Med Image Anal. 2002; 6(4):407–29.View ArticlePubMedGoogle Scholar
 Zhang G, Ueberheide BM, Waldemarson S, Myung S, Molloy K, Eriksson J, Chait BT, Neubert TA, Fenyö D. Protein quantitation using mass spectrometry. In: Fenyö D, editor. Computational Biology. Methods in Molecular Biology. Vol. 673. New York: Humana Press: 2010. p. 211–22.Google Scholar
 Greengard L, Lee JY. Accelerating the nonuniform fast fourier transform. SIAM Rev. 2004; 46(3):443–54.View ArticleGoogle Scholar
 Lo AW, MacKinlay AC. An econometric analysis of nonsynchronous trading. J Econ. 1990; 45(12):181–11.View ArticleGoogle Scholar
 Aris A, Shneiderman B, Plaisant C, Shmueli G, Jank W. Representing unevenlyspaced time series data for visualization and interactive exploration. In: HumanComputer InteractionINTERACT 2005. Heidelberg: Springer: 2005. p. 835–46.Google Scholar
 Schulz M, Mudelsee M. REDFIT: estimating rednoise spectra directly from unevenly spaced paleoclimatic time series. Comput Geosci. 2002; 28(3):421–6.View ArticleGoogle Scholar
 Deeming TJ. Fourier analysis with unequallyspaced data. Astrophys Space Sci. 1975; 36(1):137–58.View ArticleGoogle Scholar
 Scargle JD. Studies in astronomical time series analysis. IIStatistical aspects of spectral analysis of unevenly spaced data. Astrophys J. 1982; 263:835–53.View ArticleGoogle Scholar
 Bourgeois M, Wajer F, van Ormondt D, GraveronDemilly D. Modern Sampling Theory. Applied and Numerical Harmonic Analysis. In: Benedetto JJ, Ferreira PJSG, editors. Boston: Birkhäuser: 2001. p. 343–63.
 Fiedler GM, Baumann S, Leichtle A, Oltmann A, Kase J, Thiery J, Ceglarek U. Standardized peptidome profiling of human urine by magnetic bead separation and matrixassisted laser desorption/ionization timeofflight mass spectrometry. Clin Chem. 2007; 53(3):421–8.View ArticlePubMedGoogle Scholar
 Yildiz PB, Shyr Y, Rahman JSM, Wardwell NR, Zimmerman LJ, Shakhtour B, Gray WH, Chen S, Li M, Roder H, Liebler DC, Bigbee WL, Siegfried JM, Weissfeld JL, Gonzalez AL, Ninan M, Johnson DH, Carbone DP, Caprioli RM, Massion PP. Diagnostic accuracy of MALDI mass spectrometric analysis of unfractionated serum in lung cancer. J Thoracic Oncol Off Publ Intl Assoc Study Lung Cancer. 2007; 2(10):893.Google Scholar
 Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics. 2003; 19(13):1636–43.View ArticlePubMedGoogle Scholar
 Yu W, Li X, Liu J, Wu B, Williams KR, Zhao H. Multiple peak alignment in sequential data analysis: a scalespacebased approach. IEEE/ACM Trans Comput Biol Bioinform (TCBB). 2006; 3(3):208–19.View ArticleGoogle Scholar
 Adam BL, Qu Y, Davis JW, Ward MD, Semmes OJ, Schellhammer PF, Yasui Y, Feng Z, Jr GLW, Clements MA, Cazares LH. Serum protein fingerprinting coupled with a patternmatching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res. 2002; 62(13):3609–14.PubMedGoogle Scholar
 Taguchi F, Solomon B, Gregorc V, Roder H, Gray R, Kasahara K, Nishio M, Brahmer J, Spreafico A, Ludovini V, Massion PP, Dziadziuszko R, Schiller J, Grigorieva J, Tsypin M, Hunsucker SW, Caprioli R, Duncan MW, Hirsch FR, Bunn PA, Carbone DP. Mass spectrometry to classify nonsmallcell lung cancer patients for clinical outcome after treatment with epidermal growth factor receptor tyrosine kinase inhibitors: a multicohort crossinstitutional study. J Natl Cancer Inst. 2007; 99(11):838–46.View ArticlePubMedGoogle Scholar
 Li M, Chen S, Zhang J, Chen H, Shyr Y. Wavespec: a preprocessing package for mass spectrometry data. Bioinformatics. 2011; 27(5):739–40.View ArticlePubMedPubMed CentralGoogle Scholar
 Mantini D, Petrucci F, Pieragostino D, Del Boccio P, Di Nicola M, Di Ilio C, Federici G, Sacchetta P, Comani S, Urbani A. LIMPIC: a computational method for the separation of protein MALDITOFMS signals from noise. BMC Bioinformatics. 2007; 8(1):101.View ArticlePubMedPubMed CentralGoogle Scholar
 Siuzdak G. The expanding role of mass spectrometry in biotechnology. San Diego: MCC Press; 2006.Google Scholar
 House LL, Clyde MA, Wolpert RL. Bayesian nonparametric models for peak identification in MALDITOF mass spectroscopy. Ann Appl Stat. 2011; 5(2B):1488–511.View ArticleGoogle Scholar
 Box GEP, Cox DR. An analysis of transformations. J R Stat Soc Series B (Methodological). 1964; 26(2):211–52.Google Scholar
 Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. Intl J Forecasting. 2006; 22(4):679–88.View ArticleGoogle Scholar
 Eckner A. Algorithms for unevenlyspaced time series: Moving averages and other rolling operators. Technical report, Working Paper. 2013. http://www.eckner.com/papers/ts_alg.pdf. Accessed June 2015.