Data mining of plasma peptide chromatograms for biomarkers of air contaminant exposures

Background Interrogation of chromatographic data for biomarker discovery becomes a tedious task due to stochastic variability in retention times arising from solvent and column performance. The difficulty is further compounded when the effects of exposure (e.g. to environmental contaminants) and biological variability result in varying numbers and intensities of peaks among chromatograms. Results We developed a software tool to correct the stochastic time shifts in chromatographic data through iterative selection of landmark peaks and isometric interpolation to improve alignment of all chromatographic peaks. To illustrate application of the tool, plasma peptides from Fischer rats exposed for 4 h to clean air or Ottawa urban particles (EHC-93) were separated by HPLC with autofluorescence detection, and the retention time shifts between chromatograms were corrected (dewarped). Both dewarped and non-dewarped datasets were then mined for models containing peptide peaks that best discriminate among the treatment groups using ClinproTools™. In general, models generated by dewarped datasets were able to better classify test sample chromatograms into either clean air or EHC-93 exposure groups, and 0 or 24 h post-recovery time groups. Peak areas of peptides in a model that produced the best discrimination of treatment groups were analyzed by two-way ANOVA with exposure (clean air, EHC-93) and recovery time (0 h, 24 h) as factors. Statistically significant (p < 0.05) time-dependent and exposure-dependent increases and decreases were noted establishing these as biomarker candidates for further validation. Conclusion Our software tool provides a simple and portable approach for alignment of chromatograms with complex, bi-directional retention time shifts prior to data mining. Reliable biomarker discovery can be achieved through chromatographic dewarping using our software followed by pattern recognition by commercial data mining applications.


Chromatographic Data Preparation:
i) Export chromatographic data in the ASCII text format. The chromatographic data file when opened using a text editor (e.g. Notepad) should have elution time as the first column and either fluorescence intensity or absorbance (depending on the detector used) as the second column ( Fig. 1).

Fig. 1. Raw chromatographic data in the text (.txt) format
ii) Open the data file using a spreadsheet application (e.g. Microsoft Excel). Generate a third column and fill the column cells that correspond to existing data rows with asterisks. Identify and mark the data point that corresponds to the end of the solvent front (first sample peak in the data series) by a lower-case 'x' replacing the asterisk. Save the data file as a new file retaining the text format (.txt file) and editing the filename with an 'x' at the beginning and an '~' at the end (e.g. 060920Stds1.txt saved as x060920Stds1~.txt). An example of solvent front identified file is shown below. Fig. 2. Chromatographic data after marking the end of solvent front. This is ready for processing within DewarpTool.

Baseline Correction and Smoothening
i) Launch DewarpTool by selecting Start, Programs, DewarpTool. In order to import solvent-front marked data files into the application, click on the "Load Solvent Front Marked Files for Baseline Correction and Smoothening" button on the interface (circled blue, Fig 3). This opens up a file selection window.

ii)
Using the file selection window, select all solvent-front marked files (having names starting with an 'x') and click on 'Pre-Process and Load Files' to start baseline correction and smoothening. After baseline correction and smoothening, DewarpTool automatically truncates the solvent fronts (chromatographic section leading up to the first sample peak) and standardizes the lengths of all chromatograms to contain the same number of data points. As the processing of each file is completed, it is automatically saved in the same directory as the source file with a prefix 'Z' to the filename (e.g. x060920Stds1~.txt is saved as Zx060920Stds1~.txt) and loaded into the program.
First two of the loaded files are displayed for peak matching and identification of marker peaks. While others are not displayed, they are still available for display and peak selection.

iii)
When peak matching and landmark peak identification on baseline corrected and smoothened data are to be continued at a later time, the files can be directly loaded into the program by clicking on the "Load Processed Data for Marker Selection" button (circled purple, Fig. 3) on the main program interface.

Selection and Matching of Landmark Peaks ("Markers")
Landmark peaks are peaks common across two or more loaded chromatograms. As they are identified they are uniquely and sequentially numbered within the program. So it is useful to analyze all the chromatograms from an experiment as a set, for the landmark peaks to remain unique and to be correctly identified within the experiment.

i)
Selection and matching of landmark peaks: Mark a peak as a landmark peak by bringing the slider under a chromatogram (Fig. 4, Region D) to the centre of the peak, either by moving the slider or by using the coarse and fine, forward (>, >>) and backward (<. <<) slider adjustment buttons. Once the slider is centred on a peak, as verified in the magnification view ( Fig. 4 Region E), mark it as a new peak by pressing the "Select as New Peak" button under the magnification view. When a matching peak is detected in the second chromatogram, mark it as a matching peak by pressing the 'This Peak Matches Current Peak' button. Peak selection and matching is also facilitated by the ability to display small chromatographic sections lying between two selected landmark peaks (Figure 4, Region B) and to rescale X axis and Y-axes (Figure 4 Region E).
ii) Removal of a landmark peak: If a peak had been erroneously chosen as a landmark peak, it can be deleted by selecting the marker and then clicking on the 'Delete Marker' button beside the peak list ( Figure 4, Region C).

iii)
Selection of a new file for landmark peak selection: Identification and marking of the same landmark peak in multiple chromatograms from an experiment is achieved by retaining a chromatogram containing the landmark peak as the reference chromatogram and displaying the chromatogram in which the marker is to be identified as the second chromatogram (from the file list) (Fig.  4, Region A). After selection of markers in a chromatogram, click on 'Save Marker to Data File' button save the selected landmark peaks to the underlying chromatographic data file. The landmark-peak selected files are saved in the same directory as the source files with a prefix 'M' to their file names.
Both landmark peak selected files (filenames starting with M) and previously dewarped files (filenames starting with DW, see below) can be loaded along with baseline corrected/smoothened chromatograms (filenames starting with Z) using the "Load Processed Data for Marker Selection" button on the DewarpTool main interface, to facilitate landmark peak selection. During iterative dewarping, existing landmark peaks in dewarped chromatograms enable finding additional peaks in the inter-landmark peak regions.

Fig. 5. DewarpTool Main Interface: Option to dewarp currently loaded marker-identified files.
i) Load all landmark peak selected chromatograms from an experiment by clicking on the 'Load Processed Data for Marker Selection' button on the main interface, unless they are already loaded into the program confirmed by the file lists.

ii)
Click on "Dewarp Marker Identified Chromatograms" (circled blue, Fig.  5) to start compilation of landmark peaks and dewarping. As each chromatogram is dewarped, a new file is generated and renamed with the prefix 'DW' and saved in the same folder as the original marked data files.

Differential Analysis:
The differential analysis component of the program allows generation of an average chromatogram of all dewarped chromatograms from a treatment group, and visualization of two average chromatograms along with the difference trace to determine changes that may be of significance.

i)
Group dewarped chromatograms from different treatment groups into separate folders (e.g. separate folders for control, treatment1, treatment2 chromatograms etc.). Select a set of dewarped chromatograms by clicking on the "Average Dewarped Chromatograms" button (circled blue, Fig. 6) on the main program interface, and selecting the chromatograms using the file selection window.

ii)
Click on the 'Average Dewarped Spectra' button ( Fig. 6), to generate an average chromatogram for the treatment group. Generate averages for other groups of spectra as desired. An average chromatogram file called 'AvgChromatogram.txt' is generated and saved in the same folder as the original dewarped chromatograms.

iii)
To generate a difference chromatogram between two average chromatograms click on 'Generate Difference Chromatogram' (Fig. 7) and select average chromatograms for differential analysis using the file selection window. Click on 'Generate Difference Display' on the file selection window.

iv)
A new window is opened up with a differential display (Fig. 7) showing the two average chromatograms and the difference trace.

Fig. 8. Differential Display Interface
A magnification view provides a high-resolution visualization of a single peak or a small segment of the time axis, as defined by using the sliders on the full chromatogram display.
The dewarped chromatograms (filenames starting with 'DW') can also be used within other applications for mining of discriminatory biomarker patterns after adjustments to the file format as may be required by external applications.