- Open Access
PrestOMIC, an open source application for dissemination of proteomic datasets by individual laboratories
© Howes and Foster; licensee BioMed Central Ltd. 2007
- Received: 10 February 2007
- Accepted: 06 June 2007
- Published: 06 June 2007
Technological advances in mass spectrometry and other detection methods are leading to larger and larger proteomics datasets. However, when papers describing such information are published the enormous volume of data can typically only be provided as supplementary data in a tabular form through the journal website. Several journals in the proteomics field, together with the Human Proteome Organization's (HUPO) Proteomics Standards Initiative and institutions such as the Institute for Systems Biology are working towards standardizing the reporting of proteomics data, but just defining standards is only a means towards an end for sharing data. Data repositories such as ProteomeCommons.org and the Open Proteomics Database allow for public access to proteomics data but provide little, if any, interpretation.
Results & conclusion
Here we describe PrestOMIC, an open source application for storing mass spectrometry-based proteomic data in a relational database and for providing a user-friendly, searchable and customizable browser interface to share one's data with the scientific community. The underlying database and all associated applications are built on other existing open source tools, allowing PrestOMIC to be modified as the data standards evolve. We then use PrestOMIC to present a recently published dataset from our group through our website.
- Venn Diagram
- Tandem Mass Spectrum
- Main Page
- Schema File
- Open Source Application
Mass spectrometry (MS)-based proteomic methods have developed to the point where many laboratories can routinely identify hundreds or thousands of proteins from complex samples such as plasma, biochemically enriched organelles, isolated protein complexes and whole cell lysates [1–3]. The results from early, small-scale 'proteomic' studies could be and were presented entirely within a standard journal article, even including spectral assignments. However, the standard journal article format is not a useful medium to present the raw data of the much larger datasets being generated today or, indeed, enormous lists of data of any kind. As an example in a related field, when a sequenced genome is 'published' it does not imply that the journal actually prints the hundreds of millions or billions of base pairs comprising the genetic blueprint for the organism. In the case of such studies the journal article focuses on the interpretation of the sequence data while the actual sequence is deposited into public databases maintained by the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) or other institutions. Similar public databases are available for microarray data [4–6] and databases for proteomics data are starting to emerge [7–11]. However, these databases are intended more as large public repositories of raw and/or interpreted data rather than value-added presentations of specific datasets; in such large repositories the message and individuality of each of the research projects is lost.
The incompatibility between the standard journal article format and large-scale proteomics datasets means that the true value of such work is often not realized by the scientific community because much of the data is packed away in long, mundane tables of accession numbers. Such data could be better presented through a specifically designed web portal, and while no journal yet requires such a presentation, a few larger laboratories have started to develop websites for specific projects [12–14]. This is often beyond the capabilities of many groups however, so here we describe PrestOMIC (presto is Latin for 'display'): a simple suite of tools for creating a customizable database for sharing liquid chromatography/tandem mass spectrometry (LC/MS)-based proteomic data.
The intent was to create a forum for a group to share data from different projects within their laboratory, so the top-level unit here is the 'Study', a table which holds some general information about each project, including the title, authors and PubMed identification number (when referring to the name of a table in the PrestOMIC schema we have capitalized the first letter and italicized the word). Each Study entry could have several different experiments within it (e.g., biological replicates, technical replicates, similar analyses on different sources, similar analyses on different mass spectrometers, etc.) so the Experiment table contains all the relevant information about the sample and analysis of that sample. Specifics about the mass spectrometry configuration, analytical column, mobile phases and search engine are stored in separate tables within the schema to avoid unnecessary replication and referred to by their respective identification numbers in the Experiment table. At the time of writing the standardized vocabulary for describing many of these parameters was still under development at HUPO PSI  so the data is stored as uncontrolled strings at this point.
In a typical proteomics study the central dataset is the list of proteins the authors are claiming to have identified. Layered on top of this list might be relative or absolute quantitative data, post-translational modifications, or other information, but the central pieces of data are still the proteins, their accession numbers and their amino acid sequences. As such, we elected to create a separate table called Protein to hold all the protein sequences identified in a Study entry, again to avoid replication, since a protein could be identified in several Experiments within a Study. The Protein_Quantification table, as the name suggests, contains the quantitative information measured for a protein, if any exists. Since the quantitative measure of a particular protein sequence could be different in different Experiments, one-to-many relationships link Protein and Experiment to Protein_Quantification (Figure 1). In the case where an Experiment is only qualitative, each protein identified has an entry in the Protein_Quantification table but the fields pertaining to quantitation are left blank. Finally, the Spectra_Peptide table holds all the data about the individual peptides used to identify and quantify these proteins, including a field for the tandem mass spectra used in the identification.
Proteins are the primary output of a MS-based proteomics experiment but their identification is dependent on the acquisition and analysis of mass spectra: typically tandem mass spectra, but sometimes peptide mass fingerprints. Some journals now encourage authors to make the underlying spectra used for protein identification available  but there are as yet no effective tools for this task. Optimally such a tool would use a universally recognized standard for describing a mass spectrum but the HUPO-PSI has not yet decided on this format. Since most protein identification search engines accept files in the so-called Mascot Generic Format (MGF) we decided to base our spectrum display on this format for the time being and then to adapt PrestOMIC once a common format is agreed upon.
Installation and data entry
PrestOMIC requires two major underlying applications, a web server and a relational database system. Because of their stability, popularity and open source nature, we have chosen to use the Apache web server (v2.2.2) and PostgreSQL (v8.1.4) installed on a single processor computer running the Fedora Core 5 version of Linux (see Methods for detailed configuration information). The database schema can be entered manually based on Figure 1 or more efficiently by importing the schema file available as Supplementary Material or from the PrestOMIC Subversion system on Google . Since one installation of PrestOMIC can handle several different projects, a user should create a top-level hypertext markup language (HTML) page to link to the individual projects and then organize all the files for each project within its own directory. Several aspects of the main page for each project can and should be customized: 1) The title banner, 2) Text describing the project, 3) Any Venn diagrams or other project-specific graphics, 4) Conditional statements for differential searches (see description of browser interface below).
Once the system is configured, data for the project can be entered into the database via two mechanisms: manually through a program called phpPgAdmin (v4.1), or in batch mode as a CSV-formatted text file using the import functions in PrestOMIC. For the high-level tables we find it easier to enter the data manually, but for protein and peptide information it is more efficient to use the import functions in PrestOMIC. These may be accessed through the command line, and as long as the comma-separated table is formatted correctly (see Methods for the required format) then the project will be loaded correctly.
The browser interface for PrestOMIC is the portal through which interested outside users may view or query the data being presented. Our aim in designing this interface was not to build in all possible bells and whistles an investigator might want to use to make a flashy website, but rather to provide simple, searchable access to the data. As such, the main page for each project is intended to contain basic information about the project, including: title, authors, link to journal website once published, a description of the project such as the manuscript abstract (if the publishing journal's policies allow it), and perhaps a graphical representation of the study. In our first practical use of PrestOMIC we have also added links so a reader can download the supplementary data files (see below).
Application of PrestOMIC to a real-world dataset
The need for tools such as PrestOMIC stems from the incompatibility between a standard journal article and the very large datasets being generated in proteomics. It is neither economically viable for journal publishers to print pages and pages of tables, nor is it interesting for their readers, so typically such information is made available electronically through journal websites. This is a useful format for investigators who wish to reanalyze the data but it is cumbersome and not very accessible for someone who, for instance, works on a specific protein and wants to know if the authors found that protein in the particular conditions. We have also found from experience that it can be very challenging to effectively review such material when it is submitted to a journal. Therefore, we have created PrestOMIC primarily as a presentation aid for such datasets.
The scalability of a PostgreSQL database ensures that PrestOMIC should be able to handle any conceivable dataset generated in proteomics since there are a finite number of possible gene products. Even so, we do foresee at least two upgrades to PrestOMIC that will be required in the near future but that cannot be implemented at this time. As mentioned, there are still no agreed-upon standards for proteomic data  but once HUPO-PSI publishes standard vocabularies and spectrum descriptions they will be implemented in PrestOMIC. However, given how long it took MIAME standards to develop, and even now they are not universal, it seems prudent to make PrestOMIC available in a perhaps immature form now rather than wait several years for a 'perfect' version. We will incorporate these upgrades ourselves but the PrestOMIC code is available on Google's SubVersion system  so we also encourage community contributions to PrestOMIC's development as well.
While PrestOMIC will have some catching up to do in the future, many data pipelines in proteomics will also need to be upgraded to fully utilize PrestOMIC. In the larger picture, the pipelines will have to be upgrade to even satisfy the publishing requirements that are inevitably coming. For instance, most data pipelines break the link between raw fragment spectra and peptide early on so that at the time of publication it is virtually impossible to go back and gather all those spectra that gave rise to the identifications. We are currently re-engineering our own pipeline to address this.
PrestOMIC is not the first system created for presenting proteomic data [12–14] but to the best of our knowledge it is the first where the structure itself is available to the wider scientific community. While publishing standards in proteomics have been slow to emerge it is clear that some form of public presentation of the data will likely be required . PrestOMIC, by providing customizable tools for a compact and interactive presentation of specific datasets, will allow investigators to increase the exposure and impact of their data, benefiting themselves and the publishing journal alike.
Installation and configuration of PrestOMIC
Perl, the Apache Webserver and PostreSQL are installed as part of Fedora Core 5; if a different operating system is used then Apache  and PostreSQL  must be installed from their respective websites. Additional Perl libraries need to be installed, primarily BioPerl , CGI::FormBuilder  and Template Toolkit . The schema file available from Supplementary material, the PrestOMIC project website  or our own website  can then be imported into PostgreSQL to configure the database. Finally, the webpage files are copied into an Apache-accessible directory and customized (see below) as needed.
Customization of the PrestOMIC main page
The title banner – the current dbtitle.jpg file in /images/ can be replaced with a graphic symbolic of the study being presented.
Text describing the project – text specific to the project is scattered throughout index.html and should be changed to suit the need. These include: page title, meta content and information about the journal article.
Project-specific graphics – in the sample dataset used we have incorporated Venn diagrams to demonstrate overlap between different hemolymph samples and images of each of the different castes and life-stages. Again, depending on the data being presented, different image-mapped, hyperlinked graphics should be used here. In order to increase security and to prevent outside users from passing SQL statements directly to the server (an SQL injection attack), the hyperlinks to retrieve the union or intersection of datasets pass an argument containing the Experiment name plus a boolean value of 1 or 0 indicating whether the data from that Experiment is to be included in the query. For example, the Proteins Found Only In Drone and Worker link  is translated by PrestOMIC into a series of SQL statements that determine the requested subset of proteins. This example looks in our four-class database (queen, worker, drone, larva) for proteins that are present in both workers and drones, and are absent in queens. The list of proteins present in larvae has no effect on this query, because 'larva' did not appear in the query. This code is generalized, and will not need to be modified if more classes are added; the words in the query are from the 'class' field of the experiments of interest. The '1' or '0' appended to each word indicate whether a protein must be present in the class or absent from the class, respectively, before it can be included in the subset. A search limited to two classes (say 'queen' and 'worker') can be represented by a two-set Venn diagram (queen and worker, not queen and not worker, queen and not worker, not queen and worker) and can co-exist in the same database with all other possible Venn diagram searches. Any number of classes can be represented by a Venn diagram for the corresponding number of sets, but while two-dimensional Venn diagrams with more than four sets can be generated with some straightforward rules, they are difficult to draw, more difficult to turn into an image map, and even more troublesome to actually interpret.
Conditional search statements – to change the subjects of the conditional searches the values in the 'option' tags of the table containing the 'regulation search' need to be changed to match the values in the Experiment table of the database for the particular project.
Open source code
The Perl code for PrestOMIC is maintained in the Google SubVersion system  and can also be downloaded from our website  or from ProteomeCommons . The HTML template for the main page and the database schema in SQL format (PostgreSQL flavour) are also available at each of these sites.
As mentioned above, all data can be entered manually if desired, but for protein and peptide data in particular it is far more efficient to simply upload a file containing all the data. Additional File 3 in the supplementary data contains the expected format for such data. To enter the data into the database, transfer the file onto the server and run the command 'addstudy < studyfile.csv'. To backup the data from the database, run the command 'dumpstudy 123 > study123.csv', where the number is the study number. To delete the study from the database, run the command 'delstudy 123', where the number is the study number. If PrestOMIC is moved to a different SQL database, 'delstudy' will need to be expanded, as it's heavily dependent on PostgreSQL's 'cascade delete' feature to erase all records pertaining to a study.
The authors wish to thank Queenie Chan and Nikolay Stoynov for constructive criticism of PrestOMIC, as well as all members of the Cell Biology Proteomics lab for helpful discussions. LJF is a Michael Smith Foundation Scholar and the Canada Research Chair in Organelle Proteomics. This work was funded by a Canadian Institutes for Health Research (CIHR) operating grant to LJF (#MOP-77688).
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003,422(6928):198–207. 10.1038/nature01511PubMedView ArticleGoogle Scholar
- de Hoog CL, Mann M: Proteomics. Annu Rev Genomics Hum Genet 2004, 5: 267–293. 10.1146/annurev.genom.4.070802.110305PubMedView ArticleGoogle Scholar
- Yates JR 3rd, Gilchrist A, Howell KE, Bergeron JJM: Proteomics of organelles and large cellular structures. Nat Rev Mol Cell Biol 2005, 6: 702–714. 10.1038/nrm1711PubMedView ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002,30(1):207–210. 10.1093/nar/30.1.207PubMed CentralPubMedView ArticleGoogle Scholar
- Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone SA: ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 2003,31(1):68–71. 10.1093/nar/gkg091PubMed CentralPubMedView ArticleGoogle Scholar
- Ikeo K, Ishi-i J, Tamura T, Gojobori T, Tateno Y: CIBEX: center for information biology gene expression database. C R Biol 2003,326(10–11):1079–1082. 10.1016/j.crvi.2003.09.034PubMedView ArticleGoogle Scholar
- Craig R, Cortens JP, Beavis RC: Open source system for analyzing, validating, and storing protein identification data. J Proteome Res 2004,3(6):1234–1242. 10.1021/pr049882hPubMedView ArticleGoogle Scholar
- Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R: The PeptideAtlas project. Nucleic Acids Res 2006,34(Database issue):D655–8. 10.1093/nar/gkj040PubMed CentralPubMedView ArticleGoogle Scholar
- Jones P, Cote RG, Martens L, Quinn AF, Taylor CF, Derache W, Hermjakob H, Apweiler R: PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res 2006,34(Database issue):D659–63. 10.1093/nar/gkj138PubMed CentralPubMedView ArticleGoogle Scholar
- Prince JT, Carlson MW, Wang R, Lu P, Marcotte EM: The need for a public proteomics repository. Nat Biotechnol 2004,22(4):471–472. 10.1038/nbt0404-471PubMedView ArticleGoogle Scholar
- ProteomeCommons [http://www.proteomecommons.org]
- Foster LJ, de Hoog CL, Zhang Y, Zhang Y, Xie X, Mootha VK, Mann M: A mammalian organelle map by protein correlation profiling. Cell 2006,125(1):187–199. 10.1016/j.cell.2006.03.022PubMedView ArticleGoogle Scholar
- Kislinger T, Cox B, Kannan A, Chung C, Hu P, Ignatchenko A, Scott MS, Gramolini AO, Morris Q, Hallett MT, Rossant J, Hughes TR, Frey B, Emili A: Global survey of organ and organelle protein expression in mouse: combined proteomic and transcriptomic profiling. Cell 2006,125(1):173–186. 10.1016/j.cell.2006.01.044PubMedView ArticleGoogle Scholar
- Zhang Y, Zhang Y, Adachi J, Olsen JV, Shi R, de Souza G, Pasini E, Foster LJ, Macek B, Zougman A, Kumar C, Wisniewski JR, Jun W, Mann M: MAPU: Max-Planck Unified database of organellar, cellular, tissue and body fluid proteomes. Nucleic Acids Res 2007,35(Database issue):D771–9. 10.1093/nar/gkl784PubMed CentralPubMedView ArticleGoogle Scholar
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001,29(4):365–371. 10.1038/ng1201-365PubMedView ArticleGoogle Scholar
- HUPO Proteomics Standards Initiative [http://www.psidev.info/]
- PostgreSQL: The world's most advanced open source database [http://www.postgresql.org]
- Carr S, Aebersold R, Baldwin M, Burlingame A, Clauser K, Nesvizhskii A: The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data. Mol Cell Proteomics 2004,3(6):531–533. 10.1074/mcp.T400006-MCP200PubMedView ArticleGoogle Scholar
- prestomic - Google Code [http://code.google.com/p/prestomic/]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990,215(3):403–410.PubMedView ArticleGoogle Scholar
- Honeybee Hemolymph Protein Database [http://foster.nce.ubc.ca/bee/]
- Chan QW, Howes CG, Foster LJ: Quantitative comparison of caste differences in honeybee hemolymph. Mol Cell Proteomics 2006,5(12):2252–2262. 10.1074/mcp.M600197-MCP200PubMedView ArticleGoogle Scholar
- Hancock WS, Wu SL, Stanley RR, Gombocz EA: Publishing large proteome datasets: scientific policy meets emerging technologies. Trends Biotechnol 2002,20(12 Suppl):S39–44. 10.1016/S1471-1931(02)00205-7PubMedView ArticleGoogle Scholar
- The Apache HTTP Server Project [http://httpd.apache.org/]
- Mainpage - BioPerl [http://www.bioperl.org/]
- FormBuilder - Perl CGI Form Builder CPAN module [http://www.formbuilder.org/]
- Cell Biology Proteomics at UBC [http://www.proteomics.ubc.ca/foster/software.php]
- Proteins Found Only In Drone and Worker [http://foster.nce.ubc.ca/bee/db/study-1/only-queen0+worker1+drone1/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.