In silico proteome analysis to facilitate proteomics experiments using mass spectrometry
© Cagney et al; licensee BioMed Central Ltd. 2003
Received: 23 April 2003
Accepted: 13 August 2003
Published: 13 August 2003
Proteomics experiments typically involve protein or peptide separation steps coupled to the identification of many hundreds to thousands of peptides by mass spectrometry. Development of methodology and instrumentation in this field is proceeding rapidly, and effective software is needed to link the different stages of proteomic analysis. We have developed an application, proteogest, written in Perl that generates descriptive and statistical analyses of the biophysical properties of multiple (e.g. thousands) protein sequences submitted by the user, for instance protein sequences inferred from the complete genome sequence of a model organism. The application also carries out in silico proteolytic digestion of the submitted proteomes, or subsets thereof, and the distribution of biophysical properties of the resulting peptides is presented. proteogest is customizable, the user being able to select many options, for instance the cleavage pattern of the digestion treatment or the presence of modifications to specific amino acid residues. We show how proteogest can be used to compare the proteomes and digested proteome products of model organisms, to examine the added complexity generated by modification of residues, and to facilitate the design of proteomics experiments for optimal representation of component proteins.
Proteomics involves the large-scale or global analysis of the protein complement of an organism [1–3]. The convergence of several factors has led to the rapid emergence of proteomics as a distinct and promising scientific field, notably the completion of genome sequencing projects and advances in sensitive high-throughput protein analysis methods such as mass spectrometry (MS). Proteomics studies can generate massive amounts of experimental data. A single bacterial cell may produce 4000 proteins whose abundances and activities may vary throughout an experiment, while the number of proteins expressed in higher eukaryotes is likely to be at least 10-fold greater. Attempts to catalogue, visualize, and analyze proteomics experiments have therefore become a major challenge. In fact, the development of practical software applications suitable for theoretical and experimental analysis of the proteome lags far behind that for the analysis of genomes and DNA.
A fundamental operation of proteomics is to identify proteins. For most high-throughput applications, proteins are cleaved with site-specific reagents, for example cyanogen bromide (CNBr) or proteases (usually trypsin), to generate smaller peptides better suited to analysis by MS. In shotgun proteomics studies, entire mixtures of proteins are digested. Most proteomics experiments involve four steps: a) protein isolation from a biological sample (e.g. a cell extract) following some experimental treatment; b) fractionation of the resulting proteins (or peptides, the products of proteome digestion) by methods such as two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) or liquid chromatography (LC); c) protein or peptide detection by MS; d) protein identification through manual interpretation or database correlation of mass spectra. Integration of these steps is essential for a successful proteome experiment yet relies on accurate knowledge of the parameters influencing each step. Tools that effectively link the predicted proteome or digested proteome to the data obtained in proteomics experiments are therefore necessary for several reasons. First, they provide a statistical framework that can facilitate interpretation of the output of protein identification algorithms such as MS-Fit and MS-Tag  or SEQUEST . Second, failure to observe an expected protein in a proteomics experiment may be for several reasons, including limits to MS detection technology, poor expression or recovery of the protein, or because the protein identification algorithm is inefficient. While recent studies have confirmed the presence of many proteins previously only predicted by their cognate DNA sequences , likewise a large number of predicted protein species have never been observed, and software tools can highlight experimental factors that might contribute to these discrepancies. Third, while the genomic DNA sequence is believed to contain all the information needed to describe the protein products of the cell, our knowledge of non-canonical proteins (i.e. proteins other than those that are defined by uninterrupted start and stop codons and whose component residues are unmodified) is very incomplete. The set of canonical proteins for a given organism will probably be expanded several-fold by the phenomena of post-transcriptional splicing and post-translational modification. Software that can analyze data on a whole proteome scale is required for examining such expanded proteomes.
Programs that can analyze several aspects of protein biochemistry and structure are available at websites such as the Swiss Institute of Bioinformatics http://www.isb-sib.ch and the European Bioinformatics Institute http://www.ebi.ac.uk/proteome/. These programs are generally not suited to processing of whole proteomes, nor are they designed to analyze the peptide digestion products of entire proteomes. A software application that can mimic proteome digestion and analyze the resulting peptides on a whole proteome scale would be of great value to the mass spectrometry researcher. We therefore developed proteogest, a program that generates basic descriptive statistics for both the intact and proteolytically processed proteome.
An analytical tool for proteomics
proteogest is written in Perl and runs in command line mode with several options. A detailed description of how to install and run proteogest is available for download at http://www.utoronto.ca/emililab/program/proteogest.htm
Protein sequences to be analyzed are saved as a text file in FASTA format in the same directory as the proteogest program. Text files can be edited to suit the user, for instance to contain all the proteins predicted for a particular organism, or a similar list with predicted transmembrane proteins removed. The user specifies the cleavage criteria by inserting an 'X' character into the cleavage sequence e.g. "SXS" would cleave in the middle of two successive S residues. Where alternative residues may be cleaved, the alternatives are separated by a comma, "PX,QX,RX". The 'Z' character can be used as a wild card, for instance "QZZYZQXS" would mimic the tobacco etch virus protease recognition site where cleavage occurs after the second glutamine (Q) and tolerates several different residues at positions 2, 3, and 5. In the laboratory, the activity of proteolytic enzymes and chemical reagents may be incomplete, resulting in a subset of digestion products that contain cleavage sites that remain unprocessed. In order to simulate this, an option to specify the maximum number of missed cleavages per digestion product is included. When this option is chosen, the output describes all possible complete and incomplete cleavages. For instance, by choosing "2", all peptides containing 0, 1 or 2 missed cleavages are described (not just those where 2 cleavage sites are present).
Several post-translational modification options may be used. A peptide can be modified by phosphorylation (in this case, +80.0 amu can be added to every occurrence of serine, threonine or tyrosine, or iteratively to only one of each separate STY residues) or the user can specify any combination of custom modifications. A number of groups have described promising methods for phosphoproteome analysis recently [7–9] (reviewed in reference 10).
Computational proteome analysis of model organisms
proteogest was written to analyze large proteome amino acid sequence datasets and to simulate digestion of the proteome with enzymes or chemical reagents. Here we refer to the theoretical proteome as the entire potential protein complement encoded by the genetic component of a cell or organism, and distinguish it from the observed or experimental proteome, or the complement of proteins that are actually expressed under physiological or experimental conditions. This definition of the proteome includes the primary gene products defined by start and stop codons but does not exclude variants of those gene products arising from mRNA splicing or post-translational proteolysis or modification.
Another use of proteogest is for searching proteome datasets for potential binding sites and consensus sequences. For instance, metal affinity capture using Nickel conjugated resins is often used to recover recombinant proteins from E. coli and S. cerevisiae. Interestingly, the sequence HHHHHH (His6) occurs only once in the predicted E. coli proteome (the His operon attenuator leader peptide) while it is present in 17 predicted S. cerevisiae proteins. In contrast, only two E. coli proteins (multidrug resistance protein B and hypothetical protein yciQ) and four S. cerevisiae proteins (AFG3, SCJ1, YLR338W and hypothetical protein YJE8) contain polyglutamine tracts longer than five residues.
Analysis of the digested and modified proteome
Characteristics of peptide products following digestion of the yeast proteome with different agents.
Trypsin* (no missed cleavages)
Trypsin (zero or one missed cleavages)
Trypsin (zero, one or two missed cleavages)
Trypsin and chymo-trypsin
Total number of peptides in proteome
Mean peptide mass (isotopic)
Mean number of peptides per protein
Currently, two main MS approaches are used to identify proteins in proteomics experiments: a) 2D-PAGE separation combined with matrix assisted laser desorption ionization MS  and b) gel or chromatographic separation combined with electrospray MS . The former approach uses the observed masses of intact peptide ions derived from the same parent protein for identification ("peptide fingerprinting"), while the latter generally relies on uninterpreted product mass spectra derived from a single peptide ion. In both cases, database searching is normally used to match experimentally observed mass spectra with spectra predicted for known protein sequences. The efficiency of both approaches is dependant on many factors, for example the accuracy, sensitivity and resolution of the measuring instrument, and also the size and distribution of peptide and protein properties in proteome. proteogest permits descriptive statistics to be obtained for whole proteome datasets and for in silico digestion products of the proteomes. Normally, calculating these numbers requires a custom program to be written for each query. Although such programs are relatively simple, they require time and skills not always available in a busy proteomics lab. We therefore wrote proteogest to answer questions about the physical/chemical properties of theoretical proteomes, in order to design practical experiments.
The software tool is timely and valuable for several reasons. First, it permits the testing of hypotheses concerning the entire proteome (or large subsets thereof). For instance, one might ask whether yeast nuclear proteins are enriched in particular (e.g. acidic) amino acids by comparing the fraction of certain residues (e.g. aspartic acid and glutamic acid) found in nuclear localized proteins as compared with the overall proteome. To do this, proteogest is first run on a FASTA file of the complete yeast proteome and then on a similar file edited to include only proteins annotated to the nucleus. Second, the distribution of proteins or peptides can be incorporated into probability-based mass spectrum identification algorithms. For instance, the mean number of tryptic peptides per protein for E. coli is 28, but 42 for C. elegans, so the fragmentation patterns expected for a typical protein will the different in the different organisms. Furthermore, automated de novo peptide sequencing (identification of a peptide sequence solely from the spectrum itself, and not by comparison with a spectrum predicted using a DNA database) is currently achievable only using specialized high-resolution mass spectrometers (e.g. Fourier Transform MS) or by chemical modification of the peptides before MS analysis, such as using MCAT . Knowing the relative occurrence of different amino acids (or pairs of successive amino acids) for a given proteome for instance, can facilitate the probability of de novo sequencing predictions. Third, proteogest can be used for the planning and interpretation of experimental proteomics applications, in particular those involving high throughput protein identifications using MS. For instance, the ICAT method for protein relative abundance determination  relies on the modification of cysteine-containing peptides. When designing a proteomics experiment using ICAT, it is important to calculate the proportion of all proteins that contain one or more cysteines, yet currently, there is no easy way to carry out this apparently trivial calculation without writing a program. Finally, we show how proteogest can be used to search for patterns in proteomics data, for instance the frequency of particular amino acid residues in observed versus predicted peptides.
Materials and Methods
Files containing the protein sequence of all proteins predicted using the genomic DNA sequences of Escherchia coli, Methanococcus jannaschii, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, Drosophilia melanogaster, and Homo sapiens were downloaded from the European Bioinformatics website http://www.ebi.org on 20 August 2002. The experimental S. cerevisiae proteome determined by 2D-MALDI was obtained by combining datasets observed by Futcher and coworkers  and Gygi and coworkers . The proteome dataset determined by 1D LCMS was obtained from our laboratory using methods described in Cagney and Emili . The 2D MUDPIT proteome dataset comprised proteins observed by Washburn and coworkers  and in our laboratory. The experimentally determined sets are subsets of the complete predicted S. cerevisiae proteome FASTA file and proteogest is used in exactly the same way except that the input files are edited to include only relevant proteins. Peptides detected by mass spectrometry following trypsin digestion of whole cell extract of S. cerevisiae were obtained in our laboratory using MUDPIT [16, 21] and identified using the SEQUEST algorithm  searched against all predicted fully tryptic peptides in the non-redundant SwissProt and TrEMBL mouse and human protein sequences downloaded from EBI in December 2002. SEQUEST scores demonstrated to yield approximately 98% correct identifications were included in the analysis .
The software is open source and can be requested by email. The program is written in Perl and works on major operating systems (Windows, Unix, Linux). A helpfile can be downloaded from the Emili website and gives instructions on installing and using the program.
We thank Jimmy Eng, Dave Tabb, and John Yates, III, for generous use of Sequest and DTASelect/Contrast software. We also wish to thank Pete St. Onge, Faye Baron, Duy Mai, and Shaun Ghanny for computing assistance and fruitful suggestions. This work was supported in part by a grant to A.E from the National Science and Engineering Research Council of Canada and Genome Canada.
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422: 198–207. 10.1038/nature01511PubMedView ArticleGoogle Scholar
- Phizicky E, Bastiaens PIH, Zhu H, Snyder M, Fields S: Protein analysis on a proteomic scale. Nature 2003, 422: 208–215. 10.1038/nature01512PubMedView ArticleGoogle Scholar
- Tyers M, Mann M: From genomics to proteomics. Nature 2003, 422: 193–197. 10.1038/nature01510PubMedView ArticleGoogle Scholar
- Clauser KR, Baker PR, Burlingame AL: Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal Chem 1999, 14: 2871–2882. 10.1021/ac9810516View ArticleGoogle Scholar
- Eng JK, McCormack AL, Yates JR 3rd: An approach to corelate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 1994, 5: 976–989. 10.1016/1044-0305(94)80016-2PubMedView ArticleGoogle Scholar
- Washburn MP, Wolters D, Yates JR 3rd: Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nature Biotechnol 2001, 19: 242–247. 10.1038/85686View ArticleGoogle Scholar
- Salomon AR, Ficarro SB, Brill LM, Brinker A, Phung QT, Ericson C, Sauer K, Brock A, Horn DM, Schultz PG, Peters EC: Profiling of tyrosine phosphorylation pathways in human cells using mass spectrometry. Proc Natl Acad Sci USA 2003, 100: 443–448. 10.1073/pnas.2436191100PubMed CentralPubMedView ArticleGoogle Scholar
- Ficarro SB, McCleland ML, Stukenbery PT, Burke DJ, Ross MM, Shabanowitz J, Hunt DF, White FM: Phosphoproteome analysis by mass spectrometry and its application to Saccharomyces cerevisiae. Nature Biotechnol 2002, 20: 301–305. 10.1038/nbt0302-301View ArticleGoogle Scholar
- MacCoss MJ, McDonald WH, Saraf A, Sadygov R, Clark JM, Tasto JJ, Gould KL, Wolters D, Washburn M, Weiss A, Clark JI, Yates JR III: Shotgun identification of protein modifications from protein complexes and lens tissue. Proc Natl Acad Sci USA 2002, 99: 7900–7905. 10.1073/pnas.122231399PubMed CentralPubMedView ArticleGoogle Scholar
- Mann M, Jensen ON: Proteomic analysis of post-translational modifications. Nature Biotechnol 2003, 21: 255–261. 10.1038/nbt0303-255View ArticleGoogle Scholar
- Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R: Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnol 1999, 17: 994–999. 10.1038/13690View ArticleGoogle Scholar
- Zhou H, Ranish JA, Watts JD, Aebersold R: Quantitative proteome analysis by solid-phase isotope tagging and mass spectrometry. Nature Biotechnol 2002, 19: 512–515. 10.1038/nbt0502-512View ArticleGoogle Scholar
- Cagney G, Emili A: Do novo peptide sequencing and quantitative profiling of complex protein mixtures using mass-coded abundance tagging. Nature Biotechnol 2002, 20: 163–170. 10.1038/nbt0202-163View ArticleGoogle Scholar
- Gygi SP, Rochon Y, Franza BR, Aebersold R: Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 1999, 19: 1720–1730.PubMed CentralPubMedGoogle Scholar
- Futcher B, Latter GI, Monardo P, McLaughlin CS, Garrels JI: A sampling of the yeast proteome. Mol Cell Biol 1999, 19: 7357–7368.PubMed CentralPubMedGoogle Scholar
- Link AJ, Eng J, Schieltz DM, Carmack E, Mize GJ, Morris DR, Garvick BM, Yates JR 3rd: Direct analysis of protein complexes using mass spectrometry. Nature Biotechnol 1999, 17: 676–682. 10.1038/10890View ArticleGoogle Scholar
- Aebersold R, Goodlett DR: Mass spectrometry in proteomics. Chem Rev 2001, 101: 269–295. 10.1021/cr990076hPubMedView ArticleGoogle Scholar
- Peng J, Gygi SP: Proteomics: the move to mixtures. J Mass Spectrom 2001, 36: 1083–1091. 10.1002/jms.229PubMedView ArticleGoogle Scholar
- Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature 1997, 387: 708–713. 10.1038/42711PubMedView ArticleGoogle Scholar
- Kinter M, Sherman NE: Protein sequencing and identification using tandem mass spectrometry. Wiley-Interscience, New York 1 Edition 2000.Google Scholar
- Kislinger T, Rahman K, Radulovic D, Cox B, Rossant J, Emili A: PRISM, a Generic Large Scale Proteomic Investigation Strategy for Mammals. Mol Cell Proteomics 2003, 2: 96–106. 10.1074/mcp.M200074-MCP200PubMedView ArticleGoogle Scholar
- Mann M, Hendrickson RC, Pandey A: Analysis of proteins and proteomes by mass spectrometry. Annu Rev Biochem 2001, 70: 437–473. 10.1146/annurev.biochem.70.1.437PubMedView ArticleGoogle Scholar
- Wu CC, MacCoss MJ: Shotgun proteomics: tools for the analysis of complex biological systems. Curr Opin Mol Ther 2002, 4: 242–250.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.