Evidence for existence of thirty hypothetical proteins in rat brain

Background The rapid completion of genome sequences has created an infrastructure of biological information and provided essential information to link genes to gene products, proteins, the building blocks for cellular functions. In addition, genome/cDNA sequences make it possible to predict proteins for which there is no experimental evidence. Clues for function of hypothetical proteins are provided by sequence similarity with proteins of known function in model organisms. Results We constructed a two-dimensional protein map and searched for expression of hypothetical proteins in rat brain. Two-dimensional electrophoresis (2-DE) with subsequent in-gel digestion of spots and matrix-assisted laser desorption/ionization (MALDI) spectrometric identification were applied. In total about 3700 spots were analysed, which resulted in the identification of about 1700 polypeptides, that were the products of 190 different genes. A number of hypothetical gene products were detected (30 of 190, 15.8%) and are considered brain proteins. Conclusions A major finding of this study is the demonstration of the existence of putative proteins that were so far only deduced from their nucleic acid structure by a protein chemical method independent of antibody availability and specificity and unambiguously identifying proteins.


Background
The amount of genome sequence data now available is enormous, with only more to come. At this time, sequencing of more than 700 genomes is either finished or in progress, and results may be found in various public databases [1]. With the availability of genome sequences, increasing attention has focused on identifying the complete set of mammalian genes, both protein-coding and non-protein-coding. As various analyses [2,3] have revealed, different annotation criteria lead to different sets of predicted genes, and even the true number of proteincoding genes remains uncertain. Biases in the training sets used for optimizing gene-finding algorithms lead to sys-tematic biases in the types of genes that are predicted computationally and so important genes and gene classes can be missed.
A draft sequence of the mouse genome and comparative analyses with the human sequence has been published [4]. The sequencing of the rat genome is also well advanced (see "Rat Genome Database: Sequences" at http://rgd.mcw.edu/sequences/rgp_info.shtml). Comparative genomics studies have shown that many genes contributing to human disease are conserved among these genomes, underscoring the utility to biomedical research of studies in these species [5,6]. Comparative human and mouse sequence analyses have supported the notion that there are only about 30000 genes in a typical mammalian genome and confirmed that, on average, rodent and human genes are about 85% identical in their coding sequences [4,7]. The mouse and rat model systems have several advantages over higher mammals for the investigation of mammalian biology. In both species, there are numerous genetically well-defined lines that differ from each other in phenotypic characteristics and their composition of genetic variants. This, coupled with their modest cost maintenance and short generation times, has underpinned an explosion of studies in genetic mapping and the development of genetic manipulation tools over the past decades. Moreover, a very broad range of phenotypic assays can be applied to both model organisms allowing for the acquisition of precise data on both qualitative and quantitative traits. Taken together, these developments have driven the utility of the mouse and rat for studies of mammalian physiology, biochemistry and development and the study of genes and genetic pathways involved in genetic disease. Although the mouse is the primary organism for studies of mammalian genetics and development, and is seeing increased use for several researches, the rat has been used more frequently for physiological and pharmacological studies [8].
Proteomics methods, relying on integration of significant advances recently achieved in two-dimensional (2-D) electrophoretic separation of proteins and mass spectrometry (MS), are essential for studying protein expression, activity, regulation and modifications. Genomics coupled to with proteomics represents now important tools and high throughput methods are in use for analysing gene and protein expression, discovering new gene or protein products, and understanding of gene and protein functions including post-genomic studies. Therefore, the significance of informatics in proteomics will gradually increase because of the advent of high-throughput methods relying on powerful data analysis. Recently, we have applied proteomics in the study of human and rat brain [9]. In rats, we analysed brains of animals serving as models of human diseases such as anxiety and perinatal asphyxia (unpublished data) and of animals treated with the toxic agent kainic acid [10,11].
Although gene prediction programs have become more accurate and sensitive, analysis of proteins provides more reliable evidence for existence and function of predicted proteins. To allow insight into whether they are expressed in brain or not, we constructed a two-dimensional protein map by applying 2-DE coupled to matrix-assisted laser desorption ionization-mass spectroscopy (MALDI-MS) in rat brain and identified 30 hypothetical proteins.

2-DE analysis
Rat brain proteins were solubilised in the IEF-compatible reagents urea, thiourea and CHAPS and analysed by 2-D gels. The 2-DE separation was performed on broad pH range IPG strips and protein spots were visualised following staining with colloidal Coomassie blue. A large series of proteins were successfully identified, including hypothetical proteins. A representative gel presenting hypothetical proteins is shown in Fig. 1.

Protein identification
Proteins were identified by MALDI-MS on the basis of peptide mass matching [12], following in-gel digestion with trypsin. The spots of each gel were selected randomly with the goal of detecting as many new gene products as possible. Each excised spot was analysed individually. The peptide masses were matched with the theoretical peptide masses of all known proteins from all species. In total about 3700 spots were analysed, resulting in the identification of about 1700 polypeptides that were the products of 190 different genes. Most proteins identified in the present study were overlapping with previous data obtained by using enriched cellular subfractions and preelectrophoretic chromatographical separation of brain homogenates [13] and only predicted structures were included to prevent double -documentation. Thirty of 190 gene products were hypothetical or poorly described gene products. Some of them were represented by strong spots and the present study shows that they are indeed expressed in rat brain. In Table 1 -see additional file 1, data for hypothetical protein identification and assignment are provided including peptide matches, probability of assignment of random identity, theoretical/observed pI and molecular weight values. Some of the identified hypothetical proteins showed heterogeneity and were represented on 2-DE gel by more than one spot (Fig 1). In average, approximately 2-3 spots corresponded to one hypothetical protein. For example, hypothetical 79.7 kDa protein (Q91VD9) was represented by five spots, with different pI, probably reflecting post-translational modifications (PTMs) (Fig 1).

Predicted function of hypothetical proteins
Most nucleic acid sequences of hypothetical proteins were directly submitted to the GenBank/EMBL/DDBJ database. Based on the assumption that sequence-domain similarities reflect functional relationship, it may be predicted how hypothetical proteins play a role in biological mechanism. The sequences of hypothetical proteins were submitted to BLAST search. Putative conserved domains and identity to know protein were obtained by sequence similarity (Conserved Domain Database, http:// www.ncbi.nlm.nih.gov/BLAST) ( Table 1 -see additional file 1). A hypothetical protein showing one or more 2-DE gel image of rat brain proteins depicting identified 30 hypothetical proteins Figure 1 2-DE gel image of rat brain proteins depicting identified 30 hypothetical proteins. Accession numbers are given. Brain proteins were extracted and separated on an immobilised pH 3-10 non-linear gradient strip followed by separation on a 9-16% gradient polyacrylamide gel. The gel was stained with Coomassie blue and spots were analysed by MALDI-MS.
IMMT_MOUSE IMMT_HUMAN IMMT_DORME significant structural homologs, is predicted to have molecular properties similar to the homologs. Putative cellular localisations of hypothetical proteins were investigated by PSORT II http://www.psort.org using of k-nearest neighbour (k-NN) algorithm for assessing the probability of localising at each candidate sites. Hypothetical proteins may be localised in cytoplasm (16 of 30, 53.3 %), mitochondria (9 of 30, 30 %) and nuclear (5 of 30, 16.7 %) (data not shown). Hypothetical proteins were divided into several groups by putative function (Table 1 -see additional file 1). For five hypothetical proteins ("Unknown" in Table 1), poor information on predicted domains and function was available in data banks including SWISS-PROT database. Therefore, we performed BLAST search (NCBI, http://www.ncbi.nlm.nih.gov) that predicted domains of two proteins, Homo sapiens sequence 42 from patent wo0222660 and Homo sapiens sequence 33 from patent wo0218424 (Fig 2). Homo sapiens sequence 42 from patent wo0222660 (CAD34734) was detected in 2-DE gels as two spots with different pI, probably reflecting PTMs or isoforms (Fig. 1). This protein has 455 amino acid residues according to patent cDNA sequence (HYSEQ. INC, USA) and belongs to TufB family (Fig 2). Homo sapiens sequence 33 from patent wo0218424 (CAD33306) was resolved as four spot (Fig.  1) and consists of 429 amino acid residues. The protein has never been described in literature and is categorised as GTP/CDC (guanosine triphosphate/cell division control) family including CDC 3, CDC 10, CDC 11, CDC 12/Septin and some uncharacterised proteins involved in cytokinesis http://www.sanger.ac.uk/Pfam/ (Fig 2). Although members of this family are involved in cell division and bind GTP, biological roles of these proteins are unclear. To predict the function of 5730568A12Rik protein, 1700082C19Rik protein, and 1700021B03Rik protein, we searched for homologs of these proteins through bioinformatic tools (ProteinPredict, http://www.embl-hei delberg.de/predictprotein/) and performed the CLUS-TALW multiple sequence alignments. The homologs of 5730568A12Rik protein (Q9CXS1) were identified from C. elegans (Q23344, hypothetical 34.0 kDa protein, putative serine/threonine protein phosphatase) to H. sapiens (Q96ER9, similar to 5730568A12Rik protein) and this finding suggests that 5730568A12Rik protein may play an important role in cell biology (data not shown). However, the putative function of 5730568A12Rik protein is still unknown. According to BLAST searching results, 1700082C19Rik protein (Q9D9F6) shows high homology with mitochondrial inner membrane protein (Fig 3) and 1700021B03Rik protein (Q9DA45) belongs to DUF 737 family containing uncharacterised protein.

Discussion
Herein we identify hypothetical proteins in rat brain predicted from nucleic acid sequences and show that these proteins exist and are indeed expressed in rat brain. This observation adds structures to the list of proteins of metabolism, cytoskeleton, chaperone system, cell division/differentiation machinery and proteolysis.
The group of hypothetical proteins related to metabolism is the largest protein class (Table 1 -see additional file 1) and is strongly represented in rat brain map. Hypothetical proteins with enzymatic activity identified by proteomics in the rat brain may be useful for determination of metabolic disorders in the brain, including inborn errors of metabolism but also form the basis for physiological or pharmaceutical studies. We are complementing metabolism related proteins by describing new members of intermediary metabolism, hydrolytic enzymes and oxidoreductases. We furthermore add a dynamin-1-like, an erythrocyte membrane protein-like, an actin related protein 1-like and a tubulin alpha-2-like protein to cytoskeleton proteins. The KIAA0417 gene, never reported in the human system before as well as ATP and nucleotide binding TOB3 were assigned to chaperone system in humans.
New members of cell division and differentiation systems were represented by the hypothetical protein FLJ38330 and CDCrel-1A.
As to proteins of the category "unknown" function, according to analysis by searching conserved domains in NCBI database http://www.ncbi.nlm.nih.gov, homo sapiens sequence 42 from patent wo0222660 has a conserved domain; TufB, GTPases-translation elongation factors [translation, ribosomal structure and biogenesis] (Fig. 2). Elongation factor Tu (EF-Tu) belongs to the family of GTP binding proteins and the function of EF-Tu is to bind aminoacyl-tRNA (aa-tRNA) and GTP to form a stable ternary complex that interacts with the A site of the mRNA-programmed ribosome. In addition, EF-Tu is involved in several mechanisms, such as replication, transcription, RNA processing, DNA repair, regulation of translation, malignant transformation, and regulation of development [14]. EF-1a (the eukaryotic counterpart of EF-Tu) binds to actin filaments and influences assembly of cytoskeletal polymers [15][16][17]. Additionally EF-1a apparently participates in the degradation of N-terminally blocked proteins by the 26S proteasome [18]. In recent years, it has been shown that EF-Tu has chaperone-like properties in protein folding [19,20] and EF-Tu displays protein disulfide isomerase activity [21]. Molecular chaperones form a class of polypeptide binding proteins that are implicated in protein folding, protein targeting to membranes, protein renaturation or degradation after stress, and the control of protein-protein interactions.
Homo sapiens sequence 33 from patent wo0218424 is classified as GTP_CDC (Fig 2) and presents with a coiled coil region and may be regulated by cyclic AMP. Although members of this family are involved in cell division and bind GTP, biological roles of these proteins are unclear.
1700082C19Rik protein shows high homology with mitochondrial inner membrane protein and this protein may well be assigned to mitochondrial structure in the mammalian system in particular as homologues of 1700082C19Rik protein containing a mitochondrial signal and found in several organisms at the nucleic acid level, reveal high levels of similarity to this protein.
1700021B03Rik protein belongs to the DUF 737 family containing uncharacterised protein (Fig 4). According to alignment analysis by the Pfam program http:// www.sanger.ac.uk, eleven proteins including 1700021B03Rik protein were assigned DUF 737 family members and these members just belong to high mammalian species such as mouse, monkey, and human (data not shown). Although this finding suggests that 1700021B03Rik protein may be related to development of brain in the mammalian system, the role of this protein is still unclear.
Some of hypothetical proteins were detected in mouse hippocampus, human mesothelial, lymphocyte and bronchial cell lines (Table 1 -see additional file 1). None of the listed hypothetical proteins were observed in human brain maps [22,23] or in a previously published rat map [24].
A number of hypothetical proteins encoded in sequenced rat genome have computationally recognized homology to at least one well-characterised domain, but functional interpretation of these proteins is limited. In addition, functional changes over evolutionary time [25,26] and database errors [27] confound reliable computational prediction of the precise functions of newly discovered genes. Absence of experimental evidence makes it difficult to directly ascertain their molecular role.

Conclusions
In the present study, we identified a number of so far hypothetical gene products, which may be considered as tentative brain proteins. The major finding of this study is the evidence for existence of putative protein proposed to exist from its nucleic acid structure in 2-DE rat brain and providing the tool for unambiguous analysis of these structures [28,29].

Rat brain samples
The animal studies were conducted according to the guidelines of the American Physiological Society. Rats, male and 8 days old, were sacrificed for the experiments by decapitation and pooled. Whole brains were kept at -80°C until biochemical assays were performed. The freezing chain was never interrupted until use. Experiments were carried out in triplicate.

Matrix-associated laser desorption/ionization mass spectrometry (MALDI-MS)
MALDI-MS analysis was performed as described elsewhere [31,32] with minor modifications. Spots were excised with a spot picker and placed into 96-well microtiter plates. Each spot was destained with 100 µl of 30% acetonitrile in 50 mM ammonium bicarbonate and dried in a speedvac evaporator. Each dried gel piece was rehydrated with 4 µl of 3 mM Tris-HCl, pH 9.0, containing 50 ng trypsin (Promega, Madison, WI, USA). After 16 h at room temperature, 7 µl of distilled water were added to each gel piece and samples were shaken for 10 min. Four µl of 50% acetonitrile; containing 0.3% trifluoroacetic acid and the standard peptides, des-Arg-bradykinin (Sigma, 904.4681 Da), and adrenocorticotropic hormone fragment 18-39 (Sigma, 2465.1989 Da); were added to each gel piece and shaken for 10 min. Sample application was performed using SymBiot I sample processor (PE Biosystems, Framingham, MA, USA). 1.5 µl of the peptide mixture were simultaneously applied on 1 µl of matrix, consisting of a saturated solution of α-cyano-4-hydroxycinnamic acid (Sigma) in 50% acetonitrile, containing 0.1% trifluoroacetic acid. Samples were analysed in a time-of-flight mass spectrometer (Reflex 3, Bruker Analytics, Bremen, Germany). An accelerating voltage of 20 kV was used. Peptide matching and protein searches were performed automatically. Peptide masses were compared with the theoretical peptide masses of all available proteins from all species. Monoisotopic masses were used and a mass tolerance of 0.0025% was allowed. The algorithm used for determining the probability of a false positive match with a given MS-spectrum is described elsewhere [33].

Prediction of hypothetical protein function
Based on MALDI-MS analysis, we identified hypothetical proteins and their sequences were applied to diverse databases, including SWISS-PROT and NCBI. Amino acid sequences of proteins without available information on putative function were aligned with homologs by CLUS-TALW Multiple Sequence Alignments program obtained from the Web http://clustalw.genome.ad.jp.