Intrinsic disorder in putative protein sequences
© Midic and Obradovic; licensee BioMed Central Ltd. 2012
Published: 21 June 2012
Skip to main content
© Midic and Obradovic; licensee BioMed Central Ltd. 2012
Published: 21 June 2012
Intrinsically disordered proteins (IDPs) and regions (IDRs) perform a variety of crucial biological functions despite lacking stable tertiary structure under physiological conditions in vitro. State-of-the-art sequence-based predictors of intrinsic disorder are achieving per-residue accuracies over 80%. In a genome-wide study of intrinsic disorder in human genome we observed a big difference in predicted disorder content between confirmed and putative human proteins. We investigated a hypothesis that this discrepancy is not correct, and that it is due to incorrectly annotated parts of the putative protein sequences that exhibit some similarities to confirmed IDRs, which lead to high predicted disorder content.
To test this hypothesis we trained a predictor to discriminate sequences of real proteins from synthetic sequences that mimic errors of gene finding algorithms. We developed a procedure to create synthetic peptide sequences by translation of non-coding regions of genomic sequences and translation of coding regions with incorrect codon alignment.
Application of the developed predictor to putative human protein sequences showed that they contain a substantial fraction of incorrectly assigned regions. These regions are predicted to have higher levels of disorder content than correctly assigned regions. This partially, albeit not completely, explains the observed discrepancy in predicted disorder content between confirmed and putative human proteins.
Our findings provide the first evidence that current practice of predicting disorder content in putative sequences should be reconsidered, as such estimates may be biased.
Intrinsically disordered proteins (IDPs) are proteins that lack stable tertiary structure under physiological conditions in vitro . They are also known by other names, including natively denatured , natively unfolded , intrinsically unstructured , and natively disordered . IDPs can be wholly disordered or partially disordered, where we can identify intrinsically disordered regions (IDRs) and ordered regions. Although they lack stable tertiary structure, the functional repertoire of IDPs complements the functions of ordered proteins. IDPs are involved in a number of crucial biological functions including regulation, recognition, signaling and control.
Thus, in addition to the well-known "protein folding code" stating that all the information necessary for a given protein to fold is encoded in its amino acid sequence , "protein non-folding code" has been proposed, according to which the propensity of a protein to stay intrinsically disordered is likewise encoded in its amino acid sequence [10, 11]. This has been utilized to develop numerous predictors of intrinsic disorder (ID), which achieve over 80% of per-residue accuracy .
The predicted sequences were unevenly distributed between disease-related and disease-unrelated proteins. In fact, the majority of the putative sequences were products of the non-disease genes. Therefore, including such sequences into the data set would introduce significant bias for disorder in the non-disease gene part of the data set. Based on these observations, we decided to exclude such sequences from the final datasets.
Gene finding is the problem of predicting the positions of genes, and the positions of exons and introns inside the genes, for a given genomic sequence. Most predictors use Bayesian networks, such as Interpolated Markov Models , Generalized Hidden Markov Models , and Generalized Pair HMMs . These predictors exploit the following findings: 1) many signals involved in gene expression (e.g. promoters, splice junctions) exert specific patterns, known as motifs, and can be predicted from sequence, 2) protein-coding DNA have statistical properties (such as amino acid composition, length) that distinguish them from non-coding DNA, 3) signals and statistical properties are often conserved across related sequences (intra- and inter-species). From the domain experts' point of view, these prediction models perform well, as they provide important guidelines for experimental research, where predicted putative sequences are confirmed or refined. However, inclusion of these putative sequences in large-scale analysis of ID is questionable. Even when predicted exons of a predicted protein sequence overlap with true exons, the overlap can be partial and non-coding DNA may be included in the predicted exons. Another possibility is that in predicted protein sequence, true exons are translated in wrong reading frame. Therefore, predicted protein sequences contain regions that come from non-coding genomic regions or incorrectly translated coding regions, and are not present in true protein sequences. In further text we will refer to them as nonsense regions/sequences. Nonsense regions do not exist in real proteins, and the hypothetical structure they would conform to if they were synthesized is uncertain. Therefore, any prediction of structure - including prediction of intrinsic disorder - for nonsense regions and sequences is not valid. Inclusion of such sequences in genome-wide analysis of intrinsic disorder can possibly substantially bias the estimate of ID content in a genome. In  we decided to exclude XP sequences from analysis of ID in human genome. On the other hand, their exclusion from genome-wide analysis can also give an unrealistic estimate of ID content, especially if the proportion of incorrectly annotated unconfirmed sequences is high. If the higher predicted disorder content in XP sequences is realistic, then their exclusion can negatively bias the estimate of ID content in the genome.
In this paper we explore the relationship between nonsense regions in XP sequences - introduced through errors made by gene finding procedures - and intrinsic disorder. In addition to the difference in amino acid composition between NP and XP sequences (further elaborated in the Results section), we assumed that nonsense regions follow a different amino acid composition than true protein sequences. Therefore, instead of testing and improving the gene finding algorithms, we investigate whether nonsense regions can be detected from amino acid sequence, similarly to prediction of intrinsic disorder.
We developed a two-class predictor that aims at distinguishing true protein sequences from nonsense regions in putative sequences. Since no data is easily available about which regions of XP sequences are nonsense, we constructed synthetic nonsense sequences from mRNAs of the true protein sequences that form the other class.
The methodology that was used to create the synthetic nonsense sequences, train and evaluate the nonsense predictor, and analyze results of the predictor for XP sequences is described in the Methods section. Results section presents more details on the comparison of amino acid sequence composition, results of predictor evaluation, comparison of nonsense prediction in different classes of sequences, and the analysis of relationship between nonsense prediction and disorder prediction. This is followed by brief Discussion and Conclusions sections.
This paper is a substantial extension of the preceding conference paper . The dataset that was used in the initial attempt at performing this analysis was based on the dataset used in , which was retrieved from the NCBI database in 2007, and included only the human genome. Since additional information about genes and proteins was required to answer open questions and improve several shortcomings of the setup for the initial study, we downloaded all the necessary information from the NCBI database again in 2011 and performed analysis with improved methodology and, in addition to the updated human dataset, also three new datasets: mouse, fruitfly and zebrafish. This paper presents the methodology and the results of the extended study. However, the old methodology and results are also mentioned where appropriate, since the comparison of the results gives an important insight into the trends of the development of the NCBI databases that are relevant for the topic of this paper.
We created four datasets, one for each of the following species: Homo sapiens (human), Mus musculus (mouse), Drosophila melanogaster (fruitfly), Danio rerio (zebrafish). For each of the organisms, we downloaded genomic records with sequences and annotation about all genes with RefSeq protein records from the NCBI database. These records contain the genes' nucleotide sequences, as well as position of all parts of mRNA sequences: 3' and 5' UTRs (untranslated regions) and coding regions (exons). From this information we could also easily identify intronic regions. For the control/negative class of true proteins we selected either NP protein sequences that are listed as single isoforms of respective genes (i.e. the genes are not known to be involved in alternative splicing), or representative sequences compiled from multiple NP sequences for genes with multiple isoforms (i.e. alternatively spliced); a representative sequence was compiled by translating all exon regions in a genes sequence. The only exceptions were the alternatively spliced genes for which at least one of the exons was translated in a different codon alignment in different isoforms; such genes were not used in this study.
Nonsense protein sequences for the positive class were synthesized from coding and noncoding regions of the genomic sequences of genes whose representatives form the negative class. The exact locations of exons in these genomic sequences are known, and the exons can only be translated correctly if they are read in one of the three possible reading frames. For a given annotated genomic sequence and the protein it is translated to (top sequence in Figure 4, where exons are shown in black), the procedure to synthesize nonsense sequences was the following:
Overview of numbers of sequences in datasets for nonsense prediction
Synthetic nonsense sequences
Sequences from both parts of the dataset and the additional set were preprocessed to construct predictive features, similarly to how features are constructed for PONDR family of ID predictor [7, 12, 23, 24]. For each fixed residue, a window of size 41 was positioned centered at the fixed residue. Amino acids in the window were counted and their frequencies were calculated; this produced 20 features that correspond to amino acid composition. Entropy was calculated from 20 amino acid frequencies; this feature measures local complexity of amino acid sequence. Local flexibility was approximated as the scalar product of 20 amino acid frequencies and 20 flexibility parameters, which were estimated empirically. Net charge and average hydrophobicity were calculated similarly to flexibility, and their ratio is used as an additional feature. Predictions of ID were obtained with the VSL2B predictor ; these predictions are mapped to binary classification by applying the .5 threshold. To summarize the predicted ID in a protein sequence, we used disorder content (DC), which is defined as the fraction of residues that are predicted to be in disordered regions. We labeled amino acids in synthetic nonsense sequences with information about their origin, i.e. whether the central nucleotide of the corresponding codon was a part of coding region or non-coding region. For amino acids in all sequences we calculated the distance of the codon from the nearest border between exon and a non-coding region. Both of these labels were later used in balancing of the training set.
The main difference in the above described datasets and the dataset in the initial study  is that in the initial study only the mRNA sequences of the human proteins were used as the source for synthesis of nonsense sequences. The translation of non-coding regions was therefore limited only to upstream and downstream untranslated regions (3'UTR and 5'UTR) if they were included in the mRNA sequence at all. We also excluded all genes that were known to be alternatively spliced. The dataset contained 15,124 NP sequences and 45,038 synthetic nonsense sequences, as well as the additional set of 5,243 XP sequences.
This prediction problem is novel, and therefore we could not utilize any of the existing protein-related prediction tools. Furthermore, we could not compare our results to any previously published results. Our goal was not to develop an optimal predictor, but rather to construct a simple predictor with reasonable accuracy and good balance between sensitivity and specificity. We briefly tested logistic regression and neural networks as the predictive model, with various sets of parameters. Here we present only the parameters that led to the best results that we have obtained. We used neural networks with 20 hidden nodes in a single hidden layer. We always trained ensembles of 10 neural networks, with randomly sampled training and validation sets. The training and validation sets (8% and 2% of the available data respectively) were sampled from the dataset; only 10% of the available data was used per iteration to speed up the training and evaluation process. Because windows used to construct features for neighboring amino-acids were overlapping, the obtained features were similar, and therefore the redundancy allowed for subsampling without significant loss of accuracy.
Both training and validation sets were balanced (i.e. contained equal number of residues from positive and negative class), and samples from both classes were balanced in terms of disorder to include equal number of residues predicted to be ordered and disordered. We further balanced the nonsense class by sampling equal number of residues obtained by translating non-coding regions and residues obtained by translating coding regions. We also balanced both nonsense and true protein class by sampling equal numbers of residues obtained from regions in vicinity of an exon/non-coding region border (50nt or less) and of residues obtained from regions far from such borders (more than 50nt). Targets for residues from two classes were encoded as .1 and .9. In the evaluation phase, the residues were classified by comparing their real-valued predictions with the .5 threshold.
In the initial study  we balanced the training dataset only with respect to the class and the predicted disorder, but not with respect to the origin of the amino acids.
We performed both per-residue and per-sequence evaluation. In per-residue evaluation residues are observed separately, while in per-sequence evaluation predictions for all residues in a sequence are aggregated into one prediction (mean of per-residue predictions) and compared to a threshold. We used 10-fold cross-validation to evaluate the predictor, and the dataset was partitioned into 10 subsets so that residues from the same sequence were always members of the same subset. This partitioning both enables per-protein prediction and ensures fair testing in per-residue prediction, since neighboring residues in a sequence have similar input features and in most cases equal target values, and should therefore always be in the same subset. We used two indicators of nonsense prediction level in a sequence. We define nonsense content as the fraction of predicted nonsense residues in a sequence; this indicator is analogous to disorder content. Another indicator is the mean of (real-valued) per-residue nonsense predictions in the sequence. Both indicators were used to compare results of prediction for NP and XP sequences.
To analyze the impact of input features for prediction of nonsense, we used approximation of partial derivatives of prediction function. Partial derivative of prediction function pred with respect to i-th feature f i at point x was approximated as , where ε i = ε(0, ...,1, ...,0) is the vector with value ε at i-th elementh and value 0 at all other elements. The mean of such estimates for feature f i over all points in the dataset was then used to estimate both the impact (absolute value) and the direction (sign) of contribution of feature f i to prediction function.
10-fold cross-validation evaluation of per-residue and per-protein nonsense sequence predictors
Accuracy = (spec+sens)/2
Area under curve
Nonsense ~ introns
Nonsense ~ exons
Nonsense ~ introns
Nonsense ~ exons
Nonsense ~ introns
Nonsense ~ exons
Nonsense ~ introns
Nonsense ~ exons
All indicators of predictor's performance on Homo sapiens dataset showed a small improvement compared to the results of the initial study . To test the reason behind that improvement we trained a predictor on the Homo sapiens dataset for which the training dataset was only balanced with respect to true/nonsense and order/disorder criteria, but not with respect to the non-coding/exonic origin and the near/far from non-coding-exon border criteria. The evaluation of these predictors (results not shown) showed similar improvement compared to results for the initial study . Therefore we can conclude that additional balancing did not directly affect the performance of the predictor. Instead, the improvement in performance can be attributed to one of the following: 1) changes in the NCBI dataset that occurred over last three years (refinement of NP sequences and upgrading of XP sequences to NP status), 2) inclusion of more intronic regions into the synthetic nonsense part of the dataset, 3) inclusion of sequences with alternative splicing.
As a part of the 10-fold cross-validation process, we obtained predictions for all NP and synthetic nonsense sequences. We could then use all 10 predictors as an ensemble for prediction on XP sequences, since they were not used in training; the ensemble predictor is expected to perform at least as well as its component predictors .
Comparison of fractions of NP, XP and synthetic nonsense sequences with nonsense content greater than threshold
Total (per-residue) predicted nonsense content in NP, XP and nons sequences, and the margin of nonsense content between NP and XP, and between NP and synthetic nonsense sequences
NC_XP - NC_NP
NC_nons - NC_NP
After the computational experiments have indicated that human (and to some extent mouse) XP sequences contain substantial fraction of nonsense regions, the important question is how these regions affect the prediction of disorder content in XP sequences. In human XP sequences, 55.53% of all residues are predicted to be disordered. In regions of human XP sequences that are predicted to be nonsense the fraction of predicted ID residues is increased to 64.87%, while in regions predicted not to be nonsense, the fraction of predicted ID residues is only 50.58%. It is interesting to note here that in the mouse dataset, predicted fraction of ID residues is very similar in predicted nonsense regions of XP sequences (48.79%), regions of XP sequences that are predicted not to be nonsense (49.09%) and overall XP sequences (49.01%). Furthermore, in the zebrafish dataset, the difference is inverted compared to the human dataset: 46.69% overall, 38.13% in predicted nonsense regions, and 48.67% in remaining regions.
Correlation of disorder content (DC) and nonsense content (NC) for NP, XP and synthetic nonsense sequences
In a previous  we have observed a big increase in predicted disorder content for human protein sequences from NCBI with XP identifiers, as compared to human protein sequences with NP identifiers (Figure 2). This difference was consistent with the divergence in amino acid composition for NP and XP sequences (Figure 3), since several order-promoting amino acids were highly enriched in NP sequences, and several disorder-promoting amino acids were highly enriched in XP sequences.
Sequences have XP identifiers when they are in early stages of curation, and many of them are just putative sequences submitted by the automated genome annotation procedure that utilizes gene finding algorithms. Since gene finding algorithms are not perfect, they introduce nonsense regions into XP sequences. We suspected that these nonsense regions may be one of the causes for the discrepancy in predicted disorder.
Based on the difference in amino acid composition, we assumed that nonsense regions can be predicted from sequence. Since no data on nonsense regions was available, we developed a simple procedure to construct synthetic nonsense sequences from real protein sequences and their genomic sequences (Figure 4). These sequences have different amino acid composition than their real counterparts, although in human and mouse genome they also differ greatly from XP sequences, as they have higher fractions of some order-promoting amino acids and lower fractions of some disorder-promoting amino acids.
Using a simple prediction model, we have successfully trained predictors that discriminate true NP sequences from synthetic nonsense sequences. All input features were based only on local sequence information, and were constructed using methodology similar to many predictors of intrinsic disorder. The predictors have very good per-residue accuracies (82%-86%) and AUCs (> .9), comparable to predictors of intrinsic disorder (Table 2). More importantly, they are very well balanced (i.e. have similar sensitivity and specificity) and perform equally well on predicted disordered regions and predicted ordered regions, as well as on synthetic nonsense sequence regions originating from coding and non-coding genomic regions. These results confirm the assumption that nonsense regions can be predicted from sequence alone.
We have also used a simple method to aggregate per-residue predictions and obtain per-protein predictions. The performance of per-protein predictors is very close to optimal, with accuracies 96%-98% and AUC ~ .99. However, it is only feasible to use per-protein predictors when a sequence is either a true protein sequence or the whole sequence is nonsense.
We applied both per-residue and per-protein predictors to XP sequences. We used various methods to compare results of nonsense prediction for NP and XP sequences. Per-protein predictor classified ~44% of human XP sequences as fully nonsense sequences, compared to only ~6% of NP sequences. While this estimate is not realistic, it is indicative of how many XP sequences are - in terms of input features - more similar to synthetic nonsense sequences than to real NP sequences. Similar large discrepancy was observed for Mus musculus (~30% vs 7%), but not for Danio rerio (~9% vs 4%).
Per-residue predictor also gave very different predictions for human NP and XP sequences. The differences in distributions of nonsense content (fraction of residues in a sequence predicted to be in nonsense regions) are substantial for Homo sapiens and Mus musculus, but not for Danio rerio (Figure 7, Table 4).
We analyzed the total nonsense content (total fraction of residues in predicted nonsense regions) for NP, XP and synthetic sequences at various values of threshold. The separation margin between predicted nonsense contents for human NP and synthetic nonsense sequences peaks around the default threshold .5, and the margin between predicted nonsense contents for NP and XP is close to its maximum (~20% in mRNAnons, ~18% in GNMCnons dataset) at that threshold.
Predicted nonsense regions in human XP sequences have higher total disorder content (64.9%) than the remaining regions of human XP sequences (50.6%). More importantly, there is a significant positive linear dependency between predicted nonsense content and predicted disorder content in XP sequences, as indicated by fairly high Pearson correlation coefficient, as well as the R2 statistic and low p-value for the corresponding linear regression model. While a similar positive linear dependency (albeit with lower correlation coefficient) is observed in synthetic nonsense sequences, it is completely absent from NP sequences. However, no such significant correlation can be observed in Mus musculus, while in Danio rerio the correlation is significant and negative. In Danio rerio, predicted nonsense regions in XP sequences have lower total disorder content (38.1%) than the remaining regions of human XP sequences (48.67%).
The experimental results support the hypothesis that the presence of nonsense regions in human XP sequences - introduced by errors of gene finding procedures - significantly increases the predicted disorder content, and therefore introduces bias to genome-wide estimate of disorder content.
However, the same conclusion cannot be reached for Mus musculus and Danio rerio. Danio rerio has very similar distributions for predicted disorder content in NP and XP sequences, as well as very similar distributions for predicted nonsense content in NP and XP sequences. Furthermore, it has the lowest levels of predicted nonsense in XP sequences of all three compared organisms. Most importantly, the contribution of nonsense regions in XP sequences to predicted disorder content is at most minimal.
We were only able to partially explain the discrepancy in disorder content estimates for human NP and XP sequences. It is still possible that the proteins, which are currently covered with XP records, in fact have higher average disorder content than NP sequences. However, even if that is the case we cannot be sure what portion of the difference in predicted disorder content is due to the real difference, and what portion is due to errors in XP sequences that are to be eventually corrected. Differences in datasets and results for Homo sapiens, between the initial study  and the expanded study presented here, suggests that more and more XP sequences are being curated and eventually have they status upgraded, which leads to decrease in discrepancy between predicted disorder contents, as well as to lower predicted nonsense content.
This work was supported in part by a grant to ZO from the Pennsylvania Department of Health. The Department specifically disclaims responsibility for any analyses, interpretations, or conclusions.
This article has been published as part of Proteome Science Volume 10 Supplement 1, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/10/S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.