 Proceedings
 Open Access
 Published:
Unsupervised Integration of Multiple Protein Disorder Predictors: The Method and Evaluation on CASP7, CASP8 and CASP9 Data
Proteome Sciencevolume 9, Article number: S12 (2011)
Abstract
Background
Studies of intrinsically disordered proteins that lack a stable tertiary structure but still have important biological functions critically rely on computational methods that predict this property based on sequence information. Although a number of fairly successful models for prediction of protein disorder have been developed over the last decade, the quality of their predictions is limited by available cases of confirmed disorders.
Results
To more reliably estimate protein disorder from protein sequences, an iterative algorithm is proposed that integrates predictions of multiple disorder models without relying on any protein sequences with confirmed disorder annotation. The iterative method alternately provides the maximum a posterior (MAP) estimation of disorder prediction and the maximumlikelihood (ML) estimation of quality of multiple disorder predictors. Experiments on data used at CASP7, CASP8, and CASP9 have shown the effectiveness of the proposed algorithm.
Conclusions
The proposed algorithm can potentially be used to predict protein disorder and provide helpful suggestions on choosing suitable disorder predictors for unknown protein sequences.
Background
Identification of regions in proteins that do not have unique structures, called intrinsic disorders, is addressed computationally by a number of groups that aim to predict this property from sequence information [1–10]. Contrary to the lock and key paradigm, disordered regions were recently found to be involved in many important functions [11] and in various diseases [12].
Computational characterization of disorder in proteins is appealing due to the difficulties and high cost involved in experimental characterization of disorders. The first predictor of protein disorder was developed by our group in the year 1997 [13]. Due to the importance of predicting this property, in the year 2002, protein disorder prediction was introduced as a category of the CASP contests [14], which promoted the development of new methods for prediction of protein disorder. Consequently, the number of prediction methods available through the Internet has increased rapidly. More than 50 predictors of intrinsic protein disorder have been described in a recent review by He et al. [15], enabling researchers to use a meta approach to predict protein disorder by integrating the prediction results of several methods. Recently, four such meta predictors, i.e. metaPrDOS [16], MD [17], PONDRFIT [18], and MFDp [19], have been developed for the purpose of improving disorder prediction accuracy. They showed significantly improved performance in performed experiments as compared to using individual component predictors.
A limitation of these supervised learning based meta predictors is that they are prone to overoptimization in their integration processes since they are developed relying on disorder/order labeled training datasets that contain a very small number of proteins that have not already been used for development of the component predictors (e.g. sets as small as the DisProt [20] or as specialized as missing coordinates from the PDB [21]). Therefore, the prediction results of previous meta predictors may not be so good for proteins that have sequence patterns very different from cases used for integration. For example, although it achieved higher prediction accuracy than all predictors participating in CASP7 as stated in its paper [16], metaPrDOS failed to be one of the top predictors in CASP8 [22]. Moreover, one of metaPrDOS' component predictors, i.e. DISOPRED [2], was more accurate than metaPrDOS in CASP8 [22].
To address potential overoptimization problems of meta predictor development by learning from small labeled data, here we introduce a new disorder meta prediction method. By following the idea from Raykar et al. [23] we derived an iterative MAP and ML estimation (MAPML) based algorithm for the construction of a meta predictor in a completely unsupervised process using protein sequences without confirmed disorder/order annotations. Performance evaluation of the new meta method is presented by using CASP prediction targets as the test sets, which enabled us to compare the prediction results with other methods used in the CASP contests.
Methods
Problem and statement
Let us define the dataset as . Here, x _{ i } is an amino acid composition feature vector which is derived from the subsequence covered by a moving window centered at the ith amino acid within the current protein. (1 represents a disordered state while 0 represents an ordered state) is the prediction label assigned to the instance x _{ i } by the jth predictor. M is the number of predictors. N is the number of amino acids in the protein.
The first task of our interest is to estimate the sensitivity (i.e., true positive rate) α = [α^{1},…,α^{M}] and the specificity (i.e., true negative rate) β = [β^{1},…,β^{M}] of the M predictors. The second task is to get an estimation of the unknown true labels y _{1},…,y _{ N }.
The proposed MAPML algorithm
To fulfill the two tasks defined before, we propose an iterative algorithm that we will call MAPML. Given dataset D, we use majority voting to initialize the probabilistic labels μ _{ i } (i.e., the probability when the hidden true label is 1). Then, the algorithm alternately carries out the ML estimation and the MAP estimation which are described in details in the following subsections. Given the current estimates of probabilistic labels, the ML estimation measures predictors’ performance (i.e., their sensitivity α and specificity β) and learns a classifier with parameter w. Given the estimated sensitivity α, specificity β, and the prior probability which is provided by the learned classifier, the MAP estimation gets the updated probabilistic labels μ _{ i } based on the Bayesian rule. After the two estimations converge, we get the algorithm outputs which include both the probabilistic labels μ _{ i } and the model parameters θ = {w,α,β}.
The proposed iterative MAPML algorithm is summarized in Algorithm 1, and the estimations are described in the following subsections.
Algorithm 1 (Iterative MAPML Algorithm)
Input: Protein sequences with prediction labels from M predictors.
Output: The estimated sensitivity and specificity of each predictor; the weight parameter of a classifier; the probabilistic labels μ _{ i }; the estimation of the hidden true labels y _{ i }.
Step 1 Convert the protein sequences into amino acid composition feature vectors.
Step 2 Use majority voting to initialize .
Step 3 Iterative optimization.

(a)
ML estimation – Estimate the model parameters θ = {w,α,β} based on current probabilistic labels µ _{ i } using (1) and (3).

(b)
MAP estimation – Given the model parameters θ, update μ _{ i } using (8).
Step 4 If θ and µ _{ i } do not change between two successive iterations or the maximum number of iterations is reached, go to the Step 5; otherwise, go back to the Step 3.
Step 5 Estimate the hidden true label y _{ i } by applying a threshold on µ _{ i }, that is, y _{ i }=1 if µ _{ i } >γ and y _{ i }=0 otherwise. Here use γ =0.5 as the threshold.
ML estimation of the model parameters
Given the dataset D and the current estimates of µ _{ i }, the algorithm estimates the model parameters θ = {w,α,β} by maximizing the conditional likelihood. According to the definitions of sensitivity and specificity, we get
Given probabilistic labels μ _{ i }, we can learn any classifier using ML estimation. However, for convenience, we will explain it with a logistic regression classifier. By using that classifier, the probability for the positive class is modeled as a sigmoid acting on the linear discriminating function, that is,
where the logistic sigmoid function is defined as σ(z) = 1/(1 + e^{–} ^{z}). To estimate the classifier’s parameter w, we use a gradient descent method, that is, the NewtonRaphson method [24]
where g is the gradient vector, H is the Hessian matrix, and η is the step length. The gradient vector is given by , and the Hessian matrix is given by .
MAP estimation of the unknown true labels
Given the dataset D and the model parameters θ = {w,α,β}, we define probabilistic labels . Using the Bayesian rule we have
which is a MAP estimation problem.
Conditioning on the true label y _{ i } ∈ {1,0}, the denominator of formula (4) is decomposed as
Given the true label y _{ i }, we assume that are independent, that is, the predictors label the instances independently. Hence,
Similarly, we have
From (2), (4), (5), (6), and (7), the posterior probability μ _{ i } which is a soft probabilistic estimate of the hidden true label is computed as
where
Analysis of the MAP estimation
To explain how the MAP estimation model works, we apply the logit function to the posterior probability µ _{ i }. From (8), the logit of µ _{ i } is written as
where is a constant. The first term of (9) w^{T} x _{ i } is a linear combination (provided by the learned classifier) of the current amino acid’s composition features. The second term of (9) is a weighted linear combination of the prediction labels from all the predictors. The weight of each predictor is the sum of the logit of the estimated sensitivity and specificity. From (9), we can infer that the estimates of the hidden true labels (in logit form) depend both on protein sequence information and on the prediction labels from all the predictors.
Results
Evaluation criteria
CASP evaluation was based on perresidue predictions of the entire set of targets. The performance of predictors was evaluated by three criteria: the average of sensitivity and specificity (ACC), a weighted score (S_{w}) that considers the rates of ordered and disordered residues in the datasets, and the area under the ROC curve (AUC).
In CASP, predictors were asked to submit a binary label of “O” or “D” (order or disorder state) and a probability that the specific position is in a disordered region (a value in the range of 0 to 1) for each residue. The binary classification of each predictor was assessed by the following scores:
where TP is the number of true positives (disordered residue that were classified correctly), FP false positives (ordered residues that were classified as disordered), TN true negatives (ordered residues that were classified correctly), and FN false negative (disordered residues that were classified as ordered), respectively. The higher the two scores, the better the predictions; therefore, they were combined into a single score, which is the average of the two:
Since the disordered residues are rare in the targets, the weighted score S_{w} was introduced at CASP6 [25]:
where the W_{disorder} was the total percent of order and W_{order} was the total percent of disorder. Therefore, S_{w} ranges from 1 to 1 and predicting all the residues in the targets to be ordered would result in a zero. As defined, this measure greatly rewards disordered residues correctly identified as disordered while heavily penalizing any disordered residue that is misclassified.
The ROC curve was used to examine the ability of the predictors to estimate the confidence level of their predictions. The ROC curve is based on the disorder probability parameter. Once the probability is given, by setting different threshold values of the disordered status, the values of sensitivity and specificity will change accordingly. By taking (1specitificity) as the xaxis, and sensitivity as the yaxis, all the data pairs corresponding to the minimal threshold value to the maximal threshold value will make a continuous curve. This is the ROC curve, the area under this curve (AUC) is a reliable indication for the quality of the prediction. The value of AUC is between 0 and 1, the larger the area, the better the predictor.
Performance evaluation using the CASP data
To assess prediction performance, we used CASP9 data consisting of 117 experimentally characterized protein sequences with 23656 ordered and 2427 disordered residues. To reduce noise due to experimental uncertainty, in the evaluation process we didn't consider disorder segments shorter than four residues. We have also obtained prediction labels with disorder probabilities of all predictors which participated in CASP9 from the contest's official website [14]. We selected 15 predictors developed by groups at different institutions assuming that their errors are independent. We set the size of the moving window as 21 which is based on our previous study [26] as well as the ratio of long (>30 residues) disordered segments to short ones in the data.
In the experiment, as the input of our iterative MAPML algorithm we used the sequences of 117 protein targets and the prediction labels from the 15 component predictors. After the algorithm had converged, we used the estimation of the hidden true labels y _{ i } produced by MAPML as the binary disorder/order predictions and the probabilistic labels µ _{ i } from MAPML outputs as the disorder probability. We also used the majority voting method to integrate the component predictors, so that we can compare that method with the MAPML algorithm method to see which one is more effective. The majority voting method assumes all predictors are equally good.
Estimated sensitivity α and specificity β of 15 component predictors using our MAPML meta predictor without relying on true disorder/order labels are shown in Figure 1. The obtained estimates are sorted according to the average of their estimated sensitivity and specificity and were quite consistent with evaluations reported by the CASP9 committee [27] who used labeled data of confirmed disorder/order residues for their evaluations.
A comparison of 15 predictors, the majority voting method, and our MAPML meta predictor on CASP9 labeled data with confirmed disorder/order is shown in Figure 2. The details of evaluation scores are summarized in Table 1. On this comparison our iterative MAPML algorithm had an ACC score of 0.764, a S_{w} score of 0.513, and an AUC score of 0.859. These scores were superior to the 15 component predictors in the CASP9 contest and also superior to the majority voting integration. In addition, Figures 1 and 2 could be used to assess similarity of accuracies and rankings of 15 predictors obtained by MAPML algorithm without any labeled data versus their evaluation on true labels by CASP9 committee.
Using the same measures and procedures, we assessed the accuracy of 13 CASP8/11 CASP7 disorder predictors on CASP8 data [22]/CASP7 data [28] without using the corresponding experimentally determined disorder/order labels. Similar to CASP9, most of the predictors’ ranks obtained by the MAPML algorithm were quite consistent with their true accuracy on CASP8/CASP7 data. The scores of our MAPML meta predictor were better than the corresponding scores of component predictors in the CASP8/CASP7 contest and their majority voting integration. The details of the CASP8 experiment are summarized in Figure 3, Figure 4, and Table 2. The details of the CASP7 experiment are summarized in Figure 5, Figure 6, and Table 3.
The relationship between the number of component predictors and the prediction performance
Although our MAPML meta predictor outperformed each component predictor at CASP9, CASP8, and CASP7, in general it may not be the case that integration of all available component predictors is the best choice as some predictors may negatively influence the combination results. To analyze effects of possible combination choices on the accuracy of the MAPML algorithm, we studied the relationship between the number of component predictors and the prediction performance of different combinations among CASP9, CASP8, and CASP7 predictors.
For CASP9 data, any number out of 15 individual predictors can be combined by using our algorithm. By considering all subsets, we have constructed 32767 different meta predictors using the MAPML algorithm. The relationship between the number of component predictors and the prediction performance (S_{w}) by the MAPML algorithm using CASP9 data is shown at Figure 7. Similarly, for CASP8/CASP7 data, we build all 8191/2047 meta predictors by considering all subsets of 13/11 component predictors and combining these using the MAPML algorithm. The relationship between the number of component predictors and the prediction performance (S_{w}) by the MAPML algorithm using CASP8 and CASP7 data is shown at Figure 8 and Figure 9.
The results of our experiments (Figure 7, Figure 8, and Figure 9) provide evidence that the average and the lowest prediction performances improve as the number of component predictors increases. Also, the difference between the highest and the lowest performance decreases as the number of component predictors increases. However, the curves representing the highest prediction performances suggest that it is not the case that employing more component predictors will result in improved highest prediction performance. For example, a combination of five CASP8 predictors (MULTICOM, GSMetaServer2, McGuffin, mariner1, and DISOPRED) had the highest overall prediction performance (S_{w}=0.691).
Conclusions
In this study, we proposed an iterative MAPML algorithm to predict protein disorder. The algorithm alternately provides the MAP estimation of disorder prediction and the ML estimation of the quality of multiple component disorder predictors. We evaluated the performance of the MAPML algorithm versus the performance of other predictors using CASP datasets. The results showed that our meta predictor not only outperformed other predictors but also appropriately ranked other predictors without knowing the true labels.
The proposed algorithm assumed that the accuracy of each predictor did not depend on the given protein sequences and that the predictors make their errors independently. Therefore, in our experiments we used the component predictors developed by groups at different institutions. We emphasize that in practice the independence assumption might not be always true, which is the limitation of the proposed algorithm. To relax the independence assumption and to make even more accurate disorder predictions by the probabilistic meta model, our research in progress includes additional parameters such as disorder flavor and difficulty of a prediction task.
Abbreviations
 CASP:

Critical Assessment of Techniques for Protein Structure Prediction
 DisProt:

Database of Protein Disorder
 PDB:

Protein Data Bank
 ROC:

receiver operating characteristic.
References
 1.
Obradovic Z, Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK: Predicting intrinsic disorder from amino acid sequence. Proteins 2003,53(Suppl 6):566–572.
 2.
Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics 2004,20(13):2138–2139. 10.1093/bioinformatics/bth195
 3.
Dosztanyi Z, Csizmok V, Tompa P, Simon I: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 2005,21(16):3433–3434. 10.1093/bioinformatics/bti541
 4.
Wang L, Sauer UH: OnDCRF: predicting order and disorder in proteins using conditional random fields. Bioinformatics 2008,24(11):1401–1402. 10.1093/bioinformatics/btn132
 5.
McGuffin LJ: Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics 2008,24(16):1798–1804. 10.1093/bioinformatics/btn326
 6.
Sethi D, Garg A, Raghava GP: DPROT: prediction of disordered proteins using evolutionary information. Amino Acids 2008,35(3):599–605. 10.1007/s007260080085y
 7.
Deng X, Eickholt J, Cheng J: PreDisorder: ab initio sequencebased prediction of protein disordered regions. BMC Bioinformatics 2009, 10: 436. 10.1186/1471210510436
 8.
Hirose S, Shimizu K, Noguchi T: POODLEI: Disordered region prediction by integrating POODLE series and structural information predictors based on a workflow approach. In Silico Biology 2010, 10: 0015.
 9.
Walsh I, Martin AJ, Domenico TD, Vullo A, Pollastri G, Tosatto SC: CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs. Nucleic Acids Res 2011,39(Web Server issue):W190W196.
 10.
Mizianty MJ, Zhang T, Xue B, Zhou Y, Dunker AK, Uversky VN, Kurgan L: Insilico prediction of disorder content using hybrid sequence representation. BMC Bioinformatics 2011,12(1):245. 10.1186/1471210512245
 11.
Xie H, Vucetic S, Iakoucheva LM, Oldfield CJ, Dunker AK, Uversky VN, Obradovic Z: Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. Journal of Proteome Research 2007,6(5):1882–1898. 10.1021/pr060392u
 12.
Midic U, Oldfield CJ, Dunker AK, Obradovic Z, Uversky VN: Protein disorder in the human diseasome: unfoldomics of human genetic diseases. BMC Genomics 2009,10(Suppl 1):S12. 10.1186/1471216410S1S12
 13.
Romero P, Obradovic Z, Kissinger C, Villafranca JE, Dunker AK: Identifying disordered regions in proteins from amino acid sequence. In Proceedings of the International Conference on Neural Networks: 9–12 Jun 1997; Houston. Edited by: IEEE Neural Networks Council. IEEE; 1997:90–95.
 14.
CASP Contests Home Page [http://predictioncenter.org]
 15.
He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK: Predicting intrinsic disorder in proteins: an overview. Cell Res 2009,19(8):929–949. 10.1038/cr.2009.87
 16.
Ishida T, Kinoshita K: Prediction of disordered regions in proteins based on the meta approach. Bioinformatics 2008,24(11):1344–1348. 10.1093/bioinformatics/btn195
 17.
Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B: Improved disorder prediction by combination of orthogonal approaches. PLoS ONE 2009,4(2):e4433. 10.1371/journal.pone.0004433
 18.
Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN: PONDRFIT: a metapredictor of intrinsically disordered amino acids. Biochim Biophys Acta 2010,1804(4):996–1010.
 19.
Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, Kurgan L: Improved sequencebased prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics 2010,26(18):i489i496. 10.1093/bioinformatics/btq373
 20.
Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, Dunker AK: DisProt: the Database of Disordered Proteins. Nucleic Acids Res 2007,35(Database issue):D786D793.
 21.
Bernstein FC, Koetzle TF, Williams GJ, Meyer EF Jr, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M: The Protein Data Bank: a computerbased archival file for macromolecular structures. J. Mol. Biol 1977,112(3):535–542. 10.1016/S00222836(77)802003
 22.
NoivirtBrik O, Prilusky J, Sussman JL: Assessment of disorder predictions in CASP8. Proteins 2009,77(Suppl 9):210–216.
 23.
Raykar VC, Yu S, Zhao LH, Jerebko A, Florin C, Valadez GH, Bogoni L, Moy L: Supervised Learning from Multiple Experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009): 14–18 June 2009; Montreal. Edited by: Danyluk AP, Bottou L, Littman ML. ACM; 2009:889–896.
 24.
Bishop C: Pattern recognition and machine learning. New York: Springer; 2006:203–213.
 25.
Jin Y, Dunbrack RL: Assessment of disorder predictions in CASP6. Proteins 2005,61(Suppl 7):167–175.
 26.
Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK, Obradovic Z: Optimizing long intrinsic disorder predictors with protein evolutionary information. Journal of Bioinformatics and Computational Biology 2005,3(1):35–60. 10.1142/S0219720005000886
 27.
Assessment of disorder predictions in CASP9 [http://predictioncenter.org/casp9/doc/presentations/CASP9_DR.pdf]
 28.
Bordoli L, Kiefer F, Schwede T: Assessment of disorder predictions in CASP7. Proteins 2007,69(Suppl 8):129–136.
Acknowledgements
This project was funded in part under a grant with the Pennsylvania Department of Health. The Department specifically disclaims responsibility for any analyses, interpretations, or conclusions.
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
PZ designed the algorithms, implemented programs, carried out the analysis, and drafted the manuscript. ZO inspired the overall work, provided advice, and revised the final manuscript. All authors read and approved the final manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Published
DOI
Keywords
 Prediction Performance
 Area Under This Curve
 Probabilistic Label
 True Label
 Component Predictor