Unsupervised Integration of Multiple Protein Disorder Predictors: The Method and Evaluation on CASP7, CASP8 and CASP9 Data

Zhang, Ping; Obradovic, Zoran

doi:10.1186/1477-5956-9-S1-S12

Volume 9 Supplement 1

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2010

Proceedings
Open access
Published: 14 October 2011

Unsupervised Integration of Multiple Protein Disorder Predictors: The Method and Evaluation on CASP7, CASP8 and CASP9 Data

Ping Zhang¹ &
Zoran Obradovic¹

Proteome Science volume 9, Article number: S12 (2011) Cite this article

3261 Accesses
5 Citations
Metrics details

Abstract

Background

Studies of intrinsically disordered proteins that lack a stable tertiary structure but still have important biological functions critically rely on computational methods that predict this property based on sequence information. Although a number of fairly successful models for prediction of protein disorder have been developed over the last decade, the quality of their predictions is limited by available cases of confirmed disorders.

Results

To more reliably estimate protein disorder from protein sequences, an iterative algorithm is proposed that integrates predictions of multiple disorder models without relying on any protein sequences with confirmed disorder annotation. The iterative method alternately provides the maximum a posterior (MAP) estimation of disorder prediction and the maximum-likelihood (ML) estimation of quality of multiple disorder predictors. Experiments on data used at CASP7, CASP8, and CASP9 have shown the effectiveness of the proposed algorithm.

Conclusions

The proposed algorithm can potentially be used to predict protein disorder and provide helpful suggestions on choosing suitable disorder predictors for unknown protein sequences.

Background

Identification of regions in proteins that do not have unique structures, called intrinsic disorders, is addressed computationally by a number of groups that aim to predict this property from sequence information [1–10]. Contrary to the lock and key paradigm, disordered regions were recently found to be involved in many important functions [11] and in various diseases [12].

Computational characterization of disorder in proteins is appealing due to the difficulties and high cost involved in experimental characterization of disorders. The first predictor of protein disorder was developed by our group in the year 1997 [13]. Due to the importance of predicting this property, in the year 2002, protein disorder prediction was introduced as a category of the CASP contests [14], which promoted the development of new methods for prediction of protein disorder. Consequently, the number of prediction methods available through the Internet has increased rapidly. More than 50 predictors of intrinsic protein disorder have been described in a recent review by He et al. [15], enabling researchers to use a meta approach to predict protein disorder by integrating the prediction results of several methods. Recently, four such meta predictors, i.e. metaPrDOS [16], MD [17], PONDR-FIT [18], and MFDp [19], have been developed for the purpose of improving disorder prediction accuracy. They showed significantly improved performance in performed experiments as compared to using individual component predictors.

A limitation of these supervised learning based meta predictors is that they are prone to over-optimization in their integration processes since they are developed relying on disorder/order labeled training datasets that contain a very small number of proteins that have not already been used for development of the component predictors (e.g. sets as small as the DisProt [20] or as specialized as missing coordinates from the PDB [21]). Therefore, the prediction results of previous meta predictors may not be so good for proteins that have sequence patterns very different from cases used for integration. For example, although it achieved higher prediction accuracy than all predictors participating in CASP7 as stated in its paper [16], metaPrDOS failed to be one of the top predictors in CASP8 [22]. Moreover, one of metaPrDOS' component predictors, i.e. DISOPRED [2], was more accurate than metaPrDOS in CASP8 [22].

To address potential over-optimization problems of meta predictor development by learning from small labeled data, here we introduce a new disorder meta prediction method. By following the idea from Raykar et al. [23] we derived an iterative MAP and ML estimation (MAP-ML) based algorithm for the construction of a meta predictor in a completely unsupervised process using protein sequences without confirmed disorder/order annotations. Performance evaluation of the new meta method is presented by using CASP prediction targets as the test sets, which enabled us to compare the prediction results with other methods used in the CASP contests.

Methods

Problem and statement

Let us define the dataset as . Here, x _i is an amino acid composition feature vector which is derived from the subsequence covered by a moving window centered at the i-th amino acid within the current protein. (1 represents a disordered state while 0 represents an ordered state) is the prediction label assigned to the instance x _i by the j-th predictor. M is the number of predictors. N is the number of amino acids in the protein.

The first task of our interest is to estimate the sensitivity (i.e., true positive rate) α = [α¹,…,α^M] and the specificity (i.e., true negative rate) β = [β¹,…,β^M] of the M predictors. The second task is to get an estimation of the unknown true labels y ₁,…,y _N.

The proposed MAP-ML algorithm

To fulfill the two tasks defined before, we propose an iterative algorithm that we will call MAP-ML. Given dataset D, we use majority voting to initialize the probabilistic labels μ _i (i.e., the probability when the hidden true label is 1). Then, the algorithm alternately carries out the ML estimation and the MAP estimation which are described in details in the following subsections. Given the current estimates of probabilistic labels, the ML estimation measures predictors’ performance (i.e., their sensitivity α and specificity β) and learns a classifier with parameter w. Given the estimated sensitivity α, specificity β, and the prior probability which is provided by the learned classifier, the MAP estimation gets the updated probabilistic labels μ _i based on the Bayesian rule. After the two estimations converge, we get the algorithm outputs which include both the probabilistic labels μ _i and the model parameters θ = {w,α,β}.

The proposed iterative MAP-ML algorithm is summarized in Algorithm 1, and the estimations are described in the following subsections.

Algorithm 1 (Iterative MAP-ML Algorithm)

Input: Protein sequences with prediction labels from M predictors.

Output: The estimated sensitivity and specificity of each predictor; the weight parameter of a classifier; the probabilistic labels μ _i; the estimation of the hidden true labels y _i.

Step 1 Convert the protein sequences into amino acid composition feature vectors.

Step 2 Use majority voting to initialize .

Step 3 Iterative optimization.

(a)
ML estimation – Estimate the model parameters θ = {w,α,β} based on current probabilistic labels µ _i using (1) and (3).
(b)
MAP estimation – Given the model parameters θ, update μ _i using (8).

Step 4 If θ and µ _i do not change between two successive iterations or the maximum number of iterations is reached, go to the Step 5; otherwise, go back to the Step 3.

Step 5 Estimate the hidden true label y _i by applying a threshold on µ _i, that is, y _i=1 if µ _i >γ and y _i=0 otherwise. Here use γ =0.5 as the threshold.

ML estimation of the model parameters

Given the dataset D and the current estimates of µ _i, the algorithm estimates the model parameters θ = {w,α,β} by maximizing the conditional likelihood. According to the definitions of sensitivity and specificity, we get

(1)

Given probabilistic labels μ _i, we can learn any classifier using ML estimation. However, for convenience, we will explain it with a logistic regression classifier. By using that classifier, the probability for the positive class is modeled as a sigmoid acting on the linear discriminating function, that is,

(2)

where the logistic sigmoid function is defined as σ(z) = 1/(1 + e^– ^z). To estimate the classifier’s parameter w, we use a gradient descent method, that is, the Newton-Raphson method [24]

(3)

where g is the gradient vector, H is the Hessian matrix, and η is the step length. The gradient vector is given by , and the Hessian matrix is given by .

MAP estimation of the unknown true labels

Given the dataset D and the model parameters θ = {w,α,β}, we define probabilistic labels . Using the Bayesian rule we have

(4)

which is a MAP estimation problem.

Conditioning on the true label y _i ∈ {1,0}, the denominator of formula (4) is decomposed as

(5)

Given the true label y _i, we assume that are independent, that is, the predictors label the instances independently. Hence,

(6)

Similarly, we have

(7)

From (2), (4), (5), (6), and (7), the posterior probability μ _i which is a soft probabilistic estimate of the hidden true label is computed as

(8)

where

Analysis of the MAP estimation

To explain how the MAP estimation model works, we apply the logit function to the posterior probability µ _i. From (8), the logit of µ _i is written as

(9)

where is a constant. The first term of (9) w^T x _i is a linear combination (provided by the learned classifier) of the current amino acid’s composition features. The second term of (9) is a weighted linear combination of the prediction labels from all the predictors. The weight of each predictor is the sum of the logit of the estimated sensitivity and specificity. From (9), we can infer that the estimates of the hidden true labels (in logit form) depend both on protein sequence information and on the prediction labels from all the predictors.

Results

Evaluation criteria

CASP evaluation was based on per-residue predictions of the entire set of targets. The performance of predictors was evaluated by three criteria: the average of sensitivity and specificity (ACC), a weighted score (S_w) that considers the rates of ordered and disordered residues in the datasets, and the area under the ROC curve (AUC).

In CASP, predictors were asked to submit a binary label of “O” or “D” (order or disorder state) and a probability that the specific position is in a disordered region (a value in the range of 0 to 1) for each residue. The binary classification of each predictor was assessed by the following scores:

where TP is the number of true positives (disordered residue that were classified correctly), FP false positives (ordered residues that were classified as disordered), TN true negatives (ordered residues that were classified correctly), and FN false negative (disordered residues that were classified as ordered), respectively. The higher the two scores, the better the predictions; therefore, they were combined into a single score, which is the average of the two:

Since the disordered residues are rare in the targets, the weighted score S_w was introduced at CASP6 [25]:

where the W_disorder was the total percent of order and W_order was the total percent of disorder. Therefore, S_w ranges from -1 to 1 and predicting all the residues in the targets to be ordered would result in a zero. As defined, this measure greatly rewards disordered residues correctly identified as disordered while heavily penalizing any disordered residue that is misclassified.

The ROC curve was used to examine the ability of the predictors to estimate the confidence level of their predictions. The ROC curve is based on the disorder probability parameter. Once the probability is given, by setting different threshold values of the disordered status, the values of sensitivity and specificity will change accordingly. By taking (1-specitificity) as the x-axis, and sensitivity as the y-axis, all the data pairs corresponding to the minimal threshold value to the maximal threshold value will make a continuous curve. This is the ROC curve, the area under this curve (AUC) is a reliable indication for the quality of the prediction. The value of AUC is between 0 and 1, the larger the area, the better the predictor.

Performance evaluation using the CASP data

To assess prediction performance, we used CASP9 data consisting of 117 experimentally characterized protein sequences with 23656 ordered and 2427 disordered residues. To reduce noise due to experimental uncertainty, in the evaluation process we didn't consider disorder segments shorter than four residues. We have also obtained prediction labels with disorder probabilities of all predictors which participated in CASP9 from the contest's official website [14]. We selected 15 predictors developed by groups at different institutions assuming that their errors are independent. We set the size of the moving window as 21 which is based on our previous study [26] as well as the ratio of long (>30 residues) disordered segments to short ones in the data.

In the experiment, as the input of our iterative MAP-ML algorithm we used the sequences of 117 protein targets and the prediction labels from the 15 component predictors. After the algorithm had converged, we used the estimation of the hidden true labels y _i produced by MAP-ML as the binary disorder/order predictions and the probabilistic labels µ _i from MAP-ML outputs as the disorder probability. We also used the majority voting method to integrate the component predictors, so that we can compare that method with the MAP-ML algorithm method to see which one is more effective. The majority voting method assumes all predictors are equally good.

Estimated sensitivity α and specificity β of 15 component predictors using our MAP-ML meta predictor without relying on true disorder/order labels are shown in Figure 1. The obtained estimates are sorted according to the average of their estimated sensitivity and specificity and were quite consistent with evaluations reported by the CASP9 committee [27] who used labeled data of confirmed disorder/order residues for their evaluations.

A comparison of 15 predictors, the majority voting method, and our MAP-ML meta predictor on CASP9 labeled data with confirmed disorder/order is shown in Figure 2. The details of evaluation scores are summarized in Table 1. On this comparison our iterative MAP-ML algorithm had an ACC score of 0.764, a S_w score of 0.513, and an AUC score of 0.859. These scores were superior to the 15 component predictors in the CASP9 contest and also superior to the majority voting integration. In addition, Figures 1 and 2 could be used to assess similarity of accuracies and rankings of 15 predictors obtained by MAP-ML algorithm without any labeled data versus their evaluation on true labels by CASP9 committee.

Table 1 CASP9 evaluation scores on labeled data.

Full size table

Using the same measures and procedures, we assessed the accuracy of 13 CASP8/11 CASP7 disorder predictors on CASP8 data [22]/CASP7 data [28] without using the corresponding experimentally determined disorder/order labels. Similar to CASP9, most of the predictors’ ranks obtained by the MAP-ML algorithm were quite consistent with their true accuracy on CASP8/CASP7 data. The scores of our MAP-ML meta predictor were better than the corresponding scores of component predictors in the CASP8/CASP7 contest and their majority voting integration. The details of the CASP8 experiment are summarized in Figure 3, Figure 4, and Table 2. The details of the CASP7 experiment are summarized in Figure 5, Figure 6, and Table 3.

Table 2 CASP8 evaluation scores on labeled data.

Full size table

Table 3 CASP7 evaluation scores on labeled data.

Full size table

The relationship between the number of component predictors and the prediction performance

Although our MAP-ML meta predictor outperformed each component predictor at CASP9, CASP8, and CASP7, in general it may not be the case that integration of all available component predictors is the best choice as some predictors may negatively influence the combination results. To analyze effects of possible combination choices on the accuracy of the MAP-ML algorithm, we studied the relationship between the number of component predictors and the prediction performance of different combinations among CASP9, CASP8, and CASP7 predictors.

For CASP9 data, any number out of 15 individual predictors can be combined by using our algorithm. By considering all subsets, we have constructed 32767 different meta predictors using the MAP-ML algorithm. The relationship between the number of component predictors and the prediction performance (S_w) by the MAP-ML algorithm using CASP9 data is shown at Figure 7. Similarly, for CASP8/CASP7 data, we build all 8191/2047 meta predictors by considering all subsets of 13/11 component predictors and combining these using the MAP-ML algorithm. The relationship between the number of component predictors and the prediction performance (S_w) by the MAP-ML algorithm using CASP8 and CASP7 data is shown at Figure 8 and Figure 9.

The results of our experiments (Figure 7, Figure 8, and Figure 9) provide evidence that the average and the lowest prediction performances improve as the number of component predictors increases. Also, the difference between the highest and the lowest performance decreases as the number of component predictors increases. However, the curves representing the highest prediction performances suggest that it is not the case that employing more component predictors will result in improved highest prediction performance. For example, a combination of five CASP8 predictors (MULTICOM, GS-MetaServer2, McGuffin, mariner1, and DISOPRED) had the highest overall prediction performance (S_w=0.691).

Conclusions

In this study, we proposed an iterative MAP-ML algorithm to predict protein disorder. The algorithm alternately provides the MAP estimation of disorder prediction and the ML estimation of the quality of multiple component disorder predictors. We evaluated the performance of the MAP-ML algorithm versus the performance of other predictors using CASP datasets. The results showed that our meta predictor not only outperformed other predictors but also appropriately ranked other predictors without knowing the true labels.

The proposed algorithm assumed that the accuracy of each predictor did not depend on the given protein sequences and that the predictors make their errors independently. Therefore, in our experiments we used the component predictors developed by groups at different institutions. We emphasize that in practice the independence assumption might not be always true, which is the limitation of the proposed algorithm. To relax the independence assumption and to make even more accurate disorder predictions by the probabilistic meta model, our research in progress includes additional parameters such as disorder flavor and difficulty of a prediction task.

Abbreviations

CASP:: Critical Assessment of Techniques for Protein Structure Prediction
DisProt:: Database of Protein Disorder
PDB:: Protein Data Bank
ROC:: receiver operating characteristic.

References

Obradovic Z, Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK: Predicting intrinsic disorder from amino acid sequence. Proteins 2003,53(Suppl 6):566–572.
Article CAS PubMed Google Scholar
Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics 2004,20(13):2138–2139. 10.1093/bioinformatics/bth195
Article CAS PubMed Google Scholar
Dosztanyi Z, Csizmok V, Tompa P, Simon I: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 2005,21(16):3433–3434. 10.1093/bioinformatics/bti541
Article CAS PubMed Google Scholar
Wang L, Sauer UH: OnD-CRF: predicting order and disorder in proteins using conditional random fields. Bioinformatics 2008,24(11):1401–1402. 10.1093/bioinformatics/btn132
Article CAS PubMed Central PubMed Google Scholar
McGuffin LJ: Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics 2008,24(16):1798–1804. 10.1093/bioinformatics/btn326
Article CAS PubMed Google Scholar
Sethi D, Garg A, Raghava GP: DPROT: prediction of disordered proteins using evolutionary information. Amino Acids 2008,35(3):599–605. 10.1007/s00726-008-0085-y
Article CAS PubMed Google Scholar
Deng X, Eickholt J, Cheng J: PreDisorder: ab initio sequence-based prediction of protein disordered regions. BMC Bioinformatics 2009, 10: 436. 10.1186/1471-2105-10-436
Article PubMed Central PubMed Google Scholar
Hirose S, Shimizu K, Noguchi T: POODLE-I: Disordered region prediction by integrating POODLE series and structural information predictors based on a workflow approach. In Silico Biology 2010, 10: 0015.
Google Scholar
Walsh I, Martin AJ, Domenico TD, Vullo A, Pollastri G, Tosatto SC: CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs. Nucleic Acids Res 2011,39(Web Server issue):W190-W196.
Article CAS PubMed Central PubMed Google Scholar
Mizianty MJ, Zhang T, Xue B, Zhou Y, Dunker AK, Uversky VN, Kurgan L: In-silico prediction of disorder content using hybrid sequence representation. BMC Bioinformatics 2011,12(1):245. 10.1186/1471-2105-12-245
Article PubMed Central PubMed Google Scholar
Xie H, Vucetic S, Iakoucheva LM, Oldfield CJ, Dunker AK, Uversky VN, Obradovic Z: Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. Journal of Proteome Research 2007,6(5):1882–1898. 10.1021/pr060392u
Article CAS PubMed Central PubMed Google Scholar
Midic U, Oldfield CJ, Dunker AK, Obradovic Z, Uversky VN: Protein disorder in the human diseasome: unfoldomics of human genetic diseases. BMC Genomics 2009,10(Suppl 1):S12. 10.1186/1471-2164-10-S1-S12
Article PubMed Central PubMed Google Scholar
Romero P, Obradovic Z, Kissinger C, Villafranca JE, Dunker AK: Identifying disordered regions in proteins from amino acid sequence. In Proceedings of the International Conference on Neural Networks: 9–12 Jun 1997; Houston. Edited by: IEEE Neural Networks Council. IEEE; 1997:90–95.
Google Scholar
CASP Contests Home Page [http://predictioncenter.org]
He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK: Predicting intrinsic disorder in proteins: an overview. Cell Res 2009,19(8):929–949. 10.1038/cr.2009.87
Article CAS PubMed Google Scholar
Ishida T, Kinoshita K: Prediction of disordered regions in proteins based on the meta approach. Bioinformatics 2008,24(11):1344–1348. 10.1093/bioinformatics/btn195
Article CAS PubMed Google Scholar
Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B: Improved disorder prediction by combination of orthogonal approaches. PLoS ONE 2009,4(2):e4433. 10.1371/journal.pone.0004433
Article PubMed Central PubMed Google Scholar
Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN: PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. Biochim Biophys Acta 2010,1804(4):996–1010.
Article CAS PubMed Central PubMed Google Scholar
Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, Kurgan L: Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics 2010,26(18):i489-i496. 10.1093/bioinformatics/btq373
Article CAS PubMed Central PubMed Google Scholar
Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, Dunker AK: DisProt: the Database of Disordered Proteins. Nucleic Acids Res 2007,35(Database issue):D786-D793.
Article CAS PubMed Central PubMed Google Scholar
Bernstein FC, Koetzle TF, Williams GJ, Meyer EF Jr, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M: The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol 1977,112(3):535–542. 10.1016/S0022-2836(77)80200-3
Article CAS PubMed Google Scholar
Noivirt-Brik O, Prilusky J, Sussman JL: Assessment of disorder predictions in CASP8. Proteins 2009,77(Suppl 9):210–216.
Article CAS PubMed Google Scholar
Raykar VC, Yu S, Zhao LH, Jerebko A, Florin C, Valadez GH, Bogoni L, Moy L: Supervised Learning from Multiple Experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009): 14–18 June 2009; Montreal. Edited by: Danyluk AP, Bottou L, Littman ML. ACM; 2009:889–896.
Google Scholar
Bishop C: Pattern recognition and machine learning. New York: Springer; 2006:203–213.
Google Scholar
Jin Y, Dunbrack RL: Assessment of disorder predictions in CASP6. Proteins 2005,61(Suppl 7):167–175.
Article CAS PubMed Google Scholar
Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK, Obradovic Z: Optimizing long intrinsic disorder predictors with protein evolutionary information. Journal of Bioinformatics and Computational Biology 2005,3(1):35–60. 10.1142/S0219720005000886
Article CAS PubMed Google Scholar
Assessment of disorder predictions in CASP9 [http://predictioncenter.org/casp9/doc/presentations/CASP9_DR.pdf]
Bordoli L, Kiefer F, Schwede T: Assessment of disorder predictions in CASP7. Proteins 2007,69(Suppl 8):129–136.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This project was funded in part under a grant with the Pennsylvania Department of Health. The Department specifically disclaims responsibility for any analyses, interpretations, or conclusions.

This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.

Author information

Authors and Affiliations

Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, PA, 19122, USA
Ping Zhang & Zoran Obradovic

Authors

Ping Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zoran Obradovic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zoran Obradovic.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PZ designed the algorithms, implemented programs, carried out the analysis, and drafted the manuscript. ZO inspired the overall work, provided advice, and revised the final manuscript. All authors read and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zhang, P., Obradovic, Z. Unsupervised Integration of Multiple Protein Disorder Predictors: The Method and Evaluation on CASP7, CASP8 and CASP9 Data. Proteome Sci 9 (Suppl 1), S12 (2011). https://doi.org/10.1186/1477-5956-9-S1-S12

Download citation

Published: 14 October 2011
DOI: https://doi.org/10.1186/1477-5956-9-S1-S12

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2010

Unsupervised Integration of Multiple Protein Disorder Predictors: The Method and Evaluation on CASP7, CASP8 and CASP9 Data