Unsupervised Integration of Multiple Protein Disorder Predictors: The Method and Evaluation on CASP7, CASP8 and CASP9 Data

Background Studies of intrinsically disordered proteins that lack a stable tertiary structure but still have important biological functions critically rely on computational methods that predict this property based on sequence information. Although a number of fairly successful models for prediction of protein disorder have been developed over the last decade, the quality of their predictions is limited by available cases of confirmed disorders. Results To more reliably estimate protein disorder from protein sequences, an iterative algorithm is proposed that integrates predictions of multiple disorder models without relying on any protein sequences with confirmed disorder annotation. The iterative method alternately provides the maximum a posterior (MAP) estimation of disorder prediction and the maximum-likelihood (ML) estimation of quality of multiple disorder predictors. Experiments on data used at CASP7, CASP8, and CASP9 have shown the effectiveness of the proposed algorithm. Conclusions The proposed algorithm can potentially be used to predict protein disorder and provide helpful suggestions on choosing suitable disorder predictors for unknown protein sequences.


Background
Identification of regions in proteins that do not have unique structures, called intrinsic disorders, is addressed computationally by a number of groups that aim to predict this property from sequence information [1][2][3][4][5][6][7][8][9][10]. Contrary to the lock and key paradigm, disordered regions were recently found to be involved in many important functions [11] and in various diseases [12].
Computational characterization of disorder in proteins is appealing due to the difficulties and high cost involved in experimental characterization of disorders. The first predictor of protein disorder was developed by our group in the year 1997 [13]. Due to the importance of predicting this property, in the year 2002, protein disorder prediction was introduced as a category of the CASP contests [14], which promoted the development of new methods for prediction of protein disorder. Consequently, the number of prediction methods available through the Internet has increased rapidly. More than 50 predictors of intrinsic protein disorder have been described in a recent review by He et al. [15], enabling researchers to use a meta approach to predict protein disorder by integrating the prediction results of several methods. Recently, four such meta predictors, i.e. metaPrDOS [16], MD [17], PONDR-FIT [18], and MFDp [19], have been developed for the purpose of improving disorder prediction accuracy. They showed significantly improved performance in performed experiments as compared to using individual component predictors.
A limitation of these supervised learning based meta predictors is that they are prone to over-optimization in their integration processes since they are developed relying on disorder/order labeled training datasets that contain a very small number of proteins that have not already been used for development of the component predictors (e.g. sets as small as the DisProt [20] or as specialized as missing coordinates from the PDB [21]). Therefore, the prediction results of previous meta predictors may not be so good for proteins that have sequence patterns very different from cases used for integration. For example, although it achieved higher prediction accuracy than all predictors participating in CASP7 as stated in its paper [16], metaPrDOS failed to be one of the top predictors in CASP8 [22]. Moreover, one of metaPrDOS' component predictors, i.e. DIS-OPRED [2], was more accurate than metaPrDOS in CASP8 [22].
To address potential over-optimization problems of meta predictor development by learning from small labeled data, here we introduce a new disorder meta prediction method. By following the idea from Raykar et al. [23] we derived an iterative MAP and ML estimation (MAP-ML) based algorithm for the construction of a meta predictor in a completely unsupervised process using protein sequences without confirmed disorder/ order annotations. Performance evaluation of the new meta method is presented by using CASP prediction targets as the test sets, which enabled us to compare the prediction results with other methods used in the CASP contests.

Problem and statement
Let us define the dataset as D y y Here, x i is an amino acid composition feature vector which is derived from the subsequence covered by a moving window centered at the i-th amino acid within the current protein. y i j ∈{ , } 1 0 (1 represents a disordered state while 0 represents an ordered state) is the prediction label assigned to the instance x i by the j-th predictor. M is the number of predictors. N is the number of amino acids in the protein.
The first task of our interest is to estimate the sensitivity (i.e., true positive rate) a = [a 1 ,…,a M ] and the specificity (i.e., true negative rate) b = [b 1 ,…,b M ] of the M predictors. The second task is to get an estimation of the unknown true labels y 1 ,…,y N .

The proposed MAP-ML algorithm
To fulfill the two tasks defined before, we propose an iterative algorithm that we will call MAP-ML. Given dataset D, we use majority voting to initialize the probabilistic labels μ i (i.e., the probability when the hidden true label is 1). Then, the algorithm alternately carries out the ML estimation and the MAP estimation which are described in details in the following subsections. Given the current estimates of probabilistic labels, the ML estimation measures predictors' performance (i.e., their sensitivity a and specificity b) and learns a classifier with parameter w. Given the estimated sensitivity a, specificity b, and the prior probability which is provided by the learned classifier, the MAP estimation gets the updated probabilistic labels μ i based on the Bayesian rule. After the two estimations converge, we get the algorithm outputs which include both the probabilistic labels μ i and the model parameters θ = {w,a,b}.
The proposed iterative MAP-ML algorithm is summarized in Algorithm 1, and the estimations are described in the following subsections.
Algorithm 1 (Iterative MAP-ML Algorithm) Input: Protein sequences with prediction labels from M predictors. Output: The estimated sensitivity and specificity of each predictor; the weight parameter of a classifier; the probabilistic labels μ i ; the estimation of the hidden true labels y i .
Step 1 Convert the protein sequences into amino acid composition feature vectors.
Step 2 Use majority voting to initialize Step 3 Iterative optimization. (a) ML estimation -Estimate the model parameters θ = {w,a,b} based on current probabilistic labels µ i using (1) and (3).
Step 4 If θ and µ i do not change between two successive iterations or the maximum number of iterations is reached, go to the Step 5; otherwise, go back to the Step 3.
Step 5 Estimate the hidden true label y i by applying a threshold on µ i , that is, y i =1 if µ i >g and y i =0 otherwise. Here use g =0.5 as the threshold.

ML estimation of the model parameters
Given the dataset D and the current estimates of µ i , the algorithm estimates the model parameters θ = {w,a,b} by maximizing the conditional likelihood. According to the definitions of sensitivity and specificity, we get Given probabilistic labels μ i , we can learn any classifier using ML estimation. However, for convenience, we will explain it with a logistic regression classifier. By using that classifier, the probability for the positive class is modeled as a sigmoid acting on the linear discriminating function, that is, where the logistic sigmoid function is defined as s(z) = 1/(1 + e -z ). To estimate the classifier's parameter w, we use a gradient descent method, that is, the Newton-Raphson method [24] w w H g where g is the gradient vector, H is the Hessian matrix, and h is the step length. The gradient vector is , and the Hessian matrix is given by

MAP estimation of the unknown true labels
Given the dataset D and the model parameters θ = {w,a, b}, we define probabilistic labels Given the true label y i , we assume that y y From (2), (4), (5), (6), and (7), the posterior probability μ i which is a soft probabilistic estimate of the hidden true label is computed as

Analysis of the MAP estimation
To explain how the MAP estimation model works, we apply the logit function to the posterior probability µ i . From (8), the logit of µ i is written as is a constant. The first term of (9) w T x i is a linear combination (provided by the learned classifier) of the current amino acid's composition features. The second term of (9) is a weighted linear combination of the prediction labels from all the predictors. The weight of each predictor is the sum of the logit of the estimated sensitivity and specificity. From (9), we can infer that the estimates of the hidden true labels (in logit form) depend both on protein sequence information and on the prediction labels from all the predictors.

Evaluation criteria
CASP evaluation was based on per-residue predictions of the entire set of targets. The performance of predictors was evaluated by three criteria: the average of sensitivity and specificity (ACC), a weighted score (S w ) that considers the rates of ordered and disordered residues in the datasets, and the area under the ROC curve (AUC).
In CASP, predictors were asked to submit a binary label of "O" or "D" (order or disorder state) and a probability that the specific position is in a disordered region (a value in the range of 0 to 1) for each residue. The where TP is the number of true positives (disordered residue that were classified correctly), FP false positives (ordered residues that were classified as disordered), TN true negatives (ordered residues that were classified correctly), and FN false negative (disordered residues that were classified as ordered), respectively. The higher the two scores, the better the predictions; therefore, they were combined into a single score, which is the average of the two:

.
Since the disordered residues are rare in the targets, the weighted score S w was introduced at CASP6 [ where the W disorder was the total percent of order and W order was the total percent of disorder. Therefore, S w ranges from -1 to 1 and predicting all the residues in the targets to be ordered would result in a zero. As defined, this measure greatly rewards disordered residues correctly identified as disordered while heavily penalizing any disordered residue that is misclassified. The ROC curve was used to examine the ability of the predictors to estimate the confidence level of their predictions. The ROC curve is based on the disorder probability parameter. Once the probability is given, by setting different threshold values of the disordered status, the values of sensitivity and specificity will change accordingly. By taking (1-specitificity) as the x-axis, and sensitivity as the yaxis, all the data pairs corresponding to the minimal threshold value to the maximal threshold value will make a continuous curve. This is the ROC curve, the area under this curve (AUC) is a reliable indication for the quality of the prediction. The value of AUC is between 0 and 1, the larger the area, the better the predictor. Figure 1 CASP9 accuracy estimates without using labeled data. Estimated sensitivity and specificity of 15 disorder predictors is obtained by the MAP-ML algorithm at CASP9 protein sequences without using CASP9 experimentally determined disorder/order labels. The predictors are sorted in descending order of the average of the estimated sensitivity and specificity.

Performance evaluation using the CASP data
To assess prediction performance, we used CASP9 data consisting of 117 experimentally characterized protein sequences with 23656 ordered and 2427 disordered residues. To reduce noise due to experimental uncertainty, in the evaluation process we didn't consider disorder segments shorter than four residues. We have also obtained prediction labels with disorder probabilities of all predictors which participated in CASP9 from the contest's official website [14]. We selected 15 predictors developed by groups at different institutions assuming that their errors are independent. We set the size of the moving window as 21 which is based on our previous study [26] as well as the ratio of long (>30 residues) disordered segments to short ones in the data.
In the experiment, as the input of our iterative MAP-ML algorithm we used the sequences of 117 protein targets and the prediction labels from the 15 component predictors. After the algorithm had converged, we used the estimation of the hidden true labels y i produced by MAP-ML as the binary disorder/order predictions and the probabilistic labels µ i from MAP-ML outputs as the disorder probability. We also used the majority voting method to integrate the component predictors, so that we can compare that method with the MAP-ML    Zhang and Obradovic Proteome Science 2011, 9(Suppl 1):S12 http://www.proteomesci.com/content/9/S1/S12 algorithm method to see which one is more effective. The majority voting method assumes all predictors are equally good. Estimated sensitivity a and specificity b of 15 component predictors using our MAP-ML meta predictor without relying on true disorder/order labels are shown in Figure 1. The obtained estimates are sorted according to the average of their estimated sensitivity and specificity and were quite consistent with evaluations reported by the CASP9 committee [27] who used labeled data of confirmed disorder/order residues for their evaluations.
A comparison of 15 predictors, the majority voting method, and our MAP-ML meta predictor on CASP9 labeled data with confirmed disorder/order is shown in Figure 2. The details of evaluation scores are summarized in Table 1. On this comparison our iterative MAP-ML algorithm had an ACC score of 0.764, a S w score of 0.513, and an AUC score of 0.859. These scores were superior to the 15 component predictors in the CASP9 contest and also superior to the majority voting integration. In addition, Figures 1 and 2 could be used to assess similarity of accuracies and rankings of 15 predictors obtained by MAP-ML algorithm without any labeled data versus their evaluation on true labels by CASP9 committee.
Using the same measures and procedures, we assessed the accuracy of 13 CASP8/11 CASP7 disorder predictors on CASP8 data [22]/CASP7 data [28] without using the corresponding experimentally determined disorder/order labels. Similar to CASP9, most of the predictors' ranks obtained by the MAP-ML algorithm were quite consistent with their true accuracy on CASP8/CASP7 data. The scores of our MAP-ML meta predictor were better  Figure 5 CASP7 accuracy estimates without using labeled data. Estimated sensitivity and specificity of 11 disorder predictors is obtained by the MAP-ML algorithm at CASP7 protein sequences without using CASP7 experimentally determined disorder/order labels. The predictors are sorted in descending order of the average of the estimated sensitivity and specificity.
than the corresponding scores of component predictors in the CASP8/CASP7 contest and their majority voting integration. The details of the CASP8 experiment are summarized in Figure 3, Figure 4, and Table 2. The details of the CASP7 experiment are summarized in Figure 5, Figure 6, and Table 3.

The relationship between the number of component predictors and the prediction performance
Although our MAP-ML meta predictor outperformed each component predictor at CASP9, CASP8, and CASP7, in general it may not be the case that integration of all available component predictors is the best choice as some predictors may negatively influence the combination results. To analyze effects of possible combination choices on the accuracy of the MAP-ML algorithm, we studied the relationship between the number of component predictors and the prediction performance of different combinations among CASP9, CASP8, and CASP7 predictors.
For CASP9 data, any number out of 15 individual predictors can be combined by using our algorithm. By considering all subsets, we have constructed 32767 different meta predictors using the MAP-ML algorithm. The relationship between the number of component predictors and the prediction performance (S w ) by the MAP-ML algorithm using CASP9 data is shown at Figure 7. Similarly, for CASP8/CASP7 data, we build all 8191/2047 meta predictors by considering all subsets of 13/11 component predictors and combining these using the MAP-ML algorithm. The relationship between the number of component predictors and the prediction performance (S w ) by the MAP-ML algorithm using Figure 6 CASP7 comparison on labeled data. Evaluation scores are shown for the MAP-ML algorithm, majority voting method, and the 11 component predictors at disorder/order labeled CASP7 protein sequences and the corresponding experimentally determined disorder/order labels. ACC, S w , and AUC scores are sorted in descending order of the AUC score. Zhang and Obradovic Proteome Science 2011, 9(Suppl 1):S12 http://www.proteomesci.com/content/9/S1/S12 Figure 7 The prediction performance of MAP-ML algorithm vs. the number of component predictors on CASP9 data. The lowest, average, and highest performance for each group with the same number of individual predictors is shown. CASP8 and CASP7 data is shown at Figure 8 and Figure  9.
The results of our experiments ( Figure 7, Figure 8, and Figure 9) provide evidence that the average and the lowest prediction performances improve as the number of component predictors increases. Also, the difference between the highest and the lowest performance decreases as the number of component predictors increases. However, the curves representing the highest prediction performances suggest that it is not the case that employing more component predictors will result in improved highest prediction performance. For example, a combination of five CASP8 predictors (MULTI-COM, GS-MetaServer2, McGuffin, mariner1, and DISOPRED) had the highest overall prediction performance (S w =0.691).

Conclusions
In this study, we proposed an iterative MAP-ML algorithm to predict protein disorder. The algorithm alternately provides the MAP estimation of disorder prediction and the ML estimation of the quality of multiple component disorder predictors. We evaluated the performance of the MAP-ML algorithm versus the performance of other predictors using CASP datasets. The results showed that our meta predictor not only outperformed other predictors but also appropriately ranked other predictors without knowing the true labels.
The proposed algorithm assumed that the accuracy of each predictor did not depend on the given protein sequences and that the predictors make their errors independently. Therefore, in our experiments we used the component predictors developed by groups at different institutions. We emphasize that in practice the independence assumption might not be always true, which is the limitation of the proposed algorithm. To relax the independence assumption and to make even more accurate disorder predictions by the probabilistic meta model, our research in progress includes additional parameters such as disorder flavor and difficulty of a prediction task.