ATPsite: sequence-based prediction of ATP-binding residues

Background ATP is a ubiquitous nucleotide that provides energy for cellular activities, catalyzes chemical reactions, and is involved in cellular signalling. The knowledge of the ATP-protein interactions helps with annotation of protein functions and finds applications in drug design. The sequence to structure annotation gap motivates development of high-throughput sequence-based predictors of the ATP-binding residues. Moreover, our empirical tests show that the only existing predictor, ATPint, is characterized by relatively low predictive quality. Methods We propose a novel, high-throughput machine learning-based predictor, ATPsite, which identifies ATP-binding residues from protein sequences. Our predictor utilizes Support Vector Machine classifier and a comprehensive set of input features that are based on the sequence, evolutionary profiles, and the sequence-predicted structural descriptors including secondary structure, solvent accessibility, and dihedral angles. Results The ATPsite achieves significantly higher Mathews Correlation Coefficient (MCC) and Area Under the ROC Curve (AUC) values when compared with the existing methods including the ATPint, conservation-based rate4site, and alignment-based BLAST predictors. We also assessed the effectiveness of individual input types. The PSSM profile, the conservation scores, and certain features based on amino acid groups are shown to be more effective in predicting the ATP-binding residues than the remaining feature groups. Conclusions Statistical tests show that ATPsite significantly outperforms existing solutions. The consensus of the ATPsite with the sequence-alignment based predictor is shown to give further improvements.


Background
Adenosine-5'-triphosphate (ATP) is a multi-functional nucleotide that plays an important role in energy metabolism, signaling, and replication and transcription of DNA. As of July 2010, 3860 structures in the Protein Data Bank (PDB) [1], which constitute about 6% of known protein structures, are annotated as ATP binding. The ATP binding sites are regarded as valuable drug targets for antibacterial and anti-cancer chemotherapy [2,3]. Therefore, the protein-ATP interactions are of significant interest.
Past two decades observed a substantial effort in identification of conserved characteristics of the ATP-binding sites. Most of these approaches are based on a relatively simple analysis of ATP-binding sequences and structures that led to identification of sequence motifs and structural templates. For instance, the p-loop motif that interacts with ATP and its analogs was found in several protein families [4] and structural templates that interact with either adenosine or phosphates (the two chemical groups of ATP) were proposed [5,6]. However, these motifs/templates are usually confined to one or several protein families and cover only a small subset of the ATP-binding sites.
The large number of protein sequences which lack tertiary structure motivates development of computational tools for high-throughput sequence-based annotation of ATP-binding residues. At the same time, to the best of our knowledge, ATPint [7] is the only sequencebased predictor of the ATP-binding residues. We propose a novel method, named ATPsite, which aims to improve over the predictive quality of ATPint and other popular ways to annotate binging residues, including sequence alignment and conservation scoring. In contrast to the ATPint, which only takes PSSM profile and sequence descriptors as the inputs, the ATPsite uses a comprehensive set of relevant inputs. These inputs, which include PSSM profile, sequence descriptors, conservation scores, and predicted secondary structure, relative solvent accessibility (RSA), and dihedral angles, are encoded into a set of custom-designed features that are shown to improve the quality of the ATP-binding predictions.

Dataset
We extracted all complexes in PDB (as of February 2010) that include ATP. The maximal pairwise sequence identity of the resulting protein chains was reduced to 40% with CD-hit [8]. The remaining 227 chains that interact with ATP constitute the dataset used in this study. Similar to the annotation of DNA-binding residues and residues interacting with small ligands [9,10], a given residue is annotated as ATP-binding if at least one of its non-hydrogen atom is less than 3.9Å away from a non-hydrogen atom of the ATP molecule; our dataset includes 3393 ATP-binding residues and 80409 non-binding residues and is available at http://biomine. ece.ualberta.ca/ATPsite/.

Evaluation criteria and test procedure
We use 5-fold cross validation to assess the predictions. We evaluated predictions at two levels: i) the binary value that defines whether a given residue does or does not bind to ATP; and ii) the real value that quantifies the probability of binding to ATP. The binary predictions were assessed using four measures where TP (true positives) and TN (true negatives) are the counts of correctly predicted binding and non-binding residues, respectively, FP (false positives) are nonbinding residues that were predicted as binding residues, and FN (false negatives) are binding residues that were predicted as non-binding residues. The Matthews correlation coefficient (MCC) ranges between -1 and 1 and it equals zero when all residues are predicted as binding or non-binding. Higher MCC value indicates better predictions.
The receiver operating characteristic (ROC) curve was used to examine the predicted probabilities. For each value of probability p achieved by a given method (between 0 and 1), the residues with probability ≥ p are set as the binding residue, and all other residues are set as the non-binding residue. Next, the TP-rate and the FP-rate are calculated and we use the area under the curve (AUC) to quantify the predictive quality.
We analyze statistical significance of the differences in the MCC and AUC values between predictions generated by ATPsite and the other methods. The MCC values are available for all methods while the AUC value cannot be calculated for an alignment-based predictor. These values are calculated per sequence (using the cross-validated predictions) for each method and we compare them using a paired Wilcoxon rank sum test at 0.01 significance. This non-parametric test is used since the per sequence MCC and AUC values do not follow normal distribution, as tested using Shapiro-Wilk test at the 0.05 significance.

Architecture of the proposed predictor
The architecture of the ATPsite predictor is shown in Figure 1. For a given protein sequence, PSIPRED [11] is used to predict the secondary structure, REAL Spine3 [12] predicts the RSA values and dihedral angles, and PSIBLAST [13] generates the PSSM profile. In addition to these inputs, other features such as conservation scores and amino acid (AA) groups are directly calculated from the sequence. Feature selection was performed to remove irrelevant and redundant features. The selected features are fed into a Support Vector Machine (SVM), which is implemented using LIBSVM [14], to generate probabilities of ATP-binding. Selection of this classifier is motivated by its successful use in the ATPint [7].

Feature-based representation
The ATPsite utilizes both sequence and predicted structure descriptors as inputs. We utilize a sliding window of size 17 centered at the predicted residue to extract the input features. This window size was suggested in [7] to yield favorable predictive performance. The considered features include: -Predicted secondary structure generated by PSIPRED [11]. We use probabilities of the 3 secondary structure states for each residue in the window.
-Predicted relative solvent accessibility generated by Real-SPINE3 [12]. We use the real values, which quantify the fraction of the surface area of a given residue that is accessible to the solvent, for the residues in the window.
-Predicted dihedral angles generated by Real-SPINE3 [12]. We utilize two real values, which represent phi (involving the backbone atoms C'-N-C α -C') and psi (involving the backbone atoms N-C α -C'-N) angles.
-PSSM profile generated by PSIBLAST [13] with default parameters. We normalize these inputs with 1/(1 +2 -x ), where x is the raw value from the PSSM profile; this transformation is commonly used in secondary structure prediction. For a window centered at R i residue at i th position, we calculate 17×20 features f i+k,j where k= −8, −7,…,7,8 is the index of the position in the window and j=1,2,…,20 is the index of the PSSM column. We averaged values on the left and right sides of the central residue g i+z,j =(f i+z,j +f i−z,j )/2 where z=0,1,…,8. As a result, the original 17×20 values are transformed to 9×20 values.
-Terminal indicator is set to 1 for the first and last 3 residues in the sequence and 0 for the other positions.
-Secondary structure segment indicator for helix/ strand/ coil predictions from PSIPRED on both sides of the window is calculated. If 4 (3) consecutive residues on the left/right side of the window are predicted as helix (strand), we set the helix (strand) indicator as 1 for the left/right side. If both helix and strand indicators are 0, then the coil indicator is set as 1.
-Residue conservation scores are calculated based on the Shannon entropy (referred to as conservation A) and two other formulas proposed in [15,16] (named conservation B and C, respectively) which incorporate the background frequency of the amino acids.
-Collocation of AA pairs [17,18] is calculated for the residues in the window, which is motivated by results for membrane proteins where certain AA pairs are overrepresented [19]. Similarly, several sequence motifs occur frequently at the ATP binding sites. To accommodate for mutations in these motifs, we use collocated AA pairs (pairs with gaps) to characterize these motifs. We only consider pairs formed between the central residue in the window and another residue up to 5 positions away. This results in 20×20×10=4000 frequencies (for 20 AA types and 10 positions; 5 on each side). The same as in the work for the membrane proteins [19], p-values that indicate the significance of the association between an AA pair and ATP-binding annotation are calculated. An AA pair with low p-value indicates a low probability that the association between this pair and ATP-binding is a coincidence. When analyzing 4000 randomly distributed variables, we expect to observe by chance one instance of a difference from expected value with significance p < 0.00025 (1/4000). We exclude the AA pairs with p ≥ 10 -6 , since based on the Engelman's study [19] their association with ATP-binding event would be random.
We note that the terminal and secondary structure indicators, collocation of AA pairs, and the predicted secondary structure, relative solvent accessibility, and dihedral angles were never before used to predict the ATP-binding residues.

Feature selection and parameterization
We use 5-fold cross validation to compute feature selection and evaluation. The dataset is randomly divided per-sequence into 5 folds, of which 4 are used for training and the one for testing; each of the 5 folds is used once as the test fold. This procedure assures that annotations of ATP binding from test folds are not used to train the predictive model. Biserial correlation is calculated between each of the features and the binary annotation of ATP-binding residues for each of the 5 training sets. The averaged, over the 5 training sets, correlation values were used to rank the features. We used a best first forward feature selection. Given a feature list F=[f i , i=1,2,…,n], sorted by the average correlation in the descendent order, and an empty list S consisting of selected features, in each round we add the top-ranked feature from F to S and run default SVM with linear kernel and complexity constant C=1 on the feature set S (using 5-fold cross validation). If addition of a given feature improves the average AUC value over the 5 test folds, this feature is retained in S; otherwise it is removed. We repeat that until F is empty.
The SVM is parameterized using the selected feature set S which includes 96 features. The features are summarized in Table 1. All parameterization steps maximize the average cross-validated AUC value. We considered polynomial and RBF kernels. For polynomial kernel, C is initially fixed at 1 and the degree of the polynomial is adjusted between 0.5 and 5. The degree that results in the highest average AUC is selected and next we adjust C with consecutive powers of 2 between 2 -3 and 2 5 . Similarly for the RBF kernel, the gamma parameter is first optimized when C is fixed at 1, and next C is adjusted using the 2 -3 and 2 5 range. The threshold used to binarize the probabilities is set to maximize the MCC value of the predictions on the training folds. A residue is classified as ATP-binding if its predicted probability ≥ 0.182, which is the averaged threshold that maximizes the MCC of our method over the 5 training folds; otherwise it is categorized as non-binding.

Baseline predictors
ATPsite is compared with the existing ATPint as well as three baseline predictors that implement commonly used approaches to find binding residues: -Rate4site [20] predicts functional sites by finding conserved residues. We first run PSI-Blast using the query sequence against the NCBI non-redundant database. For chains with at least 3 significant matches, we created alignments of the best 50 sequences (the default for Consurf [21], which is the web version of rate4site) using ClustalW [22] and we inputted them to rate4site. The rate4site generates conservation score for each residue and residues with lower scores (indicating a higher conservation) have higher probability to be binding residues. We use these scores to compute ROC curves and the threshold that maximizes the MCC value is used to binarize the conservation scores.
-Sequence alignmentusing BLAST identifies similar sequences or segments from a given annotated (with ATP-binding residues) dataset for a query sequence. This approach predicts the binding residues by using the ATP-binding annotations from the best aligned sequence. We execute the BLAST-based alignment between a query sequence and all other sequences (except the query sequence itself) in the benchmark dataset. The sequence with lowest E-value is selected as the template. The residues in the query sequence that were aligned with the binding residues on the template chain are predicted as the ATP-binding residues.
-PSSM profile is widely used in related sequencebased predictors, including the ATPint predictor [7]. To validate the effectiveness of the features proposed in this work, we build a simple predictor that uses SVM (which used the same parameters as the SVM in ATPsite) and takes PSSM profile as the input. This allows estimation of the improvements provided by the new features.

Comparison with existing methods
The ATPsite predictor is compared with ATPint (we used the web server at http://www.imtech.res.in/ raghava/atpint/) and the three baseline predictors based on the alignment, conservation scoring and evolutionary profiles, see Table 2. ATPsite is shown to outperform other methods. It improves the AUC and MCC by 0.03 and 0.07 when compared with the runner-up PSSMbased and BLAST-based predictors, respectively. These improvements are statistically significant with p < 0.01. ATPsite also achieved the highest (tied with BLAST) accuracy and the second best sensitivity and specificity. Since rate4site only considers residue conservation, it likely finds binding residues for other ligand types, catalytic residues, and key residues that are critical for protein folding. This explains the high sensitivity and low specificity, relative to the other methods, achieved by the rate4site. The alignment-based predictions have the best specificity and low sensitivity. This shows that some ATP-binding residues are in conserved sequence segments, but the sequence conservation can accommodate only for a fraction of the ATP-binding residues. The predictive quality of ATPint is lower than it was originally reported [7]. The authors of ATPint balanced the number of binding and non-binding residues in their study which likely lead to overestimation of the predictive quality when applied to full protein chains. Importantly, applications of the new features results in significant improvements as demonstrated by the differences between ATPsite and PSSM+SVM methods. The ROC curves are shown in Figure 2. The BLAST-based method does not provide probabilities and thus it is represented by a point that corresponds to the binary predictions. Figure 2 focuses on the FP rates < 0.05 Table 2 Comparison between ATPsite, ATPint and three baseline predictors that use alignment (BLAST), conservation scoring (rate4site) and evolutionary profiles (PSSM+SVM). The "Significance" column reports statistical significance tests that compare paired persequence AUC and MCC between ATPsite and other methods; + indicates that ATPsite is significantly better at 0.01 level.

Effectiveness of individual input types
We assessed the ability of individual input types to predict the ATP-binding residues. We use the same SVM classifier but we utilize the selected features from one input source at the time, see Table 3. The corresponding ROC curves are shown in Figure 3 (for FP rates < 0.05) and in full in the Supplementary Figure 2 in the online supplement at http://biomine.ece.ualberta.ca/ ATPsite/. Among the 11 feature groups, the PSSM profile achieves the highest AUC, followed by the AA groups and conservation scores. The 3 different types of scoring functions that we tried provide comparable predictive value. The AUC values for features based on predicted secondary structure and dihedral angles are above 0.7. Importantly, predictions obtained with these individual feature sets have substantially lower quality than predictions obtained with the combined set of features, which support the utility of our solution that uses and combines a set of novel inputs.

Consensus-based predictions
The machine-learning based ATPsite, alignment-based BLAST, and conservation-based rate4site utilize different and potentially orthogonal approaches, which motivates building consensus methods that would exploit this potential complementarity. We built 4 simple inclusive disjunction type ensembles of these 3 predictors (including 3 pairs of methods and all 3 methods together). For instance, the consensus of ATPsite and rate4site predicts a residue as ATP-binding if any of these two methods predicts this residue as binding; the corresponding probability equals to the maximal probabilities generated by the component predictors. The conservation scores of rate4site are linearly normalized to the [0, 1] range. We compare the ROC curves, AUC and MCC values, of the consensus and standalone   predictors, see Figure 4 and Table 4 in the main text and the Supplementary Figure 3  The predicted probability implies confidence We investigate whether the probabilities of the ATPbinding generated by ATPsite could be used as confidence scores for the binary predictions. We binned the residues into twenty 0.05 wide intervals based on their predicted probabilities and we computed the corresponding average accuracy and percentage of the residues in each of these bins, see Figure 5. For residues with high, >0.9, or low, <0.1, probabilities, the accuracies are higher than for the remaining residues. The averaged accuracy for 0.6% residues (including 14% of the ATP-binding residues) that are predicted with probability > 0.9 equals 0.85, and for the 94% of residues with probability < 0.1 it equals 0.98.

Case study
We demonstrate predictions using chain A of phos-phofructokinase6 (PDBid 3CQD). The predictive quality of the considered methods for this target is similar to their average quality on the entire dataset. The native ATP-binding residues, the binary predictions, and the probabilities predicted by ATPsite, rate4site, BLAST and PSSM+SVM methods are shown in Figure 6. The conservation scores from rate4site were divided by 10 to fit the Figure. The native ATPbinding residues are clustered into four segments, residue 185 to 189, 224 to 229, 248 to 258 and 280 to 287. ATPsite captured the four segments and correctly predicted 9 out of the 17 binding residues with 4 false positives (FPs). The PSSM profile-based predictor missed the last binding segment and successfully identified 7 out of the 17 binding residues with 23 FPs. The BLAST-based predictor missed two of the binding segments and correctly predicted 8 of the 17 binding residues with 7 FPs. The rate4site found the four segments and successfully identified 12 binding residue, however, it also produced 81 FPs.

Conclusions
We developed a new method, ATPsite, for the sequence-based prediction of ATP-binding residues. Our predictor is empirically shown to outperform the  Figure 5 The distribution of average accuracies (shown using bars) for residues binned into twenty 0.05 wide intervals based on their probability of ATP-binding predicted by ATPsite. The percentage of the residues in each bin is shown above the bars. existing approaches. These improvements are attributed to the usage of a novel and comprehensive set of input features, which include both sequence and predicted structural descriptors. We also found that a simple consensus of ATPsite with BLAST-based method leads to additional improvements. The consensus-based predictor achieves AUC = 0.861 and MCC = 0.46, which demonstrates that these predictions provide useful information for the high-throughput, sequence-based annotation of the ATP-binding residues.