Volume 9 Supplement 1
ATPsite: sequence-based prediction of ATP-binding residues
© Chen et al; licensee BioMed Central Ltd. 2011
Published: 14 October 2011
ATP is a ubiquitous nucleotide that provides energy for cellular activities, catalyzes chemical reactions, and is involved in cellular signalling. The knowledge of the ATP-protein interactions helps with annotation of protein functions and finds applications in drug design. The sequence to structure annotation gap motivates development of high-throughput sequence-based predictors of the ATP-binding residues. Moreover, our empirical tests show that the only existing predictor, ATPint, is characterized by relatively low predictive quality.
We propose a novel, high-throughput machine learning-based predictor, ATPsite, which identifies ATP-binding residues from protein sequences. Our predictor utilizes Support Vector Machine classifier and a comprehensive set of input features that are based on the sequence, evolutionary profiles, and the sequence-predicted structural descriptors including secondary structure, solvent accessibility, and dihedral angles.
The ATPsite achieves significantly higher Mathews Correlation Coefficient (MCC) and Area Under the ROC Curve (AUC) values when compared with the existing methods including the ATPint, conservation-based rate4site, and alignment-based BLAST predictors. We also assessed the effectiveness of individual input types. The PSSM profile, the conservation scores, and certain features based on amino acid groups are shown to be more effective in predicting the ATP-binding residues than the remaining feature groups.
Statistical tests show that ATPsite significantly outperforms existing solutions. The consensus of the ATPsite with the sequence-alignment based predictor is shown to give further improvements.
Adenosine-5'-triphosphate (ATP) is a multi-functional nucleotide that plays an important role in energy metabolism, signaling, and replication and transcription of DNA. As of July 2010, 3860 structures in the Protein Data Bank (PDB) , which constitute about 6% of known protein structures, are annotated as ATP binding. The ATP binding sites are regarded as valuable drug targets for antibacterial and anti-cancer chemotherapy [2, 3]. Therefore, the protein-ATP interactions are of significant interest.
Past two decades observed a substantial effort in identification of conserved characteristics of the ATP-binding sites. Most of these approaches are based on a relatively simple analysis of ATP-binding sequences and structures that led to identification of sequence motifs and structural templates. For instance, the p-loop motif that interacts with ATP and its analogs was found in several protein families  and structural templates that interact with either adenosine or phosphates (the two chemical groups of ATP) were proposed [5, 6]. However, these motifs/templates are usually confined to one or several protein families and cover only a small subset of the ATP-binding sites.
The large number of protein sequences which lack tertiary structure motivates development of computational tools for high-throughput sequence-based annotation of ATP-binding residues. At the same time, to the best of our knowledge, ATPint  is the only sequence-based predictor of the ATP-binding residues. We propose a novel method, named ATPsite, which aims to improve over the predictive quality of ATPint and other popular ways to annotate binging residues, including sequence alignment and conservation scoring. In contrast to the ATPint, which only takes PSSM profile and sequence descriptors as the inputs, the ATPsite uses a comprehensive set of relevant inputs. These inputs, which include PSSM profile, sequence descriptors, conservation scores, and predicted secondary structure, relative solvent accessibility (RSA), and dihedral angles, are encoded into a set of custom-designed features that are shown to improve the quality of the ATP-binding predictions.
We extracted all complexes in PDB (as of February 2010) that include ATP. The maximal pairwise sequence identity of the resulting protein chains was reduced to 40% with CD-hit . The remaining 227 chains that interact with ATP constitute the dataset used in this study. Similar to the annotation of DNA-binding residues and residues interacting with small ligands [9, 10], a given residue is annotated as ATP-binding if at least one of its non-hydrogen atom is less than 3.9Å away from a non-hydrogen atom of the ATP molecule; our dataset includes 3393 ATP-binding residues and 80409 non-binding residues and is available at http://biomine.ece.ualberta.ca/ATPsite/.
Evaluation criteria and test procedure
where TP (true positives) and TN (true negatives) are the counts of correctly predicted binding and non-binding residues, respectively, FP (false positives) are non-binding residues that were predicted as binding residues, and FN (false negatives) are binding residues that were predicted as non-binding residues. The Matthews correlation coefficient (MCC) ranges between -1 and 1 and it equals zero when all residues are predicted as binding or non-binding. Higher MCC value indicates better predictions.
The receiver operating characteristic (ROC) curve was used to examine the predicted probabilities. For each value of probability p achieved by a given method (between 0 and 1), the residues with probability ≥ p are set as the binding residue, and all other residues are set as the non-binding residue. Next, the TP-rate and the FP-rate are calculated and we use the area under the curve (AUC) to quantify the predictive quality.
We analyze statistical significance of the differences in the MCC and AUC values between predictions generated by ATPsite and the other methods. The MCC values are available for all methods while the AUC value cannot be calculated for an alignment-based predictor. These values are calculated per sequence (using the cross-validated predictions) for each method and we compare them using a paired Wilcoxon rank sum test at 0.01 significance. This non-parametric test is used since the per sequence MCC and AUC values do not follow normal distribution, as tested using Shapiro-Wilk test at the 0.05 significance.
Architecture of the proposed predictor
Predicted secondary structure generated by PSIPRED . We use probabilities of the 3 secondary structure states for each residue in the window.
Predicted relative solvent accessibility generated by Real-SPINE3 . We use the real values, which quantify the fraction of the surface area of a given residue that is accessible to the solvent, for the residues in the window.
Predicted dihedral angles generated by Real-SPINE3 . We utilize two real values, which represent phi (involving the backbone atoms C'-N-Cα-C') and psi (involving the backbone atoms N-Cα-C'-N) angles.
PSSM profile generated by PSIBLAST  with default parameters. We normalize these inputs with 1/(1+2- x ), where x is the raw value from the PSSM profile; this transformation is commonly used in secondary structure prediction. For a window centered at R i residue at ith position, we calculate 17×20 features f i + k , j where k=−8, −7,…,7,8 is the index of the position in the window and j=1,2,…,20 is the index of the PSSM column. We averaged values on the left and right sides of the central residue g i + z , j =(f i + z , j +f i−z , j )/2 where z=0,1,…,8. As a result, the original 17×20 values are transformed to 9×20 values.
AA groups including hydrophobic residues (Ala, Cys, Ile, Leu, Met and Val), negatively charged (Asp and Glu), positively charged (His, Lys, Arg) and carboxamide-containing AAs (Asn and Gln).
Terminal indicator is set to 1 for the first and last 3 residues in the sequence and 0 for the other positions.
Secondary structure segment indicator for helix/ strand/ coil predictions from PSIPRED on both sides of the window is calculated. If 4 (3) consecutive residues on the left/right side of the window are predicted as helix (strand), we set the helix (strand) indicator as 1 for the left/right side. If both helix and strand indicators are 0, then the coil indicator is set as 1.
Residue conservation scores are calculated based on the Shannon entropy (referred to as conservation A) and two other formulas proposed in [15, 16] (named conservation B and C, respectively) which incorporate the background frequency of the amino acids.
Collocation of AA pairs[17, 18] is calculated for the residues in the window, which is motivated by results for membrane proteins where certain AA pairs are over-represented . Similarly, several sequence motifs occur frequently at the ATP binding sites. To accommodate for mutations in these motifs, we use collocated AA pairs (pairs with gaps) to characterize these motifs. We only consider pairs formed between the central residue in the window and another residue up to 5 positions away. This results in 20×20×10=4000 frequencies (for 20 AA types and 10 positions; 5 on each side). The same as in the work for the membrane proteins , p-values that indicate the significance of the association between an AA pair and ATP-binding annotation are calculated. An AA pair with low p-value indicates a low probability that the association between this pair and ATP-binding is a coincidence. When analyzing 4000 randomly distributed variables, we expect to observe by chance one instance of a difference from expected value with significance p < 0.00025 (1/4000). We exclude the AA pairs with p ≥ 10-6, since based on the Engelman’s study  their association with ATP-binding event would be random.
We note that the terminal and secondary structure indicators, collocation of AA pairs, and the predicted secondary structure, relative solvent accessibility, and dihedral angles were never before used to predict the ATP-binding residues.
Feature selection and parameterization
We use 5-fold cross validation to compute feature selection and evaluation. The dataset is randomly divided per-sequence into 5 folds, of which 4 are used for training and the one for testing; each of the 5 folds is used once as the test fold. This procedure assures that annotations of ATP binding from test folds are not used to train the predictive model. Biserial correlation is calculated between each of the features and the binary annotation of ATP-binding residues for each of the 5 training sets. The averaged, over the 5 training sets, correlation values were used to rank the features. We used a best first forward feature selection. Given a feature list F=[f i , i=1,2,…,n], sorted by the average correlation in the descendent order, and an empty list S consisting of selected features, in each round we add the top-ranked feature from F to S and run default SVM with linear kernel and complexity constant C=1 on the feature set S (using 5-fold cross validation). If addition of a given feature improves the average AUC value over the 5 test folds, this feature is retained in S; otherwise it is removed. We repeat that until F is empty.
Summary list of the selected features.
# of selected features
Predicted secondary structure
Predicted relative solvent accessibility
Predicted dihedral angles
Secondary structure segment indicator
Residue conservation scores
Collocation of AA pairs
Rate4site predicts functional sites by finding conserved residues. We first run PSI-Blast using the query sequence against the NCBI non-redundant database. For chains with at least 3 significant matches, we created alignments of the best 50 sequences (the default for Consurf , which is the web version of rate4site) using ClustalW  and we inputted them to rate4site. The rate4site generates conservation score for each residue and residues with lower scores (indicating a higher conservation) have higher probability to be binding residues. We use these scores to compute ROC curves and the threshold that maximizes the MCC value is used to binarize the conservation scores.
Sequence alignment using BLAST identifies similar sequences or segments from a given annotated (with ATP-binding residues) dataset for a query sequence. This approach predicts the binding residues by using the ATP-binding annotations from the best aligned sequence. We execute the BLAST-based alignment between a query sequence and all other sequences (except the query sequence itself) in the benchmark dataset. The sequence with lowest E-value is selected as the template. The residues in the query sequence that were aligned with the binding residues on the template chain are predicted as the ATP-binding residues.
PSSM profile is widely used in related sequence-based predictors, including the ATPint predictor . To validate the effectiveness of the features proposed in this work, we build a simple predictor that uses SVM (which used the same parameters as the SVM in ATPsite) and takes PSSM profile as the input. This allows estimation of the improvements provided by the new features.
Comparison with existing methods
Comparison between ATPsite, ATPint and three baseline predictors that use alignment (BLAST), conservation scoring (rate4site) and evolutionary profiles (PSSM+SVM). The “Significance” column reports statistical significance tests that compare paired per-sequence AUC and MCC between ATPsite and other methods; + indicates that ATPsite is significantly better at 0.01 level.
Predicted binary annotations
Effectiveness of individual input types
Predictive quality achieved with individual input types; the inputs are sorted in the descending order using the AUC values.
Pred. secondary structure
Sec. str. segment indicator
Pred. dihedral angles
Collocation of AA pairs
Pred. solvent accessibility
Comparison between 4 ensemble predictors and their component predictors.
The predicted probability implies confidence
We developed a new method, ATPsite, for the sequence-based prediction of ATP-binding residues. Our predictor is empirically shown to outperform the existing approaches. These improvements are attributed to the usage of a novel and comprehensive set of input features, which include both sequence and predicted structural descriptors. We also found that a simple consensus of ATPsite with BLAST-based method leads to additional improvements. The consensus-based predictor achieves AUC = 0.861 and MCC = 0.46, which demonstrates that these predictions provide useful information for the high-throughput, sequence-based annotation of the ATP-binding residues.
Support Vector Machine
Area Under Curve
Matthews Correlation Coefficient
We thank the authors of PSIPRED and REAL-Spine3 for sharing their programs. This work was supported in part by the NSERC Discovery grant to LK, by the Alberta Ingenuity and iCORE scholarship in ICT to KC, and by the Izaak Walton Killam Memorial scholarship to MJM.
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–42. 10.1093/nar/28.1.235PubMed CentralPubMedView ArticleGoogle Scholar
- Maxwell A, Lawson DM: The ATP-binding site of type II topoisomerases as a target for antibacterial drugs. Curr Top Med Chem 2003, 3: 283–303. 10.2174/1568026033452500PubMedView ArticleGoogle Scholar
- Rock FL, Mao W, Yaremchuk A, Tukalo M, Crépin T, Zhou H, et al.: An antifungal agent inhibits an aminoacyl-tRNA synthetase by trapping tRNA in the editing site. Science 2007, 316: 1759–1761. 10.1126/science.1142189PubMedView ArticleGoogle Scholar
- Walker JE, Saraste M, Runswick MJ, Gay NJ: Distantly related sequences in the alpha- and beta-subunits of ATP synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding fold. EMBO J 1982, 1: 945–951.PubMed CentralPubMedGoogle Scholar
- Moodie SL, Mitchell JB, Thornton JM: Protein recognition of adenylate: an example of a fuzzy recognition template. J Mol Biol 1996, 263: 486–500. 10.1006/jmbi.1996.0591PubMedView ArticleGoogle Scholar
- Denessiouk KA, Johnson MS: When fold is not important: a common structural framework for adenine and AMP binding in 12 unrelated protein families. Proteins 2000, 38: 310–26. 10.1002/(SICI)1097-0134(20000215)38:3<310::AID-PROT7>3.0.CO;2-TPubMedView ArticleGoogle Scholar
- Chauhan JS, Mishra NK, Raghava GP: Identification of ATP binding residues of a protein from its primary sequence. BMC Bioinformatics 2009, 10: 434. 10.1186/1471-2105-10-434PubMed CentralPubMedView ArticleGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659. 10.1093/bioinformatics/btl158PubMedView ArticleGoogle Scholar
- Luscombe NM, Laskowski RA, Thornton JM: Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Res 2001, 29: 2860–74. 10.1093/nar/29.13.2860PubMed CentralPubMedView ArticleGoogle Scholar
- Chen K, Kurgan L: Investigation of atomic level patterns in protein-small ligand interactions. PLoS ONE 2009, 4: 4473. 10.1371/journal.pone.0004473View ArticleGoogle Scholar
- McGuffin LJ, Bryson K, Jones DT: PSIPRED protein structure prediction server. Bioinformatics 2000, 16: 404–5. 10.1093/bioinformatics/16.4.404PubMedView ArticleGoogle Scholar
- Faraggi E, Xue B, Zhou Y: Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a 2-layer neural network. Proteins 2009, 74: 847–56. 10.1002/prot.22193PubMed CentralPubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–402. 10.1093/nar/25.17.3389PubMed CentralPubMedView ArticleGoogle Scholar
- Fan RE, Chen PH, Lin CJ: Working set selection using second order information for training SVM. J Mach Learn Res 2005, 6: 1889–918.Google Scholar
- Wang K, Samudrala R: Incorporating background frequency improves entropy-based residue conservation measures. BMC Bioinformatics 2006, 7: 385. 10.1186/1471-2105-7-385PubMed CentralPubMedView ArticleGoogle Scholar
- Capra JA, Singh M: Predicting functionally important residues from sequence conservation. Bioinformatics 2007, 23: 1875–82. 10.1093/bioinformatics/btm270PubMedView ArticleGoogle Scholar
- Chen K, Kurgan L, Ruan J: Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol 2007, 7: 25. 10.1186/1472-6807-7-25PubMed CentralPubMedView ArticleGoogle Scholar
- Chen K, Jiang Y, Du L, Kurgan L: Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs. J Comput Chem 2009, 30: 163–72. 10.1002/jcc.21053PubMedView ArticleGoogle Scholar
- Senes A, Gerstein M, Engelman DM: Statistical analysis of amino acid patterns in transmembrane helices: the GxxxG motif occurs frequently and in association with beta-branched residues at neighboring positions. J Mol Biol 2000, 296: 921–36. 10.1006/jmbi.1999.3488PubMedView ArticleGoogle Scholar
- Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N: Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 2002, Suppl 1: S71–7.View ArticleGoogle Scholar
- Ashkenazy H, Erez E, Martz E, Pupko T, Ben-Tal N: ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic Acids Res 2010, 38: W529–33. 10.1093/nar/gkq399PubMed CentralPubMedView ArticleGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, et al.: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23: 2947–2948. 10.1093/bioinformatics/btm404PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.