Predicting DNA-binding locations and orientation on proteins using knowledge-based learning of geometric properties
© Wang and Chen; licensee BioMed Central Ltd. 2011
Published: 14 October 2011
DNA-binding proteins perform their functions through specific or non-specific sequence recognition. Although many sequence- or structure-based approaches have been proposed to identify DNA-binding residues on proteins or protein-binding sites on DNA sequences with satisfied performance, it remains a challenging task to unveil the exact mechanism of protein-DNA interactions without crystal complex structures. Without information from complexes, the linkages between DNA-binding proteins and their binding sites on DNA are still missing.
While it is still difficult to acquire co-crystallized structures in an efficient way, this study proposes a knowledge-based learning method to effectively predict DNA orientation and base locations around the protein’s DNA-binding sites when given a protein structure. First, the functionally important residues of a query protein are predicted by a sequential pattern mining tool. After that, surface residues falling in the predicted functional regions are determined based on the given structure. These residues are then clustered based on their spatial coordinates and the resultant clusters are ranked by a proposed DNA-binding propensity function. Clusters with high DNA-binding propensities are treated as DNA-binding units (DBUs) and each DBU is analyzed by principal component analysis (PCA) to predict potential orientation of DNA grooves. More specifically, the proposed method is developed to predict the direction of the tangent line to the helix curve of the DNA groove where a DBU is going to bind.
This paper proposes a knowledge-based learning procedure to determine the spatial location of the DNA groove with respect to the query protein structure by considering geometric propensity between protein side chains and DNA bases. The 11 test cases used in this study reveal that the location and orientation of the DNA groove around a selected DBU can be predicted with satisfied errors.
This study presents a method to predict the location and orientation of DNA grooves with respect to the structure of a DNA-binding protein. The test cases shown in this study reveal the possibility of imaging protein-DNA binding conformation before co-crystallized structure can be determined. How the proposed method can be incorporated with existing protein-DNA docking tools to study protein-DNA interactions deserve further studies in the near future.
Gene regulation in organisms relies on specific protein-DNA recognitions in a correct way. Recently, many computational methods have been proposed to predict binding sites on both proteins and DNA [1, 2]. Sequence-based approaches employ machine learning approaches and training data from structure database to predict DNA-binding sites on proteins [3–5]. On the other hand, pattern mining or multiple sequence alignment techniques are usually incorporated with large-scale molecular binding information such as chromatin immunoprecipitation (ChIP) experiments to discover protein-binding sites on DNA sequences [6–8].
In recent years, many experimentally determined protein structure models are extensively studied to understand and decipher the binding mechanisms of protein-DNA interactions . With protein-DNA complexes, structure-based algorithms [10–12] construct consensus or profiles of binding sites to complement the sequence-based approaches for identifying transcription factor binding sites. We also have many structure-based methods for predicting DNA-binding sites on proteins using both sequence and structure information [13–15]. Although many methods have been proposed to predict protein-DNA interactions, it remains a challenging task to unveil the exact binding conformation of protein-DNA interactions without crystal complexes.
In addition to de novo prediction methods, researchers previously applied structure alignment on a query protein against existing protein-DNA complexes for predicting binding sites and constructing potential binding models . Another way to generate protein-DNA complexes for a query sequence is using homology modelling . Sequence alignment is performed on the query protein and its homologous sequences with complex structures. The advantage of using this approach is no protein structure is required for the query protein in advance. Furthermore, with unbound protein structure available, docking programs [18–20] can be employed to predict the binding locations and orientation between proteins and DNA molecules. Protein-DNA docking is capable to generate novel complexes, which is in particularly useful for the query protein that is not similar to any protein chains in the complex database. However, the predicting accuracy of molecular docking still largely relies on computing resources and the prior knowledge about DNA sequence and conformation.
It has been shown in a recent study that the directionality of normal vectors on protein surface is correlated with that of DNA axes . In other words, it has potential to investigate the DNA-binding location and orientation on protein structures even when protein-DNA complexes are not available. This observation motivates the current study. We first characterize geometric property between protein side chains and DNA bases according to a set of existing protein-DNA complexes. Then, several learning algorithms are employed to analyze the query structure and provide prediction of DNA-binding locations and orientation. More specifically, the proposed method is developed to predict the direction of the tangent line to the helix curve of the DNA groove where the DNA-binding protein is going to bind. The predicted information can be used as the initial guess of docking tools or serve as supplementary information to improve the prediction accuracy of docking results.
When given the structure of a query protein, the proposed method first identifies a subgroup of conserved residues that form a compact cluster in space and are categorized to have high DNA-binding propensity. The discovered set of residues is considered as a basic DNA-binding unit (DBU) which is assumed to protrude into DNA grooves, no matter major or minor, for recognizing DNA sequences. To predict the DNA-binding orientation of a local region of the protein-DNA binding interface, we apply principal component analysis (PCA) on some particularly selected atom coordinates in a DBU, in order to determine the direction of the tangent line to the helix curve of the DNA groove bound by the DBU. With the detected DBU, we construct the distribution of each base type around the DBU based on a pre-calculated knowledgebase of 80 geometric models. In the following subsections, we describe each procedure of the proposed method in details.
Collecting training and testing data
The training data used for constructing the knowledgebase was prepared by referring to . This dataset was collected based on the July 2007 release of Protein Data Bank (PDB) database , containing only X-ray structures of protein-DNA complexes with resolution better than 3.0 Å. Protein sequence shorter than 40 amino acids were excluded. The DNA molecule must contain at least six base pairs. It is also required that the protein chain in the complex must have at least five DNA-binding residues (distance to DNA atoms < 4.5 Å). Furthermore, member redundancy is removed by performing sequence alignment, resulting in 179 DNA-binding domains, belonging to 170 PDB files. We name it as the dataset PDB170.
Since the proposed method is a knowledge-based approach, it is important to have an independent test set in which the redundancy between training data and testing data has been carefully eliminated. For this purpose, a set of 11 PDB files of DNA-protein complexes (PDB11) were collected as the testing data by the following procedures. First, 1267 protein-DNA complex structures were collected from PDB (release on May 2009), after removing redundancy by excluding sequences with an identity value greater than 90% against a previously selected sequence. All the 1267 protein-DNA complex structures are with resolution better than 3.0Å solved by X-ray diffraction. Second, we performed BLAST on each chain of the 1267 protein chains against the protein chains in PDB170, and excluded any protein chains with e-value<0.001 or identity>25% against the training protein chains to remove the redundancy between the training data and the testing data. Afterward, the selected chains were clustered by CD-HIT  to further remove redundancy within the testing data. Finally only the PDB files with exactly two twisted DNA strands were selected. It is noted that PDB files with unwound DNA positions were also excluded.
Constructing knowledgebase of geometric propensity between side chains and bases
, where the symbol # is short for the word ‘number’.
Next, we investigate DNA-binding propensity for each atom in amino acids based on the similar idea. We want to know which atom of an amino acid is most likely to interact with DNA bases. We use PDB170 to count the number of bases for each atom of amino acids which are falling within the distance of 4 Å. The top-3 atoms for each amino acid are then considered as the reference frame of each amino acid, which will be used later to align the amino acids of the same type from different structure files when constructing geometric models.
Discovering basic DNA-binding units
A basic DNA-binding unit (DBU) is defined as a compact cluster of residues that is supposed to protrude into DNA grooves when a protein binds to DNA. The proposed method discovers DBUs by combining information of conservation, solvent accessibility, and DNA-binding propensity. Conserved residues are discovered by a pattern mining utility, MAGIIC-PRO . Solvent accessibility of each residue was calculated by DSSP . Finally, conserved residues near surface were clustered based on their spatial relationships, and the resultant clusters were ranked by their DNA-binding propensities. The details of the three procedures are given below.
MAGIIC-PRO is a sequential pattern mining utility which is useful in identifying functional regions and residues . The readers can refer to the paper of MAGIIC-PRO for more details about the parameter settings. After a set of conserved residues were discovered, we calculated the relative solvent accessibility (RSA) score of each residue on the structure of the target protein chain by invoking DSSP . Afterward, we only picked up residues with RSA scores higher than 0.25 for the following clustering process.
Hierarchical clustering was employed to cluster these functionally important surface residues into DBUs. At first, clustering was conducted at atom level. Euclidean distance was used to measure the dissimilarity between two atoms and average linkage was adopted as the scenario to measure dissimilarity between existing clusters. The clustering process was stopped once any pair of cluster exhibit dissimilarity larger than 11 Å (covering about three successive bases in DNA grooves). Once it happens that not all the atoms of a single residue are falling into the same cluster, a majority vote was used to determine the belonging of the residues to clusters. Finally, we used the DNA-binding propensity scores of the clusters to rank them. The score of a cluster is the lumped sum of the DNA-binding propensity score defined in Eq. ( 1 ) of the residues inside it. The cluster scores higher than expectation (‘number of residues inside the cluster’ × ‘average of DNA-binding propensity of the 20 amino acids’) are considered as DNA-binding units in the following analyses.
Predicting DNA-binding orientation
The proposed method assumes that the discovered DBUs will protrude into DNA grooves, no matter with major grooves or minor grooves. In this regard, we selected three atoms in amino acids to represent the spatial property of each residue, and use PCA to predict the direction of the tangent line to the helix curve of DNA grooves. The first selected atom is the atom with the highest DNA-binding propensity in an amino acid. The second and the third atoms selected to represent the amino acid are the CA and C atoms on the backbone.
Predicting locations of DNA bases
, where x is a 3-dimensional vector, representing the coordinates of a point in space, y i presents the 3-dimensional coordinates of the atom i, aa(r) stands for the amino acid type of the residue r, and C is the set of residues in the selected DBU. In this study, we empirically used the number of residues belonging to the amino acid type aa(r) in the DBU to normalize the contributed scores before accumulating them.
In this section, we first define how the performance of the proposed method was evaluated. After that, we demonstrate that how the proposed method can be used to predict DNA-binding conformation for a large DNA-binding interface. A TATA-box binding protein was used as an example in this situation.
Evaluation of prediction accuracy
Errors of the predicted locations and orientation on the 11 test cases (PDB11).
PDB ID and the chain ID
Location error in Å
Groove type to which the top-1 DBU binds
Orientation error in degree
Multiple predictions for large protein-DNA interfaces
Prediction using unbound protein structure
Since proteins usually undergo conformation change upon binding DNA, it is of interest to investigate that how the proposed method performs when the given protein structure is an unbound model. We use another structure model (1TBP:B) for the same protein (TATA-box binding protein of Saccharomyces cerevisiae) to predict DNA-binding locations and orientation by the proposed method. In Figure 4(b), we show that the predictions are generally consistent with that derived from the bound structure shown in Figure 4(a). This reveals the potential of the proposed method in future applications of predicting exact binding mechanisms using unbound structures of DNA-binding proteins alone.
Comparison with existing methods for predicting DNA-binding residues.
PDB ID: chain ID
The proposed method
The proposed method
This study opens an opportunity of computational methods to imagine protein-DNA binding conformation as long as protein structures are available. Using MAGIIC-PRO to discover functionally important residues achieves 10 successes among the 11 test cases. The proposed method for discovering basic DNA-binding units achieves seven successes among the 10 good cases from MAGIIC-PRO. Among the seven correctly predicted DBUs, the constructed models identify correct base locations for all the cases and the PCA analysis successfully identify the tangent direction of the bound groove on five cases. We concluded that the proposed method could help to set the initial conditions of DNA structure models for conducting protein-DNA docking or serve as useful supplementary information in studying protein-DNA interactions.
The authors would like to thank National Science Council of Republic of China, Taiwan, for the financial support under the contracts: 98-2221-E-002-137-MY2 and 99-2627-B-002-004.
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
- Sarai A, Kono H: Protein-DNA recognition patterns and predictions. Annual Review of Biophysics and Biomolecular Structure 2005, 34: 379–398. 10.1146/annurev.biophys.34.040204.144537PubMedView ArticleGoogle Scholar
- Höglund A, Kohlbacher O: From sequence to structure and back again: approaches for predicting protein-DNA binding. Proteome Science 2004, 2: 3. 10.1186/1477-5956-2-3PubMed CentralPubMedView ArticleGoogle Scholar
- Ofran Y, Mysore V, Rost B: Prediction of DNA-binding residues from sequence. Bioinformatics 2007,23(13):I347-I353. 10.1093/bioinformatics/btm174PubMedView ArticleGoogle Scholar
- Hwang S, Gou ZK, Kuznetsov IB: DP-Bind: a Web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 2007,23(5):634–636. 10.1093/bioinformatics/btl672PubMedView ArticleGoogle Scholar
- Lu H, Carson MB, Langlois R: NAPS: a residue-level nucleic acid-binding prediction server. Nucleic Acids Research 2010, 38: W431-W435. 10.1093/nar/gkq361PubMed CentralPubMedView ArticleGoogle Scholar
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al.: Transcriptional regulatory code of a eukaryotic genome. Nature 2004,431(7004):99–104. 10.1038/nature02800PubMed CentralPubMedView ArticleGoogle Scholar
- Chen CY, Tsai HK, Hsu CM, Chen MJM, Hung HG, Huang GTW, Li WH: Discovering gapped binding sites of yeast transcription factors. Proceedings of the National Academy of Sciences of the United States of America 2008,105(7):2527–2532. 10.1073/pnas.0712188105PubMed CentralPubMedView ArticleGoogle Scholar
- Ji HK, Jiang H, Ma WX, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nature Biotechnology 2008,26(11):1293–1300. 10.1038/nbt.1505PubMed CentralPubMedView ArticleGoogle Scholar
- Luscombe NM, Laskowski RA, Thornton JM: Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Research 2001,29(13):2860–2874. 10.1093/nar/29.13.2860PubMed CentralPubMedView ArticleGoogle Scholar
- Kono H, Sarai A: Structure-based prediction of DNA target sites by regulatory proteins. Proteins-Structure Function and Genetics 1999,35(1):114–131. 10.1002/(SICI)1097-0134(19990401)35:1<114::AID-PROT11>3.0.CO;2-TView ArticleGoogle Scholar
- Morozov AV, Havranek JJ, Baker D, Siggia ED: Protein-DNA binding specificity predictions with structural models. Nucleic Acids Research 2005,33(18):5781–5798. 10.1093/nar/gki875PubMed CentralPubMedView ArticleGoogle Scholar
- Morozov AV, Siggia ED: Connecting protein structure with predictions of regulatory sites. Proceedings of the National Academy of Sciences of the United States of America 2007,104(17):7068–7073. 10.1073/pnas.0701356104PubMed CentralPubMedView ArticleGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 2004,20(4):477–486. 10.1093/bioinformatics/btg432PubMedView ArticleGoogle Scholar
- Zhou HX, Tjong H: DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces. Nucleic Acids Research 2007,35(5):1465–1477. 10.1093/nar/gkm008PubMed CentralPubMedView ArticleGoogle Scholar
- Kuznetsov IB, Gou ZK, Li R, Hwang SW: Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins-Structure Function and Bioinformatics 2006,64(1):19–27. 10.1002/prot.20977View ArticleGoogle Scholar
- Gao M, Skolnick J: DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Research 2008,36(12):3978–3992. 10.1093/nar/gkn332PubMed CentralPubMedView ArticleGoogle Scholar
- Contreras-Moreira B, Branger PA, Collado-Vides J: TFmodeller: comparative modelling of protein-DNA complexes. Bioinformatics 2007,23(13):1694–1696. 10.1093/bioinformatics/btm148PubMedView ArticleGoogle Scholar
- van Dijk M, van Dijk ADJ, Hsu V, Boelens R, Bonvin AMJJ: Information-driven protein-DNA docking using HADDOCK: it is a matter of flexibility. Nucleic Acids Research 2006,34(11):3317–3325. 10.1093/nar/gkl412PubMed CentralPubMedView ArticleGoogle Scholar
- Liu ZJ, Guo JT, Li T, Xu Y: Structure-based prediction of transcription factor binding sites using a protein-DNA docking approach. Proteins-Structure Function and Bioinformatics 2008,72(4):1114–1124. 10.1002/prot.22002View ArticleGoogle Scholar
- Roberts VA, Case DA, Tsui V: Predicting interactions of winged-helix transcription factors with DNA. Proteins-Structure Function and Bioinformatics 2004,57(1):172–187. 10.1002/prot.20193View ArticleGoogle Scholar
- Yeh CS, Chen FM, Wang JY, Cheng TL, Hwang MJ, Tzou WS: Directional shape complementarity at the protein-DNA interface. Journal of Molecular Recognition 2003,16(4):213–222. 10.1002/jmr.624PubMedView ArticleGoogle Scholar
- Dutta S, Burkhardt K, Young J, Swaminathan GJ, Matsuura T, Henrick K, Nakamura H, Berman HM: Data Deposition and Annotation at the Worldwide Protein Data Bank. Molecular Biotechnology 2009,42(1):1–13. 10.1007/s12033-008-9127-7PubMedView ArticleGoogle Scholar
- Li WZ, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006,22(13):1658–1659. 10.1093/bioinformatics/btl158PubMedView ArticleGoogle Scholar
- Hsu CM, Chen CY, Liu BJ: MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences. Nucleic Acids Research 2006, 34: W356-W361. 10.1093/nar/gkl309PubMed CentralPubMedView ArticleGoogle Scholar
- Kabsch W, Sander C: Dictionary of Protein Secondary Structure - Pattern-Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 1983,22(12):2577–2637. 10.1002/bip.360221211PubMedView ArticleGoogle Scholar
- Hsu CM, Chen CY, Liu BJ, Huang CC, Laio MH, Lin CC, Wu TL: Identification of hot regions in protein-protein interactions by sequential pattern mining. BMC Bioinformatics 2007,8(Suppl 5):S8. 10.1186/1471-2105-8-S5-S8PubMed CentralPubMedView ArticleGoogle Scholar
- Chang DTH, Chen CY, Chung WC, Oyang YJ, Juan HF, Huang HC: ProteMiner-SSM: a web server for efficient analysis of similar protein tertiary substructures. Nucleic Acids Research 2004, 32: W76-W82. 10.1093/nar/gkh425PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.