From sequence to structure and back again: approaches for predicting protein-DNA binding

Gene regulation in higher organisms is achieved by a complex network of transcription factors (TFs). Modulating gene expression and exploring gene function are major aims in molecular biology. Furthermore, the identification of putative target genes for a certain TF serve as powerful tools for specific targeting of rational drugs. Detecting the short and variable transcription factor binding sites (TFBSs) in genomic DNA is an intriguing challenge for computational and structural biologists. Fast and reliable computational methods for predicting TFBSs on a whole-genome scale offer several advantages compared to the current experimental methods that are rather laborious and slow. Two main approaches are being explored, advanced sequence-based algorithms and structure-based methods. The aim of this review is to outline the computational and experimental methods currently being applied in the field of protein-DNA interactions. With a focus on the former, the current state of the art in modeling these interactions is discussed. Surveying sequence and structure-based methods for predicting TFBSs, we conclude that in order to achieve a sound and specific method applicable on genomic sequences it is desirable and important to bring these two approaches together.


Introduction
A complex network of gene regulatory signals allows each cell in both single-and multicellular organisms to flexibly respond to environmental factors. In 1967, Ptashne realized that gene expression is regulated by protein switches that bind to target sequences in the DNA [1]. Understanding the mechanisms underlying sequence-specific binding of proteins to DNA and the resulting gene expression, holds great promise for targeting numerous diseases through rational drug development [2].
The sequencing of whole genomes alongside with experimental studies of the control of gene expression has revealed some fundamental mechanisms. Each gene is regulated by at least one, but often multiple transcription factor (TFs). The TFs bind to specific transcription factor binding sites (TFBSs) within the regulatory regions (promoters) of the genes. The functional arrangement, i.e. the presence, combination, and order of the TFBSs in a regulatory region, form promoter modules [3] that control the spatial and temporal expression of genes [4].
The analysis of individual TFBSs can provide important clues in deducing regulatory networks in a cell and the functional context of specific genes. Over time, several experimental methods have been developed for studying TFBSs. In vitro analysis is complicated by two facts typical of TFs: TFs usually bind to multiple target sequences with varying affinity and they often regulate multiple genes. In silico analysis is not straight-forward either, but presents a necessary extension to current in vitro methods. The main obstacles are that TFBSs are often located in non-coding DNA, degenerate in their sequence, and relatively short (5-12 nucleotides). Searching for such low-information content sites within huge amounts of genomic DNA using computational methods typically yields a large number of randomly occurring false positive sites. Reducing the number of these false positives has been the goal of many efforts. Currently, most successful sequence-based algorithms are context-sensitive and account for the presence of other TFBSs [5], relative positioning to transcription start site (TSS) [6], and evolutionary conservation of functional regulatory elements [7]. Seen from a structural point of view, the recognition of a nucleotide sequence by a DNA-binding protein is determined by the interactions between the DNA base pair (bp) edges and the amino acid side chains. Structure-based methods use either statistical information obtained from structural data, or models for representing the steric and chemical complementarity, for evaluating the affinity of a protein-DNA complex [8].
Research during the past decades has focused on understanding the mechanisms underlying protein-DNA interactions and aiming towards expressing these using general sets of rules. First attempts to define such a recognition code arose in 1976 through the work of Seeman and Rosenberg [9], who identified a specific pattern of hydrogen bond (H-bond) acceptors and donors on the DNA bp edges. More detailed studies of protein-DNA structural complexes soon concluded that the interactions could not be explained by a simple one-to-one correspondence [10,11]. However, specific amino acid-base preferences do exist [12,13], which comes as no surprise given their chemical and structural characteristics.
Current sequence-based algorithms and structure-based models will benefit from a mutual integration, when the primary aim is to develop fast and reliable prediction methods for TFBSs and an understanding for how DNA recognition is facilitated. Experimental techniques for studying protein-DNA interactions and the physical characteristics of such interactions will be explained in the first two sections. In the final section, accurate computational modeling of the binding sites of regulatory proteins will be discussed in the light of experimental and theoretical implications.

Experimental methods
In order to be able to analyze differences and commonalities of how binding takes place, examples of binding sites are required. Experimental methods used in the determi-nation of binding sites for transcription factors are important for creating a sound description of each TFBS.
There are a several methods available for producing interaction data. Nitrocellulose-binding assay [14], electrophoretic mobility shift assay (EMSA) [15], enzyme-linked immunosorbent assay (ELISA) [16], DNase 1 footprinting [17], DNA-protein crosslinking (DPC) [18], and reporter conducts [19] are examples of in vitro techniques that are used for determining DNA binding sites and analyzing the difference in binding specificity for different protein-DNA complexes. They are all currently in use, but suffer from major drawbacks: they are not suited for high-throughput experiments and information on optimal vs. suboptimal protein binding sites is lost.
Chromatin immunoprecipitation (ChIP) is a recent microarray-based assay developed for genome-wide determination of protein binding sites on DNA [20]. Systemic evolution of ligands by exponential enrichment (SELEX) [21] and Phage Display (PD) [22] represent another type of experiments and offer a high-throughput possibility to select high-affinity binders, DNA and protein targets respectively. Both SELEX and PD suffer from the same drawback, the fact that the multitude of sequences obtained from these experiments are all good binders, but it is hard to say anything about their relative affinities. The assumption that the best binders occur more frequently, from purely statistical reasons, is commonly adopted. The differences between individual mutants have to be measured one at a time by other and more laborious methods (discussed above).
In 1999, Bulyk et al. presented dsDNA microarrays for exploring sequence specific protein-DNA binding [23]. The major advantage over the methods discussed above is that it is a high-throughput method resulting in data with associated relative binding affinities, which is of high importance in protein-DNA interaction studies.
Finally, there is X-ray crystallographic and NMR spectroscopic data providing a base for studying the structural details of protein-DNA interactions. Protein-DNA complexes have successfully been co-crystallized [24], and the data has been deposited into the Protein Data Bank (PDB) and Nucleic Acid Database (NDB). Each complex is a 3D representation of all intermolecular interactions participating in protein-DNA recognition, however, the experiments are very time-consuming.

Characteristics of protein-DNA interactions
Double-stranded DNA forms the famous double helix [25], where pairs of complementary bases on opposing strands are stabilized by intermolecular H-bonds. The chemical composition of the DNA sugar-phosphate backbone is independent of the bp sequence and thus not involved in the specificity of sequence recognition. Only the edges of the bp are exposed in the grooves of the helical DNA, where they form a pattern of H-bond acceptors and donors [9] that can be recognized by the amino acid side chains, see Figure 1 for an illustration. Specific recognition of DNA has to rely on the interactions with these exposed patches. TFs typically contain a DNA-binding domain and one or multiple interaction domains that bind to other TFs. It is common to group the TFs into families according to the structure of these DNA-binding domains [26], where each family employs a different mechanism for recognizing the DNA sequence of the target site [12].
The energetics and mode of protein-DNA interactions differ from those of protein-protein interactions. The main differences are that the protein-DNA interfaces are much more polar, have many more intermolecular H-bonds, and a higher abundance of buried water molecules [27,28]. The most important biochemical interactions in protein-DNA complexes are van der Waals contacts, Hbonds, and water-mediated contacts [29]. About twothirds of all contacts are non-specific and made with the sugar-phosphate backbone of the DNA, leaving one-third of all interactions for the specificity [30]. Nonspecific interactions (protein-DNA backbone) are extremely important for the overall stability of the complex, and are mainly mediated through van der Waals contacts. About two-thirds of the specific interactions (protein-DNA base edges) involve complex H-bond patterns [29]. The distribution of H-bonds clearly demonstrates particular amino acid-base preferences, but no generalizable code can be deduced [13]. It is important to note that each amino acid can interact with more than one bp simultaneously, and several different amino acids can interact with the same bp. Interdependence between both bases and amino acids is an important feature of the interaction scheme. Very specific contact patterns can be achieved in this way and enable subtle but crucial differences in binding affinities [31].
Water molecules act as contact-mediators and space-fillers at the protein-DNA interface and play a key role in complex formation. As suggested in [32], an atomic description of water molecules at the interface is required for a complete formulation of protein-DNA interactions. Important water bridges can be identified in crystal structures or using molecular modeling [33].
The helical DNA structure is often distorted when bound to a protein [34,35]. Enforced bending of the DNA strand occurs through kinks at the base steps, leading to unstacking and unwinding of the helix. Several types of structural changes have been detected, including shift, slide, twist, rise, roll, and tilt [36]. The stiffness of the DNA helix is determined by the background bp composition [37], i.e. C-G bp are more rigid since they have one additional Hbond compared to A-T bp. The side chains of the protein are flexible and can re-arrange upon complex formation in order to achieve complementarity.

Computational methods
Computational approaches present an attractive solution for modeling and discovering TFBSs on a genomic scale. Several different computational approaches for predicting TFBSs have been explored, which has lead to considerable progress during recent years. The main approaches are sequence and structure-based, where the difference is that sequence-based methods consider only the primary structure of DNA, whereas structure-based methods aim at describing the physical and chemical complementarity Figure 1 [9]. H-bond acceptors and donors are indicated by outward and inward pointing arrows respectively. The letter M is the methyl group of the base T and H R is a ring hydrogen donor. The chemical composition of the DNA sugar-phosphate backbone (not shown) is constant and independent of the bp sequence.

Characteristics of C-G and T-A base pairs Intermolecular H-bonds (dotted lines) in the C-G and T-A bp, stabilize the DNA double helix. The bp edges form a pattern of H-bond acceptors and donors that can be recognized by amino acid side chains of proteins. This pattern is unique for each bp (C-G, G-C, T-A, and A-T) in the major groove (up), whereas it is only possible to distinguish a C-G bp (top) form an T-A bp (bottom) in the minor groove (down)
between a TF and its binding site. We will now briefly discuss some selected sequence and structure-based computational methods for predicting TFBSs.
Experimentally verified binding sites can be used for constructing a consensus sequence motif of the binding site of a TF. A consensus sequence can be obtained from a multiple alignment of known binding sites [38], and can be used for scanning genomic sequences in the search for TFBSs [39,40]. However, methods using scoring matrices for describing the binding sites [41,42] offer great advantages over consensus sequence methods. Position specific scoring matrices (PSSMs) are based on experimentally verified binding sites and represent the relative distribution and conservation of all nucleotides in the binding site. PSSMs exist for almost all types of TFBSs [43] and are widely used for predicting binding sites [41]. For an excellent review on PSSMs, see [44]. Table 1 is an illustration of a consensus sequence and a PSSM for an example TFBS. Sequence logos can be used for graphically describing the PSSMs [45]. The main advantage of PSSMs is that a qualitative measure can be obtained rather than the yes/no type of answer obtained from consensus models. Accounting for interdependence [46] between bases in the TFBS is not trivial, thus treating the binding energy contribution of each position in the binding site as independent ("independent binding hypothesis") is a frequently adopted approximation [47]. However, some improvement in performance has been achieved using higher order PSSM models [48,49].
False positive hits are detected with high frequencies [50], when using consensus or PSSMs for scanning genomes for putative binding sites. Bringing genetic context into the models has improved the specificity of the prediction methods. Limiting the search to predicted promoter regions [6,51], combining a set of functionally related TFs [4], and searching for their co-abundance has increased the specificity significantly [40]. The inclusion of spacingrules between the TFs [52], limitations of the number of each contributing TF [53], and combinatorial aspects of TF positioning [54,55] has further reduced the number of false hits.
Several TFs bind their target sequences as homo-or heterodimers, leading to co-occurring binding sites. The number of nucleotides in the gaps between the two halfsites may vary, even for the same TF binding to two different sites [56]. Accounting for varying half-site spacing in computational search algorithms is not trivial, nevertheless essential. Synergy, or cooperative binding is another reason for co-occurring motifs. Per definition, classical cooperative binding is when protein-protein interactions lead to a more efficient control of the promoter. Biological experiments have shown that synergistic activation can also occur when two regulatory proteins have no physical contact [57]. Computer simulations indicate that this might be an effect of the protein first binding changing the tension in the DNA strand [58]. Several computational methods predicting TFBS have been developed that take such putative synergy effects into account. BioProspector [59] and Co-Bind [5] are examples of methods that can be used for discovering co-occurring motifs.
Computational de novo discovery of overrepresented motifs has been used for finding putative and functionally related TFBSs. Detecting short and degenerate binding sites in genomic sequences is a very hard task. Limiting the search to promoters and conserved non-coding regions where TFBSs are enriched [60] has improved the performance. Gibbs Sampling [61], Ann-SPEC [62], and LOGOS [63] are examples of algorithms that have proven helpful in detecting TFBSs [64,65]. Further improvement has been made by assuming that co-expressed genes are co-regulated [66], at least to some extent. Inferring co-expression in order to detect overrepresented motifs in regulatory sequences has frequently been adopted [67,68]. Phylogenetic footprinting is a computational method commonly applied as a filter for pointing towards conserved, possibly functional regions of non-coding regulatory sequences [7,69]. Several successful examples have  been reported [70,71], and the computational methods have been reviewed in [72,73].
Alongside with an increasing number of genomic sequences, the amount of structural information on protein-DNA complexes has been increasing rapidly. Careful structural analyses of protein-DNA complexes obtained from PDB and NDB have identified the characteristics of such interactions [13,27,29]. Examination of the relationship between amino acid sequence conservation and role in DNA sequence recognition in protein-DNA complexes has revealed a strong correlation across all protein structural families [74,75].
Structure-based models offer promising extensions to the sequence-based models. These provide a way to qualitatively analyze DNA deformation, cooperativity, and other structural properties of protein-DNA interactions. There are mainly two categories of structure-based approaches.
The first one is based on statistical potentials and the second one on potentials obtained from molecular mechanics simulations. Statistical potentials are derived from systematic analysis of structural protein-DNA complexes. Pairwise potentials are extracted from distributions of C α atoms around DNA bases of known protein-DNA complexes, which reflect the statistical occurrence of specific interactions. They have proven to be sufficiently sensitive to evaluate the affinities of sequences obtained in a combinatorial fashion by threading them onto the fold of the original complex [8,76]. Computer simulations have been used to derive free-energy interaction maps between pairs of bases and amino acids [77,78], which can be used for prediction of TFBSs in a similar fashion as described above. In order to fully address structural flexibility of both protein and DNA, and interaction redundancy, intensive computation is needed. Observing processes during appropriate simulation periods and accounting for whole-system interactions are the two main limiting factors. Despite the required computing power, free energies have been analyzed in larger biological systems, see [79] for a review. Encoding the structural properties of specific DNA sequences and using these in combination with sequence-based methods can improve the specificity of the predictions [28,80].
The direct interactions between amino acids and DNA bases are mainly specific hydrogen bonds, which are fairly well understood. The non-specific interactions, constituting the majority of all interactions involved, are less well understood yet nevertheless, indications exist that these will provide important clues in understanding the complete picture of protein-DNA recognition. Structure-based approaches for modeling protein-DNA interactions are expensive regarding computing power, however, they pro-vide valuable insights into the physical interactions at an atomic level.

Conclusion
Protein-DNA interactions have been under intense research during recent years, which has resulted in numerous valuable finding as well as computational methods for the prediction of TFBSs. While sequence-based methods are amenable to analyses on a whole-genome scale, the computational costs for structure-based methods are currently still prohibitively high. The required computation time ranges up to several days for one single protein-DNA complex, due to the complexity of the interactions. At the same time, structure-based methods provide deep insights into the mechanisms and features of the protein-DNA interaction. These insights allow us to validate -or falsify -some of the assumptions and approximations underlying some of the sequence-based methods.
Sequence-based algorithms also provide a fast and flexible system for analyzing and reducing the search space in genomic sequences, whereas computationally intensive structure-based approaches can then be used in a final step with the specificity needed for a final evaluation of the predicted binding sites.
We hence observe both a need and a recent tendency to use structure-based methods for validation of sequencebased methods. We conclude that advanced sequencebased methods and detailed structure-based methods will make a strong combination in the search for putative binding sites for regulatory proteins in genomic sequences.