High performance transcription factor-DNA docking with GPU computing

Background Protein-DNA docking is a very challenging problem in structural bioinformatics and has important implications in a number of applications, such as structure-based prediction of transcription factor binding sites and rational drug design. Protein-DNA docking is very computational demanding due to the high cost of energy calculation and the statistical nature of conformational sampling algorithms. More importantly, experiments show that the docking quality depends on the coverage of the conformational sampling space. It is therefore desirable to accelerate the computation of the docking algorithm, not only to reduce computing time, but also to improve docking quality. Methods In an attempt to accelerate the sampling process and to improve the docking performance, we developed a graphics processing unit (GPU)-based protein-DNA docking algorithm. The algorithm employs a potential-based energy function to describe the binding affinity of a protein-DNA pair, and integrates Monte-Carlo simulation and a simulated annealing method to search through the conformational space. Algorithmic techniques were developed to improve the computation efficiency and scalability on GPU-based high performance computing systems. Results The effectiveness of our approach is tested on a non-redundant set of 75 TF-DNA complexes and a newly developed TF-DNA docking benchmark. We demonstrated that the GPU-based docking algorithm can significantly accelerate the simulation process and thereby improving the chance of finding near-native TF-DNA complex structures. This study also suggests that further improvement in protein-DNA docking research would require efforts from two integral aspects: improvement in computation efficiency and energy function design. Conclusions We present a high performance computing approach for improving the prediction accuracy of protein-DNA docking. The GPU-based docking algorithm accelerates the search of the conformational space and thus increases the chance of finding more near-native structures. To the best of our knowledge, this is the first ad hoc effort of applying GPU or GPU clusters to the protein-DNA docking problem.


Background
Protein-DNA interactions play crucial roles in many key biological processes. One of these processes is transcriptional regulation, in which transcription factors (TFs) bind to specific DNA binding sequences to either activate or repress the expression of their regulated genes. Transcription factors form a distinct group of DNA binding proteins in terms of function and binding specificity [1]. Owning to their roles in cancer development, transcription factors are potential drug targets for cancer therapy [2][3][4]. Therefore, knowledge of transcription factor-DNA interaction at a structural level not only can help us better understand the protein-DNA recognition and binding specificity, more importantly it can also offer guidance in targeted drug design. Moreover, structure-based transcription factor binding site prediction at genome scale has received much deserved attention recently because structure-based approaches have the advantage to consider both the position interdependence of TFs and the contribution of DNA flanking sequences in assessing TF-DNA binding specificity [5][6][7][8][9]. Despite rapid technological advances in experimental structure determination, only a small percentage of the structures in Protein Data Bank (PDB) are TF-DNA complexes [10]. Computational modeling, on the other hand, provides a cost-efficient alternative to the usually time-consuming experimental methods. Previous studies have demonstrated that molecular docking can obtain accurate complex structures for protein-protein, protein-peptide, and protein-ligand interactions. However, protein-DNA docking still lags behind due to our limited knowledge of protein-DNA interactions and it remains one of the challenging problems in the field of structural bioinformatics.
There are two key issues in general protein-DNA docking. One is a potential function for accurate evaluation of protein-DNA binding affinity. The other is conformational sampling of the complex structures. It should be noted that these two issues are related since the sampling methods generally need the energy function to guide the search. Improvement has been made in the development of knowledge-based potentials for assessing the binding affinity between protein and DNA. These knowledge-based potentials are developed based on the mean-force theory and are more attractive due to their relative simplicity and ease of use. These potentials generally vary in their resolution levels, from residue-based to atom-based potentials and in their distance scales, from distance-independent to distancedependent [9,11,12]. Studies have shown that specific interaction environments around the contacting amino acids and nucleotides contribute significantly to the binding affinity [13,14]. To take structural context into consideration, Liu et. al. have previously developed a knowledge-based, multi-body interaction potential for protein-DNA interaction [11]. This statistical potential can describe the effects of DNA structure deformation and local interaction environments. By using three structurally adjacent nucleotides (termed DNA trinucleotides or triplets) as the basic interaction units, we developed a rigid-body docking algorithm using Monte-Carlo simulations and further extended the rigid-body docking algorithm to a semi-flexible docking approach in case the DNA structure is unknown [8]. In semi-flexible docking, we use a number of representative DNA structures that are generated by clustering DNA structures from solved protein-DNA complexes. Each DNA structure model is docked with the target protein structure using the rigid-body docking algorithm. The best conformation from all the docked protein-DNA combinations is selected as the final complex model.
We have demonstrated the effectiveness of our protein-DNA docking algorithm and its application in structurebased transcription factor binding site prediction [8]. However, the docking algorithm is very computation expensive. For example, our semi-flexible experiments with 45 TF-DNA complexes (200 Monte-Carlo simulation runs for each complex) needed 130, 000 CPU hours when executed on 2.8GHz Intel Xeon processors (3 weeks on a 240-node CPU cluster). More importantly, the performance of the docking algorithm depends substantially on the coverage of the search space. It has been mathematically proved that for any random search algorithm to find a global optimal solution, the probability of repeatedly missing any measurable subsets of the search space must be zero [15]. This effectively requires our Monte-Carlo simulation procedure to search over the entire solution space, which is infeasible and can only be approximated by increasing the number of random samples. The computation cost therefore hinders the implementation and execution of our algorithm, especially for parameter optimization and large-scale testing runs. In this paper, our objective is to improve conformational search for TF-DNA docking algorithm through Graphics Processing Unit (GPU) computing. GPU has recently evolved from a fixed-function graphical device into a highly programmable parallel processor, and has been successfully deployed to accelerate a broad range of scientific applications [16,17]. GPUs support massive thread-level parallelism and can significantly accelerate parallel workloads. But the architectural features of GPUs also pose unique challenges to GPU application design, especially on the patterns of their memory accesses and execution paths. In this paper, we present the GPU implementation of our protein-DNA docking algorithm, which carefully manages memory accesses and execution paths to explore GPU acceleration. We further scale our algorithm from a single GPU card to a cluster of GPUs and achieve significant performance improvement. Although the application of GPU to other docking problems have been previously investigated [18], to the best of our knowledge, this is the first ad hoc effort of applying GPU or GPU clusters to the protein-DNA docking problem.
The effectiveness of our new method is validated through extensive experiments. Our GPU algorithm exhibits a 28x speedup on an individual Nvidia M2070 GPU over a single 2.8GHz Xeon core, and achieves a sustained performance of 10.4 TFLOPS using a cluster of 128 GPUs, which equals the capability of a conventional cluster with 3600 CPU cores. With the benefit of such improved computation capability, we are able to increase the number of random samples for the Monte-Carlo simulation procedure and thus expand the traversal in the search space. Experimental results show that GPU acceleration leads to higher rate of successful prediction of TF-DNA complexes. We also tested our GPU method on the rigid TF-DNA docking benchmark with carefully selected 38 cases [19]. We found that more random sampling can improve the performance of the easy cases but has almost no effect on the hard cases, suggesting that better energy functions and search algorithms are needed for these hard cases.

Datasets
A non-redundant set of 75 TF-DNA complex structures (less than 35% protein sequence identity) was generated from PDB [10]. Each of these TF-DNA complex structures was solved by X-ray crystallography with a resolution of 3.0 Å or better. Since transcription factors are not well annotated in PDB, we developed an in-house program for automatic identification of TF-DNA complexes by combing information from Gene Ontology (GO) terms [20], UniProt keywords [21], and PDB keywords. Another dataset used for testing our GPU algorithm is the rigid TF-DNA docking benchmark [19]. This benchmark contains 38 non-redundant cases that are classified into two groups in terms of expected docking difficulty. Each case in the benchmark is a TF-DNA binding unit, which is defined as an entity of a DNA double-helix and one or more TF-chains that interact with each other with at least three residue-residue contacts based on a heavy-atom distance cutoff of 4.5 Å. The degree of difficulty is assigned based on the strength of TF-DNA interactions in terms of the number of residue-base contacts (NRBCs) [19]. The easy group has 21 cases and the hard group has 17 cases. Figure 1 shows the overall framework of our rigid-body docking algorithm as described previously [8]. Briefly, for a given pair of transcription factor and DNA, our algorithm searches for a docked TF-DNA structure that has the lowest interaction energy using a Monte-Carlo simulated annealing approach. Each iteration of the Monte-Carlo simulation consists of two steps: docking energy calculation and conformational sampling. The energy function includes the protein-DNA binding affinity (E binding ), the atomic van der Waals (VDW) packing energy (E packing ), and the constraint energy (E constraint ) as seen in Equation 1 (W packing and W constraint are weight factors).

TF-DNA docking algorithm
The protein-DNA binding affinity, which dominates the docking energy, is evaluated using a knowledgebased distance-dependent amino acid-nucleotide interaction potential [11]. The binding affinity between a DNA triplet and a protein residue is determined by three factors: 1) the composition of the DNA triplet; 2) the type of the protein residue; and 3) the distance between the triplet and the residue. The coordinate of each DNA triplet is calculated as the geometric center of the three corresponding nucleotides while the coordinate of Cβ atom of each residue represents the residue position (a pseudo Cβ is used for Glycine based on its geometric shape).
The calculation of VDW packing energy uses a harmonic form with soft repulsion and attraction terms as described previously [8]. Since the knowledge-based binding energy is derived from the mean force theory and, in principle, it covers all the energy effects including the VDW packing contribution. The primary role of adding this packing energy to the docking is to guide the docking process while not affecting the final docked structures (i.e. W packing approaches zero as the random walk progresses). The constraint energy is also used to guide the docking process by bringing the DNA molecule in contact with the protein surface. This energy becomes zero for correctly docked structures.
The simulated annealing sampling of the conformational space includes translation at a step size of 0.01Å and rotation at a step size of 2 degrees. The initial temperature is set to a 0.833 acceptance rate and the cooling rate is 0.998. The simulation continues until the system converges with an acceptance rate lower than 1% or when the total number of steps reaches a pre-set maximum (1.5 million in our current work). To improve coverage on the conformational search space, we conduct multiple Monte-Carlo simulations with different random seeds.
To evaluate the docking performance, we compared the docked DNA conformations with the corresponding DNA structures in the native TF-DNA complexes by fixing the protein positions. The root mean square deviation (RMSD) is computed between the predicted and the native complex using DNA backbone heavy atoms. In some TF-DNA complexes, the proteins are homodimers that bind to the same or similar DNA sequences. In these cases, if the two protein chains exchange positions in the TF-DNA complex, the new structure should also be considered as "correct". For example, 1AN2 is a dimeric transcription factor Max that binds to its recognition sequence CACGTG by direct contact (Figure 2A) [22]. If chain A and chain C switch their positions, the newly generated structure is technically the same as the native TF-DNA complex ( Figure 2B). To address this issue in docking automatically, we generated a "flipped" DNA structure by flipping the two protein chains. Then for docking evaluation, we calculate the RMSDs by comparing the docked structure with both the native and the "flipped" DNA structures and record the smaller RMSD for the docked TF-DNA complex.

GPU computing
All the tests and experiments of our GPU algorithm were conducted on the 'Keeneland' GPU-based HPC system cohosted by Georgia Institute of Technology, Oak Ridge National Lab, and the University of Tennessee [23]. The system currently contains 120 nodes, each equipped with 3 Nvidia M2070 GPU cards. Nodes in the system are connected via QDR infiniband switches. The software tools used were GCC 4.1, CUDA 3.2, and MPI 1.4.3-intel.
Our protein-DNA docking approach exhibits three levels of parallelism: (1) multiple protein-DNA pairs can be evaluated in parallel; (2) for each protein-DNA pair, multiple Monte-Carlo simulation runs are independent of each other; and (3) for each simulation run, the DNA movement and energy calculations feature fine-grained parallelism: the calculations are applied to each atom and amino acid/DNA triplet and can be performed in parallel within each simulation step. We design our GPU program to explore all three levels of parallelism. For notational convenience, we call each Monte-Carlo simulation run a task.

A. Task level design and scheduling
Our protein-DNA docking program is designed for high-performance computing (HPC) systems equipped with Nvidia GPU cards. Such platforms feature a 4-level hardware hierarchy: multiple computing nodes; each equipped with multiple Nvidia GPUs; each GPU contains several multiprocessors; and each of which consists of multiple processor cores. Such a hardware hierarchy maps to a 4-level software architecture: multiple MPI processes, each involving a GPU CUDA kernel, which contains multiple CUDA thread blocks. Each block consists of multiple CUDA threads. In this software architecture, the parallel tasks are mapped to different processes, simulation runs within each task are mapped to different CUDA thread blocks, and parallel operations inside each simulation step are mapped to different CUDA threads. In the design of our protein-DNA docking algorithm, the master process is responsible of dispatching tasks. The slave processes iteratively execute six operations: obtaining a task from the master, reading input data, copying data from CPU memory to GPU memory, executing the CUDA kernels, copying outputs from GPU memory to CPU memory, and writing outputs. We also include an additional pre-docking stage where the necessary information of the target protein-DNA pairs are retrieved from text files, assembled into appropriate data structures, and then stored as binary data files. With such pre-docking data ready, the master process can dispatch docking task by sending out a simple task ID, the slave process will load data file directly according to this ID. The MPI processes can thus be light-weighted, and the startup overhead can be significantly reduced.

B. CUDA kernel design
The overall structure of the CUDA kernel for our Monte-Carlo simulated annealing based docking algorithm is illustrated in Figure 3. The memory spaces on GPU are pre-allocated to accommodate data structures for the TF-DNA complexes. Each time before the docking kernel is launched, the host MPI process loads a set of TF-DNA data into the GPU memory and runs a random number generation kernel to prepare the random numbers. Within the kernel, each block conducts an independent simulation run. The number of concurrent simulation runs is set equal to the number of multiprocessors available on the target GPU cards. For example, the number is 14 for Nvidia M2070 GPU cards. A reduction operation is used to compute the total interaction energy, and several small pieces of sequential code are used for setting parameters and making randomized simulated annealing decisions.
The computation of the VDW packing energy dominates the execution time. This interaction only affects atoms within close proximity and approaches zero as the distance increases. Consequently, in the computation, the entire 3D space is partitioned into small cells (6x6x6 Å), and for each DNA atom, we only evaluate its interaction energy with protein atoms in the same cell and in the neighboring 26 cells. Our algorithm uses 27 threads to evaluate one DNA atom with respect of its 27 neighboring cells (including the cell the atom resides). The binding affinity is evaluated between DNA triplets and protein residues. During the evaluations, the space is partitioned into lattices which are coarser-grained than cells for packing energy calculation. Since a molecule has fewer triplets/residues than atoms, the computation of binding affinity is faster than the packing energy.

Results and discussion
Computation cost of TF-DNA docking The rugged docking energy landscape and the statistical nature of our docking algorithm suggest that an accurate conformation can be found only if sufficient Monte-Carlo simulations are conducted assuming the energy function is accurate. We first show the importance of conducting more simulation runs and the need for computational speedups. We conducted 5400 simulation runs for each complex in our dataset of the 75 TF-DNA complexes. Sixty-three out of the 75 complexes have at least one docked conformation with RMSD less than 2 Å. The number of runs needed to produce one near native conformation is summarized in Figure 4. It shows that the difficulty of conformational search varies for different TF-DNA complexes. For example, 12 out of the successfully docked complexes only need 17-32 simulation runs for the random search to 'hit' the near-native conformation. But there also exist 5 difficult complexes that need more than 4097 runs for one 'hit'. Since we do not know the search difficulty for a blind docking prediction, it is therefore critical to increase the number of Monte-Carlo simulation runs to improve the quality of the docked conformation.

GPU-based TF-DNA docking efficiency
We implemented our protein-DNA docking algorithm on GPU and tested the execution speed of our GPU docking program. We used a 2.8GHz Intel Xeon processor as the baseline to measure the speedup. Twenty Monte-Carlo simulation runs were conducted for each TF-DNA complex. The average speedup across all complexes is 28. A breakdown of the speedup shows that the calculation of the packing energy was accelerated by 40, and binding affinity by 35. Not surprisingly, speedups in moving the DNA and accepting new DNA positions are small due to the simplicity of these calculations. We also tested the scalability of our algorithm when executed on the GPU cluster. In the strong scaling experiment, we assigned 4256, 8512, 17024, and Figure 4 The number of simulation runs to find at least one near-native conformation. 34048 tasks to each GPU respectively. Figure 5 shows that our algorithm achieved close-to-linear speedups. When scaled to 128 GPU cards, our algorithm achieved a sustained speed of 10.4 TFLOPs, which is comparable to a traditional CPU cluster with 3600 processor cores.

Improvement of TF-DNA prediction accuracy
To assess the improvement on prediction accuracy with our accelerated GPU algorithm, we ran the GPU docking algorithm with different number of Monte-Carlo simulations, 200, 400, 800, and 1600 using the data set of 75 TF-DNA complexes. The results are shown in Table 1 where a checkmark "x" indicates at least one conformation with RMSD < 1 Å was found during the search, and a checkmark in parenthesis "(x)" indicates a docking result where the conformation with the lowest energy has an RMSD less than 1Å. Our docking algorithm was successful in finding 47 of 75 (63%) near-native complexes when 200 Monte-Carlo runs were performed, and the sampling rate for finding near-native structures increased to 67%, 71%, and 73% when the number of Monte-Carlo runs increased to 400, 800, and 1600 respectively. When predictions were made based on the lowest energy, 36 (48%) complexes have been docked with RMSD < 1 Å with 200 Monte-Carlo runs. The success rate increased to 51%, 53%, and 55%, respectively as we increased the number of Monte-Carlo runs to 400, 800, and 1600. The above results demonstrated that increasing the coverage of the sampling space improves the chance of finding a near-native docking conformation, and our GPU based acceleration method is efficient in increasing the sampling coverage.

Test on the rigid-body TF-DNA docking benchmark
We next tested our GPU accelerated sampling on the rigid TF-DNA docking benchmark [19]. The benchmark contains a carefully selected non-redundant set of 38 test cases, encompassing diverse fold families. The 38 test cases are classified into easy (21 cases) and hard (17 cases) groups with respect to the degrees of difficulty in TF-DNA docking. The benchmark was designed to identify the strengths and weaknesses of potential functions and docking algorithms and to facilitate the development of better approaches.
We ran our GPU docking simulations on this benchmark set with the number of simulation runs at 200, 800, and 1600 respectively. The detailed docking results are shown in Table 2, in which the first 21 entries above the line are the easy cases while the remaining cases are the hard ones. These docking results are summarized in Table 3. The results clearly show two different trends for the easy and hard targets. For the easy targets, increasing sampling runs from 200 to 1600 improved the number of successful predictions based on the lowest energy, from 7 to 9 within 1Å (or 8 to 10 within 3Å) when compared to the native TF-DNA structures. However, for the hard targets, more simulation runs did not improve the number of successful predictions with either 1Å or 3Å as a cutoff.
The docking simulations also produced some nearnative structures ("conformation with the lowest RMSD" in Table 2 and the number in parenthesis in Table 3)    that were missed due to the higher docking energies. More simulation runs resulted more near-native structures for the easy cases while there are almost no changes for the hard cases (Table 2). A larger percentage of the near-native structures in the easy cases also have the lowest energies than those in the hard cases. This is consistent with the rationale of assigning degrees of docking difficulty based on the interaction strength and suggesting that in addition to developing more efficient search algorithm we need better energy functions for discriminating the native or near-native structures from the decoy ones.

Conclusions
Protein-DNA docking is a computation extensive problem. Due to the statistical nature of our Monte Carlo-based protein-DNA docking algorithm and the potentially rugged energy landscape, it is desirable to run multiple Monte-Carlo simulations with different seeds for better prediction performance. In this paper, we present a GPU-based high-performance protein-DNA docking algorithm that was designed for large scale GPU clusters. Modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that often substantially outpaces its CPU counterpart [17]. Rapid improvement in GPU programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding problems to the GPU, including many bioinformatics and biomedical problems [24][25][26][27]. To take advantage of GPU's massive parallel processing capability, we developed this GPUbased docking algorithm to accelerate the random sampling process, and thus improve the performance of the TF-DNA docking algorithm. To this end, we designed special techniques to improve the efficiency of the CUDA kernel and the scalability over the entire cluster.
Experimental results using a non-redundant dataset demonstrated a 28x speedup using a single GPU card and close-to-linear scalability when using GPU clusters. Further testing on a rigid TF-DNA docking benchmark revealed that such improved computing capability improves the chance of finding near-native conformations for the easy cases in the benchmark, but not on the hard cases, suggesting there is a limit for improving prediction accuracy by simply increasing the number of  simulation runs with our current energy function and search algorithm. There is clearly a need for developing more efficient search algorithms and more accurate interaction potentials.
List of abbreviations used CUDA: compute unified device architecture; GPU: graphical processing unit; HPC: high performance computing; MPI: message passing interface; NRBC: number of residue-base contact; PDB: protein data bank; RMSD: root mean square deviation; TF: transcription factor; VDW: van der Waals.