Energetics of the protein-DNA-water interaction

Background To understand the energetics of the interaction between protein and DNA we analyzed 39 crystallographically characterized complexes with the HINT (Hydropathic INTeractions) computational model. HINT is an empirical free energy force field based on solvent partitioning of small molecules between water and 1-octanol. Our previous studies on protein-ligand complexes demonstrated that free energy predictions were significantly improved by taking into account the energetic contribution of water molecules that form at least one hydrogen bond with each interacting species. Results An initial correlation between the calculated HINT scores and the experimentally determined binding free energies in the protein-DNA system exhibited a relatively poor r2 of 0.21 and standard error of ± 1.71 kcal mol-1. However, the inclusion of 261 waters that bridge protein and DNA improved the HINT score-free energy correlation to an r2 of 0.56 and standard error of ± 1.28 kcal mol-1. Analysis of the water role and energy contributions indicate that 46% of the bridging waters act as linkers between amino acids and nucleotide bases at the protein-DNA interface, while the remaining 54% are largely involved in screening unfavorable electrostatic contacts. Conclusion This study quantifies the key energetic role of bridging waters in protein-DNA associations. In addition, the relevant role of hydrophobic interactions and entropy in driving protein-DNA association is indicated by analyses of interaction character showing that, together, the favorable polar and unfavorable polar/hydrophobic-polar interactions (i.e., desolvation) mostly cancel.

function of biological systems: enzymes that bind substrates and effectors, proteins that mediate signal transduction via networks of alternative or specific proteinprotein pair, and nucleic acids that, via the binding of transcription factors, repressors, co-activators, regulate protein expression. In particular, the site-specific associations between DNA and proteins regulate most biological events [1], with key involvement in transcription, replication and recombination. Matthews [2] stated "the full appreciation of the complexity and individuality of each complex will be discouraging to anyone hoping to find simple answers to the recognition problem". A few years later, Draper [3] was still asking "...how does a protein select a specific DNA site out of the many available, when all potential binding sites share such a high degree of structural similarity? Thermodynamic, as well as structural, approaches must be used to answer this question... ". Now, more than a decade later, no simple model for recognition between amino acids and nucleotides has been found [2,[4][5][6][7]. From the analysis of the first protein-DNA crystal structure it was evident that several distinct contributions lead to formation of the complex [8][9][10], i.e., hydrogen bonds, electrostatic interactions, direct and indirect contacts between amino acids and phosphate, sugars and bases, water-mediated contacts, hydrophobic effects, ion release, mutual conformation rearrangement, bending and distortion. Amongst the enthalpic contributions, hydrogen bonds are the most easily recognized, and energetically may represent the bulk of interactions between nucleic acids and proteins, comprising protein backbone and side-chains contacting bases at their edges and the polynucleotide backbone [11]. About half of the hydrogen bonds found in known protein-DNA complexes involve phosphodiester oxygens [12], initially mediating indirect recognition between DNA and protein, and favoring a subsequent localization of the protein in a specific site [13]. In direct recognition, representing the foundation of sequence specificity, hydrogen bonds are formed between amino acid side-chains and DNA bases. Even earlier in the binding process, entropy plays a significant role in recognition as non-specific (low affinity) interactions, driven by long-range electrostatic forces, bring the DNA and protein into proximity and cause the release of counter-ions from the free DNA [14]. Thusly, water molecules in free and protein-bound DNA complexes have been thoroughly investigated both experimentally and theoretically, and different roles have been proposed for interaction and recognition (see [14][15][16] and references therein).
While enthalpy is associated with molecular interactions resulting from complex formation, entropy is associated with multiple protein and DNA conformations, variations in the structure of water molecules and counterions, and other factors. This complexity and the interplay between multiple chemical and physical mechanisms necessary to achieve the required level of specificity are extremely difficult to describe quantitatively [17,18]. Recent investigations using osmotic pressure [19] has led to a determination of the differential role and number of water molecules released in specific and nonspecific binding of protein to DNA sequences [20,21]. Some results from these studies do not appear to be supported by x-ray crystallographic data of specific and nonspecific proteinbound DNA complexes [15]. Interestingly, osmotic pressure experiments suggest that in vitro studies in dilute solutions are likely to be less informative on in vivo processes than expected, due to the presence of crowding and confinements effects [22,23]. This, in turn, implies that the biological environment is relatively similar to that experienced by macromolecules within a crystal lattice [15,24,25]. Computational methods, which are heavily dependent on x-ray crystallographic data and are widely and successfully used in the evaluation of the energetics of ligand-protein interactions [26][27][28][29], should also be applicable to understanding protein-DNA complex formation. The earliest attempts [3,30] tried to estimate the contribution of each pair of amino acid residues/nucleotide bases with respect to the total protein to DNA binding affinity. A different approach, proposed by Mandel-Gutfreund and Margalit [5], assumed that a global score reflecting the complementarity between a protein and its DNA target can be calculated by statistical analyses of the frequency of interactions for specific amino acid residue-nucleotide base types, thus implying additivity in binding energetics. Other attempts to qualitatively, semi-quantitatively and/ or quantitatively describe the interaction between protein and DNA [6,7,11,14,[31][32][33][34][35][36][37], have taken advantage of the available three-dimensional crystallographic structures of proteins that bind to DNA, a field pioneered by Matthews [38].
A wealth of information on the rules that govern biomolecular recognition is derived from structural data, predominantly x-ray crystallography and nuclear magnetic resonance. However, the analysis of the three-dimensional structure of a complex can only provide a geometric framework that ultimately needs quantitative evaluation of the binding energetics to enable assessment of codes, rules, and/or mechanisms. To date, pursuit of this goal has primarily focused on ligand-protein interactions due to the intense interest in designing compounds that bind selectively and with high affinity to therapeutically relevant enzyme or protein targets. Consequently, a variety of computational modeling approaches have been developed to obtain quantitative descriptions of ligand-protein encounters. Usually, the process is simplified by: i) considering only the volume specified by the active site; ii) assuming no or reduced conformational flexibility; and iii) neglecting the energetic contributions of water mole-cules (both with respect to their contribution to enthalpy and entropy).
To overcome some of these limitations, the HINT (Hydrophatic INTeractions) force field was developed. HINT is based on LogP o/w , the solvent partition coefficient of a species between 1-octanol and water, these solvents being models for the internal apolar and polar protein milieus, respectively [39]. Because LogP o/w is a free energy parameter, its measurement takes into account both enthalpic and entropic contributions originating from all molecules, including water, that participate in a biomolecular association, and solvent partitioning data are unique experimental measurements of intermolecular and interatomic interactions. The total interaction score (B) for a complex is calculated with the following equation: (1) where b ij represents the interaction score between atoms i and j, a the atomic hydrophobic atom constant, S the atomic solvent accessible surface area, T ij a logic function assuming +1 or -1 values, depending on the polar nature of interacting atoms, and R ij and r ij are, respectively, functions of the distance between the atoms i and j. R ij is usually a simple exponential function, while r ij is an adaptation of the Lennard-Jones function. The key parameters a are calculated by a procedure adapted from the CLOG-P method [40]. Because the sum of all a i s is the LogP o/w for a molecule, each a i is a partial logP that can be considered a δg for solute transfer. If the "receptor" is changed from the 1-octanol/water solvent pair to a biomacromolecule with hydrophobic and polar regions, then, in a sense, the a i s represent atomic free energies of association. Each a i thus encodes all aspects of free energy, both enthalpic and entropic. In HINT, favorable b ij interactions (hydrogen bonds, acid-base, hydrophobic-hydrophobic) are positively scored, while unfavorable contacts (acid-acid, base-base, hydrophobic-polar) are negatively scored in the HINT paradigm. H TOTAL , the sum of all b ij terms describes the total interaction between the two species. In this way, the ligand-protein interaction is not separated in multiple factors by interaction type (e.g., hydrogen bond, hydrophobic, etc.), but is considered a concerted event, as it occurs in nature [41]. Because the HINT analysis is carried out on biomolecular systems with three-dimensional structure, geometric information is embedded in the procedure. We have applied this approach to the energetics of protein-ligand complexes both in the absence and presence of water molecules that bridge protein and ligand, at constant pH and as a function of the ionization state of interacting groups [29, [42][43][44], in protein-protein interfaces [45,46], and in ligand-DNA recognition [47][48][49]. Results from these diverse stud-ies indicate that HINT is a powerful tool to quantitatively investigate and describe the energetics and specificity of biomolecular processes. It must be noted that HINT analysis evaluates the interactions between pre-formed molecules, and does not include terms for evaluating the internal energies of these molecules. These internal energies are certainly important components of overall binding free energy, but may be relatively invariant within a particular data set as we have reported [29,[42][43][44][45][46][47][48][49].
In the present work HINT analysis was used to evaluate the strength of interaction in protein-DNA complexes, explicitly taking into account the energetic contribution generated by water molecules found at the interface between protein and DNA. This analysis was performed on 39 DNA-protein complexes, determined at resolution better or equal to 2.8 Å and for which experimental equilibrium constants are available. Correlation of HINT score with experimental free energy indicates predictive models with a standard error of ± 1.28 kcal mol -1 . These results represent a quantitative basis for ultimately dissecting the amino acid residue-nucleotide base interaction to understand the amino acid-base "recognition code", a topic we are currently investigating.

Results and discussion
Being able to accurately model the energetics of protein-DNA association will help us to more completely understand the machinery of life itself and to uncover a wealth of new opportunities for the therapeutic treatment of many diseases. While direct interactions, i.e., recognition, between the two macromolecules are important for specificity, the water molecules at protein-DNA interface also contribute to the complex formation and potentially play a role in mediating specific interactions (see [14][15][16] and references therein). In fact, Janin reports that protein-protein and protein-DNA interfaces contain at least as many water-mediated interactions as direct hydrogen bonds or salt bridges [50]. Water molecules mediating biological interactions have been the subject of intense recent study [16,43,44,[51][52][53]. The importance of water in regulating recognition, complex formation and, generally, interactions among biomolecules is widely accepted, but experimental and computational tools for quantifying these effects are still somewhat lacking [54]. Even x-ray crystallography at high resolution likely underestimates the number of solvent molecules, and can misrepresent other ions, precipitant molecules or artifacts as water. One approach we have previously used to validate crystallographic water sets is application of the GRID program [55], which evaluates empty regions of space in terms of water (or another probe) being favorably bound. We found that crystallographic water molecules with high specificity virtually always exhibit favorable GRID energies [43], and thus should be considered well-placed.
Both protein and DNA molecules in solution, when uncomplexed, are surrounded by a variable number of water molecules interacting through hydrogen bonds with exposed polar groups. While the protein solvation pattern is extremely variable, a consequence of the protein's nature and folding [18,56], DNA presents largely the same (conserved) hydration pattern, with minor sequencedependent local variation. An ordered spine of hydration occupies the minor groove, whereas the major groove is too wide to retain the same water network and is filled with ordered water molecules interacting singly or in pairs with the nucleotide bases [15]. In addition, the phosphate groups are usually surrounded by six hydration sites, with positions differing with the conformation and nucleotide types [57]. Overall, these conserved water patterns contribute to stabilization of the DNA conformation [16].
The process of protein-DNA association is certainly very complex, with substantial conformational changes of the interacting macromolecules and a concomitant significant rearrangement of the bound solvent water molecules and counterions. Matthews, in his pioneering work on protein-DNA interaction, recognized the fundamental role played by water molecules in mediating the formation of the Trp repressor-DNA complex, stating that "the explicit need to consider bound water on the surfaces of both proteins and DNA adds another level of complexity to the recognition problem" [2]. Figure 1a illustrates the three-dimensional structure of the homing endonuclease I-Crei complex (PDB code 1g9y), one of the 39 complexes studied in this work, including display of the bound water molecules. Water molecules hydrating exposed polar groups on the protein and DNA respectively, are highlighted in Figure 1b and 1c. The goal of this paper is to unravel the energetics of association as they relate to recognition between protein and DNA. We will put particular emphasis on understanding the contribution and role of water in protein-DNA associations.

Protein-DNA interaction energetics
The structures of 39 proteins bound to their target DNA sequences were retrieved from the Protein Data Bank [58] and from the Nucleic Acid Database [59] (see Table 1). In order to obtain reliable calculations and predictions, only structures characterized by resolution better than 2.90 Å were considered: the average resolution is 2.18 Å. Other selection criteria are described in the Methods. The data set is composed of DNA binding proteins with different functions, i.e., 6 transcription factors, 19 transcription regulators, 12 enzymes and 2 DNA binding proteins. The binding affinities for the complexes vary over about four orders of magnitude. The interaction of each protein with its corresponding DNA sequence was evaluated with the HINT force field [39] ( Table 2). Correlation (Figure 2, solid line) of the calculated HINT scores for each protein-DNA association with the experimentally determined free energies of binding for that complex (all symbols) leads to: ΔG° = -0.000198 HTOTAL -9.98, (2) with a relatively poor r 2 = 0.21 and a standard error of ± 1.71 kcal mol -1 . However, several outliers (open symbols) are evident, negatively affecting the correlation. All outlier complexes contain the same protein: homing endonuclease I-CreI, complexed in the native form with either the DNA product (1g9z) or the DNA substrate (1g9y), and enzyme mutants (1t9j and 1u0c). While the data point for the endonuclease I-CreI substrate complex 1t9i is well placed in this correlation, it is considered an outlier in this discussion (vide infra). The exclusion of these five outliers produces a significantly improved correlation ( with an improved r 2 of 0.51 and a decreased standard error of ± 1.41 kcal mol -1 . The count of solvent molecules is extremely variable in the analyzed structures (Table 2), ranging from 2 in 1jkr to 857 in 1g9z, with a mean value of 200. We have shown previously that water molecules, in particular those that bridge between interacting species, play a significant energetic role in biomolecular associations [43]. Significantly, the average number of crystallographically detected waters in the endonuclease I-CreI-DNA complexes (1g9z, 1g9y, 1t9i, 1t9j, 1u0c) is much higher, 454. Complexes with an overall high number of crystallographic waters would also be expected to have a concomitantly high number of potentially bridging and energetically relevant waters at the protein-DNA interface. Since a high water count in crystallographic models is usually due to higher accuracy in the x-ray structure as a larger fraction of bound waters are revealed, the crystallographic resolution of the five endonuclease I-CreI-DNA complexes (varying between 1.6 to 2.5 Å with an average of 1.99 Å), is only a partial cause of this difference. It is important to note, however, that water molecules may be introduced during crystallographic refinement only to account for electron density with unknown origin, which improves the apparent data analysis statistics.

Water role in protein-DNA interaction energetics
Water can play several fundamental roles at the interface of protein-DNA systems [16]. Water molecules can: i) fill destabilizing holes in the complex; ii) facilitate binding by screening unfavorable electrostatic contacts ( Figure 1d); and iii) act as linkers or "bridging waters" at the protein-DNA interface by providing side chain "extensions" that facilitate indirect hydrogen bonding (Figure 1e). We evaluated the contribution of water molecules placed at the protein-DNA interface by first identifying all waters (oxygen atoms) that are ≤ 4 Å from both protein and DNA. This set contains an overall total of 1244 water molecules (Table 3). When, as described in Material and Methods, the contribution of these interfacial waters was added to the protein-DNA HINT score, the correlation of H TOTAL with the experimental free energy of association ( Figure  3a Three-dimensional structure representation of Homing endonuclease I-crei complex (1g9y), using display features of the Lithium software package [75]. a) Overall view of the complex where the protein is displayed in ribbon/tube style and the DNA is represented in color-coded ribbons: red for adenine (A), blue for cytosine (C), green for guanine (G), and yellow for thymine (T). b) Water molecule hydrating a negatively-charged amino acid side-chain. c) Water molecule hydrating a DNA phosphate group. d) Water molecule screening the repulsive interaction between an Asp side-chain and a DNA phosphate. e) Water molecule located at the complex interface mediating specific amino acid-base interactions. Complexes previously identified as outliers ( Figure 2) are now coherent with the correlation, supporting the fundamental contribution played by water molecules to the free energy of binding between protein and DNA. Only 1t9i, the endonuclease I-CreI-DNA complex that was not an obvious outlier in Figure 2 (but nevertheless removed), is an obvious outlier in Figure 3a.
Previous analyses of protein-ligand systems indicated that only "bridging" water molecules are relevant for complex formation [43], and these highly constrained waters should be located in crystallographic experiments of even moderate resolution. We used the Rank algorithm [60], which has been validated with a wide set of protein and protein-ligand structures [44], to identify bridging waters and predict the weighted number of hydrogen bonds potentially formed by each with both the protein and the DNA. Using the filter that only waters characterized by Rank greater than 0 with both macromolecules (i.e., forming at least one hydrogen bond with each) are included in H TOTAL , the number of significant waters placed at the protein-DNA interface is reduced from 1244 to 996 (Table 3) for all complexes. Correlating this H TOTAL with free energy (not shown) yielded a model with r 2 of 0.47 and standard error of ± 1.41 kcal mol -1 . Visual inspection suggested, however, that some of the members of this water set are not truly bridging, possibly because the Rank algorithm does not distinguish between distance and angular contributions to Rank. The implication is that Rank only mod- Our previous studies of water molecules in proteins and protein-ligand complexes [44] demonstrated that water molecules with total Rank of at least 4 and non-zero partial Ranks had more impact on the formation of proteinligand complexes. Waters with Rank ≥ 4 should form at least two hydrogen bonds and have very favorable geometry and thus be more locked and stable at the protein-DNA interface and, thus, more detectable by X-ray diffraction analysis. The number of waters that satisfy these cri-teria is 261. The more "fixed" position of these waters is confirmed by a relatively lower mean B factor (32.7 Å 2 ) than the mean value calculated for all the 7394 crystallographic water molecules (42.8 Å 2 ). Correlation of H TOTAL calculated with this set of waters with free energy (Table 2 and Figure 3b) yields: with r 2 = 0.56 and standard error ± 1.28 kcal mol -1 . The improvement in the correlation is clearly due to the contributions of a smaller, but more significant, set of water It is important to emphasize that the results presented here explain only part of the protein-DNA-water interaction and the tools we have used only illuminate the process through examination of the bound endpoint. For example, the energetic contribution of the internal conformations, i.e., conformational entropy, of the interacting biopolymers is not treated explicitly, and is only one of several components of the additive constant portion of our correlations (eqs. [2][3][4][5]. However, the low standard errors in our models indicate that these contributions are more or less constant across the data series. The magnitude of the additive constant can be rationalized by the fact that these complexes do have many structural and chemical similarities -the most important of which is that they all form crystals analyzable by x-ray diffraction. Note (eqs. 2, 4 and 5) that as we incrementally improved the models by explicitly including more appropriate sets of water molecules, the additive constant decreased in magnitude as the standard error improved, indicating that this particular contribution to ΔG° is now being treated more explicitly.

Energy contributions of the DNA base, phosphate and ribose to complex formation
The association of a protein-DNA complex usually involves a two-step process: an initial binding via nonspecific interactions and a subsequent translocation of the protein to the specific binding site [62,63]. The first step is regulated by electrostatic contacts between the protein side-chains and the DNA backbone phosphates, while binding specificity is achieved by interactions with the nucleotide bases themselves. However, the DNA backbone (ribose and phosphates) may play a less dramatic but fundamental role in specificity by holding the protein in a defined orientation, thus decreasing the energetic cost of the complex formation, or because the phosphate orientations are somewhat determined by the base sequence [12]. From a geometric-based analysis, which evaluated two atoms to be in contact if their centers were 1-5 Å apart, Lejeune and co-workers [64] reported that an average of 47% protein-DNA interactions involve the phosphate group, while 24% can be attributed to the base. The total HINT score for each analyzed complex was deconvolved into partial contributions from protein-DNA phosphate , protein-DNA ribose and protein-DNA base interactions ( Figure 4a). Clearly, both the non-specific (DNA backbone) and the specific contacts (base) play a fundamental role in driving the association event and in stabilizing the complex. The protein-DNA phosphate HINT score values are variable, ranging from -7900 to 10800 HINT score units ( Table 2), but the most negative contributions are found in the endonuclease I-CreI-DNA complexes 1g9y, 1g9z, 1t9j and 1u0c, noted for their nonconforming behavior earlier. These high negative-valued protein-DNA phosphate contacts clearly explain the low overall protein-DNA HINT scores calculated for complexes 1g9z, 1t9j and 1u0c (Table 2), whose association seems unfavorable without a compensating water term. In these cases, water molecules do not only mediate the recognition between amino acids and bases, but may also significantly screen the unfavorable electrostatic interactions between phosphate groups and negatively charged amino acid side-chains. Thus, the high number of water molecules is needed to stabilize the DNA/I-CreI endonuclease complexes. The ribose groups do not significantly participate in the binding energetics ( Figure 4a) and, in fact, the protein-ribose interactions are equally likely to be favorable or unfavorable with respect to HINT score. In contrast, the DNA bases contact the protein with ubiquitously favorable interactions (mean value of 3300 HINT score units).
It is important to note that recognition is a complex process; for example, there is an almost complete absence of protein-DNA base interactions in the specific complex between TFIIBc and its target DNA sequence (1c9b), so this recognition must be mediated by protein-DNA phosphate or protein-DNA ribose interactions [12].
Both direct, including hydrophobic interactions, and indirect (water-mediated), interactions between the protein and DNA are relevant [12,64]. Figure 4b illustrates the contribution of the 261 interfacial (bridging) water molecules (both water-protein and water-DNA partial Ranks > 0, total Rank ≥ 4, as in eq. 5) to the protein-DNA interaction, where the water-DNA phosphate , water-DNA ribose , water-DNA base and water-protein terms are individually shown. The favorable DNA-water interaction is generally attributable to both water-DNA phosphate and water-DNAbase contacts, reinforcing the notion that in most complexes water facilitates binding by screening unfavorable electrostatic contacts and acts as a linker at the protein-DNA interface. The water-DNA phosphate HINT score ranges from near zero to 9800 with a mean value of 2400, while the water-DNA base contribution ranges from near zero to 6700 HINT score units, with a mean value of 1600. Only in a few cases is a positive DNA-water HINT score completely attributable to the water-DNA base interaction; e.g., in 1azp, 1bl0, 1pue and 1qpz complexes water mediation is necessary to achieve specific recognition between the two macromolecules. The water-DNA ribose interaction always negatively affects the global HINT score because of unfavorable hydrophobic-polar contacts made between water and the hydrophobic moieties of ribose. Finally, the score contributions from protein-water contacts range from -1340 to 2120, with an average of only 140 units. The discrepancies between protein-water and DNA-water HINT scores will be discussed later, but are generally attributable to the different chemical natures of the interacting groups. It is evident in comparing Figure 4a and 4b that in the cases where the overall protein-DNA score is negative (i.e., the DNA/I-CreI endonuclease complexes), the water terms are able to compensate.

Water molecules in protein-DNA interaction specificity identified by role
Coordinating water molecules are found in high numbers around protein-DNA complexes, and they play a variety of Correlation between the experimental binding free energies and the calculated HINT score values taking into account the score contributions of water molecules  roles in stabilizing these complexes. To ascertain these roles, and to determine if conclusions regarding role could be determined by their Rank, we visually analyzed the larger set of waters; i.e., the 996 water molecules having non-zero Rank with respect to both macromolecules. Of these, somewhat more than half (547) interact with phosphate and ribose groups of the DNA in the ways described above. The main role of these waters is to screen unfavorable electrostatic forces arising between phosphate groups and charged amino acids side-chains. These waters are tightly bound to the DNA, with an average partial HINT score (H DNA(backbone)-water ) of 426 (Table 4), but a weakly negative H protein-water of -36. On the other hand, the corresponding Rank has the opposite trend, the partial Rank for protein (1.9) is larger than that for the DNA (1.3). Together these data suggest that these waters are predominantly locked by a single very strong interaction with the DNA, and that favorable interactions between the protein and the DNA backbone are not actually mediated by water to any large extent other than by shielding the highly negative phosphate charge and generally making the environment around the phosphates more conducive to protein binding.
The remaining 461 waters of the set interact only with the bases of the polynucleotide. Each was then categorized (Table 4) as either non-bridging (when they are positioned such that cannot link protein and DNA base or when they unnecessarily mediate already favorable interactions between protein and DNA bases) or bridging (mediating specific protein-DNA recognition and association). This analysis identified 212 waters as non-bridging with an average Rank of 2.8. In fact, only 23 of these nonbridging waters (10%) have Rank ≥ 4. Among the 249 nucleotide base-to-protein bridging waters, 218 are found between bases and amino acid side-chains, 20 between bases and protein backbone, and 11 connect the bases to both the side-chain and backbone of the protein. The average Rank of these bridging waters is 3.7, with those linking to both the protein side-chain and backbone having an unsurprisingly larger average Rank of 4.6. Onethird (82) of the bridging waters have Rank ≥ 4. HINT scores and Rank statistics for the set of waters interacting with both protein and DNA bases are summarized in Table 4. The mean interaction scores for waters bridging protein side chains to DNA bases are 94 and 360 for H protein-water and H DNA-water , respectively, while the partial Ranks are 1.9 and 1.8.
The previous analyses of bridging waters in protein-ligand systems [44], revealed a global Rank of 4.5 less evenly divided between protein and ligand: the mean partial protein-water Rank was 3.0, while the mean partial ligandwater Rank was 1.5. This difference is probably attributable to the different natures of protein-ligand and protein-DNA interfaces. Proteins, with a more extended and heterogeneous surface characterized by clefts and cavities, usually envelop small ligands, but formation of a protein-DNA complex likely involves winding of the objects together, yielding two more or less comparable surface areas. The HINT score values are also differently distributed in protein-ligand systems compared to protein-DNA systems. In the protein-ligand system [44], H protein-water and H ligand-water were 307 and 277 HINT score units, respectively, i.e., nearly equal. Here, even in the case of protein side-chain to DNA base, waters interact notably stronger with the DNA (360) than with the protein (94). This is, at first, somewhat surprising, given that the bases are structurally constrained to be planar, while the protein side-chains possess more flexibility and would presumably adopt the most conducive conformation for binding. However, the aromatic groups, present in both pyridine and purine bases, are capable of forming weak hydrogen bonds with water, either by water hydrogen atoms donating to aromatic electron clouds, or by water oxygen atoms accepting from polarized aromatic hydrogens. Thus, nearly all contacts between the nucleic acid bases and the surrounding water molecules are potentially positive. In contrast, hydrophobic protein side-chains would produce numerous unfavorable (negative scored) hydrophobicpolar contacts with water, regardless of the water geometry. Also, structural differences between the two types of interfaces are relevant. The cavities and shallows that bind waters at interfaces in protein-ligand complexes are usually formed by backbone or, more frequently, by charged and polar groups; however, the surface of a protein interacting with a polynucleotide can also be formed by apolar moieties. Thus, even though the number of hydrogen bonds to waters is more equally distributed between the two macromolecules in the protein-DNA case, these waters cannot be enveloped by either the protein or the DNA.
A most interesting consequence of the above results is that water molecules contributing to protein-DNA recognition specificity have a somewhat different set of criteria than those contributing energetically to the complex stability. Visual evaluation indicates that: 1) 54% of waters with nonzero Rank with respect to both macromolecules were involved in interactions with the DNA backbone and thus play a minor role in specificity but are energetically critical for the association. 2) 46% of the waters interact with the nucleotide base; of these, 21% are actually non-bridging, and the remainder (25%) bridge between the base and various features of the protein. Interestingly, only 2-3% of the nucleotide base-bridging waters interact (only) with the protein backbone, so that the vast majority interacts with the protein side-chains and potentially governs binding specificity. It is likely that only these waters forming hydrogen bonds with amino acid side-chains would be involved in recognition of specific nucleic acid sequences, but that accounts for more than 90% of the waters bridging between protein and bases of DNA. Figure 5 illustrates the protein-DNA interactions for the set of 39 complexes in terms of HINT interaction types; i.e., hydrogen-bond, acid/base, hydrophobic, acid/acid, base/base and hydrophobic/polar. As is usually the case in biomolecular associations, the non-covalent forces in protein-DNA association are system-specific and finely balanced [37]. This is evident from Figure 5 where the favorable polar terms (hydrogen-bond and acid/base) are compensated by the base/base, acid/acid and hydrophobic/polar terms. The unfavorable HINT terms represent energy costs such as desolvation that are paid when the complex forms by association of isolated biomolecules. The overall sum of these forces is the binding free energy, generally ranging from -9 to -17 kcal mol -1 [65], and involving electrostatic and van der Waals interactions, hydrogen bonds, ion and water release, complex reorganization due to hydrophobic effects, hydrophobic contacts and other entropy effects [5,36,37,61,65]. While hydrogen bonds (direct and water-mediated) and electrostatic contacts are usually taken into account and considered fundamental in analyses of complex formation and in specific recognition, and are clearly the dominant terms in Figure 5, the other factors related to entropy and hydrophobicity are commonly ignored. We have found that inclusion of hydrophobic terms (favorable hydrophobichydrophobic and unfavorable hydrophobic-polar) in scoring models leads to reliable binding free energy pre-dictions in protein-protein [45,46], protein-ligand [29,43] and DNA-ligand [47][48][49] complexes.

Character of protein-DNA interactions
Hydrophobic effects have been proposed to be the major driving forces of protein-DNA association [35,37,66], as this force arises from the burial of non-polar protein surfaces into the DNA binding site. The predominant role of hydrophobicity (i.e., entropy) is supported by calorimetric analyses that reported a negative change in heat capacity upon complex formation [67,68]. On the enthalpic side, the electrostatic term of free energy counteracts binding because favorable charge-charge interactions are often counterbalanced by the highly unfavorable contribution from dehydration of the polar groups [35]. Jayaram's computational analysis of binding [37] also demonstrated that packing and hydrophobic effects favor binding, whereas electrostatic interactions energetically oppose it [41]. However, the negative heat capacity change associated with the formation of specific protein-DNA complexes could not be completely explained by taking into account only hydration effects [14,17,18]. Other contributions, like the conformational changes of both proteins and nucleic acids accounting for 20% of the total ΔC p [69][70][71][72], the modification of the protonation state of the interacting residues [73] and counterion release [74], have been considered. In particular, even if ion release was generally considered to be favorable for complex formation, several studies demonstrated that the negative contribution from ion-molecule electrostatics, rather than the positive entropy given by the ion reorganization, dominates the salt-dependent solvation effects [36,37]. Furthermore, the ionic interaction with water molecules induces an increased ordering of waters, producing a large negative heat capacity change [14,74].
The HINT analysis in this work allowed examination of the character of interactions contributing to an association without actually parsing them energetically because all atom-atom interactions are evaluated with the same protocol. HINT evaluates not only the electrostatic and van der Waals contributions, but also hydrophobicrelated contacts and should be able to evaluate the observation of Mandel-Gutfreund and Margalit [5] that amino acid-nucleotide base recognition is governed by both hydrogen bonds and hydrophobic interactions. Stabilizing hydrophobic contacts, mainly between sugar methylenes and aliphatic or aromatic amino acid side-chains, were estimated to account for 63% of protein-DNA ribose contacts [64]. Note that the free energy-based analysis illustrated in Figure 5 is over the entire protein-DNA interaction set (not just protein to ribose). Nevertheless, the hydrophobic/hydrophobic interactions ( Figure 5) always contribute favorably to the protein-DNA binding but apparently only to a moderate extent. These contributions are not impacted by unfavorable effects or the presence/ absence of bridging waters, and in some cases are the dominant factors in binding after the other terms appear to cancel out. In this sense, hydrophobic contacts and the related hydrophobic effects may represent the main driving force of protein-DNA association, while the electrostatic interactions seem to increase specificity but not affinity. It must be reiterated, however, that this computational analysis tool is probing only the (relatively short range) energetics between pre-formed DNA and protein components of the final, end-state, complex. As such, it does not measure or account for the internal energies of the protein and DNA molecules and the energy involved in conformational changes of these molecules between their unassociated and bound states. The quality of the resulting models, eq. 5, suggests that these and other terms are largely invariant over the data set.

Conclusion
Water contributes to protein-DNA complex formation in two principal ways. Without water, some of the complexes would be scored as energetically unfavorable. There is an apparent, but interesting, disconnect between water molecules that are significant for DNA-protein recognition having a lower Rank threshold than those critical for accurate free energy calculations. Also, the results above demonstrated that including the energetic contribution from waters at the protein-DNA interface significantly improved the quality of our computational free energy predictions, particularly with only "true" bridging waters. Our criterion, based on the previous analysis of 15 protein-ligand complexes [44], is that only waters characterized by nonzero partial Ranks with each interacting molecule and total Rank of at least 4 are energetically relevant. In effect, a bridging solvent molecule should form a minimum of two strong, well-located hydrogen bonds, with at least one additional favorable contact. Those waters with lower Rank, especially between 3 and 4, are still significant in mapping the energetic landscape for interaction by altering the shape, polarity and surface charge of the DNA or protein, even if they do not directly contribute to the free energy of binding.
Contributions of the different interaction types participating in the direct protein-DNA association process, as estimated by the HINT force field Figure 5 Contributions of the different interaction types participating in the direct protein-DNA association process, as estimated by the HINT force field. The bars are color-coded as indicated in the legend. This report is the first part in an effort to decode the molecular features leading to protein-DNA recognition. The interaction between these two biomacromolecules is an essential component of the machinery of life. Here we have demonstrated that our modeling experiments, using the empirical HINT free energy forcefield, with a measured incorporation of critical water molecules, gives more than acceptable estimates (± 1.28 kcal mol -1 ) of the free energy of binding. In addition, we have identified a set of traits based on Rank for water molecules that impact binding specificity. The count, orientation and binding strength of this set of water molecules is far more dependent on the chemical nature of the protein amino acid sidechains than on features of the DNA bases. In a forthcoming work, we will explore the specific match-ups of protein amino acid residues and DNA nucleotide bases by their types, with confidence that our computational approach is representative of actual binding free energy, and with these guidelines for the inclusion of relevant water molecules in our models.

Protein-DNA data set
The protein-DNA data set was selected from the available structures in the Protein Data Bank [58]. While there are 123 unique structures in the PDB, many do not have reliable protein-DNA dissociation constants for exactly the same complex, are of poor resolution, and/or have missing residues or bases due to disorder or other experimental factors. The structures of the remaining thirty-nine protein-DNA complexes solved at a resolution better than 2.90 Å (28 complexes at better than 2.50 Å), were retrieved from the PDB and are listed in Table 1. Twentyone structures are monomeric proteins interacting with double-stranded DNA, while eighteen structures are homodimeric and heterodimeric proteins complexed with palindromic double-stranded DNA. When only the monomeric-single stranded structure was available in the PDB because of crystallographic symmetry, the actual biological complex (i.e., dimeric protein and doublestranded DNA) was obtained from the Nucleic Acid Database http://ndbserver.rutgers.edu. 1jkr and 1jko structures are protein mutants of the 1hcr DNA-native protein complex. Analogously, 1jk1 and 1jk2 are mutants of the 1aay DNA-native complexes, and 1t9j, 1t9i, 1u0c are mutants of 1g9y. Only non-covalent complexes with four or more base pairs in the polynucleotide strand were included in the dataset. PDB files characterized by anomalous DNA structure, non-classical bases or anomalous base-base coupling were not considered. Moreover, only complexes for which published experimental dissociation (K d ) constants values are available were retained. In particular, to avoid misleading correlations between experimental and computational results, a structure of a particular protein-DNA complex was included in the data set only when the DNA sequence used for the experimental assay was completely coincident with the sequence of the crystallized complex, and when, at least, the same protein domain involved in DNA recognition was used in both binding and crystallographic experiments. When small differences between the DNA sequences used in K d determination and crystallization experiments were observed, those complexes were included in our analysis only if the divergent bases were not directly involved in the protein-DNA recognition and association.

Model building
All complexes were modeled with Sybyl version 7.0 [75]. The structures were carefully checked and corrected for chemically consistent atom and bond type assignment. Hydrogen atoms, not normally detected with common Xray diffraction techniques, were computationally added, using the Sybyl Biopolymer and Build/Edit menu tools.
To avoid steric clashes, added hydrogen atoms were then energy minimized using the Powell algorithm, with a convergence gradient of 0.5 kcal (mol Å) -1 for 1500 cycles, while fixing all heavy atom positions.

Hydropathic analysis
Hydropathic analyses were carried out with the HINT software [75], using a locally modified version 3.09Sβ [76], as previously reported [29, [42][43][44]. All partition calculations (where atomic HINT constants are assigned based on LogP o/w ) were performed using the dictionary option for both proteins and nucleic acid sequences [77]. In this work ionization states of neither protein residues nor DNA nucleotides were modified, i.e., keeping the default protonation models (ca. pH 7) of Sybyl. Because the interactions between proteins and nucleic acids are mainly electrostatic and H-bond based, the 'essential' option, which treats only the polar hydrogen atoms explicitly, was chosen as partition mode. A new HINT option that corrects the S i terms for backbone amide nitrogens and hydrogens [78] by adding 20 Å 2 was used in this study. This correction improves the relative energetics of interand intra-molecular hydrogen bonds involving backbone amides.

Energetic contribution of water molecules
Water molecules crystallographically placed at the protein-DNA interface in a 4 Å range were automatically optimized and scored, using the "optimize bridging waters" and the "water accounting" options, implemented in the 3.09Sβ HINT version. For all of the succeeding calculations, each water was treated as an individual static molecule, and no statistical mechanical averaging on dynamics simulation trajectories were performed. During HINT optimization, the crystallographically-determined oxygen atom is allowed to translate at most 0.1 Å around its original position. HINT scores involving water are calculated as if each water molecule is a "ligand" interacting with the surrounding biomolecules acting in concert as a "receptor". Next, the "optimize water network" option was applied on crystallographic waters within 4 Å of both atoms of the protein and atoms of DNA using the geometry-based Rank algorithm [44,60]. Rank is able to predict the weighted number of potential hydrogen bonds formed by each water molecule with both the protein and the DNA sequence. During the optimization process the water hydrogen atoms are allowed to adopt all possible positions in order to maximize hydrogen bonds and acid/ base interactions, and to minimize unfavorable hydrophobic/polar or acid/acid contacts; i.e., the process is exhaustive. Only waters exhibiting Rank values greater than 0 with both protein and nucleic acid are considered bridging water molecules [44]. Waters forming hydrogen bonds with only the protein, the DNA or neither are considered as waters of solvation that are not involved in the binding event and presumed to be not essential to the energetics of complex formation. Therefore, for each analyzed complex, the contribution given by waters characterized by Rank > 0 was calculated and added to the protein-DNA HINT score, i.e., H TOT = H protein-DNA + H protein-water + H DNA-water . Even though the Rank algorithm allows each water molecule to act as donor with at most two hydrogen bond acceptors and as acceptor with at most two hydrogen bond donors, Rank should be interpreted only loosely as a count of hydrogen bonds. In previous analyses performed on protein-ligand complexes [44], Rank greater than four was associated with very locked and stable water molecules. Thus, in this work, bridging waters with total Rank ≥ 4 were identified for special consideration (see Results and Discussion).

Identification of water molecules mediating specific protein-DNA recognition
Some water molecules are specific mediators of recognition between protein and DNA. To isolate specific interactions between protein and base atoms, the phosphate and ribose groups were excluded from the HINT partition. Again, water molecules found in a 4 Å range at the protein-DNA interface with Rank > 0 were optimized, scored and Ranked only with respect to protein residues and DNA bases. These waters, potentially significant for specific recognition and association, were classified as bridging or not bridging. Another constraint is that bridging waters must mediate interactions between groups that are too far to contact each other otherwise. The bridging waters were divided into three different classes: (I) waters bridging DNA bases and protein amino acid residue sidechains, (II) waters bridging DNA bases and the protein backbone, and (III) waters bridging DNA bases and both protein side-chain and backbone atoms. Specific mean HINT score and Ranks were determined for each category, paying particular attention to side chain bridging waters, the only that should be able to mediate specific recognition. HINT score and Rank diagnostic of the three classes were calculated in order to identify essential water molecules in new protein-DNA complexes.