Analysis of the impact of solvent on contacts prediction in proteins

Background The correlated mutations concept is based on the assumption that interacting protein residues coevolve, so that a mutation in one of the interacting counterparts is compensated by a mutation in the other. Approaches based on this concept have been widely used for protein contacts prediction since the 90s. Previously, we have shown that water-mediated interactions play an important role in protein interfaces. We have observed that current "dry" correlated mutations approaches might not properly predict certain interactions in protein interfaces due to the fact that they are water-mediated. Results The goal of this study has been to analyze the impact of including solvent into the concept of correlated mutations. For this purpose we use linear combinations of the predictions obtained by the application of two different similarity matrices: a standard "dry" similarity matrix (DRY) and a "wet" similarity matrix (WET) derived from all water-mediated protein interfacial interactions in the PDB. We analyze two datasets containing 50 domains and 10 domain pairs from PFAM and compare the results obtained by using a combination of both matrices. We find that for both intra- and interdomain contacts predictions the introduction of a combination of a "wet" and a "dry" similarity matrix improves the predictions in comparison to the "dry" one alone. Conclusion Our analysis, despite the complexity of its possible general applicability, opens up that the consideration of water may have an impact on the improvement of the contact predictions obtained by correlated mutations approaches.


Background
The correlated mutations concept was introduced in the 90s [1][2][3][4] and has been widely used for protein contacts prediction [5]. The method is based on the assumption that interacting protein residues co-evolve, so that a mutation in one of the interacting counterparts is compensated by a mutation in the other. Therefore, it is possible to introduce an exchange matrix or other measures of similarity for each sequence position in a multiple sequence alignment and to use covariance (correlation coefficient) between two positions to predict if the residues at these positions may establish physical contact in 3D space, and develop contact maps. Several different similarity measures and algorithms have been implemented in the concept of correlated mutations [5][6][7]. Most exchange matrices are based either on physico-chemical properties of amino acids or on statistical data on the substitutions obtained from multiple sequence alignments [8]. Statistically it is clear that the distribution of distances between the residues at highly correlated positions is shifted towards lower values compared to the distance distribution of all residues. This has been demonstrated in the study of correlated mutations for residues within one protein domain (intradomain), for residues from different domains in multidomain proteins (interdomain intraprotein) [9,10] and in transmembrane proteins [11]. At the same time, attempts to use the concept of correlated mutations to predict thermodynamically coupled residues have suggested that the method is successful only for residues in evolutionary constrained positions [12].
The concept of correlated mutations has been intensively developed recently. The implementation of neural nets into algorithms of contact predictions has allowed to substantially improve the accuracy of the methods in a number of studies [13][14][15][16]. Also the application of filtering procedures such as the similarity of sequences in a dataset and the number of sequences in multiple sequence alignments, introduction of weights for physicochemical properties of the residue pairs and creation of sub-multiple sequence alignments were successfully used to increase a true positive ratio of contact predictions [17]. Nowadays, different correlated mutations based approaches yield predictions accuracies in the range of 0.1-0. 4 [17] but they are still of little use in the ab initio prediction of protein structure [7].
Previously, we have shown that water-mediated interactions play an important role in protein interfaces [18,19]. In particular, we observed that the interfacial residues interacting only through one water molecule (wet spots) are more similar in terms of dynamic and energetic properties to residues in the core of proteins than to residues on the protein surface. Moreover, in our studies interfacial water molecules show significantly longer residence times than water molecules on the protein surface or in bulk solvent, and have been shown to give an indispensable energetic impact on complex formation [19]. In other studies it has been demonstrated that inclusion of solvent term into the Hamiltonian of protein systems has improved folding predictions compared to in vacuo folding models [20]. Also consideration of solvent explicitly in protein docking approaches has recently shown promising results [21]. In addition, we have observed that water molecules in protein interfaces may contribute to the conservation of interactions by allowing more sequence variability in the interacting partners. In particular, we have observed water-mediated interactions in protein complex interfaces that are not predicted by "dry" correlated mutations approaches [19]. Interestingly, in one of the recent studies on correlated mutations, protein contacts prediction has been shown to be more accurate for protein cores than for the whole protein [22]. This could be partly explained by a higher conservation of residue contacts in protein cores, especially the hydrophobic ones [23] and probably also by the fact that the participation of solvent in protein contacts is being ignored.
The goal of this study has been to analyze the impact of including solvent into the concept of correlated mutations. For this purpose, we use a linear combination of predictions obtained by the use of two similarity matrices: a standard and widely used "dry" similarity matrix (DRY) [24] and a "wet" similarity matrix (WET) derived from data on all water-mediated protein-protein interfacial interactions in the PDB [25]. We compare the predictive results obtained with different combinations of these two similarity matrices in terms of number of correctly predicted contacts, accuracy, improvement ratio over random prediction for intradomain contacts and distributions of distances between residues in interdomain pairs.
Our results show that, despite a partial interdependence of both WET and DRY matrices, there is a clear trend pointing that a combination of these two matrices yields improved predictions over the single use of the DRY matrix for both intra-and interdomain contacts. The results obtained in this work underline the importance of water-mediated interactions in the description of proteinprotein interactions, and that implementing combinations of "dry" and "wet" matrices could possibly improve the results obtained by correlated mutations-based approaches.

Residue-solvent relations in proteins
Independently of residue types, we calculated the average ratios between the number of residues found to be in contact with water and all residues in X-ray PDB structures. A negligible difference was found between these ratios for interfaces and the whole protein (0.33 and 0.35, respectively). The ratios by residue type (Figure 1 and see additional file 1) correlate with an adjusted squared correlation coefficient R 2 = 0.90 (p-value~10 -10 ) and there is also a clear trend of residue ratios distribution in interfaces, which relates to their hydrophilic properties. This agrees with observations obtained from other datasets not including the whole PDB [26]. The better correlation between the ratios and the hydrophilicity index for interfaces compared to the whole protein (R 2 = 0.62 p-value~10 -5 and R 2 = 0.44 p-value~10 -3 , respectively) could be explained by the fact that the whole protein includes many residues in the core that are not accessible to water. This further supports the evidence that residue-solvent relations in protein interfaces are different from the ones in the proteins as a whole [18,19].

Relations between the DRY and WET similarity matrices
Both DRY and WET similarity matrices are created in a way that each column or row is a vector, which coordinates correspond to the similarity between certain amino acid residue type and other residue types. It is possible to define whether these vectors are interdependent for both matrices by application of linear regression analysis. The data obtained and averaged for all types of residues are presented in Table 1. High degree of correlation is observed for some vectors, which correspond to hydrophilic residues (excluding Thr and Tyr) and for Ile, Leu, Met, Val, suggesting that these vectors in the matrices are close to be collinear in 20-dimensional space. This can be explained by the properties of these residues. In particular, hydrophilic residues interact by electrostatic forces through their polar atoms, and water mediation in this case can only change the electrostatic forces by introducing water dipoles oriented in a way to weaken the initial electric field. For hydrophilic residues there is a correlation between hydrophilicity indexes and co-linearity of the corresponding vectors in the DRY and WET matrices, which explains also relatively low co-linearity for Tyr and Thr residues in comparison to other hydrophilic residues (additional file 2). Direct and water-mediated interactions formed by main chains of Ile, Leu, Met and Val in interfaces have been previously shown to be especially important, whereas other residues that present no correlation have been shown to predominantly participate in sidechain interactions in interfaces [18]. We conclude that the DRY and WET similarity matrices contain partially interdependent information for some of amino acid residues, and the found similarities can be explained by the physico-chemical properties of these residues.

Intradomain contacts prediction
Our dataset for intradomain contacts prediction consisted of domains of 50 PFAM protein families ( Table 2). The lengths of the reference sequences varied from 30 to 195 residues. Initially we analyzed L, L/2, L/3, L/5 and L/10 best correlated contacts for each family (L is the length of the reference sequence). The number of sequences considered for the multiple sequence alignments was in the range of 20 to 295 sequences. Previous studies have shown that accuracy (ratio between the number of correctly predicted contacts and the number of total predicted contacts) and improvement ratio over random prediction (ratio between accuracy and the probability of predicting a contact by chance) decrease with the increase of the number of analyzed contacts [4][5][6]. Table 3 shows accuracy and improvement ratio over random prediction for  = 0.5 (weight for WET matrix prediction when for DRY is 1), which corresponds to the average best accuracy obtained for different numbers of analyzed predicted contacts. The results obtained for other  values followed the same trend (data not shown). Independent of the number Water contacts of residues in PDB Figure 1 Water contacts of residues in PDB. Fractions of residues found to be in contact with water in protein interfaces (white) and in whole proteins (grey) in the PDB.  We compared the dependences on  of: i) accuracy, ii) improvement ratio over random prediction, iii) number of correctly predicted contacts (C corr ); and, since our dataset is heterogeneous (see high standard deviations in   Figure 2A, B). This means that, for these values of , introduction of the WET similarity matrix improves prediction by 20-30% on average.
Noticeably, the high values of   {10, 20} still make the predictions on average better than by the single use of the DRY matrix. For optimal value  = 0.5, absolute values of accuracy and improvement ratio over random prediction averaged for all 50 families increase by 1.4% and 0.19, respectively, in comparison to the single use of the DRY similarity matrix. For each family in the dataset there is an essentially higher increase of accuracy and improvement ratio over random prediction than on average. In some families, wet prediction ratio is improved more than twice (reference structures 1AF7, 1PDA, 8PAZ, 1DMR, 1AS0) and even 4.5 times (reference structure 1WVH) when  > 0. Our results show a significant improvement (20-30% of increase in wet prediction ratio) in predictions by the introduction of the WET similarity matrix in comparison to the single use of the DRY matrix within a correlated mutations approach. We observe that for sequence separations |i-j| > 6, 12, 24 our results follow the same trend.
The obtained results for  = 0.5 for different number of contacts (L, L/2, L/3, L/5, L/10) are shown in Table 4. We observe that the best predictions correspond to  = 0.2 and 0.5 for most of sequence separation values and L is the length of the reference sequence. The value  = 0.5 has been used.
Dependence on  of relative prediction characteristics for the intradomain dataset

Interdomain contacts prediction
The interdomain dataset used for our studies consisted of 10 different pairs of interacting domains (Table 5). From the analysis of the (L 1 +L 2 )/2 predicted interdomain residue contacts (L 1 and L 2 are the lengths of the sequences in each of the two domains) we observed that in 9 out of 10 cases best predictions in terms of X d were obtained when both the WET and DRY matrices were used. Relative  (Table 5). These results point out that the use of the WET similarity matrix might improve the statistic X d in comparison to the single use of the DRY similarity matrix.
Dependence of relative average Xd on  for interdomain contacts prediction (Figure 3) resembles the one obtained for intradomain prediction ( Figure 2B) but they differ in the optimal  and in the Xd corresponding to the higher  values. While in predictions of intradomain contacts all values of  > 0 lead to the improvement of contact predictions, in the case of interdomain contacts prediction the use of the WET similarity matrix yields higher Xd than the DRY alone when   {0.1,0.2}. This might be due to the differences in distance distributions between the analyzed pairs of residues, which are closer to each other in the case of intradomain contacts. Nevertheless, introduction of the WET similarity matrix improves contact prediction compared to the single use of the DRY similarity matrix for Proportion of residue pairs at distance bins for the interac-tion SH2-SH3  L is the length of the reference sequence. R is improvement over random prediction. The value  = 0.5 has been used.
Predictions for interdomain dataset Figure 3 Predictions for interdomain dataset. Relative harmonic weighted difference statistic (X d ) dependence on .
both intra-and interdomain contacts. Although there are still significant limitations for practical use of the correlated mutations approach for interdomain contacts prediction, also mentioned by other authors [5,9], we believe that consideration of water by the use of "wet" similarity matrices could improve the results obtained by correlated mutations approaches.

Conclusion
This study is the first investigating the impact of inclusion of solvent into the concept of correlated mutations. With this work we further demonstrate our previous observations that relations between solvent and protein residues in protein interfaces differ from those in the whole protein. Recent work on bond preferences in inter-versus intraprotein interactions highlights the different architecture of protein interfaces and their unique bond preferences [28].
Two similarity matrices have been used in this work: the McLachlan matrix as the DRY similarity matrix and a WET similarity matrix derived by statistical analysis of the frequency of water contacts by residue type in protein interfaces in the whole PDB. Analysis of the DRY and WET similarity matrices shows that they are interdependent for some residue types, which could be explained by physicochemical properties of individual amino acid residues. We analyze two datasets containing 50 domains and 10 domain pairs belonging to PFAM families. We sum the predictions obtained by the use of both matrices with different weight coefficients and find optimal combinations for best predictions. Our datasets are heterogeneous to propose one best weight value to be able to apply the optimized method to all domain families; however, the prediction of contacts obtained by the introduction of the WET similarity matrix is improved for most of the families in the datasets (for both intra-and interdomain) as well as on average (by 20-30%). Our analysis of solvent impact on contact prediction in proteins suggests that further development of the correlated mutations concept would benefit from taking into account solvent as an active participant in protein-protein interactions, which is usually overlooked in these studies.

Dataset and multiple sequence alignments
We based the generation of our dataset on previous similar studies [4,9,22]. Our dataset includes 50 domains and 10 domain pairs extracted from the PFAM database [29]. Consecutive increase of the size of our dataset for intradomain contacts did not significantly change our results.
For most of the families, only seed sequences were used, except for the cases when the number of seed sequences was less than 20. Datasets with a smaller number of sequences are not supposed to be useful in correlated mutations analysis [22]. The reference sequence (corresponding to the structure used for predictions evaluation)

Source and analysis of atomic data on protein structures
An in-house relational database of protein structures (XMLRPDB) and the SCOWLP database [25,27] were used to obtain interaction information including solvent from X-ray structures in the PDB.

Contact definition
Residue contacts in a reference structure were defined by following the physico-chemical criteria from SCOWLP [27]. We considered a 3.2 Å donor-acceptor distance for hydrogen bonds, 4 Å for salt bridges, and van der Waals radii for van der Waals interactions.

Similarity matrices
We used the McLachlan similarity matrix (based on structural and genetic similarities of amino acids) as a "dry" matrix (DRY) [24]. To build a "wet" matrix (WET) we extracted information on protein interfacial residues and solvent from all available X-ray PDB structures using the SCOWLP database [25,27]. In this database, three classes of interacting residues are defined based on their interactions: dry (direct interaction), dual (direct and water-mediated interactions), and wet spots (residues interacting only through one water molecule). For each type of amino acid residue the probability of participation in water-mediated interactions (by establishing hydrogen bond by main chain or side chain) in protein interfaces was calculated as: (Figure 1), where i corresponds to any of the 20 amino acids; N i, w is the number of the residues of this type forming wet spots or dual interactions; and N i, total is the total number of residues of this type participating in interfaces in all PDB structures. Each element of the WET similarity matrix was then defined as: where i and j correspond to any of the 20 amino acids.
The fact that for the creation of the wet matrix we take low resolution structures containing either none or few water molecules into account when considering the whole PDB does not bias the WET matrix because it affects each probability proportionally.

Correlation coefficient calculations
For both DRY and WET similarity matrices the corresponding covariance matrices were calculated as previously described (Göbel et al 1994) using the formula: , where N is the number of sequences; i and j are sequence position numbers; S ikl is a value from the similarity matrix (DRY or WET); S i is the mean of S ikl ;  i is the standard deviation of S ikl ; and W kl is a weight matrix defined as: , where L is the sequence length; R ik and R il are the residue types at position i in the sequences k and l, respectively; and  is Kronecker delta [31].
For the interdomain dataset the weight matrix W kl was calculated as an average for the domains and weighted by sequence length. The positions with more than 10% of gaps as well as completely conserved positions were not included in the calculations (zero was assigned to the corresponding correlation coefficient). After calculating covariance matrices based on the DRY and WET similarity matrices, we built their linear combinations: r ij = r ij DRY + ·r ij WET , where  takes values from {0, 0.1, 0.2, 0.5, 1, 2, 4, 10, 20}, so that the weight ratio between the impact of DRY and WET represents the range from completely dry ( = 0) to extremely WET-biased covariance ( = 20).

Evaluation of intradomain predictions
For evaluation of intradomain contacts predictions we used previously described methodology [4]. Sequence separation of 0, 6, 12 and 24 was used. Prediction accuracy was defined as the ratio between the number of correctly predicted contacts (C corr ) and total number of predicted contacts (C tot ). Random accuracy corresponds to the probability of correct prediction of the contact by chance and is equal to the ratio between experimentally observed contacts (C obs ) and maximum number of possible contacts. The ratio between accuracy and random accuracy was introduced as improvement ratio over random prediction. Wet prediction ratio is equal to accuracy normalized by the accuracy obtained by using only the DRY matrix ( = 0). For the reference structures C corr was taken as the number of contacts defined by SCOWLP criteria (see the Contact definition section in Methods).

Distance calculation and harmonic average (X d )
In the analysis of interdomain contacts the accuracy calculated in the same way as for intradomain contacts (typical value C obs~1 0 2 ) is expected to be at least one order of magnitude lower (typical value C obs~1 0 1 ). That is why comparison of accuracy, improvement ratio over random prediction and C corr as functions of  is not appropriate in this case. It has been shown that the distribution of distances between the correlated pairs is shifted to lower values compared to the distribution of distances for all residue pairs in two domains [9]. In our study we use a harmonic weighted difference statistic X d described before [9]: , where n is the number of distance bins; d i is the upper limit for each bin normalized to the maximum value of the distributed distances; P ic is the percentage of the analyzed correlated pairs at the distances between d i and d i-1 ; and P ia is the same percentage for all pairs of residues. The width of bin was 4 Å. The higher the X d value, the more successful a prediction is.
Different definitions for the distance between residues resulted in all cases in the same trends and quantitatively only slightly affected X d values. For interdomain pairs we used distances between the centers of mass of residues in order not to be biased to either main-chain or side-chain contacts.
For X d calculations we took the best L/2 contacts for intradomain and (L 1 +L 2 )/2 contacts for interdomain contact predictions, where L 1 and L 2 are the reference sequences of the two interacting domains.
Although both the wet prediction ratio and X d characterize the predictive power of the method, it is irrelevant to compare the results obtained for these parameters with each other. The same applies to  values corresponding to best predictions.

Statistical analysis
Statistical analysis of data was carried out with the R-package [32].

Authors' contributions
SAS developed and implemented the WET similarity matrix and performed all the analysis. JT obtained the data from SCOWLP used for this work. GA obtained the data from XMLRPDB used for this work. SAS and MTP wrote the manuscript. MTP designed and supervised the project. All authors have read and approved the final manuscript.