- Research article
- Open Access
- Published:

# Analysis of the impact of solvent on contacts prediction in proteins

*BMC Structural Biology*
**volume 9**, Article number: 22 (2009)

## Abstract

### Background

The correlated mutations concept is based on the assumption that interacting protein residues coevolve, so that a mutation in one of the interacting counterparts is compensated by a mutation in the other. Approaches based on this concept have been widely used for protein contacts prediction since the 90s. Previously, we have shown that water-mediated interactions play an important role in protein interfaces. We have observed that current "dry" correlated mutations approaches might not properly predict certain interactions in protein interfaces due to the fact that they are water-mediated.

### Results

The goal of this study has been to analyze the impact of including solvent into the concept of correlated mutations. For this purpose we use linear combinations of the predictions obtained by the application of two different similarity matrices: a standard "dry" similarity matrix (DRY) and a "wet" similarity matrix (WET) derived from all water-mediated protein interfacial interactions in the PDB. We analyze two datasets containing 50 domains and 10 domain pairs from PFAM and compare the results obtained by using a combination of both matrices. We find that for both intra- and interdomain contacts predictions the introduction of a combination of a "wet" and a "dry" similarity matrix improves the predictions in comparison to the "dry" one alone.

### Conclusion

Our analysis, despite the complexity of its possible general applicability, opens up that the consideration of water may have an impact on the improvement of the contact predictions obtained by correlated mutations approaches.

## Background

The correlated mutations concept was introduced in the 90s [1–4] and has been widely used for protein contacts prediction [5]. The method is based on the assumption that interacting protein residues co-evolve, so that a mutation in one of the interacting counterparts is compensated by a mutation in the other. Therefore, it is possible to introduce an exchange matrix or other measures of similarity for each sequence position in a multiple sequence alignment and to use covariance (correlation coefficient) between two positions to predict if the residues at these positions may establish physical contact in 3D space, and develop contact maps. Several different similarity measures and algorithms have been implemented in the concept of correlated mutations [5–7]. Most exchange matrices are based either on physico-chemical properties of amino acids or on statistical data on the substitutions obtained from multiple sequence alignments [8]. Statistically it is clear that the distribution of distances between the residues at highly correlated positions is shifted towards lower values compared to the distance distribution of all residues. This has been demonstrated in the study of correlated mutations for residues within one protein domain (intradomain), for residues from different domains in multidomain proteins (interdomain intraprotein) [9, 10] and in transmembrane proteins [11]. At the same time, attempts to use the concept of correlated mutations to predict thermodynamically coupled residues have suggested that the method is successful only for residues in evolutionary constrained positions [12].

The concept of correlated mutations has been intensively developed recently. The implementation of neural nets into algorithms of contact predictions has allowed to substantially improve the accuracy of the methods in a number of studies [13–16]. Also the application of filtering procedures such as the similarity of sequences in a dataset and the number of sequences in multiple sequence alignments, introduction of weights for physico-chemical properties of the residue pairs and creation of sub-multiple sequence alignments were successfully used to increase a true positive ratio of contact predictions [17]. Nowadays, different correlated mutations based approaches yield predictions accuracies in the range of 0.1–0.4 [17] but they are still of little use in the *ab initio* prediction of protein structure [7].

Previously, we have shown that water-mediated interactions play an important role in protein interfaces [18, 19]. In particular, we observed that the interfacial residues interacting only through one water molecule (wet spots) are more similar in terms of dynamic and energetic properties to residues in the core of proteins than to residues on the protein surface. Moreover, in our studies interfacial water molecules show significantly longer residence times than water molecules on the protein surface or in bulk solvent, and have been shown to give an indispensable energetic impact on complex formation [19]. In other studies it has been demonstrated that inclusion of solvent term into the Hamiltonian of protein systems has improved folding predictions compared to *in vacuo* folding models [20]. Also consideration of solvent explicitly in protein docking approaches has recently shown promising results [21]. In addition, we have observed that water molecules in protein interfaces may contribute to the conservation of interactions by allowing more sequence variability in the interacting partners. In particular, we have observed water-mediated interactions in protein complex interfaces that are not predicted by "dry" correlated mutations approaches [19]. Interestingly, in one of the recent studies on correlated mutations, protein contacts prediction has been shown to be more accurate for protein cores than for the whole protein [22]. This could be partly explained by a higher conservation of residue contacts in protein cores, especially the hydrophobic ones [23] and probably also by the fact that the participation of solvent in protein contacts is being ignored.

The goal of this study has been to analyze the impact of including solvent into the concept of correlated mutations. For this purpose, we use a linear combination of predictions obtained by the use of two similarity matrices: a standard and widely used "dry" similarity matrix (DRY) [24] and a "wet" similarity matrix (WET) derived from data on all water-mediated protein-protein interfacial interactions in the PDB [25]. We compare the predictive results obtained with different combinations of these two similarity matrices in terms of number of correctly predicted contacts, accuracy, improvement ratio over random prediction for intradomain contacts and distributions of distances between residues in interdomain pairs.

Our results show that, despite a partial interdependence of both WET and DRY matrices, there is a clear trend pointing that a combination of these two matrices yields improved predictions over the single use of the DRY matrix for both intra- and interdomain contacts. The results obtained in this work underline the importance of water-mediated interactions in the description of protein-protein interactions, and that implementing combinations of "dry" and "wet" matrices could possibly improve the results obtained by correlated mutations-based approaches.

## Results and discussion

### Residue-solvent relations in proteins

Independently of residue types, we calculated the average ratios between the number of residues found to be in contact with water and all residues in X-ray PDB structures. A negligible difference was found between these ratios for interfaces and the whole protein (0.33 and 0.35, respectively). The ratios by residue type (Figure 1 and see additional file 1) correlate with an adjusted squared correlation coefficient R^{2} = 0.90 (p-value~10^{-10}) and there is also a clear trend of residue ratios distribution in interfaces, which relates to their hydrophilic properties. This agrees with observations obtained from other datasets not including the whole PDB [26]. The better correlation between the ratios and the hydrophilicity index for interfaces compared to the whole protein (R^{2} = 0.62 p-value~10^{-5} and R^{2} = 0.44 p-value~10^{-3}, respectively) could be explained by the fact that the whole protein includes many residues in the core that are not accessible to water. This further supports the evidence that residue-solvent relations in protein interfaces are different from the ones in the proteins as a whole [18, 19].

### Relations between the DRY and WET similarity matrices

Both DRY and WET similarity matrices are created in a way that each column or row is a vector, which coordinates correspond to the similarity between certain amino acid residue type and other residue types. It is possible to define whether these vectors are interdependent for both matrices by application of linear regression analysis. The data obtained and averaged for all types of residues are presented in Table 1. High degree of correlation is observed for some vectors, which correspond to hydrophilic residues (excluding Thr and Tyr) and for Ile, Leu, Met, Val, suggesting that these vectors in the matrices are close to be collinear in 20-dimensional space. This can be explained by the properties of these residues. In particular, hydrophilic residues interact by electrostatic forces through their polar atoms, and water mediation in this case can only change the electrostatic forces by introducing water dipoles oriented in a way to weaken the initial electric field. For hydrophilic residues there is a correlation between hydrophilicity indexes and co-linearity of the corresponding vectors in the DRY and WET matrices, which explains also relatively low co-linearity for Tyr and Thr residues in comparison to other hydrophilic residues (additional file 2). Direct and water-mediated interactions formed by main chains of Ile, Leu, Met and Val in interfaces have been previously shown to be especially important, whereas other residues that present no correlation have been shown to predominantly participate in side-chain interactions in interfaces [18]. We conclude that the DRY and WET similarity matrices contain partially interdependent information for some of amino acid residues, and the found similarities can be explained by the physico-chemical properties of these residues.

### Intradomain contacts prediction

Our dataset for intradomain contacts prediction consisted of domains of 50 PFAM protein families (Table 2). The lengths of the reference sequences varied from 30 to 195 residues. Initially we analyzed L, L/2, L/3, L/5 and L/10 best correlated contacts for each family (L is the length of the reference sequence). The number of sequences considered for the multiple sequence alignments was in the range of 20 to 295 sequences. Previous studies have shown that accuracy (ratio between the number of correctly predicted contacts and the number of total predicted contacts) and improvement ratio over random prediction (ratio between accuracy and the probability of predicting a contact by chance) decrease with the increase of the number of analyzed contacts [4–6]. Table 3 shows accuracy and improvement ratio over random prediction for α = 0.5 (weight for WET matrix prediction when for DRY is 1), which corresponds to the average best accuracy obtained for different numbers of analyzed predicted contacts. The results obtained for other α values followed the same trend (data not shown). Independent of the number of analyzed contacts the best predictions in average did not correspond to α = 0. The obtained values for accuracy and improvement ratio over random prediction are within the ranges obtained by other correlated mutations approaches [17, 22]. However, direct quantitative comparison of these methods is not appropriate because of their substantial differences in their residue contacts definitions. In particular, some of these approaches utilize for contact definition (see contact definition in Methods section) a chosen distance cut-off of 6–8 Å between atoms [4, 16, 17], whereas we use physico-chemical properties of protein residues, which results in a ≤ 4 Å cut-off [27].

We compared the dependences on *α* of: i) accuracy, ii) improvement ratio over random prediction, iii) number of correctly predicted contacts (C_{corr}); and, since our dataset is heterogeneous (see high standard deviations in Table 3), we normalized these parameters by the corresponding values at *α* = 0 (wet prediction ratio). For the purpose of wet prediction ratio comparison at different values of *α* we found L/2 to be the most appropriate number of contacts. This choice is explained by the fact that the changes in prediction results influenced by α variation become hardly detectable if a smaller number of contacts (C_{total}) is considered for analysis since these changes are limited by low values of C_{total} and, consequently, of correctly predicted contacts (C_{corr}). On the other hand, the increase of C_{total} generally leads to decrease of prediction accuracy and to negligible differences in prediction results corresponding to different *α* values. Only in 2 out of the 50 families of our dataset best predictions correspond to *α* = 0 values (Table 2). Maximum values for wet prediction ratio and relative *X*_{
d
}(harmonic weighted difference statistic) averaged for the whole dataset are obtained when *α* = 0.5 and *α* = 1 (1.19 and 1.29, respectively; Figure 2A, B). This means that, for these values of *α*, introduction of the WET similarity matrix improves prediction by 20–30% on average. Noticeably, the high values of *α* ∈ {10, 20} still make the predictions on average better than by the single use of the DRY matrix. For optimal value *α* = 0.5, absolute values of accuracy and improvement ratio over random prediction averaged for all 50 families increase by 1.4% and 0.19, respectively, in comparison to the single use of the DRY similarity matrix. For each family in the dataset there is an essentially higher increase of accuracy and improvement ratio over random prediction than on average. In some families, wet prediction ratio is improved more than twice (reference structures 1AF7, 1PDA, 8PAZ, 1DMR, 1AS0) and even 4.5 times (reference structure 1WVH) when *α* > 0. Our results show a significant improvement (20–30% of increase in wet prediction ratio) in predictions by the introduction of the WET similarity matrix in comparison to the single use of the DRY matrix within a correlated mutations approach. We observe that for sequence separations |i-j| > 6, 12, 24 our results follow the same trend. The obtained results for *α* = 0.5 for different number of contacts (L, L/2, L/3, L/5, L/10) are shown in Table 4. We observe that the best predictions correspond to *α* = 0.2 and 0.5 for most of sequence separation values and number of contacts. Wet prediction ratios for the whole range of analyzed *α* are presented in a figure in supplementary material (additional file 3). In all cases, independently of sequence separation and number of contacts, the best predictions correspond to *α* > 0.

### Interdomain contacts prediction

The interdomain dataset used for our studies consisted of 10 different pairs of interacting domains (Table 5). From the analysis of the (L_{1}+L_{2})/2 predicted interdomain residue contacts (L_{1} and L_{2} are the lengths of the sequences in each of the two domains) we observed that in 9 out of 10 cases best predictions in terms of *X*_{
d
}were obtained when both the WET and DRY matrices were used. Relative *X*_{
d
}averaged for the whole dataset reaches a maximum value of 1.32 at *α* = 0.2 and then decreases with the further increase of *α* (Figure 3). In one of the examples (SH2-SH3 domains interaction) the differences of distance distributions for different α values are dramatic (Figure 4). In this case the *X*_{
d
}value for predicted contacts at *α* = 0 and *α* = 0.2 changes almost twice (Table 5). These results point out that the use of the WET similarity matrix might improve the statistic *X*_{
d
}in comparison to the single use of the DRY similarity matrix.

Dependence of relative average *X*_{
d
}on *α* for interdomain contacts prediction (Figure 3) resembles the one obtained for intradomain prediction (Figure 2B) but they differ in the optimal *α* and in the *X*_{
d
}corresponding to the higher *α* values. While in predictions of intradomain contacts all values of *α* > 0 lead to the improvement of contact predictions, in the case of interdomain contacts prediction the use of the WET similarity matrix yields higher *X*_{
d
}than the DRY alone when *α* ∈ {0.1,0.2}. This might be due to the differences in distance distributions between the analyzed pairs of residues, which are closer to each other in the case of intradomain contacts. Nevertheless, introduction of the WET similarity matrix improves contact prediction compared to the single use of the DRY similarity matrix for both intra- and interdomain contacts. Although there are still significant limitations for practical use of the correlated mutations approach for interdomain contacts prediction, also mentioned by other authors [5, 9], we believe that consideration of water by the use of "wet" similarity matrices could improve the results obtained by correlated mutations approaches.

## Conclusion

This study is the first investigating the impact of inclusion of solvent into the concept of correlated mutations. With this work we further demonstrate our previous observations that relations between solvent and protein residues in protein interfaces differ from those in the whole protein. Recent work on bond preferences in inter- *versus* intraprotein interactions highlights the different architecture of protein interfaces and their unique bond preferences [28].

Two similarity matrices have been used in this work: the McLachlan matrix as the DRY similarity matrix and a WET similarity matrix derived by statistical analysis of the frequency of water contacts by residue type in protein interfaces in the whole PDB. Analysis of the DRY and WET similarity matrices shows that they are interdependent for some residue types, which could be explained by physico-chemical properties of individual amino acid residues. We analyze two datasets containing 50 domains and 10 domain pairs belonging to PFAM families. We sum the predictions obtained by the use of both matrices with different weight coefficients and find optimal combinations for best predictions. Our datasets are heterogeneous to propose one best weight value to be able to apply the optimized method to all domain families; however, the prediction of contacts obtained by the introduction of the WET similarity matrix is improved for most of the families in the datasets (for both intra- and interdomain) as well as on average (by 20–30%). Our analysis of solvent impact on contact prediction in proteins suggests that further development of the correlated mutations concept would benefit from taking into account solvent as an active participant in protein-protein interactions, which is usually overlooked in these studies.

## Methods

### Dataset and multiple sequence alignments

We based the generation of our dataset on previous similar studies [4, 9, 22]. Our dataset includes 50 domains and 10 domain pairs extracted from the PFAM database [29]. Consecutive increase of the size of our dataset for intradomain contacts did not significantly change our results.

For most of the families, only seed sequences were used, except for the cases when the number of seed sequences was less than 20. Datasets with a smaller number of sequences are not supposed to be useful in correlated mutations analysis [22]. The *reference sequence* (corresponding to the structure used for predictions evaluation) was added to the set of sequences, if this did not already contain it, following the same procedure that Eyal and co-workers used for obtaining a substitution matrix for protein structure prediction purposes [22]. Multiple sequence alignments were obtained with CLUSTALW [30]. Sequences with more than 95% of identity were not taken into account.

For the interdomain dataset the sequences from the two domain families were aligned independently. Except for the case of immunoglobulins, where light and heavy chains were used as two interacting domains, all interdomain entries in the dataset contained pairs of two different PFAM domains. Reference structures had resolution ≤ 2.0 Å except for five of them (1BU1 and 1A19 taken from the Eyal et al dataset and 2HB2, 1WMG, 1ZWW taken into account to enrich the dataset with bigger domains and highly represented families).

### Source and analysis of atomic data on protein structures

An in-house relational database of protein structures (XMLRPDB) and the SCOWLP database [25, 27] were used to obtain interaction information including solvent from X-ray structures in the PDB.

### Contact definition

Residue contacts in a reference structure were defined by following the physico-chemical criteria from SCOWLP [27]. We considered a 3.2 Å donor-acceptor distance for hydrogen bonds, 4 Å for salt bridges, and van der Waals radii for van der Waals interactions.

### Similarity matrices

We used the McLachlan similarity matrix (based on structural and genetic similarities of amino acids) as a "dry" matrix (DRY) [24]. To build a "wet" matrix (WET) we extracted information on protein interfacial residues and solvent from all available X-ray PDB structures using the SCOWLP database [25, 27]. In this database, three classes of interacting residues are defined based on their interactions: dry (direct interaction), dual (direct and water-mediated interactions), and wet spots (residues interacting only through one water molecule). For each type of amino acid residue the probability of participation in water-mediated interactions (by establishing hydrogen bond by main chain or side chain) in protein interfaces was calculated as:

*p*_{
i
}= *N*_{i, w}/*N*_{i, total}(Figure 1), where *i* corresponds to any of the 20 amino acids; *N*_{i, w}is the number of the residues of this type forming wet spots or dual interactions; and *N*_{i, total}is the total number of residues of this type participating in interfaces in all PDB structures. Each element of the WET similarity matrix was then defined as:

*WET*_{
ij
}= *1*-|*p*_{
i
}-*p*_{
j
}|, where *i* and *j* correspond to any of the 20 amino acids.

The fact that for the creation of the wet matrix we take low resolution structures containing either none or few water molecules into account when considering the whole PDB does not bias the WET matrix because it affects each probability proportionally.

### Correlation coefficient calculations

For both DRY and WET similarity matrices the corresponding covariance matrices were calculated as previously described (Göbel et al 1994) using the formula:

, where *N* is the number of sequences; *i* and *j* are sequence position numbers; *S*_{
ikl
}is a value from the similarity matrix (DRY or WET); *S*_{
i
}is the mean of *S*_{
ikl
}; σ_{i} is the standard deviation of *S*_{
ikl
}; and *W*_{
kl
}is a weight matrix defined as:

, where *L* is the sequence length; *R*_{
ik
}and *R*_{
il
}are the residue types at position *i* in the sequences *k* and *l*, respectively; and δ is Kronecker delta [31].

For the interdomain dataset the weight matrix *W*_{
kl
}was calculated as an average for the domains and weighted by sequence length. The positions with more than 10% of gaps as well as completely conserved positions were not included in the calculations (zero was assigned to the corresponding correlation coefficient). After calculating covariance matrices based on the DRY and WET similarity matrices, we built their linear combinations:

*r*_{
ij
}= *r*_{
ij DRY
}+ *α*·*r*_{
ij WET
}, where *α* takes values from {0, 0.1, 0.2, 0.5, 1, 2, 4, 10, 20}, so that the weight ratio between the impact of DRY and WET represents the range from completely dry (*α = 0*) to extremely WET-biased covariance (*α = 20*).

### Evaluation of intradomain predictions

For evaluation of intradomain contacts predictions we used previously described methodology [4]. Sequence separation of 0, 6, 12 and 24 was used. Prediction *accuracy* was defined as the ratio between the number of correctly predicted contacts (C_{corr}) and total number of predicted contacts (C_{tot}). *Random accuracy* corresponds to the probability of correct prediction of the contact by chance and is equal to the ratio between experimentally observed contacts (C_{obs}) and maximum number of possible contacts. The ratio between accuracy and random accuracy was introduced as *improvement ratio over random prediction*. *Wet prediction ratio* is equal to accuracy normalized by the accuracy obtained by using only the DRY matrix (α = 0). For the reference structures C_{corr} was taken as the number of contacts defined by SCOWLP criteria (see the Contact definition section in Methods).

### Distance calculation and harmonic average (X_{d})

In the analysis of interdomain contacts the accuracy calculated in the same way as for intradomain contacts (typical value C_{obs}~10^{2}) is expected to be at least one order of magnitude lower (typical value C_{obs}~10^{1}). That is why comparison of accuracy, improvement ratio over random prediction and C_{corr} as functions of *α* is not appropriate in this case. It has been shown that the distribution of distances between the correlated pairs is shifted to lower values compared to the distribution of distances for all residue pairs in two domains [9]. In our study we use a harmonic weighted difference statistic *X*_{
d
}described before [9]:

, where n is the number of distance bins; *d*_{
i
}is the upper limit for each bin normalized to the maximum value of the distributed distances; *P*_{
ic
}is the percentage of the analyzed correlated pairs at the distances between *d*_{
i
}and *d*_{i-1}; and *P*_{
ia
}is the same percentage for all pairs of residues. The width of bin was 4 Å. The higher the *X*_{
d
}value, the more successful a prediction is.

Different definitions for the distance between residues resulted in all cases in the same trends and quantitatively only slightly affected *X*_{
d
}values. For interdomain pairs we used distances between the centers of mass of residues in order not to be biased to either main-chain or side-chain contacts.

For *X*_{
d
}calculations we took the best L/2 contacts for intradomain and (L_{1}+L_{2})/2 contacts for interdomain contact predictions, where L_{1} and L_{2} are the reference sequences of the two interacting domains.

Although both the wet prediction ratio and *X*_{
d
}characterize the predictive power of the method, it is irrelevant to compare the results obtained for these parameters with each other. The same applies to α values corresponding to best predictions.

### Statistical analysis

Statistical analysis of data was carried out with the R-package [32].

## References

Gregoret L, Sauer R:

**Additivity of Mutant Effects Assessed by Binomial Mutagenesis.***PNAS*1993,**90**(9):4246–4250. 10.1073/pnas.90.9.4246Lee C, Levitt M:

**Accurate prediction of the stability and activity effects of site-directed mutagenesis on a protein core.***Nature*1991,**352**(6334):448–451. 10.1038/352448a0Wells JA:

**Additivity of mutational effects in proteins.***Biochemistry*1990,**29**(37):8509–8517. 10.1021/bi00489a001Göbel U, Sander C, Schneider R, Valencia A:

**Correlated mutations and residue contacts in proteins.***Proteins*1994,**18**(4):309–317. 10.1002/prot.340180402Halperin I, Wolfson H, Nussinov R:

**Correlated mutations: Advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families.***Proteins*2006,**60**(2):832–845. 10.1002/prot.20933Fodor AA, Aldrich RW:

**Influence of conservation on calculations of amino acid covariance in multiple sequence alignments.***Proteins*2004,**56**(2):211–221. 10.1002/prot.20098Horner D, Pirovano W, Pesole G:

**Correlated substitution analysis and the prediction of amino acid structural contacts.***Brief Bioinform*2007,**9**(1):46–56. 10.1093/bib/bbm052Pokarowski P, Kloczkowski A, Nowakowski S, Pokarowska M, Jernigan R, Kolinski A:

**Ideal amino acid exchange forms for approximating substitution matrices.***Proteins: Structure, Function, and Bioinformatics*2007,**69**(2):379–393. 10.1002/prot.21509Pazos F, Helmer-Citterich M, Ausiello G, Valencia A:

**Correlated mutations contain information about protein-protein interaction.***J Mol Biol*1997,**271**(4):511–523. 10.1006/jmbi.1997.1198Perez-Jimenez R, Godoy-Ruiz R, Parody-Morreale A, Ibarra-Molero B, Sanchez-Ruiz JM:

**A simple tool to explore the distance distribution of correlated mutations in proteins.***Biophys Chem*2006,**119**(3):240–246. 10.1016/j.bpc.2005.09.010Fuchs A, Martin-Galiano A, Kalman M, Fleishman S, Ben-Tal N, Frishman D:

**Co-evolving residues in membrane proteins.***Bioinformatics*2007,**23**(24):3312–3319. 10.1093/bioinformatics/btm515Fodor AA, Aldrich RW:

**On evolutionary conservation of thermodynamic coupling in proteins.***J Biol Chem*2004,**279**(18):19046–19050. 10.1074/jbc.M402560200Nagl S:

**Can correlated mutations in protein domain families be used for protein design?***Brief Bioinform*2001,**2**(3):279–288. 10.1093/bib/2.3.279Fariselli P, Olmea O, Valencia A, Casadio R:

**Progress in predicting inter-residue contacts of proteins with neural networks and correlated mutations.***Proteins: Structure, Function, and Genetics*2001,**45**(S5):157–162. 10.1002/prot.1173Shackelford G, Karplus K:

**Contact prediction using mutual information and neural nets.***Proteins: Structure, Function, and Bioinformatics*2007,**68**(8):159–164. 10.1002/prot.21791Xue B, Faraggi E, Zhou Y:

**Predicting residue-residue contact maps by a two-layer, integrated neural-network method.***Proteins*2008, in press.Kundrotas P, Alexov E:

**Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives.***BMC Bioinformatics*2006,**7:**503. 10.1186/1471-2105-7-503Teyra J, Pisabarro MT:

**Characterization of interfacial solvent in protein complexes and contribution of***wet spots***to the interface description.***Proteins: Structure, Function, and Bioinformatics*2007,**67**(4):1087–1095. 10.1002/prot.21394Samsonov S, Teyra J, Pisabarro T:

**A molecular dynamics approach to study the importance of solvent in protein interactions.***Proteins: Structure, Function, and Bioinformatics*2008,**73**(2):515–525. 10.1002/prot.22076Papoian GA, Ulander J, Eastwood MP, Luthey-Schulten Z, Wolynes PG:

**Water in protein structure prediction.***Proc Natl Acad Sci USA*2004,**101**(10):3352–3357. 10.1073/pnas.0307851100van Dijk ADJ, Bonvin AMJJ:

**Solvated docking: introducing water into the modelling of biomolecular complexes.***Bioinformatics*2006,**22**(9):2340–2347. 10.1093/bioinformatics/btl395Eyal E, Frenkel-Morgenstern M, Sobolev V, Pietrokovski S:

**A pair-to-pair amino acids substitution matrix and its applications for protein structure prediction.***Proteins*2007,**67**(1):142–153. 10.1002/prot.21223Schueler-Furman O, Baker D:

**Conserved residue clustering and protein structure prediction.***Proteins*2003,**52**(2):225–235. 10.1002/prot.10365McLachlan AD:

**Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c551.***Journal of Molecular Biology*1971,**61**(2):409–424. 10.1016/0022-2836(71)90390-1Teyra J, Paszkowski-Rogacz M, Anders G, Pisabarro T:

**SCOWLP classification: Structural comparison and analysis of protein binding regions.***BMC Bioinformatics*2008,**9:**9. 10.1186/1471-2105-9-9Yan C, Wu F, Jernigan RL, Dobbs D, Honavar V:

**Characterization of protein-protein interfaces.***The protein journal*2008,**27**(1):59–70. 10.1007/s10930-007-9108-xTeyra J, Doms A, Schroeder M, Pisabarro MT:

**SCOWLP: a web-based database for detailed characterization and visualization of protein interfaces.***BMC Bioinformatics*2006.,**7**(1):Cohen M, Reichmann D, Neuvirth H, Schreiber G:

**Similar chemistry, but different bond preferences in inter versus intra-protein interactions.***Proteins*2008,**72**(2):741–753. 10.1002/prot.21960Finn RD, Mistry J, Schuster-Bäckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R,

*et al*.:**Pfam: clans, web tools and services.***Nucleic Acids Res*2006,**34**(Database issue):D247-D251. 10.1093/nar/gkj149Thompson JD, Higgins DG, Gibson TJ:

**CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.***Nucleic Acids Res*1994,**22**(22):4673–4680. 10.1093/nar/22.22.4673Sander C, Schneider R:

**Database of homology-derived protein structures and the structural meaning of sequence alignment.***Proteins*1991,**9**(1):56–68. 10.1002/prot.340090107R-package Development Core Team:

*R: a language and environment for statistical computing.*Vienna, Austria; 2006.

## Acknowledgements

Our group is funded by the Klaus Tschira Stiftung (KTS).

## Author information

### Affiliations

### Corresponding author

## Additional information

### Authors' contributions

SAS developed and implemented the WET similarity matrix and performed all the analysis. JT obtained the data from SCOWLP used for this work. GA obtained the data from XMLRPDB used for this work. SAS and MTP wrote the manuscript. MTP designed and supervised the project. All authors have read and approved the final manuscript.

## Electronic supplementary material

### 12900_2008_242_MOESM1_ESM.doc

Additional file 1: **Probabilities for residues to be in contact with water in protein interfaces**. Probabilities for residues to be in contact with water in protein interfaces. The probabilities are derived from SCOWLP data for protein interfaces. (DOC 44 KB)

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

**Open Access**
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
https://creativecommons.org/licenses/by/2.0
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Samsonov, S.A., Teyra, J., Anders, G. *et al.* Analysis of the impact of solvent on contacts prediction in proteins.
*BMC Struct Biol* **9, **22 (2009). https://doi.org/10.1186/1472-6807-9-22

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1472-6807-9-22

### Keywords

- Similarity Matrix
- Correlate Mutation
- Residue Type
- Domain Pair
- Interfacial Residue