"Hot cores" in proteins: Comparative analysis of the apolar contact area in structures from hyper/thermophilic and mesophilic organisms

Background A wide variety of stabilizing factors have been invoked so far to elucidate the structural basis of protein thermostability. These include, amongst the others, a higher number of ion-pairs interactions and hydrogen bonds, together with a better packing of hydrophobic residues. It has been frequently observed that packing of hydrophobic side chains is improved in hyperthermophilic proteins, when compared to their mesophilic counterparts. In this work, protein crystal structures from hyper/thermophilic organisms and their mesophilic homologs have been compared, in order to quantify the difference of apolar contact area and to assess the role played by the hydrophobic contacts in the stabilization of the protein core, at high temperatures. Results The construction of two datasets was carried out so as to satisfy several restrictive criteria, such as minimum redundancy, resolution and R-value thresholds and lack of any structural defect in the collected structures. This approach allowed to quantify with relatively high precision the apolar contact area between interacting residues, reducing the uncertainty due to the position of atoms in the crystal structures, the redundancy of data and the size of the dataset. To identify the common core regions of these proteins, the study was focused on segments that conserve a similar main chain conformation in the structures analyzed, excluding the intervening regions whose structure differs markedly. The results indicated that hyperthermophilic proteins underwent a significant increase of the hydrophobic contact area contributed by those residues composing the alpha-helices of the structurally conserved regions. Conclusion This study indicates the decreased flexibility of alpha-helices in proteins core as a major factor contributing to the enhanced termostability of a number of hyperthermophilic proteins. This effect, in turn, may be due to an increased number of buried methyl groups in the protein core and/or a better packing of alpha-helices with the rest of the structure, caused by the presence of hydrophobic beta-branched side chains.

ture certainly exerts a deep selective pressure on cell biochemistry and physiology [1]. Indeed, temperatures approaching 100°C usually denature proteins and nucleic acids, and increase the fluidity of membranes to lethal levels [2]. It is therefore of great interest to study how organisms coped with the molecular adaptations required to thrive in extreme environments, particularly at high temperatures. Such organisms, which are distributed among the three domains of life, are called "thermophiles" or "hyperthermophiles", if they exhibit an optimal growth in either a 45°C -80°C or a 80°C -110°C temperature range, respectively [3].
To date, a number of studies has been carried out to understand how proteins found in hyper/thermophilic organisms are stabilized [1][2][3][4][5][6]. Thanks to the wealth of sequence and structural information available today on hyper/thermophilic proteins, it is becoming clear that there is not a general rule for the stabilization of proteins at high temperatures. Rather, an increased thermal stability seems to be achieved through a combination of different small structural modifications involving, amongst the others, ion-pairs interactions, hydrogen bonds and packing of hydrophobic residues [6].
Regarding the latter, one frequently invoked theory is that the packing of hydrophobic side chains is improved in thermophilic and hyperthermophilic proteins, when compared to their mesophilic counterparts [7]. Many studies on proteins adaptation to high temperatures focused on the differences in compactness between hyper/ thermophilic and mesophilic proteins using accessible surface area [6] or cavity size [8] as judgment criteria. However, as discussed by Robinson-Rechavi and Godzik [9], and by Gromiha [10], these approaches present several drawbacks, e.g., the individual contribution to the enhanced thermostability of different structural environments and inter-residue contacts cannot be assessed. Hence, alternative ways to quantify protein compactness were adopted. For example, Gromiha [10] analyzed the long range and inter-residue contacts in mesophilic and thermophilic proteins of sixteen different protein families, and found that an increase in contacts between hydrogen-bond forming residues increases protein stability. Very recently, the contact order [11] is receiving increasing attention, thanks to the findings obtained by Godzik and his research group [9,12], who found that hyperthermophilic proteins from T. maritima have higher contact order than their mesophilic counterparts. Most importantly, contact order is correlated to the folding rate of proteins that fold with a two-states mechanism [11].
However, a severe limitation of this and other [10,13] studies is that two residues are considered to be in contact if the distance between their C α atoms or between one atom and any other atom is below an arbitrary threshold. For example, Robinson-Rechavi et al. [12] considered two residues to be in contact if any of their atoms are closer than 4.5 Å, while Gromiha [10] made use of a sphere of 8.0 Å centered on C α atoms to define long-range contacts. Furthermore, this approach bears another important drawback: it does not permit to quantify the hydrophobic contact area between two interacting residues. The hydrophobic contact area between buried residues represents in fact an indirect measure of both entropic (entropy change due to the rearrangement of the local water molecules as two hydrophobic residues interact [14]) and enthalpic (van der Waals forces in protein core, due to tight packing of neighboring residues [4]) effects ( Figure 1). Therefore, despite a series of experimental and theoretical studies on the molecular mechanisms of protein folding [15,16] and stability [3,9,17] argued that the hydrophobic contacts play a role of paramount importance in such processes, the difference of apolar contact area between large datasets of proteins from hyper/thermophilic organisms and their mesophilic homologs, to our knowledge, has been never quantified.
Such consideration, along with the wealth of information provided very recently by structural genomics projects, prompted the comparison of a large number of protein crystal structures from hyper/thermophilic organisms and their mesophilic homologs, in order to assess the role played by the hydrophobic contacts in the stabilization of the protein core, at high temperatures.
Computation of the apolar contact area Figure 1 Computation of the apolar contact area. A-B) Initially, for each amino acid pair (in this case two sample residues, Phe and Lys, are considered), the Van der Walls surface is generated. C) Then, the solvent accessible surface is computed. D) The latter is used to compute the hydrophobic contact surface between the two interacting residues.

Analysis of the Apolar Contact Area
Two datasets were obtained from a collection of 1563 hyperthermophilic and thermophilic proteins, retrieved from structural databases using several keywords (see Methods section; Table 1 and 2). In the first case a choice criteria favouring quality over quantity of data yielded a non redundant dataset, which will be referred to as "A", including 38 crystal structures, lacking any structural defect and displaying a maximum resolution of 2.0 Å and a maximum R-value of 0.25. Dataset A represents a subset of a second dataset, which will be referred to as "B". Dataset B is composed of 59 crystal structures lacking any structural defect, displaying a maximum resolution of 3.0 Å and a maximum R-value of 0.30. For each structure composing the two datasets, a mesophilic homologous counterpart was collected, following the same above mentioned choice criteria. The computation of the total apolar contact area (ACA) between the residues of each structure pair composing dataset A and B was then carried out. The statistical significance of the observed differences of ACA between hyper/thermophilic proteins and their mesophilic counterparts was assessed with a paired t-test. The results are reported in Table 3 (see also Additional file  1 for additional information). T-test values are expressed as the associated probability P of acceptance of the null hypothesis, that is, there are no significant differences of ACA between hyper/thermophilic and mesophilic pairs. T-values scoring > 2.0 (P(t) < 0.05) are considered statistically significant. Figure 2 shows the difference of apolar contact area computed over the whole structures of the protein pairs composing the two analysed datasets. The obtained values were normalized by the sequence length of each protein. In dataset A, 22 (13 hyperthermophilic/ mesophilic and 9 thermophilic/mesophilic protein pairs) of the 38 considered protein pairs showed an increase of the ACA (Figure 2A); the corresponding P(t) was ~0.086 (0.079 for hyperthermophiles and 0.690 for thermophiles). In dataset B, 38 (24 hyperthermophilic/mesophilic and 14 thermophilic/mesophilic protein pairs) of the 59 protein pairs showed an increase of the ACA (Figure 2B); the corresponding P(t) was ~0.012 (0.020 for hyperthermophiles and 0.474 for thermophiles). Although the obtained differences were not considered statistically significant, according to the t-test validation analysis, for both datasets (Table 3), nonetheless they indicated a general increase of the apolar contact area in hyperthermophilic proteins, compared to their mesophilic counterparts.
A more detailed analysis on the structurally conserved regions [18] (SCRs; see methods section) of the structures composing dataset A and B indicated that, in both datasets, a number of hyperthermophilic proteins underwent a highly significant (P(t) < 0.001) increase of the hydro-phobic contact area of those residues composing the SCRs ( Figure 3; Table 3). SCRs were defined as regions displaying a similar local conformation, lacking insertions and deletions and composed of at least three consecutive residues. SCRs are therefore protein segments that conserve the same main-chain conformation in each pair of structures analysed, excluding the intervening regions whose structure differs markedly amongst different proteins [19]. Considering the role of great importance played by the hydrophobic contacts in stabilizing and possibly driving the protein folding mechanism, it seemed interesting to analyse how, during evolution, the SCRs coped with the modifications of the hydrophobic contacts necessary to achieve the correct fold at high temperatures. In dataset A ( Figure 3A), 22 (17 hyperthermophilic/mesophilic and 5 thermophilic/mesophilic protein pairs, respectively) of the 38 considered protein pairs showed an increase of the ACA (P(t) ~0.0029). The same trend was also observed for dataset B ( Figure 3B), in which 37 of 59 protein pairs (27 hyperthermophilic/mesophilic and 10 thermophilic/mesophilic) displayed an increased ACA in the direction mesophile → hyper/thermophile (P(t) ~0.0001). The measured mean ΔACA was 0.39 Å2/residue and 0.37 Å2/ residue for datasets A and B, respectively. However, if only the hyperthermophilic/mesophilic pairs were considered, the mean ΔACA was 0.74 Å2/residue and 0.63 Å2/residue for datasets A and B, respectively. The maximum measured difference was 2.92 Å2/residue for the pair 1V7R/ 1K7K (nucleotide triphosphate pyrophosphatase from P. horikoshii/E. coli). Since these quite high differences of ACA can be due to other factors than acquired thermostability (i.e., different overall conformations), the t-test validation analysis was repeated without these extreme pairs, obtaining again not significant results (see "Methods" section and supplementary material).
To get a deeper insight into the statistically significant increase of the hydrophobic contact area of protein cores from hyperthermophilic organisms, the possible occurrence of a larger amount of hydrophobic contact area has been examined in different secondary structure elements. In dataset A ( Figure 4A), 16 out of the 24 hyperthermophilic proteins considered showed an increase of ACA in the α-helices of the protein core, compared to their mesophilic counterparts, while in dataset B ( Figure 4B) the same ratio was 25 out of 37 proteins, with a measured significance P(t) ~0.0524 and P(t) ~0.0113 for datasets A and B, respectively. Although in this latter case significant deviations from normality, as judged by the application of the Shapiro-Wilk normality test, were observed for the distribution of mesophilic values, nonetheless removing three outliers gave a Shapiro-Wilk P(t) ~0.62 and a t-test P(t) ~0.001. These results indicated that α-helices are mainly involved in the increased amount of hydrophobic contact area which was observed comparing hyperther-mophilic/mesophilic proteins. Conversely, no statistically significant trends have been observed in the comparison of the ACA in the β-strands of the SCRs (Table 3). In dataset A, 21 (14 hyperthermophilic/mesophilic protein pairs) of the 38 considered protein pairs showed an increase of the ACA, while in dataset B, 34 (24 hyperthermophilic/mesophilic proteins) of the 59 pairs exhibited an increase of the ACA. The mean value of ΔACA is -0.02 Å 2 /residue and 0.34 Å 2 /residue for dataset A and B. Therefore, at least for the hyperthermophilic/mesophilic protein pairs, it can be concluded that the statistically significant increase of the hydrophobic contact area of protein cores involves mainly the α-helices and not the βstrands.

Differences in the amino acid composition of the residues involved in conserved hydrophobic contacts
The differences of amino acid composition of the residues involved in conserved hydrophobic contacts (CHCs; Table 4) [19] between hyperthermophilic proteins and their mesophilic counterparts is expressed in units of standard deviation from the measured mean value, R aa . R aa values > 0 or < 0 indicate, respectively, a frequency of residue type aa higher or lower than the expected mean.
Differences in the apolar contact area (ΔACA) for each protein pair, composing dataset A and B, computed over the whole protein structure Figure 2 Differences in the apolar contact area (ΔACA) for each protein pair, composing dataset A and B, computed over the whole protein structure. Values for hyperthermophilic/mesophilic protein pairs and thermophilic/mesophilic pairs are expressed in Å 2 /residue and represented as light grey and dark grey bars, respectively. Numbers on X-axis refer to Table 1 (A) and Table 2 (B).  R aa values ≥ 3.0 standard deviations (P ≤ 0.01) from the mean value (that approximates zero) were considered statistically significant. Compositional analysis shows no statistically significant differences between hyperthermophilic and mesophilic proteins, regarding the identity of the residues involved in the formation of hydrophobic contacts, except for isoleucine, that scored at ~3.6 standard deviations from the mean in both datasets A and B. It is important to emphasize that, in evaluating the differences of amino acid composition of the residues involved in conserved hydrophobic contacts, dataset B, containing 13 hyperthermophilic/mesophilic protein pairs more than dataset A, is probably more confident. In any case, since both datasets A and B gave very similar results, the role played by isoleucine is probably independent from the number and type of structures analysed.

Preferred amino acid interactions in conserved hydrophobic contacts
In order to further investigate the statistically significant increase of isoleucine in CHCs of hyperthermophilic pro-teins, compared to their mesophilic counterparts, an analysis was carried out to infer which amino acid pairs are preferred in the formation of hydrophobic contacts. Preferred amino acid pairs forming hydrophobic contacts were identified by computing the number of times a particular pair of residues comprised in SCRs makes a hydrophobic contact, displaying an apolar contact area > 0.0 Å 2 .
The results of this analysis are shown in Tables 5 and 6, where each element ij of the interaction matrix reports, in units of standard deviation from the mean value, the measured frequency of interaction between residue i and residue j.  lysine, scoring at 3.28 standard deviations from the mean. The closeness between the apolar atoms composing Glu and Lys residues might be only a secondary effect in the generation of strong ion-pairs between these two residues.

Preferred amino acid substitutions in conserved hydrophobic contacts
Favoured amino acid substitutions between the hyperthermophilic and mesophilic proteins were calculated from the results obtained by the CHC_FIND tool [19]. The residues exchange analysis was indeed limited to the identified conserved hydrophobic contacts. The obtained substitution matrices are shown in Tables 7 and 8. Values are expressed in units of standard deviation from the mean. Only values scoring at 3.0 standard deviations or more from the mean were considered statistically significant. Again, almost all of the most significant exchanges involve isoleucine in both datasets (dataset A: Val→Ile 6.32, Leu→Ile 6.36; dataset B: Val→Ile 6.39, Leu→Ile 6.84 and Phe→Ile 3.12). These exchanges are reflected in the variation of average amino acid composition of hyperthermophiles (Table 4), where a marked increase of isoleucine content can be detected. The only other exchange observed not involving isoleucine is Ala→Val, scoring at 3.20 standard deviations from the mean.

Discussion
The main goal of this study was to evaluate on a quantitative basis the relationship between hydrophobic contacts and proteins adaptation to high temperatures.
An essential prerequisite to carry out such a study is to assemble a large and minimally redundant set of very high resolution crystal structures. Indeed, despite the observation that each protein family seems to adopt different structural strategies to adapt to high temperatures [5], common trends may be outlined if a large number of structural data is available [8]. At the same time, since computed values of apolar contact area are mostly influenced by the relative position of the interacting residues, their precision is affected by the resolution of the crystal structures analysed. Therefore two datasets were culled from a set of 1563 crystal structures from thermophilic (optimal growth temperature between 50°C and 80°C) and hyperthermophilic (optimal growth temperature above 80°C) organisms, and their mesophilic counterparts. The rationale of this choice was to assure that the obtained results were not biased either by the paucity of data, or by the quality of the collected crystal structures.
As already discussed by Chen et al. [7], the increase of the apolar contact area in hyperthermophilic and thermophilic proteins may be achieved at least by two different mechanisms: an evenly distributed increase over all residues; a local increase over key residues. The latter mechanism, that has been shown to be a major contribute to the enhanced thermostability of proteins from T. maritima [9], seems to involve mainly residues already implied in the formation of hydrophobic contacts. This suggests that a better compactness may originate from an even better connectivity in those protein regions that already have a tendency to compactness and not by simply "tightening the loops" [9]. The results obtained in this work on the difference of apolar contact area (ΔACA) agree with this hypothesis: a significant increase of ACA was measured in both datasets only when the analysis was limited to the SCRs of the hyperthermophilic structures. The SCRs were presumably subject to similar constraints during the divergent evolution of a family of proteins from a common ancestor, and therefore they possibly contain most of the determinants necessary to maintain the fold. Considering the role played by hydrophobic contacts in this sense, it is not surprising that the residues composing the SCRs and engaging hydrophobic contacts were mostly involved in the structural modifications nec-essary to achieve and maintain a proper fold at high temperatures. Moreover, the finding that the measure of the difference of ACA resulted highly significant only when limited to the SCRs, could explain some apparently not significant results previously obtained by measuring accessible surface area [8] or cavity size [6].
The statistically significant increase of ~0.75 Å 2 /residue of apolar contact area was observed only in the SCRs of hyperthermophilic proteins. Therefore, it can be argued that proteins from thermophilic organisms usually adopt different strategies to enhance thermostability. Indeed, it has been demonstrated that moderately and extremely thermostable proteins rely on different mechanisms to Differences in the apolar contact area (ΔACA) for each protein pair, composing dataset A and B, computed over the SCRs Figure 3 Differences in the apolar contact area (ΔACA) for each protein pair, composing dataset A and B, computed over the SCRs. Values for hyperthermophilic/mesophilic protein pairs and thermophilic/mesophilic pairs are expressed in Å 2 / residue and represented as light grey and dark grey bars, respectively. Numbers on X-axis refer to Table 1 (A) and Table 2 (B).
achieve greater stability [8,20]. Ion-pairs interactions represent presumably a predominant force in thermophilic proteins, as well as in many hyperthermophilic proteins [8,21]. On the other hand, comparisons of mesophilic and hyperthermophilic protein structures indicate that the hydrophobic effect has a contribution to stability only at high temperatures, while only moderately thermophilic proteins show an increase in the polarity of their exposed surface [20]. Two factors could be responsible for this difference: the temperature dependence of the thermodynamic forces involved in protein stabilization, and/or the phylogenetic origin of the extremely thermophilic organisms, that belong to the domain Archaea, and are there-fore distinct from moderately thermophilic organisms, which are mostly Bacteria. In any case, the obtained results strongly suggest that packing of hyperthermophilic proteins, in comparison with their mesophilic homologs, has improved significantly, and it is reasonable to deduce that this increased amount of apolar contact area contributes to the stabilization of the native state of the protein.
Our analysis revealed that α-helices were mainly involved in the increased amount of ACA. Surprisingly, no statistically significant trends have been observed in the comparison of the ACA in the β-strands of the SCRs. We cannot provide a clear explanation of this different behaviour Differences in the apolar contact area (ΔACA) for each protein pair, composing dataset A and B, computed over the α-helices of the SCRs Figure 4 Differences in the apolar contact area (ΔACA) for each protein pair, composing dataset A and B, computed over the α-helices of the SCRs. Values for hyperthermophilic/mesophilic protein pairs and thermophilic/mesophilic pairs are expressed in Å 2 /residue and represented as light grey and dark grey bars, respectively. Numbers on X-axis refer to Table 1 (A) and Table 2 (B). between secondary structures. An intriguing possibility is that β-strands are, generally, already almost optimally packed, even in mesophilic proteins, resulting in a small margin of improvement. However, it is also possible that this observation is due to 'sample bias' e.g., the peculiarities of the available protein structures.
Structural stabilization of α-helices in protein cores may therefore represent a component of great importance for the enhanced termostability of hyperthermophilic proteins. A number of studies in the past has stressed the importance of the enhanced stability of α-helices as a general feature of many hyperthermophilic proteins. In order to investigate the role of α-helices in protein thermostabil-  ity, Petukhov et al. [22] compared energy characteristics of α-helices from four families of hyperthermophilic and mesophilic proteins, using statistical mechanical theory for describing helix/coil transitions. They found that the magnitude of the observed decrease in intrinsic free energy on α-helix formation of the thermostable proteins was sufficient to explain the experimentally determined increase of their thermostability. Furthermore, protein engineering studies showed that a well-packed α-helix structure is related to large increase in thermostability [23,24]. It is well known that the flexibility of α-helices is often required to assure protein function, such as conformational transitions in substrate binding or protein-protein interactions [25]. However, an excessive flexibility of this secondary structure element, at high temperatures, could result in an insufficient stability to maintain its native conformation, causing the entire protein to unfold.
According to thermodynamic studies on model peptides in aqueous environments, two main factors appear to play a key role in the structural stability of the α-helices: the presence of amino acids with intrinsic helical propensity, and side chain-side chain interactions [26,27]. Therefore, we further investigated the nature of the increased stabilization of α-helices composing the SCRs of hyperthermostable proteins, determining the differences in amino acid composition of the residues involved in CHCs. The results of this analysis strongly suggest that isoleucine and, to a lesser extent valine, mostly to the detriment of leucine, are involved in the formation of more hydrophobic contacts in hyperthermophilic proteins, compared to their mesophilic counterparts. Likewise, the importance of isoleucine in the formation of CHCs of hyperthermophilic proteins was confirmed by the analysis of the preferred amino acid interactions in CHCs, where almost all types of interactions scoring at > 3.0 standard deviations involved the amino acid isoleucine, and by the favoured amino acid substitutions between the hyperthermophilic and mesophilic proteins in CHCs. A large amount of theoretical and experimental studies demonstrates the importance of isoleucine in the stabilization of protein structures from thermophilic organisms. Malakauskas and Mayo [24] reported the computer-aided engineering of a seven-fold mutant of the β1 domain of the Streptococcal protein G, exhibiting a melting temperature above 100°C and an enhancement in thermodynamic stability of 4.3 kcal mol -1 at 50°C over the wildtype protein. Of seven mutations, five were of type XXX→ Ile, and they improved side-chain packing in the interior of the protein. An increased content of isoleucine in thermophilic and hyperthermophilic proteins, to the detriment of leucine, was also noted by Haney et al. [28] and Kumar et al. [6]. More recently, a structural genomics based study carried out by Chakravarty and Varadarajan [29] reported that leucine is preferentially substituted by the β-branched residues valine and isoleucine, at buried sites.
Several studies have demonstrated in the past that leucine has a slightly higher α-helix propensity than isoleucine and, generally, β-branched residues [27,30]. This assumption, which is apparently in contrast with the results obtained by this work, derives from substitution experiments in short polyalanine α-helices-forming peptides in    ALA  VAL  PHE  ILE  LEU  ASP  GLU  GLY  LYS  SER  THR  TYR  CYS  ASN  GLN  PRO  MET  ARG  HIS  TRP      water [31]. This process is mainly associated with the loss of conformational entropy of residues during the folding of α-helices in an aqueous environment: freezing side chain with fewer internal rotational degrees in the α-helix conformation would be entropically less expensive. However, it must be noted that these experiments, and many derived propensity scales, do not take into account solvent entropy effects. As discussed by Creamer and Rose [30], neglect of solvent entropy appears justified for a peptide side chain because no significant differences in solvation energy are expected in the side chain of a solitary polyalanyl helix during a helix-coil transition. In either case, the side chain is highly solvent-exposed. The same situation would not be appropriate for a protein helix that, upon association with the remainder of the molecule, engages a solvent-shielded interaction surface. In this study, only the α-helices composing the SCRs and therefore mostly found in the protein core were considered for further investigation. Therefore, application of helix propensity scales might be not appropriate in this case. For example, Li and Deber [15] have shown that αhelices propensity scales are not appropriate for non aqueous environments and that β-branched amino acids, as valine and isoleucine, rank among the best helix promoters in an apolar environment, as a lipid bilayer.
On the other side, hydrophobic contacts deriving by side chain interactions could play a role of great importance in the stabilization of the α-helices composing the SCRs of hyperthermostable proteins. At temperatures above 80°C, the hydrophobic effect, that is considered to be a dominant force in protein folding [32,33], is mainly enthalpy driven [34]. In fact, while at high temperatures the entropy contribution to the protein stability tends to zero, the loss or gain of van der Waals interactions acquires increased importance. For example, constructing 15 Barnase mutants in which hydrophobic interactions were deleted, Serrano et al. [35] found a strong correlation between the degree of Barnase destabilization and the number of methyl side chain groups that were lost (r = 0.91). These data agree with the preferred substitutions (R Ala→Val = 3.20; R Val→Ile = 6.31) observed in the CHCs of our datasets.

Conclusion
In conclusion, taken together the obtained results indicate the preference, in the hydrophobic contacts, for isoleucine and valine residues as an important feature contributing to the enhanced thermostability of α-helices in hyperthermophilic proteins, possibly occurring through a decreased flexibility of these elements of secondary structure. This effect, in turn, may be due to an increased number of buried methyl groups in protein core and/or a better packing of α-helices with the rest of the structure, caused by the presence of hydrophobic β-branched side chains.
Despite the advances in the design of hyperthermostable protein variants [17], a potential drawback of these approaches is still constituted by the time consumed by computer algorithms for exploring the whole sequence protein space. Other things being equal, focussing on the apolar contact area of the α-helices of the protein core through substitutions increasing the number of methyl side chain groups and/or resulting in a better packing of the secondary structure elements, will potentially give clues for the thermostabilization of the protein.

Data Collection
Hyperthermophilic and thermophilic protein structures were retrieved from Protein Data Bank (PDB) [36], by initially searching for the words "thermo", "thermophile" and "hyperthermophile". This search yielded about 300 proteins and their corresponding sources. An additional search was then performed using as query the name of such organisms, after having assessed that their optimal growth temperatures were between 50°C and 80°C for thermophiles, and above 80°C for hyperthermophiles [3]. Optimal growth temperatures for each organism were obtained from Entrez [37] and the "Prokaryotic Growth Temperature Database" [38]. As a first refinement step, the entries in which protein structures were determined by nuclear magnetic resonance (NMR) were discarded, yielding about 1563 crystal structures.
As a second refinement step, all the entries were examined by means of the PISCES tool [39], and culled from the original dataset by maximum percentage of identity (90%), maximum resolution (2.0 Å), maximum R-value (0.25) and minimum chain length (50 residues) criteria. Furthermore, a second dataset was collected following less stringent criteria (maximum resolution at 3.0 Å and maximum R-value at 0.30), in order to cull a greater number of structures. This second step yielded 458 and 767 proteins for dataset A and B, respectively. Each dataset was then further reduced by eliminating proteins displaying any structural defect, such as missing side-chains or chain breaks due to missing residues, using the MAXIT tool, available at [46]. At the end of this refinement step, 93 and 144 structures comprised dataset A and B, respectively.
Each structure of the two datasets was then exploited to check for the presence in PDB of a mesophilic counterpart. To this purpose, a search with the blast tool [40,41] was carried out, adopting the following criteria: 30% minimum sequence identity, that is usually accepted as a threshold value to assure a homology relationship between two proteins [42]; 90% maximum sequence identity, in order to avoid any redundancy of data; 40% maximum difference in length between the sequences, to avoid the presence of large indels between the two structures. Furthermore, the retrieved mesophilic proteins had to satisfy the same above described structural criteria to be accepted. In those cases yielding several mesophilic homologous structures available for each hyperthermophic/mesophilic protein, the one displaying the highest percent of sequence identity was collected. At the end of this search, 38 protein pairs for dataset A (14 thermophilic/mesophilic pairs and 24 hyperthermophilic/ mesophilic pairs) and 59 protein pairs for dataset B (22 thermophilic/mesophilic pairs and 37 hyperthermophilic/mesophilic pairs) were collected (Table 1 and Table 2).

Computation of the Apolar Contact Area
Computation of the total apolar contact area between the residues of each structure composing dataset A and B was carried out by means of the pdb_np_cont tool [43], which computes pairwise atom contact areas between non-polar atoms from structural protein data in a standard PDB coordinate file. Briefly, this method is based on the classification of points located on a sphere of interaction radius, surrounding each non-polar atom. The interaction radius is the van der Waals radius of each atom type, plus the radius of a water molecule. The output of this program was utilized to calculate the pairwise residue contact areas for every possible pair of residues belonging to the structures analysed. Heteroatoms were ignored. The total apolar contact area was then normalized by sequence length of each protein structure.
In order to assess the role played by the hydrophobic contacts in the stabilization of the protein core, at high temperatures, each pair of homologous hyperthermophilic/ mesophilic and thermophilic/mesophilic structures was initially superposed by means of the CE-MC tool [44]. The resulting alignment was then utilized to derive manually refined structural alignments. Every pair of structures was visually inspected and, where necessary, modified to optimise the matching of several structural features, including observed secondary elements, functionally conserved residues and hydrophobic regions, in order to give the most accurate structural alignment.
Each structural alignment obtained as described above was utilized to identify the common core and the structurally conserved regions between the pairs of proteins taken into consideration (SCRs). SCRs were defined as regions displaying a similar local conformation, with a mean positional RMSD of the equivalent α-carbon positions of the structures superposed ≤ 3.0 Å [18], lacking indels (insertions and deletions) and composed of at least three consecutive residues. For every structurally equivalent position of the pairwise structural alignment, the RMSD from the center of mass of the structurally equivalent C α atoms was computed. To avoid the presence of SCRs with indels, positions with gaps were not considered. A window of size w = 3 positions was then scrolled through the alignment and used to define seed positions with a mean RMSD ≤ 3.0 Å. Each time a seed position was found, w was increased iteratively by one position until the mean score remained belove 3.0 Å, or until the window reached the end of the alignment. The obtained SCRs were then visually inspected to avoid the possible presence of regions with different conformations. Then, the hydrophobic contacts involving pairs of topologically equivalent residues in both of the structures analysed (Conserved Hydrophobic Contacts, CHCs) were extracted from the identified SCRs. The SCR_FIND and CHC_FIND tools [19] were utilized to this purpose.
The differences observed in the amount of apolar contact area between the SCRs of the hyperthermophilic/mesophilic and thermophilic/mesophilic protein pairs were further investigated through the analysis of such differences in the regular secondary structure elements: α-helices and β-strands. Secondary structures were determined by using the program DSSP [45].
The amount of apolar contact area measured in the SCRs and secondary structure elements of each structure were finally normalized by the number of residues belonging to SCRs, α-helices and β-strands, respectively.

Amino acid Composition of the residues involved in CHCs
Differences in amino acid composition were measured by: where D aa is the difference in amino acid composition for residue aa, n T and n M are the number of residues of type aa in hyperthermophilic/thermophilic (T) and mesophilic (M) structures and n aa is the total number of residues in hyperthermophilic, thermophilic (T) and mesophilic (M) structures.
The D aa values measured for each pair of the structures analysed were then used to calculate the difference in amino acid composition C aa over the k pairs composing dataset A and dataset B: The mean and standard deviation for the C aa elements were determined; the significance R aa of the difference in amino acid composition for residue aa was then calcu- lated by dividing the difference between C aa and the overall mean by the standard deviation σ: R aa values ≥ 3.0 standard deviations (corresponding to a probability P ≤ 0.01 that the observed difference was obtained by chance) from the mean value were considered statistically significant.

Preferred amino acid pairs in CHCs
Preferred amino acid pairs forming hydrophobic contacts were identified by computing the number of times a particular pair of residues comprised in SCRs makes a hydrophobic contact. The obtained counts were then normalized by the number of pairs of interacting residues present in the SCRs of the structure taken into consideration. An interaction matrix reporting the differences in the number of apolar contacts for each possible pair of residues, between hyperthermophilic/mesophilic and thermophilic/mesophilic structures, was derived: where k represents the number of elements of dataset A or B, C XY is the element of the matrix reporting the differences in the number of apolar contacts for the pair XY of interacting residues, C T and C M are the normalized counts for the hyperthermophilic/thermophilic and the mesophilic proteins, respectively.
The mean and standard deviation for the non-zero elements of the overall interaction matrix were determined; the significance R XY of the interaction XY was then calculated by dividing the difference between C XY and the overall matrix mean by the standard deviation σ: R XY values ≥ 3.0 standard deviations (corresponding to a probability P ≤ 0.01 that the observed difference was obtained by chance) from the mean value were considered statistically significant.

Preferred amino acid substitutions in CHCs
Amino acid substitutions of residues involved in the formation of conserved hydrophobic contacts between hyperthermophilic and mesophilic proteins were determined by analysing the alignment of the SCRs of each pair. For each residue X, belonging to a mesophilic protein and involved in making CHCs, aa X→Y was defined as the number of times X is substituted by the residue Y of the hyperthermophilic sequence. Likewise, aa Y→X is defined. Therefore, a substitution matrix can be obtained by computing the difference between aa X→Y and aa Y→X over the whole dataset of protein pairs k, according to: where C S is the element of the substitution matrix.
The mean and standard deviation for the non-zero elements of the overall exchange matrix were determined; the significance R XY of the exchange X → Y was then calculated by dividing the difference between C S , and the overall matrix mean by the standard deviation σ: R XY values ≥ 3.0 standard deviations (corresponding to a probability P ≤ 0.01 that the observed difference was obtained by chance) from the mean value were considered statistically significant.

Statistical significance
The statistical significance of the observed differences of ACA between hyper/thermophilic proteins and their mesophilic counterparts was assessed with a paired t-test (applied to every pair of structures composing dataset A and dataset B, respectively), to judge the rejection of the null hypothesis (t > 2.0; P(t) < 5%). The null hypothesis to be rejected with the paired t-test analysis is that there is not a significant difference between the measured values of ACA in the hyper/thermophilic and mesophilic proteins.
In order to ensure that the measured P(t) was not biased by the extreme values of the distributions, the t-test validation analyses were repeated, removing the highest and lowest values from the datasets.
The Shapiro-Wilk normality test was applied to judge the distribution of the obtained values for the two datasets. The null hypothesis of this test is that the analysed samples of data are taken from a Gaussian distribution; therefore, the returned P(t) of this test represents a criteria of acceptance or rejection of the null hypothesis. A P(t) < 0.05 was considered statistically significant to reject the supposition of normality.