Skip to main content
  • Research article
  • Open access
  • Published:

Protein structure prediction by all-atom free-energy refinement

Abstract

Background

The reliable prediction of protein tertiary structure from the amino acid sequence remains challenging even for small proteins. We have developed an all-atom free-energy protein forcefield (PFF01) that we could use to fold several small proteins from completely extended conformations. Because the computational cost of de-novo folding studies rises steeply with system size, this approach is unsuitable for structure prediction purposes. We therefore investigate here a low-cost free-energy relaxation protocol for protein structure prediction that combines heuristic methods for model generation with all-atom free-energy relaxation in PFF01.

Results

We use PFF01 to rank and cluster the conformations for 32 proteins generated by ROSETTA. For 22/10 high-quality/low quality decoy sets we select near-native conformations with an average C α root mean square deviation of 3.03 Å/6.04 Å. The protocol incorporates an inherent reliability indicator that succeeds for 78% of the decoy sets. In over 90% of these cases near-native conformations are selected from the decoy set. This success rate is rationalized by the quality of the decoys and the selectivity of the PFF01 forcefield, which ranks near-native conformations an average 3.06 standard deviations below that of the relaxed decoys (Z-score).

Conclusion

All-atom free-energy relaxation with PFF01 emerges as a powerful low-cost approach toward generic de-novo protein structure prediction. The approach can be applied to large all-atom decoy sets of any origin and requires no preexisting structural information to identify the native conformation. The study provides evidence that a large class of proteins may be foldable by PFF01.

Background

The development of reliable methods for de-novo protein structure prediction remains a challenge [1–4] even for small proteins. Heuristic methods, which dominate protein structure prediction contests [3], can generate accurate models [5], but often lack the ability to reliably identify near-native conformations [6]. Folding simulations using accurate biophysical models demonstrate agreement with experimental investigations [7–11], but remain limited to small proteins by their large associated computational cost. We have developed an all-atom free-energy forcefield [12] to describe the protein folding process. In our folding studies we exploit the thermodynamic hypothesis [13], which stipulates that many proteins in their native configuration are in thermodynamic equilibrium with their environment. Based on this paradigm the native conformation of a protein can be predicted as the global optimum of its free energy surface [14]. Since the free-energy landscape of naturally occurring proteins is thought to have a funnel-like shape [15, 16], stochastic search methods are guided by the overall gradient towards the global optimum of this landscape. Using a variety of different stochastic optimization methods we were able to demonstrate the reproducible and predictive folding of several proteins, including the trp-cage protein (1L2Y) [17], the villin headpiece [18], the HIV accessory protein (1F4I) [19], and the bacterial ribosomal protein L20 [20, 21] with 20, 36, 40 and 60 acids respectively.

While these studies demonstrate the feasibility of all-atom protein structure prediction from random initial conformations, the numerical effort for a predictive simulation still increases steeply with system size. The numerical effort for a predictive simulation increases from about 20 CPU days (on standard off-the-shelf hardware) for a protein with 20 amino acids to about 8000 CPU days for 60 amino acids [21, 22]. In an alternative, widely pursued approach [23–27], protein structures are assembled de-novo according to heuristic principles, such as local sequence homology [28] and then ranked with either knowledge based or forcefield based scoring functions [29–37]. Heuristic decoy generation eliminates the need to the sample the entire conformational space of the protein or to reconstruct the folding pathway. Because large decoys sets of protein-like conformations can be generated much faster than by sampling the free-energy landscape, the decoy selection approach makes it possible to predict the native conformation of proteins that are too large to be folded from completely random initial conformations. Of particular interest in this regard are decoy sets that are generated from a completely orthogonal philosophy from folding, e.g. methods that assemble the protein from fragments obtained from local homology or other sources [28, 38, 39].

The goals of this investigation are therefore twofold: first test the accuracy of the free-energy forcefield PFF01 for proteins that are too large and too complex to fold from random initial conformations. If we find near-native decoys are lower in free-energy than all other conformations, the forcefield is accurate enough to fold the protein. Since it is impossible to generate completely exhaustive decoy sets we use 32 proteins of the latest ROSETTA all-atom decoy library as a reference [6]. These decoy sets were generated specifically for the purpose of forcefield-assessment and help us to obtain an unbiased assessment of the "universality" of the forcefield.

Secondly we develop and validate a protocol, free-energy relaxation, to select the native protein structure from large libraries of protein conformations generated by a heuristic method. Free-energy relaxation could be used for protein structure prediction either as a stand-alone method, or as a post-scoring approach for existing techniques. Because no fundamentally new conformations are generated in the relaxation protocol, a prerequisite for success is the existence of some near native conformations in the decoy set. This investigation deals only with the validation of a suitable selection protocol, not with the generation of exhaustive decoys sets. Our approach will therefore fail for proteins where the decoy set contains an insufficient number of near-native conformations. An overall assessment of the viability of the free-energy relaxation approach for protein structure prediction would additionally require an independent assessment of the likelihood of the decoy-generator to propose near-native decoys.

Results

We investigated 32 small proteins (30–85 amino acids) without any stabilizing ligands [6] (see Methods). The proteins have all-alpha (20), alpha-beta (8) or only beta (4) secondary structure and cover many distinct structural families. Previous investigations ranked the decoys and found significant enrichment by several independent descriptors (Lennard-Jones, Coulomb, Hydrogen Bonding, etc) with Z-score ranging from -1 to -2 [6, 36]. The Z-scores using the original ROSETTA energies were reported to be poor, indicating that the development of a scoring function to select near-native decoys from this set poses a significant challenge [40–43].

The lowest Z-scores were reported for side-chain Lennard Jones interactions, favoring compact structures and side-chain hydrogen bonding. Neither main-chain hydrogen bonding, nor Coulomb interactions, nor a wide range of implicit solvent models resulted independently in very low Z-scores [6]. The PFF01 forcefield [12] integrates exactly such components (Lennard Jones, electrostatic model, hydrogen bonding, SASA implicit solvent model) but balances them in a fashion that was demonstrated to be highly selective for at least some small proteins. Here we investigate the question whether this unique combination is transferable to a larger protein test set and able to select near-native conformations of these independently generated decoy sets. Use of all applicable decoys of a protein library generated by an alternate approach ensures that the investigation is not biased towards proteins particularly amenable to relaxation with PFF01.

Decoy ranking

Free-energy relaxation scores the decoys in the set according to their energy in the forcefield PFF01 without major structural changes. Since this approach can only succeed for decoy sets containing near-native conformations we have subdivided the protein targets into two families: 22 high-quality decoy sets containing at least 10% near-native conformations and 10 low-quality decoy sets which contain few or no near-native decoys. Throughout this study we define near-native conformations as those with a C α root mean square deviation (C α RMSD) of less than 4.5 Å to the native conformation, commensurate with the characteristic resolution of the decoys [6] (less then 1% of the decoys of the low-quality decoy set have a C α RMSD of less then 4 Å). This measure is commensurate with the quality of the near-native conformations that we find in our folding studies [21, 44], which typically converge to about 3–4 Å C α RMSD, owing to the use of an implicit solvent model. Implicit solvent models, which are required to estimate the solvent contribution to the free energy of a protein conformation, tend to degrade the accuracy of the simulated native ensemble in comparison to explicit water simulations. As a result, we cannot expect the resolution of the forcefield to be better than 3–4 Å C α RMSD in the present relaxation protocol. Conformations generated from all-atom models are not trivially transferable from one theoretical model to another. In order to obtain a meaningful energy estimate each of the decoys must be relaxed in the new forcefield to a nearby local minimum. We pursued a low-cost approach (see Methods), which places the emphasis on the quality of the initial decoy, rather than on the generation of long trajectories that independently sample the conformational space. In such a rapid energy relaxation decoys will not move far from the starting configuration, but will significantly change their energy (Figure 1). The relaxation process leads to a reordering of the decoys and a substantial enrichment of the low-energy subset with near-native decoys. Due to the stochastic nature of the annealing process the final energy of each of the decoys samples a probabilistic distribution in energy and C α RMSD, the lowest energy decoy must not be a near-native one (see inset).

Figure 1
figure 1

Energy relaxation. Each conformation in a decoy set (1r69) is relaxed from its starting conformation (top set) to the final conformation (bottom set), dashed lines correlate the starting and final conformations for a subset of the relaxation runs. The native conformation (red triangle) has the lowest energy and clustering of the fifty lowest energy decoys (blue triangles) predicts the native conformation. Inset: Energy relaxation leads to an enrichment of the near native decoys in the entire set of final conformations(blue set = 50 lowest energy decoys).

For 18 of the 22 high quality decoy sets near-native conformations rank among the 10 best conformations (out of approximately 1900 for each protein); for ten proteins the native conformation is selected solely on the basis of its energy (Table 1). Even for the proteins with non-native lowest energy decoys the low-energy ensemble is significantly enriched with near-native conformations (Table 2). A failure to find a native decoy as the lowest energy conformation can have two sources: either it is a failure of the free-energy model/scoring function to identify the correct structure as the global optimum or the near native ensemble was not properly probed in the decoy generation.

Table 1 Results for high quality data sets
Table 2 Results for low quality data sets

Forcefield selectivity

In order to discriminate between these two possibilities, we independently generated near-native conformations starting from the experimental conformation. Their Z-scores (see Methods) average to -2.98/-3.25 for the high-quality/low quality decoy set (Figure 2).

Figure 2
figure 2

Z-scores. Z-scores for independently generated near-native conformations of of the proteins investigated. The Z-score is computed as the ratio of the energy difference between the near-native decoy to the mean of the decoy set, divided by the standard deviation of the letter. The relaxed energies were used for mean and standard deviation.

These values are significantly lower than any reported in previous investigations on the same set of decoys using a variety of different scoring methods [6]. The very good values for the decoys indicate that sampling or ranking problems, rather than forcefield accuracy limit the selectivity of free-energy relaxation. 5pti is the only protein for which a positive Z-score was computed, which is explained by the existence of a large unstructured region in the native conformation that is stabilized by three disulphide bridges. Since disulphide bridges are not accounted for in the present version of the forcefield, it is not surprising that the relaxation protocol generates a large number of decoys with better secondary structure and hydrophobic packing. The Z-score is a function both of the quality of the forcefield and of the decoy set. The Z-scores for low-quality decoy sets are lower than those for high quality decoy sets.

Decoy clustering

We have used a low-cost computational protocol in order to develop a method that can be applied to very large decoy sets. The best energy found in the short relaxation simulations thus depends stochastically on the moves chosen in the course of the simulation. In order to reduce such fluctuations one may either sample longer or generate several independent trajectories. While both methods may be successful they would significantly increase the overall computational cost to produce a statistically reliable reduction of the energy fluctuations.

Alternatively we can exploit the fact that many decoys are available: we reduce the statistical fluctuations associated with a single short relaxation trajectory by clustering the 50 lowest-energy conformations for each of the decoy sets using a hierarchical algorithm [5, 6, 45, 46]. When a unique cluster emerges during this operation (number of decoys in the largest cluster exceed that of the next-largest cluster by at least 20%), we accept the prediction as "decisive", otherwise we rate the simulation as "indecisive". Just as the available experimental methods routinely fail for many proteins due to lacking crystal or signal quality, computational procedures that can 'solve' only a limited number proteins would be very helpful, provided that they contain an inherent measure of the likelihood of success. Here we use the existence of a 'largest cluster' as a predictor for the decisive simulations.

Applying this criterion to the high-quality decoy sets find decisive predictions for 19 out of 22 proteins. For the decisive simulations we predict near-native conformations (left panel of Figure 3) for all but one protein. The single prediction failure (1ctf, C α RMSD: 5.2 Å) occurred for a dimeric protein. On average the C α RMSD differs by just 2.4 Å from the experimental conformation. In some cases we approach experimental resolution. Three of the four prediction failures are marked as correctly indecisive, indicating that the prediction protocol is able to differentiate between prediction success and failure based on an inherent criterion. Following this protocol correct predictions are achieved in 90% of the decisive simulations (78% of the proteins). Figure 4 (a)-(c) demonstrates the impressive agreement between the predicted and the experimental conformation for three nontrivial proteins, the presence of the correct secondary structure, stabilizing tertiary contacts and hydrophobic cores.

Figure 3
figure 3

C α RMSD of predicted structures. The C α RMS deviation between the best cluster and the experimental conformation for the high and low-quality decoy sets (ordered by protein size) shows no decrease of the prediction probability with protein size for the high quality decoy sets. Green bars indicate correct predictions, red bars prediction failures. Shaded bars indicate indecisive simulations (see text), where there was no 'largest cluster' in the top 50 decoys. We note that there is only one genuine prediction failure for the high-quality decoy sets. The horizontal dashed line indicates the acceptance threshold for correct predictions. The purple bars designate the deviations of the best decoys when the same clustering technique is applied with the original ROSETTA scoring function.

Figure 4
figure 4

Top Row: Overlay of some predicted structures. The overlay of the predicted (red) and experimental conformation (green) documents the close agreement of conformations for 1orc[66] and 1ail[67]; only the backbone of the proteins is shown for clarity (generated with PYMOL [62]). Bottom row: Cβ-Cβ distance difference matrices: A pixel in row i and column j of the color coded distance map indicates the difference in the Cβ-Cβ distances of the native and the folded structure. Black (gray) squares indicate that the Cβ-Cβ distances of the native and the other structure differ by less than 1.5 (2.25)Å respectively. White squares indicate larger deviations. The data indicates clearly the presence of all short-range and long-range native contacts for the high-quality decoys and good agreement even for the predictions from low-quality decoy sets.

We have applied the same computational procedure to the 10 low-quality decoy sets, which contain few or no near-native conformations (Table 2). Not surprisingly only non-native conformations have the lowest energy for all of these decoy sets, but near-native decoys still rank high in the decoy set. Applying the same clustering technique as above, we obtain correct predictions in three of the six decisive cases. The prediction failure for 1hyp is explained by the presence of four disulphide bridges not accounted for in the model and 1csp is an all-beta protein that is problematic to treat with PFF01. In addition we find one accurate prediction (1utg) that is labeled as indecisive. The quality of the models for representative difficult cases is illustrated in Figure 4 (d)-(f), which demonstrates a still significant similarity of the tertiary structure of the models and the experimental conformations. For most of the low-quality decoy sets differences between the model and the biologically active unit are responsible for the prediction failure. The existence of the many of the correct long range native contacts in the predicted structures is demonstrated in the C β -C β distance difference matrices shown in Figure 5. Tertiary contacts are characterized by comparing the difference in distance between pairs of amino acids of two conformations, which correspond to the NOE signals in NMR experiments. In the figure we show the C β -C β distance comparison between the model and the experimental conformation for the proteins shown in Figure 4(same order).

Figure 5
figure 5

Enrichment. Enrichment of native conformations by PFF01: Fraction of near-native conformations in the top 50 decoys by energy (black) and by CαRMSD (red). The latter bar indicates the enrichment attainable by an 'ideal' scoring function.

Discussion

In order to put these results into perspective we have investigated the enrichment of near native decoys in each decoy set. We computed the fraction of near native (as defined above) conformations in the top 50 decoys that were used for clustering using the free-energy criterion and the C α RMSD to the native conformation as ordering criteria (Figure 6). The latter fraction is a measure of the quality of the decoy set: it approaches one when a sufficient number of near native decoys is present. The first fraction is a measure of the selectivity of the free-energy relaxation protocol. Correct predictions are obviously rendered in those cases where all lowest energy decoys are near-native (1res for example) and in those cases, where near-native decoys dominate the top 50 configurations (for example: 1afi, 1gab). The clustering scheme gives acceptable predictions even when only 30% of the low-energy decoys are near-native (for example: 2pdd), but routinely fails when the selectivity of the forcefield is insufficient (1am3). There is clearly room for improvement in the clustering protocol, because some decoys with a relatively large number of top-scoring near-natives nevertheless fail to generate near-native predictions (for example: 1pgx). For the low-quality decoy sets there is a strong correlation between the fraction of good decoys and the success of the approach. Even when the fraction of near-native decoys in the decoy-set drops below 10%, a relatively small number of selected near-natives is sufficient to obtain a near-native prediction. This observation indicates that a search for improved relaxation protocols may help to reduce the required fraction of near-native decoys for a successful prediction below 10% of the overall database.

Figure 6
figure 6

Enrichment. Enrichment of the best decoys by CαRMSD, total energy in PFF01, and its components (Lennard-Jones, Solvation Energy, Hydrogen Bonding, Sidechain Electrostatics), Radius of Gyration and the clustering technique described in the text. The vertical axis counts the number of proteins in the database that yield a best decoy according to chosen criterion.

This observation is also supported when we analyze the top decoy of each decoy set by a variety of measures. Following the analysis of Tsai et.al. [6], we show the number of proteins for which the top conformation is selected by various scoring methods (Figure 7). Ranked by C α RMSD we find that about half of the decoy sets contain at least one decoy with a C α RMSD of less than 2.5 Å. None of the scoring functions is capable to find this single 'needle in the haystack'. If we look at the error range of 3–4 Å C α RMSD, which is commensurate with our folding simulations [12, 18, 19], the relaxation/clustering technique is far superior to any of the other indicators investigated here. The next best scoring function is the Lennard-Jones energy as was reported previously [6]. In comparison, many other indicators that are believed to correlate highly with 'nativeness', such as the existence of secondary structure (as measured by hydrogen bonding energy), solvation terms or sidechain electrostatics, are much less selective. This selectivity of the Lennard-Jones interaction is presently not understood, because the Lennard-Jones energy gives only a small contribution to the overall 'folding' energy in our folding simulations. We hypothesized that Lennard Jones interactions simply measure 'compactness' and ranked the decoys by their radius of gyration as a similar measure. However, the radius of gyration emerged as a much less sensitive measure. The Lennard-Jones energy also does not remove many clashing conformations, because the decoy sets are of very high quality with regard to steric hindrance. The data compilation with the ROSETTA scoring function leads to a highly preselected set of conformations, which might bias the results. Our observation might be explained by the fact that many decoys are already near-optimal with respect to the other energy terms, so that the Lennard-Jones term emerges with higher selectivity than with a set of random conformations. We have repeated the analysis also by analyzing the best C α RMSD of the top five decoys in each category to reduce possible scatter and find the qualitatively the same results (data not shown). The fact that total energy and clustering technique are by far the most selective indicates that it is the combination of terms in the forcefield which results in the high overall selectivity of the method. This is also confirmed by comparing our results with those of ROSETTA. We have scored all the proteins with the original ROSETTA [28, 47–49] scoring function and applied the same clustering techniques as described above (purple bars in Figure 3). We find that the present method leads to a significant improvement of the C α RMSD for all but one decoy set with decisive predictions. It is therefore the combination of a powerful method for decoy generation in combination with the additional selectivity provided by the all-atom forcefield that generates the high-selectivity of the relaxation approach.

Figure 7
figure 7

Comparison with DFIRE: RMSD of best energy (top row) and best cluster (bottom row) for the decoy sets by PFF01 (green bar) with DFIRE (red bar). The dashed lines indicate the averages over all decoy sets.

We have also computed the logPB1 and logPB10 values that were used to characterize the selectivity of the density scoring/self-RAPDF function in a recent investigation using the same decoy set [36] (Table 3). We find average values of-0.48 for logPB1 and -1.43 for logPB10 respectively, which compares with -0.92/-1.46 for the density scoring function and -1.0/-1.6 for self-RAPDF for the same subset of decoys. The ranking of native conformations per se is not important for structure prediction since it may not be an indicator of how well a function can select near-native decoys. In other words, it is relatively easy to design functions that discriminate the native conformation from a set of decoys, but hard to design functions that can discriminate near-native decoys from other decoys. This scenario applies exactly here: as Table 3 demonstrates, self-RAPDF works very well to select the best decoys from decoy sets that contain a significant fraction (> 30%) of near-native decoys by our definition, while the present protocol may rank the energetically best decoy comparatively badly in a set of only good decoys. Due to the inherent limitation of its resolution, PFF01 is not a good forcefield for the second purpose, we obtain therefore comparatively bad logPB1/10 values for very good decoy sets. If instead we focus on the selection of near-native decoys from decoys sets with large non-native fractions, the present protocol outperforms self-RAPDF. In Table 3 we have marked the decoy sets where a given method select at least one near native decoy in the top ten. Using this selection criterion, we find that PFF01 succeeds in 78% of the cases, in comparison to 50% for self-RAPDF. The approach pursued here is useful to select low-resolution decoys from complex decoy sets, which contain many non-native competitors, with high probability [50].

Table 3 Ranking and Selectivity

Conclusion

We have investigated a straightforward all-atom energy-relaxation protocol for protein structure prediction. We scored all conformations in a given decoy set using our all-atom forcefield PPF01 after a rapid relaxation procedure, followed by a hierarchical clustering of the 50 top-scoring decoys. We label a relaxation as "decisive" if the lowest energy cluster is at least 20% larger than any other.

With this approach we have succeeded to assign all-atom tertiary structure to 78% of the proteins (marked as decisive) investigated in this study with an average C α RMSD of 3.12 Å. Exploiting the inherent success criterion of our approach a near-native conformation was predicted in 90% of the decisive relaxation simulations. This high degree of success is rationalized by the high selectivity of the forcefield. We find an average Z-score of-3.03 for independently generated near-native conformations with respect to the decoy sets. PFF01 stabilizes the native conformation of all but one protein against the decoys in the data set. The protocol investigated here has a success threshold of about 10% of native decoys, but appears to succeed at least occasionally for lower native content of the decoy set. Further improvements in the forcefield and the relaxation protocol may be able to push the required fraction of native conformations to even lower values.

The accuracy of the predicted structures appears to be limited 3–4 Å by the resolution of the present free-energy forcefield. This resolution is comparable to that of our folding investigations and commensurate with other folding studies using implicit solvent forcefields [51]. In order to improve the accuracy further, one can either design all-atom explicit water protocols that start from the predicted structures or rank families of near-native conformations by knowledge based scoring functions such as self-RAPDF [36] that are more selective in the near-native conformational space.

Energy relaxation emerges as a powerful low-cost approach (20–50 CPU days in parallel per decoy set) toward generic de-novo protein structure prediction. It can be applied to large all-atom decoy sets of any origin and requires no preexisting structural information to identify the native conformation. We have confined the present investigation to the ROSETTA decoy sets, because the computation of selectivitiy indicators (such as the Z-Score) or the success rate for prediction obviously depend on the methods by which the decoy sets were generated. The ROSETTA decoy set was explicitely generated for forcefield validation with one coherent protocol and thus gives comparable results for a wide range of structurally distinct small proteins. In addition, the protocol investigated here is based on a generic, publically available technique and can thus be used as the basis of a protocol for protein structure prediction in the CASP competition. Other decoy sets (such as decoys-are-US), which contain also larger proteins, will be investigated in future studies. We stress that only one of two important ingredients to protein structure prediction, the ability of the relaxation protocol to select near-native conformations for diverse decoy sets, was investigated here.

Methods

Forcefield

The all-atom (with the exception of apolar CH n groups) free-energy forcefield PFF01 [12] parametrizes the internal free-energy of a protein macro state in a minimal thermodynamic approach [12, 19, 52]. The forcefield parametrizes the internal free energy of the protein (excluding backbone entropy) and contains the following non-bonded interactions:

V ( { r → i } ) = ∑ i j V i j [ ( R i j r i j ) 12 − 2 ( R i j r i j ) 6 ] + ∑ i j q i q j ε g ( i ) g ( j ) r i j + ∑ i σ i A i + ∑ h b o n d s V h b . ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeGadaaabaGaemOvayLaeiikaGIaei4EaSNafmOCaiNbaSaadaWgaaWcbaGaemyAaKgabeaakiabc2ha9jabcMcaPaqaaiabg2da9aqaamaaqafabaGaemOvay1aaSbaaSqaaiabdMgaPjabdQgaQbqabaaabaGaemyAaKMaemOAaOgabeqdcqGHris5aOWaamWaaeaadaqadaqaamaalaaabaGaemOuai1aaSbaaSqaaiabdMgaPjabdQgaQbqabaaakeaacqWGYbGCdaWgaaWcbaGaemyAaKMaemOAaOgabeaaaaaakiaawIcacaGLPaaadaahaaWcbeqaaiabigdaXiabikdaYaaakiabgkHiTiabikdaYmaabmaabaWaaSaaaeaacqWGsbGudaWgaaWcbaGaemyAaKMaemOAaOgabeaaaOqaaiabdkhaYnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaaaaOGaayjkaiaawMcaamaaCaaaleqabaGaeGOnaydaaaGccaGLBbGaayzxaaaabaaabaaabaGaey4kaSYaaabuaeaadaWcaaqaaiabdghaXnaaBaaaleaacqWGPbqAaeqaaOGaemyCae3aaSbaaSqaaiabdQgaQbqabaaakeaaiiGacqWF1oqzdaWgaaWcbaGaem4zaCMaeiikaGIaemyAaKMaeiykaKIaem4zaCMaeiikaGIaemOAaOMaeiykaKcabeaakiabdkhaYnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaaakiabgUcaRmaaqafabaGae83Wdm3aaSbaaSqaaiabdMgaPbqabaGccqWGbbqqdaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAaeqaniabggHiLdGccqGHRaWkdaaeqbqaaiabdAfawnaaBaaaleaacqWGObaAcqWGIbGyaeqaaaqaaGqaaiab+HgaOjab+jgaIjab+9gaVjab+5gaUjab+rgaKjab+nhaZbqab0GaeyyeIuoakiabc6caUaWcbaGaemyAaKMaemOAaOgabeqdcqGHris5aaaakiaaxMaacaWLjaWaaeWaaeaacqaIXaqmaiaawIcacaGLPaaaaaa@93CA@

Here r ij denotes the distance between atoms i and j and g(i) the type of the amino acid i. The Lennard Jones parameters (V ij , R ij for potential depths and equilibrium distance) depend on the type of the atom pair and were adjusted to satisfy constraints derived from a set of 138 proteins of the PDB database [52–54]. The non-trivial electrostatic interactions in proteins are represented via group- and position dependent dielectric constants (εg(i)g(j)depending on the amino-acids to which the atoms i and j belong). The partial charges q i and the dielectric constants were derived in a potential-of-mean-force approach [55] [see Additional file 1]. Interactions with the solvent were first fit in a minimal solvent accessible surface model [56] parameterized by free energies per unit area σ i to reproduce the enthalpies of solvation of the Gly-X-Gly family of peptides [57]. A i corresponds to the area of atom i that is in contact with a fictitious solvent. Hydrogen bonds are described via dipole-dipole interactions included in the electrostatic terms and an additional short range term for backbone-backbone hydrogen bonding (CO to NH) which depends on the OH distance, the angle between N,H and O along the bond and the angle between the CO and NH axis [12, 58].

In the folding process under physiological conditions the degrees of freedom of a peptide are confined to rotations about single bonds. In our simulation we therefore consider only moves around the sidechain and backbone dihedral angles, which are attempted with thirty and seventy percent probability respectively. The moves for the sidechain angles are drawn from an equidistributed interval with a maximal change of 5 degrees. Half of the backbone moves are generated in the same fashion, the remainder is generated from a move library that was designed to reflect the natural amino-acid dependent bias towards the formation of α-helices or β-sheets. The probability distribution of the move library was fitted to experimental probabilities observed in the PDB database [59]. While driving the simulation towards the formation of secondary structure, the move library introduces no bias towards helical or sheet structures beyond that encountered in nature.

Decoy sets and relaxation

The decoy sets were provided electronically by J. Tsai [6], we have excluded decoy sets that contained only fragments of the experimental structure (2ptl,1tuc,1vcc), that contain iron clusters, stabilizing ions or heavy metals not parametrized in our forcefield (1bq9, 1cc5, 1ptq, 1tif, 5icb). 1msi is an antifreeze protein [60], coordinating a shell of crystal water that cannot be described with an implicit solvent model. Each decoy was relaxed in a single simulated annealing run (50,000 steps, T start = 200 K, T final = 3 K). The decoys were clustered in a hierarchical algorithm [45]. Near-native conformations were generated in 50 independent basin hopping simulations [61] starting from the native conformation, each comprising 50 simulated annealing cycles with the same protocol as above using a threshold acceptance criterion of 1 kcal/mol.

References

  1. Baker D, Sali A: Protein Structure Prediction and Structural Genomics. Science 2001, 294: 93–96. 10.1126/science.1065659

    Article  CAS  PubMed  Google Scholar 

  2. Hardin C, Pogorelov T, Luthey-Schulten Z: Ab initio protein structure prediction. Curr Opin Struct Biol 2002, 12: 176–181. 10.1016/S0959-440X(02)00306-8

    Article  CAS  PubMed  Google Scholar 

  3. Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure prediction (CASP) – Round 6. Proteins: Structure, Function, and Bioinformatics 2005, 61: 3–7. 10.1002/prot.20716

    Article  CAS  Google Scholar 

  4. Ginalski K, Grishin N, Godzik A, Rychlewski L: Practical lessons from protein structure prediction. Nucl Acids Res 2005, 33: 1874–1891. 10.1093/nar/gki327

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Bradley P, Misura KMS, Baker D: Toward High-Resolution de-novo Structure Prediction for small proteins. Science 2005, 309: 1868–1871. 10.1126/science.1113801

    Article  CAS  PubMed  Google Scholar 

  6. Tsai J, Bonneau R, Morozov AV, Kuhlman B, Rohl CA, Baker D: An improved protein decoy set for testing energy functions for protein structure prediction. Proteins 2003, 53: 76–87. 10.1002/prot.10454

    Article  CAS  PubMed  Google Scholar 

  7. Duan Y, Kollman PA: Pathways to a Protein Folding Intermediate Observed in a 1-Microsecond Simulation in Aqueous Solution. Science 1998, 282: 740–744. 10.1126/science.282.5389.740

    Article  CAS  PubMed  Google Scholar 

  8. Pande VS, Rokhsar DS: Molecular dynamics simulations of unfolding and refolding of abeta-hairpin fragment of protein G. Proc Nat Acad (USA) 1999, 96: 9062–9067. 10.1073/pnas.96.16.9062

    Article  CAS  Google Scholar 

  9. Snow CD, Nguyen H, Pande VS, Gruebele M: Absolute Comparison of simulated and experimental protein folding dynamics. Nature 2002, 420: 102–106. 10.1038/nature01160

    Article  CAS  PubMed  Google Scholar 

  10. Garcia AE, Onuchic N: Folding a protein in a computer: An atomic description of the folding/unfolding of protein A. Proc Nat Acad (USA) 2003, 100: 13898–13903. 10.1073/pnas.2335541100

    Article  CAS  Google Scholar 

  11. Zagrovic B, Snow CD, Shirts MR, Pande VS: Simulation of Folding of a small alpha-helical protein in atomistic detail using worldwide-distributed computing. Journal of Molecular Biology 2002, 323: 927–937. 10.1016/S0022-2836(02)00997-X

    Article  CAS  PubMed  Google Scholar 

  12. Herges T, Wenzel W: An All-Atom Force Field for Tertiary Structure Prediction of Helical Proteins. Biophys J 2004, 87(5):3100–3109. 10.1529/biophysj.104.040071

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. Anfinsen CB: Principles that govern the Folding of Protein Chains. Science 1973, 181: 223–230. 10.1126/science.181.4096.223

    Article  CAS  PubMed  Google Scholar 

  14. Gibson K, Scheraga H: A Rapid and Efficient Algorithm for Packing Polypeptide Chains by Energy Minimizations. J Comp Chem 1994, 15: 1403–1413. 10.1002/jcc.540151210

    Article  CAS  Google Scholar 

  15. Dill K, Chan H: From Levinthal to Pathways to Funnels: The "New View" of Protein Folding Kinetics. Nature Structural Biology 1997, 4: 10–19. 10.1038/nsb0197-10

    Article  CAS  PubMed  Google Scholar 

  16. Onuchic JN, Luthey-Schulten Z, Wolynes PG: Theory of Protein Folding: The Energy Landscape Perspective. Annu Rev Phys Chem 1997, 48: 545–600. 10.1146/annurev.physchem.48.1.545

    Article  CAS  PubMed  Google Scholar 

  17. Schug A, Herges T, Wenzel W: Reproducible Protein Folding with the Stochastisc Tunneling Method. Phys Rev Letters 2003, 91: 158102. 10.1103/PhysRevLett.91.158102

    Article  CAS  Google Scholar 

  18. Herges T, Wenzel W: Free Energy Landscape of the Villin Headpiece in an All-Atom Forcefleld. Structure 2005, 13: 661. 10.1016/j.str.2005.01.018

    Article  CAS  PubMed  Google Scholar 

  19. Herges T, Wenzel W: Reproducible in-silico folding of a three-helix protein and characterization of its free energy landscape in a transferable all-atom forcefield. Phys Rev Lett 2005, 94: 018101. 10.1103/PhysRevLett.94.018101

    Article  CAS  PubMed  Google Scholar 

  20. Schug A, Wenzel W: Predictive in-silico all-atom folding of a four helix protein with a free-energy model. J Am Chem Soc 2004, 126: 16736–16737. 10.1021/ja0453681

    Article  CAS  PubMed  Google Scholar 

  21. Schug A, Wenzel W: Evolutionary Strategies for All-Atom folding of the sixty amino acid bacterial ribosomal proein L20. Biophysical Journal 2006, 90: 4273–4280. 10.1529/biophysj.105.070409

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Wenzel W: Reproducible folding of the trp-zipper. 2006, in press.

    Google Scholar 

  23. Go N, Scheraga HA: On the use of classical statistical mechanics in the treatment of polymer chain conformation. Macromolecules 1976, 9: 535–542. 10.1021/ma60052a001

    Article  CAS  Google Scholar 

  24. Vasquez M, Nemethy G, Scheraga H: Conformational Energy Calculations on Polypeptides and Proteins. Chem Rev 1994, 94: 2138–2239. 10.1021/cr00032a002

    Article  Google Scholar 

  25. Park B, Levitt M: Energy Functions that discriminate X-ray and near-native folds from well-constructed decoys. J Molec Biol 1996, 258: 367. 10.1006/jmbi.1996.0256

    Article  CAS  PubMed  Google Scholar 

  26. Pillardy J, Czaplewski C, Liwo A, Lee J, Ripoll DR, Kamierkiewicz R, Oldziej S, Wedemeyer WJ, Gibson KD, Arnautova YA, Saunders J, Ye YJ, Scheraga HA: Recent improvements in prediction of protein structure by global optimization of a potential energy function. Proc Nat Acad (USA) 2001, 98: 2329–2333. 10.1073/pnas.041609598

    Article  CAS  Google Scholar 

  27. Karplus K, Karchin R, Draper J, Casper J, Mandel-Gutfreund Y, Diekhans M, Hughey R: Combining local-structure, fold-recognition, and new fold methods for protein structure prediction. Proteins: Structure, Function, and Genetics 2003, 53: 491–496. 10.1002/prot.10540

    Article  CAS  Google Scholar 

  28. Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Molec Biol 1997, 286: 209–225. 10.1006/jmbi.1997.0959

    Article  Google Scholar 

  29. Holm L, Sander C: Evaluation of protein models by atomic solvation preference. J Mol Biol 1992, 225: 93–105. 10.1016/0022-2836(92)91028-N

    Article  CAS  PubMed  Google Scholar 

  30. Samudrala R, Moult J: An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 1998, 275: 895–916. 10.1006/jmbi.1997.1479

    Article  CAS  PubMed  Google Scholar 

  31. Lu H, Skolnick J: A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 2001, 44: 223–232. 10.1002/prot.1087

    Article  CAS  PubMed  Google Scholar 

  32. McConkey B, Sobolev V, Edelman M: Discrimination of native protein structures using atom-atom contact scoring. Proc Natl Acad Sci 2003, 100: 3215–3220. 10.1073/pnas.0535768100

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  33. Wang Y, Zhang H, Li W, Scott R: Discriminating compact nonnative structures from the native structure of globular proteins. Proc Natl Acad Sci 1995, 92: 709–713. 10.1073/pnas.92.3.709

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  34. Park B, Levitt M: Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J Mol Biol 1996, 258: 367–392. 10.1006/jmbi.1996.0256

    Article  CAS  PubMed  Google Scholar 

  35. Felts A, Gallicchio E, Wallqvist A, Levy R: Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS all-atom force field and the Surface Generalized Born solvent model. Proteins 2002, 48: 404–222. 10.1002/prot.10171

    Article  CAS  PubMed  Google Scholar 

  36. Wang K, Fain B, Levitt M, Sumudrala R: Improved protein structure selection using decoy-dependent discriminatory functions. BMC Struct Biol 2004, 4: 8. 10.1186/1472-6807-4-8

    Article  PubMed Central  PubMed  Google Scholar 

  37. Hubner I, Deeds E, Shakhnovich E: High-resolution protein folding with a transferable potential. Proc Natl Acad Sci 2005, 102: 18914–18919. 10.1073/pnas.0502181102

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  38. Jones D, McGuffin L: Assembling novel protein folds from super-secondary structural fragments. Proteins 2003, 53(Suppl 6):480–485. 10.1002/prot.10542

    Article  CAS  PubMed  Google Scholar 

  39. David K, Chivian D, Baker D: Protein structure prediction and analysis using the Robetta server. Nucl Acids Res 2004, 32: 526–531. 10.1093/nar/gkh468

    Article  Google Scholar 

  40. Thomas P, Dill K: Statistical potentials extracted from protein structures: how accurate are they? J Mol Biol 1996, 257: 457–469. 10.1006/jmbi.1996.0175

    Article  CAS  PubMed  Google Scholar 

  41. Ben-Naim A: Statistical potentials extracted from protein structures: Are these meaningful potentials. J Chem Phys 1997, 107: 3698–3706. 10.1063/1.474725

    Article  CAS  Google Scholar 

  42. Huang E, Samudrala R, Ponder J: Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. J Mol Biol 1999, 290: 267–281. 10.1006/jmbi.1999.2861

    Article  CAS  PubMed  Google Scholar 

  43. Park B, Huang E, Levitt M: Factors affecting the ability of energy functions to discriminate correct from incorrect folds. J Mol Biol 1997, 266: 831–846. 10.1006/jmbi.1996.0809

    Article  CAS  PubMed  Google Scholar 

  44. Schug A, Verma A, Herges T, Lee KH, Wenzel W: Comparison of Stochastic Optimization Methods for all-atom folding of the trp-cage protein. PhysChemPhys 2005, 6: 2640–2646. 10.1002/cphc.200500213

    Article  CAS  Google Scholar 

  45. Feig M, Karanicolas J, CL Brooks I: MMTSB Tool Set (2001), MMTSB NIH Research Resource. Tech. rep., The Scripps Research Institute; 2001.

    Google Scholar 

  46. Shortle D, Simons K, Baker D: Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci 1998, 95: 11158–11162. 10.1073/pnas.95.19.11158

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  47. Simons KT, Ruczinski I, Kooperberg C, Fox B, Bystroff C, Baker D: Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. PSFG 1999, 34: 82–95.

    CAS  Google Scholar 

  48. Bonneau R, Tsai J, Ruczinski I, Chivian D, Rohl C, CE OS, Baker D: Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins 2001, (Suppl 5):119–26. 10.1002/prot.1170

  49. Bonneau R, Strauss C, Rohl C, Chivian D, Bradley P, Malmstrom L, Robertson T, Baker D: De novo prediction of three-dimensional structures for major protein families. J Mol Biol 2002, 322: 65–78. 10.1016/S0022-2836(02)00698-8

    Article  CAS  PubMed  Google Scholar 

  50. Eyrich V, Standley D, Friesner R: Prediction of protein tertiary structure to low resolution: performance for a large and structurally diverse test set. J Mol Biol 1999, 288: 725–742. 10.1006/jmbi.1999.2702

    Article  CAS  PubMed  Google Scholar 

  51. Chen J, Im W, III CLB: Balancing Solvation and Intramolecular Interactions: Toward a Consistent Generalized Born Force Field. J Am Chem Soc 2006, 128: 3728–3736. 10.1021/ja057216r

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  52. Herges T, Merlitz H, Wenzel W: Stochastic Optimization Methods for Biomolecular Structure Prediction. J Ass Lab Autom 2002, 7: 98–104. 10.1016/S1535-5535(04)00203-5

    Article  CAS  Google Scholar 

  53. Abagyan RA, Totrov M: Biased Probability Monte Carlo Conformation Searches and Electrostatic Calculations for Peptides and Proteins. J Molec Biol 1994, 235: 983–1002. 10.1006/jmbi.1994.1052

    Article  CAS  PubMed  Google Scholar 

  54. Herges T, Schug A, Wenzel W: Exploration of the Free Energy Surface of a Three Helix Peptide with Stochastic Optimization Methods. Int J Quant Chem 2004, 99: 854–893. 10.1002/qua.20052

    Article  CAS  Google Scholar 

  55. Avbelj F, Moult J: Role of electrostatic screening in determining protein main chain conformational preferences. Biochemistry 1995, 34: 755–764. 10.1021/bi00003a008

    Article  CAS  PubMed  Google Scholar 

  56. Eisenberg D, McLachlan AD: Solvation energy in protein folding and binding. Nature 1986, 199–203. 10.1038/319199a0

    Google Scholar 

  57. Sharp KA, Nicholls A, Friedman R, Honig B: Extracting Hydrophobic Free Energies from Experimental Data:Relationship to Protein Folding and Theoretical Models. Biochemistry 1991, 30: 9686–9697. 10.1021/bi00104a017

    Article  CAS  PubMed  Google Scholar 

  58. Kortemme T, Morozov A, Baker D: An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J Mol Biol 2003, 324: 1239–1259. 10.1016/S0022-2836(03)00021-4

    Article  Google Scholar 

  59. Pedersen JT, Moult J: Protein folding simulations with genetic algorithms and a detailed molecular description. J Molec Biol 1997, 269: 240. 10.1006/jmbi.1997.1010

    Article  CAS  PubMed  Google Scholar 

  60. Jia Z, DeLuca C, Chao H, Davies P: Structural basis for the binding of a globular antifreeze protein to ice. Nature 1996, 384: 285–288. 10.1038/384285a0

    Article  CAS  PubMed  Google Scholar 

  61. Verma A, Schug A, Lee KH, Wenzel W: Basin Hopping Simulations for All-Atom Protein Folding. J Chem Phys 2006, 124: 044515. 10.1063/1.2138030

    Article  CAS  PubMed  Google Scholar 

  62. DeLano WL: The PyMOL Molecular Graphics System.DeLano Scientific, San Carlos, CA, USA; 2002. [http://www.pymol.org]

    Google Scholar 

  63. Aihara H, Ito Y, Kurumizaka H, Terada T, Yokoyama S, Shibata T: An interaction between a specified surface of the C-terminal domain of RecA protein and double-stranded DNA for homologous pairing. J Mol Biol 1997, 274: 213–221. 10.1006/jmbi.1997.1403

    Article  CAS  PubMed  Google Scholar 

  64. Assa-Munt N, Mortishire-Smith R, Aurora R, Herr W, Wright P: The solution structure of the Oct-1 POU-specific domain reveals a striking similarity to the bacteriophage lambda represser DNA-binding domain. Cell 1993, 73: 193–205. 10.1016/0092-8674(93)90171-L

    Article  CAS  PubMed  Google Scholar 

  65. Narayana N, Matthews D, Howell E, Nguyen-huu X: A plasmid-encoded dihydrofolate reductase from trimethoprim-resistant bacteria has a novel D2-symmetric active site. Nat Struct Biol 1995, 2: 1018–1025. 10.1038/nsb1195-1018

    Article  CAS  PubMed  Google Scholar 

  66. Albright R, Mossing M, Matthews B: High-resolution structure of an engineered Cro monomer shows changes in conformation relative to the native dimer. Biochemistry 1996, 35: 735–742. 10.1021/bi951958n

    Article  CAS  PubMed  Google Scholar 

  67. Liu J, Lynch P, Chien C, Montelione G, Krug R, Berman H: Crystal structure of the unique RNA-binding domain of the influenza virus NS1 protein. Nat Struct Biol 1997, 4: 896–899. 10.1038/nsb1197-896

    Article  CAS  PubMed  Google Scholar 

  68. Schindelin H, Marahiel M, Heinemann U: Universal nucleic acid-binding domain revealed by crystal structure of the B. subtilis major cold-shock protein. Nature 1993, 364: 164–168. 10.1038/364164a0

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We acknowledge the use of computational facilities at the KIST super-computational materials lab of the Korean Institute of Science and Technology. This work was supported by the German National Science foundation (DFG WE 1863/10-1,10-2), the and the Bode Foundation. We are thankful to J. Tsai for providing us with the decoy sets.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wolfgang Wenzel.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Authors' contributions

Simulations and preparation of the manuscript was jointly performed AV and WW. Both authors read and approved the final manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Verma, A., Wenzel, W. Protein structure prediction by all-atom free-energy refinement. BMC Struct Biol 7, 12 (2007). https://doi.org/10.1186/1472-6807-7-12

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1472-6807-7-12

Keywords