- Research article
- Open Access
Improved protein structure selection using decoy-dependent discriminatory functions
BMC Structural Biology volume 4, Article number: 8 (2004)
A key component in protein structure prediction is a scoring or discriminatory function that can distinguish near-native conformations from misfolded ones. Various types of scoring functions have been developed to accomplish this goal, but their performance is not adequate to solve the structure selection problem. In addition, there is poor correlation between the scores and the accuracy of the generated conformations.
We present a simple and nonparametric formula to estimate the accuracy of predicted conformations (or decoys). This scoring function, called the density score function, evaluates decoy conformations by performing an all-against-all Cα RMSD (Root Mean Square Deviation) calculation in a given decoy set. We tested the density score function on 83 decoy sets grouped by their generation methods (4state_reduced, fisa, fisa_casp3, lmds, lattice_ssfit, semfold and Rosetta). The density scores have correlations as high as 0.9 with the Cα RMSDs of the decoy conformations, measured relative to the experimental conformation for each decoy.
We previously developed a residue-specific all-atom probability discriminatory function (RAPDF), which compiles statistics from a database of experimentally determined conformations, to aid in structure selection. Here, we present a decoy-dependent discriminatory function called self-RAPDF, where we compiled the atom-atom contact probabilities from all the conformations in a decoy set instead of using an ensemble of native conformations, with a weighting scheme based on the density scores. The self-RAPDF has a higher correlation with Cα RMSD than RAPDF for 76/83 decoy sets, and selects better near-native conformations for 62/83 decoy sets. Self-RAPDF may be useful not only for selecting near-native conformations from decoy sets, but also for fold simulations and protein structure refinement.
Both the density score and the self-RAPDF functions are decoy-dependent scoring functions for improved protein structure selection. Their success indicates that information from the ensemble of decoy conformations can be used to derive statistical probabilities and facilitate the identification of near-native structures.
A scoring or discriminatory function that can reliably distinguish near-native conformations from misfolded ones is a necessity to solve the structure prediction problem. Various types of scoring functions have been developed to accomplish this goal, and can be grouped into two categories: physics-based functions that take into account electrostatic, Van der Waals, hydrogen bonding, solvation and covalent interactions [1–4], and knowledge-based functions that compile statistics on the preferences of amino acid residues/atoms (such as pairwise distances or solvent accessibility) from experimentally solved structures [5–10]. In particular, knowledge-based scoring functions, especially detailed all-atom ones, have been applied in all areas of structure prediction: comparative or homology modeling, fold recognition or threading, and de novo prediction.
The knowledge-based scoring functions have been very successful in discriminating the native conformation from misfolded ones [10–13]. However, even the best conformations generated by the structure prediction protocols, particularly de novo ones, are usually still quite distant from the native conformation [14–18]. Therefore, it is more important to assess how well a given scoring function can distinguish the best predicted conformations in a given decoy set generated by structure prediction methods. In this regard, none of these functions can consistently select the most near-native conformations from non-native ones, and there is poor correlation between the scores and measures of similarity between the predicted conformations (or "decoys") and the native conformation, such as the Cα root mean square deviation (RMSD) of the decoys relative to the native conformation.
There are problems with the theoretical justification of both the physics-based  and knowledge-based [20–23] approaches, which in part explains the ineffectiveness of these scoring functions [19–23]. Specifically in the case of knowledge-based scoring functions, which are "trained" using experimentally determined structures, the intrinsic structural properties of native conformations may be captured, but these functions may not contain the information necessary to evaluate the quality of near-native and misfolded conformations. However, borrowing information from all the conformations in a decoy set may be helpful to evaluate the proximity of any given near-native or misfolded conformation to the corresponding native structure. This is supported by recent findings that prediction of native contacts was improved when using the frequency of occurrence of contacts in decoy conformations from a decoy set [24, 25].
A new variety of scoring functions that attempt to use all the information in the ensemble of conformations generated by a structure prediction protocol have been used as a final filtering step in de novo structure prediction. This strategy for predicting protein structure is based on the assumption that there are a greater number of low-energy conformations surrounding the correct fold than there are surrounding incorrect folds in a decoy set. These functions compute a score for a given conformation based on its distance in Cartesian space relative to its neighbors. Initially used successfully at CASP3, one approach was implemented simply by adding the number of neighbors within a particular Cα RMSD cutoff to a given conformation [26, 27]. In these cases, the conformation with the greatest number of neighbors was closer to the experimentally determined conformation than were the majority of conformations in the ensemble. The method was refined further at CASP4 to simultaneously cluster decoy conformations and pick the centers of these largest clusters .
Here we describe a similar and simple formula, called the density score function, to pick the near-native conformations from a large ensemble of conformations in a decoy set. The logic underlying such a nonparametric function is that a significant fraction of conformations in the decoy set resemble the native conformation from different directions in the space. These decoy conformations form a single cluster, and the density of the cluster gradually increases from the periphery to the center of the cluster. When near-native conformations are sampled adequately, the center of the cluster is where the most near-native conformations should reside. Therefore, by calculating the density around a given conformation, we can estimate the similarity between this conformation and the corresponding native one. To calculate the densities, we first perform an all-against-all Cα RMSD calculation, and the density score for each conformation is then calculated as the sum of RMSDs between it and all other decoy conformations.
Publicly available decoy sets provide a means to evaluate performance of scoring functions, and permit comparisons between different structure discrimination methods [29, 30]. Many of these decoy sets contain a large number (>100) of decoy conformations, with varying degrees of similarity to the native conformation. The goal of any scoring function is to pick the conformations that are most similar to the native one. We tested the density score function on 83 decoy sets grouped by their generation methods (4state_reduced, fisa, fisa_casp3, lmds, lattice_ssfit, semfold and Rosetta). Since these sets contain conformations generated by different conformational search algorithms, the performance of a scoring function depends on each set, and success in one set does not guarantee success in another . Therefore, the goal of testing on a wide variety and large number of decoy sets is to provide a rigorous evaluation of how well a scoring function works. In general, the density scores have relatively high correlation with the Cα RMSD relative to the experimentally determined structure in the decoy sets we evaluated. However, because the calculation of the density score function depends on an existing decoy set, this scoring function cannot be easily used in a fold simulation.
The success of the density score function led us to believe that using information in the decoy set itself can be helpful in selecting the best conformations using knowledge-based scoring functions. These functions usually compile statistics on the preferences of amino acid residue or atomic contacts in a large ensemble of experimentally determined structures [5–10]. In previous efforts we derived a residue-specific all-atom probability discriminatory function (RAPDF) to compute the probability of a conformation being native-like, given a set of pairwise atom-atom distances . Here, we hypothesize that such a knowledge-based function may be used to derive statistics from all the decoy conformations in a large decoy set. We can use all the included decoy conformations to derive the parameters for the all-atom function and then use the parameters to select the most near-native conformations in the same set.
For a given decoy set, the Cα RMSDs relative to the experimentally determined structures usually follow a Gaussian-like distribution, which means that only a small fraction of conformations have relatively low Cα RMSD. When compiling the atom-atom contact probabilities from such a set, an appropriate weighting method is necessary to inflate the contribution of the low-RMSD conformations to the statistics. Given the strong correlation between Cα RMSDs and the density scores, the latter can be used as a parameter in the weighting scheme.
We therefore derived a statistical probability function, called self-RAPDF, from the decoy conformations using an exponential weighting scheme based on the density scores. We tested the performance of self-RAPDF on 83 publicly available decoy sets. In almost all cases, this method produced a higher correlation with Cα RMSD than the RAPDF, whose parameters were derived from a large ensemble of experimentally determined structures. It also performed better than RAPDF at selecting near-native conformations for most of the decoy sets. Unlike the density score function, self-RAPDF can also be used in a fold simulation and for structure refinement.
Performance of the density score function
The performance of the density score function on the 4state_reduced decoy sets, as evaluated by correlation coefficients between scores and Cα RMSDs of decoys relative to experimentally determined conformations, is summarized in Table 1. For comparison purposes, we also list the results generated by the self-RAPDF and other published scoring functions on the same set, including the empirical free energy function with an atomic solvation model , the atomic knowledge-based potential , and the Shell function . For all the 4state_reduced decoy sets, the density scores and self-RAPDF produce a significantly higher correlation between scores and Cα RMSDs than the other functions. The 4state_reduced sets contain decoys for seven small proteins, and were generated by exhaustively enumerating the backbone rotamer states of 10 selected residues in each protein, using an off-lattice model with four discrete dihedral angle states per residue . Compact structures were further filtered to produce these sets, and various scoring functions have a satisfactory performance on it. We used the 4state_reduced sets since they allowed us to compare our function to others that have used the same sets. However, it is also important to examine the performance on other decoy sets since the performance of scoring functions may be highly set dependent.
Besides correlation coefficient (C.C.), the performance of scoring functions can be evaluated by other measures that emphasize particular features. For structure prediction applications, where near-native conformations are rarely, if ever, sampled, it is more important to know how the decoys are ranked relative to each other and whether it is possible to identify conformations that are closest to the experimentally determined conformation. Three other kinds of measurements are provided in Table 2 for the evaluation of the density score function on 83 decoy sets from seven sources: these are the log probability of selecting the best scoring conformation (log PB 1), log probability of selecting the lowest RMSD conformation among the top 10 best scoring conformations (log PB 10), and the fraction enrichment (F.E.) of the 10% lowest RMSD conformations in the top 10% best scoring conformations (see Methods). For comparison purposes, the correlation coefficients based on the original formulation by Simons et al. are also listed . In their formula, they counted the number of structural neighbors within a 7 Å threshold and used it as the score for a given conformation.
Table 2 shows that the performance of the density score function is strongly dependent on the intrinsic properties of decoy generation methods and the quality of the decoy sets, with the best performance achieved in the 4state_reduced sets and the worst in the semfold sets. Although in general the density scores have relatively high correlation with Cα RMSDs, they have negative correlations in a small number of cases, indicating a failure of the function on these decoy sets. These include one protein (1jwe) in the fisa_casp3 sets and two proteins (1pgx and 1res) in the most recent Rosetta 10-14-01 sets.
Mechanism of the density score function
The 83 decoy sets used in our study were produced using several different simulation methods, which may explain why the same scoring function performs very differently on sets generated by different methods . To further investigate how the density score function works, we plotted four pairwise RMSD matrices using the decoy conformations for the 1ctf protein (Figure 1). 1ctf represents the carboxy-terminal domain of L7/L12 50s ribosomal protein from Escherichia coli and was chosen for this analysis since it is present in four groups of decoy sets that we used (4state_reduced, lattice_ssfit, lmds and semfold). The correlation coefficients between the density scores and Cα RMSDs for 1ctf in these four decoy sets are 0.98, 0.71, 0.41 and 0.13, respectively (Table 2). Only the 1000 lowest RMSD conformations were used for lattice_ssfit and semfold sets because of their large size. For all the four matrices, the upper left corner tends to be black, which means that low RMSD decoy conformations tend to be more similar with each other. The density score formula calculates the overall distance between a given conformation and all other conformations, so ideally it has a perfect negative correlation with the density around the decoy conformations. When conformations in a region tend to have lower pairwise RMSD with each other, the density of such a region will be higher, which explains the high correlation between density scores and Cα RMSDs of decoys relative to the experimentally determined structure.
Based on the above observations, we propose a theory on how the density score function works. During each step of a simulation process, a scoring function is used to judge whether a newly simulated conformation is energetically more favorable than the previous one. This step is iterated for certain times, and at the end of the simulation one or a few low scoring conformations are kept, achieving a local minima in terms of the scoring function. We call such minima "scoring basins". The scoring basins may or may not resemble the energy basin in conformational space. Usually structure predictors repeat the simulation process many times and save all the output conformations which comprise the decoy sets. These decoys tend to accumulate around such scoring basins so that the bottom of the basin has a higher density relative to the upper part of the basin. For exhaustive methods where the conformational space is evenly sampled, conformations near each other in space are more likely to have similar structures; once the non-native conformations are filtered out, similar structures tend to cluster together and scoring basins are formed by the filtering criteria, as is the case for the 4state_reduced sets. When a scoring basin is in close proximity to the energy basin, conformations around the bottom of the basin are near-native ones. In this case, there is a strong correlation between Cα RMSDs and the density of the space around these conformations. In our formula, we use the sum of RMSDs between a given conformation and with all other conformations in the decoy set to approximate the density.
For the four ensembles of decoy conformations depicted in Figure 1, the conformations represented in the upper left corner of the matrices have lower Cα RMSD, so they tend to reside near the bottom of the scoring basin. The density is higher near the bottom, so decoys in that region have lower pairwise RMSDs between each other, making the cells darker than others. Interestingly, for the lmds set, three obvious "black blocks" are seen in the corresponding matrix. By examining the Cα RMSD histogram of this protein (Figure 2), we found that there are actually three peaks, which account for the three "black blocks". This means that the pathological tendencies of simulation methods used in lmds sets may produce decoys that are far from the native conformation but tend to cluster together. In other words, three distinct scoring basins are encountered during the fold simulation process around which decoys tend to accumulate, yet only one of the basins can approximate the real energy basin. Because of that, high-density conformations in this set may not be near-native conformations if they reside in a wrong scoring basin, and the correlation coefficients between density scores and Cα RMSDs of decoys relative to native conformations cannot be very high. In the case of the 4state_reduced set, although there are two peaks in the Cα RMSD histogram, only one scoring basin is formed because conformations are sampled evenly by this simulation method. Here, the density scores have high correlation with the Cα RMSD of decoys relative to experimentally determined conformations.
Based on our theory, when near-native conformations are not sufficiently sampled, native conformations will not necessarily have the highest density. This explains why for most proteins, the density score ranking of the native conformation is not very high (Table 2, Column 10) in spite of the high correlation between Cα RMSDs and density scores. Our goal for developing these scoring functions is to select the most near-native conformations from a decoy set, when the experimental structure is unknown. The ranking of native conformation per se is not important for structure prediction since it may not be an indicator of how well a function can select near-native ones. In other words, it is relatively easy to design functions that discriminate the native conformation from a set of decoys, but hard to design functions that can discriminate near-native decoys from other decoys. The density score function (as well as self-RAPDF) is highly dependent on the search function used in the fold simulation process, and does not contain explicit information about native conformations (i.e., they are trained on decoys, not native conformations). Therefore, a complementary and good search method must be used with density scores (or self-RAPDF) at least for the initial decoy generation to minimize bias to erroneous conformations, which is the case for the methods used to generate our decoy sets. Finally, a scoring function that scores native conformation well is dependent on the particular types of native conformations that it is derived from. In our case, for 57 out of 83 decoy sets, the native conformation (or its slightly refined version; Cα RMSD < 0.2 Å) scores as the top best conformation by the original RAPDF. Of the remaining 26 decoy sets, the native conformation for 11 of them are derived by NMR spectroscopy which usually do not score well with RAPDF since the function is parameterized on structures derived from X-ray crystallography (Liu and Samudrala, manuscript in preparation).
Performance of self-RAPDF
For every decoy set, we generated a separate set of atom-atom contact probabilities using a formulation similar to the residue-specific all-atom scoring function (RAPDF) . Using this function, called self-RAPDF, we scored all the decoy conformations used to compile the function, and evaluated the performance of the function with the four measures described before. Table 3 compares the performance of the RAPDF and the self-RAPDF on individual decoy sets, and Figure 3 compares the performance on the decoy sets grouped by their generation methods. The self-RAPDF has better performance than RAPDF in terms of log PB 1(62/83 decoy sets), log PB 10(56/83 decoy sets), F.E. (63/83 decoy sets) or C.C. (76/83 decoy sets). We noticed that the performance of self-RAPDF is highly dependent on the performance of the density score function, which specifies the weighting scheme in generating the self-RAPDF. Because of the high correlation between Cα RMSDs and the density scores, the self-RAPDF generated higher correlation with Cα RMSDs than RAPDF in all decoy groups. However, for some proteins in the fisa and semfold sets, self-RAPDF did not tend to pick lower RMSD conformations over RAPDF, as judged by the mean of log PB 1for these groups of sets. This suggests that when performing structure selection, we can choose RAPDF or self-RAPDF based on the decoy generation methods to achieve the best results. However, since self-RAPDF almost always has better performance than RAPDF in terms of correlation with RMSD, this means self-RAPDF may be a better scoring function than RAPDF to be used in fold simulation during the structure refinement process.
More recent decoy sets such as those generated by the semfold method  or the Rosetta method  provide particularly challenging tests for scoring functions, because the decoys were assembled from fragments of experimentally determined structures. These sets contain a subset of misfolded conformations with similar local interactions, but are globally distant from the native fold. As a consequence, discriminating near-native conformations from the semfold  and the most recent Rosetta 10-14-01 sets is expected to be more challenging for any scoring function , and few results have been published on the performance of scoring functions using these decoy sets.
Unlike the density score function, the self-RAPDF can be used for not only structure selection, but also fold simulation. Therefore, it is especially important for self-RAPDF to have high correlation with the RMSD of decoy conformations. We compared the performance of RAPDF and self-RAPDF on the Rosetta sets in terms of log PB 1and correlation coefficients (Figure 4). RAPDF generally performed poorly on these decoy sets, while self-RAPDF was superior at discriminating low RMSD structures for these decoy sets for 37/41 proteins, in terms of the Cα RMSD of the best scoring conformation.
Figure 5 shows the scatter plot of the self-RAPDF scores versus Cα RMSDs of decoys relative to experimentally determined conformations for all 41 proteins in the Rosetta sets. A large fraction of near-native conformations were sampled in these sets . Scores for most of the proteins have very good correlation with Cα RMSDs except 1hyp, 1mzm and 1pgx. The density score function for these 3 proteins had either negative or near-zero correlation with Cα RMSDs (Table 2), which explains the poor performance of the self-RAPDF on them.
We also observed that neither RAPDF nor self-RAPDF has satisfactory performance on the semfold decoy sets (Figure 6). From our previous experience, these sets are difficult. None of the scoring functions we used before had good correlation with Cα RMSD on these decoy sets. Some possible reasons to explain the poor performance are detailed in the Discussion section.
Current scoring functions generally try to maximize the Z score to discriminate native conformations from near-native ones, but perform poorly in the real problem that we are facing with in structure prediction: selecting the most near-native conformations from an ensemble of decoys. Here, we introduce two decoy-dependent scoring functions, the density score function and self-RAPDF, which can be used to aid structure selection. They work better at selecting the most near-native conformations compared to previously published results.
It has been hypothesized that the behavior of the density score function represents a feature of the protein energetic surface , i.e., that the lowest energy conformation is the most populated one. A simpler explanation is that what we are observing is purely a statistical phenomenon: traditional scoring functions are not perfect, and if they are partially correct, then it is likely that two conformations that are close to each other are also likely to be close to the native conformation. In effect, the conformation with the best score is the median, i.e., the one with the smallest total distance to every conformation in the entire decoy set. By taking into account the ensemble of conformations generated by the scoring function, we maximize the amount of information used. In other words, conformations that score poorly by a discriminatory function also have information content that can be used to achieve better discrimination.
We therefore argue that the resulting ensemble of conformations after a structure prediction process will not be an unbiased sampling of the real energy basin. Instead, we propose that since any scoring function used in structure simulation cannot be perfect, it will form one or more scoring basins that may or may not resemble the real energy basin. These decoys then accumulate around the scoring basins, instead of the energy basin. When a scoring basin is near the energy basin, i.e., when a lot of near-native conformations are sampled in the decoy sets, we expect good performance from density-based approaches. Otherwise, we do not expect high correlation between the density around a given decoy conformation and the spatial distance between this conformation and the bottom of the energy basin, where the native conformation resides.
The key to the success of such decoy-dependent scoring functions is that near-native conformations are adequately sampled in the conformational space, which is not true in some cases. This in part accounts for the failure of both functions on some proteins in the fisa_casp3, semfold and Rosetta decoy sets. On the other hand, the intrinsic properties of the simulation process itself may dictate whether these functions will work well or not. This explains why decoy sets generated by the same simulation methods tend to have similar performance with a particular scoring function, but the performance is divergent across those sets from different sources. The 4state_reduced sets always yield the best performance for most scoring functions, since they are generated by sampling conformational space around native conformations evenly, using knowledge of the experimental structures. In such cases, the scoring basin should largely overlap with the energy basin of the proteins. The semfold and Rosetta sets are similar in that both of them are generated by assembling small pieces (3–9 amino acid residues) of local conformations from experimentally determined structures, and thus both sets provide a challenge. The density score function and self-RAPDF perform reasonably well on the Rosetta decoy sets with a few exceptions, but perform unsatisfactorily on the semfold sets. One reason is that the semfold sets does not contain as many near-native conformations as the Rosetta sets (Table 2). It is also worth noting that RAPDF itself was a component of the scoring function used in the semfold structure simulation process. We therefore expect that the scoring basin itself be biased toward correct RAPDF atom-atom contacts. So decoy conformations in semfold sets are already minimized in terms of the normal range of atom-atom contacts, and would not be easily discriminated by another atom-atom contact probability scoring function such as self-RAPDF.
It is not very surprising that self-RAPDF works better than RAPDF when near-native conformations are sampled adequately in the decoy sets. The RAPDF scoring function was compiled from an ensemble of native structures in certain structure databases, such as the Protein Data Bank (PDB), which contains very diverse conformations with bias to certain types of folds. The statistics may not work well for certain protein targets if their folds are not represented in the experimentally determined structure database. Self-RAPDF is compiled from an ensemble of decoy conformations, some of which resemble the native fold. So if a large fraction of near-native conformations are present in the decoy set, appropriate residue-specific atom-atom contacts for the particular sequence are more likely to be present in these decoys. Compiling this contact information can help in determining whether a given decoy conformation conforms to the majority of near-native conformations.
Besides RAPDF, other knowledge-based scoring functions have been developed in recent years with varying degrees of success [8–10]. These functions usually compile some statistics from databases that contain experimentally determined structures, and use such statistics to test the probability of a given conformation to be native-like. The results in this paper also have implications on the performance of other knowledge-based scoring functions.
Other structure clustering algorithms similar to our scoring functions have been applied in previous CASP experiments for structure selection. Simons et al. used the number of structural neighbors within a certain RMSD threshold as the basis of the clustering during the CASP3 experiment , and Bonneau et al. used simultaneously clustering of conformations using an iteratively reduced RMSD cutoff . The original clustering algorithm fixed the RMSD cutoff to generate clusters of different sizes, but the simultaneous clustering algorithm fixed the size of each cluster to contain ~100 conformations. They worked well for the decoy sets generated by Rosetta method, but their performance was not reported for other decoy sets. Compared to these clustering algorithms, both the density score function and the self-RAPDF function give quantitative scores for every decoy conformation. In addition, the self-RAPDF function can be used in structure refinement and fold simulation, after an initial decoy set has been generated.
The weighting scheme in our work was chosen somewhat arbitrarily. Only a small fraction of conformations have low Cα RMSDs for any given decoy set, which are the ones that we are most interested in. We seek to derive weights to inflate the contribution of these low-RMSD conformations to the self-RAPDF function. However, the low-RMSD conformations cannot be identified without knowledge of the experimentally determined structures. Since the density score function usually has high correlation with Cα RMSDs, we can use it as a surrogate of how similar a given decoy conformation is to the experimentally determined conformation, and derive weights based on the density scores. An exponential weighting scheme based on the density scores is shown to work quite well. Other weighting scheme parameterized on other scoring functions need to be explored.
During a fold simulation, we need a scoring method to evaluate the quality of newly simulated conformations relative to those already generated. This method should be reasonably fast, and have a relatively high correlation to the accuracy of predicted conformations. Currently we are using the RAPDF as one such component in our de novo structure prediction protocol . Based on the high correlation of self-RAPDF scores and RMSDs relative to experimentally determined conformations, it is also possible to use self-RAPDF for further refinement of predicted protein conformations. Further work is needed to test this hypothesis.
In conclusion, both the density score and the self-RAPDF functions are decoy-dependent scoring functions for improved protein structure selection. The implementation of both methods is simple, and the execution is very fast, so they can be applied to very large decoy sets. Both scoring functions compile information from the ensemble of decoy conformations, based on the assumption that a large fraction of near-native conformations are sampled in the decoy set, and these decoys can provide information about the native conformation. Unlike other knowledge-based scoring functions, both functions used here do not use any knowledge of experimentally determined structures. Besides structure selection, the self-RAPDF may also aid in fold simulation, the effectiveness of which is currently being evaluated. Based on our work, it is reasonable to assume that other knowledge-based scoring functions can also compile statistics from decoy conformations, for use in both structure selection and simulation.
Formulation of the density score function
Suppose a decoy set contains n decoys x1, x2, ..., x n . For any given decoy x i (1 ≤ i ≤ n), the density score is calculated using the formula:
where S i is the density score of decoy x i , r ij is the pairwise Cα RMSD between decoy x i and decoy x j (1 ≤ i,j ≤ n).
Formulation of the self-RAPDF function
For a given decoy set, we first normalize the density scores to be between -1 and 1 using the following formula:
is the normalized density score for decoy x i , and S median is the median density score for the set.
Each decoy conformation is weighted according to its normalized density score:
where W i represents the weight of decoy x i and k is a constant. In this paper we choose k to be 5. The contribution of each decoy conformation is multiplied by its own weight during the compilation process of the self-RAPDF statistics.
The all-atom scoring function, RAPDF, was used to calculate the probability of a conformation being native-like, given a set of inter-atomic distances. A full description can be found in the original paper . The compilation of self-RAPDF library uses a modified version of RAPDF and incorporates the weighting scheme described above. Briefly, the required probabilities are compiled by counting frequencies of distances between pairs of atom types in a decoy set. The counts for each conformation are multiplied by its weight, and are summed together to generate an overall probability. All non-hydrogen atoms are considered, and the description of the atoms is residue specific, which results in a total of 167 atom types. We divide the observed distances into 1.0 Å bins ranging from 3.0 Å to 20.0 Å. Contacts between atom types in the 0.0–3.0 Å range are placed in a separate bin, resulting in a total of 18 distance bins.
We compile tables of scores s proportional to the negative log conditional probability that we are observing a native conformation given an inter-atomic distance d for all possible pairs of the 167 atom types, a and b, for the 18 distance ranges, P(C | d ab ):
where P(d ab |C) is the probability of observing a distance d between atom types a and b in a correct structure, and P(d ab ) is the probability of observing such a distance in any structure. The required ratios P(d ab |C)/P(d ab ) can be obtained by:
is the number of observations of atom types a and b in a particular distance bin d in decoy x i , and W i is the weight for decoy x i from equation (3). No intra-residue distances are included in the summation.
Source of decoy sets
The decoy sets used for the evaluation of these scoring functions were obtained from the Decoys 'R' Us database http://dd.compbio.washington.edu and the most recent Rosetta 10-14-01 decoy set http://www.bakerlab.org. We used only those decoy sets that contained a reasonably large number (>100) of decoy conformations, resulting in 83 decoy sets from seven different sources (4state_reduce, fisa, fisa_casp3, lattice_ssfit, lmds, semfold and Rosetta). The 4state_reduced sets were generated by exhaustively enumerating 10 selectively chosen residues in each protein using a 4-state off-lattice model, and filtering the conformations with a variety of criteria . The fisa, fisa_casp3, semfold and Rosetta sets were generated using a fragment insertion simulated annealing procedure to assemble near-native structures from fragments of unrelated protein structures with similar local sequences [30, 34, 35]. The lattice_ssfit sets were generated by exhaustively enumerating sequence on a tetrahedral lattice and filtering the conformations by a combination of all-atom functions . The lmds sets were generated using a scoring function which is based on a united and soft atom version of the "classic" ENCAD forcefield that ensures that local minima are chemically valid with reasonable geometry and without clashes . More detailed description of these sets is available in the corresponding websites.
Methods used to evaluate scoring functions
Four different methods were used in this study to evaluate the performance of scoring functions, emphasizing their different aspects. These include:
log PB 1: The log probability of selecting the best scoring conformation. Suppose the best scoring conformation x i has the Cα RMSD rank of R i in n decoy conformations, this probability can be calculated as
log PB1 = log10 (R i /n) (6)
log PB 10: The log probability of selecting the lowest RMSD conformation among the top 10 best scoring conformations. Suppose x i has the lowest RMSD among the 10 best scoring conformations, with the RMSD rank of R i in all the N decoy conformations, this probability is calculated using the above formula. Since the number of conformations varies a lot for different types of decoy sets, dividing the rank by n in the formulation of both logPB 1and logPB 10ensures a fair comparison between different decoy sets.
F.E.: Fraction enrichment of the top 10% lowest RMSD conformations in the top 10% best scoring conformations.
C.C.: The correlation coefficient between Cα RMSDs and the scores generated by the scoring function.
Score calculation and data analysis
The structure preparation and score calculation were performed using the RAMP program suite, available at http://software.compbio.washington.edu. Additional data analysis was done using the statistics software STATA (College Station, TX, USA).
Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M: CHARMM: a program for macromolecular energy minimization and dynamics calculations. J Comput Chem 1983, 4: 187–217.
Jorgensen William L., Tirado-Rives Julian: The OPLS potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin. J Am Chem Soc 1988, 110: 1657–1666.
Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA: A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 1995, 117: 5179–5197.
Fain B, Xia Y, Levitt M: Design of an optimal Chebyshev-expanded discrimination function for globular proteins. Protein Sci 2002, 11: 2010–2021. 10.1110/ps.0200702
Holm L, Sander C: Evaluation of protein models by atomic solvation preference. J Mol Biol 1992, 225: 93–105.
Subramaniam S, Tcheng DK, Fenton JM: A knowledge-based method for protein structure refinement and prediction. Proc Int Conf Intell Syst Mol Biol 1996, 4: 218–229.
Samudrala R, Moult J: An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 1998, 275: 895–916. 10.1006/jmbi.1997.1479
Lu H, Skolnick J: A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 2001, 44: 223–232. 10.1002/prot.1087
Berrera M, Molinari H, Fogolari F: Amino acid empirical contact energy definitions for fold recognition in the space of contact maps. BMC Bioinformatics 2003, 4: 8. 10.1186/1471-2105-4-8
McConkey BJ, Sobolev V, Edelman M: Discrimination of native protein structures using atom-atom contact scoring. Proc Natl Acad Sci U S A 2003, 100: 3215–3220. 10.1073/pnas.0535768100
Wang Y, Zhang H, Li W, Scott RA: Discriminating compact nonnative structures from the native structure of globular proteins. Proc Natl Acad Sci U S A 1995, 92: 709–713.
Park B, Levitt M: Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J Mol Biol 1996, 258: 367–392. 10.1006/jmbi.1996.0256
Felts AK, Gallicchio E, Wallqvist A, Levy RM: Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS all-atom force field and the Surface Generalized Born solvent model. Proteins 2002, 48: 404–422. 10.1002/prot.10171
Bradley P, Chivian D, Meiler J, Misura KM, Rohl CA, Schief WR, Wedemeyer WJ, Schueler-Furman O, Murphy P, Schonbrun J, Strauss CE, Baker D: Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins 2003, 53 Suppl 6: 457–468. 10.1002/prot.10552
Skolnick J, Zhang Y, Arakaki AK, Kolinski A, Boniecki M, Szilagyi A, Kihara D: TOUCHSTONE: a unified approach to protein structure prediction. Proteins 2003, 53 Suppl 6: 469–479. 10.1002/prot.10551
Jones DT, McGuffin LJ: Assembling novel protein folds from super-secondary structural fragments. Proteins 2003, 53 Suppl 6: 480–485. 10.1002/prot.10542
Fang Q, Shortle D: Prediction of protein structure by emphasizing local side-chain/backbone interactions in ensembles of turn fragments. Proteins 2003, 53 Suppl 6: 486–490. 10.1002/prot.10541
Karplus K, Karchin R, Draper J, Casper J, Mandel-Gutfreund Y, Diekhans M, Hughey R: Combining local-structure, fold-recognition, and new fold methods for protein structure prediction. Proteins 2003, 53 Suppl 6: 491–496. 10.1002/prot.10540
Moult J: Comparison of database potentials and molecular mechanics force fields. Curr Opin Struct Biol 1997, 7: 194–199. 10.1016/S0959-440X(97)80025-5
Kocher JP, Rooman MJ, Wodak SJ: Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. J Mol Biol 1994, 235: 1598–1613. 10.1006/jmbi.1994.1109
Rooman MJ, Wodak SJ: Are database-derived potentials valid for scoring both forward and inverted protein folding? Protein Eng 1995, 8: 849–858.
Thomas PD, Dill KA: Statistical potentials extracted from protein structures: how accurate are they? J Mol Biol 1996, 257: 457–469. 10.1006/jmbi.1996.0175
Ben-Naim A: Statistical potentials extracted from protein structures: Are these meaningful potentials. J Chem Phys 1997, 107: 3698–3706. 10.1063/1.474725
Huang ES, Samudrala R, Ponder JW: Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. J Mol Biol 1999, 290: 267–281. 10.1006/jmbi.1999.2861
Zhu Jiang, Zhu Qianqian, Shi Yunyu, Liu Haiyan: How well can we predict native contacts in proteins based on decoy structures and their energies? Proteins 2003, 52: 598–608. 10.1002/prot.10444
Shortle D, Simons KT, Baker D: Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci U S A 1998, 95: 11158–11162. 10.1073/pnas.95.19.11158
Simons KT, Bonneau R, Ruczinski I, Baker D: Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins 1999, Suppl 3: 171–176. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+<171::AID-PROT21>3.3.CO;2-Q
Bonneau R, Tsai J, Ruczinski I, Chivian D, Rohl C, Strauss CE, Baker D: Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins 2001, Suppl 5: 119–126. 10.1002/prot.1170
Samudrala R, Levitt M: Decoys 'R' Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci 2000, 9: 1399–1401.
Tsai J, Bonneau R, Morozov AV, Kuhlman B, Rohl CA, Baker D: An improved protein decoy set for testing energy functions for protein structure prediction. Proteins 2003, 53: 76–87. 10.1002/prot.10454
Park BH, Huang ES, Levitt M: Factors affecting the ability of energy functions to discriminate correct from incorrect folds. J Mol Biol 1997, 266: 831–846. 10.1006/jmbi.1996.0809
Gatchell DW, Dennis S, Vajda S: Discrimination of near-native protein structures from misfolded models by empirical free energy functions. Proteins 2000, 41: 518–534. 10.1002/1097-0134(20001201)41:4<518::AID-PROT90>3.3.CO;2-Y
Huang ES, Samudrala R, Park BH: Scoring Functions for ab initio folding. Predicting Protein Structure: Methods and Protocols (Edited by: Walker J and Webster D). 2000.
Samudrala R, Levitt M: A comprehensive analysis of 40 blind protein structure predictions. BMC Struct Biol 2002, 2: 3. 10.1186/1472-6807-2-3
Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 1997, 268: 209–225. 10.1006/jmbi.1997.0959
Feig M, Brooks C. L., 3rd: Evaluating CASP4 predictions with physical energy functions. Proteins 2002, 49: 232–245. 10.1002/prot.10217
Samudrala R, Xia Y, Levitt M, Huang ES: A combined approach for ab initio construction of low resolution protein tertiary structures from sequence. Pac Symp Biocomput 1999, 505–516.
This work was supported in part by a Searle Scholar Award to R.S., NSF grant DBI-0217241 and NIH grant GM068152-01. We thank the creators of the decoy sets used in our study for making these sets publicly available. We also thank members of the Samudrala group for helpful comments.
KW carried out the computational experiments and drafted the manuscript. BF and RS developed the idea and evaluated the results. ML and RS provided intellectual guidance and mentorship. RS coordinated the whole study.
Authors’ original submitted files for images
About this article
Cite this article
Wang, K., Fain, B., Levitt, M. et al. Improved protein structure selection using decoy-dependent discriminatory functions. BMC Struct Biol 4, 8 (2004). https://doi.org/10.1186/1472-6807-4-8
- Root Mean Square Deviation
- Native Conformation
- Density Score
- Structure Selection
- Fold Simulation