Scoring predictive models using a reduced representation of proteins: model and energy definition

Background Reduced representations of proteins have been playing a keyrole in the study of protein folding. Many such models are available, with different representation detail. Although the usefulness of many such models for structural bioinformatics applications has been demonstrated in recent years, there are few intermediate resolution models endowed with an energy model capable, for instance, of detecting native or native-like structures among decoy sets. The aim of the present work is to provide a discrete empirical potential for a reduced protein model termed here PC2CA, because it employs a PseudoCovalent structure with only 2 Centers of interactions per Amino acid, suitable for protein model quality assessment. Results All protein structures in the set top500H have been converted in reduced form. The distribution of pseudobonds, pseudoangle, pseudodihedrals and distances between centers of interactions have been converted into potentials of mean force. A suitable reference distribution has been defined for non-bonded interactions which takes into account excluded volume effects and protein finite size. The correlation between adjacent main chain pseudodihedrals has been converted in an additional energetic term which is able to account for cooperative effects in secondary structure elements. Local energy surface exploration is performed in order to increase the robustness of the energy function. Conclusion The model and the energy definition proposed have been tested on all the multiple decoys' sets in the Decoys'R'us database. The energetic model is able to recognize, for almost all sets, native-like structures (RMSD less than 2.0 Å). These results and those obtained in the blind CASP7 quality assessment experiment suggest that the model compares well with scoring potentials with finer granularity and could be useful for fast exploration of conformational space. Parameters are available at the url: .


Background
Knowledge-based potential energy functions are extracted from protein structures. Most often a statistical analysis of database protein structures is performed. The potential involving a variable (e.g. a distance or an angle) is estimated from the distribution of that variable in the database, compared with that in a reference state or a null model [1][2][3][4][5][6][7][8][9][10][11]. Such potentials are often referred to as statistical effective energy functions (SEEFs).
Another class of knowledge-based potentials is based on optimization, that is the set of parameters for the potential functions are optimized, for instance, by maximizing the energy gap between the known native conformation and a set of alternative (or decoy) conformations [12][13][14][15][16][17][18][19][20][21][22]. This approach is strongly dependent on the methods used for building up decoys, and do not rely on an exact estimation of the energy gap existing between native and decoy structures.
The successful application of SEEFs to protein structure prediction tasks has been repeatedly demonstrated (see e.g. refs. [23,24]).
The statistical approach to the derivation of energy functions will be followed here. The structural representation of a protein s can be reduced to the coordinates of C α , C β , or side-chain centers which can be used to represent the location of a residue [25]. Once its amino acid sequence a is given, a function f mapping from the (s, a) space to the d-dimensional space of descriptors is needed in order to allow a proper reduced protein description. A descriptor can be, e. g. the contact map between non-bonded residues, the solvent accessible surface area, a backbone or sidechain dihedral angle, the packing density and/or any other feature of protein structure. In practice the values that a variable (e.g. a residue-residue distance or an angle) can assume are discretized. The descriptors associated with that variable describe the possible discretized values of that variable and assume value 1 if the current value is within the bin associated to the descriptor and 0 otherwise. The potential function becomes therefore a map of the d-dimensional descriptors c to a real energy value. The energy is commonly computed as a linearly weighted sum of descriptors (for the notation we refer to ref. [26]): where "·" denotes inner product of vectors and c i is the number of occurrence of the i-th type of descriptor. For statistical knowledge-based potential functions, the weight vector w for linear potential is derived by characterization of the frequency distributions of structural descriptors in a database of experimentally determined protein structures.
According to the Boltzmann principle, the distribution of protein molecules among the microscopic states at the equilibrium connects the potential function H(γ) for a microstate γ to its probability of occupancy π(γ). This probability π(γ) can be written as: where k and T are the Boltzmann constant and the absolute temperature measured in Kelvin, respectively, and Z is the partition function. Following from Eq. (2) the knowledge-based potential function H(γ) corresponding to the Boltzmann distribution π(γ) is: H(γ) = -kT ln π(γ) -kT ln Z.
In order to obtain a knowledge-based potential function, the background energetic interactions H'(γ) in the reference state must be defined. The effective potential energy is then: where π'(γ) is the probability of the descriptor γ in the reference state. Z and Z' are both constants. Following Sipple, it is usually assumed that Z ≈ Z', so that Eq. (4) becomes [3]: Due to the high dimensionality of the space of descriptors a reasonable factorization of the probability ratio is sought. Different variables are typically treated independently of each other, resulting thus in an energy function that is the sum of many independent contributions. For each descriptor c i the contribution to the energy is given by w i = -kT ln[π(c i )/π'(c i )]. Whether they are formalized in a discretized or analytical way, contibutions to the energy π γ π γ = − ′ functions derived in this way are technically potentials of mean force [35]. The effectiveness and the drawbacks of the approach have been repeatedly pointed out [32]. First, it is straightfoward that the choice of the reference state is critical for developing knowledge-based statistical potential function. The reference state problem has been clearly stated by Jernigan and Bahar [10] and discussed by Skolnick and coworkers [36,37] who derived a potential for a compact reference state with a bias for buried hydrophobic residues and compared it with previous contact potentials.
Second the choice of descriptor must catch most aspects of protein energetics, i.e. only native-like models should be assigned low energy by the potential.
Third, the assumption of independence of different potentials of mean force is clearly unrealistic. Notwithstanding all limitations, statistical effective energy functions are currently the most successful potential available for describing protein conformations (see e.g. refs. [23,24]).
Many Statistical Effective Energy Functions have been derived according to different level of representation of the protein, the features selected for defining the potential, the reference state and the actual way of derivation of the potential. Since excellent reviews on this subject are available in the literature we will not attempt an extensive coverage of all the works published so far, but rather we aim at discussing general aspects on this issue with respect to the work presented here [8,[38][39][40].
Earlier works focused on the preference of contacts between specific residues in the database of available experimental structures [1,2,[41][42][43]. These and other works pointed out that preferential interactions are one of the most relevant features that must be taken into account for describing protein energetics [38].
The demonstrated unlearnability of optimal potential on simplified models of proteins points out that most likely [17]: i) the chosen protein representation requires a higher level of detail; ii) the actual residue-residue contact definition is crucial for the accuracy of the potential [44] iii) other features, like local conformational preferences, must be taken into account for discriminating native structure among decoys; Another important result, which is worth mentioning here, obtained by Park and Levitt is that smoothing the contact energy function improves the performance of the potential in native structure discrimination among decoys [45].
The derivation of statistical potentials requires the definition of a reference state. The finite size of database proteins and the diversity of sidechains makes this task not straightforward. This problem has been repeatedly pointed out and recently novel approaches have been proposed [10,27].
Concerning other features, like local conformational preferences, it is well known that such preferences exist and actually they form the basis of the success of secondary structure prediction algorithms [46] and current single sequence algorithm are able to predict secondary structure with a 3-state accuracy of 70.3% [47].
Recently convincing evidence has been provided that, for high resolution structures, the distribution of backbone and sidechain dihedral angles may be used for assessing the quality of predictive models [48][49][50] and may be successfully used for supplementing contact potentials (see e. g. ref. [51]). Moreover it has been pointed out that dihedral angles are strongly correlated with the identity of adjacent residues [52].
The development of the potential presented here was motivated by lack of a reliable potential employing a coarse grained representation of protein structures with the following features: -only two (or one for glycine) centers of interaction per residues; -smooth interactions e.g. by binning a range interval; -off-lattice representation with a continuum range for all conformational variables; -inclusion of residue-dependent local conformational potential term favoring observed preferences; -inclusion of (at least) nearest neighbor correlation which could reproduce local conformation correlation.
This term is fundamental for obtaining the proper average length of helices, similar to nearest-neighbor coupling in one-dimensional Ising models and it is not usually considered in available potentials.
Although most of the above listed elements are found in available empirical potentials, an empirical potential that includes all these features at the same time is missing in the literature. There are a number of empirical potentials available, which are similar in spirit. For instance a similar model, but including up to three centers of interactions per sidechain, has been used successfully by Betancourt [53]. The distance-based potential developed by Zhou and coworkers achieves good performance with an even coarser grained representation [54].
Other potentials include structural features of interaction centers like orientational parameters for interactions (see e.g. the UNRES potential [55][56][57] and the review by [40]). Other potentials, used for simplified model simulations in physics, include all backbone atoms as reviewed by [38]. The statistical potential of Dehouck et al. includes all backbone torsion angles (broadly classified into seven classes) and a center of interaction for each sidechain [31].
Usage of the correlation between local amino acid conformations has been shown to improve the native structure recognition capability of scoring functions by [48].
Recently correlation between different sequence and structure descriptors (associated with potentials of mean force) has been built in a general framework by Dehouck and coworkers. The best SEEF developed according to this framework, entailing 30 energy terms, compares well or better with most popular SEEFs [31].
The reduced representation proposed here consists of two centers of interaction per residue, one for the backbone, centered on C α atom and one for the sidechain (except for glycine which entails only one center of interaction for the backbone) centered on the center of mass of the sidechain atoms.
The potential function entails energy terms for the pseudo bonds between two consecutive C α 's and between a C α and the center of mass of the relative sidechain, angular energy terms for all three pseudo-bonded centers of interaction, torsional energy terms for all four pseudo-bonded centers of interaction, a pseudo-torsional energy terms to mantain proper chirality of sidechain orientation with respect to the main chain, an energy term dependent on the torsional angles of adjacent quartets of consecutive C α 's and energy terms for all pairwise non-bonded interactions. The pairwise interactions between centers of interaction is derived here from database analysis and it employs a reference state which takes into account the finite size of proteins. The model and energy definition are detailed in the Materials and methods section.
We termed this reduced representation of proteins PC2CA, an acronym for PseudoCovalent structure with 2 Centers of interaction per Amino acid.
The performance of this potential on all multiple decoy sets in the Decoys'R'us database [58] is tested and the results obtained in the Critical Assessment of Structural Predictions (CASP7) model Quality Assessment (QA) category are summarized.

Analysis of the different energy terms
The average and standard deviation values of the different energy terms have been evaluated on the top500H database structures. For what concerns the covalent energy terms (terms i to vi in Eq. (6)), only few structures, notably with short sequences, had significant deviations in the energy terms connected with the positioning of the sidechain center of mass (terms ii to iv in equation 6).
Particular attention has been paid to the torsional term dependent on the local chain conformation and the term describing the correlation between adjacent local conformations. In order to make sure that these two terms could describe reasonably well the known preferences for secondary structure elements, a Monte Carlo simulation was run on the sequences of all the structures in the top500H datasets. The energy was just the sum of the torsional (term vii in equation 7) and correlation energy (term viii in equation 8) and the temperature factor kT was set to 1.0 in order to match the derivation of the potential.
The range 30 to 70 degrees was assigned to helical conformation, the range 170 to 240 was assigned to extended conformation and all other conformations were assigned to coil conformations. Every residue was assigned to the most populated conformational range (actually averaged on the preceding and the following bond, because all torsional angles refer to bonds involving two adjacent residues). The experimental secondary structure was obtained using the program DSSP [59] and converting the result into three states (extended, helix, coil): the 'E' state was left as the extended conformation; the 'H', 'G' and 'I' states were converted into helix, and all other states into coil. No post-processing of the results was performed.
Although this test is run on the same structures used for deriving the potential, making it invalid for any quantitative assessment, the three-state accuracy of this simple prediction procedure is 0.57 which is more or less what expected for a single sequence method using only nearest neighbor information.
It is interesting to assess the relative contribution of the correlation term to this accuracy. The same test has been repeated setting the correlation term to zero. The accuracy dropped to 0.51. This was a confirmation that the correlation term is indeed important for properly reproducing local conformational propensities.

Test on the multiple decoy datasets
The decoy sets available in the Decoys'R'us database under the category 'multiple' are sets where many alternative conformations are given for a single native structure. The models are obtained with widely different methods and offer therefore a significant challenge for free energy estimators. The sets have different features which make different measures of performance appropriate. The ten sets which are currently available are shortly described hereafter.
The set 4state_reduced contains alternative models for 7 different proteins. For each protein native-like conformations are present in the set and therefore some correlation between rmsd and energy should witness the accuracy of the energy function [45].
The two sets fisa and fisa_casp3 have been assembled by the group of Baker using fragments via a simulated annealing protocol [60]. For the protein 130 the structure with pdb code 1ck2 has been used as the native structure.
The sets ig_structal, hg_structal and ig_structal_hires are sets containing few models for many immunoglobulins (ig) or globins (hg) built by homology modeling. Most of the models have very low RMSD from native.
The set lattice_ssfit has been generated selecting and refining with an all-atom energy function coarse lattice models [61]. The RMSD from native in the set is larger than 4 Å for all the eight proteins modeled.
The lmds set was built including information on secondary structure and models have been refined using a soft core all atom model [62]. The set includes models with RMSD from native lower than 5.0 Å for 10 proteins.
The semfold test has been produced apparently by a fragment insertion method [63]. This includes a very large number (average of 12900) of decoys for each of the 6 proteins. Some models have RMSDs from native in the range 3 to 5 Å.
The vhp_mcmd decoy set has been obtained by taking snapshots of long molecular dynamics simulations starting from the native structure and from four coarse grained models obtained by Monte Carlo simulations. All conformations have been energy minimized using the molecular mechanics/generalized Born model [51].
The results obtained on the decoy sets are summarized in tables 1 to 10. Since many of the target structures have homologues in the top500H database, those which do not have a significant similarity with the top500H set are indicated by boldface characters (we used as a criterion an E-value greater than 0.01 for the best alignment with BLAST). No significant difference is apparent based on the presence or absence of homologues in the top500H dataset.
An energy versus RMSD plot for the 4state decoy sets is reported in Figure 1.
There are a number of plausible reasons for failure in native structure recognition, even for the best quality scoring functions, as discussed by Shen and Sali [64]. In the present case, in the few cases where the native structure is not recognized the covalent energy of the pseudocovalent structure is large showing that most likely the experimental model is not optimally refined. For instance in the native structure of protein with PDB code 4rxn there the distance between the C α of Glu 16 and the C α of Asp 17 is 3.19 Å, much shorter than the average distance causing the highest energy in the set for the corresponding energy term. Similarly there are distorted geometries in the proteins with PDB code 1ctf, 1bba and 1dtk.
Another likely reason for failure of native structure recognition is the presence in the crystal structure of other chains. This is the case for the short fragment of protein A (chain C of PDB structure with code 1fc2) which is bound in the crystal to an immunoglobulin domain.
In general NMR structures are more difficult to be recognized. Refinement of NMR structures is strongly dependent on the forcefield and protocol used, and this may result in minor structural features which are not typical of protein structures and give rise, in turn, to large energies in the statistical effective energy function.
The most challenging decoy set appears to be the semfold set which includes six targets and more than 10000 decoys for each target. For five of the decoys native or low RMSD decoys could be recognized, but the Z-score is rather low. For the structure with PDB code 1khm the lowest energy structure has a high RMSD from native, although there are decoys with RMSD from native as low as 3.0 or 4.0 Å. It is remarkable that the term for CA CM interactions attains the third lowest energy for the native structure but the overall energy is only the tenth lowest energy beyond structures with RMSD from native larger than 10 Å. The number of native-like structures (say with RMSD from native less than 4.0 Å) in the set is however limited, so it is difficult to assess whether the failure can be ascribed to the quality of the native structure, of the decoys or of the energy function itself. For this reason we considered other structures deposited in the PDB with the same sequence. For the structure with PDB code 1zzj the energy (computed on the same fragment modeled in the decoy set) is lower than the energy of all decoys. This result witnesses the quality of the energy function although the low energy assigned to very different conformations poses an issue on the robustness of the methodology. It is worth mentioning that the protein is associated with single stranded RNA in the crystallographic structure, and this feature is not modelled by the statistical effective potential.
The energy function performs also well on decoys which are mostly native-like as in the decoy sets hg_structal, ig_structal, ig_structal_hires where the native conformation is recognized 13 times on 110 cases.
Moreover very low RMSD decoys are mostly selected as the lowest energy conformations. For this sets it is interesting that the correlation coefficient between energy and RMSD is on average high (0.44) and for the hg_structal set it is on average equal to 0.71.
The correlation between energy and RMSD is typically found only at low RMSDs, in other words the energy for grossly misfolded structures is not correlated with the RMSD from the native structure, but should be correlated with the RMSD from the local minimum energy conformation. For sets where the whole range of RMSDs is represented the correlation coefficient should be positive and significantly different from 0.0 and similarly the fraction enrichment should be significantly larger than 10 percent. Indeed this is the case for all 4state_reduced decoy sets and for the vhp_mcmd sets, and on average for the hg_structal, ig_structal, ig_structal_hires.
The overall ability of recognizing the native structure among decoys including native-like structures is outstanding for a model entailing only two centers of interaction per amino acid, and compares well or is superior to more complex models as judged by the results obtained with many model quality assessment programs and reported by Tosatto (see Table 3 in [50]). We report in table 11 a summary of the data reported in the cited study by Tosatto including the best performing potentials together with our results, for the sake of comparison. The best Model Quality Assessment Programs (MQAPs) considered here are ProQ [65], Prosa II [66], Verify3d [67,68], AKBP [5], DFIRE [27], RAPDF [4] and FRST [50].

Test on conformations generated by Rosetta
A test was performed by generating 100 conformations for the small thermostable domain of the chicken villin head-  piece using the software Rosetta which is one of the best tools for ab-initio protein structure prediction. 100 structures have been generated and refined using the same software. The average RMSD of the conformations with respect to native is 3.6 Å with a standard deviation of 1.1 Å. The best model selected by the energy model proposed here has 3.0 Å RMSD from native. Perhaps more significant is the correlation between the rank according to the energy and the rank according to the RMSD which is 0.48.

PC2CA in CASP7
At the time this paper is being written the CASP7 experiment has just closed (for a description of the CASP experiment see ref. [69]). For 99 out of the 100 targets experimental structures have been released. The quality assessement category has been recently introduced in the CASP experiment in order to evaluate by scoring functions the quality of the predictive models obtained by servers.
The discussion reported hereafter is connected with the methods adopted for assessment in this community-wide experiment (see ref. [70] and forthcoming articles in Proteins: Structure, Function, Bioinformatics).
Unfortunately models submitted by servers and evaluated by quality assessement programs differ in the number of residues modeled and in the level of detail, ranging from only C α 's to all atoms for each amino acid. For this reason a choice must be adopted for ranking the models which takes into account these aspects.
Evaluation of predictions may be conducted using different legitimate criteria. We discuss in the following the ability of PC2CA to select native-like structures among decoys.
In order to assess the performance of PC2CA we evaluate the quality of best ranking models using the widely accepted Global Distance Test Total Score (GDT_TS) criterion [71]. In particular the "loss in GDT_TS" of the best ranked model (i.e. lowest energy model) compared to the best available model (according to GDT_TS) (see A. Tramontano's presentation at CASP7 available at [70]) gives a good idea of the performance of an energy or scoring model.
We wish to remark that no selection has been applied nor on targets nor on models: all models were scored for all  Another important issue is that residues with incomplete backbone or sidechain have been simply ignored in our quality assessment predictions and therefore a large number of models received very low score. In the deposited quality assessments we ranked models according to the energy per residue with a cutoff on the percentage of modeled residues for half of the targets and according to global energy for the remaining half. The energies for models with different level of completeness are not directly comparable and the chosen criterion had only the purpose to single out best and most complete models.
Here we will discuss predictions based on the global energy, but the results described here are however largely overlapping with those deposited.
The presence of heterogeneous (regards to completeness in length and heavy atoms) predictions made the correlation between ranking and GDT_TS insignificant and, in general, it impaired safe comparison of models.
The global energy appeared a reasonable criterion for scoring best models but it was not designed in order to maximize correlation of score rank with GDT_TS rank.
When the GDT_TS of the best models obtained from servers is compared with the GDT_TS of the best scoring model according to PC2CA the results are outstanding when one considers that the energy model employs only two centers of interaction per residue. The average loss in GDT_TS is 10.3 for all targets (10.0 and 11.9 for template modeling and template free modeling targets according to  the assessors' classification, respectively). When the best scoring model is compared to the first model submitted by most successful servers like Zhang server and ROBETTA server the difference in GDT_TS is as low as 5.4 and 1.5, respectively, and in general the average loss in GDT_TS compared to the best MQAP predictor is 4.7. Comparison with other MQAP predictors in CASP7 is not straightforward because only 7 groups (including ours) deposited predictions for all targets. When we compare the average GDT_TS of the best PC2CA scoring models with that of other predictors (with the average performed on the same predictions deposited by each group) PC2CA ranks 15th out of 26 for template free modeling and 18th out of 26 for all models. The average loss in GDT_TS computed is 10.1, smaller than the average of 10.7 of all quality assessment methods.
The best PC2CA scoring models have consistently higher GDT_TS than the average GDT_TS computed on all models for each target (Figure 2).

Discussion
The reduced representation of proteins presented in this work has two features which makes it attractive for application in biophysical areas: it is simple and it is capable of discerning native-like models among non native-like models produced using a wide variety of methods. A blind test performed in the CASP7 quality assessment category confirms this conclusion and, in spite of its granularity, our scoring potential ranks in the average of methods using finer granularity and applying consensus procedures.
The good scoring properties of the model however do not guarantee that the same model can be used for folding small proteins e. g. by Monte Carlo simulated annealing. Indeed, the potential could have lower minima for conformations that are not explored by the algorithms used for generating decoys or predictive models in CASP7. The range of values for terms involving bonds, angles and pseudodihedrals in the systems tested is limited. It is likely that it would be possible to slightly increase the energy of these terms, reaching non-physical conformations, and simultaneously decrease other, e. g. nonbonded energy terms.
An obvious correction to the potential for model generation applications is the replacement of flat high energy regions at large distances with increasing potential, in order to prevent bonds to break. A less obvious issue is that the weights of the different terms, in particular those which have less variability in the decoy test sets, could have to be changed when using the potential to generate models. It is also likely that the correlation found between torsional and bending energy terms in native structures will need proper treatment. Such correlations are somehow taken into account by the attractive potential between CA's.  Another important point is that the pseudo-minimization used in this work is a very rough approximation of a real energy minimization procedure.
The ultimate test for a scoring potential function is however recognition of native-like structures among decoys generated with the task of minimizing the same potential function.
All these issues above are being considered for using the potential in model generation applications under development in our laboratories.

Conclusion
The results obtained on decoy sets and those obtained in the blind CASP7 quality assessment experiment suggest that the energetic model proposed here is suited for scoring predictive protein models. In spite of its simplicity the potential compares well with scoring potentials with much finer granularity. Tests are underway in order to assess its application for fast exploration of conformational space.

Selection of a protein dataset
The dataset of proteins used for the extraction of the statistical potential is Top500H [49]. This is a non-redundant set of 500 hand-cured proteins resolved by X-ray cristallography to 1.8 Å or better resolution. There are few Van der Waals clashes and few significant deviations from ideal bond lenghts and angles and a maximum of 60% sequence identity is allowed for each structure pair.

Derivation of a statistical potential
All protein structures in the Top500H dataset have been read and converted in reduced form including only C α atoms (CA), representing the backbone atoms, and the center of mass of each sidechain representing the sidechain itself (CM). Adjacency between amino acids was checked by the requirement that, after sorting residue numbers and residue insertions in the PDB file the distance between consecutive C α 's was in the range 2.7 to 4.3 Å. Based on this reduced model a statistical effective energy function has been derived from the analysis of the Top500H dataset. Incomplete backbone groups or sidechains have been discarded from analysis.
For this reduced model a pseudocovalent structure may be defined where each CA is bonded to its sidechain center of mass (CM) and to the adjacent CAs [25] (Figure 3). Consequently, following the standard terminology for molecular mechanics forcefields, the energy model entails bonded and non-bonded terms. CM is absent for glycine residues.
The energy linked to the quality of the pseudocovalent structure E cov , rather than to the quality of the conformation itself, is described by the sum of six terms (each implying summation on index i): where the subscripts b, a and p refer to bond, angle and pseudodihedral energies, respectively.
The domain of each bond, angle and pseudodihedral variable considered has been divided into bins and the energy corresponding to the k th bin has been obtained as where N k is the number of counts in the k th bin and N tot is the number of total counts. Intervals with N k = 0 were assigned arbitrarily a value E(k) = -2log( ), which is equivalent to the usage of pseudocounts ( ). The width of the bins and particular cases will be discussed in the following.
The first term (i) is the energy associated with the bond between two consecutive CAs. The distribution of the CA-CA distance depends essentially on the presence or absence of a proline in the second position and therefore only two distributions (with proline or any other residue at the second position, respectively) have been considered. The bins are 0.025 Å wide. The second term (ii) is the ) ( ( ( ), ( ))) ) ) ( ( (), ( ), Energy versus RMSD plots for the 4state decoy sets Figure 1 Energy versus RMSD plots for the 4state decoy sets. energy associated with the bonds between CA and the sidechain center of mass CM. This energy term is specific for all amino acid types (other than glycine). The bins considered are 0.1 Å wide. The terms (iii) and (iv) represent the energy associated with the angles between the CA-CM vector and the preceding and following CA-CA vectors, respectively. These terms are specific for the amino acid type of the residue involving the CA-CM vector. The bins are 5 degrees wide. The factor is used because the terms (iii) to (vi) are not independent and clearly a single angular term involving CM would suffice to restrain the position of CM with respect to the trace of the protein.
The factor in terms (iii) and (iv) does not apply for terms involving the ends of the chain.
The term (v) represents the energy associated with the angle among three consecutive CAs. The last term (vi) is an energy term associated with the pseudodihedral defined by three consecutive CAs and the sidechain center of mass of the second residue. This term is needed in order to mantain the proper orientation of the sidechain center of mass with respect to the main chain plane.
Both terms (v) and (vi) are specific for the central amino acid type. The bins considered are 5 degrees wide.
The sequence local conformational propensity is taken into account with an energy which depends on the dihedral angle defined by four consecutive CAs where the subscript t refers to torsional energies. The bins considered here are 10 degree wide.
This term is specific for the two central amino acid types, i.e. there are 400 different potentials of mean force defined.
In order to take into account properly the occurrence of secondary structure elements and other frequent local conformation motifs, like turns, an additional term describes the correlation between adjacent local conformations: GDT_TS of the best PC2CA scoring models (black trian-gles)and average GDT_TS of predictive models (red circles) versus GDT_TS of the best predictive model for each target in CASP7 MQAP Figure 2 GDT_TS of the best PC2CA scoring models (black triangles)and average GDT_TS of predictive models (red circles) versus GDT_TS of the best predictive model for each target in CASP7 MQAP.
, the correlation energy for the two-dimensional bin (m, n) is defined as: where m and n refer to angles φ i and φ i + 1 , respectively and the probability p(m, n) is computed over all pairs φ i , φ i + 1 .
The energetics of non-bonded interactions among the centers of interaction in the reduced model is described by the sum of three terms: The energy of the k th bin of the distribution is computed as: where N exp (i) and N obs are the number of counts of centers of interaction expected and actually found in the k th bin of the distribution, respectively. The expected number of counts in a given bin (N exp ) requires a proper treatment, because proteins are finite systems and different types of center of interactions have different dimensions. The finite size of proteins will result in a decay of the counts at long distances. In the absence of specific interactions and neglecting excluded volume or correlation effects, the expected number of counts within any given distance range r -to r + depends on the density of the relevant centers of interaction in the dataset proteins and is proportional to the spherical shell volume around the reference center: The finite size of proteins makes larger distances less and less probable. We found that a simple exponential damping factor describes this effect fairly well. For this reason the expected number of counts in a bin at a given distance r may be expressed as: N exp (r) = a * r 2 * exp(-(r/b) c ) * Δr where a, b and c are fitting parameters. These parameters take account of both the density of the interaction centers and finite size of proteins. Assuming that the interactions among amino acids are short ranged, as it is usally assumed, the fitting parameters a, b and c may be estimated from the observed distributions at large distances (say at more than 16 Å). This was done here and the effec- Pseudocovalent structure for a five-residue fragment of a protein (PDB id 1CTF, fragment 54-58) Figure 3 Pseudocovalent structure for a five-residue fragment of a protein (PDB id 1CTF, fragment 54-58).
tiveness of this approach is shown in Figure 4 where the original counts' distribution and the computed radial distribution function are shown. This choice for the reference state corresponds to a finite size randomly collapsed protein. It has the advantage, over other possible definition of the reference state, of taking into account in a straightforward way the finite size of proteins and the different size of amino acids. Note that the definition of average amino acid interactions would be problematic in the presence of very different radial distributions.

Computation of conformational energies
In order to properly evaluate conformational energies, some energy minimization should be performed, because the model at hand might have been not refined and minor deviations from standard geometry could result in large unfavorable energy terms. We assumed that any given conformation could relax toward lower energies, this is even more true when simplified models are taken into account. A dislocation of 10 degrees was therefore allowed as far as angles and pseudoangles are concerned, as well as a variation of 0.05 Å for the distance d(CA(i), CA(i + 1)) and of 0.2 Å for the distance d(CA(i), CM(i)).
We remark that no real movement of the structure is allowed, but rather we take the minimum energy value in the allowed range for every contribution to the energy. This virtual relaxation of the structure is able to reduce the dependence of the energy on the simplification procedure and on building details.
Another problem concerns the exact weigthing of the different contributions. Energy terms connected with the covalent structure of the reduced model have been treated ad hoc, in order to avoid that a structure built with all angles and lengths at their energy minimum values could get a very low energy. All these contributions (terms i to vi in equation 6) have been reassigned as follows: where is the average energy contribution per residue and σ E is the standard deviation in the top500H dataset.
Since there are eleven different terms contributing the energy we decided to group together the covalent terms, but considered separately the dihedral term, the correlation term and the three non-bonded terms, and apply different weights to this terms. Setting the weights of the covalent term to one we tested combinatorially weights 0.5, 1, 2, 4, 8 on all other terms. The set of all multiple decoy sets in the Decoys'R'us database were tested and the performance of the weighting scheme was judged by average RMSD from native of the lowest energy model and by the average Z-score of the native structure. The final chosen weights were of 1 for the covalent, the dihedral and the correlation terms, and 8, 4 and 1 for CM-CM, CA-CA and CM-CA non-bonded interactions, respectively. The decoy sets more sensitive to the choice of weights was the semfold decoy set containing the largest number of decoys.