- Research article
- Open Access
A knowledge-based structure-discriminating function that requires only main-chain atom coordinates
BMC Structural Biologyvolume 8, Article number: 46 (2008)
The use of knowledge-based potential function is a powerful method for protein structure evaluation. A variety of formulations that evaluate single or multiple structural features of proteins have been developed and studied. The performance of functions is often evaluated by discrimination ability using decoy structures of target proteins. A function that can evaluate coarse-grained structures is advantageous from many aspects, such as relatively easy generation and manipulation of model structures; however, the reduction of structural representation is often accompanied by degradation of the structure discrimination performance.
We developed a knowledge-based pseudo-energy calculating function for protein structure discrimination. The function (Discriminating Function using Main-chain Atom Coordinates, DFMAC) consists of six pseudo-energy calculation components that deal with different structural features. Only the main-chain atom coordinates of N, C α , and C atoms for the respective amino acid residues are required as input data for structure evaluation. The 231 target structures in 12 different types of decoy sets were separated into 154 and 77 targets, and function training and the subsequent performance test were performed using the respective target sets. Fifty-nine (76.6%) native and 68 (88.3%) near-native (< 2.0 Å C α RMSD) targets in the test set were successfully identified. The average C α RMSD of the test set resulted in 1.174 with the tuned parameters. The major part of the discrimination performance was supported by the orientation-dependent component.
Despite the reduced representation of input structures, DFMAC showed considerable structure discrimination ability. The function can be applied to the identification of near-native structures in structure prediction experiments.
Protein structure evaluation is a key process in protein structure prediction, in association with comparative modeling, fold recognition, structure refinement, and de novo folding. Protein design technology also requires structure evaluation methods with sufficient capacity. Many different types of potential energy functions have been developed and examined. The formulation of the functions can be roughly grouped under physical-based and knowledge-based approaches [1, 2, 4]. Physical-based (or molecular mechanics) potential energy functions are mainly used for the simulation of protein folding and dynamics , and are also effective for protein design . The knowledge-based approach to developing such an evaluation system is also effective and widely used, especially for protein structure prediction and protein design studies . The classical approach is the extraction of "pseudo" mean potentials from the distribution of pairwise distances of known protein structures based on the Boltzmann law . A number of potential constructions and their successful applications have been reported [5–16]. Recently, improved accuracy has been facilitated, accompanied with the accumulation of high-resolution protein structure information .
The assessment of pairwise distances is implemented in many knowledge-based functions. Several variations of atom types are utilized, such as C α and/or C β atoms , the center of mass of the side chain , and heavy atom representation for a variety of atom types . The functions of other structural features, including hydrogen bonds , main-chain dihedral angles , and solvation potentials , were also reported. A number of functions have been formulated as a combination of the above functional components. The introduction of orientation-dependent components often improves the accuracy of the function. The hydrogen bond is a typical example, and the effectiveness of orientation-dependent potential was reported . Buchete et al. introduced another type of orientation-dependent potential, using the pairwise interaction of local reference states for respective amino acids [9, 10].
The structure discrimination capacity of the function is frequently estimated on the basis of the ability to correctly identify native or near-native structures from nonnative but plausible "decoy" structures. The "Decoys 'R' Us" database  is a collection of decoy sets, and is commonly used to evaluate functions. The database consists of 10 decoy sets, generated by different methods. Many other decoy sets, such as the "moulder"  or the "rosetta" , are also utilized to assess functions. It is commonly understood that the performance of structure evaluation functions tends to depend strongly on the intrinsic properties of decoy generation methods and/or other qualities of decoy sets . Thus, many reports have assessed functions using multiple decoy sets and/or effective statistical techniques.
The compatibility of the structure-discriminating function for reduced structural representations provides many beneficial effects. For example, the generation and manipulation of model structures can be performed without more complexed structure construction; however, it is difficult to reduce the required structural information without losing the accuracy of the scoring function.
In this article, we report the development of a knowledge-based protein structure-discrimination function. The complexity of the required input structure data for evaluation was limited to the main-chain trace with only three atom coordinates (N, C α , and C) per respective amino acid residue. To overcome the possible loss of accuracy of decoy discrimination, orientation-dependent potential between two C α -pseudo-C β vectors was introduced. The parameter training and the subsequent performance test were carried out using the decoy sets from the Decoys 'R' Us database, in addition to the moulder and the rosetta decoy sets. High accuracy in native or near native structure recognition was observed in the test set. The level of discrimination ability was nearly comparable to other coarse-grained or all-atom-type functions. A detailed description of the development of the function and evaluation of the discrimination ability are provided.
Before explaining the results of function development and structure evaluation, the overall design of the function is briefly described. The details of the function formulation can be found in Methods. The structure-discriminating function developed in this study consists of six pseudo-energy calculation components. Each of the components evaluates the distinctive structural feature of a target protein. The pseudo-energy is calculated based on the Boltzmann law , with knowledge-based procedures using a precompiled database from a non-redundant set of known structures. The six structural features focused on are as follows: the C α pairwise distance (the corresponding functional component is referred to as DIST), the relative orientation between two vectors of C α -pseudo-C β (DABG component, Figure 1A), hydrogen bonding between a main-chain amino hydrogen and a carbonyl oxygen (HBND component, Figure 1B), the main-chain dihedral angles of the combination between ψ at a residue and φ at the next residue (PPDA component), the main-chain ω dihedral angle (OMDA component), and the number of surrounding C α atoms around a C α atom (SURR component). Each atom coordinate is treated separately by twenty amino acid types. The overall function is formulated with the weighted linear combination of the above six pseudo-energy components. As the function was designed to require three main chain atom coordinates (amino nitrogen, C α , and carbonyl carbon) per residue as input data, we refer to the final form of the function as DFMAC (Discriminating Function using Main-chain Atom Coordinates).
Function training with decoy sets
The parameters associated with each component and the weights of respective components are not inherently clarified. In order to search for and determine the parameter values, we used decoy sets.
The parameter values and weights were determined on the basis of the discrimination ability of the native structure from its decoys. The outline of the tuning procedure is as follows: (1) The probable set of values of the parameters was scanned and determined using arbitrarily collected decoy sets. (2) The parameters were further tuned using the training decoy set by the cross-validation procedure. (3) The weights of the respective function components were finally determined using the entire training set. The performance of the tuned function was evaluated using the test decoy set, which was distinctive from the training set. Details of the procedure are in Methods.
To determine the initial values of the parameters for the following tuning, we used 7 decoy sets of 4state_reduced, fisa, fisa_casp3, hg_structal, ig_structal, ig_structal_hires, and lmds from the Decoys 'R' Us database http://dd.compbio.washington.edu/. (Note: Although some targets used in the final performance test were included in these decoy sets, some of the parameter values of the respective components and weights were changed after the subsequent parameter tuning procedure (Table 1). Thus, bias in the final performance test is considered to be limited.) Parameters which decreased the average square values of C α RMSD of the best pseudo-energy structures for respective protein targets in the 7 decoy sets were successively selected.
Using the probable parameters determined above as the initial parameter set, further tuning was then carried out. The 231 targets of the 10 decoy sets from the Decoys 'R' Us database, the moulder decoy set ftp://salilab.org/decoys/comp_models.tar.gz[15, 20], and the all-atom decoy set from Rosetta@home http://depts.washington.edu/bakerpg/decoys/rosetta_decoys_62proteins.tgz, were separated into 77 and 154 targets and used for testing and training (or parameter tuning) the function, respectively. (Note: Each set consisted of targets from a variety of decoy sets.) Tuning was performed by 10-fold cross validation using the training set. Briefly, nine of ten parts of targets were used for parameter training, and then the function was validated with the remaining part of the targets. Performance with a distinct parameter set was evaluated by the C α RMSD average of the top structures from 10 evaluations with different training and validation combinations. After successive tuning of the parameters, the new parameter set was obtained (Table 1, see Methods). Finally, the weights of the respective function components were determined on a whole training decoy set with the new parameter set. Some of the parameters and weights were changed from the initial values during the above procedure.
A summary of the performance of the tuned function on the training set is shown in Table 2. Of 154 training targets, 115 (74.7%) native and 135 (87.7%) near-native (i.e. < 2.0 Å C α RMSD) structures were correctly identified as the best energy structures. The averages of the Z-score, correlation coefficient (C.C), and fraction enrichment (F.E.) were sufficiently positive. The performance on decoy structures, without native structures, is critical, because no native or near-native structure is available prior to structure prediction experiments. Thus, analyses were also carried out on decoys without native structures. Discrimination performance of decoy structures were also positive as indicated by the average values of logPB1, logPB10, and the correlation coefficient (C.C.decoy) and fraction enrichment (F.E.decoy) among decoy structures.
Decoy discrimination performance test
The performance of the tuned function cannot be evaluated on the training set itself, because versatility is not necessarily assured because of the possibility of over-learning; therefore, the structure discriminating ability of the tuned DFMAC was tested on the above test set, containing different targets from the training set. The results are summarized in Table 2, and the details are shown in Table 3. A large number of native structures of the respective protein targets were correctly identified as the best-energy (i.e. the lowest energy value) structures (Table 2). Correct identification of the native structures was 59 out of 77 targets (76.6% success), and the identification of near-native structures (C α RMSD < 2 Å) was 68 (88.3% success). The possible interpretations of failed identification of the remaining 9 targets are discussed below. The significantly positive average values of Z-score, C.C., and F.E. indicate considerable overall performance. The averages of the respective decoy discrimination scores (logPB1, logPB10, C.C.decoy, and F.E.decoy) were also significantly positive. Although the average C α RMSD of top-energy structures was 1.174, which was a little larger than the average on the training set, the percentage of correctly identified native or near-native structures was similar to the training set. Additionally, other indexes were also similar between the training and test sets. Thus, a certain degree of versatility was confirmed.
As for the effectiveness for the individual decoy sets (Table 3), nonuniformity was observed, as mentioned in the Background. The best average Z-score was obtained for lattice_ssfit (10.499), and the worst for hg_structal (1.762). The average Z-score values were positive in all decoy sets. The best average C.C. and average F.E. were for moulder (0.824 of C.C. and 62.9% of F.E.), and the worst were for lattice_ssfit (0.045 of C.C. and 12.8% of F.E.). In Figure 2, three examples of energy distribution against C α RMSD are shown. The average C.C.decoy of 4state_reduced (0.767), hg_structal (0.800) and moulder (0.821) were relatively high. The worst average C.C.decoy (0.000) was obtained for lattice_ssfit. The average F.E.decoy of 4state_reduced (61.5%) and moulder (61.9%) were significant, and the worst was for ig_structal_hires (0.0%).
Comparison with other statistical potentials
We compared the performance of the DFMAC with 6 different state-of-the-art statistical potentials of DOPE , RAPDF , DFIRE , and PC2CA . To exclude possible training biases, the target structures for comparison were restricted to the entries listed in our test set. Comparison of the rank of the native structures is shown in Table 4. DOPE, RAPDF, and DFIRE-A use residue-specific heavy atom representations. DFIRE-B uses the main-chain and C β atoms. PC2CA uses C α atoms and the side-chain center of mass. DFMAC uses main-chain atoms (N, C α , and C) per residue, while evaluation was carried out with generated pseudo atoms of C β , main-chain amino hydrogen (H) and carbonyl oxygen (O). In regard to the total number of correct identifications out of 11 total proteins, DOPE and DFIRE-A identified 10 native structures, followed by PC2CA (9 correct), DFMAC (8), RAPDF (7), and DFIRE-B (6). From this viewpoint, the performance of DFMAC was moderate. DOPE was also significant according to the averaged rank (4.0), followed by DFMAC (8.1), DFIRE-B (26.9) and DFIRE-A (40.0). RAPDF and PC2CA were similar (~60). Although the number and the types of targets applied here were limited and biased to a certain degree, DFMAC provided at least one better performance indexes against many other functions.
Among the functions dealing with coarse-grained structure representation (DFIRE-B, PC2CA, and DFMAC), PC2CA had the largest number of correct identifications. We thus carried out additional comparison of DFMAC with PC2CA to identify the detailed relative performance of DFMAC. The performance of DFMAC on the targets of the test set, except for moulder and rosetta decoy sets, was compared with PC2CA results reported in the literature  (Table 5). Forty (78.4%) native structures were correctly identified by DFMAC from test targets, while PC2CA identified fewer native structures of 16 (31.4%). PC2CA and DFMAC had distinctive performances for the respective decoy sets. For example, PC2CA showed better performances with all of the averaged indexes (correct, C α RMSD, Z-score, C.C., and F.E.) for the lmds decoy set, while DFMAC was better for 4state_reduced, ig_structal and ig_structal_hires. All of the summarized indexes were better with DFMAC. Although the number and kinds of decoy sets used here were limited in number and compilation of a variety of characteristics, the performance of DFMAC could be at least roughly similar to one of the state-of-the-art functions, PC2CA.
Contributions of the function components to performance
DFMAC consists of six pseudo-energy calculating components. We evaluated the contribution of each component to the structure discrimination ability. The original DFMAC function was compared to functions without any of the components on the test set (Table 6). A significant increase in average C α RMSD without the DABG component, followed by the SURR component, was observed, indicating the major contributions of the two components. The deficiency of discrimination ability without these two components was similarly observed for most of the other indexes, supporting the significance of these components. The influence of any one of 4 other components was smaller, and most indexes remained similar to the original function; however, the averaged rank of the native structures increased without OMDA, indicating a certain degree of contribution. When cross validation with tuned parameters was carried out on the training set without any one of the six components, no improvement in average C α RMSD was observed (data not shown). This result also suggests the requirement of all six components. Additionally, the inclusion of HBND, PPDA, and OMDA components is expected to have discriminative ability for a possible chain modeling application.
A knowledge-based decoy discriminatory function (DFMAC) was successfully developed. The DFMAC function requires the input data of the coordinates of only three main-chain atom types (N, C α , and C) per each amino-acid residue. The function is formalized as the combination of six pseudo-energy calculating components. Each component evaluates a different feature of a protein. The native or near-native structures in various types of decoy sets were recognized with high accuracy. The discrimination ability was nearly comparable to other state-of-the-art coarse-grained or all-atom-type scoring functions.
One notable feature of the function is the simplicity of the required representation of the model structures, consisting of only three main-chain atom coordinates per residue. Such input structural data is beneficial for structure modeling. Because the side chain conformation need not be scanned, the main chain conformation scan could be facilitated. The scanning of different folds for evaluation of sequence-fold compatibility could also be facilitated. The construction of an all-atom model is possible by assigning side-chain coordinates to a reasonable main-chain model.
The considerable accuracy of the whole function was derived mainly from the DABG component. This component evaluates the relative orientation of the pseudo C β atom against the associated C α atom between two residues. The recognition of acceptable orientation conversely restricts the degree of freedom of main-chain conformation and side-chain orientation. Thus, more accurate fold recognition could be provided than a simple distant dependent function among, for example, C α atoms. The all-atom distant-dependent-type functions would implement similar or more accurate fold recognition, by judging acceptable main-chain/side-chain orientation with multiple distances per residue. Although our representation of the structure is far simpler, an alternative structure recognition mechanism would be implemented, at least partially. The effectiveness of orientation-dependent potentials was also shown by Buchete et al. [9, 10]; the interaction centers were defined for respective side chains and peptide bonds, and six parameters were used to express a single pairwise interaction. In our case, evaluation was performed with more limited conditions, using only a single point per residue and as few as 4 parameters per pairwise interaction between the points; however, the DABG component of DFMAC was able to provide considerable discrimination capacity.
Compared with the DABG and SURR function components, the contributions of 4 other components were smaller; however, the DABG component does not evaluate local main-chain conformation within a 2-residue distance. The SURR component also does not evaluate local main-chain conformation. Thus, HBND, PPDA, and OMDA components were implemented to recognize the allowed conformation for possible model building experiments, although little pullback of discrimination was observed. Additionally, since many decoy structures already had reasonable local conformations, significant contribution of these components might not be observed. Improvement of these components to more harmless and versatile ones could help refine the overall function.
Nine of the 77 test targets failed in native or near-native structure identification. The IDs were 1bl0, 1dtk, 4pti, 1eh2, 2pna, 1bq9, 1elw, 1rnb, and 2ci2. Many plausible reasons for the failure of (near-) native structure recognition can be found [15, 16]. The distorted geometries of 1dtk could harm the scoring of the native structure . The presence of other chains in the crystal structure is also a possible reason for the failure . The difficulty of 1bl0, which was bound to DNA, might also have failed because of the complex structure. The 1rnb is also a complex of protein and small molecules. Interaction with the metal ion might also make discrimination harder. The crystal structure of 1bq9 contains Fe(III) ion. NMR structures are often suggested to be difficult to identify . The 1dtk, 1eh2, and 2pna are NMR structures, and a difficulty might arise for that reason. The difficulty with smaller proteins is also frequently discussed. We tested the correlation of protein size and accuracy, and we also found a difficult tendency for smaller polypeptides (data not shown). The failure of 1dtk (57 residues), 4pti (58 residues), 1bq9 (51 residues), and 2ci2 (62 residues) might have resulted. The possible origin of failure of the remaining 1elw top structure was not apparent.
The capacity of DFMAC to recognize a correct fold among different folds, which are separated in the structure space, is not apparent. In the rosetta decoy set , each target consists of 20 refined native structures and the 100 lowest scoring models out of ~10,000 de novo predicted models among a variety of conformations. Evaluation of the test set with DFMAC resulted in correct identification of 68.4% (13/19) native and 78.9% (15/19) near-native structures; therefore, the capacity for fold recognition which could support de novo structure prediction might be expected.
The high C α RMSD of the top structures was frequently observed for some decoy sets, such as lmds or rosetta. One of our next challenges is to improve our function to cover these "difficult" decoy sets. The introduction of a high-resolution structure dataset for database construction  and the development of an additional all-atom-type evaluation system are possible solutions. Additionally, since the function was mainly implemented with pairwise interactions, a frustrated structure, which consists of locally allowed pairwise interactions, might be positively evaluated. Based on these considerations, further improvement of the function in decoy or fold discrimination ability is now in progress.
A novel knowledge-based decoy discrimination function, DFMAC, was successfully constructed. Despite the simple representation of protein structure models of input data, the discrimination ability was nearly comparable to other coarse-grained and all-atom-type functions. The orientation-dependent pseudo-energy calculating component (DABG), in addition to the component for the number of surrounding atoms (SURR), was found to be significantly effective for performance of the function. A variety of applications of the function to support activities such as structure prediction is expected.
Overview of the function formulation
The function for total energy calculation is formulated as the sum of six weighted pseudo-energy terms:
where E total is the total pseudo-energy, the "w" and "E" with subscripts on the right side of the equation are the weights and pseudo-energy calculation components, respectively. The subscripts of the terms on the right side correspond to the six respective types of pseudo-energy components, dealing with the distances between two C α atoms (referred to as DIST), the relative orientation of the vector of C α to pseudo-C β atom coordinates between two residues (DABG) (Figure 1A), hydrogen bonds between the main-chain amino hydrogen and the main-chain carbonyl oxygen atoms (HBND) (Figure 1B), the ψ-φ dihedral angles (PPDA), the ω dihedral angle (OMDA), and the number of the surrounding C α atoms (SURR). Each pseudo-energy component was calculated referring to a specifically precompiled database derived from the known protein structures. The formulation details are described below.
Preparation of database from native protein structures
Databases for pseudo-energy calculation components, which evaluate individual structural features, were derived from 3,313 nonhomologous (less than 25% homology) protein structures with a resolution of better than 2.0 Å and R-factors of better than 0.25. The list of the proteins (compiled on June 23, 2007) was provided by PISCES server http://dunbrack.fccc.edu/PISCES.php. Structural data with atomic coordinates were from the protein data bank (PDB) http://www.rcsb.org. The coordinates of the three main-chain atom types of amino nitrogen (N), alpha carbon (C α ), and carbonyl carbon (C) of all 3,313 structures were used for database construction. The database was compiled for each of the pseudo-energy calculating components. All databases, except for SURR, were built considering the combination of the subject and object amino acid types. Thus, 400 sub-databases were generated for each component. The SURR database consists of 20 sub-databases for respective amino acid types of subject residues. The domains of parameter(s) were divided into uniform-sized bins. All measurements, which met the specific criteria below, were classified and counted in corresponding bins. For the DIST database, C α pairwise distances were compiled with 64 bins in the range from 0.0 Å to 21.0 Å. Residue pairs within a certain distance in the amino acid sequence were not included. This pair inclusion range in the sequence was tuned by the procedures described below. The DABG database was derived using C α atom coordinates and pseudo-C β atom coordinates, which were generated based on the coordinates of N, C α , and C atoms of the respective residues. Four parameters of the distance between the two C α atoms, and the α, β, and γ angles were applied to represent unique relative orientation between two C α -pseudo-C β atom vectors of the residues (Figure 1A). The criteria for residue inclusion and distance parameter range were the same as DIST. The pair inclusion range in the sequence were also tuned as described below. The α and β angles (0° to 180°) were divided into 16 bins, and the γ angle (-180° to 180°) was divided into 32 bins. Data were compiled into four-dimensional sub-databases. The HBND database was derived using the coordinates of pseudo amino hydrogen atoms (H) and pseudo carbonyl oxygen atoms (O), which were generated using N, C α , and C main-chain atom coordinates. (Note: The N-terminal pseudo-H and C-terminal pseudo-O of each fragment could not be generated and compiled because of the absence of their preceding and following residues, respectively.) The three parameters of the distance between pseudo-H and pseudo-O atoms, and the η and θ angles were calculated (Figure 1B). The distance, ranging from 1.7 Å to 2.9 Å, was divided into 4 bins. Either of the η and θ angles, ranging from 0° to 180°, was divided into 16 bins. Data were compiled into three-dimensional sub-databases. The PPDA database was derived using the ψ and φ main-chain dihedral angles of the peptide bond. We constructed a database dealing with the ψ angle of one residue and the φ angle of the next residue, which is different from the standard Ramachandran-type representation (i.e. φ and ψ angles for a single residue). Thus, 400 variations of sub-databases were generated for respective permutations of amino acid types of two adjacent residues. Either of the angles, ranging from -180° to 180°, was divided in 64 bins. Data were compiled into two-dimensional sub-databases. The OMDA database was derived using the ω dihedral angles of peptide bonds. Thus, the variation of sub-databases was 400, corresponding to the respective amino acid permutations. The ω angle, ranging from -180° to 180°, was divided in 128 bins. The SURR database is related to the degree of embedding of the residue in a molecule. The number of C α atoms in the sphere with a certain radius from the central C α atom coordinate of the focusing residue was counted as the surroundings. The radius was also tuned as described below. The count was classified into the respective bins of corresponding counts. Data were compiled into 20 sub-databases, associated with the respective amino acid types.
Pseudo potential energy calculation
Each of the above six pseudo-energy calculation components was derived based on the Boltzmann law . The pseudo-energy (E s ) for a state "s" was calculated with the following equation:
where N obs (s) and N exp (s) represent the number of observed and expected counts for the state (s), respectively. Since the energy values were utilized as relative scores throughout the analyses, the factor of kT was not included in the formula with the assumption of constant temperature. Unless otherwise stated, counts in the corresponding bin of the database were used as N obs (s). The expected count is also referred to as the reference, which was from the distribution without the interaction focused on for the component. The total pseudo-energy for each component (E COMP ) was calculated as:
The criteria for inclusion in the energy summation were defined by individual components. The parameter set, bin size, and bin distribution were the same as the database construction conditions described above. The specific energy calculation conditions for each component were as described below. In the case of DIST component calculation, the measured distances between two C α atoms of the respective residue pairs were used as specific states. In consideration of the finite size of proteins, the corresponding N exp (s) was calculated as :Nexp(d) = a*d2 *exp(-(d/b)c)* Δd
where d is the central value of the corresponding distance bin, Δd is the bin size, and a, b, and c are constants. Constant a was adjusted as follows: the sum of the counts (A sum ) in the database was calculated for bins ranging from 56th to 64th (i.e. the distance ranging from 18.05 Å to 21 Å). On the other hand, the integral of N exp (d) (A integ ) for the same distance range was calculated assuming that a is 1. The value of (A sum /A integ ) was then re-assigned to the a constant. The b and c constants were scanned and tuned with the procedure described below. The pairs, which have distances ranging from 0 Å to 15.75 Å, and a certain degree of sequence separation of |i-j| between ith and jth residues in the amino acid sequence, were subjected to energy calculation. The minimum limit of sequence separation was subjected to tuning. When the count of the bin in the database was 0, a penalty energy value was alternatively assigned. This was also tuned. For DABG energy calculation, the parameters of the distance, and α, β, and γ angles were calculated with pseudo-C β , as described above in the database preparation (Figure 1A). N obs was the average count of the bins in the database with the center at the position corresponding to the measured parameters of the distance, and α, β, and γ angles. Averaging with bins extending to certain position ranges for both sides was applied for each parameter. These ranges were also tuned. N exp was calculated as follows: assuming that the C α -pseudo-C β vector orientation is random, the probability density function (P) for α or β parameters is:
The P(γ) is supposed to distribute uniformly along the γ angle; thus, N exp (d, α, β, γ) is formulated as:
where α and β are the center of the corresponding bin, Δγ is the bin size of γ. N obs (d) of the DIST component was used as a substitute for N exp (d). Residues with the same sequence separation as DIST were evaluated. The penalty value for the 0 of the average count was subjected to tuning. The method of HBND energy calculation was the simpler version employed by Kortemme et al. . The function evaluated only the main-chain hydrogen bond between pseudo-H and pseudo-O atoms using three parameters (Figure 1B). As described for the DABG component, the probability density function under the random orientation of N-H or C-O vectors is represented as:
The expected distance distribution probability P exp (d) is assumed to have a similar form to the finite ideal gas reference state  as:P(d) = c*d1.6
where d is the center of each distance bin. Constant c was adjusted to make the sum of the probability for 4 total bins of possible conditions to 1. N exp (d, η θ) for HBND component was thus expressed as:
where N total is the total observed count compiled in the corresponding pairwise sub-database. The penalty value was tuned. In the case of PPDA energy calculation, equal distribution on the ψ-φ plane was assumed for the expected probability. Thus, N exp (ψ,φ) for PPDA component is:Nexp (ψ, ϕ) = N total * Δψ/360* Δϕ/360
where N total was the total observed count compiled in the corresponding pairwise sub-database, and Δψ and Δφ were the bin sizes of the respective angles. The penalty value was tuned. OMDA energy was calculated similarly with equal distribution along the ω axis assumed for the expected probability. Thus, N exp (ω) was:Nexp(ω) = N total * Δω/360
where N total was the total observed count, and Δω was the bin size of the ω angle. The penalty value was tuned. The concept of SURR energy was similar to the solvation potential by Jones . The distribution of the observed count of surrounding C α atoms was compiled in a procedure similar to the database construction for each amino acid type. The resultant database was standardized by each of the sub-databases, and then used as the expected count of N exp (n) as:
where n is the number of surrounding C α atoms in a sphere, N aa is the sum of the counts for a specific amino acid type over the surrounding numbers, N tot is the sum of the counts of all residues of all structures over the surrounding numbers, and N(n) is the count of the specific number of surroundings (n) for all residues of all structures. The radius of the sphere and the penalty value were tuned.
The Decoys 'R' Us decoy database http://dd.compbio.washington.edu/download.shtml, which was compiled by Samudrala and Levitt , was used for parameter tuning and function evaluation. For the l30 target in the fisa_casp3 decoy set, 1ck2 from PDB was used as the native structure . C α RMSD of the l30 native structure in the original list (1.882 Å) was used as the native C α RMSD. The moulder decoy set ftp://salilab.org/decoys/comp_models.tar.gz[15, 20] and the all atom decoy set from Rosetta@home http://depts.washington.edu/bakerpg/decoys/rosetta_decoys_62proteins.tgz (or "rosetta"), which were from the homepages of the Sali lab. and Baker lab., respectively, were also used. The rosetta set contains well-scoring Rosetta protein models and their native crystal structures for 59 proteins, without 3 NMR structures. We repacked 141 structures of one native PDB structure, the 20 refined native structures, the 100 lowest scoring models out of ~10,000 total models, and 20 random models, per protein into a single target entry for use.
Determination of initial parameter values
The 7 decoy sets of 4state_reduced, fisa, fisa_casp3, hg_structal, ig_structal, ig_structal_hires, and lmds from the Decoys 'R' Us database  were used to search the initial parameter values. Several parameters were scanned at a time, and the best values of the parameters were determined successively. The procedure was repeated until all the parameters were scanned. Following are the function components and their associated parameters determined by the above procedure: the DIST component, the distance range for database construction and scoring, the sequence separation for database construction and scoring, the upper distance limit for scoring, the function form of N exp , the values of b and c, the lower distance limit for determination of the a value, and the penalty value; the DABG component, the distance range for database construction and scoring, the sequence separation for database construction and scoring, the upper distance limit for scoring, the range of neighboring bins for averaging the counts, and the penalty value; the HBND component, the distance range for database construction and scoring, and the penalty value; the PPDA and OMDA components, the penalty values; the SURR component, the radius of the sphere for database construction and scoring. Bin sizes for all of the components were determined appropriately. Each time a new parameter value was applied, the weight parameters w for the respective energy components were scanned, and the performance was evaluated with the temporary optimized weight values. Because the total pseudo-energy was used as a relative index value (not as an absolute energy), w DIST was fixed as 1 and the remaining 5 weights were scanned. The searching procedure for the weights was as follows: firstly, all of the combinations of discrete weight values, evenly spaced in a logarithmic scale (0.01 to 31.6, 15 steps), were evaluated, and the weight set with the best discrimination performance was selected. Then, another more precise cycle was carried out around the set of weight values determined by the previous scan (0.56- to 1.78-fold the previous weight value, 11 steps). The optimized weights for the function were 1, 0.316, 0.141, 0.200, 0.00562, and 0.178 for the components of DIST, DABG, HBND, PPDA, OMDA, and SURR, respectively.
Function tuning by cross validation
The temporary optimized function by the previous procedure was further tuned through the cross-validation procedure. The 231 targets from the Decoys 'R' Us database, and the decoy sets of moulder and rosetta, were split into 154 (2/3 of total) for the training set and 77 (1/3 of total) for the test set according to their temporarily-assigned serial numbers. The targets were listed in the sequence of the decoy sets, and serial numbers of multiples of 3 were selected as the test set, and the rest of the targets were the training set. Thus, each decoy set was included in both training and test sets with a roughly equal ratio, without any intentional bias. The parameters were tuned by 10-fold cross validation. The above training set was divided into 10 segments, with new identification numbers in cyclic order. The function with temporally optimized weights was evaluated on one remaining target segment. The average C α RMSD of top structures from 10 evaluations of all combinations of training and evaluation was set as the performance index. The weights of the function components, with a set of updated parameter values, were optimized for 9 target segments to the minimum C α RMSD average. Weights were optimized by the following iterative cycles of scanning: the combination of discrete values, equally spaced in the logarithmic scale for each of the six weights, was scanned in a cycle. The 1st cycle was 5 steps of 0.01 to 100. The following 4 cycles were repeated for the 8 best weight sets found by the 1st cycle, which were separated by at least a 6-step distance. Each of the 2nd to 5th cycles evaluated the 3 steps of the parameters, i.e. w-r, w, and wr, where w was the weight value selected by the previous cycle, and r was the factor of the scanning range. If any one of the selected parameter values was not the previous one (i.e. the optimum was not at the center of scanning), the same cycle was repeated again with the selected parameter values. The r values of 2nd, 3rd, 4th, and 5th cycles were 3.16, 1.78, 1.33, and 1.15, respectively. The final weight values tuned by the procedure were 1.00, 0.662, 0.765, 0.0372, 1.02, and 43.0 for the components of DIST, DABG, HBND, PPDA, OMDA, and SURR, respectively. The tuned parameters, the order of the parameters successively scanned, and the initial, scanned, and final values are listed in Table 1.
The performance measures and their definitions are as follows: C α RMSD, the root mean square deviation of the C α -C α pairs between the native structure and the model with the best energy; Z-score, the score of the native structure, which was calculated under the standard definition  (Note: the positive value corresponds to the lower (better) energy than average.); C.C., Pearson's correlation coefficient among the structures including the native and the decoys; F.E., the fraction of the top 10% lowest C α RMSD structures in the top 10% best-energy structures among the structures, including the native and the decoys; R B1 , the C α RMSD rank of the best-energy structure among the decoy structures; logP B1 , the common logarithm of the probability of selecting the best decoy structure, where P B1 = R B1 /(number of decoy structures); R B10 , the lowest C α RMSD rank in the 10 best-energy decoy structures among the decoys; logP B10 , the common logarithm of the probability of selecting the best decoy structure in the 10 best-energy decoy structures, where P B10 = R B10 /(number of decoy structures); C.C. decoy , the C.C. among the decoys; F.E. decoy , the F.E. among the decoys.
Poole AM, Ranganathan R: Knowledge-based potentials in protein design. Curr Opin Struct Biol 2006, 16: 508–513. 10.1016/j.sbi.2006.06.013
Boas FE, Harbury PB: Potential energy functions for protein design. Curr Opin Struct Biol 2007, 17: 199–204. 10.1016/j.sbi.2007.03.006
Gordon DB, Marshall SA, Mayo SL: Energy functions for protein design. Curr Opin Struct Biol 1999, 9: 509–513. 10.1016/S0959-440X(99)80072-4
Zhou Y, Zhou H, Zhang C, Liu S: What is a desirable statistical energy function for proteins and how can it be obtained? Cell Biochem Biophys 2006, 46: 165–174. 10.1385/CBB:46:2:165
Sippl MJ: Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J Mol Biol 1990, 213: 859–883. 10.1016/S0022-2836(05)80269-4
Jones DT: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 1999, 287: 797–815. 10.1006/jmbi.1999.2583
Zhou H, Zhou Y: Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci 2002, 11: 2714–2726. 10.1110/ps.0217002
Kortemme T, Morozov AV, Baker D: An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J Mol Biol 2003, 326: 1239–1259. 10.1016/S0022-2836(03)00021-4
Buchete NV, Straub JE, Thirumalai D: Continuous anisotropic representation of coarse-grained potentials for proteins by spherical harmonics synthesis. J Mol Graph Model 2004, 22: 441–450. 10.1016/j.jmgm.2003.12.010
Buchete N-V, Straub JE, Thirumalai D: Orientational potentials extracted from protein structures improve native fold recognition. Protein Sci 2004, 13: 862–874. 10.1110/ps.03488704
Chen Y, Kortemme T, Robertson T, Baker D, Varani G: A new hydrogen-bonding potential for the design of protein-RNA interactions predicts specific, contacts and discriminates decoys. Nucleic Acids Res 2004, 32: 5147–5162. 10.1093/nar/gkh785
Wang K, Fain B, Levitt M, Samudrala R: Improved protein structure selection using decoy-dependent discriminatory functions. BMC Struct Biol 2004, 4: 8. 10.1186/1472-6807-4-8
Zhang C, Liu S, Zhou H, Zhou Y: An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state. Protein Sci 2004, 13: 400–411. 10.1110/ps.03348304
Tosatto SC: The victor/FRST function for model quality estimation. J Comput Biol 2005, 12: 1316–1327. 10.1089/cmb.2005.12.1316
Shen MY, Sali A: Statistical potential for assessment and prediction of protein structures. Protein Sci 2006, 15: 2507–2524. 10.1110/ps.062416606
Fogolari F, Pieri L, Dovier A, Bortolussi L, Giugliarelli G, Corazza A, Esposito G, Viglino P: Scoring predictive models using a reduced representation of proteins: model and energy definition. BMC Struct Biol 2007, 7: 15. 10.1186/1472-6807-7-15
Liu T, Samudrala R: The effect of experimental resolution on the performance of knowledge-based discriminatory functions for protein structure selection. Protein Eng Des Sel 2006, 19: 431–437. 10.1093/protein/gzl027
Samudrala R, Levitt M: Decoys 'R' Us: A database of incorrect protein conformations to improve protein structure prediction. Protein Sci 2000, 9: 1399–1401.
Tsai J, Bonneau R, Morozov AV, Kuhlman B, Rohl CA, Baker D: An improved protein decoy set for testing energy functions for protein structure prediction. Proteins 2003, 53: 76–87. 10.1002/prot.10454
John B, Sali A: Comparative protein structure modeling by iterative alignment, model building, and model assessment. Nucleic Acids Res 2003, 31: 3982–3992. 10.1093/nar/gkg460
Samudrala R, Moult J: An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 1998, 275: 895–916. 10.1006/jmbi.1997.1479
Wang G, Dunbrack RL Jr: PISCES: a protein sequence culling server. Bioinformatics 2003, 19: 1589–1591. 10.1093/bioinformatics/btg224
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235
The source code and the compiled databases are available on request.
YM conceived the project, designed the function, carried out the computational experiments, and drafted the manuscript. NI provided intellectual guidance and mentorship. Both authors read and approved the final manuscript.