Bmc Structural Biology a Knowledge-based Structure-discriminating Function That Requires Only Main-chain Atom Coordinates

Background: The use of knowledge-based potential function is a powerful method for protein structure evaluation. A variety of formulations that evaluate single or multiple structural features of proteins have been developed and studied. The performance of functions is often evaluated by discrimination ability using decoy structures of target proteins. A function that can evaluate coarse-grained structures is advantageous from many aspects, such as relatively easy generation and manipulation of model structures; however, the reduction of structural representation is often accompanied by degradation of the structure discrimination performance.


Background
Protein structure evaluation is a key process in protein structure prediction, in association with comparative modeling, fold recognition, structure refinement, and de novo folding. Protein design technology also requires structure evaluation methods with sufficient capacity.
Many different types of potential energy functions have been developed and examined. The formulation of the functions can be roughly grouped under physical-based and knowledge-based approaches [1,2,4]. Physical-based (or molecular mechanics) potential energy functions are mainly used for the simulation of protein folding and dynamics [2], and are also effective for protein design [3]. The knowledge-based approach to developing such an evaluation system is also effective and widely used, especially for protein structure prediction and protein design studies [1]. The classical approach is the extraction of "pseudo" mean potentials from the distribution of pairwise distances of known protein structures based on the Boltzmann law [5]. A number of potential constructions and their successful applications have been reported [5][6][7][8][9][10][11][12][13][14][15][16]. Recently, improved accuracy has been facilitated, accompanied with the accumulation of high-resolution protein structure information [17].
The assessment of pairwise distances is implemented in many knowledge-based functions. Several variations of atom types are utilized, such as C α and/or C β atoms [7], the center of mass of the side chain [16], and heavy atom representation for a variety of atom types [7]. The functions of other structural features, including hydrogen bonds [8], main-chain dihedral angles [14], and solvation potentials [6], were also reported. A number of functions have been formulated as a combination of the above functional components. The introduction of orientationdependent components often improves the accuracy of the function. The hydrogen bond is a typical example, and the effectiveness of orientation-dependent potential was reported [8]. Buchete et al. introduced another type of orientation-dependent potential, using the pairwise interaction of local reference states for respective amino acids [9,10].
The structure discrimination capacity of the function is frequently estimated on the basis of the ability to correctly identify native or near-native structures from nonnative but plausible "decoy" structures. The "Decoys 'R' Us" database [18] is a collection of decoy sets, and is commonly used to evaluate functions. The database consists of 10 decoy sets, generated by different methods. Many other decoy sets, such as the "moulder" [15] or the "rosetta" [19], are also utilized to assess functions. It is commonly understood that the performance of structure evaluation functions tends to depend strongly on the intrinsic properties of decoy generation methods and/or other qualities of decoy sets [12]. Thus, many reports have assessed functions using multiple decoy sets and/or effective statistical techniques.
The compatibility of the structure-discriminating function for reduced structural representations provides many beneficial effects. For example, the generation and manipulation of model structures can be performed without more complexed structure construction; however, it is difficult to reduce the required structural information without losing the accuracy of the scoring function.
In this article, we report the development of a knowledgebased protein structure-discrimination function. The complexity of the required input structure data for evaluation was limited to the main-chain trace with only three atom coordinates (N, C α , and C) per respective amino acid residue. To overcome the possible loss of accuracy of decoy discrimination, orientation-dependent potential between two C α -pseudo-C β vectors was introduced. The parameter training and the subsequent performance test were carried out using the decoy sets from the Decoys 'R' Us database, in addition to the moulder and the rosetta decoy sets. High accuracy in native or near native structure recognition was observed in the test set. The level of discrimination ability was nearly comparable to other coarse-grained or all-atom-type functions. A detailed description of the development of the function and evaluation of the discrimination ability are provided.

Function Design
Before explaining the results of function development and structure evaluation, the overall design of the function is briefly described. The details of the function formulation can be found in Methods. The structure-discriminating function developed in this study consists of six pseudoenergy calculation components. Each of the components evaluates the distinctive structural feature of a target protein. The pseudo-energy is calculated based on the Boltzmann law [5], with knowledge-based procedures using a precompiled database from a non-redundant set of known structures. The six structural features focused on are as follows: the C α pairwise distance (the corresponding functional component is referred to as DIST), the relative orientation between two vectors of C α -pseudo-C β (DABG component, Figure 1A), hydrogen bonding between a main-chain amino hydrogen and a carbonyl oxygen (HBND component, Figure 1B), the main-chain dihedral angles of the combination between ψ at a residue and φ at the next residue (PPDA component), the mainchain ω dihedral angle (OMDA component), and the number of surrounding C α atoms around a C α atom (SURR component). Each atom coordinate is treated separately by twenty amino acid types. The overall function is formulated with the weighted linear combination of the above six pseudo-energy components. As the function was designed to require three main chain atom coordinates (amino nitrogen, C α , and carbonyl carbon) per residue as input data, we refer to the final form of the function as DFMAC (Discriminating Function using Main-chain Atom Coordinates).

Function training with decoy sets
The parameters associated with each component and the weights of respective components are not inherently clar-Schematic representation of the pairwise residue parameters for pseudo-energy components Figure 1 Schematic representation of the pairwise residue parameters for pseudo-energy components. (A) DABG component. Distance d (Å) is measured between two C α atoms. The α angle (degree) is formed with the C α -pseudo-C β vector of i th residue and the C α -C α vector. The β angle (degree) is formed similarly for j th residue. The γ is the dihedral angle (degree) formed with the four atom coordinates of the C α and the pseudo-C β for the respective i th and j th residues. (B) HBND component. Distance d (Å) is measured between pseudo-H atom of the i th residue and pseudo-O atoms of the j th residue. The η angle (degree) is formed with the pseudo-H-N vector of the i th residue and the pseudo-H-pseudo-O vector. The θ angle (degree) is formed with the pseudo-O-C vector of the j th residue and the pseudo-O-pseudo-H vector.
ified. In order to search for and determine the parameter values, we used decoy sets.
The parameter values and weights were determined on the basis of the discrimination ability of the native structure from its decoys. The outline of the tuning procedure is as follows: (1) The probable set of values of the parameters was scanned and determined using arbitrarily collected decoy sets. (2) The parameters were further tuned using the training decoy set by the cross-validation procedure.
(3) The weights of the respective function components were finally determined using the entire training set. The performance of the tuned function was evaluated using the test decoy set, which was distinctive from the training set. Details of the procedure are in Methods.
To determine the initial values of the parameters for the following tuning, we used 7 decoy sets of 4state_reduced, fisa, fisa_casp3, hg_structal, ig_structal, ig_structal_hires, and lmds from the Decoys 'R' Us database http:// dd.compbio.washington.edu/ [18]. (Note: Although some targets used in the final performance test were included in these decoy sets, some of the parameter values of the respective components and weights were changed after the subsequent parameter tuning procedure (Table 1). Thus, bias in the final performance test is considered to be limited.) Parameters which decreased the average square values of C α RMSD of the best pseudo-energy structures for respective protein targets in the 7 decoy sets were successively selected.
Using the probable parameters determined above as the initial parameter set, further tuning was then carried out. The 231 targets of the 10 decoy sets from the Decoys 'R' Us database, the moulder decoy set ftp://salilab.org/decoys/ comp_models.tar.gz [15,20], and the all-atom decoy set from Rosetta@home http://depts.washington.edu/bak erpg/decoys/rosetta_decoys_62proteins.tgz, were separated into 77 and 154 targets and used for testing and training (or parameter tuning) the function, respectively. (Note: Each set consisted of targets from a variety of decoy sets.) Tuning was performed by 10-fold cross validation using the training set. Briefly, nine of ten parts of targets were used for parameter training, and then the function was validated with the remaining part of the targets. Performance with a distinct parameter set was evaluated by the C α RMSD average of the top structures from 10 evaluations with different training and validation combinations. After successive tuning of the parameters, the new parameter set was obtained ( Table 1, see Methods). Finally, the weights of the respective function components were determined on a whole training decoy set with the new parameter set. Some of the parameters and weights were changed from the initial values during the above procedure.  A summary of the performance of the tuned function on the training set is shown in Table 2. Of 154 training targets, 115 (74.7%) native and 135 (87.7%) near-native (i.e. < 2.0 Å C α RMSD) structures were correctly identified as the best energy structures. The averages of the Z-score, correlation coefficient (C.C), and fraction enrichment (F.E.) were sufficiently positive. The performance on decoy structures, without native structures, is critical, because no native or near-native structure is available prior to structure prediction experiments. Thus, analyses were also carried out on decoys without native structures. Discrimination performance of decoy structures were also positive as indicated by the average values of logP B1 , logP B10 , and the correlation coefficient (C.C. decoy ) and fraction enrichment (F.E. decoy ) among decoy structures.

Decoy discrimination performance test
The performance of the tuned function cannot be evaluated on the training set itself, because versatility is not necessarily assured because of the possibility of over-learning; therefore, the structure discriminating ability of the tuned DFMAC was tested on the above test set, containing different targets from the training set. The results are summarized in Table 2, and the details are shown in Table 3. A large number of native structures of the respective protein targets were correctly identified as the best-energy (i.e. the lowest energy value) structures (Table 2). Correct identification of the native structures was 59 out of 77 targets (76.6% success), and the identification of near-native structures (C α RMSD < 2 Å) was 68 (88.3% success). The possible interpretations of failed identification of the remaining 9 targets are discussed below. The significantly positive average values of Z-score, C.C., and F.E. indicate considerable overall performance. The averages of the respective decoy discrimination scores (logP B1 , logP B10 , C.C. decoy , and F.E. decoy ) were also significantly positive.
Although the average C α RMSD of top-energy structures was 1.174, which was a little larger than the average on the training set, the percentage of correctly identified native or near-native structures was similar to the training set. Additionally, other indexes were also similar between the training and test sets. Thus, a certain degree of versatility was confirmed.
As for the effectiveness for the individual decoy sets (Table  3), nonuniformity was observed, as mentioned in the Background. The best average Z-score was obtained for lattice_ssfit (10.499), and the worst for hg_structal (1.762). The average Z-score values were positive in all decoy sets. The best average C.C. and average F.E. were for moulder (0.824 of C.C. and 62.9% of F.E.), and the worst were for lattice_ssfit (0.045 of C.C. and 12.8% of F.E.). In Figure 2, three examples of energy distribution against C α RMSD are shown. The average C.C. decoy of 4state_reduced (0.767), hg_structal (0.800) and moulder (0.821) were relatively high. The worst average C.C. decoy (0.000) was obtained for lattice_ssfit. The average F.E. decoy of 4state_reduced (61.5%) and moulder (61.9%) were significant, and the worst was for ig_structal_hires (0.0%).

Comparison with other statistical potentials
We compared the performance of the DFMAC with 6 different state-of-the-art statistical potentials of DOPE [15], RAPDF [21], DFIRE [7], and PC2CA [16]. To exclude possible training biases, the target structures for comparison were restricted to the entries listed in our test set. Comparison of the rank of the native structures is shown in Table  4. DOPE, RAPDF, and DFIRE-A use residue-specific heavy atom representations. DFIRE-B uses the main-chain and C β atoms. PC2CA uses C α atoms and the side-chain center of mass. DFMAC uses main-chain atoms (N, C α , and C) per residue, while evaluation was carried out with generated pseudo atoms of C β , main-chain amino hydrogen (H) and carbonyl oxygen (O). In regard to the total number of correct identifications out of 11 total proteins, DOPE and DFIRE-A identified 10 native structures, followed by PC2CA (9 correct), DFMAC (8), RAPDF (7), and DFIRE-B (6). From this viewpoint, the performance of DFMAC was moderate. DOPE was also significant according to the averaged rank (4.0), followed by DFMAC (8.1), DFIRE-B (26.9) and DFIRE-A (40.0). RAPDF and PC2CA were similar (~60). Although the number and the types of targets applied here were limited and biased to a certain degree, DFMAC provided at least one better performance indexes against many other functions.
Among the functions dealing with coarse-grained structure representation (DFIRE-B, PC2CA, and DFMAC), PC2CA had the largest number of correct identifications. We thus carried out additional comparison of DFMAC with PC2CA to identify the detailed relative performance of DFMAC. The performance of DFMAC on the targets of Summarized values are shown by the respective decoy sets. The numbers of total target proteins evaluated (N all ), and correct identification of the native (N n ) or near-native (C α RMSD < 2 Å) (N nn ) structures are shown. The C α RMSD, Z-score, C.C., F.E., R B1 , logP B1 , R B10 , logP B10 , C.C. decoy , F.E. decoy are the average of the respective scores of the target proteins evaluated. The definition of each index is described in Methods. The performance scores for respective PDB IDs of the target proteins and their average are shown by individual generation methods. The number of decoy structures and single native structure (N), the rank of the native structure relative to decoy structures based on the calculated pseudo-energy (R nat ), and the rest of the scores, described in the Methods, are shown. Examples of the distribution of total pseudo-energy against C α RMSD the test set, except for moulder and rosetta decoy sets, was compared with PC2CA results reported in the literature [16] (Table 5). Forty (78.4%) native structures were correctly identified by DFMAC from test targets, while PC2CA identified fewer native structures of 16 (31.4%). PC2CA and DFMAC had distinctive performances for the respective decoy sets. For example, PC2CA showed better performances with all of the averaged indexes (correct, C α RMSD, Z-score, C.C., and F.E.) for the lmds decoy set, while DFMAC was better for 4state_reduced, ig_structal and ig_structal_hires. All of the summarized indexes were better with DFMAC. Although the number and kinds of decoy sets used here were limited in number and compilation of a variety of characteristics, the performance of DFMAC could be at least roughly similar to one of the state-of-the-art functions, PC2CA.

Contributions of the function components to performance
DFMAC consists of six pseudo-energy calculating components. We evaluated the contribution of each component to the structure discrimination ability. The original DFMAC function was compared to functions without any of the components on the test set (Table 6). A significant increase in average C α RMSD without the DABG component, followed by the SURR component, was observed, indicating the major contributions of the two components. The deficiency of discrimination ability without these two components was similarly observed for most of the other indexes, supporting the significance of these The rank of the native structure identified by respective functions is shown for the targets listed. The results of DFIRE-A, DFIRE-B and RAPDF were from the literature [7]. The results of DOPE were from [15]. The results of PC2CA were from [16]. The average rank (average) and the number of correctly identified native structures (correct) in 11 targets are shown. Only the results for targets listed in our test set are compiled and shown. The number of correctly identified native structures (correct) out of the total targets (total) is shown by individual generation methods. The averages of C α RMSD, Z-score, C.C., and F.E. for the respective decoy sets are also shown. In the "Summary" column, the sum of total and correct counts, and the averages of C α RMSD, Z-score, C.C., and F.E. over the respective protein targets, are shown. The results of PC2CA [16] were used and the respective score averages were calculated.
components. The influence of any one of 4 other components was smaller, and most indexes remained similar to the original function; however, the averaged rank of the native structures increased without OMDA, indicating a certain degree of contribution. When cross validation with tuned parameters was carried out on the training set without any one of the six components, no improvement in average C α RMSD was observed (data not shown). This result also suggests the requirement of all six components. Additionally, the inclusion of HBND, PPDA, and OMDA components is expected to have discriminative ability for a possible chain modeling application.

Discussion
A knowledge-based decoy discriminatory function (DFMAC) was successfully developed. The DFMAC function requires the input data of the coordinates of only three main-chain atom types (N, C α , and C) per each amino-acid residue. The function is formalized as the combination of six pseudo-energy calculating components. Each component evaluates a different feature of a protein. The native or near-native structures in various types of decoy sets were recognized with high accuracy. The discrimination ability was nearly comparable to other state-of-the-art coarse-grained or all-atom-type scoring functions.
One notable feature of the function is the simplicity of the required representation of the model structures, consisting of only three main-chain atom coordinates per residue. Such input structural data is beneficial for structure modeling. Because the side chain conformation need not be scanned, the main chain conformation scan could be facilitated. The scanning of different folds for evaluation of sequence-fold compatibility could also be facilitated.
The construction of an all-atom model is possible by assigning side-chain coordinates to a reasonable mainchain model.
The considerable accuracy of the whole function was derived mainly from the DABG component. This component evaluates the relative orientation of the pseudo C β atom against the associated C α atom between two resi-dues. The recognition of acceptable orientation conversely restricts the degree of freedom of main-chain conformation and side-chain orientation. Thus, more accurate fold recognition could be provided than a simple distant dependent function among, for example, C α atoms. The all-atom distant-dependent-type functions would implement similar or more accurate fold recognition, by judging acceptable main-chain/side-chain orientation with multiple distances per residue. Although our representation of the structure is far simpler, an alternative structure recognition mechanism would be implemented, at least partially. The effectiveness of orientation-dependent potentials was also shown by Buchete et al. [9,10]; the interaction centers were defined for respective side chains and peptide bonds, and six parameters were used to express a single pairwise interaction. In our case, evaluation was performed with more limited conditions, using only a single point per residue and as few as 4 parameters per pairwise interaction between the points; however, the DABG component of DFMAC was able to provide considerable discrimination capacity.
Compared with the DABG and SURR function components, the contributions of 4 other components were smaller; however, the DABG component does not evaluate local main-chain conformation within a 2-residue distance. The SURR component also does not evaluate local main-chain conformation. Thus, HBND, PPDA, and OMDA components were implemented to recognize the allowed conformation for possible model building experiments, although little pullback of discrimination was observed. Additionally, since many decoy structures already had reasonable local conformations, significant contribution of these components might not be observed. Improvement of these components to more harmless and versatile ones could help refine the overall function.
Nine of the 77 test targets failed in native or near-native structure identification. The IDs were 1bl0, 1dtk, 4pti, 1eh2, 2pna, 1bq9, 1elw, 1rnb, and 2ci2. Many plausible reasons for the failure of (near-) native structure recognition can be found [15,16]. The distorted geometries of 1dtk could harm the scoring of the native structure [16]. The measures are as described in Table 3. The average scores over the test set are shown by the omitted component.
The presence of other chains in the crystal structure is also a possible reason for the failure [16]. The difficulty of 1bl0, which was bound to DNA, might also have failed because of the complex structure. The 1rnb is also a complex of protein and small molecules. Interaction with the metal ion might also make discrimination harder. The crystal structure of 1bq9 contains Fe(III) ion. NMR structures are often suggested to be difficult to identify [16]. The 1dtk, 1eh2, and 2pna are NMR structures, and a difficulty might arise for that reason. The difficulty with smaller proteins is also frequently discussed. We tested the correlation of protein size and accuracy, and we also found a difficult tendency for smaller polypeptides (data not shown). The failure of 1dtk (57 residues), 4pti (58 residues), 1bq9 (51 residues), and 2ci2 (62 residues) might have resulted. The possible origin of failure of the remaining 1elw top structure was not apparent.
The capacity of DFMAC to recognize a correct fold among different folds, which are separated in the structure space, is not apparent. In the rosetta decoy set [12], each target consists of 20 refined native structures and the 100 lowest scoring models out of ~10,000 de novo predicted models among a variety of conformations. Evaluation of the test set with DFMAC resulted in correct identification of 68.4% (13/19) native and 78.9% (15/19) near-native structures; therefore, the capacity for fold recognition which could support de novo structure prediction might be expected.
The high C α RMSD of the top structures was frequently observed for some decoy sets, such as lmds or rosetta. One of our next challenges is to improve our function to cover these "difficult" decoy sets. The introduction of a high-resolution structure dataset for database construction [17] and the development of an additional all-atom-type evaluation system are possible solutions. Additionally, since the function was mainly implemented with pairwise interactions, a frustrated structure, which consists of locally allowed pairwise interactions, might be positively evaluated. Based on these considerations, further improvement of the function in decoy or fold discrimination ability is now in progress.

Conclusion
A novel knowledge-based decoy discrimination function, DFMAC, was successfully constructed. Despite the simple representation of protein structure models of input data, the discrimination ability was nearly comparable to other coarse-grained and all-atom-type functions. The orientation-dependent pseudo-energy calculating component (DABG), in addition to the component for the number of surrounding atoms (SURR), was found to be significantly effective for performance of the function. A variety of applications of the function to support activities such as structure prediction is expected.

Overview of the function formulation
The function for total energy calculation is formulated as the sum of six weighted pseudo-energy terms: where E total is the total pseudo-energy, the "w" and "E" with subscripts on the right side of the equation are the weights and pseudo-energy calculation components, respectively. The subscripts of the terms on the right side correspond to the six respective types of pseudo-energy components, dealing with the distances between two C α atoms (referred to as DIST), the relative orientation of the vector of C α to pseudo-C β atom coordinates between two residues (DABG) ( Figure 1A), hydrogen bonds between the main-chain amino hydrogen and the main-chain carbonyl oxygen atoms (HBND) (Figure 1B), the ψ-φ dihedral angles (PPDA), the ω dihedral angle (OMDA), and the number of the surrounding C α atoms (SURR). Each pseudo-energy component was calculated referring to a specifically precompiled database derived from the known protein structures. The formulation details are described below.

Preparation of database from native protein structures
Databases for pseudo-energy calculation components, which evaluate individual structural features, were derived from 3,313 nonhomologous (less than 25% homology) protein structures with a resolution of better than 2.0 Å and R-factors of better than 0.25. The list of the proteins (compiled on June 23, 2007) was provided by PISCES server http://dunbrack.fccc.edu/PISCES.php [22]. Structural data with atomic coordinates were from the protein data bank (PDB) http://www.rcsb.org [23]. The coordinates of the three main-chain atom types of amino nitrogen (N), alpha carbon (C α ), and carbonyl carbon (C) of all 3,313 structures were used for database construction. The database was compiled for each of the pseudo-energy calculating components. All databases, except for SURR, were built considering the combination of the subject and object amino acid types. Thus, 400 sub-databases were generated for each component. The SURR database consists of 20 sub-databases for respective amino acid types of subject residues. The domains of parameter(s) were divided into uniform-sized bins. All measurements, which met the specific criteria below, were classified and counted in corresponding bins. For the DIST database, C α pairwise distances were compiled with 64 bins in the range from 0.0 Å to 21.0 Å. Residue pairs within a certain distance in the amino acid sequence were not included. This pair inclusion range in the sequence was tuned by the procedures described below. The DABG database was derived using C α atom coordinates and pseudo-C β atom coordinates, which were generated based on the coordinates of N, C α , and C atoms of the respective residues. Four parameters of the distance between the two C α atoms, and the α, β, and γ angles were applied to represent unique relative orientation between two C α -pseudo-C β atom vectors of the residues ( Figure 1A). The criteria for residue inclusion and distance parameter range were the same as DIST. The pair inclusion range in the sequence were also tuned as described below. The α and β angles (0° to 180°) were divided into 16 bins, and the γ angle (-180° to 180°) was divided into 32 bins. Data were compiled into four-dimensional sub-databases. The HBND database was derived using the coordinates of pseudo amino hydrogen atoms (H) and pseudo carbonyl oxygen atoms (O), which were generated using N, C α , and C main-chain atom coordinates. (Note: The N-terminal pseudo-H and C-terminal pseudo-O of each fragment could not be generated and compiled because of the absence of their preceding and following residues, respectively.) The three parameters of the distance between pseudo-H and pseudo-O atoms, and the η and θ angles were calculated ( Figure 1B). The distance, ranging from 1.7 Å to 2.9 Å, was divided into 4 bins. Either of the η and θ angles, ranging from 0° to 180°, was divided into 16 bins. Data were compiled into three-dimensional subdatabases. The PPDA database was derived using the ψ and φ main-chain dihedral angles of the peptide bond. We constructed a database dealing with the ψ angle of one residue and the φ angle of the next residue, which is different from the standard Ramachandran-type representation (i.e. φ and ψ angles for a single residue). Thus, 400 variations of sub-databases were generated for respective permutations of amino acid types of two adjacent residues. Either of the angles, ranging from -180° to 180°, was divided in 64 bins. Data were compiled into two-dimensional sub-databases. The OMDA database was derived using the ω dihedral angles of peptide bonds. Thus, the variation of sub-databases was 400, corresponding to the respective amino acid permutations. The ω angle, ranging from -180° to 180°, was divided in 128 bins. The SURR database is related to the degree of embedding of the residue in a molecule. The number of C α atoms in the sphere with a certain radius from the central C α atom coordinate of the focusing residue was counted as the surroundings. The radius was also tuned as described below. The count was classified into the respective bins of corresponding counts. Data were compiled into 20 sub-databases, associated with the respective amino acid types.

Pseudo potential energy calculation
Each of the above six pseudo-energy calculation components was derived based on the Boltzmann law [5]. The pseudo-energy (E s ) for a state "s" was calculated with the following equation: where N obs (s) and N exp (s) represent the number of observed and expected counts for the state (s), respectively. Since the energy values were utilized as relative scores throughout the analyses, the factor of kT was not included in the formula with the assumption of constant temperature. Unless otherwise stated, counts in the corresponding bin of the database were used as N obs (s). The expected count is also referred to as the reference, which was from the distribution without the interaction focused on for the component. The total pseudo-energy for each component (E COMP ) was calculated as: The criteria for inclusion in the energy summation were defined by individual components. The parameter set, bin size, and bin distribution were the same as the database construction conditions described above. The specific energy calculation conditions for each component were as described below. In the case of DIST component calculation, the measured distances between two C α atoms of the respective residue pairs were used as specific states. In consideration of the finite size of proteins, the corresponding N exp (s) was calculated as [16]: where d is the central value of the corresponding distance bin, Δd is the bin size, and a, b, and c are constants. Constant a was adjusted as follows: the sum of the counts (A sum ) in the database was calculated for bins ranging from 56 th to 64 th (i.e. the distance ranging from 18.05 Å to 21 Å). On the other hand, the integral of N exp (d) (A integ ) for the same distance range was calculated assuming that a is 1. The value of (A sum /A integ ) was then re-assigned to the a constant. The b and c constants were scanned and tuned with the procedure described below. The pairs, which have distances ranging from 0 Å to 15.75 Å, and a certain degree of sequence separation of |i-j| between i th and j th residues in the amino acid sequence, were subjected to energy calculation. The minimum limit of sequence separation was subjected to tuning. When the count of the bin in the database was 0, a penalty energy value was alternatively assigned. This was also tuned. For DABG energy calculation, the parameters of the distance, and α, β, and γ angles were calculated with pseudo-C β , as described above in the database preparation ( Figure 1A). N obs was the average count of the bins in the database with the center at the extending to certain position ranges for both sides was applied for each parameter. These ranges were also tuned. N exp was calculated as follows: assuming that the C αpseudo-C β vector orientation is random, the probability density function (P) for α or β parameters is: The P(γ) is supposed to distribute uniformly along the γ angle; thus, N exp (d, α, β, γ) is formulated as: where α and β are the center of the corresponding bin, Δγ is the bin size of γ. N obs (d) of the DIST component was used as a substitute for N exp (d). Residues with the same sequence separation as DIST were evaluated. The penalty value for the 0 of the average count was subjected to tuning. The method of HBND energy calculation was the simpler version employed by Kortemme et al. [8]. The function evaluated only the main-chain hydrogen bond between pseudo-H and pseudo-O atoms using three parameters ( Figure 1B). As described for the DABG component, the probability density function under the random orientation of N-H or C-O vectors is represented as: The expected distance distribution probability P exp (d) is assumed to have a similar form to the finite ideal gas reference state [7] as: where d is the center of each distance bin. Constant c was adjusted to make the sum of the probability for 4 total bins of possible conditions to 1. N exp (d, η θ) for HBND component was thus expressed as: where N total is the total observed count compiled in the corresponding pairwise sub-database. The penalty value was tuned. In the case of PPDA energy calculation, equal distribution on the ψ-φ plane was assumed for the expected probability. Thus, N exp (ψ,φ) for PPDA component is: N exp (ψ, ϕ) = N total * Δψ/360* Δϕ/360 where N total was the total observed count compiled in the corresponding pairwise sub-database, and Δψ and Δφ were the bin sizes of the respective angles. The penalty value was tuned. OMDA energy was calculated similarly with equal distribution along the ω axis assumed for the expected probability. Thus, N exp (ω) was: N exp (ω) = N total * Δω/360 where N total was the total observed count, and Δω was the bin size of the ω angle. The penalty value was tuned. The concept of SURR energy was similar to the solvation potential by Jones [6]. The distribution of the observed count of surrounding C α atoms was compiled in a procedure similar to the database construction for each amino acid type. The resultant database was standardized by each of the sub-databases, and then used as the expected count of N exp (n) as: where n is the number of surrounding C α atoms in a sphere, N aa is the sum of the counts for a specific amino acid type over the surrounding numbers, N tot is the sum of the counts of all residues of all structures over the surrounding numbers, and N(n) is the count of the specific number of surroundings (n) for all residues of all structures. The radius of the sphere and the penalty value were tuned.

Decoy sets
The Decoys 'R' Us decoy database http://dd.comp bio.washington.edu/download.shtml, which was compiled by Samudrala and Levitt [18], was used for parameter tuning and function evaluation. For the l30 target in the fisa_casp3 decoy set, 1ck2 from PDB was used as the native structure [16]. C α RMSD of the l30 native structure in the original list (1.882 Å) was used as the native C α RMSD. The moulder decoy set ftp://salilab.org/decoys/ comp_models.tar.gz [15,20] and the all atom decoy set from Rosetta@home http://depts.washing ton.edu/bakerpg/decoys/rosetta_decoys_62proteins.tgz (or "rosetta"), which were from the homepages of the Sali lab. and Baker lab., respectively, were also used. The rosetta set contains well-scoring Rosetta protein models and their native crystal structures for 59 proteins, without 3 NMR structures. We repacked 141 structures of one native PDB structure, the 20 refined native structures, the 100 lowest scoring

Authors' contributions
YM conceived the project, designed the function, carried out the computational experiments, and drafted the manuscript. NI provided intellectual guidance and mentorship. Both authors read and approved the final manuscript.