### Overview of the function formulation

The function for total energy calculation is formulated as the sum of six weighted pseudo-energy terms:

\begin{array}{c}{E}_{total}={W}_{DIST}*{E}_{DIST}+{W}_{DABG}*{E}_{DABG}+{W}_{HBND}*{E}_{HBND}\\ +{W}_{PPDA}*{E}_{PPDA}+{W}_{OMDA}*{E}_{OMDA}+{W}_{SURR}*{E}_{SURR}\end{array}

where *E*_{
total
}is the total pseudo-energy, the "*w*" and "*E*" with subscripts on the right side of the equation are the weights and pseudo-energy calculation components, respectively. The subscripts of the terms on the right side correspond to the six respective types of pseudo-energy components, dealing with the distances between two C_{
α
}atoms (referred to as DIST), the relative orientation of the vector of C_{
α
}to pseudo-C_{
β
}atom coordinates between two residues (DABG) (Figure 1A), hydrogen bonds between the main-chain amino hydrogen and the main-chain carbonyl oxygen atoms (HBND) (Figure 1B), the *ψ*-*φ* dihedral angles (PPDA), the *ω* dihedral angle (OMDA), and the number of the surrounding C_{
α
}atoms (SURR). Each pseudo-energy component was calculated referring to a specifically precompiled database derived from the known protein structures. The formulation details are described below.

### Preparation of database from native protein structures

Databases for pseudo-energy calculation components, which evaluate individual structural features, were derived from 3,313 nonhomologous (less than 25% homology) protein structures with a resolution of better than 2.0 Å and *R*-factors of better than 0.25. The list of the proteins (compiled on June 23, 2007) was provided by PISCES server http://dunbrack.fccc.edu/PISCES.php[22]. Structural data with atomic coordinates were from the protein data bank (PDB) http://www.rcsb.org[23]. The coordinates of the three main-chain atom types of amino nitrogen (N), alpha carbon (C_{
α
}), and carbonyl carbon (C) of all 3,313 structures were used for database construction. The database was compiled for each of the pseudo-energy calculating components. All databases, except for SURR, were built considering the combination of the subject and object amino acid types. Thus, 400 sub-databases were generated for each component. The SURR database consists of 20 sub-databases for respective amino acid types of subject residues. The domains of parameter(s) were divided into uniform-sized bins. All measurements, which met the specific criteria below, were classified and counted in corresponding bins. For the DIST database, C_{
α
}pairwise distances were compiled with 64 bins in the range from 0.0 Å to 21.0 Å. Residue pairs within a certain distance in the amino acid sequence were not included. This pair inclusion range in the sequence was tuned by the procedures described below. The DABG database was derived using C_{
α
}atom coordinates and pseudo-C_{
β
}atom coordinates, which were generated based on the coordinates of N, C_{
α
}, and C atoms of the respective residues. Four parameters of the distance between the two C_{
α
}atoms, and the *α*, *β*, and *γ* angles were applied to represent unique relative orientation between two C_{
α
}-pseudo-C_{
β
}atom vectors of the residues (Figure 1A). The criteria for residue inclusion and distance parameter range were the same as DIST. The pair inclusion range in the sequence were also tuned as described below. The *α* and *β* angles (0° to 180°) were divided into 16 bins, and the *γ* angle (-180° to 180°) was divided into 32 bins. Data were compiled into four-dimensional sub-databases. The HBND database was derived using the coordinates of pseudo amino hydrogen atoms (H) and pseudo carbonyl oxygen atoms (O), which were generated using N, C_{
α
}, and C main-chain atom coordinates. (Note: The N-terminal pseudo-H and C-terminal pseudo-O of each fragment could not be generated and compiled because of the absence of their preceding and following residues, respectively.) The three parameters of the distance between pseudo-H and pseudo-O atoms, and the *η* and *θ* angles were calculated (Figure 1B). The distance, ranging from 1.7 Å to 2.9 Å, was divided into 4 bins. Either of the *η* and *θ* angles, ranging from 0° to 180°, was divided into 16 bins. Data were compiled into three-dimensional sub-databases. The PPDA database was derived using the *ψ* and *φ* main-chain dihedral angles of the peptide bond. We constructed a database dealing with the *ψ* angle of one residue and the *φ* angle of the next residue, which is different from the standard Ramachandran-type representation (i.e. *φ* and *ψ* angles for a single residue). Thus, 400 variations of sub-databases were generated for respective permutations of amino acid types of two adjacent residues. Either of the angles, ranging from -180° to 180°, was divided in 64 bins. Data were compiled into two-dimensional sub-databases. The OMDA database was derived using the *ω* dihedral angles of peptide bonds. Thus, the variation of sub-databases was 400, corresponding to the respective amino acid permutations. The *ω* angle, ranging from -180° to 180°, was divided in 128 bins. The SURR database is related to the degree of embedding of the residue in a molecule. The number of C_{
α
}atoms in the sphere with a certain radius from the central C_{
α
}atom coordinate of the focusing residue was counted as the surroundings. The radius was also tuned as described below. The count was classified into the respective bins of corresponding counts. Data were compiled into 20 sub-databases, associated with the respective amino acid types.

### Pseudo potential energy calculation

Each of the above six pseudo-energy calculation components was derived based on the Boltzmann law [5]. The pseudo-energy (*E*_{
s
}) for a state "*s*" was calculated with the following equation:

E(s)=-\mathrm{ln}(\frac{{N}_{obs}(s)}{{N}_{\mathrm{exp}}(s)})

where *N*_{
obs
}(*s*) and *N*_{
exp
}(*s*) represent the number of observed and expected counts for the state (*s*), respectively. Since the energy values were utilized as relative scores throughout the analyses, the factor of *kT* was not included in the formula with the assumption of constant temperature. Unless otherwise stated, counts in the corresponding bin of the database were used as *N*_{
obs
}(*s*). The expected count is also referred to as the reference, which was from the distribution without the interaction focused on for the component. The total pseudo-energy for each component (*E*_{
COMP
}) was calculated as:

{E}_{COMP}={\displaystyle \sum _{s}^{all}E(s)}

The criteria for inclusion in the energy summation were defined by individual components. The parameter set, bin size, and bin distribution were the same as the database construction conditions described above. The specific energy calculation conditions for each component were as described below. In the case of DIST component calculation, the measured distances between two C_{
α
}atoms of the respective residue pairs were used as specific states. In consideration of the finite size of proteins, the corresponding *N*_{
exp
}(*s*) was calculated as [16]:*N*_{exp}(*d*) = *a***d*^{2} *exp(-(*d*/*b*)^{c})* Δ*d*

where *d* is the central value of the corresponding distance bin, Δ*d* is the bin size, and *a*, *b*, and *c* are constants. Constant *a* was adjusted as follows: the sum of the counts (*A*_{
sum
}) in the database was calculated for bins ranging from 56^{th} to 64^{th} (i.e. the distance ranging from 18.05 Å to 21 Å). On the other hand, the integral of *N*_{
exp
}(*d*) (*A*_{
integ
}) for the same distance range was calculated assuming that *a* is 1. The value of (*A*_{
sum
}/*A*_{
integ
}) was then re-assigned to the *a* constant. The *b* and *c* constants were scanned and tuned with the procedure described below. The pairs, which have distances ranging from 0 Å to 15.75 Å, and a certain degree of sequence separation of |*i*-*j*| between *i*^{th} and *j*^{th} residues in the amino acid sequence, were subjected to energy calculation. The minimum limit of sequence separation was subjected to tuning. When the count of the bin in the database was 0, a penalty energy value was alternatively assigned. This was also tuned. For DABG energy calculation, the parameters of the distance, and *α*, *β*, and *γ* angles were calculated with pseudo-C_{
β
}, as described above in the database preparation (Figure 1A). *N*_{
obs
}was the average count of the bins in the database with the center at the position corresponding to the measured parameters of the distance, and *α*, *β*, and *γ* angles. Averaging with bins extending to certain position ranges for both sides was applied for each parameter. These ranges were also tuned. *N*_{
exp
}was calculated as follows: assuming that the C_{
α
}-pseudo-C_{
β
}vector orientation is random, the probability density function (*P*) for *α* or *β* parameters is:

P(\alpha \text{or}\beta )=\mathrm{sin}(\frac{(\alpha \text{or}\beta )}{180}*\pi )/2

The *P*(*γ*) is supposed to distribute uniformly along the *γ* angle; thus, *N*_{
exp
}(*d*, *α*, *β*, *γ*) is formulated as:

\begin{array}{l}{N}_{\mathrm{exp}}(d,\alpha ,\beta ,\gamma )=\\ {N}_{obs}(d)*\frac{\mathrm{sin}(\frac{\alpha}{180}*\pi )}{2}*\frac{\mathrm{sin}(\frac{\beta}{180}*\pi )}{2}*\frac{\Delta \gamma}{360}\end{array}

where *α* and *β* are the center of the corresponding bin, Δ*γ* is the bin size of *γ*. *N*_{
obs
}(*d*) of the DIST component was used as a substitute for *N*_{
exp
}(*d*). Residues with the same sequence separation as DIST were evaluated. The penalty value for the 0 of the average count was subjected to tuning. The method of HBND energy calculation was the simpler version employed by Kortemme et al. [8]. The function evaluated only the main-chain hydrogen bond between pseudo-H and pseudo-O atoms using three parameters (Figure 1B). As described for the DABG component, the probability density function under the random orientation of N-H or C-O vectors is represented as:

P(\eta \text{or}\theta )=\mathrm{sin}(\frac{(\eta \text{or}\theta )}{180}*\pi )/2

The expected distance distribution probability *P*_{
exp
}(*d*) is assumed to have a similar form to the finite ideal gas reference state [7] as:*P*(*d*) = *c***d*^{1.6}

where *d* is the center of each distance bin. Constant *c* was adjusted to make the sum of the probability for 4 total bins of possible conditions to 1. *N*_{
exp
}(*d*, *η* *θ*) for HBND component was thus expressed as:

\begin{array}{l}{N}_{\mathrm{exp}}(d,\eta ,\theta )=\\ {N}_{total}*c*{d}^{1.6}*\frac{\mathrm{sin}(\frac{\eta}{180}*\pi )}{2}*\frac{\mathrm{sin}(\frac{\theta}{180}*\pi )}{2}\end{array}

where *N*_{
total
}is the total observed count compiled in the corresponding pairwise sub-database. The penalty value was tuned. In the case of PPDA energy calculation, equal distribution on the *ψ*-*φ* plane was assumed for the expected probability. Thus, *N*_{
exp
}(*ψ*,*φ*) for PPDA component is:*N*_{exp} (*ψ*, *ϕ*) = *N*_{
total
}* Δ*ψ*/360* Δ*ϕ*/360

where *N*_{
total
}was the total observed count compiled in the corresponding pairwise sub-database, and Δ*ψ* and Δ*φ* were the bin sizes of the respective angles. The penalty value was tuned. OMDA energy was calculated similarly with equal distribution along the *ω* axis assumed for the expected probability. Thus, *N*_{
exp
}(*ω*) was:*N*_{exp}(*ω*) = *N*_{
total
}* Δ*ω*/360

where *N*_{
total
}was the total observed count, and Δ*ω* was the bin size of the *ω* angle. The penalty value was tuned. The concept of SURR energy was similar to the solvation potential by Jones [6]. The distribution of the observed count of surrounding C_{
α
}atoms was compiled in a procedure similar to the database construction for each amino acid type. The resultant database was standardized by each of the sub-databases, and then used as the expected count of *N*_{
exp
}(*n*) as:

{N}_{\mathrm{exp}}(n)={N}_{aa}*\frac{N(n)}{{N}_{tot}}

where *n* is the number of surrounding C_{
α
}atoms in a sphere, *N*_{
aa
}is the sum of the counts for a specific amino acid type over the surrounding numbers, *N*_{
tot
}is the sum of the counts of all residues of all structures over the surrounding numbers, and *N*(*n*) is the count of the specific number of surroundings (*n*) for all residues of all structures. The radius of the sphere and the penalty value were tuned.

### Decoy sets

The Decoys 'R' Us decoy database http://dd.compbio.washington.edu/download.shtml, which was compiled by Samudrala and Levitt [18], was used for parameter tuning and function evaluation. For the l30 target in the fisa_casp3 decoy set, 1ck2 from PDB was used as the native structure [16]. C_{
α
}RMSD of the l30 native structure in the original list (1.882 Å) was used as the native C_{
α
}RMSD. The moulder decoy set ftp://salilab.org/decoys/comp_models.tar.gz[15, 20] and the all atom decoy set from Rosetta@home http://depts.washington.edu/bakerpg/decoys/rosetta_decoys_62proteins.tgz (or "rosetta"), which were from the homepages of the Sali lab. and Baker lab., respectively, were also used. The rosetta set contains well-scoring Rosetta protein models and their native crystal structures for 59 proteins, without 3 NMR structures. We repacked 141 structures of one native PDB structure, the 20 refined native structures, the 100 lowest scoring models out of ~10,000 total models, and 20 random models, per protein into a single target entry for use.

### Determination of initial parameter values

The 7 decoy sets of 4state_reduced, fisa, fisa_casp3, hg_structal, ig_structal, ig_structal_hires, and lmds from the Decoys 'R' Us database [18] were used to search the initial parameter values. Several parameters were scanned at a time, and the best values of the parameters were determined successively. The procedure was repeated until all the parameters were scanned. Following are *the function components* and their associated parameters determined by the above procedure: *the DIST component*, the distance range for database construction and scoring, the sequence separation for database construction and scoring, the upper distance limit for scoring, the function form of *N*_{
exp
}, the values of *b* and *c*, the lower distance limit for determination of the *a* value, and the penalty value; *the DABG component*, the distance range for database construction and scoring, the sequence separation for database construction and scoring, the upper distance limit for scoring, the range of neighboring bins for averaging the counts, and the penalty value; *the HBND component*, the distance range for database construction and scoring, and the penalty value; *the PPDA and OMDA components*, the penalty values; *the SURR component*, the radius of the sphere for database construction and scoring. Bin sizes for all of the components were determined appropriately. Each time a new parameter value was applied, the weight parameters *w* for the respective energy components were scanned, and the performance was evaluated with the temporary optimized weight values. Because the total pseudo-energy was used as a relative index value (not as an absolute energy), *w*_{
DIST
}was fixed as 1 and the remaining 5 weights were scanned. The searching procedure for the weights was as follows: firstly, all of the combinations of discrete weight values, evenly spaced in a logarithmic scale (0.01 to 31.6, 15 steps), were evaluated, and the weight set with the best discrimination performance was selected. Then, another more precise cycle was carried out around the set of weight values determined by the previous scan (0.56- to 1.78-fold the previous weight value, 11 steps). The optimized weights for the function were 1, 0.316, 0.141, 0.200, 0.00562, and 0.178 for the components of DIST, DABG, HBND, PPDA, OMDA, and SURR, respectively.

### Function tuning by cross validation

The temporary optimized function by the previous procedure was further tuned through the cross-validation procedure. The 231 targets from the Decoys 'R' Us database, and the decoy sets of moulder and rosetta, were split into 154 (2/3 of total) for the training set and 77 (1/3 of total) for the test set according to their temporarily-assigned serial numbers. The targets were listed in the sequence of the decoy sets, and serial numbers of multiples of 3 were selected as the test set, and the rest of the targets were the training set. Thus, each decoy set was included in both training and test sets with a roughly equal ratio, without any intentional bias. The parameters were tuned by 10-fold cross validation. The above training set was divided into 10 segments, with new identification numbers in cyclic order. The function with temporally optimized weights was evaluated on one remaining target segment. The average C_{
α
}RMSD of top structures from 10 evaluations of all combinations of training and evaluation was set as the performance index. The weights of the function components, with a set of updated parameter values, were optimized for 9 target segments to the minimum C_{
α
}RMSD average. Weights were optimized by the following iterative cycles of scanning: the combination of discrete values, equally spaced in the logarithmic scale for each of the six weights, was scanned in a cycle. The 1^{st} cycle was 5 steps of 0.01 to 100. The following 4 cycles were repeated for the 8 best weight sets found by the 1^{st} cycle, which were separated by at least a 6-step distance. Each of the 2^{nd} to 5^{th} cycles evaluated the 3 steps of the parameters, i.e. *w*^{-r}, *w*, and *w*^{r}, where *w* was the weight value selected by the previous cycle, and *r* was the factor of the scanning range. If any one of the selected parameter values was not the previous one (i.e. the optimum was not at the center of scanning), the same cycle was repeated again with the selected parameter values. The *r* values of 2^{nd}, 3^{rd}, 4^{th}, and 5^{th} cycles were 3.16, 1.78, 1.33, and 1.15, respectively. The final weight values tuned by the procedure were 1.00, 0.662, 0.765, 0.0372, 1.02, and 43.0 for the components of DIST, DABG, HBND, PPDA, OMDA, and SURR, respectively. The tuned parameters, the order of the parameters successively scanned, and the initial, scanned, and final values are listed in Table 1.

### Performance measures

The performance measures and their definitions are as follows: *C*_{
α
}*RMSD*, the root mean square deviation of the C_{
α
}-C_{
α
}pairs between the native structure and the model with the best energy; *Z-score*, the score of the native structure, which was calculated under the standard definition [11] (Note: the positive value corresponds to the lower (better) energy than average.); *C.C*., Pearson's correlation coefficient among the structures including the native and the decoys; *F.E*., the fraction of the top 10% lowest C_{
α
}RMSD structures in the top 10% best-energy structures among the structures, including the native and the decoys; *R*_{
B1
}, the C_{
α
}RMSD rank of the best-energy structure among the decoy structures; *logP*_{
B1
}, the common logarithm of the probability of selecting the best decoy structure, where *P*_{
B1
}= *R*_{
B1
}/(number of decoy structures); *R*_{
B10
}, the lowest C_{
α
}RMSD rank in the 10 best-energy decoy structures among the decoys; *logP*_{
B10
}, the common logarithm of the probability of selecting the best decoy structure in the 10 best-energy decoy structures, where *P*_{
B10
}= *R*_{
B10
}/(number of decoy structures); *C.C*._{
decoy
}, the C.C. among the decoys; *F.E*._{
decoy
}, the F.E. among the decoys.