SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs

Background Protein tertiary structure prediction is a fundamental problem in computational biology and identifying the most native-like model from a set of predicted models is a key sub-problem. Consensus methods work well when the redundant models in the set are the most native-like, but fail when the most native-like model is unique. In contrast, structure-based methods score models independently and can be applied to model sets of any size and redundancy level. Additionally, structure-based methods have a variety of important applications including analogous fold recognition, refinement of sequence-structure alignments, and de novo prediction. The purpose of this work was to develop a structure-based model selection method based on predicted structural features that could be applied successfully to any set of models. Results Here we introduce SELECTpro, a novel structure-based model selection method derived from an energy function comprising physical, statistical, and predicted structural terms. Novel and unique energy terms include predicted secondary structure, predicted solvent accessibility, predicted contact map, β-strand pairing, and side-chain hydrogen bonding. SELECTpro participated in the new model quality assessment (QA) category in CASP7, submitting predictions for all 95 targets and achieved top results. The average difference in GDT-TS between models ranked first by SELECTpro and the most native-like model was 5.07. This GDT-TS difference was less than 1% of the GDT-TS of the most native-like model for 18 targets, and less than 10% for 66 targets. SELECTpro also ranked the single most native-like first for 15 targets, in the top five for 39 targets, and in the top ten for 53 targets, more often than any other method. Because the ranking metric is skewed by model redundancy and ignores poor models with a better ranking than the most native-like model, the BLUNDER metric is introduced to overcome these limitations. SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein, where it outperforms the benchmarked method (I-TASSER). Conclusion SELECTpro is an effective model selection method that scores models independently and is appropriate for use on any model set. SELECTpro is available for download as a stand alone application at: . SELECTpro is also available as a public server at the same site.


Background
Selecting the most native-like model from a set of possible models is a crucial task in protein structure prediction. A variety of Model Quality Assessment Programs (MQAPs) have been developed that assign numeric scores to models in a set, and then use the scores to rank the models and ultimately select a single model. MQAP methods can be divided roughly into three categories based on the type of information they use: evolutionary methods use sequence or profile similarity between target sequence and template, consensus methods use similarity between models, and structure-based methods use model coordinates [1]. Each category of methods has inherent strengths and weaknesses.
Evolutionary methods can provide quality scores that have been shown to correlate with structural similarity to native [2]. However, for lower confidence alignments the scores do not correlate well with structural similarity. Furthermore, identification of the best template and specific alignment can be difficult. In addition, models built from multiple templates or template-free methods cannot be scored appropriately by evolutionary methods alone.
Consensus methods take advantage of the observation that similar models produced by different predictors tend to be more accurate than those that are structural outliers. In practice, consensus methods outperform the methods they draw from, and they rarely pick a very poor model. The disadvantage, however, is that when the best model is a structural outlier it will be overlooked for lack of popularity [1]. Also, consensus methods are not appropriate for selecting from small sets of structurally diverse models, especially in the extreme case of a twomodel set.
While consensus methods depend on similarity between models, structure-based methods calculate scores on each model independently. For this reason, structurebased methods can be applied to model sets of any size and diversity, and will produce the same score for a model regardless of the other models in the set. Structure-based methods can also be used for templatefree modeling [3][4][5][6] and model refinement procedures [7,8]. One weakness of high resolution structure-based methods, including protein free energy approximation functions [9][10][11][12] and physics based approaches [13,14], is their sensitivity to local structural irregularities such as steric clashes and chain breaks, which can significantly bias scores on otherwise accurate models. Even slight differences in model backbones can produce significantly different scores [15]. Lower resolution structurebased methods, such as statistical potentials [6,16,17], are more robust to backbone variation, but are sensitive to extended low contact-order regions in the models.
Here we describe SELECTpro, a novel structure-based MQAP that combines high and low resolution energy terms into a model selection method that is effective on model sets of variable size, diversity, and target difficulty. Most of our assessment is calculated from the CASP7 model quality assessment category (QA) results published online [18]. The QA category provides a framework for the unbiased evaluation of MQAPs on ensembles of models produced by diverse automated prediction methods.

Results and discussion
We analyze the CASP7 quality assessment category predictions with a focus on the quality of the model ranked first by each predictor and the recovery of the most native-like model in the set. Only SetAll is used in the assessment of the quality of the model ranked first by each group (Table 1). The results are very similar when using SetComplete (data not shown) because QA groups rarely rank an incomplete model first.
The assessment of the recovery of the most native-like model, is performed on both SetAll and SetComplete (Table 2) because the few cases where an incomplete model is the most native-like have a significant effect on the average recovery metrics of all QA groups. Incomplete and irregular models are especially challenging for structure-based methods. A comparison of the average Pearson Correlation on SetAll and SetComplete, highlights these issues ( Table 3). The frequency of recovering the most native-like model is calculated on SetComplete ( Figure 1).
The utility of SELECTpro for selecting the best model from a small set is demonstrated by selecting from the five models submitted for each target by the top automated predictors. These small set selection results are calculated using SetAll (Figure 2). SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein and compared to I-TASSER ( Figure 3).
To make fair comparisons to groups participating on only a subset of targets, common subset comparisons between SELECTpro and each of these groups are included in Tables 1 and 2. Only groups participating on at least half of the targets are included, and for groups with multiple submissions only the best one is shown. In the results tables any value that is better than SELECTpro is underlined.
For multiple domain targets, the sum of GDT-TS over all domains is used as the GDT-TS of the model. Since the  In the CASP7 submission SELECTpro did not have a score for M max of target T0356 due to a processing error. We added in the score for this analysis in order to make complete common subset comparisons. * SELECTpro (699_1) results appear in bold face and all results that are better than SELECTpro are underlined. Statistically significant p-values (p < .05) are also in bold.
QA predictions correspond to the entire structures, it is impossible to fairly assess the domains independently.
To assess the significance of the summary statistics compared in Table 1, Table 2, and Figure 2, we performed paired t-tests between SELECTpro each other group on common subsets of targets (or targets and models when appropriate). All p-values from the tests appear in the tables and figure, but only statistically significant p-values (p < .05) are shown in bold.      Large Decoy Set Model Selection. Large decoy set model selection with SELECTpro on I-TASSER benchmark set. This set of 16 small proteins was used as one of the benchmark sets for evaluating the I-TASSER method [19]. The complete decoy sets can be downloaded from [20]. Each protein has from 12500 to 20000 decoy models. For each protein different symbols are used to indicate the GDT-TS of M max (□), SELECTpro's M QA1 (×), and I-TASSER's M QA1 (+).
The following notations are used throughout the results section: • M max : The model with the highest GDT-TS among all server models.
• M QA1 : The model with the highest QA score.
• N T : The number of targets a group made valid predictions on.
• N D : The number of domains a group made valid predictions on.
The recovery of M max by a QA predictor can only be evaluated if M max was scored by the predictor. In most cases QA predictors did not provide scores for all available server models, and frequently there is no score for M max . For example, predictor 016_1 (AMBER/PB) made submissions on 86 targets, but M max is only scored for 53 of these targetsso only these targets (N T = 53) can be evaluated for this predictor.

Quality of Model Ranked First (M QA1 ) Relative to Most Native-Like Model (M max )
In this section on the assessment of the model ranked first, and the corresponding Table 1, we use the following three metrics: • ΔGDT QA1 = GDT-TS(M max ) -GDT-TS(M QA1 ) : The GDT-TS difference between M max and M QA1 measures how much is lost by selecting M QA1 rather than M max for a single target.
• ΔGDT QA1 = ΣΔGDT QA1 /N D : The average ΔGDT QA1 is a simple way of assessing the quality of M QA1 over all targets.
• ΔGDT QA1% = ΔGDT QA1 /GDT-TS(M max ) : The GDT-TS difference percentage allows for comparison across targets with different numbers of domains and difficulty levels.
The columns of Table 1 are: (1) group number; (2) number of targets the group made predictions on; (3) number of targets such that ΔGDT QA1 = 0; (4) number of targets such that ΔGDT QA1% < 1%; (5) number of targets such that ΔGDT QA1% < 10%; and (6) ΔGDT QA1 . The common subset results section has an additional column for the p-value of the paired t-test using ΔGDT QA1 . The rows are sorted first by the number of targets and then by ΔGDT QA1 . Of the groups participating on all 95 targets, SELECTpro has the lowest average ΔGDT QA1 , with a value of 5.07, followed closely by group 713_1 (Circle-QA), with a value of 5.44. Predictor 038_1 (GeneSilico) has an average ΔGDT QA1 of 5.75, with predictions on 85 targets. In common subset comparisons with these two groups SELECTpro is not significantly better, with p-values of . 25 and .12 respectively. In common subset comparisons with all remaining groups SELECTpro is significantly better.
Another way to assess the quality of M QA1 over many targets is to count the number of targets such that M QA1 is the best model, or nearly the best, in the set. A method that performs very well on most targets, but very poorly on a few, would still be recognized by this criteria. SELECTpro recovers the best model for 12 targets, selects a model with ΔGDT QA1% < 1% for 18 targets, and selects a model with ΔGDT QA1% < 10% for 66 targets. Group 091_1 (Ma-OPUS) also performs well, with 11, 18, and 61 targets in the respective categories. Only the 60 targets with ΔGDT QA1% < 10% of predictor 038_1 (GeneSilico) on its 85 target subset are better than SELECTpro in common subset comparison (58 for SELECTpro).
The BLUNDER Measure Recovery of M max How well does a QA predictor recover M max ? The traditional metric to assess M max recovery is the rank of M max , and the average rank over many targets ( rank ).
While rank captures some important information, it ignores the redundancy of models and the quality of models ranked better than M max . Consider the following hypothetical situation: group A ranks M max 10 th and all nine models ranked above it are redundant with ΔGDT of 2.0, group B ranks M max 5 th and the four models ranked above it are diverse with a ΔGDT between 10.0 and 20.0. Which group has done a better job of recovering M max ? In this example, the rank metric favors group B, although group A ranks only a single redundant model above M max . In addition, the models ranked better than M max by group A have only slightly lower GDT-TS than M max , while the models ranked better than M max by group B are significantly worse than M max . To address these weaknesses of the rank metric, we introduce the BLUNDER metric, which focuses on the worst model ranked better than M max (the most embarrassing blunder). This measure is not affected by model redundancy and measures the quality of models ranked above M max . The BLUNDER metric is defined using the following notation, and used in the assessment of the recovery of M max and the corresponding Table 2 and Figure 1:    Table 2. The results columns are (1) average rank ( rank ) and (2) average ΔGDT BLUNDER ( ΔGDT BLUNDER ) on SetAll and SetComplete. The common subset results section also includes a column for the pvalue of a paired t-test using ΔGDT BLUNDER (p-value). Rows are sorted separately for each dataset by N T first and then ΔGDT BLUNDER . On SetComplete SELECTpro has a ΔGDT BLUNDER of 10.4. In common subset comparisons one group has a lower rank : group 091_1 (Ma-OPUS) with rank of 16.8 on 94 targets compared to 17.4 for SELECTpro. On SetAll SELECTpro did not submit a score for M max of target T0356 (HHpred2_TS1) due to a processing error. In order to make complete common subset comparisons when possible we added in the SELECTpro score for HHpred2_TS1. SELECTpro ranks it 86 th and ΔGDT BLUNDER = 50.0. Both results are significantly worse than the SELECTpro averages.

Pearson Correlation for Individual Proteins
The assessor evaluation of the quality assessment category [18] focused on the Pearson Correlation between the QA scores and GDT-TS. Here we use the Pearson Correlation only to highlight some of the difficulties for structure-based methods in dealing with incomplete models, as well as basic non-protein like structural features. Approximately half of the models in SetAll are incomplete, with backbone coordinates missing for one or more residues.
Incomplete models present a challenge to SELECTpro and other structure-based methods because the scores for each model are only comparable when calculated on coordinates for the same set of residues. Another issue is that some complete models have severe chain-breaks, severe steric clashes, or significant portions modeled only as extended chains. These local problems can overwhelm the energy of what may otherwise be a good model. Consensus methods do not suffer from these local structure problems. Given this rationale, one would expect structure-based methods to see the most improvement in terms of average Pearson Correlation on SetComplete relative to SetAll. Table 3 shows the average Pearson Correlation of five selected groups. Predictors 713_1 (Circle-QA), 633_1 (ProQ), and SELECTpro are structure-based MQAPs, while 634_1 (Pcons) is a consensus method and 556_1 (LEE) scored structures based on the GDT-TS similarity to their human Model 1 CASP7 prediction [18]. As expected, the structurebased MQAPs improve more than the structural similaritybased methods. The even greater increase in Pearson Correlation for SELECTpro can be accounted for by the failure to generate appropriate complete models for some of the incomplete models resulting in QA scores calculated on extended chains.

Reranking Top Server Group Models
Predictors in CASP may submit up to five models, but CASP evaluation focuses on the model designated as Model 1. Clearly, the selection of Model 1 is critical in the CASP setting and for protein structure prediction in general. Figure 2 contains the results when SELECTpro is used to rerank the five models submitted by each of the top ten servers from CASP7, compared to each server's results. In the following assessment M max-g is the model with the highest GDT-TS of the five models submitted by a server. Figure 2 (A) shows that SELECTpro recovers M max-g more frequently than 8 of the top 10 server groups; in addition, when SELECTpro is used to select Model 1 the average GDT-TS increases for 7 of 10 sever groups; however, the increase is only statistically significant for 3 groups. SELECTpro improves using both criteria for the top 3 server groups (Zhang-Server, Pmodeller6, and ROBETTA). These results highlight the utility of SELECTpro for the task of model selection. The comparisons made here are fair because structure-based methods can be applied in the server setting to any number of models.

Large Decoy Set Model Selection
Here we analyze SELECTpro's model selection capability on the large decoy sets for 16 small proteins from a recent I-TASSER benchmark set [19]. The I-TASSER prediction method generates 12500 to 20000 different backbone conformations. The complete decoy sets can be downloaded from [20]. The consensus method SPICKER [21] is used to cluster the models and a centroid model is built from the first cluster. A paired t-test of the hypothesis that SELECTpro and I-TASSER's mean performance are equal produces a pvalue of .19, which is not statistically significant, but does give some evidence that SELECTpro can select a very good model from a large set of decoys at least well as an established method that utilizes consensus methods.

Conclusion
A MQAP that can select the most native-like model from a set of possibilities has a variety of applications in protein structure prediction. The new quality assessment category introduced in CASP7 allows for the unbiased assessment of MQAPs on the models produced by automated predictors. This category allows researchers to focus on the model scoring aspect of protein structure prediction.
The results presented in this work demonstrate that SELECTpro, a structure-based model selection method, consistently selects one of the best models from the large diverse sets of models produced by automated predictors, across all levels of target difficulty. On these large diverse sets of models, SELECTpro also recovers the single most native-like model well compared to other methods. On the small sets of five models submitted for each target by the top automated predictors, in most cases SELECTpro selects better models than the predictors themselves.
Since SELECTpro and other structure-based methods score models independently, they can be incorporated into the model selection pipelines of individual protein structure prediction servers. For this reason, it may help predictors if the CASP organizers distinguished methods that score models independently from those that do not.
Consensus and structure-based methods can be combined to achieve improved results. For example, the metaserver method Pmodeller [22] combines consensus (Pcons [23]) and structure-based methods (ProQ [24]) to predict protein structures more accurately than either method in isolation. The assessment of the QA category by CASP assessors recognized the consensus method Pcons (group 634_1) for the high Pearson Correlation between their scores and model GDT-TS on most targets [18]. In their own assessment the authors of Pcons recognized that while consensus methods perform well in most cases, "when most of the models are incorrect and the few correct models are outliers a consensus based approach cannot be expected to make an optimal choice." [1] For instance, they identified three particular targets in CASP7 where their consensus method failed: T0283, T0350, and T0351 [1]. The Pcons average ΔGDT QA1 on these three targets is 30.8. The same research group's structure-based method ProQ (group 633_1) has an average ΔGDT QA1 of 17.2. In contrast, on these three targets SELECTpro has an average ΔGDT QA1 of only 7.1. This example highlights the potential of combining SELECTpro with existing model selection methods.
SELECTpro has been made publicly available as a server, where users may submit from 2 to 100 models for evaluation. In addition to the global confidence scores, the scores of individual energy terms are also returned to the user by email for each model submitted. SELECTpro is one of several protein structure tools in the SCRATCH suite of predictors [25], and is available through: http:// www.igb.uci.edu/~baldig/selectpro.html.

Datasets
All of the comparative analysis in this work is performed on the server models and quality assessment predictions submitted in the CASP7 [26] experiment. The CASP QA experiment is particularly relevant for the evaluation of model selection methods for several reasons: (1) the QA predictors were blind to the true structures at the time of prediction making it impossible for methods to be tuned to improve results; (2) the set of proteins is diverse: the 95 targets range in size from 68 to 530 amino acids, come from a variety of organisms, and span the full range of prediction difficulty; (3) each target has more than 200 predicted models that contain the types of errors that occur in automated structure prediction; (4) the protein set is not selected by any of the participating QA groups; (5) the models are scored by a variety of methods and the results are publicly available. We perform analysis on the set of all models (SetAll) and a subset of models (SetComplete) that are complete and free of gross structural irregularities, as described below. All of the ABIpro models and some of the 3Dpro models were optimized using the exact energy function of SELECTpro. These models are removed because of the obvious bias towards these models. In recent CASP experiments the GDT-TS [27] has been used as the primary automatic structural similarity measure. The published GDT-TS values from the CASP7 website are the only structural similarity measure used in this work.

SetAll
The SetAll dataset consists of the server models with a GDT-TS value published on the CASP7 website, a total of 23,423 models. To calculate a score on a protein model SELECTpro requires the backbone coordinates (N, C α , C) for all model residues as input. A total of 8,812 models in SetAll have only a C α trace or have no coordinates for one or more residues. Modeller8v1 [28][29][30] was used to generate complete models from the incomplete ones, and then the complete models were scored by SELECTpro. In most cases the complete models were built appropriately from the incomplete models; however, in some cases the final model was a fully extended chain due to an error in our application of Modeller. We failed to identify this problem until after the completion of the CASP7 competition. The SELECTpro scores versus GDT-TS scores for all models of target T0305 are displayed in plot A of Figure 4. The circled outliers with very low confidence scores and high GDT-TS scores are models that were incomplete and the complete models generated by Modeller were fully extended chains. The Pearson correlation on the set of all models for T0305 is .641. The SELECTpro scores versus GDT-TS scores for complete models only are displayed in plot B of Figure 4, and the Pearson correlation is .966.

SetComplete
The scores produced by SELECTpro are comparable on complete models of the same sequence. There is no standard for the handling of incomplete models and we assume that participating groups took a variety of approaches. Using only complete models ensures that the MQAP scores are calculated from the same coordinates. Thus, the models retained in SetComplete are screened first for completeness. Models missing backbone coordinates for one or more residues are removed. This leaves 14,611 models.
Structure-based MQAPs are susceptible to local structural irregularities in models, and will tend to score such models poorly. This is why methods developed to select near-native models from sets of decoys remove such models from consideration [31]. We apply additional filters (described below) for C α -C α clashes, C α -C α chain breaks, and expanded termini to remove an additional 1,217 models leaving 13,494 more plausible models in SetComplete.
The C α -C α clash model filter enforces a squared difference penalty for C α -C α distances less than 3.6 Ǻ.
The distance between the C α atoms of residue i and j is denoted by r i,Cα,j,Cα and N is the protein length. The constant 13.52 in the threshold below corresponds to two severe clashes where r i,Cα,j,Cα = 1.0 Ǻ. Models with a sum of squared differences greater than 13.52 per 100 residues are filtered out.  The expanded termini filter removes models where a large portion of the structure is modeled as expanded chain with no non-local interactions. The screening procedure is: scan from the N-terminus until three consecutive residues have a contact number of at least 10, and repeat from the C-terminus. The contact number of a residue is defined here as the number of other C β atoms within 10 Ǻ of the residue's C β [3]. If the sum of low contact number termini residues is at least 20% of N, the model is filtered out.

Reduced representation
In the reduced representation the heavy backbone atoms, carbonyl oxygen, amide hydrogen (N, C α , C, O, H), and C β are represented explicitly. For glycine residues a pseudo C β is calculated. The side-chain atoms are represented by a single united point (centroid) [32,33]. The centroid is calculated as the mean of the position of the heavy side-chain atoms. For glycine and alanine the centroid (CT) is set to the C β atom. Only the heavy backbone atoms (N, C α , C) are used as input to SELECTpro and the positions of additional atoms and centroids are calculated from these. SetAll versus SetComplete. Plots of SELECTpro scores versus GDT-TS scores for T0305 models from SetAll (A) and SetComplete (B). The Pearson correlation is .641 for SetAll and .996 for SetComplete. This large difference is mainly due to the extended chain models (circled in plot A) scored by SELECTpro due to an error in our use of Modeller to generate complete models from incomplete ones.

All heavy-atom representation
In the all heavy-atom representation the centroid is removed and the heavy side chain atoms are represented explicitly. The side-chains are initially placed onto the backbone of the reduced representation in their most likely conformation according to the SCWRL backbonedependent rotamer library [34]. The side-chain placements are then optimized using the SELECTpro all-atom energy terms (described below) in conjunction with the rotamer library.

Energy Functions Overview
consists of the energy terms that depend on the all heavy-atom representation. E ALL-ATOM is a linear combination of the following physical terms: E FINAL is the sum of E REDUCED and E ALL-ATOM , and is used for the final scoring of models by SELECTpro. The individual energy terms are outlined briefly below and the detailed description of the novel terms follow in the remainder of this section. Underlined terms are adapted from previously described energy terms their details are included in the Appendix.

Parameter Weights
The parameter weights were determined by repeatedly varying individual weights and maximizing the sum of the GDT-TS of the lowest E FINAL models on a training set built from CASP6 protein domains. For each CASP6 protein domain a set of 500 decoy models was generated using fragment assembly with the RMSD to native as the dominant term in the objective function [3]. E STAT-PW-CI : context independent pair-wise potential [3,16] E STAT-PW-CD : context dependent pair-wise potential [6] E ROG : compactness E ALL-ATOM E SC-HB : side-chain hydrogen bonding E LEN-JONES : van der Waals forces [10] E SOLVATION : solvation effects [35] E ELECTRO : electrostatic interactions Throughout this work the convention of all capital letters referring to global energy and all lower case referring to local energy is used. For instance, E PRED-CM refers to the global contact map energy and E pred-cm (i,j) refers to the contact map energy between residues i and j.

Parameter notation used in energy equations
Model variables r i,x,j,y : distance between atom x of residue i and atom y of residue j r x,y : distance between atom x and atom y v i,x,j,y : vector from atom x of residue i to atom y of residue j u i,x,j,y : unit vector calculated from v i,x,j,y N i : number of residues in contact with residue i, with contact defined as r i,Cb,j,Cb < 10 Ǻ acc i : predicted solvent accessibility of residue i ('e': exposed, '-', buried) cmap i,j : predicted contact/non-contact between residues i and j, with contact defined as r i,Cα,j,Cα < 12 Ǻ

Reduced Representation Energy Term Details
The details of how the novel reduced representation energy terms are calculated are presented in this section. The predicted structural terms E PRED-SS , E PRED-ACC , and E PRED-CM and the β-strand pairing term, E BETA , are novel and unique to SELECTpro. Additional reduced representation terms are adapted from previously published work and their details are included in the Appendix.

Predicted structural features overview
The predicted structural feature predictions used in E PRED-SS , E PRED-ACC , and E PRED-CM come from the SCRATCH suite of predictors [25]. Each predictor is trained in a supervised fashion using curated nonredundant datasets extracted from the PDB [37]. The secondary structure (SSpro [38]) and solvent accessibility (ACCpro [39]) predictors use ensembles of 1D-RNN (one dimensional-recursive neural network) architectures [40]. The contact map predictor (CMAPpro [41]) uses ensembles of 2D-RNN architectures [40].
E PRED-SS : predicted secondary structure The predicted secondary structure term E PRED-SS penalizes deviation of the torsion angles from the torsion angle parameters for helices and strands predicted by SSpro. There is no penalty for predicted coils. The parameter values for helix residues are: I H = -65.3, σ H = 11.9, I Hψ = -39.4, σ Hψ = 11.3. The parameter values for strand residues are: I E = -135.0, σ E = 15.0, I Eψ = 135.0, σ Eψ = 15.0. Only torsion angles that are more than two standard deviations from the ideal are penalized, with the penalty defined as follows: The definition of E pred-strand (j) is equivalent to E pred-helix (i), but with I E , σ E , I Eψ and σ Eψ in place of the corresponding helical values.
E PRED-ACC : predicted solvent accessibility The solvent accessibility predictor ACCpro predicts the percent of solvent accessibility in 5% increments for each residue. Using 25% exposure as a binary threshold the accuracy of the predictor is~77% [39]. The binary exposure ('e')/burial ('-') prediction is used as the predicted solvent accessibility for E PRED-ACC . In the reduced representation the solvent accessibility of residue i is estimated by its contact number (N i ), where N i > 16 is considered buried [3]. If the predicted status of a residue is not realized in the model, the penalty is calculated as: The contact map predictor CMAPpro predicts the probability of contact or non-contact between C α atoms, with a contact threshold of 12 Å. The strategy utilized to infer predicted contacts from the probability matrix [41] results in maps that are sparse when compared to those of real proteins; thus, unrealized contacts are penalized while non-contacts are not. The constant 1.0 is added to the penalty to ensure that all unrealized contacts make a significant contribution to E PRED-CM . The predicted contact map can help identify the highest GDT-TS models in the set, even when they are not highly similar to native. A good example of this is CASP7 target T0304 is a 122 residue α/β protein where the highest GDT-TS model in the set is Zhang-Server_TS1 (GDT-TS = 45.55). Most secondary structure predictors (including SSpro) failed to predict the first two strands making this target especially difficult. No QA method ranked the highest GDT-TS model first; however, SELECTpro ranked it second and the model ranked first by SELECTpro (T0304.Zhang-Server_TS4) has the second highest GDT-TS. These models have the lowest E PRED-CM of any models in the set, but the native structure has an even lower E PRED-CM . Figure 5 compares the native and predicted contact maps for target T0304.
E BETA : strand pairing The formation of hydrogen bonds between the residues of b-strand partners is a major determinant of the tertiary structure of b and a/b proteins. The b hydrogen bonding treatment described here favors realistic strand pairing and sheet formation. The treatment also efficiently accommodates bulges in strands because it does not force the register between two paired strands. E BETA is the global strand pairing energy that penalizes the hydrogen bonding of b residues between strand pairs. E beta- is the strand pairing energy of strand b k to strand b w . E beta-sp is only commutative if the two strands have the same length. E beta-hb (i,j) is the hydrogen bonding penalty between residues i and j.
E beta-sp is calculated for all possible strand pairings, but only the two lowest energies from each strand are used in E BETA . Other strand-strand interactions are ignored. In the equations below S is the set of all strands in the protein, b m1 is the strand with the minimum pairing energy from b k , and b m2 is the strand with the next lowest pairing energy from b k . If the strand count is less than six at least two of the strands must be edge strands. This is accounted for by only considering the single best strand partner for two strands.
In the equations for E beta-sp below, S k is the set of all residues in strand b k . Each time E beta-hb is calculated the pair (i,j) is chosen with i from S k and j from S w , such that E beta-hb is minimized. Then residue i is removed from S k , and residue j is removed from S w . E beta-hb is calculated once for each residue in S k . If S k has more residues than S w each unpaired residue is given maximum penalty of E beta-hb . If residues i and j are paired in parallel strands, either i forms hydrogen bonds with j-1 and j+1, or j forms hydrogen bonds with i-1 and i+1. No hydrogen bonds are formed between the atoms of residues i and j. The hydrogen bonding energy is calculated for both possible conformations and only the minimum of the two is used in E beta-hb (i,j).  N,d,H ). The distance and acceptor atom angle parameters are motivated by the orientation-dependent hydrogen bonding potential described in [42]. The following parameters were set based on idealized hydrogen bonding between β residues, with standard deviation values set such that two standard deviations approximate the cutoff in true hydrogen bonds. The ideal distance from hydrogen atom to accepting oxygen is I hb-dist = 1.9 Ǻ, with standard deviation σ hb-dist = 0.5 Ǻ. The ideal angle at the acceptor atom is 0°, so the ideal (u a,C,a,O · u a,O,d,H ) is I acc-dp = 1.0, with standard deviation σ acc-dp = 0.11. The ideal angle between the acceptor and donor atom vectors is 180°, so the ideal (u a,C,a,O · u d,N,d,H ) is I acc-don-dp = -1.0, with standard deviation σ acc-dp = 0.15. The parameters for pseudo-bonded residues are as follows: the ideal distance for r a,O,d,H is I ps-hbdist = 7.9 Ǻ, I ps-acc-dp = -1.0, and I ps-acc-don-dp = -1.0. The standard deviations from the corresponding hydrogen bonding parameters above are used in Φ ps (a d).   The penalty for the observed value (x) increases up to 6 standard deviations from the ideal value (μ).

All-Atom Energy Term Details
The all-atom energy terms depend on atom-atom interactions when all heavy atoms are included in the model. In the all-atoms energy equations x and y refer to atoms in the model and the residue positions are not referenced. The van der Waals radii and well-depths (ε x , used in E LEN-JONES ) come from the CHARMM19 parameter set [43]. The side-chain hydrogen bonding term, E SC-HB , is described in detail here because it is unique to SELECTpro. in the Appendix. Atoms at least 75% exposed are considered fully exposed and atoms less than 25% exposed are considered fully buried. For 25% < ΔG x slv % < 75% the penalty weight is reduced linearly from 1.0 at 25% to 0 at 75%. The ideal distance from the acceptor atom to donor atom is I hb-da-dist = 2.9 Ǻ. In the equations below donors is the set of all side-chain hydrogen donor atoms and acceptors is the set of all side-chain hydrogen acceptor atoms.

Appendix
In the interest of completeness and reproducibility we include the details of the energy terms that are adapted from previous work.

Reduced Representation Energy Term Details
E CT-REP : centroid repulsion A centroid-centroid repulsive term is used to reduce the overcrowding of side-chains in the reduced representation. The minimum distance between two centroids in the calculation is the minimum observed for each pair of residue types -D CT-min (aa i ,aa j )in pdb_select25. The penalty for centroid-centroid overlaps is defined as the overlap distance squared: The motivation for this term is to model the hydrophobic effect. The level of burial for each residue in the model is estimated by the number of other C β atoms within 10 Ǻ (the contact number N i ) [3]. The values in the table Ω statenv reflect the likelihood of observing a particular N i for each residue type. For model residues near both termini the contact number is artificially increased to account for the missing neighbors along the chain. of [3]. The potential considers the likelihood of observing the pair of centroids in a given distance bin relative to the background, with distance bins of < 5, 5-7, 7-10, 10-12, and > 12 Å. The advantage of a context independent pairwise potential is that it is less vulnerable to over-fitting by a conformational search because of its generality. 12 0 if E STAT-PW-CD : context dependent pair-wise potential This context specific pair-wise potential is from [6]. This pair-wise potential depends on the local structure and relative orientation of both amino acids in the interaction. The statistics are calculated independently for each combination of local structures and relative orientations. At each position the local structure is considered either compact or open and the relative orientation is determined by the dot product of the C α to C β unit vectors of each residue and divided into three classes: parallel, anti-parallel, and intermediate. The radius of gyration is a simple measure of the global compactness of a domain. E ROG penalizes models that are less compact than expected according to [44]. If the radius of gyration of the model (λ) is less than the expected value (2.2N .38 ), there is no penalty. If it is greater, then the penalty is the squared difference between observed and expected. In the equation below r i,mean is the distance between the C α of residue i and the mean of all C α s in the model. All-Atom Energy Term Details E LEN-JONES : van der Waals forces A fundamental characteristic of native globular protein structures is their efficient steric packing of atoms in the protein core. A Lennard-Jones 12-6 potential with damped repulsion (E LEN-JONES ) is used to measure the quality of steric packing. E LEN-JONES is the sum of local energy calculations E len-jones (x,y) performed on all pairs of non-bonded atoms. Since the repulsive portion of the standard Lennard-Jones 12-6 potential will overwhelm the entire energy function with a single significant atomatom clashrepulsion is handled by a linear ramp from 0 to 10 as shown in the equation below [10]. Since E len-jones = 0 when (vdw x,y /r x,y ) = 2