SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs

Randall, Arlo; Baldi, Pierre

doi:10.1186/1472-6807-8-52

Research article
Open access
Published: 03 December 2008

SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs

Arlo Randall^1,2 &
Pierre Baldi^1,2

BMC Structural Biology volume 8, Article number: 52 (2008) Cite this article

5619 Accesses
19 Citations
Metrics details

Abstract

Background

Protein tertiary structure prediction is a fundamental problem in computational biology and identifying the most native-like model from a set of predicted models is a key sub-problem. Consensus methods work well when the redundant models in the set are the most native-like, but fail when the most native-like model is unique. In contrast, structure-based methods score models independently and can be applied to model sets of any size and redundancy level. Additionally, structure-based methods have a variety of important applications including analogous fold recognition, refinement of sequence-structure alignments, and de novo prediction. The purpose of this work was to develop a structure-based model selection method based on predicted structural features that could be applied successfully to any set of models.

Results

Here we introduce SELECTpro, a novel structure-based model selection method derived from an energy function comprising physical, statistical, and predicted structural terms. Novel and unique energy terms include predicted secondary structure, predicted solvent accessibility, predicted contact map, β-strand pairing, and side-chain hydrogen bonding.

SELECTpro participated in the new model quality assessment (QA) category in CASP7, submitting predictions for all 95 targets and achieved top results. The average difference in GDT-TS between models ranked first by SELECTpro and the most native-like model was 5.07. This GDT-TS difference was less than 1% of the GDT-TS of the most native-like model for 18 targets, and less than 10% for 66 targets. SELECTpro also ranked the single most native-like first for 15 targets, in the top five for 39 targets, and in the top ten for 53 targets, more often than any other method. Because the ranking metric is skewed by model redundancy and ignores poor models with a better ranking than the most native-like model, the BLUNDER metric is introduced to overcome these limitations. SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein, where it outperforms the benchmarked method (I-TASSER).

Conclusion

SELECTpro is an effective model selection method that scores models independently and is appropriate for use on any model set. SELECTpro is available for download as a stand alone application at: http://www.igb.uci.edu/~baldig/selectpro.html. SELECTpro is also available as a public server at the same site.

Background

Selecting the most native-like model from a set of possible models is a crucial task in protein structure prediction. A variety of Model Quality Assessment Programs (MQAPs) have been developed that assign numeric scores to models in a set, and then use the scores to rank the models and ultimately select a single model. MQAP methods can be divided roughly into three categories based on the type of information they use: evolutionary methods use sequence or profile similarity between target sequence and template, consensus methods use similarity between models, and structure-based methods use model coordinates [1]. Each category of methods has inherent strengths and weaknesses.

Evolutionary methods can provide quality scores that have been shown to correlate with structural similarity to native [2]. However, for lower confidence alignments the scores do not correlate well with structural similarity. Furthermore, identification of the best template and specific alignment can be difficult. In addition, models built from multiple templates or template-free methods cannot be scored appropriately by evolutionary methods alone.

Consensus methods take advantage of the observation that similar models produced by different predictors tend to be more accurate than those that are structural outliers. In practice, consensus methods outperform the methods they draw from, and they rarely pick a very poor model. The disadvantage, however, is that when the best model is a structural outlier it will be overlooked for lack of popularity [1]. Also, consensus methods are not appropriate for selecting from small sets of structurally diverse models, especially in the extreme case of a two-model set.

While consensus methods depend on similarity between models, structure-based methods calculate scores on each model independently. For this reason, structure-based methods can be applied to model sets of any size and diversity, and will produce the same score for a model regardless of the other models in the set. Structure-based methods can also be used for template-free modeling [3–6] and model refinement procedures [7, 8]. One weakness of high resolution structure-based methods, including protein free energy approximation functions [9–12] and physics based approaches [13, 14], is their sensitivity to local structural irregularities such as steric clashes and chain breaks, which can significantly bias scores on otherwise accurate models. Even slight differences in model backbones can produce significantly different scores [15]. Lower resolution structure-based methods, such as statistical potentials [6, 16, 17], are more robust to backbone variation, but are sensitive to extended low contact-order regions in the models.

Here we describe SELECTpro, a novel structure-based MQAP that combines high and low resolution energy terms into a model selection method that is effective on model sets of variable size, diversity, and target difficulty. Most of our assessment is calculated from the CASP7 model quality assessment category (QA) results published online [18]. The QA category provides a framework for the unbiased evaluation of MQAPs on ensembles of models produced by diverse automated prediction methods.

Results and discussion

We analyze the CASP7 quality assessment category predictions with a focus on the quality of the model ranked first by each predictor and the recovery of the most native-like model in the set. Only SetAll is used in the assessment of the quality of the model ranked first by each group (Table 1). The results are very similar when using SetComplete (data not shown) because QA groups rarely rank an incomplete model first.

Table 1 Quality of Model Ranked First (M_QA1) Relative to Most Native-Like Model (M_max)

Full size table

The assessment of the recovery of the most native-like model, is performed on both SetAll and SetComplete (Table 2) because the few cases where an incomplete model is the most native-like have a significant effect on the average recovery metrics of all QA groups. Incomplete and irregular models are especially challenging for structure-based methods. A comparison of the average Pearson Correlation on SetAll and SetComplete, highlights these issues (Table 3). The frequency of recovering the most native-like model is calculated on SetComplete (Figure 1).

Table 2 Recovery of Top GDT-TS Model (M_max)

Full size table

Table 3 Correlation of Selected Groups

Full size table

The utility of SELECTpro for selecting the best model from a small set is demonstrated by selecting from the five models submitted for each target by the top automated predictors. These small set selection results are calculated using SetAll (Figure 2). SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein and compared to I-TASSER (Figure 3).

To make fair comparisons to groups participating on only a subset of targets, common subset comparisons between SELECTpro and each of these groups are included in Tables 1 and 2. Only groups participating on at least half of the targets are included, and for groups with multiple submissions only the best one is shown. In the results tables any value that is better than SELECTpro is underlined.

For multiple domain targets, the sum of GDT-TS over all domains is used as the GDT-TS of the model. Since the QA predictions correspond to the entire structures, it is impossible to fairly assess the domains independently.

To assess the significance of the summary statistics compared in Table 1, Table 2, and Figure 2, we performed paired t-tests between SELECTpro each other group on common subsets of targets (or targets and models when appropriate). All p-values from the tests appear in the tables and figure, but only statistically significant p-values (p < .05) are shown in bold.

The following notations are used throughout the results section:

M_max: The model with the highest GDT-TS among all server models.
M_{QA 1}: The model with the highest QA score.
N_T: The number of targets a group made valid predictions on.
N_D: The number of domains a group made valid predictions on.

The recovery of M_max by a QA predictor can only be evaluated if M_maxwas scored by the predictor. In most cases QA predictors did not provide scores for all available server models, and frequently there is no score for M_max. For example, predictor 016_1 (AMBER/PB) made submissions on 86 targets, but M_max is only scored for 53 of these targets – so only these targets (N_T= 53) can be evaluated for this predictor.

Quality of Model Ranked First (M_QA1) Relative to Most Native-Like Model (M_max)

In this section on the assessment of the model ranked first, and the corresponding Table 1, we use the following three metrics:

ΔGDT_QA1= GDT-TS(M_max) - GDT-TS(M_{QA 1}) : The GDT-TS difference between M_max and M_{QA 1}measures how much is lost by selecting M_{QA 1}rather than M_max for a single target.
$\bar{Δ G D T_{Q A 1}}$ = ΣΔGDT_{QA 1}/N_D: The average ΔGDT_{QA 1}is a simple way of assessing the quality of M_{QA 1}over all targets.
ΔGDT_{QA 1%}= ΔGDT_{QA 1}/GDT-TS(M_max) : The GDT-TS difference percentage allows for comparison across targets with different numbers of domains and difficulty levels.

The columns of Table 1 are: (1) group number; (2) number of targets the group made predictions on; (3) number of targets such that ΔGDT_{QA 1}= 0; (4) number of targets such that ΔGDT_{QA 1%}< 1%; (5) number of targets such that ΔGDT_{QA 1%}< 10%; and (6) $\bar{Δ G D T_{Q A 1}}$ . The common subset results section has an additional column for the p-value of the paired t-test using ΔGDT_{QA 1}. The rows are sorted first by the number of targets and then by $\bar{Δ G D T_{Q A 1}}$ . Of the groups participating on all 95 targets, SELECTpro has the lowest average ΔGDT_{QA 1}, with a value of 5.07, followed closely by group 713_1 (Circle-QA), with a value of 5.44. Predictor 038_1 (GeneSilico) has an average ΔGDT_{QA 1}of 5.75, with predictions on 85 targets. In common subset comparisons with these two groups SELECTpro is not significantly better, with p-values of .25 and .12 respectively. In common subset comparisons with all remaining groups SELECTpro is significantly better.

Another way to assess the quality of M_{QA 1}over many targets is to count the number of targets such that M_{QA 1}is the best model, or nearly the best, in the set. A method that performs very well on most targets, but very poorly on a few, would still be recognized by this criteria. SELECTpro recovers the best model for 12 targets, selects a model with ΔGDT_{QA 1%}< 1% for 18 targets, and selects a model with ΔGDT_{QA 1%}< 10% for 66 targets. Group 091_1 (Ma-OPUS) also performs well, with 11, 18, and 61 targets in the respective categories. Only the 60 targets with ΔGDT_{QA 1%}< 10% of predictor 038_1 (GeneSilico) on its 85 target subset are better than SELECTpro in common subset comparison (58 for SELECTpro).

The BLUNDER Measure Recovery of M_max

How well does a QA predictor recover M_max? The traditional metric to assess M_max recovery is the rank of M_max, and the average rank over many targets ( $\bar{r a n k}$ ). While rank captures some important information, it ignores the redundancy of models and the quality of models ranked better than M_max. Consider the following hypothetical situation: group A ranks M_max 10^th and all nine models ranked above it are redundant with ΔGDT of ~2.0, group B ranks M_max 5^th and the four models ranked above it are diverse with a ΔGDT between 10.0 and 20.0. Which group has done a better job of recovering M_max? In this example, the rank metric favors group B, although group A ranks only a single redundant model above M_max. In addition, the models ranked better than M_max by group A have only slightly lower GDT-TS than M_max, while the models ranked better than M_max by group B are significantly worse than M_max. To address these weaknesses of the rank metric, we introduce the BLUNDER metric, which focuses on the worst model ranked better than M_max (the most embarrassing blunder). This measure is not affected by model redundancy and measures the quality of models ranked above M_max. The BLUNDER metric is defined using the following notation, and used in the assessment of the recovery of M_max and the corresponding Table 2 and Figure 1:

M_BLUNDER: The model with the minimum GDT-TS among models ranked better than M_max.
ΔGDT_BLUNDER= GDT-TS(M_max) - GDT-TS(M_BLUNDER) : The GDT-TS difference between M_max and M_BLUNDERmeasures the size of the worst blunder.
$\bar{Δ G D T_{B L U N D E R}}$ = ΣΔGDT_BLUNDER/N_D: The average ΔGDT_BLUNDERmeasures how well a method robustly recovers M_max over many targets.
ΔGDT_BLUNDER%= ΔGDT_BLUNDER/GDT-TS(M_max) : The ΔGDT_BLUNDERpercentage allows for comparison across targets with different numbers of domains and difficulty levels.

Figure 1 contains graphs of the frequency of recovering M_max using the rank (A) and ΔGDT_BLUNDER%(B) measures on SetComplete. SELECTpro ranks M_max first for 15 targets, in the top five for 39 targets, and in the top ten for 53 targets. SELECTpro's ΔGDT_BLUNDER%values are less than 10% of GDT-TS(M_max) for 40 targets and less than 20% for 63 targets. These results are best among all QA participants. The average M_max recovery results are summarized in Table 2. The results columns are (1) average rank ( $\bar{r a n k}$ ) and (2) average ΔGDT_BLUNDER( $\bar{Δ G D T_{B L U N D E R}}$ ) on SetAll and SetComplete. The common subset results section also includes a column for the p-value of a paired t-test using ΔGDT_BLUNDER(p-value). Rows are sorted separately for each dataset by N_Tfirst and then $\bar{Δ G D T_{B L U N D E R}}$ . On SetComplete SELECTpro has a $\bar{Δ G D T_{B L U N D E R}}$ of 10.4. In common subset comparisons one group has a lower $\bar{r a n k}$ : group 091_1 (Ma-OPUS) with $\bar{r a n k}$ of 16.8 on 94 targets compared to 17.4 for SELECTpro. On SetAll SELECTpro did not submit a score for M_max of target T0356 (HHpred2_TS1) due to a processing error. In order to make complete common subset comparisons when possible we added in the SELECTpro score for HHpred2_TS1. SELECTpro ranks it 86^th and ΔGDT_BLUNDER= 50.0. Both results are significantly worse than the SELECTpro averages.

Pearson Correlation for Individual Proteins

The assessor evaluation of the quality assessment category [18] focused on the Pearson Correlation between the QA scores and GDT-TS. Here we use the Pearson Correlation only to highlight some of the difficulties for structure-based methods in dealing with incomplete models, as well as basic non-protein like structural features. Approximately half of the models in SetAll are incomplete, with backbone coordinates missing for one or more residues.

Incomplete models present a challenge to SELECTpro and other structure-based methods because the scores for each model are only comparable when calculated on coordinates for the same set of residues. Another issue is that some complete models have severe chain-breaks, severe steric clashes, or significant portions modeled only as extended chains. These local problems can overwhelm the energy of what may otherwise be a good model. Consensus methods do not suffer from these local structure problems. Given this rationale, one would expect structure-based methods to see the most improvement in terms of average Pearson Correlation on SetComplete relative to SetAll. Table 3 shows the average Pearson Correlation of five selected groups. Predictors 713_1 (Circle-QA), 633_1 (ProQ), and SELECTpro are structure-based MQAPs, while 634_1 (Pcons) is a consensus method and 556_1 (LEE) scored structures based on the GDT-TS similarity to their human Model 1 CASP7 prediction [18]. As expected, the structure-based MQAPs improve more than the structural similarity-based methods. The even greater increase in Pearson Correlation for SELECTpro can be accounted for by the failure to generate appropriate complete models for some of the incomplete models resulting in QA scores calculated on extended chains.

Reranking Top Server Group Models

Predictors in CASP may submit up to five models, but CASP evaluation focuses on the model designated as Model 1. Clearly, the selection of Model 1 is critical in the CASP setting and for protein structure prediction in general. Figure 2 contains the results when SELECTpro is used to rerank the five models submitted by each of the top ten servers from CASP7, compared to each server's results. In the following assessment M_max-g is the model with the highest GDT-TS of the five models submitted by a server. Figure 2 (A) shows that SELECTpro recovers M_max-g more frequently than 8 of the top 10 server groups; in addition, when SELECTpro is used to select Model 1 the average GDT-TS increases for 7 of 10 sever groups; however, the increase is only statistically significant for 3 groups. SELECTpro improves using both criteria for the top 3 server groups (Zhang-Server, Pmodeller6, and ROBETTA). These results highlight the utility of SELECTpro for the task of model selection. The comparisons made here are fair because structure-based methods can be applied in the server setting to any number of models.

Large Decoy Set Model Selection

Here we analyze SELECTpro's model selection capability on the large decoy sets for 16 small proteins from a recent I-TASSER benchmark set [19]. The I-TASSER prediction method generates 12500 to 20000 different backbone conformations. The complete decoy sets can be downloaded from [20]. The consensus method SPICKER [21] is used to cluster the models and a centroid model is built from the first cluster. A second round of simulation resolves the steric clashes in the centroid model and results in the final predicted model. The centroid model and final model are not part of the decoy set. In order to make a fair model selection comparison the decoy model closest to the centroid is used as I-TASSER's M_{QA 1}.

On the benchmark set SELECTpro has an average GDT-TS of 63.7, while I-TASSER has an average GDT-TS of 62.1. SELECTpro's average ΔGDT_{QA 1}is 9.2 and I-TASSER's ΔGDT_{QA 1}is 10.7. Figure 3 displays the GDT-TS results for the individual proteins in the benchmark set. Different symbols are used to indicate the GDT-TS of M_max (□), the GDT-TS of SELECTpro's M_{QA 1}(×), and the GDT-TS of I-TASSER's M_{QA 1}(+) for each protein. A paired t-test of the hypothesis that SELECTpro and I-TASSER's mean performance are equal produces a p-value of .19, which is not statistically significant, but does give some evidence that SELECTpro can select a very good model from a large set of decoys at least well as an established method that utilizes consensus methods.

Conclusion

A MQAP that can select the most native-like model from a set of possibilities has a variety of applications in protein structure prediction. The new quality assessment category introduced in CASP7 allows for the unbiased assessment of MQAPs on the models produced by automated predictors. This category allows researchers to focus on the model scoring aspect of protein structure prediction.

The results presented in this work demonstrate that SELECTpro, a structure-based model selection method, consistently selects one of the best models from the large diverse sets of models produced by automated predictors, across all levels of target difficulty. On these large diverse sets of models, SELECTpro also recovers the single most native-like model well compared to other methods. On the small sets of five models submitted for each target by the top automated predictors, in most cases SELECTpro selects better models than the predictors themselves.

Since SELECTpro and other structure-based methods score models independently, they can be incorporated into the model selection pipelines of individual protein structure prediction servers. For this reason, it may help predictors if the CASP organizers distinguished methods that score models independently from those that do not.

Consensus and structure-based methods can be combined to achieve improved results. For example, the meta-server method Pmodeller [22] combines consensus (Pcons [23]) and structure-based methods (ProQ [24]) to predict protein structures more accurately than either method in isolation. The assessment of the QA category by CASP assessors recognized the consensus method Pcons (group 634_1) for the high Pearson Correlation between their scores and model GDT-TS on most targets [18]. In their own assessment the authors of Pcons recognized that while consensus methods perform well in most cases, "when most of the models are incorrect and the few correct models are outliers a consensus based approach cannot be expected to make an optimal choice." [1] For instance, they identified three particular targets in CASP7 where their consensus method failed: T0283, T0350, and T0351 [1]. The Pcons average ΔGDT_{QA 1}on these three targets is 30.8. The same research group's structure-based method ProQ (group 633_1) has an average ΔGDT_{QA 1}of 17.2. In contrast, on these three targets SELECTpro has an average ΔGDT_{QA 1}of only 7.1. This example highlights the potential of combining SELECTpro with existing model selection methods.

SELECTpro has been made publicly available as a server, where users may submit from 2 to 100 models for evaluation. In addition to the global confidence scores, the scores of individual energy terms are also returned to the user by email for each model submitted. SELECTpro is one of several protein structure tools in the SCRATCH suite of predictors [25], and is available through: http://www.igb.uci.edu/~baldig/selectpro.html.

Methods

Datasets

All of the comparative analysis in this work is performed on the server models and quality assessment predictions submitted in the CASP7 [26] experiment. The CASP QA experiment is particularly relevant for the evaluation of model selection methods for several reasons: (1) the QA predictors were blind to the true structures at the time of prediction making it impossible for methods to be tuned to improve results; (2) the set of proteins is diverse: the 95 targets range in size from 68 to 530 amino acids, come from a variety of organisms, and span the full range of prediction difficulty; (3) each target has more than 200 predicted models that contain the types of errors that occur in automated structure prediction; (4) the protein set is not selected by any of the participating QA groups; (5) the models are scored by a variety of methods and the results are publicly available. We perform analysis on the set of all models (SetAll) and a subset of models (SetComplete) that are complete and free of gross structural irregularities, as described below. All of the ABIpro models and some of the 3Dpro models were optimized using the exact energy function of SELECTpro. These models are removed because of the obvious bias towards these models. In recent CASP experiments the GDT-TS [27] has been used as the primary automatic structural similarity measure. The published GDT-TS values from the CASP7 website are the only structural similarity measure used in this work.

SetAll

The SetAll dataset consists of the server models with a GDT-TS value published on the CASP7 website, a total of 23,423 models. To calculate a score on a protein model SELECTpro requires the backbone coordinates (N, C_α, C) for all model residues as input. A total of 8,812 models in SetAll have only a C_α trace or have no coordinates for one or more residues. Modeller8v1 [28–30] was used to generate complete models from the incomplete ones, and then the complete models were scored by SELECTpro. In most cases the complete models were built appropriately from the incomplete models; however, in some cases the final model was a fully extended chain due to an error in our application of Modeller. We failed to identify this problem until after the completion of the CASP7 competition. The SELECTpro scores versus GDT-TS scores for all models of target T0305 are displayed in plot A of Figure 4. The circled outliers with very low confidence scores and high GDT-TS scores are models that were incomplete and the complete models generated by Modeller were fully extended chains. The Pearson correlation on the set of all models for T0305 is .641. The SELECTpro scores versus GDT-TS scores for complete models only are displayed in plot B of Figure 4, and the Pearson correlation is .966.

SetComplete

The scores produced by SELECTpro are comparable on complete models of the same sequence. There is no standard for the handling of incomplete models and we assume that participating groups took a variety of approaches. Using only complete models ensures that the MQAP scores are calculated from the same coordinates. Thus, the models retained in SetComplete are screened first for completeness. Models missing backbone coordinates for one or more residues are removed. This leaves 14,611 models.

Structure-based MQAPs are susceptible to local structural irregularities in models, and will tend to score such models poorly. This is why methods developed to select near-native models from sets of decoys remove such models from consideration [31]. We apply additional filters (described below) for C_α-C_α clashes, C_α-C_α chain breaks, and expanded termini to remove an additional 1,217 models leaving 13,494 more plausible models in SetComplete.

The C_α-C_α clash model filter enforces a squared difference penalty for C_α-C_α distances less than 3.6 Ǻ. The distance between the C_α atoms of residue i and j is denoted by r_i,Cα,j,Cαand N is the protein length. The constant 13.52 in the threshold below corresponds to two severe clashes where r_i,Cα,j,Cα= 1.0 Ǻ. Models with a sum of squared differences greater than 13.52 per 100 residues are filtered out.

\sum_{i > j} \max {0, 3.6 - r_{i, C_{α}, j, C_{α}}}^{2} > 13.52 (N / 100)

The C_α-C_α chain break model filter enforces a squared difference penalty for r_{i,Cα,i+1,Cα}distances greater than 4.0 Ǻ. The constant 16.0 in the threshold below corresponds to a single chain break where r_{i,Cα,i+1,Cα}= 8.0 Ǻ. Models with a sum of squared differences greater than 16.0 per 100 residues are filtered out.

\sum_{i} \max {0, r_{i, C_{α}, i + 1, C_{α}} - 4.0}^{2} > 16.0 (N / 100)

The expanded termini filter removes models where a large portion of the structure is modeled as expanded chain with no non-local interactions. The screening procedure is: scan from the N-terminus until three consecutive residues have a contact number of at least 10, and repeat from the C-terminus. The contact number of a residue is defined here as the number of other C_β atoms within 10 Ǻ of the residue's C_β [3]. If the sum of low contact number termini residues is at least 20% of N, the model is filtered out.

Model Representations

Reduced representation

In the reduced representation the heavy backbone atoms, carbonyl oxygen, amide hydrogen (N, C_α, C, O, H), and C_β are represented explicitly. For glycine residues a pseudo C_β is calculated. The side-chain atoms are represented by a single united point (centroid) [32, 33]. The centroid is calculated as the mean of the position of the heavy side-chain atoms. For glycine and alanine the centroid (CT) is set to the C_β atom. Only the heavy backbone atoms (N, C_α, C) are used as input to SELECTpro and the positions of additional atoms and centroids are calculated from these.

All heavy-atom representation

In the all heavy-atom representation the centroid is removed and the heavy side chain atoms are represented explicitly. The side-chains are initially placed onto the backbone of the reduced representation in their most likely conformation according to the SCWRL backbone-dependent rotamer library [34]. The side-chain placements are then optimized using the SELECTpro all-atom energy terms (described below) in conjunction with the rotamer library.

Energy Functions Overview

E_REDUCEDis the combined energy calculated from the reduced representation. E_REDUCEDis a linear combination of predicted (E_PRED-SS, E_PRED-SA, E_PRED-CM), physical (E_VDW-REP), and statistical (E_CT-REP, E_STAT-ENV, E_STAT-PW-CI, E_STAT-PW-CD, E_ROG) terms:E_REDUCED= w₁E_PRED-SS+ w₂E_PRED-SA+ w₃E_PRED-CM+ w₄E_BETA+ w₅E_VDW-REP+ w₆E_CT-REP+ w₇E_STAT-ENV+ w₈E_STAT-PW-CI+ w₉E_STAT-PW-CD+ w₁₀E_ROG

E_ALL-ATOMconsists of the energy terms that depend on the all heavy-atom representation. E_ALL-ATOMis a linear combination of the following physical terms:E_ALL-ATOM= w₁₁E_SC-HB+ w₁₂E_LEN-JONES+ w₁₃E_SOLVATION+ w₁₄E_ELECTRO

E_FINALis the sum of E_REDUCEDand E_ALL-ATOM, and is used for the final scoring of models by SELECTpro. The individual energy terms are outlined briefly below and the detailed description of the novel terms follow in the remainder of this section. Underlined terms are adapted from previously described energy terms their details are included in the Appendix.

Parameter Weights

The parameter weights were determined by repeatedly varying individual weights and maximizing the sum of the GDT-TS of the lowest E_FINALmodels on a training set built from CASP6 protein domains. For each CASP6 protein domain a set of 500 decoy models was generated using fragment assembly with the RMSD to native as the dominant term in the objective function [3].

E_REDUCED

E_PRED-SS: predicted secondary structure

E_PRED-ACC: predicted solvent accessibility

E_PRED-CM: predicted contact map

E_BETA: sheet formation

E_{BB
-
REP}: backbone repulsion

E_{CT
-
REP}: centroid repulsion

E_{STAT
-
ENV}: residue environment potential [3]

E_{STAT
-
PW
-
CI}: context independent pair-wise potential [3, 16]

E_{STAT
-
PW
-
CD}: context dependent pair-wise potential [6]

E_ROG: compactness

E_ALL-ATOM

E_SC-HB: side-chain hydrogen bonding

E_{LEN
-
JONES}: van der Waals forces [10]

E_SOLVATION: solvation effects [35]

E_ELECTRO: electrostatic interactions

Throughout this work the convention of all capital letters referring to global energy and all lower case referring to local energy is used. For instance, E_PRED-CMrefers to the global contact map energy and E_pred-cm(i,j) refers to the contact map energy between residues i and j.

Parameter notation used in energy equations

Model variables

r_i,x,j,y: distance between atom x of residue i and atom y of residue j

r_x,y: distance between atom x and atom y

v_i,x,j,y: vector from atom x of residue i to atom y of residue j

u_i,x,j,y: unit vector calculated from v_i,x,j,y

N_i: number of residues in contact with residue i, with contact defined as r_i,Cβ,j,Cβ< 10 Ǻ

phi_i: Phi angle of residue i

psi_i: Psi angle of residue i

Protein specific input parameters

aa_i: amino acid type of residue i

ss_i: predicted secondary structure of residue i (H,E,C)

acc_i: predicted solvent accessibility of residue i ('e': exposed, '-', buried)

cmap_i,j: predicted contact/non-contact between residues i and j, with contact defined as r_i,Cα,j,Cα< 12 Ǻ

Protein independent parameters

I_value: ideal parameter value for a given calculation

σ_value: standard deviation value for a given calculation

vdw_x: van der Waals radius of atom x

vdw_x+y: vdw_x+ vdw_y

Ω_stat-env: pre-calculated statistics for use in E_STAT-ENV

Ω_stat-pw-oi: pre-calculated statistics for use in E_STAT-PW-CI

Ω_stat-pw-od: pre-calculated statistics for use in E_STAT-PW-CD

D_min,pw-od: minimum interaction distance for centroid pairs used in E_STAT-PW-CD

D_max,pw-od: maximum interaction distance for centroid pairs used in E_STAT-PW-CD

D_min-CT: minimum distances between centroids of amino acid pairs observed in pdb_select25 [36].

Reduced Representation Energy Term Details

The details of how the novel reduced representation energy terms are calculated are presented in this section. The predicted structural terms E_PRED-SS, E_PRED-ACC, and E_PRED-CMand the β-strand pairing term, E_BETA, are novel and unique to SELECTpro. Additional reduced representation terms are adapted from previously published work and their details are included in the Appendix.

Predicted structural features overview

The predicted structural feature predictions used in E_PRED-SS, E_PRED-ACC, and E_PRED-CMcome from the SCRATCH suite of predictors [25]. Each predictor is trained in a supervised fashion using curated non-redundant datasets extracted from the PDB [37]. The secondary structure (SSpro [38]) and solvent accessibility (ACCpro [39]) predictors use ensembles of 1D-RNN (one dimensional-recursive neural network) architectures [40]. The contact map predictor (CMAPpro [41]) uses ensembles of 2D-RNN architectures [40].

E_PRED-SS: predicted secondary structure

The predicted secondary structure term E_PRED-SSpenalizes deviation of the torsion angles from the torsion angle parameters for helices and strands predicted by SSpro. There is no penalty for predicted coils. The parameter values for helix residues are: I_Hφ= -65.3, σ_Hφ= 11.9, I_Hψ= -39.4, σ_Hψ= 11.3. The parameter values for strand residues are: I_Eφ= -135.0, σ_Eφ= 15.0, I_Eψ= 135.0, σ_Eψ= 15.0. Only torsion angles that are more than two standard deviations from the ideal are penalized, with the penalty defined as follows:

\begin{array}{l} E_{P R E D - S S} = \sum_{s s_{i} = H} E_{p r e d - h e l i x} (i) + \sum_{s s_{j} = E} E_{p r e d - s t r a n d} (j) \\ E_{p r e d - h e l i x} (i) = \sqrt{Θ_{H Φ} (i) {(| p h i_{i} - I_{H Φ} | - 2 σ_{H Φ})}^{2} + Θ_{H Ψ} (i) {(| p s i_{i} - I_{H Ψ} | - 2 σ_{H Ψ})}^{2}} \\ Θ_{H Φ} (i) = {\begin{cases} 1,     if | p h i_{i} - I_{H Φ} | > 2 σ_{H Φ} \\ 0,    otherwise \end{cases} \\ Θ_{H Ψ} (i) = {\begin{cases} 1,     if | p s i_{i} - I_{H Ψ} | > 2 σ_{H Ψ} \\ 0,    otherwise \end{cases} \end{array}

The definition of E_pred-strand(j) is equivalent to E_pred-helix(i), but with I_Eφ, σ_Eφ, I_{E ψ}and σ_{E ψ}in place of the corresponding helical values.

E_PRED-ACC: predicted solvent accessibility

The solvent accessibility predictor ACCpro predicts the percent of solvent accessibility in 5% increments for each residue. Using 25% exposure as a binary threshold the accuracy of the predictor is ~77% [39]. The binary exposure ('e')/burial ('-') prediction is used as the predicted solvent accessibility for E_PRED-ACC. In the reduced representation the solvent accessibility of residue i is estimated by its contact number (N_i), where N_i> 16 is considered buried [3]. If the predicted status of a residue is not realized in the model, the penalty is calculated as:

\begin{array}{l} E_{P R E D - A C C} = \sum_{i} E_{p r e d - a c c} (i) \\ E_{p r e d - a c c} (i) = {\begin{array}{l} {(17 - N_{i})}^{2}, & if a c c_{i} =' -' and N_{i} \leq 16 \\ {(N_{i} - 16)}^{2}, & if a c c_{i} =' e' and N_{i} > 16 \\ 0, & otherwise \end{array} \end{array}

E_PRED-CM: predicted contact map

The contact map predictor CMAPpro predicts the probability of contact or non-contact between C_α atoms, with a contact threshold of 12 Å. The strategy utilized to infer predicted contacts from the probability matrix [41] results in maps that are sparse when compared to those of real proteins; thus, unrealized contacts are penalized while non-contacts are not. The constant 1.0 is added to the penalty to ensure that all unrealized contacts make a significant contribution to E_PRED-CM.

\begin{array}{l} E_{P R E D - C M} = \sum_{j > i} Θ (i, j) E_{p r e d - c m} (i, j) \\ Θ (i, j) = {\begin{array}{l} 1, & if r_{i, C_{α}, j, C_{α}} > I_{c m - t h r e s h} \\ 0, & otherwise \end{array} \\ E_{p r e d - c m} (i, j) = c m a p_{i, j} (1.0 + \frac{r_{i, C_{α}, j, C_{α}}^{2}}{I_{c m - t h r e s h}^{2}}) \end{array}

The predicted contact map can help identify the highest GDT-TS models in the set, even when they are not highly similar to native. A good example of this is CASP7 target T0304 is a 122 residue α/β protein where the highest GDT-TS model in the set is Zhang-Server_TS1 (GDT-TS = 45.55). Most secondary structure predictors (including SSpro) failed to predict the first two strands making this target especially difficult. No QA method ranked the highest GDT-TS model first; however, SELECTpro ranked it second and the model ranked first by SELECTpro (T0304.Zhang-Server_TS4) has the second highest GDT-TS. These models have the lowest E_PRED-CMof any models in the set, but the native structure has an even lower E_PRED-CM. Figure 5 compares the native and predicted contact maps for target T0304.

E_BETA: strand pairing

The formation of hydrogen bonds between the residues of β-strand partners is a major determinant of the tertiary structure of β and α/β proteins. The β hydrogen bonding treatment described here favors realistic strand pairing and sheet formation. The treatment also efficiently accommodates bulges in strands because it does not force the register between two paired strands. E_BETAis the global strand pairing energy that penalizes the hydrogen bonding of β residues between strand pairs. E_beta-sp(β_k→β_w) is the strand pairing energy of strand β_kto strand β_w. E_beta-spis only commutative if the two strands have the same length. E_beta-hb(i,j) is the hydrogen bonding penalty between residues i and j.

E_beta-spis calculated for all possible strand pairings, but only the two lowest energies from each strand are used in E_BETA. Other strand-strand interactions are ignored. In the equations below S is the set of all strands in the protein, β_{m 1}is the strand with the minimum pairing energy from β_k, and β_{m 2}is the strand with the next lowest pairing energy from β_k. If the strand count is less than six at least two of the strands must be edge strands. This is accounted for by only considering the single best strand partner for two strands.

\begin{array}{l} E_{B E T A} = \sum_{β_{k} \in S} E_{b e t a - s p} (β_{k} \to β_{m 1}) + E_{b e t a - s p} (β_{k} \to β_{m 2}) \\ β_{m 1} = {β_{x} : \min_{β_{x} \in S / {β_{k}}} E_{b e t a - s p} (β_{k} \to β_{x})} \\ β_{m 2} = {β_{y} : \min_{β_{y} \in S / {β_{k}, β_{m 1}}} E_{b e t a - s p} (β_{k} \to β_{y})} \end{array}

In the equations for E_beta-spbelow, S_kis the set of all residues in strand β_k. Each time E_beta-hbis calculated the pair (i,j) is chosen with i from S_kand j from S_w, such that E_beta-hbis minimized. Then residue i is removed from S_k, and residue j is removed from S_w. E_beta-hbis calculated once for each residue in S_k. If S_khas more residues than S_weach unpaired residue is given maximum penalty of E_beta-hb.

\begin{array}{l} E_{b e t a - s p} (β_{k} \to β_{w}) = \sum^{S_{k} \neq \emptyset,} E_{b e t a - h b} (i, j) \\ (i, j) = {(x, y) : \min_{x \in S_{k}, y \in S_{w}} E_{b e t a - h b} (x, y)} \\ S_{k} = S_{k} / {i}, S_{w} = S_{w} / {j} \end{array}

Between two anti-parallel strand partners, only every other pair of residues is hydrogen bonded. For the pairs that are not hydrogen bonded, a pseudo-bonding calculation is used. The hydrogen bonding energy and pseudo-bonding energy are both calculated and the minimum of the two is used in E_beta-hb(i,j).

If residues i and j are paired in parallel strands, either i forms hydrogen bonds with j-1 and j+1, or j forms hydrogen bonds with i-1 and i+1. No hydrogen bonds are formed between the atoms of residues i and j. The hydrogen bonding energy is calculated for both possible conformations and only the minimum of the two is used in E_beta-hb(i,j).

E_{b e t a - h b} (i, j) = {\begin{array}{l} min{Φ (i \to j) + Φ (j \to i), Φ_{p s} (i \to j) + Φ_{p s} (j \to i)}, & if strands are anti-parallel \\ min{Φ (i \to j + 1) + Φ (j - 1 \to i), Φ (i - 1 \to j) + Φ (j \to i + 1)}, & if strands are parallel \end{array}

Φ(a→d) is the directional energy calculation for a single hydrogen bond where a is the index of the acceptor residue and d is the index of the donor residue. Three geometrical measures are used to estimate the strength of hydrogen bonds: the distance between the acceptor and the hydrogen atoms (r_a,O,d,H), the angle at the acceptor atom (u_a,C,a,O· u_a,O,d,H), and the angle between the acceptor and donor atom vectors (u_a,C,a,O· u_d,N,d,H). The distance and acceptor atom angle parameters are motivated by the orientation-dependent hydrogen bonding potential described in [42]. The following parameters were set based on idealized hydrogen bonding between β residues, with standard deviation values set such that two standard deviations approximate the cut-off in true hydrogen bonds. The ideal distance from hydrogen atom to accepting oxygen is I_hb-dist= 1.9 Ǻ, with standard deviation σ_hb-dist= 0.5 Ǻ. The ideal angle at the acceptor atom is 0°, so the ideal (u_a,C,a,O· u_a,O,d,H) is I_acc-dp= 1.0, with standard deviation σ_acc-dp= 0.11. The ideal angle between the acceptor and donor atom vectors is 180°, so the ideal (u_a,C,a,O· u_d,N,d,H) is I_acc-don-dp= -1.0, with standard deviation σ_acc-dp= 0.15. The parameters for pseudo-bonded residues are as follows: the ideal distance for r_a,O,d,His I_ps-hb-dist= 7.9 Ǻ, I_ps-acc-dp= -1.0, and I_{ps-acc-don-dp}= -1.0. The standard deviations from the corresponding hydrogen bonding parameters above are used in Φ_ps(a→d).

\begin{array}{l} Φ (a \to d) = {\begin{cases} \begin{matrix} 54.0, & if r_{a, O, d, H} > 7.0 \end{matrix} \\ Ψ (r_{a, O, d, H}, I_{h b - d i s t}, σ_{h b - d i s t}) \\ + Ψ {(u}_{a, C, a, O} • u_{d, N, d, H}, I_{a c c - d o n - d p}, σ_{a c c - d o n - d p}) \\ \begin{matrix} + Ψ {(u}_{a, C, a, O} • u_{a, O, d, H}, I_{a c c - d p}, σ_{a c c - d p}), & otherwise \end{matrix} \end{cases} \\ Φ_{p s} (a \to d) = {\begin{cases} \begin{matrix} 54.0, & if r_{a, O, d, H} > 10.0 \end{matrix} \\ Ψ (r_{a, O, d, H}, I_{p s - h b - d i s t}, σ_{p s - h b - d i s t}) \\ + Ψ {(u}_{a, C, a, O} • u_{d, N, d, H}, I_{a c c - d o n - d p}, σ_{a c c - d o n - d p}) \\ \begin{matrix} + Ψ {(u}_{a, C, a, O} • u_{a, O, d, H}, I_{p s - a c c - d p}, σ_{a c c - d p}), & otherwise \end{matrix} \end{cases} \end{array}

The penalty for the observed value (x) increases up to 6 standard deviations from the ideal value (μ).

Ψ (x, μ, σ) = {\begin{array}{l} \frac{{(x - μ)}^{2}}{2 σ^{2}}, & if | x - μ | < 6 σ \\ \frac{{(6 σ)}^{2}}{2 σ^{2}} = 18.0, & otherwise \end{array}

All-Atom Energy Term Details

The all-atom energy terms depend on atom-atom interactions when all heavy atoms are included in the model. In the all-atoms energy equations x and y refer to atoms in the model and the residue positions are not referenced. The van der Waals radii and well-depths (ε_x, used in E_LEN-JONES) come from the CHARMM19 parameter set [43]. The side-chain hydrogen bonding term, E_SC-HB, is described in detail here because it is unique to SELECTpro. The details of E_LEN-JONES, E_SOLVATION, and E_ELECTROare provided in the Appendix.

E_SC-HB: side-chain hydrogen bonding

E_SC-HBpenalizes unsatisfied hydrogen bond donor and acceptor atoms that are at least partially buried. There is no penalty for fully exposed donor or acceptor atoms. Exposure percent ( $Δ G_{x}^{s l v}$ %) is calculated as $Δ G_{x}^{s l v} / Δ G_{x}^{r e f}$ . The definitions of $Δ G_{x}^{s l v}$ and $Δ G_{x}^{r e f}$ are provided in the description of E_SOLVATIONin the Appendix. Atoms at least 75% exposed are considered fully exposed and atoms less than 25% exposed are considered fully buried. For 25% < $Δ G_{x}^{s l v}$ % < 75% the penalty weight is reduced linearly from 1.0 at 25% to 0 at 75%. The ideal distance from the acceptor atom to donor atom is I_hb-da-dist= 2.9 Ǻ. In the equations below donors is the set of all side-chain hydrogen donor atoms and acceptors is the set of all side-chain hydrogen acceptor atoms.

\begin{array}{l} E_{S C - H B} = \sum_{x \in a c c e p t o r s} E_{h b - a c c} (x) + \sum_{x \in d o n o r s} E_{h b - d o n} (x) \\ E_{h b - a c c} (x) = λ (x) \min_{y \in d o n o r s} {| r_{x, y} - I_{h b - d a - d i s t} |}^{2} \\ E_{h b - d o n} (x) = λ (x) \min_{y \in a c c e p t o r s} {| r_{x, y} - I_{h b - d a - d i s t} |}^{2} \\ λ (x) = {\begin{array}{l} 1 & if Δ G_{x}^{s l v} % < .25 \\ 0 & if Δ G_{x}^{s l v} % > .75 \\ 2 (.75 - Δ G_{x}^{s l v} %) & otherwise \end{array} \end{array}

Appendix

In the interest of completeness and reproducibility we include the details of the energy terms that are adapted from previous work.

Reduced Representation Energy Term Details

E_BB-REP: backbone repulsion

This term penalizes steric clashes between non-bonded atoms explicitly represented in the reduced representation. The penalty for overlapping atoms is the overlap distance squared as defined here:

\begin{array}{l} E_{B B - R E P} = \sum_{j > i} E_{b b - r e p} (i, j) \\ E_{b b - r e p} (i, j) = \sum_{x} \sum_{y} Θ (i, x, j, y) {(v d w_{x + y} - r_{i, x, j, y})}^{2} \\ Θ (i, x, j, y) = {\begin{array}{l} 1, & if r_{i, x, j, y} < v d w_{x + y} \\ 0, & otherwise \end{array} \end{array}

E_CT-REP: centroid repulsion

A centroid-centroid repulsive term is used to reduce the overcrowding of side-chains in the reduced representation. The minimum distance between two centroids in the calculation is the minimum observed for each pair of residue types – D_CT-min(aa_i,aa_j) – in pdb_select25. The penalty for centroid-centroid overlaps is defined as the overlap distance squared:

\begin{array}{l} E_{C T - R E P} = \sum_{j > i} Θ (i, j) {[D_{C T - \min} (a a_{i}, a a_{j}) - r_{i, C T, j, C T}]}^{2} \\ Θ (i, j) = {\begin{array}{l} 1, & if r_{i, C T, j, C T} < D_{C T - \min} (a a_{i}, a a_{j}) \\ 0, & otherwise \end{array} \end{array}

E_STAT-ENV: residue environment potential

The motivation for this term is to model the hydrophobic effect. The level of burial for each residue in the model is estimated by the number of other C_β atoms within 10 Ǻ (the contact number N_i) [3]. The values in the table Ω_stat-envreflect the likelihood of observing a particular N_ifor each residue type. For model residues near both termini the contact number is artificially increased to account for the missing neighbors along the chain.

\begin{array}{l} E_{S T A T - E N V} = \sum_{i} Ω_{s t a t - e n v} (a a_{i}, N_{i}^{*}) \\ N_{i}^{*} = {\begin{array}{l} N_{i} + 4 - i, & if i < 4 \\ N_{i} + 4 - | i - N |, & if | i - N | < 4 \\ N_{i}, & otherwise \end{array} \end{array}

E_STAT-PW-CI: context independent pair-wise interactions

This context independent pair-wise potential comes from Equation 6 of [3]. The potential considers the likelihood of observing the pair of centroids in a given distance bin relative to the background, with distance bins of < 5, 5–7, 7–10, 10–12, and > 12 Å. The advantage of a context independent pair-wise potential is that it is less vulnerable to over-fitting by a conformational search because of its generality.

\begin{array}{l} E_{S T A T - P W - C I} = \sum_{j > i} E_{s t a t - p w - c i} (i, j) \\ E_{s t a t - p w - c i} (i, j) = Ω_{s t a t - p w - c i} [a a_{i}, a a_{j}, r_{b i n} (i, j)] \\ r_{b i n} (i, j) = {\begin{array}{l} 0 - 5 & if r_{i, C T, j, C T} \leq 5.0 \\ 5 - 7 & if 5.0 < r_{i, C T, j, C T} \leq 7.0 \\ 7 - 10 & if 7.0 < r_{i, C T, j, C T} \leq 10.0 \\ 10 - 12 & if 10.0 < r_{i, C T, j, C T} \leq 12.0 \\ 12 + & if r_{i, C T, j C T} > 12.0 \end{array} \end{array}

E_STAT-PW-CD: context dependent pair-wise potential

This context specific pair-wise potential is from [6]. This pair-wise potential depends on the local structure and relative orientation of both amino acids in the interaction. The statistics are calculated independently for each combination of local structures and relative orientations. At each position the local structure is considered either compact or open and the relative orientation is determined by the dot product of the C_α to C_β unit vectors of each residue and divided into three classes: parallel, anti-parallel, and intermediate.

\begin{array}{l} E_{S T A T - P W - C D} = \sum_{j > i} E_{s t a t - p w - c d} (i, j) \\ E_{s t a t - p w - o d} (i, j) = Θ (i, j) Ω_{s t a t - p w - o d} [a a_{i}, a a_{j}, λ (i), λ (j), Φ (i, j)] \\ Θ (i, j) = {\begin{array}{l} 1, & \begin{matrix} if r_{i, C T, j, C T} > D_{\min, p w - o d} [a a_{i}, a a_{j}, λ (i), λ (j), Φ (i, j)] & and \end{matrix} \\ r_{i, C T, j, C T} < D_{\max, p w - o d} [a a_{i}, a a_{j}, λ (i), λ (j), Φ (i, j)] \\ 0, & otherwise \end{array} \\ λ (i) = {\begin{array}{l} compact, & if r_{i - 1, C_{α}, i + 1, C_{α}} < 6.0 \\ open, & otherwise \end{array} \\ Φ (i, j) = {\begin{array}{l} parallel, & {if u}_{i, C_{α}, i, C_{β}} • u_{j, C_{α}, j, C_{β}} > .5 \\ antiparallel, & {if u}_{i, C_{α}, i, C_{β}} • u_{j, C_{α}, j, C_{β}} < - .5 \\ intermediate, & otherwise \end{array} \end{array}

E_ROG: compactness

The radius of gyration is a simple measure of the global compactness of a domain. E_ROGpenalizes models that are less compact than expected according to [44]. If the radius of gyration of the model (λ) is less than the expected value (2.2N^.38), there is no penalty. If it is greater, then the penalty is the squared difference between observed and expected. In the equation below r_i,meanis the distance between the C_α of residue i and the mean of all C_αs in the model.

\begin{array}{l} E_{R O G} = Θ (λ - 2.2 N^{.38})^{2} \\ λ = \sqrt{\frac{\sum r_{i, m e a n}^{2}}{N}} \\ Θ = {\begin{array}{l} 1, & if λ > 2.2 N^{.38} \\ 0, & otherwise \end{array} \end{array}

All-Atom Energy Term Details

E_LEN-JONES: van der Waals forces

A fundamental characteristic of native globular protein structures is their efficient steric packing of atoms in the protein core. A Lennard-Jones 12-6 potential with damped repulsion (E_LEN-JONES) is used to measure the quality of steric packing. E_LEN-JONESis the sum of local energy calculations E_len-jones(x,y) performed on all pairs of non-bonded atoms. Since the repulsive portion of the standard Lennard-Jones 12-6 potential will overwhelm the entire energy function with a single significant atom-atom clash – repulsion is handled by a linear ramp from 0 to 10 as shown in the equation below [10]. Since E_len-jones= 0 when (vdw_x,y/r_x,y) = $\sqrt[6]{2}$ independent of atom types, the switch to a linear ramp occurs when (vdw_x,y/r_x,y) > $\sqrt[6]{2}$ .

\begin{array}{l} E_{L E N - J O N E S} = \sum_{y > x} E_{l e n - j o n e s} (x, y) \\ E_{l e n - j o n e s} (x, y) = {\begin{array}{l} 10.0 (1 - \frac{\sqrt[6]{2}}{v d w_{x, y} / r_{x, y}}), & if (v d w_{x, y} / r_{x, y}) > \sqrt[6]{2} \\ \sqrt{ε_{x} ε_{y}} [{(\frac{v d w_{x, y}}{r_{x, y}})}^{12} - 2 {(\frac{v d w_{x, y}}{r_{x, y}})}^{6}], & otherwise \end{array} \end{array}

E_SOLVATION: solvation effects

Solvation energy is calculated using the implicit solvation model described in [35] with the following adjustment: for overlapping atoms, the sum of their van der Waals radii is used in the calculation in place of the observed atom-atom distance in the model. This restricts the amount a single atom can contribute to the burial of another atom. Without this adjustment overlapping atoms will bias the calculation to indicate an atom is more buried than it would be otherwise. In the solvation model $Δ G_{x}^{s l v}$ is the observed solvation free energy of atom x in the model, calculated as the free energy of the fully exposed atom ( $Δ G_{x}^{r e f}$ ) minus the reduction in solvation caused by the surrounding atoms. $Δ G_{x}^{f r e e}$ was determined empirically by setting it equal to $Δ G_{x}^{r e f}$ and increasing its magnitude until $Δ G_{x}^{s l v}$ of deeply buried atoms became zero. λ_xis the correlation length of atom x. V_yis the volume neighboring atom y. The values of these parameters come from [3535], with the exception of $Δ G_{x}^{r e f}$ [45]. The equation for $Δ G_{x}^{s l v}$ below is the combination of Equations 5, 6, and 7 of [35], with the atom overlap adjustment.

\begin{array}{l} E_{S O L V A T I O N} = \sum_{x} Δ G_{x}^{s l v} \\ Δ G_{x}^{s l v} = Δ G_{x}^{r e f} - \frac{Δ G_{x}^{f r e e}}{2 λ_{x} π \sqrt{π}} \sum_{x \neq y} \frac{e^{- [{(\frac{r_{x, y}^{*} - v d w_{x}}{λ_{x}})}^{2}]} V_{y}}{r_{x, y}^{* 2}} \\ r_{x, y}^{*} = {\begin{array}{l} v d w_{x + y}, & if r_{x, y} < v d w_{x + y} \\ r_{x, y}, & otherwise \end{array} \end{array}

E_ELECTRO: electrostatics

Electrostatic interactions between charged atoms are treated by simple repulsion and attraction according to inverse distance squared. The use of distance squared rather than linear distance encourages the formation of salt bridges in the models. There is a correction for atom-atom distance below the minimum realistic value. The ideal distance between oppositely charged atoms is I_hb-da-dist= 2.75 Ǻ. In the equations below pos is the set of all positively charged atoms and neg is the set of all negatively charged atoms.

\begin{array}{l} E_{E L E C T R O} = \sum_{y > x \in p o s \cup n e g} Θ (x) Θ (y) / r_{x, y}^{* 2} \\ Θ (x) = {\begin{array}{l} 1, & if x \in p o s \\ -, 1 & if x \in n e g \end{array} \\ r_{x, y}^{*} = {\begin{array}{l} I_{e_d i s t}, & if r_{x, y} < I_{e_d i s t} \\ r_{x, y}, & otherwise \end{array} \end{array}

Availability and requirements

Project home page: http://www.igb.uci.edu/~baldig/selectpro.html
Operating system: linux for stand alone version, server is platform independent
Programming language: C++ and Perl
Software requirements: Perl
Disk space requirements: 1.6 Gb for full version, 13 Mb without feature predictors

References

Wallner B, Elofsson A: Prediction of global and local model quality in CASP7 using Pcons and ProQ. Proteins 2007, 69(Suppl 8):184–193. 10.1002/prot.21774
Article CAS Google Scholar
Cozzetto D, Tramontano A: Relationship between multiple sequence alignments and quality of protein comparative models. Proteins 2005, 58: 151–157. 10.1002/prot.20284
Article CAS Google Scholar
Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 1997, 268: 209–225. 10.1006/jmbi.1997.0959
Article CAS Google Scholar
Kihara D, Lu H, Kolinski A, Skolnick J: TOUCHSTONE: An ab initio protein structure prediction method that uses threading-based tertiary restraints. Proc Natl Acad Sci USA 2001, 98: 10125–10130. 10.1073/pnas.181328398
Article CAS Google Scholar
Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A: Protein fragment reconstruction using various modeling techniques. J Comput Aided Mol Des 2003, 17: 725–738. 10.1023/B:JCAM.0000017486.83645.a0
Article CAS Google Scholar
Kolinski A: Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 2004, 51: 349–371.
CAS Google Scholar
Sanchez R, Sali A: Comparative protein structure modeling. Introduction and practical examples with modeller. Methods Mol Biol 2000, 143: 97–129.
CAS Google Scholar
Qian B, Ortiz A, Baker D: Improvement of comparative model accuracy by free-energy optimization along principal components of natural structural variation. Proc Natl Acad Sci USA 2004, 101: 15346–15351. 10.1073/pnas.0404703101
Article CAS Google Scholar
Lazaridis T, Karplus M: Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J Mol Biol 1999, 288: 477–487. 10.1006/jmbi.1999.2685
Article CAS Google Scholar
Kuhlman B, Baker D: Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci USA 2000, 97: 10383–10388. 10.1073/pnas.97.19.10383
Article CAS Google Scholar
Vorobjev Y, Hermans J: Free energies of protein decoys provide insight into determinants of protein stability. Protein Sci 2001, 10: 2498–2506. 10.1110/ps.ps.15501
Article CAS Google Scholar
Felts A, Gallicchio E, Wallqvist A, Levy R: Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS all-atom force field and the Surface Generalized Born solvent model. Proteins 2002, 48: 404–422. 10.1002/prot.10171
Article CAS Google Scholar
Dominy B, Brooks C: Identifying native-like protein structures using physics-based potentials. J Comput Chem 2002, 23: 147–160. 10.1002/jcc.10018
Article CAS Google Scholar
Oldziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA, Khalili M, Arnautova YA, Jagielska A, Makowski M, Schafroth HD, Kazmierkiewicz R, Ripoll DR, Pillardy J, Saunders JA, Kang YK, Gibson KD, Scheraga HA: Physics-based protein-structure prediction using a hierarchical protocol based on the UNRES force field: Assessment in two blind tests. Proc Natl Acad Sci USA 2005, 102: 7547–7552. 10.1073/pnas.0502655102
Article CAS Google Scholar
Shortle D, Simons KT, Baker D: Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci USA 1998, 95: 11158–11162. 10.1073/pnas.95.19.11158
Article CAS Google Scholar
Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D: Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 1999, 34: 82–95. 10.1002/(SICI)1097-0134(19990101)34:1<82::AID-PROT7>3.0.CO;2-A
Article CAS Google Scholar
Vendruscolo M, Najmanovich R, Domany E: Can a pairwise contact potential stabilize native protein folds against decoys obtained by threading? Proteins 2000, 38: 134–148. 10.1002/(SICI)1097-0134(20000201)38:2<134::AID-PROT3>3.0.CO;2-A
Article CAS Google Scholar
Cozzetto D, Kryshtafovych A, Ceriani M, Tramontano A: Assessment of predictions in the model quality assessment category. Proteins 2007, 69: 175–183. 10.1002/prot.21669
Article CAS Google Scholar
Wu S, Skolnick J, Zhang Y: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biol 2007, 5: 17. 10.1186/1741-7007-5-17
Article Google Scholar
Zhang 2007 Decoy Sets[http://zhang.bioinformatics.ku.edu/I-TASSER/decoys/]
Zhang Y, Skolnick J: SPICKER: A clustering approach to identify near-native protein folds. J Comput Chem 2004, 25: 865–871. 10.1002/jcc.20011
Article CAS Google Scholar
Wallner B, Fang H, Elofsson A: Automatic consensus-based fold recognition using Pcons, ProQ, and Pmodeller. Proteins 2003, 53(Suppl 6):534–541. 10.1002/prot.10536
Article CAS Google Scholar
Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A: Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci 2001, 10: 2354–2362. 10.1110/ps.08501
Article CAS Google Scholar
Wallner B, Elofsson A: Can correct protein models be identified? Protein Sci 2003, 12: 1073–1086. 10.1110/ps.0236803
Article CAS Google Scholar
SCRATCH Cheng J, Randall AZ, Sweredoski M, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 2005, (33 Web Server):W72-W76. 10.1093/nar/gki396
Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure prediction-Round VII. Proteins 2007, 69(Suppl 8):3–9. 10.1002/prot.21767
Article CAS Google Scholar
Zemla A, Veclovas C, Moult J, Fidelis K: Processing and analysis of CASP3 protein structure predictions. Proteins 1999, 37(Suppl 3):22–29. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+<22::AID-PROT5>3.0.CO;2-W
Article Google Scholar
Sali A, Blundell TL: Comparative protein modeling by satisfaction of spatial restraints. J Mol Biol 1993, 234: 779–815. 10.1006/jmbi.1993.1626
Article CAS Google Scholar
Martin-Renom MA, Stuart A, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29: 291–325. 10.1146/annurev.biophys.29.1.291
Article Google Scholar
Fiser A, Do RK, Sali A: Modeling of loops in protein structures. Protein Sci 2000, 9: 1753–1773.
Article CAS Google Scholar
Tsai J, Bonneau R, Morozov AV, Kuhlman B, Rohl CA, Baker D: An Improved Protein Decoy Set for Testing Energy Functions for Protein Structure Prediction. Proteins 2003, 53: 76–87. 10.1002/prot.10454
Article CAS Google Scholar
Baker D, Bystroff C, Fletterick RJ, Agard DA: PRISM: topologically constrained phased refinement for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr 1993, 49: 429–39. 10.1107/S0907444993004032
Article CAS Google Scholar
Sun S: Reduced representation approach to protein tertiary structure prediction: statistical potential and simulated annealing. J Theor Biol 1995, 172: 13–32. 10.1006/jtbi.1995.0002
Article CAS Google Scholar
Canutescu AA, Shelenkov AA, Dunbrack RL: A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci 2003, 12: 2001–2014. 10.1110/ps.03154503
Article CAS Google Scholar
Lazaridis T, Karplus M: Effective Energy Function for Proteins in Solution. Proteins 1999, 35: 133–152. 10.1002/(SICI)1097-0134(19990501)35:2<133::AID-PROT1>3.0.CO;2-N
Article CAS Google Scholar
Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Sci 1994, 3: 522–524.
Article CAS Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235
Article CAS Google Scholar
Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 2002, 47: 228–235. 10.1002/prot.10082
Article CAS Google Scholar
Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of coordination number and relative solvent accessibility in proteins. Proteins 2002, 47: 142–153. 10.1002/prot.10069
Article CAS Google Scholar
Baldi PF, Pollastri G: The principled design of large-scale recursive neural network architectures–DAG-RNNs and the protein structure prediction problem. J Mach Learn Res 2003, 4: 575–602. 10.1162/153244304773936054
Google Scholar
Pollastri G, Baldi P: Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics 2002, 18: S62-S70.
Article Google Scholar
Kortemme T, Morozov AV, Baker D: An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J Mol Biol 2003, 326: 1239–1259. 10.1016/S0022-2836(03)00021-4
Article CAS Google Scholar
Neria E, Fischer S, Karplus M: Simulation of activation free energies in molecular systems. J Chem Phys 1996, 105: 1902–1921. 10.1063/1.472061
Article CAS Google Scholar
Skolnick J, Kolinski A, Ortiz AR: MONSSTER: A method for folding globular proteins with a small number of distance restraints. J Mol Biol 1997, 265: 217–241. 10.1006/jmbi.1996.0720
Article CAS Google Scholar
Privalov PL, Makhatadze GI: Contribution of hydration to protein folding thermodynamics II. The entropy and Gibbs energy of hydration. J Mol Biol 1993, 232: 660–679. 10.1006/jmbi.1993.1417
Article CAS Google Scholar

Download references

Acknowledgements

Work supported by NIH grant LM-07443-01, NSF grants EIA-0321390 and IIS-0513376, and a Microsoft Faculty Research Award to PFB.

Author information

Authors and Affiliations

School of Information and Computer Sciences, University of California, Irvine, CA, 92697, USA
Arlo Randall & Pierre Baldi
Institute for Genomics and Bioinformatics, University of California, Irvine, CA, 92697, USA
Arlo Randall & Pierre Baldi

Authors

Arlo Randall
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Baldi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre Baldi.

Additional information

Authors' contributions

AR and PB designed the novel energy terms. AR implemented the methods and carried out the experiments. AR and PB authored the manuscript. Both authors approved the manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Randall, A., Baldi, P. SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs. BMC Struct Biol 8, 52 (2008). https://doi.org/10.1186/1472-6807-8-52

Download citation

Received: 26 June 2008
Accepted: 03 December 2008
Published: 03 December 2008
DOI: https://doi.org/10.1186/1472-6807-8-52

SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs

Abstract

Background

Results

Conclusion

Background

Results and discussion

Quality of Model Ranked First (MQA1) Relative to Most Native-Like Model (Mmax)

The BLUNDER Measure Recovery of Mmax

Pearson Correlation for Individual Proteins

Reranking Top Server Group Models

Large Decoy Set Model Selection

Conclusion

Methods

Datasets

SetAll

SetComplete

Model Representations

Reduced representation

All heavy-atom representation

Energy Functions Overview

Parameter Weights

EREDUCED

EALL-ATOM

Parameter notation used in energy equations

Model variables

Protein specific input parameters

Protein independent parameters

Reduced Representation Energy Term Details

Predicted structural features overview

EPRED-SS: predicted secondary structure

EPRED-ACC: predicted solvent accessibility

EPRED-CM: predicted contact map

EBETA: strand pairing

All-Atom Energy Term Details

ESC-HB: side-chain hydrogen bonding

Appendix

Reduced Representation Energy Term Details

EBB-REP: backbone repulsion

ECT-REP: centroid repulsion

ESTAT-ENV: residue environment potential

ESTAT-PW-CI: context independent pair-wise interactions

ESTAT-PW-CD: context dependent pair-wise potential

EROG: compactness

All-Atom Energy Term Details

ELEN-JONES: van der Waals forces

ESOLVATION: solvation effects

EELECTRO: electrostatics

Availability and requirements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Structural Biology

Contact us

Quality of Model Ranked First (M_QA1) Relative to Most Native-Like Model (M_max)

The BLUNDER Measure Recovery of M_max

E_REDUCED

E_ALL-ATOM

E_PRED-SS: predicted secondary structure

E_PRED-ACC: predicted solvent accessibility

E_PRED-CM: predicted contact map

E_BETA: strand pairing

E_SC-HB: side-chain hydrogen bonding

E_BB-REP: backbone repulsion

E_CT-REP: centroid repulsion

E_STAT-ENV: residue environment potential

E_STAT-PW-CI: context independent pair-wise interactions

E_STAT-PW-CD: context dependent pair-wise potential

E_ROG: compactness

E_LEN-JONES: van der Waals forces

E_SOLVATION: solvation effects

E_ELECTRO: electrostatics