 Research article
 Open access
 Published:
SELECTpro: effective protein model selection using a structurebased energy function resistant to BLUNDERs
BMC Structural Biology volumeÂ 8, ArticleÂ number:Â 52 (2008)
Abstract
Background
Protein tertiary structure prediction is a fundamental problem in computational biology and identifying the most nativelike model from a set of predicted models is a key subproblem. Consensus methods work well when the redundant models in the set are the most nativelike, but fail when the most nativelike model is unique. In contrast, structurebased methods score models independently and can be applied to model sets of any size and redundancy level. Additionally, structurebased methods have a variety of important applications including analogous fold recognition, refinement of sequencestructure alignments, and de novo prediction. The purpose of this work was to develop a structurebased model selection method based on predicted structural features that could be applied successfully to any set of models.
Results
Here we introduce SELECTpro, a novel structurebased model selection method derived from an energy function comprising physical, statistical, and predicted structural terms. Novel and unique energy terms include predicted secondary structure, predicted solvent accessibility, predicted contact map, Î²strand pairing, and sidechain hydrogen bonding.
SELECTpro participated in the new model quality assessment (QA) category in CASP7, submitting predictions for all 95 targets and achieved top results. The average difference in GDTTS between models ranked first by SELECTpro and the most nativelike model was 5.07. This GDTTS difference was less than 1% of the GDTTS of the most nativelike model for 18 targets, and less than 10% for 66 targets. SELECTpro also ranked the single most nativelike first for 15 targets, in the top five for 39 targets, and in the top ten for 53 targets, more often than any other method. Because the ranking metric is skewed by model redundancy and ignores poor models with a better ranking than the most nativelike model, the BLUNDER metric is introduced to overcome these limitations. SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein, where it outperforms the benchmarked method (ITASSER).
Conclusion
SELECTpro is an effective model selection method that scores models independently and is appropriate for use on any model set. SELECTpro is available for download as a stand alone application at: http://www.igb.uci.edu/~baldig/selectpro.html. SELECTpro is also available as a public server at the same site.
Background
Selecting the most nativelike model from a set of possible models is a crucial task in protein structure prediction. A variety of Model Quality Assessment Programs (MQAPs) have been developed that assign numeric scores to models in a set, and then use the scores to rank the models and ultimately select a single model. MQAP methods can be divided roughly into three categories based on the type of information they use: evolutionary methods use sequence or profile similarity between target sequence and template, consensus methods use similarity between models, and structurebased methods use model coordinates [1]. Each category of methods has inherent strengths and weaknesses.
Evolutionary methods can provide quality scores that have been shown to correlate with structural similarity to native [2]. However, for lower confidence alignments the scores do not correlate well with structural similarity. Furthermore, identification of the best template and specific alignment can be difficult. In addition, models built from multiple templates or templatefree methods cannot be scored appropriately by evolutionary methods alone.
Consensus methods take advantage of the observation that similar models produced by different predictors tend to be more accurate than those that are structural outliers. In practice, consensus methods outperform the methods they draw from, and they rarely pick a very poor model. The disadvantage, however, is that when the best model is a structural outlier it will be overlooked for lack of popularity [1]. Also, consensus methods are not appropriate for selecting from small sets of structurally diverse models, especially in the extreme case of a twomodel set.
While consensus methods depend on similarity between models, structurebased methods calculate scores on each model independently. For this reason, structurebased methods can be applied to model sets of any size and diversity, and will produce the same score for a model regardless of the other models in the set. Structurebased methods can also be used for templatefree modeling [3â€“6] and model refinement procedures [7, 8]. One weakness of high resolution structurebased methods, including protein free energy approximation functions [9â€“12] and physics based approaches [13, 14], is their sensitivity to local structural irregularities such as steric clashes and chain breaks, which can significantly bias scores on otherwise accurate models. Even slight differences in model backbones can produce significantly different scores [15]. Lower resolution structurebased methods, such as statistical potentials [6, 16, 17], are more robust to backbone variation, but are sensitive to extended low contactorder regions in the models.
Here we describe SELECTpro, a novel structurebased MQAP that combines high and low resolution energy terms into a model selection method that is effective on model sets of variable size, diversity, and target difficulty. Most of our assessment is calculated from the CASP7 model quality assessment category (QA) results published online [18]. The QA category provides a framework for the unbiased evaluation of MQAPs on ensembles of models produced by diverse automated prediction methods.
Results and discussion
We analyze the CASP7 quality assessment category predictions with a focus on the quality of the model ranked first by each predictor and the recovery of the most nativelike model in the set. Only SetAll is used in the assessment of the quality of the model ranked first by each group (Table 1). The results are very similar when using SetComplete (data not shown) because QA groups rarely rank an incomplete model first.
The assessment of the recovery of the most nativelike model, is performed on both SetAll and SetComplete (Table 2) because the few cases where an incomplete model is the most nativelike have a significant effect on the average recovery metrics of all QA groups. Incomplete and irregular models are especially challenging for structurebased methods. A comparison of the average Pearson Correlation on SetAll and SetComplete, highlights these issues (Table 3). The frequency of recovering the most nativelike model is calculated on SetComplete (Figure 1).
The utility of SELECTpro for selecting the best model from a small set is demonstrated by selecting from the five models submitted for each target by the top automated predictors. These small set selection results are calculated using SetAll (Figure 2). SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein and compared to ITASSER (Figure 3).
To make fair comparisons to groups participating on only a subset of targets, common subset comparisons between SELECTpro and each of these groups are included in Tables 1 and 2. Only groups participating on at least half of the targets are included, and for groups with multiple submissions only the best one is shown. In the results tables any value that is better than SELECTpro is underlined.
For multiple domain targets, the sum of GDTTS over all domains is used as the GDTTS of the model. Since the QA predictions correspond to the entire structures, it is impossible to fairly assess the domains independently.
To assess the significance of the summary statistics compared in Table 1, Table 2, and Figure 2, we performed paired ttests between SELECTpro each other group on common subsets of targets (or targets and models when appropriate). All pvalues from the tests appear in the tables and figure, but only statistically significant pvalues (p < .05) are shown in bold.
The following notations are used throughout the results section:

M_{max}: The model with the highest GDTTS among all server models.

M_{QA 1}: The model with the highest QA score.

N_{ T }: The number of targets a group made valid predictions on.

N_{ D }: The number of domains a group made valid predictions on.
The recovery of M_{max} by a QA predictor can only be evaluated if M_{max}was scored by the predictor. In most cases QA predictors did not provide scores for all available server models, and frequently there is no score for M_{max}. For example, predictor 016_1 (AMBER/PB) made submissions on 86 targets, but M_{max} is only scored for 53 of these targets â€“ so only these targets (N_{ T }= 53) can be evaluated for this predictor.
Quality of Model Ranked First (M_{QA1}) Relative to Most NativeLike Model (M_{max})
In this section on the assessment of the model ranked first, and the corresponding Table 1, we use the following three metrics:

Î”GDT_{ QA1 }= GDTTS(M_{max})  GDTTS(M_{QA 1}) : The GDTTS difference between M_{max} and M_{QA 1}measures how much is lost by selecting M_{QA 1}rather than M_{max} for a single target.

\stackrel{\xc2\xaf}{\mathrm{\xce\u201d}GD{T}_{QA1}} = Î£Î”GDT_{QA 1}/N_{ D }: The average Î”GDT_{QA 1}is a simple way of assessing the quality of M_{QA 1}over all targets.

Î”GDT_{QA 1%}= Î”GDT_{QA 1}/GDTTS(M_{max}) : The GDTTS difference percentage allows for comparison across targets with different numbers of domains and difficulty levels.
The columns of Table 1 are: (1) group number; (2) number of targets the group made predictions on; (3) number of targets such that Î”GDT_{QA 1}= 0; (4) number of targets such that Î”GDT_{QA 1%}< 1%; (5) number of targets such that Î”GDT_{QA 1%}< 10%; and (6) \stackrel{\xc2\xaf}{\mathrm{\xce\u201d}GD{T}_{QA1}}. The common subset results section has an additional column for the pvalue of the paired ttest using Î”GDT_{QA 1}. The rows are sorted first by the number of targets and then by \stackrel{\xc2\xaf}{\mathrm{\xce\u201d}GD{T}_{QA1}}. Of the groups participating on all 95 targets, SELECTpro has the lowest average Î”GDT_{QA 1}, with a value of 5.07, followed closely by group 713_1 (CircleQA), with a value of 5.44. Predictor 038_1 (GeneSilico) has an average Î”GDT_{QA 1}of 5.75, with predictions on 85 targets. In common subset comparisons with these two groups SELECTpro is not significantly better, with pvalues of .25 and .12 respectively. In common subset comparisons with all remaining groups SELECTpro is significantly better.
Another way to assess the quality of M_{QA 1}over many targets is to count the number of targets such that M_{QA 1}is the best model, or nearly the best, in the set. A method that performs very well on most targets, but very poorly on a few, would still be recognized by this criteria. SELECTpro recovers the best model for 12 targets, selects a model with Î”GDT_{QA 1%}< 1% for 18 targets, and selects a model with Î”GDT_{QA 1%}< 10% for 66 targets. Group 091_1 (MaOPUS) also performs well, with 11, 18, and 61 targets in the respective categories. Only the 60 targets with Î”GDT_{QA 1%}< 10% of predictor 038_1 (GeneSilico) on its 85 target subset are better than SELECTpro in common subset comparison (58 for SELECTpro).
The BLUNDER Measure Recovery of M_{max}
How well does a QA predictor recover M_{max}? The traditional metric to assess M_{max} recovery is the rank of M_{max}, and the average rank over many targets (\stackrel{\xc2\xaf}{rank}). While rank captures some important information, it ignores the redundancy of models and the quality of models ranked better than M_{max}. Consider the following hypothetical situation: group A ranks M_{max} 10^{th} and all nine models ranked above it are redundant with Î”GDT of ~2.0, group B ranks M_{max} 5^{th} and the four models ranked above it are diverse with a Î”GDT between 10.0 and 20.0. Which group has done a better job of recovering M_{max}? In this example, the rank metric favors group B, although group A ranks only a single redundant model above M_{max}. In addition, the models ranked better than M_{max} by group A have only slightly lower GDTTS than M_{max}, while the models ranked better than M_{max} by group B are significantly worse than M_{max}. To address these weaknesses of the rank metric, we introduce the BLUNDER metric, which focuses on the worst model ranked better than M_{max} (the most embarrassing blunder). This measure is not affected by model redundancy and measures the quality of models ranked above M_{max}. The BLUNDER metric is defined using the following notation, and used in the assessment of the recovery of M_{max} and the corresponding Table 2 and Figure 1:

M_{ BLUNDER }: The model with the minimum GDTTS among models ranked better than M_{max}.

Î”GDT_{ BLUNDER }= GDTTS(M_{max})  GDTTS(M_{ BLUNDER }) : The GDTTS difference between M_{max} and M_{ BLUNDER }measures the size of the worst blunder.

\stackrel{\xc2\xaf}{\mathrm{\xce\u201d}GD{T}_{BLUNDER}} = Î£Î”GDT_{ BLUNDER }/N_{ D }: The average Î”GDT_{ BLUNDER }measures how well a method robustly recovers M_{max} over many targets.

Î”GDT_{BLUNDER%}= Î”GDT_{ BLUNDER }/GDTTS(M_{max}) : The Î”GDT_{ BLUNDER }percentage allows for comparison across targets with different numbers of domains and difficulty levels.
Figure 1 contains graphs of the frequency of recovering M_{max} using the rank (A) and Î”GDT_{BLUNDER%}(B) measures on SetComplete. SELECTpro ranks M_{max} first for 15 targets, in the top five for 39 targets, and in the top ten for 53 targets. SELECTpro's Î”GDT_{BLUNDER%}values are less than 10% of GDTTS(M_{max}) for 40 targets and less than 20% for 63 targets. These results are best among all QA participants. The average M_{max} recovery results are summarized in Table 2. The results columns are (1) average rank (\stackrel{\xc2\xaf}{rank}) and (2) average Î”GDT_{ BLUNDER }(\stackrel{\xc2\xaf}{\mathrm{\xce\u201d}GD{T}_{BLUNDER}}) on SetAll and SetComplete. The common subset results section also includes a column for the pvalue of a paired ttest using Î”GDT_{ BLUNDER }(pvalue). Rows are sorted separately for each dataset by N_{ T }first and then \stackrel{\xc2\xaf}{\mathrm{\xce\u201d}GD{T}_{BLUNDER}}. On SetComplete SELECTpro has a \stackrel{\xc2\xaf}{\mathrm{\xce\u201d}GD{T}_{BLUNDER}} of 10.4. In common subset comparisons one group has a lower \stackrel{\xc2\xaf}{rank}: group 091_1 (MaOPUS) with \stackrel{\xc2\xaf}{rank} of 16.8 on 94 targets compared to 17.4 for SELECTpro. On SetAll SELECTpro did not submit a score for M_{max} of target T0356 (HHpred2_TS1) due to a processing error. In order to make complete common subset comparisons when possible we added in the SELECTpro score for HHpred2_TS1. SELECTpro ranks it 86^{th} and Î”GDT_{ BLUNDER }= 50.0. Both results are significantly worse than the SELECTpro averages.
Pearson Correlation for Individual Proteins
The assessor evaluation of the quality assessment category [18] focused on the Pearson Correlation between the QA scores and GDTTS. Here we use the Pearson Correlation only to highlight some of the difficulties for structurebased methods in dealing with incomplete models, as well as basic nonprotein like structural features. Approximately half of the models in SetAll are incomplete, with backbone coordinates missing for one or more residues.
Incomplete models present a challenge to SELECTpro and other structurebased methods because the scores for each model are only comparable when calculated on coordinates for the same set of residues. Another issue is that some complete models have severe chainbreaks, severe steric clashes, or significant portions modeled only as extended chains. These local problems can overwhelm the energy of what may otherwise be a good model. Consensus methods do not suffer from these local structure problems. Given this rationale, one would expect structurebased methods to see the most improvement in terms of average Pearson Correlation on SetComplete relative to SetAll. Table 3 shows the average Pearson Correlation of five selected groups. Predictors 713_1 (CircleQA), 633_1 (ProQ), and SELECTpro are structurebased MQAPs, while 634_1 (Pcons) is a consensus method and 556_1 (LEE) scored structures based on the GDTTS similarity to their human Model 1 CASP7 prediction [18]. As expected, the structurebased MQAPs improve more than the structural similaritybased methods. The even greater increase in Pearson Correlation for SELECTpro can be accounted for by the failure to generate appropriate complete models for some of the incomplete models resulting in QA scores calculated on extended chains.
Reranking Top Server Group Models
Predictors in CASP may submit up to five models, but CASP evaluation focuses on the model designated as Model 1. Clearly, the selection of Model 1 is critical in the CASP setting and for protein structure prediction in general. Figure 2 contains the results when SELECTpro is used to rerank the five models submitted by each of the top ten servers from CASP7, compared to each server's results. In the following assessment M_{maxg} is the model with the highest GDTTS of the five models submitted by a server. Figure 2 (A) shows that SELECTpro recovers M_{maxg} more frequently than 8 of the top 10 server groups; in addition, when SELECTpro is used to select Model 1 the average GDTTS increases for 7 of 10 sever groups; however, the increase is only statistically significant for 3 groups. SELECTpro improves using both criteria for the top 3 server groups (ZhangServer, Pmodeller6, and ROBETTA). These results highlight the utility of SELECTpro for the task of model selection. The comparisons made here are fair because structurebased methods can be applied in the server setting to any number of models.
Large Decoy Set Model Selection
Here we analyze SELECTpro's model selection capability on the large decoy sets for 16 small proteins from a recent ITASSER benchmark set [19]. The ITASSER prediction method generates 12500 to 20000 different backbone conformations. The complete decoy sets can be downloaded from [20]. The consensus method SPICKER [21] is used to cluster the models and a centroid model is built from the first cluster. A second round of simulation resolves the steric clashes in the centroid model and results in the final predicted model. The centroid model and final model are not part of the decoy set. In order to make a fair model selection comparison the decoy model closest to the centroid is used as ITASSER's M_{QA 1}.
On the benchmark set SELECTpro has an average GDTTS of 63.7, while ITASSER has an average GDTTS of 62.1. SELECTpro's average Î”GDT_{QA 1}is 9.2 and ITASSER's Î”GDT_{QA 1}is 10.7. Figure 3 displays the GDTTS results for the individual proteins in the benchmark set. Different symbols are used to indicate the GDTTS of M_{max} (â–¡), the GDTTS of SELECTpro's M_{QA 1}(Ã—), and the GDTTS of ITASSER's M_{QA 1}(+) for each protein. A paired ttest of the hypothesis that SELECTpro and ITASSER's mean performance are equal produces a pvalue of .19, which is not statistically significant, but does give some evidence that SELECTpro can select a very good model from a large set of decoys at least well as an established method that utilizes consensus methods.
Conclusion
A MQAP that can select the most nativelike model from a set of possibilities has a variety of applications in protein structure prediction. The new quality assessment category introduced in CASP7 allows for the unbiased assessment of MQAPs on the models produced by automated predictors. This category allows researchers to focus on the model scoring aspect of protein structure prediction.
The results presented in this work demonstrate that SELECTpro, a structurebased model selection method, consistently selects one of the best models from the large diverse sets of models produced by automated predictors, across all levels of target difficulty. On these large diverse sets of models, SELECTpro also recovers the single most nativelike model well compared to other methods. On the small sets of five models submitted for each target by the top automated predictors, in most cases SELECTpro selects better models than the predictors themselves.
Since SELECTpro and other structurebased methods score models independently, they can be incorporated into the model selection pipelines of individual protein structure prediction servers. For this reason, it may help predictors if the CASP organizers distinguished methods that score models independently from those that do not.
Consensus and structurebased methods can be combined to achieve improved results. For example, the metaserver method Pmodeller [22] combines consensus (Pcons [23]) and structurebased methods (ProQ [24]) to predict protein structures more accurately than either method in isolation. The assessment of the QA category by CASP assessors recognized the consensus method Pcons (group 634_1) for the high Pearson Correlation between their scores and model GDTTS on most targets [18]. In their own assessment the authors of Pcons recognized that while consensus methods perform well in most cases, "when most of the models are incorrect and the few correct models are outliers a consensus based approach cannot be expected to make an optimal choice." [1] For instance, they identified three particular targets in CASP7 where their consensus method failed: T0283, T0350, and T0351 [1]. The Pcons average Î”GDT_{QA 1}on these three targets is 30.8. The same research group's structurebased method ProQ (group 633_1) has an average Î”GDT_{QA 1}of 17.2. In contrast, on these three targets SELECTpro has an average Î”GDT_{QA 1}of only 7.1. This example highlights the potential of combining SELECTpro with existing model selection methods.
SELECTpro has been made publicly available as a server, where users may submit from 2 to 100 models for evaluation. In addition to the global confidence scores, the scores of individual energy terms are also returned to the user by email for each model submitted. SELECTpro is one of several protein structure tools in the SCRATCH suite of predictors [25], and is available through: http://www.igb.uci.edu/~baldig/selectpro.html.
Methods
Datasets
All of the comparative analysis in this work is performed on the server models and quality assessment predictions submitted in the CASP7 [26] experiment. The CASP QA experiment is particularly relevant for the evaluation of model selection methods for several reasons: (1) the QA predictors were blind to the true structures at the time of prediction making it impossible for methods to be tuned to improve results; (2) the set of proteins is diverse: the 95 targets range in size from 68 to 530 amino acids, come from a variety of organisms, and span the full range of prediction difficulty; (3) each target has more than 200 predicted models that contain the types of errors that occur in automated structure prediction; (4) the protein set is not selected by any of the participating QA groups; (5) the models are scored by a variety of methods and the results are publicly available. We perform analysis on the set of all models (SetAll) and a subset of models (SetComplete) that are complete and free of gross structural irregularities, as described below. All of the ABIpro models and some of the 3Dpro models were optimized using the exact energy function of SELECTpro. These models are removed because of the obvious bias towards these models. In recent CASP experiments the GDTTS [27] has been used as the primary automatic structural similarity measure. The published GDTTS values from the CASP7 website are the only structural similarity measure used in this work.
SetAll
The SetAll dataset consists of the server models with a GDTTS value published on the CASP7 website, a total of 23,423 models. To calculate a score on a protein model SELECTpro requires the backbone coordinates (N, C_{Î±}, C) for all model residues as input. A total of 8,812 models in SetAll have only a C_{Î±} trace or have no coordinates for one or more residues. Modeller8v1 [28â€“30] was used to generate complete models from the incomplete ones, and then the complete models were scored by SELECTpro. In most cases the complete models were built appropriately from the incomplete models; however, in some cases the final model was a fully extended chain due to an error in our application of Modeller. We failed to identify this problem until after the completion of the CASP7 competition. The SELECTpro scores versus GDTTS scores for all models of target T0305 are displayed in plot A of Figure 4. The circled outliers with very low confidence scores and high GDTTS scores are models that were incomplete and the complete models generated by Modeller were fully extended chains. The Pearson correlation on the set of all models for T0305 is .641. The SELECTpro scores versus GDTTS scores for complete models only are displayed in plot B of Figure 4, and the Pearson correlation is .966.
SetComplete
The scores produced by SELECTpro are comparable on complete models of the same sequence. There is no standard for the handling of incomplete models and we assume that participating groups took a variety of approaches. Using only complete models ensures that the MQAP scores are calculated from the same coordinates. Thus, the models retained in SetComplete are screened first for completeness. Models missing backbone coordinates for one or more residues are removed. This leaves 14,611 models.
Structurebased MQAPs are susceptible to local structural irregularities in models, and will tend to score such models poorly. This is why methods developed to select nearnative models from sets of decoys remove such models from consideration [31]. We apply additional filters (described below) for C_{Î±}C_{Î±} clashes, C_{Î±}C_{Î±} chain breaks, and expanded termini to remove an additional 1,217 models leaving 13,494 more plausible models in SetComplete.
The C_{Î±}C_{Î±} clash model filter enforces a squared difference penalty for C_{Î±}C_{Î±} distances less than 3.6 Çº. The distance between the C_{Î±} atoms of residue i and j is denoted by r_{i,CÎ±,j,CÎ±}and N is the protein length. The constant 13.52 in the threshold below corresponds to two severe clashes where r_{i,CÎ±,j,CÎ±}= 1.0 Çº. Models with a sum of squared differences greater than 13.52 per 100 residues are filtered out.
The C_{Î±}C_{Î±} chain break model filter enforces a squared difference penalty for r_{i,CÎ±,i+1,CÎ±}distances greater than 4.0 Çº. The constant 16.0 in the threshold below corresponds to a single chain break where r_{i,CÎ±,i+1,CÎ±}= 8.0 Çº. Models with a sum of squared differences greater than 16.0 per 100 residues are filtered out.
The expanded termini filter removes models where a large portion of the structure is modeled as expanded chain with no nonlocal interactions. The screening procedure is: scan from the Nterminus until three consecutive residues have a contact number of at least 10, and repeat from the Cterminus. The contact number of a residue is defined here as the number of other C_{Î²} atoms within 10 Çº of the residue's C_{Î²} [3]. If the sum of low contact number termini residues is at least 20% of N, the model is filtered out.
Model Representations
Reduced representation
In the reduced representation the heavy backbone atoms, carbonyl oxygen, amide hydrogen (N, C_{Î±}, C, O, H), and C_{Î²} are represented explicitly. For glycine residues a pseudo C_{Î²} is calculated. The sidechain atoms are represented by a single united point (centroid) [32, 33]. The centroid is calculated as the mean of the position of the heavy sidechain atoms. For glycine and alanine the centroid (CT) is set to the C_{Î²} atom. Only the heavy backbone atoms (N, C_{Î±}, C) are used as input to SELECTpro and the positions of additional atoms and centroids are calculated from these.
All heavyatom representation
In the all heavyatom representation the centroid is removed and the heavy side chain atoms are represented explicitly. The sidechains are initially placed onto the backbone of the reduced representation in their most likely conformation according to the SCWRL backbonedependent rotamer library [34]. The sidechain placements are then optimized using the SELECTpro allatom energy terms (described below) in conjunction with the rotamer library.
Energy Functions Overview
E_{ REDUCED }is the combined energy calculated from the reduced representation. E_{ REDUCED }is a linear combination of predicted (E_{PREDSS}, E_{PREDSA}, E_{PREDCM}), physical (E_{VDWREP}), and statistical (E_{CTREP}, E_{STATENV}, E_{STATPWCI}, E_{STATPWCD}, E_{ ROG }) terms:E_{ REDUCED }= w_{1}E_{PREDSS}+ w_{2}E_{PREDSA}+ w_{3}E_{PREDCM}+ w_{4}E_{ BETA }+ w_{5}E_{VDWREP}+ w_{6}E_{CTREP}+ w_{7}E_{STATENV}+ w_{8}E_{STATPWCI}+ w_{9}E_{STATPWCD}+ w_{10}E_{ ROG }
E_{ALLATOM}consists of the energy terms that depend on the all heavyatom representation. E_{ALLATOM}is a linear combination of the following physical terms:E_{ALLATOM}= w_{11}E_{SCHB}+ w_{12}E_{LENJONES}+ w_{13}E_{ SOLVATION }+ w_{14}E_{ ELECTRO }
E_{ FINAL }is the sum of E_{ REDUCED }and E_{ALLATOM}, and is used for the final scoring of models by SELECTpro. The individual energy terms are outlined briefly below and the detailed description of the novel terms follow in the remainder of this section. Underlined terms are adapted from previously described energy terms their details are included in the Appendix.
Parameter Weights
The parameter weights were determined by repeatedly varying individual weights and maximizing the sum of the GDTTS of the lowest E_{ FINAL }models on a training set built from CASP6 protein domains. For each CASP6 protein domain a set of 500 decoy models was generated using fragment assembly with the RMSD to native as the dominant term in the objective function [3].
E_{REDUCED}
E_{PREDSS}: predicted secondary structure
E_{PREDACC}: predicted solvent accessibility
E_{PREDCM}: predicted contact map
E_{ BETA }: sheet formation
E_{ BB  REP }: backbone repulsion
E_{ CT  REP }: centroid repulsion
E_{ STAT  ENV }: residue environment potential [3]
E_{ STAT  PW  CI }: context independent pairwise potential [3, 16]
E_{ STAT  PW  CD }: context dependent pairwise potential [6]
E_{ ROG }: compactness
E_{ALLATOM}
E_{SCHB}: sidechain hydrogen bonding
E_{ LEN  JONES }: van der Waals forces [10]
E_{ SOLVATION }: solvation effects [35]
E_{ ELECTRO }: electrostatic interactions
Throughout this work the convention of all capital letters referring to global energy and all lower case referring to local energy is used. For instance, E_{PREDCM}refers to the global contact map energy and E_{predcm}(i,j) refers to the contact map energy between residues i and j.
Parameter notation used in energy equations
Model variables
r_{i,x,j,y}: distance between atom x of residue i and atom y of residue j
r_{x,y}: distance between atom x and atom y
v_{i,x,j,y}: vector from atom x of residue i to atom y of residue j
u_{i,x,j,y}: unit vector calculated from v_{i,x,j,y}
N_{ i }: number of residues in contact with residue i, with contact defined as r_{ i,CÎ²,j,CÎ² }< 10 Çº
phi_{ i }: Phi angle of residue i
psi_{ i }: Psi angle of residue i
Protein specific input parameters
aa_{ i }: amino acid type of residue i
ss_{ i }: predicted secondary structure of residue i (H,E,C)
acc_{ i }: predicted solvent accessibility of residue i ('e': exposed, '', buried)
cmap_{i,j}: predicted contact/noncontact between residues i and j, with contact defined as r_{i,CÎ±,j,CÎ±}< 12 Çº
Protein independent parameters
I_{ value }: ideal parameter value for a given calculation
Ïƒ_{ value }: standard deviation value for a given calculation
vdw_{ x }: van der Waals radius of atom x
vdw_{x+y}: vdw_{ x }+ vdw_{ y }
Î©_{statenv}: precalculated statistics for use in E_{STATENV}
Î©_{statpwoi}: precalculated statistics for use in E_{STATPWCI}
Î©_{statpwod}: precalculated statistics for use in E_{STATPWCD}
D_{min,pwod}: minimum interaction distance for centroid pairs used in E_{STATPWCD}
D_{max,pwod}: maximum interaction distance for centroid pairs used in E_{STATPWCD}
D_{minCT}: minimum distances between centroids of amino acid pairs observed in pdb_select25 [36].
Reduced Representation Energy Term Details
The details of how the novel reduced representation energy terms are calculated are presented in this section. The predicted structural terms E_{PREDSS}, E_{PREDACC}, and E_{PREDCM}and the Î²strand pairing term, E_{ BETA }, are novel and unique to SELECTpro. Additional reduced representation terms are adapted from previously published work and their details are included in the Appendix.
Predicted structural features overview
The predicted structural feature predictions used in E_{PREDSS}, E_{PREDACC}, and E_{PREDCM}come from the SCRATCH suite of predictors [25]. Each predictor is trained in a supervised fashion using curated nonredundant datasets extracted from the PDB [37]. The secondary structure (SSpro [38]) and solvent accessibility (ACCpro [39]) predictors use ensembles of 1DRNN (one dimensionalrecursive neural network) architectures [40]. The contact map predictor (CMAPpro [41]) uses ensembles of 2DRNN architectures [40].
E_{PREDSS}: predicted secondary structure
The predicted secondary structure term E_{PREDSS}penalizes deviation of the torsion angles from the torsion angle parameters for helices and strands predicted by SSpro. There is no penalty for predicted coils. The parameter values for helix residues are: I_{ HÏ† }= 65.3, Ïƒ_{ HÏ† }= 11.9, I_{ HÏˆ }= 39.4, Ïƒ_{ HÏˆ }= 11.3. The parameter values for strand residues are: I_{ EÏ† }= 135.0, Ïƒ_{ EÏ† }= 15.0, I_{ EÏˆ }= 135.0, Ïƒ_{ EÏˆ }= 15.0. Only torsion angles that are more than two standard deviations from the ideal are penalized, with the penalty defined as follows:
The definition of E_{predstrand}(j) is equivalent to E_{predhelix}(i), but with I_{ EÏ† }, Ïƒ_{ EÏ† }, I_{E Ïˆ}and Ïƒ_{E Ïˆ}in place of the corresponding helical values.
E_{PREDACC}: predicted solvent accessibility
The solvent accessibility predictor ACCpro predicts the percent of solvent accessibility in 5% increments for each residue. Using 25% exposure as a binary threshold the accuracy of the predictor is ~77% [39]. The binary exposure ('e')/burial ('') prediction is used as the predicted solvent accessibility for E_{PREDACC}. In the reduced representation the solvent accessibility of residue i is estimated by its contact number (N_{ i }), where N_{ i }> 16 is considered buried [3]. If the predicted status of a residue is not realized in the model, the penalty is calculated as:
E_{PREDCM}: predicted contact map
The contact map predictor CMAPpro predicts the probability of contact or noncontact between C_{Î±} atoms, with a contact threshold of 12 Ã…. The strategy utilized to infer predicted contacts from the probability matrix [41] results in maps that are sparse when compared to those of real proteins; thus, unrealized contacts are penalized while noncontacts are not. The constant 1.0 is added to the penalty to ensure that all unrealized contacts make a significant contribution to E_{PREDCM}.
The predicted contact map can help identify the highest GDTTS models in the set, even when they are not highly similar to native. A good example of this is CASP7 target T0304 is a 122 residue Î±/Î² protein where the highest GDTTS model in the set is ZhangServer_TS1 (GDTTS = 45.55). Most secondary structure predictors (including SSpro) failed to predict the first two strands making this target especially difficult. No QA method ranked the highest GDTTS model first; however, SELECTpro ranked it second and the model ranked first by SELECTpro (T0304.ZhangServer_TS4) has the second highest GDTTS. These models have the lowest E_{PREDCM}of any models in the set, but the native structure has an even lower E_{PREDCM}. Figure 5 compares the native and predicted contact maps for target T0304.
E_{BETA}: strand pairing
The formation of hydrogen bonds between the residues of Î²strand partners is a major determinant of the tertiary structure of Î² and Î±/Î² proteins. The Î² hydrogen bonding treatment described here favors realistic strand pairing and sheet formation. The treatment also efficiently accommodates bulges in strands because it does not force the register between two paired strands. E_{ BETA }is the global strand pairing energy that penalizes the hydrogen bonding of Î² residues between strand pairs. E_{betasp}(Î²_{ k }â†’Î²_{ w }) is the strand pairing energy of strand Î²_{ k }to strand Î²_{ w }. E_{betasp}is only commutative if the two strands have the same length. E_{betahb}(i,j) is the hydrogen bonding penalty between residues i and j.
E_{betasp}is calculated for all possible strand pairings, but only the two lowest energies from each strand are used in E_{ BETA }. Other strandstrand interactions are ignored. In the equations below S is the set of all strands in the protein, Î²_{m 1}is the strand with the minimum pairing energy from Î²_{ k }, and Î²_{m 2}is the strand with the next lowest pairing energy from Î²_{ k }. If the strand count is less than six at least two of the strands must be edge strands. This is accounted for by only considering the single best strand partner for two strands.
In the equations for E_{betasp}below, S_{ k }is the set of all residues in strand Î²_{ k }. Each time E_{betahb}is calculated the pair (i,j) is chosen with i from S_{ k }and j from S_{ w }, such that E_{betahb}is minimized. Then residue i is removed from S_{ k }, and residue j is removed from S_{ w }. E_{betahb}is calculated once for each residue in S_{ k }. If S_{ k }has more residues than S_{ w }each unpaired residue is given maximum penalty of E_{betahb}.
Between two antiparallel strand partners, only every other pair of residues is hydrogen bonded. For the pairs that are not hydrogen bonded, a pseudobonding calculation is used. The hydrogen bonding energy and pseudobonding energy are both calculated and the minimum of the two is used in E_{betahb}(i,j).
If residues i and j are paired in parallel strands, either i forms hydrogen bonds with j1 and j+1, or j forms hydrogen bonds with i1 and i+1. No hydrogen bonds are formed between the atoms of residues i and j. The hydrogen bonding energy is calculated for both possible conformations and only the minimum of the two is used in E_{betahb}(i,j).
Î¦(aâ†’d) is the directional energy calculation for a single hydrogen bond where a is the index of the acceptor residue and d is the index of the donor residue. Three geometrical measures are used to estimate the strength of hydrogen bonds: the distance between the acceptor and the hydrogen atoms (r_{a,O,d,H}), the angle at the acceptor atom (u_{a,C,a,O}Â· u_{a,O,d,H}), and the angle between the acceptor and donor atom vectors (u_{a,C,a,O}Â· u_{d,N,d,H}). The distance and acceptor atom angle parameters are motivated by the orientationdependent hydrogen bonding potential described in [42]. The following parameters were set based on idealized hydrogen bonding between Î² residues, with standard deviation values set such that two standard deviations approximate the cutoff in true hydrogen bonds. The ideal distance from hydrogen atom to accepting oxygen is I_{hbdist}= 1.9 Çº, with standard deviation Ïƒ_{hbdist}= 0.5 Çº. The ideal angle at the acceptor atom is 0Â°, so the ideal (u_{a,C,a,O}Â· u_{a,O,d,H}) is I_{accdp}= 1.0, with standard deviation Ïƒ_{accdp}= 0.11. The ideal angle between the acceptor and donor atom vectors is 180Â°, so the ideal (u_{a,C,a,O}Â· u_{d,N,d,H}) is I_{accdondp}= 1.0, with standard deviation Ïƒ_{accdp}= 0.15. The parameters for pseudobonded residues are as follows: the ideal distance for r_{a,O,d,H}is I_{pshbdist}= 7.9 Çº, I_{psaccdp}= 1.0, and I_{psaccdondp}= 1.0. The standard deviations from the corresponding hydrogen bonding parameters above are used in Î¦_{ps}(aâ†’d).
The penalty for the observed value (x) increases up to 6 standard deviations from the ideal value (Î¼).
AllAtom Energy Term Details
The allatom energy terms depend on atomatom interactions when all heavy atoms are included in the model. In the allatoms energy equations x and y refer to atoms in the model and the residue positions are not referenced. The van der Waals radii and welldepths (Îµ_{ x }, used in E_{LENJONES}) come from the CHARMM19 parameter set [43]. The sidechain hydrogen bonding term, E_{SCHB}, is described in detail here because it is unique to SELECTpro. The details of E_{LENJONES}, E_{ SOLVATION }, and E_{ ELECTRO }are provided in the Appendix.
E_{SCHB}: sidechain hydrogen bonding
E_{SCHB}penalizes unsatisfied hydrogen bond donor and acceptor atoms that are at least partially buried. There is no penalty for fully exposed donor or acceptor atoms. Exposure percent (\mathrm{\xce\u201d}{G}_{x}^{slv} %) is calculated as \mathrm{\xce\u201d}{G}_{x}^{slv}/\mathrm{\xce\u201d}{G}_{x}^{ref}. The definitions of \mathrm{\xce\u201d}{G}_{x}^{slv} and \mathrm{\xce\u201d}{G}_{x}^{ref} are provided in the description of E_{ SOLVATION }in the Appendix. Atoms at least 75% exposed are considered fully exposed and atoms less than 25% exposed are considered fully buried. For 25% <\mathrm{\xce\u201d}{G}_{x}^{slv}% < 75% the penalty weight is reduced linearly from 1.0 at 25% to 0 at 75%. The ideal distance from the acceptor atom to donor atom is I_{hbdadist}= 2.9 Çº. In the equations below donors is the set of all sidechain hydrogen donor atoms and acceptors is the set of all sidechain hydrogen acceptor atoms.
Appendix
In the interest of completeness and reproducibility we include the details of the energy terms that are adapted from previous work.
Reduced Representation Energy Term Details
E_{BBREP}: backbone repulsion
This term penalizes steric clashes between nonbonded atoms explicitly represented in the reduced representation. The penalty for overlapping atoms is the overlap distance squared as defined here:
E_{CTREP}: centroid repulsion
A centroidcentroid repulsive term is used to reduce the overcrowding of sidechains in the reduced representation. The minimum distance between two centroids in the calculation is the minimum observed for each pair of residue types â€“ D_{CTmin}(aa_{ i },aa_{ j }) â€“ in pdb_select25. The penalty for centroidcentroid overlaps is defined as the overlap distance squared:
E_{STATENV}: residue environment potential
The motivation for this term is to model the hydrophobic effect. The level of burial for each residue in the model is estimated by the number of other C_{Î²} atoms within 10 Çº (the contact number N_{ i }) [3]. The values in the table Î©_{statenv}reflect the likelihood of observing a particular N_{ i }for each residue type. For model residues near both termini the contact number is artificially increased to account for the missing neighbors along the chain.
E_{STATPWCI}: context independent pairwise interactions
This context independent pairwise potential comes from Equation 6 of [3]. The potential considers the likelihood of observing the pair of centroids in a given distance bin relative to the background, with distance bins of < 5, 5â€“7, 7â€“10, 10â€“12, and > 12 Ã…. The advantage of a context independent pairwise potential is that it is less vulnerable to overfitting by a conformational search because of its generality.
E_{STATPWCD}: context dependent pairwise potential
This context specific pairwise potential is from [6]. This pairwise potential depends on the local structure and relative orientation of both amino acids in the interaction. The statistics are calculated independently for each combination of local structures and relative orientations. At each position the local structure is considered either compact or open and the relative orientation is determined by the dot product of the C_{Î±} to C_{Î²} unit vectors of each residue and divided into three classes: parallel, antiparallel, and intermediate.
E_{ROG}: compactness
The radius of gyration is a simple measure of the global compactness of a domain. E_{ ROG }penalizes models that are less compact than expected according to [44]. If the radius of gyration of the model (Î») is less than the expected value (2.2N^{.38}), there is no penalty. If it is greater, then the penalty is the squared difference between observed and expected. In the equation below r_{i,mean}is the distance between the C_{Î±} of residue i and the mean of all C_{Î±}s in the model.
AllAtom Energy Term Details
E_{LENJONES}: van der Waals forces
A fundamental characteristic of native globular protein structures is their efficient steric packing of atoms in the protein core. A LennardJones 126 potential with damped repulsion (E_{LENJONES}) is used to measure the quality of steric packing. E_{LENJONES}is the sum of local energy calculations E_{lenjones}(x,y) performed on all pairs of nonbonded atoms. Since the repulsive portion of the standard LennardJones 126 potential will overwhelm the entire energy function with a single significant atomatom clash â€“ repulsion is handled by a linear ramp from 0 to 10 as shown in the equation below [10]. Since E_{lenjones}= 0 when (vdw_{x,y}/r_{x,y}) = \sqrt[6]{2} independent of atom types, the switch to a linear ramp occurs when (vdw_{x,y}/r_{x,y}) > \sqrt[6]{2}.
E_{SOLVATION}: solvation effects
Solvation energy is calculated using the implicit solvation model described in [35] with the following adjustment: for overlapping atoms, the sum of their van der Waals radii is used in the calculation in place of the observed atomatom distance in the model. This restricts the amount a single atom can contribute to the burial of another atom. Without this adjustment overlapping atoms will bias the calculation to indicate an atom is more buried than it would be otherwise. In the solvation model \mathrm{\xce\u201d}{G}_{x}^{slv} is the observed solvation free energy of atom x in the model, calculated as the free energy of the fully exposed atom (\mathrm{\xce\u201d}{G}_{x}^{ref}) minus the reduction in solvation caused by the surrounding atoms. \mathrm{\xce\u201d}{G}_{x}^{free} was determined empirically by setting it equal to \mathrm{\xce\u201d}{G}_{x}^{ref} and increasing its magnitude until \mathrm{\xce\u201d}{G}_{x}^{slv} of deeply buried atoms became zero. Î»_{ x }is the correlation length of atom x. V_{ y }is the volume neighboring atom y. The values of these parameters come from [3535], with the exception of \mathrm{\xce\u201d}{G}_{x}^{ref}[45]. The equation for \mathrm{\xce\u201d}{G}_{x}^{slv} below is the combination of Equations 5, 6, and 7 of [35], with the atom overlap adjustment.
E_{ELECTRO}: electrostatics
Electrostatic interactions between charged atoms are treated by simple repulsion and attraction according to inverse distance squared. The use of distance squared rather than linear distance encourages the formation of salt bridges in the models. There is a correction for atomatom distance below the minimum realistic value. The ideal distance between oppositely charged atoms is I_{hbdadist}= 2.75 Çº. In the equations below pos is the set of all positively charged atoms and neg is the set of all negatively charged atoms.
Availability and requirements

Project home page: http://www.igb.uci.edu/~baldig/selectpro.html

Operating system: linux for stand alone version, server is platform independent

Programming language: C++ and Perl

Software requirements: Perl

Disk space requirements: 1.6 Gb for full version, 13 Mb without feature predictors
References
Wallner B, Elofsson A: Prediction of global and local model quality in CASP7 using Pcons and ProQ. Proteins 2007, 69(Suppl 8):184â€“193. 10.1002/prot.21774
Cozzetto D, Tramontano A: Relationship between multiple sequence alignments and quality of protein comparative models. Proteins 2005, 58: 151â€“157. 10.1002/prot.20284
Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 1997, 268: 209â€“225. 10.1006/jmbi.1997.0959
Kihara D, Lu H, Kolinski A, Skolnick J: TOUCHSTONE: An ab initio protein structure prediction method that uses threadingbased tertiary restraints. Proc Natl Acad Sci USA 2001, 98: 10125â€“10130. 10.1073/pnas.181328398
Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A: Protein fragment reconstruction using various modeling techniques. J Comput Aided Mol Des 2003, 17: 725â€“738. 10.1023/B:JCAM.0000017486.83645.a0
Kolinski A: Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 2004, 51: 349â€“371.
Sanchez R, Sali A: Comparative protein structure modeling. Introduction and practical examples with modeller. Methods Mol Biol 2000, 143: 97â€“129.
Qian B, Ortiz A, Baker D: Improvement of comparative model accuracy by freeenergy optimization along principal components of natural structural variation. Proc Natl Acad Sci USA 2004, 101: 15346â€“15351. 10.1073/pnas.0404703101
Lazaridis T, Karplus M: Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J Mol Biol 1999, 288: 477â€“487. 10.1006/jmbi.1999.2685
Kuhlman B, Baker D: Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci USA 2000, 97: 10383â€“10388. 10.1073/pnas.97.19.10383
Vorobjev Y, Hermans J: Free energies of protein decoys provide insight into determinants of protein stability. Protein Sci 2001, 10: 2498â€“2506. 10.1110/ps.ps.15501
Felts A, Gallicchio E, Wallqvist A, Levy R: Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS allatom force field and the Surface Generalized Born solvent model. Proteins 2002, 48: 404â€“422. 10.1002/prot.10171
Dominy B, Brooks C: Identifying nativelike protein structures using physicsbased potentials. J Comput Chem 2002, 23: 147â€“160. 10.1002/jcc.10018
Oldziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA, Khalili M, Arnautova YA, Jagielska A, Makowski M, Schafroth HD, Kazmierkiewicz R, Ripoll DR, Pillardy J, Saunders JA, Kang YK, Gibson KD, Scheraga HA: Physicsbased proteinstructure prediction using a hierarchical protocol based on the UNRES force field: Assessment in two blind tests. Proc Natl Acad Sci USA 2005, 102: 7547â€“7552. 10.1073/pnas.0502655102
Shortle D, Simons KT, Baker D: Clustering of lowenergy conformations near the native structures of small proteins. Proc Natl Acad Sci USA 1998, 95: 11158â€“11162. 10.1073/pnas.95.19.11158
Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D: Improved recognition of nativelike protein structures using a combination of sequencedependent and sequenceindependent features of proteins. Proteins 1999, 34: 82â€“95. 10.1002/(SICI)10970134(19990101)34:1<82::AIDPROT7>3.0.CO;2A
Vendruscolo M, Najmanovich R, Domany E: Can a pairwise contact potential stabilize native protein folds against decoys obtained by threading? Proteins 2000, 38: 134â€“148. 10.1002/(SICI)10970134(20000201)38:2<134::AIDPROT3>3.0.CO;2A
Cozzetto D, Kryshtafovych A, Ceriani M, Tramontano A: Assessment of predictions in the model quality assessment category. Proteins 2007, 69: 175â€“183. 10.1002/prot.21669
Wu S, Skolnick J, Zhang Y: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biol 2007, 5: 17. 10.1186/17417007517
Zhang 2007 Decoy Sets[http://zhang.bioinformatics.ku.edu/ITASSER/decoys/]
Zhang Y, Skolnick J: SPICKER: A clustering approach to identify nearnative protein folds. J Comput Chem 2004, 25: 865â€“871. 10.1002/jcc.20011
Wallner B, Fang H, Elofsson A: Automatic consensusbased fold recognition using Pcons, ProQ, and Pmodeller. Proteins 2003, 53(Suppl 6):534â€“541. 10.1002/prot.10536
Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A: Pcons: a neuralnetworkbased consensus predictor that improves fold recognition. Protein Sci 2001, 10: 2354â€“2362. 10.1110/ps.08501
Wallner B, Elofsson A: Can correct protein models be identified? Protein Sci 2003, 12: 1073â€“1086. 10.1110/ps.0236803
SCRATCH Cheng J, Randall AZ, Sweredoski M, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 2005, (33 Web Server):W72W76. 10.1093/nar/gki396
Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure predictionRound VII. Proteins 2007, 69(Suppl 8):3â€“9. 10.1002/prot.21767
Zemla A, Veclovas C, Moult J, Fidelis K: Processing and analysis of CASP3 protein structure predictions. Proteins 1999, 37(Suppl 3):22â€“29. Publisher Full Text 10.1002/(SICI)10970134(1999)37:3+<22::AIDPROT5>3.0.CO;2W
Sali A, Blundell TL: Comparative protein modeling by satisfaction of spatial restraints. J Mol Biol 1993, 234: 779â€“815. 10.1006/jmbi.1993.1626
MartinRenom MA, Stuart A, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29: 291â€“325. 10.1146/annurev.biophys.29.1.291
Fiser A, Do RK, Sali A: Modeling of loops in protein structures. Protein Sci 2000, 9: 1753â€“1773.
Tsai J, Bonneau R, Morozov AV, Kuhlman B, Rohl CA, Baker D: An Improved Protein Decoy Set for Testing Energy Functions for Protein Structure Prediction. Proteins 2003, 53: 76â€“87. 10.1002/prot.10454
Baker D, Bystroff C, Fletterick RJ, Agard DA: PRISM: topologically constrained phased refinement for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr 1993, 49: 429â€“39. 10.1107/S0907444993004032
Sun S: Reduced representation approach to protein tertiary structure prediction: statistical potential and simulated annealing. J Theor Biol 1995, 172: 13â€“32. 10.1006/jtbi.1995.0002
Canutescu AA, Shelenkov AA, Dunbrack RL: A graphtheory algorithm for rapid protein sidechain prediction. Protein Sci 2003, 12: 2001â€“2014. 10.1110/ps.03154503
Lazaridis T, Karplus M: Effective Energy Function for Proteins in Solution. Proteins 1999, 35: 133â€“152. 10.1002/(SICI)10970134(19990501)35:2<133::AIDPROT1>3.0.CO;2N
Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Sci 1994, 3: 522â€“524.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235â€“242. 10.1093/nar/28.1.235
Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 2002, 47: 228â€“235. 10.1002/prot.10082
Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of coordination number and relative solvent accessibility in proteins. Proteins 2002, 47: 142â€“153. 10.1002/prot.10069
Baldi PF, Pollastri G: The principled design of largescale recursive neural network architecturesâ€“DAGRNNs and the protein structure prediction problem. J Mach Learn Res 2003, 4: 575â€“602. 10.1162/153244304773936054
Pollastri G, Baldi P: Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics 2002, 18: S62S70.
Kortemme T, Morozov AV, Baker D: An orientationdependent hydrogen bonding potential improves prediction of specificity and structure for proteins and proteinprotein complexes. J Mol Biol 2003, 326: 1239â€“1259. 10.1016/S00222836(03)000214
Neria E, Fischer S, Karplus M: Simulation of activation free energies in molecular systems. J Chem Phys 1996, 105: 1902â€“1921. 10.1063/1.472061
Skolnick J, Kolinski A, Ortiz AR: MONSSTER: A method for folding globular proteins with a small number of distance restraints. J Mol Biol 1997, 265: 217â€“241. 10.1006/jmbi.1996.0720
Privalov PL, Makhatadze GI: Contribution of hydration to protein folding thermodynamics II. The entropy and Gibbs energy of hydration. J Mol Biol 1993, 232: 660â€“679. 10.1006/jmbi.1993.1417
Acknowledgements
Work supported by NIH grant LM0744301, NSF grants EIA0321390 and IIS0513376, and a Microsoft Faculty Research Award to PFB.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
AR and PB designed the novel energy terms. AR implemented the methods and carried out the experiments. AR and PB authored the manuscript. Both authors approved the manuscript.
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Randall, A., Baldi, P. SELECTpro: effective protein model selection using a structurebased energy function resistant to BLUNDERs. BMC Struct Biol 8, 52 (2008). https://doi.org/10.1186/14726807852
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/14726807852