Improving predicted protein loop structure ranking using a Pareto-optimality consensus method
© Li et al; licensee BioMed Central Ltd. 2010
Received: 27 October 2009
Accepted: 20 July 2010
Published: 20 July 2010
Accurate protein loop structure models are important to understand functions of many proteins. Identifying the native or near-native models by distinguishing them from the misfolded ones is a critical step in protein loop structure prediction.
We have developed a Pareto Optimal Consensus (POC) method, which is a consensus model ranking approach to integrate multiple knowledge- or physics-based scoring functions. The procedure of identifying the models of best quality in a model set includes: 1) identifying the models at the Pareto optimal front with respect to a set of scoring functions, and 2) ranking them based on the fuzzy dominance relationship to the rest of the models. We apply the POC method to a large number of decoy sets for loops of 4- to 12-residue in length using a functional space composed of several carefully-selected scoring functions: Rosetta, DOPE, DDFIRE, OPLS-AA, and a triplet backbone dihedral potential developed in our lab. Our computational results show that the sets of Pareto-optimal decoys, which are typically composed of ~20% or less of the overall decoys in a set, have a good coverage of the best or near-best decoys in more than 99% of the loop targets. Compared to the individual scoring function yielding best selection accuracy in the decoy sets, the POC method yields 23%, 37%, and 64% less false positives in distinguishing the native conformation, indentifying a near-native model (RMSD < 0.5A from the native) as top-ranked, and selecting at least one near-native model in the top-5-ranked models, respectively. Similar effectiveness of the POC method is also found in the decoy sets from membrane protein loops. Furthermore, the POC method outperforms the other popularly-used consensus strategies in model ranking, such as rank-by-number, rank-by-rank, rank-by-vote, and regression-based methods.
By integrating multiple knowledge- and physics-based scoring functions based on Pareto optimality and fuzzy dominance, the POC method is effective in distinguishing the best loop models from the other ones within a loop model set.
Protein loop structure modeling is important in structural biology for its wide applications, including determining the surface loop regions in homology modeling , defining segments in NMR spectroscopy experiments , designing antibodies , and modeling ion channels [4, 5]. Typically, the protein loop structure modeling procedure involves the following steps [6, 7]. First of all, the structural conformation space is sampled to produce a large ensemble of backbone models satisfying certain conditions such as loop closure, clash-free, and low score (energy). Secondly, clustering algorithms are applied to select representative models from these backbone models. Thirdly, side chains are added to the representative models to build all-atom models and their structures are further optimized by score minimization. Finally, the models are assessed and the "best" ones will be selected as the predicted conformations.
In many loop modeling methods [6–13], sample loop conformations are constructed by dihedral angle buildup or fragment library search . Recently, Mandell et al.  developed a kinematic closure approach, which can construct loop conformations within a 1A resolution. Nevertheless, scoring functions used to guide loop modeling vary widely. Rohl et al.  optimized the Rosetta score using fragment buildup. Fiser et al.  used a hybrid scoring function by summing up CHARMM force field terms and statistically derived terms. Xiang et al.  developed a combined energy function with force-field energy and RMSD (Root Mean Square Deviation) dependent terms. They also developed the concept of "colony energy" that has been used by Fogolari and Tosatto  as well, for considering the loop entropy (an important component in flexible loops) as part of the total free energy. Olson et al.  used a multiscale approach based on physical potentials. An efficient grid-based force field has been employed by Cui et al. . Jacobson et al. , Zhu et al. , Rapp and Friesner , de Bakker et al. , Felts et al. , and Rapp et al.  employed physics-based energy schemes with various solvent models. Soto et al.  found that using the statistical potential DFIRE  as a filter prior to all-atom physics-based energy minimization can improve prediction accuracy and reduce computation time. DFIRE has previously proven to be successful by itself for loop selection . All these methods have led to recent significant progress in generating high-resolution loop models and several loop prediction servers are now available (see , for example).
In practice, the value of computer-generated protein loop models in biological research relies critically on their accuracy. While efficiently sampling the protein loop conformation space to produce sufficient number of low-energy models to cover conformations with good structures remains a challenging issue, another critical problem is the insensitivity of the existing protein scoring functions. These scoring functions are developed to estimate the energy of the protein molecule. The insensitivity of the scoring functions leads to difficulty in distinguishing the native or native-like conformations from the erroneous models, and thus restricts the loop structure prediction accuracy. Therefore, selecting the highest quality loop models from a number of other models is a critical step in solving the protein loop structure prediction problem.
The scoring functions play a significant role in protein structure assessment and selection. Although a number of scoring functions are currently available for protein loop model evaluation, there is no generally reliable one that can always distinguish the native or near native models. Every existing scoring function has its own pros and cons. Recently, the strategy of using multiple scoring functions to estimate the quality of models and improve selection was proposed in protein folding and protein-ligand docking [23–27]. Multiple, carefully selected scoring functions are integrated and selection improvements can be achieved by tolerating the insensitivity and deficiency of every individual scoring function. Thus, the multiple scoring functions method can usually lead to a better performance than an individual scoring function.
Similar to structure prediction in an overall protein, the scoring functions that have been used in loop modeling can be categorized into knowledge-based [8, 21, 28–30] and physics-based [13, 31–35]. The knowledge-based scoring functions are typically derived from protein structural databases such as the PDB and thus incorporate empirical criteria to distinguish the native structure from the misfolds. By contrast, the physics-based scoring functions are developed based on first principle concepts, where electrostatic, Van der Waals, hydrogen bonds, solvation, and covalent interactions are taken into account.
In this paper, we present a Pareto Optimality Consensus (POC) method based on the Pareto optimality  and fuzzy dominance theory  to take advantage of multiple scoring functions for ranking protein loop models. The rationale is to identify the models at the Pareto optimal front of the function space of a set of carefully selected scoring functions and then to rank them based on the fuzzy dominance relationship relative to the other models. For protein loop structure ranking, we employ five knowledge- or physics-based scoring (energy) functions: DFIRE , our triplet backbone dihedral potential , OPLS-AA/SGB [31, 32], all-atom Rosetta , and DOPE . All of these scoring functions have shown efficiency in loop modeling in the literature [6–8, 21, 28]. We apply our approach to the loop decoy sets generated by Jacobson et al. . The loops in Jacobson's decoy sets are regarded as "difficult" targets [21, 35]. There are frequent Pro and Gly occurrences in these loops. Cys are treated separately in both reduced and oxidized forms to take the formation of disulfide bridges into account. The loop positions are random to make possible encountering of all sorts of situations. Jacobson's decoy sets have been frequently used as a benchmark for loop prediction and effectiveness of scoring functions [20, 21, 35]. The original loop decoy sets include targets whose native protein structures have certain exceptional features such as high or low pH values when crystallized, explicit interactions between the target loops and heteroatoms, and low resolution crystal structures in target loop regions with large measured B-factors . Jacobson et al. also provide a filtered list of decoy sets by eliminating targets with the above exceptional features. Since none of the scoring functions we used makes assumptions of these exceptional features, we only consider the filtered decoy sets in this paper. In addition to Jacobson's decoy sets, we apply our method to more recent decoy sets for 294 loops chosen from 44 chains in 38 membrane proteins . We also compared the POC method with the hydrophobic potential of mean force (HPMF) approach for loop model selection as well as other multiple scoring functions ranking strategies , including Rank-by-Number, Rank-by-Rank, Rank-by-Vote, and regression-based methods.
The consensus Strategy
The Pareto Optimality Consensus Method
for each scoring function f i (.), f i (u) ≤ f i (v) holds for all i;
ii) there is at least one scoring function f j (.) where f j (u) < f j (v) is satisfied.
By definition, the models which are not dominated by any other models in the model set form the Pareto-optimal solution set. A Pareto-optimal model possesses certain optimality compared to the other ones in the model set.
for all normalized scoring functions g(f i (.)). In our current POC method, we use a linear membership function, min(x, y)/y, as suggested in , and the fuzzy scheme does not bias to any individual scoring functions.
For the example shown in Figure 3, μ a (A, C) = 1.0, μ p (A, C) = 0.083, μ a (A, B) = 1.0, and μ p (A, B) = 0.167. As a result, A shows a more significant dominance to C than to B in the fuzzy dominance scheme.
which will be used to rank the Pareto-optimal models. For ranking of the whole model set, we firstly identify the Pareto-optimal models and rank them according to fuzzy Pareto dominance relationship. Then, we remove the Pareto-optimal models, identify the Pareto-optimal models for the rest of the models, and assign ranks to them. The procedure is repeated until there are no more models left in the model set.
Effectiveness of the Pareto Optimal Models
Efficiency in Identifying Near-Native Structures
We applied the POC method to the decoy sets generated by Jacobson et al. The decoy set for each target contains very good models (MODEL 1 and MODEL 2) derived from the native structure by optimizing the OPLS-AA/SGB force field as well as other models generated by hierarchical comparative modeling .
Average ROC-AUC Comparison in Jacobson's Decoy Sets and the Membrane Protein Loop Decoy (MPD) Sets
Comparison to Regression-based Consensus Method
Another major drawback of the regression-based consensus method is its dependence on the size, composition and generality of the training set used to derive the weights. Similar to the vote-based or rank-based consensus methods, POC does not require a training procedure. The selection and ranking solely depend on evaluation of the dominance relationship among the decoys.
Comparison to Rank-by-Number, Rank-by-Rank, and Rank-by-Vote Methods
The vote-based consensus method is another strategy of multiple scoring functions selection method, which takes advantage of the observation that similar models voted by more scoring functions tend to be more accurate than those having fewer votes. However, the disadvantage of vote-based consensus methods is that it is very sensitive to the artificially-set vote threshold value [23, 27]. Also, the vote-based consensus method has difficulties in situations when the scoring functions strongly disagree with each other. As a result, the vote-based consensus methods are usually inferior to the consensus score methods and are generally not recommended .
Selection Accuracy Comparison of Various Consensus Strategies and Best Individual Scoring Function in Jacobson's Decoy Sets of 502 Loop Targets
Best Individual Scoring Function
Top-ranked decoy < 0.5A
Best Top-5-ranked decoys < 0.5A
Comparison to Another Selection Method
Selection Accuracy of the POC method compared to the HPMF Method
In this section, we analyze, from the biological perspective, the results obtained for several loop targets. These targets include 1fus(28:38), 1aac(16:20), and 1hbq(31:38).
On the other hand, Rosetta's best scored decoy has the opposite problem: It makes some good contacts with the protein frame but has a poor choice of backbone torsion angle combinations. For example, the Thr37 residue has the following backbone torsion angle combination: phi = 80°, psi = -45°, which falls on a region of the Threonine's Ramachandran map that is disallowed due to local steric clashes. The success of the POC method in this case is justified by selectively relying on the other scoring functions that have good performances.
A somewhat opposite example is provided by the 1aac(16:20) target, where only the triplet scoring function selects decoys close to the native structure. All the other scoring functions select decoys with inferior torsion angle combinations. It seems that the distance-based scoring functions cannot accurately evaluate the local backbone interactions that are well described by our triplet torsion angle scoring function. Despite scoring a loop by its internal interactions only, our triplet scoring function proves itself as a valuable tool in the POC scheme. Our POC method heavily relies on the triplet scoring function to identify the near-native conformation in this case.
Limitations of the POC Method
Similar to the other consensus methods, a limitation of the POC method depends on the accuracy of the scoring functions involved in the consensus scheme. If the large majority of the scoring functions have poor accuracy, the consensus scheme is unlikely to select decoys with high resolution. The effectiveness of the POC method also depends on the quality of the decoys generated. POC is a selection and ranking scheme and thus it is unable to generate better decoys than the best one in a decoy set.
Another minor disadvantage of the POC method is the decoy selection and ranking time when the decoy set is large. For a set of N decoys, the Pareto-optimal decoys selection and ranking time scaling is O(N2) because of the requirement of evaluating pair-wise decoy dominance relationship, whereas the ranking time scaling in regression-based, rank-based, or vote-based consensus methods is O(N). However, compared to the training time in regression-based method and the evaluation time for the scoring functions, the decoy selection and ranking time in the POC method is still rather small for a reasonable size of the decoy set.
The POC method is shown to be effective in distinguishing the best models from the other ones within Jacobson's loop decoy sets and the membrane protein loop decoy sets. It is clear that a combination of multiple, carefully-selected physics- and knowledge-based scoring functions can significantly reduce the number of false positives compared to using an individual scoring function only. Moreover, identifying the decoys at the Pareto optimal front and ranking these decoys based on the fuzzy dominance relationship against the other decoys in the set have led to higher model selection accuracy in the POC method than in the other consensus strategies including rank-by-vote, rank-by-number, rank-by-rank, and regression-based methods. In addition to protein loop structure prediction, the POC approach may also be used in applications of protein folding, protein-protein interaction, and protein-ligand docking.
Our current POC implementation does not bias to any individual scoring function. However, there may still be improvement space for the POC method. For example, the POC may couple with a training algorithm to measure the efficiency of a scoring function and then certain bias to some scoring functions can be incorporated in evaluating the fuzzy Pareto dominance relation. This will be one of our future research directions.
We acknowledge support from NIH grants 5PN2EY016570-06 and 5R01NS063405-02 and from NSF grants 0835718, 0829382, and 0845702.
- Bruccoleri RE: Ab initio loop modeling and its application to homology modeling. Methods in Molecular Biology 2000, 143: 247–264.PubMed
- Dmitriev OY, Fillingame RH: The rigid connecting loop stabilizes hairpin folding of the two helices of the ATP synthase subunit c. Protein Science 2007, 16(10):2118–2122. 10.1110/ps.072776307PubMed CentralView ArticlePubMed
- Martin AC, Cheetham JC, Rees AR: Modeling antibody hypervariable loops: a combined algorithm. PNAS 1989, 86(23):9268–9272. 10.1073/pnas.86.23.9268PubMed CentralView ArticlePubMed
- Tasneem A, Iyer LM, Jakobsson E, Aravind L: Identification of the prokaryotic ligand-gated ion channels and their implications for the mechanisms and origins of animal Cys-loop ion channels. Genome Biol 2005, 6(1):R4. 10.1186/gb-2004-6-1-r4PubMed CentralView ArticlePubMed
- Yarov-Yarovoy V, Baker D, Catterall WA: Voltage sensor conformations in the open and closed states in ROSETTA structural models of K + channels. Proc Natl Acad USA 2006, 103: 7292–7297. 10.1073/pnas.0602350103View Article
- Jacobson MP, Pincus DL, Rapp CS, Day TJF, Honig B, Shaw DE, Friesner RA: A Hierarchical Approach to All-atom Protein Loop Prediction. Proteins: Structure, Function, and Bioinformatics 2004, 55: 351–367. 10.1002/prot.10613View Article
- Zhu K, Pincus DL, Zhao S, Friesner RA: Long Loop Prediction Using the Protein Local Optimization Program. Proteins: Structure, Function, and Bioinformatics 2006, 65: 438–452. 10.1002/prot.21040View Article
- Rohl RA, Strauss CE, Chivian D, Baker D: Modeling structurally variable regions in homologous proteins with Rosetta. Proteins 2004, 55: 656–677. 10.1002/prot.10629View ArticlePubMed
- Fiser A, Do RKG, Sali A: Modeling of loops in protein structures. Protein Sci 2000, 9: 1753–1773. 10.1110/ps.9.9.1753PubMed CentralView ArticlePubMed
- Xiang ZX, Soto CS, Honig B: Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc Natl Acad Sci USA 2002, 99: 7432–7437. 10.1073/pnas.102179699PubMed CentralView ArticlePubMed
- Rapp CS, Friesner RA: Prediction of loop geometries using a generalized born model of solvation effects. Proteins 1999, 35: 173–183. 10.1002/(SICI)1097-0134(19990501)35:2<173::AID-PROT4>3.0.CO;2-2View ArticlePubMed
- de Bakker PIW, Depristo MA, Burke DF, Blundell TL: Ab initio construction of polypeptide fragments: accuracy of loop decoy discrimination by an all-atom statistical potential and the AMBER force field with the Generalized Born solvation model. Proteins Struct Funct Bioinformat 2003, 51: 21–40. 10.1002/prot.10235View Article
- Felts AK, Gallicchio E, Chekmarev D, Paris KA, Friesner RA, Levy RM: Prediction of Protein Loop Conformation Using the AGBNP Implicit Solvent Model and Torsion Angle Sampling. J Chem Theory Comput 2008, 4(5):855–868. 10.1021/ct800051kPubMed CentralView ArticlePubMed
- Fernandez-Fuentes N, Oliva B, Fiser A: A Supersecondary Structure Library and Search Algorithm for Modeling Loops in Protein Structures. Nucleic Acids Res 2006, 34(7):2085–2097. 10.1093/nar/gkl156PubMed CentralView ArticlePubMed
- Mandell DJ, Coutsias EA, Kortemme T: Sub-Angstrom Accuracy in Protein Loop Reconstruction by Robotics-Inspired Conformational Sampling. Nature Methods 2009, 6: 551–552. 10.1038/nmeth0809-551PubMed CentralView ArticlePubMed
- Fogolari F, Tosatto SCE: Application of MM/PBSA colony free energy to loop decoy discrimination: Toward correlation between energy and root mean square deviation. Protein Science 2005, 14(4):889–901. 10.1110/ps.041004105PubMed CentralView ArticlePubMed
- Olson MA, Feig M, Brooks CL: Prediction of protein loop conformations using multiscale Modeling methods with physical energy scoring functions. Journal of Computational Chemistry 2008, 29(5):820–831. 10.1002/jcc.20827View ArticlePubMed
- Cui M, Mezei M, Osman R: Prediction of protein loop structures using a local move Monte Carlo approach and a grid-based force field. Protein Engineering Design & Selection 2008, 21(12):729–735.View Article
- Rapp CS, Strauss T, Nederveen A, Fuentes G: Prediction of protein loop geometries in solution. Proteins: Structure Function and Bioinformatics 2007, 69: 69–74. 10.1002/prot.21503View Article
- Soto CS, Fasnacht M, Zhu J, Forrest L, Honig B: Loop Modeling: Sampling, Filtering, Scoring. Proteins 2008, 70(3):834–843. 10.1002/prot.21612PubMed CentralView ArticlePubMed
- Zhang C, Liu S, Zhou Y: Accurate and efficient loop selections by the DFIRE-based all-atom statistical potential. Protein Sci 2004, 13(2):391–399. 10.1110/ps.03411904PubMed CentralView ArticlePubMed
- Spassov VZ, Flook PK, Yan L: LOOPER: a molecular mechanics-based algorithm for protein loop prediction. Protein Engineering Design & Selection 2008, 21(2):91–100.View Article
- Wang R, Wang S: How Does Consensus Scoring Work for Virtual Library Screening? An Idealized Computer Experiment. J Chem Inf Comput Sci 2001, 41: 1422–1426.View ArticlePubMed
- Qiu J, Sheffler W, Baker D, Noble WS: Ranking predicted protein structures with support vector regression. Proteins 2008, 71: 1175–1182. 10.1002/prot.21809View ArticlePubMed
- Clark RD, Strizhev A, Leonard JM, Blake JF, Matthew JB: Consensus scoring for ligand/protein interactions. Journal of Molecular Graphics and Modelling 2002, 20(4):281–295. 10.1016/S1093-3263(01)00125-5View ArticlePubMed
- Gao X, Bu D, Xu J, Li M: Improving Consensus Contact Prediction via Server Correlation Reduction. BMC Structural Biology 2009, 9: 28–42. 10.1186/1472-6807-9-28PubMed CentralView ArticlePubMed
- Oda A, Tsuchida H, Takakura T, Yamaotsu N, Hirono S: Comparison of Consensus Scoring Strategies for Evaluating Computational Models of Protein-Ligand Complexes. J Chem Inf Model 2006, 46: 380–391. 10.1021/ci050283kView ArticlePubMed
- Rata I, Li Y, Jakobsson E: Backbone statistical potential from local sequence-structure interactions in protein loops. Journal of Phys Chem B 2010, 114(5):1859–1869. 10.1021/jp909874gView Article
- Burke DF, Deane CM: Improved protein loop prediction from sequence alone. Protein Eng 2001, 14: 473–478. 10.1093/protein/14.7.473View ArticlePubMed
- Fernandez-Fuentes N, Oliva B, Fiser A: A supersecondary structure library and search algorithm for modeling loops in protein structures. Nucleic Acids Research 2006, 34(7):2085–2097. 10.1093/nar/gkl156PubMed CentralView ArticlePubMed
- Jorgensen WL, Maxwell DS, Tirado-Rives J: Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J Am Chem Soc 1996, 118: 11225–11236. 10.1021/ja9621760View Article
- Ghosh A, Rapp CS, Friesner RA: Generalized Born model based on a surface integral formulation. J Phys Chem B 1998, 102: 10983–10990. 10.1021/jp982533oView Article
- de Bakker PIW, DePristo MA, Burke DF, Blundell TL: Ab initio construction of polypeptide fragments: Accuracy of loop decoy discrimination by an all-atom statistical potential and the AMBER force field with the Generalized Born solvation model. Proteins Struct Funct Genet 2002, 51: 21–40.View Article
- Sellers BD, Zhu K, Zhao S, Friesner RA, Jacobson MP: Toward Better Refinement of Comparative Models: Predicting Loops in Inexact Environments. Proteins: Structure, Function, and Bioinformatics 2008, 72(3):959–971. 10.1002/prot.21990View Article
- Lin MS, Head-Gordon T: Improved Energy Selection of Nativelike Protein Loops for Loop Decoys. Journal of Chemical Theory and Computation 2008, 4: 515–521. 10.1021/ct700292uView Article
- Godzik A: Knowledge-based potentials for protein folding: what can we learn from known protein sequences? Structure 1996, 4: 363–366. 10.1016/S0969-2126(96)00041-XView ArticlePubMed
- Naim BA: Statistical potentials extracted from protein structures: are these meaningful potentials? J Chem Phys 1997, 107: 3698–3706. 10.1063/1.474725View Article
- Thomas PD, Dill KA: Statistical potentials extracted from protein structures: how accurate are they? J Mol Biol 1996, 257: 457–469. 10.1006/jmbi.1996.0175View ArticlePubMed
- Kocher JA, Rooman MJ, Wodak SJ: Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. J Mol Biol 1994, 235: 1598–1613. 10.1006/jmbi.1994.1109View ArticlePubMed
- Li Y, Bordner AJ, Tian Y, Tao X, Gorin A: Extensive Exploration of the Conformational Space Improves Rosetta Results for Short Protein Domains. Proceedings of 7th Annual International Conference on Computational Systems Bioinformatics (CSB08) 2008.
- Simons KT, Bonneau R, Ruczinski I, Baker D: Ab initio protein structure prediction of CASP III targets using Rosetta. Proteins 1999, 37(S3):171–176. 10.1002/(SICI)1097-0134(1999)37:3+<171::AID-PROT21>3.0.CO;2-ZView Article
- Eswar N, Eramian D, Webb B, Shen M, Sali A: Protein Structure Modeling with MODELLER. Structural Proteomics: High-Throughput Methods 2008, 145–159.View Article
- Deb K: Multi-objective optimization using evolutionary algorithms. John Wiley & Sons; 2001.
- Koppen M, Vicente-Garcia R: A Fuzzy Scheme for the Ranking of Multivariate Data and its Application. Proceedings of the IEEE Annual Meeting of the North American Fuzzy Information Processing Society 2004.
- Gao C, Stern HA: Scoring function accuracy for membrane protein structure prediction. Proteins 2007, 68(1):67–75. 10.1002/prot.21421View ArticlePubMed
- Waegeman W: Learning to rank: a ROC-based graph-theoretic approach. 4OR: A Quarterly Journal of Operations Research 2009, 7(4):399–402. 10.1007/s10288-009-0095-yView Article
- Mierswa I, Morik K: About the non-convex optimization problem induced by non-positive semidefinite kernel learning. Advances in Data Analysis and Classification 2008, 2(3):241–258. 10.1007/s11634-008-0033-4View Article
- Fan RE, Chen PH, Lin CJ: Working set selection using second order information for training SVM. Journal of Machine Learning Research 2005, 6: 1889–1918. 2005
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.