A conformation ensemble approach to protein residue-residue contact
© Eickholt et al; licensee BioMed Central Ltd. 2011
Received: 30 June 2011
Accepted: 12 October 2011
Published: 12 October 2011
Protein residue-residue contact prediction is important for protein model generation and model evaluation. Here we develop a conformation ensemble approach to improve residue-residue contact prediction. We collect a number of structural models stemming from a variety of methods and implementations. The various models capture slightly different conformations and contain complementary information which can be pooled together to capture recurrent, and therefore more likely, residue-residue contacts.
We applied our conformation ensemble approach to free modeling targets from both CASP8 and CASP9. Given a diverse ensemble of models, the method is able to achieve accuracies of. 48 for the top L/5 medium range contacts and. 36 for the top L/5 long range contacts for CASP8 targets (L being the target domain length). When applied to targets from CASP9, the accuracies of the top L/5 medium and long range contact predictions were. 34 and. 30 respectively.
When operating on a moderately diverse ensemble of models, the conformation ensemble approach is an effective means to identify medium and long range residue-residue contacts. An immediate benefit of the method is that when tied with a scoring scheme, it can be used to successfully rank models.
Even after many years of intense attention and development, de novo protein structure prediction remains a difficult and open problem. In part, this is due to the inadequacy of current de novo sampling techniques which are incapable of guiding the folding process through such a vast conformational space [1–3]. To address this issue, several have proposed the use of long range contacts to reduce the size of the conformational search space. Studies have shown that with as few as L/8 long-range contacts (L being the sequence length) proteins can be folded and moderate resolution models generated [4, 5]. Additional uses of protein residue-residue contacts include applications such as model evaluation, model selection and ranking [6–8], and drug design .
Given the importance and applicability of protein contacts, considerable effort has been put forth to develop methods which can predict protein residue-residue contacts. The majority of these methods can be categorized into three groups based on machine learning, templates or correlated mutations. Machine learning approaches make predictions by employing techniques such as neural networks, support vector machines or hidden Markov models trained on contacts from experimental structures [10–16]. Template based methods rely on the detection of similar structures (ie templates) by means of threading or homology and once identified, extract contacts from the templates as predictions [16–18]. Recently, more sophisticated template based approaches have been developed which attempt to combine contacts contained in differing conformations among identified templates. This is done by weighting the contacts contained within the templates based on evolutionary distance between the templates and target sequence . Methods based on correlated mutation identify correlated changes in residues as evidenced in multiple sequence alignments and then exploit this information to predict residue-residue contacts [20–24]. Both machine learning and correlated mutation methods are considered ab-initio methods since no structural template information is used. One additional method which does not fall under the umbrella of the three categories mentioned is the extraction of contacts from 3D structural models generated for a protein. This approach has been used by the CASP assessors [25, 26], a few CASP predictors such as SMEG-CCP (see CASP8 abstracts), and in scoring protein models .
In spite of the effort and attention that contact prediction has been given, the accuracy of long range contact predictions still remains quite low for hard targets. For these targets, accuracies typically range from 20 to 35% depending on number of contacts considered, distance thresholds and dataset [13, 15, 16]. Results from the eighth and ninth Critical Assessment of Techniques for Protein Structure Prediction (CASP) report that for free modeling (ie hard) targets, the average accuracy for long range contacts is routinely in the range of 20 to 25% [25, 27].
Here we present a conformation ensemble approach for contact prediction. The approach is partially motivated by the view that while current protein structure predictions methods infrequently capture the overall conformation of hard targets, they do often capture portions of it. By pooling together a number of models stemming from varying alignments, templates, methods and implementations, it is possible to create an ensemble of conformations which represent portions of possible conformations for the target. The various models can capture slightly different conformations and contain complementary information which can be pooled together to capture recurrent, and therefore more likely, residue-residue contacts regardless of the particular conformation. The method works by extracting contacts from a large ensemble of possible structures generated for a protein. When evaluating the method on the CASP8 and CASP9 free modeling (FM) targets, we find that it outperforms current approaches substantially and achieves long range contact accuracies of 36% on the CASP8 FM targets and 30% on the CASP9 FM targets.
Datasets and Evaluation Metrics
The prediction targets used in our study were the protein domains classified as free modeling (FM) targets for CASP8 and CASP9. These are domains which did not have structural templates or the templates existed but were extremely difficult to detect . For CASP8, the target domains considered were the same used in the official CASP8 assessment of contact predictors . These domains included T0397 [1-82], T0405 [2-282], T0416 [124-180], T0443 [31-96], T0443 [97-118,136-173], T0460 [1-49,72-102], T0465[25-35,41-135], T0476 [2-88], T0482[5-10,19-31,35-46,49-76,96-103], T0496[4-123], T0510[236-279] and T0513[17-85]. For CASP9, we used all the domains classified as FM on the official CASP9 website (http://predictioncenter.org/casp9/domain_definitions.cgi). These domains included T0529 [7-339], T0531 [6-63], T0534 [31-80,257-384], T0534 [81-256], T0537 [65-350], T0537 [351-381], T0544 [1-135], T0547 [343-421], T0547 [554-609], T0550 [178-339], T0553 [3-65], T0553 [66-136], T0555 [12-145], T0561 [1-109,112-161], T0571 [197-331], T0578 [9-56,64-163], T0581 [27-131], T0604 [11-94], T0604 [292-496], T0608 [29-117], T0618 [6-175], T0621 [2-170], T0624 [5-73], T0629 [50-208], T0637 [1-135] and T0639 [3-126]. All the targets along with their corresponding domain definitions and experimental structures are available on the CASP websites (http://predictioncenter.org/casp8/, http://predictioncenter.org/casp9/). It should be noted that the ensemble prediction approach could be applied to hard template based modeling as well. In this study we limited ourselves to the free modeling targets as they are typically the type of target chosen when evaluating residue-residue contact prediction methods.
For the purposes of our investigation two amino acid residues are said to be in contact if the distance between their Cβ atoms (Cα for glycine) in the experimental structure is less than 8Å. Long range contacts are defined as residues in contact whose separation in the sequence is greater than or equal to 24 residues. Medium range contacts are defined by interacting residues which are 12 to 23 residues apart in the sequence. These definitions were used in accordance with previous studies [10, 15, 16] and CASP residue-residue contact assessments [25–27, 29].
A common evaluation metric for residue-residue contact predictions is the accuracy of the top L/5 or L/10 predictions where L is the length of the protein in residues. If evaluating predictions over a domain, L can also be the length of the domain. Accuracy is defined as the number of correctly predicted residue-residue contacts divided by the total number of contact predictions considered. The recall is defined as the number of correctly predicted residue-residue contacts divided by the total number of true contacts. Additionally, we also calculated the number of contact predictions which were very close to a true contact. For this calculation, a prediction is considered correct if there is a true contact within ± δ residues for small values (ie 1 or 2) of δ.
Conformation Ensemble Contact Prediction Procedure
The primary source of input ensembles was CASP. During the most recent CASP experiments, prediction groups were allowed to submit up to 5 tertiary structure predictions per target to the prediction center. The models for the groups which participated in the server category are available on the CASP website and provided us with a rich collection of ensembles for our prediction targets. On average there were 301 models in each ensemble.
Results and Discussion
Precision and recall of conformation ensemble contact predictions on CASP8 FM targets
Medium range contacts
Long range contacts
Top L/5, δ = 1
Top L/5, δ = 2
Precision and recall of conformation ensemble contact predictions on CASP9 FM targets
Medium range contacts
Long range contacts
Top L/5, δ = 1
Top L/5, δ = 2
Comparison of contact predictors on top L/5 predictions for CASP9 FM targets
Medium range contacts
Long range contacts
Precision of top L/5 contact predictions obtained from filtered ensembles on CASP9 FM targets
Medium range contacts
Long range contacts
One application of our conformational ensemble approach which we demonstrate here is its usability and effectiveness in ranking models. It should be noted that use of predicted contacts to rank and select models has been studied previously and shown to be useful [6, 8]. Motivated by these efforts, we developed our own scoring scheme to rank models using contacts obtained by the conformation ensemble approach. To rank models, we used our conformational ensemble approach to generate contacts for each FM target. We then scored the models based on how well they satisfied the predicted top L medium range contacts and all long range contacts. More specifically, we calculated the percentage of the predicted medium range contacts satisfied exactly, the percentage of predicted medium range contacts satisfied within 1 residue (ie, δ = 1), the percentage of predicted long range contacts satisfied exactly and the percentage of predicted long range contacts satisfied within 1 residue. The sum of these percentages was calculated and used to rank the models.
The average loss on CASP9 FM targets
Avg. Loss (in GDT-TS score)
Scoring w/conformation ensemble contacts
Random baseline measure
As indicated in Table 5, the model rankings based on contacts obtained by our conformation ensemble approach are indeed very competitive and on par with those stemming from model quality assessment programs, which performed much better than the random baseline approach. The simple scoring scheme we used to rank models rewards those models which characterize the residue-residue interactions which were most common across the ensemble. Thus, the ability to effectively rank models using contacts obtained by our conformation ensemble approach indicates that the method is able consolidate information about the protein's overall structure across the models. Here, we also note that this ranking strategy (ie, extracting contacts from models and using them as a means to rank the models) could be applicable to any protein structure prediction pipeline which produces a large number of structures in the course of making a 3D model.
Representation of predicted contact clusters in an ensemble
Cluster Coverage (percentage of models from ensemble with stated coverage)
This ability to consolidate contact information across multiple models is a concept that several protein structure predictors could use as part of their own prediction pipeline. Clustering is widely used as a means to identify more probable structures from a pool of models. However, with clustering only similar models are capable of being clustered and contribute information. With the conformation ensemble approach, all models are able to contribute and help identity likely residue-residue interactions. One could easily envision an iterative approach in which a protein structure predictor could generate a diverse set of models, extract contact data and use it to generate more models. This would allow information about the conformation space to be passed from one round to the next via the likely contacts extracted from the models.
A disadvantage of the method is its dependency on a diverse ensemble of mildly accurate 3D models. In order for the approach to work, the models generated need to be able to capture at least some local portion of the overall topology of the protein. If all of the models in the ensemble are of poor quality then the method does not perform very well.
An additional consideration which must be taken into the account is the generation of the models. In practice, one would need to generate a varied ensemble of models before using the method. This could be done using a variety of protein structure prediction methods or variants of a few approaches. The time and computing resources needed to generate the models would depend on the methods used to produce the models. These decisions would affect the general practicality and usefulness of the method as a general residue-residue contact predictor. Yet, as we have demonstrated the method is applicable to ensembles of smaller sizes and still generates relatively accuracy predictions. The size of the ensemble and the sources of the models are choices which must be made when implementing a conformational ensemble predictor and inevitably affect the time needed to make contact predictions, the accuracy of those predictions and the method's ability to extract varied contact information across the models.
In this work we have presented a conformation ensemble approach for predicting protein residue-residue contacts. The method draws contact data from an ensemble of models which capture slightly different conformations and contain complementary information. This information can be pooled together to capture recurrent, and therefore more likely, residue-residue contacts. We evaluated our approach on hard targets from CASP8 and CASP9 and found that it is capable of achieving state of the art performance for medium and long range residue-residue contact prediction. We have also demonstrated that the generated contact information coupled with a simple scoring scheme is capable of effectively ranking models.
The work was partially supported by a NIH grant 1R01GM093123 to JC and a NLM fellowship to JE.
- Ben-David M, Noivirt-Brik O, Paz A, Prilusky J, Sussman JL, Levy Y: Assessment of CASP8 structure predictions for template free targets. Proteins 2009, 77(Suppl 9):50–65.View ArticlePubMedGoogle Scholar
- Bradley P, Misura KMS, Baker D: Toward High-Resolution de Novo Structure Prediction for Small Proteins. Science 2005, 309: 1868–1871. 10.1126/science.1113801View ArticlePubMedGoogle Scholar
- Zhang Y: Progress and challenges in protein structure prediction. Current Opinion in Structural Biology 2008, 18: 342–348. 10.1016/j.sbi.2008.02.004PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Zhang Y, Skolnick J: Application of sparse NMR restraints to large-scale protein structure prediction. Biophys J 2004, 87: 1241–1248. 10.1529/biophysj.104.044750PubMed CentralView ArticlePubMedGoogle Scholar
- Skolnick J, Kolinski A, Ortiz AR: MONSSTER: a method for folding globular proteins with a small number of distance restraints. J Mol Biol 1997, 265: 217–241. 10.1006/jmbi.1996.0720View ArticlePubMedGoogle Scholar
- Miller CS, Eisenberg D: Using inferred residue contacts to distinguish between correct and incorrect protein models. Bioinformatics 2008, 24: 1575–1582. 10.1093/bioinformatics/btn248PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Z, Tegge AN, Cheng J: Evaluating the absolute quality of a single protein model using structural features and support vector machines. Proteins 2009, 75: 638–647. 10.1002/prot.22275View ArticlePubMedGoogle Scholar
- Tress ML, Valencia A: Predicted residue-residue contacts can help the scoring of 3D models. Proteins 2010, 78: 1980–1991.PubMedGoogle Scholar
- Kliger Y, Levy O, Oren A, Ashkenazy H, Tiran Z, Novik A, Rosenberg A, Amir A, Wool A, Toporik A, et al.: Peptides modulating conformational changes in secreted chaperones: from in silico design to preclinical proof of concept. Proc Natl Acad Sci USA 2009, 106: 13797–13801. 10.1073/pnas.0906514106PubMed CentralView ArticlePubMedGoogle Scholar
- Bjorkholm P, Daniluk P, Kryshtafovych A, Fidelis K, Andersson R, Hvidsten TR: Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residue-residue contacts. Bioinformatics 2009, 25: 1264–1270. 10.1093/bioinformatics/btp149PubMed CentralView ArticlePubMedGoogle Scholar
- Pollastri G, Baldi P: Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics 2002, 18(Suppl 1):S62–70. 10.1093/bioinformatics/18.suppl_1.S62View ArticlePubMedGoogle Scholar
- Xue B, Faraggi E, Zhou Y: Predicting residue-residue contact maps by a two-layer, integrated neural-network method. Proteins 2009, 76: 176–183. 10.1002/prot.22329PubMed CentralView ArticlePubMedGoogle Scholar
- Tegge AN, Wang Z, Eickholt J, Cheng J: NNcon: improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Res 2009, 37: W515–518. 10.1093/nar/gkp305PubMed CentralView ArticlePubMedGoogle Scholar
- Cheng J, Baldi P: Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics 2007, 8: 113. 10.1186/1471-2105-8-113PubMed CentralView ArticlePubMedGoogle Scholar
- Vullo A, Walsh I, Pollastri G: A two-stage approach for improved prediction of residue contact maps. BMC Bioinformatics 2006, 7: 180. 10.1186/1471-2105-7-180PubMed CentralView ArticlePubMedGoogle Scholar
- Wu S, Zhang Y: A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics 2008, 24: 924–931. 10.1093/bioinformatics/btn069PubMed CentralView ArticlePubMedGoogle Scholar
- Misura KM, Chivian D, Rohl CA, Kim DE, Baker D: Physically realistic homology models built with ROSETTA can be more accurate than their templates. Proc Natl Acad Sci USA 2006, 103: 5361–5366. 10.1073/pnas.0509355103PubMed CentralView ArticlePubMedGoogle Scholar
- Skolnick J, Kihara D, Zhang Y: Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins 2004, 56: 502–518. 10.1002/prot.20106View ArticlePubMedGoogle Scholar
- Ashkenazy H, Unger R, Kliger Y: Hidden conformations in protein structures. Bioinformatics 2011, 27: 1941–1947. 10.1093/bioinformatics/btr292View ArticlePubMedGoogle Scholar
- Fodor AA, Aldrich RW: Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins 2004, 56: 211–221. 10.1002/prot.20098View ArticlePubMedGoogle Scholar
- Gobel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins 1994, 18: 309–317. 10.1002/prot.340180402View ArticlePubMedGoogle Scholar
- Kundrotas PJ, Alexov EG: Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives. BMC Bioinformatics 2006, 7: 503. 10.1186/1471-2105-7-503PubMed CentralView ArticlePubMedGoogle Scholar
- Olmea O, Valencia A: Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des 1997, 2: S25–32.View ArticlePubMedGoogle Scholar
- Vicatos S, Reddy BV, Kaznessis Y: Prediction of distant residue contacts with the use of evolutionary information. Proteins 2005, 58: 935–949. 10.1002/prot.20370View ArticlePubMedGoogle Scholar
- Ezkurdia I, Grana O, Izarzugaza JM, Tress ML: Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8. Proteins 2009, 77(Suppl 9):196–209.View ArticlePubMedGoogle Scholar
- Izarzugaza JM, Grana O, Tress ML, Valencia A, Clarke ND: Assessment of intramolecular contact predictions for CASP7. Proteins 2007, 69(Suppl 8):152–158.View ArticlePubMedGoogle Scholar
- Monastyrskyy B, Fidelis K, Tramontano A, Kryshtafovych A: Evaluation of residue-residue contact predictions in CASP9. Proteins 2011.Google Scholar
- Tress ML, Ezkurdia I, Richardson JS: Target domain definition and classification in CASP8. Proteins 2009, 77(Suppl 9):10–17.PubMed CentralView ArticlePubMedGoogle Scholar
- Grana O, Baker D, MacCallum RM, Meiler J, Punta M, Rost B, Tress ML, Valencia A: CASP6 assessment of contact prediction. Proteins 2005, 61(Suppl 7):214–224.View ArticlePubMedGoogle Scholar
- Zemla A, Venclovas , Moult J, Fidelis K: Processing and evaluation of predictions in CASP4. Proteins 2001, (Suppl 5):13–21.
- Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins 2004, 57: 702–710. 10.1002/prot.20264View ArticlePubMedGoogle Scholar
- Zemla A: LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res 2003, 31: 3370–3374. 10.1093/nar/gkg571PubMed CentralView ArticlePubMedGoogle Scholar
- Zemla A, Venclovas C, Moult J, Fidelis K: Processing and analysis of CASP3 protein structure predictions. Proteins 1999, (Suppl 3):22–29.
- Cozzetto D, Kryshtafovych A, Tramontano A: Evaluation of CASP8 model quality predictions. Proteins 2009, 77(Suppl 9):157–166.View ArticlePubMedGoogle Scholar
- Xu D, Zhang J, Roy A, Zhang Y: Automated protein structure modeling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based structure refinement. Proteins 2011.Google Scholar