Improving consensus structure by eliminating averaging artifacts
© Dukka; licensee BioMed Central Ltd. 2009
Received: 03 October 2008
Accepted: 06 March 2009
Published: 06 March 2009
Common structural biology methods (i.e., NMR and molecular dynamics) often produce ensembles of molecular structures. Consequently, averaging of 3D coordinates of molecular structures (proteins and RNA) is a frequent approach to obtain a consensus structure that is representative of the ensemble. However, when the structures are averaged, artifacts can result in unrealistic local geometries, including unphysical bond lengths and angles.
Herein, we describe a method to derive representative structures while limiting the number of artifacts. Our approach is based on a Monte Carlo simulation technique that drives a starting structure (an extended or a 'close-by' structure) towards the 'averaged structure' using a harmonic pseudo energy function. To assess the performance of the algorithm, we applied our approach to Cα models of 1364 proteins generated by the TASSER structure prediction algorithm. The average RMSD of the refined model from the native structure for the set becomes worse by a mere 0.08 Å compared to the average RMSD of the averaged structures from the native structure (3.28 Å for refined structures and 3.36 A for the averaged structures). However, the percentage of atoms involved in clashes is greatly reduced (from 63% to 1%); in fact, the majority of the refined proteins had zero clashes. Moreover, a small number (38) of refined structures resulted in lower RMSD to the native protein versus the averaged structure. Finally, compared to PULCHRA , our approach produces representative structure of similar RMSD quality, but with much fewer clashes.
The benchmarking results demonstrate that our approach for removing averaging artifacts can be very beneficial for the structural biology community. Furthermore, the same approach can be applied to almost any problem where averaging of 3D coordinates is performed. Namely, structure averaging is also commonly performed in RNA secondary prediction , which could also benefit from our approach.
Methods for the experimental or theoretical determination of protein structures often output their results as an ensemble. In the case of experimental data like X-ray crystallography data, the ensemble represents both the conformational diversity and the inability to resolve the time and spatial aspects of the experiment, whereas in the case of computational experiments, the ensemble in part represents the uncertainty of interpreting the data. However, often, there is a requirement for a single consensus structure. One way to generate this 'consensus' or 'representative structure' is to calculate the centroid structure by averaging the Cartesian coordinates of the ensemble of superimposed structures.
A series of computational and experimental studies have been performed to rationalize the averaging methodology. Zagrovic et al.  proposed the "mean-structure hypothesis" which states that the geometry of the collapsed unfolded state of small peptides and proteins in an average sense corresponds to the geometry of the native structure at equilibrium. Huang et al have shown that finding the "averaged structure" from a set of decoys yield structures that are closer to the native structure than most individual structures. Moreover, Zagrovic et al have shown mathematically that the RMSD between the "averaged structure" and the native structure is more similar than the most individual structures to the native structure. Furthermore, it was also argued that finding average distance matrices and using distance based root mean square deviation as a metric may be one way to capture the relevant features of ensembles of structure and compare them with other reference structures.
Unlike point based averaging where each member is a point, in averaging of structures, the "averaged model" often has unrealistic local geometry, including unphysical bond lengths and angles. In this regard, several methods have been developed to remove averaging artifacts. Due to the process of protein structure prediction, methods to remove averaging artifacts are most commonly developed in this context. The 'regularize' function of REFMAC  can be used to regularize the bonds and angles. Furthermore, Betancourt and Skolnick  developed a clustering approach, called SCAR, that uses a harmonic potential to refine centroid structures. However, structure prediction results indicate that SPICKER  outperforms SCAR in terms of model selection. Furthermore, it has been shown  that 'the models generated by TASSER  have incorrect side-chain conformations and poor hydrogen bonding patterns partly because of the on-lattice modelling and the unphysical geometry of the SPICKER cluster centroid structure'. PULCHRA , which combines a conjugant gradient search with a harmonic potential, supersedes both SCAR and SPICKER. Similarly, Kolinski and Bujnicki  have introduced an elegant approach using a combination of template-based and de novo modelling followed by hierarchical clustering that employed averaging of very diverse models from threading, that results in consensus structures with improved local and global quality.
In general, averaging artifacts become more pronounced when members of the ensemble are more divergent. This artifact is exacerbated in TASSER due to the fact that it begins with a lattice model and averaging is performed across clusters of dissimilar structures. Consequently, the averaged structure is often not suitable for detailed atomic model building due to unrealistic bond lengths and angles and unphysical local geometry. In the same vein, the community wide experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP) also penalizes structures with unphysical bond lengths and unrealistic geometries. CASP defines two types of clashes for bond length. The first type of clash involves atoms that are less than 1.9 Å apart, and the other type of clash involves atoms that are less than 3.6 Å apart. We adopt these criteria in what follows.
Herein, we apply the proposed algorithm for removing averaging artifacts from clusters of structures generated by the TASSER (Threading/ASSembly/Refinement) algorithm . Within TASSER, generated structures are clustered using SPICKER  and the cluster with the highest structural density is selected (this is still true in current version of TASSER ). Subsequently, the centroid model (called COMBO) that is obtained by averaging all the cluster members of the most densely populated cluster is selected as the predicted structure. In various benchmarks of the TASSER algorithm , the averaged structure (aka, the COMBO model) is generally closest to the native in terms of global RMSD. It is closer to the native structure than all the individual cluster members, including the medoid (CLOSC model). Hence, TASSER outputs the cluster centroid (COMBO model) as its final model. In this regard, averaged models have also been shown to outperform minimum free energy structures in the context of RNA secondary structure prediction .
Our goal is to generate a structure that is as close as possible to the 'averaged structure' while maintaining realistic bond lengths and angles and local geometry. Unless otherwise stated, the term bond length refers to 'virtual bond length' between two Cα atoms throughout this report, and bond angle refers to 'virtual bond angle' between any three consecutive Cα atoms. To address this issue, we have developed a new algorithm, MCORE (M onte C arlO based RE finement) that is designed to generate such structures by minimizing the difference between the 'averaged' and the physically reasonable structure using a Monte Carlo minimization procedure. We show that our approach is robust and general and can overcome averaging artifacts with minimal reduction of structure quality as assessed by the RMSD of the resulting model from the native structure. Once the refined Cα model is obtained, then approaches like the one based on Backbone Building from Quadrilaterals proposed by Gront et al.  can be used to complete backbone reconstruction.
The central idea behind our approach is to start from a structure that has physically allowed bond lengths and then minimize the difference between this starting structure and the averaged structure. In this respect, our methodology consists of two basic components: (1) generation of the starting structure and (2) minimization of this starting structure in the presence of the averaged structure.
We explore two types of starting structures: (1) a fully extended structure with bond length corresponding to the average bond length obtained from the PDB and all ψ and ψ = 180°, (2) a model that is close to the 'averaged structure' but has physically reasonable bond length and angles, which we call the 'close-by model'. A typical model of this type can be the structure that is closest (based on RMSD) to the 'averaged structure' in an ensemble of proteins. In case of TASSER, CLOSC models fall in this category. In case, when two structures have the same RMSD to the averaged structure, one of them is chosen at random. Extended structures will be required when no 'close-by model' is available.
where N is the number of Cα atoms; k excl , k ang , k clos are the weights of corresponding contributions to V. r kl is the distance between the k th and l th Cα atoms, r0_exclis the cutoff parameter for excluded volume violations and is set to 4 Å if r kl < 4.0 Å, otherwise r0_exclis set to be equal to r kl . That is, this contribution to the potential is turned off. θi, i+1, i+2is the virtual bond angle formed by the i th , i+1 st , and i+2 nd Cα atom θ0_angis the cut-off angle, and is set to be 70° if θi, i+1, i+2< 70°, 150° if θi, i+1, i+2> 150°, or θ0_ang= θi, i+1, i+2otherwise; is the distance between the corresponding Cα atoms of the target structure and the current conformation, and d0_clois the maximum allowed displacement between the corresponding Cα atoms and is set to be equal to 0.001 if > 0.001 Å or is set equal to otherwise. The values for these parameters are chosen such that they are close to those by Oldfield et al.  The values of k clos , k excl , and k ang are chosen to be 1.0831, 0.56818 and 0.015, respectively, on the basis of optimization of the parameters using MINUIT  to maximize the correlation between the energy function and the RMSD to the native structure and manual adjustment based on empirical observation for a set of 726 proteins that are used for training parameters as described in the data set. The RMSD values are measured on the Cα atoms for all the cases except those where specified.
There are no straightforward convergence criteria for Monte Carlo Simulations (MCS). However, two obvious convergence criteria are: (1) allowing for a pre-specified total number of steps and (2) allowing the algorithm to proceed until it ceases to make progress. Herein, we use both types of convergency criteria. Starting from the 726 extended structures, the average final RMSD for a 2000 steps run was 0.05 Å. Hence, we choose 2000 steps as the specified steps for our simulation. Furthermore, we also devised a mechanism to stop the algorithm when it ceases to make progress. We define that the algorithm ceases to make progress after step j if following criteria is satisfied for every i, where 1 <i <n:∀ i (RMSD j - RMSDj-i) <T
where RMSD j is the RMSD of the conformation after j steps and RMDSj-iis the RMSD of the conformation after j-i steps. The value of i goes from 1 through n and T is the tolerance cutoff. Once the step j (where the algorithm ceases to make progress) is obtained, the simulation is run for extra x steps. In other words, the simulation is stopped after j+x number of steps if the value of RMSD of the last n steps is within the tolerance region compared to the value of the step j. The values of n, x and T were chosen empirically and were chosen to be 50, 10 and 0.05 respectively. Monte Carlo simulations were performed using the above move set and the standard Metropolis criteria at a temperature of 450 K .
To verify the application of MCORE algorithm, we use it to remove the averaging artifacts in the output of TASSER algorithm. The data set used for this study consists of 2090 non-homologous single domain proteins with less than 200 residues with a maximum of 35% pairwise sequence identity to each other that cover the Protein Data Bank. All of these proteins have an initial RMSD of COMBO model (averaged model) against the native protein to be less than 6.5 Å. This is from the fact that the predicted models that are about 6.0 Å to the native structures are likely to have the same fold as the native structure . In addition, from the TASSER outputs we have corresponding COMBO structure and CLOSC structures for each of these proteins. Out of the 2090 proteins, 726 proteins are used for training of model parameters, whereas the remaining 1364 proteins are used for validation. All root mean square deviation (RMSD) values refer to Cα atom comparisons unless otherwise stated.
Results and discussion
Comparison of Two Types of Starting Structures
Comparison of results for two types of starting structures1
1 min/3.38 Å/1.8%
2 min/3.35 Å/1.5%
20 min/3.34 Å/1.3%
1 min/3.36 Å/1.0%
2 min/3.35 Å/1.0%
20 min/3.34 Å/1.2%
Refinement of COMBO Models
Before comparing the results of the refined models, we also take an opportunity to analyze the RMSD to native of the 1364 COMBO and CLOSC models. Across the dataset, only 100 CLOSC models had lower RMSD values relative to the native structure compared to the COMBO models, reiterating the advantage of averaged structures in this regard. The average RMSD of COMBO model to the native structure is 3.28 Å, whereas the average RMSD of CLOSC model to the corresponding native structure is 3.55 Å.
Cα RMSD of MCORE and other Models compared to the native structure for a set of 1364 proteins in terms of RMSD, TM-SCORE and percentage of atoms in the clashes2.
Overall, it is found that MCORE produces models better in terms of RMSD and TM-score  versus the corresponding CLOSC models. Moreover, the MCORE models are only slightly worse than the averaged COMBO models, which is consistent with our initial problem statement. Note that the larger TM-score, the better the model. Moreover, if we discriminate clashes into the two CASP types, then it is more evident that our refined models are much better in terms of clashes as they do not have any atoms involved in clashes that are less than 1.9 Å, whereas COMBO models have 4.5% of the atoms involved in this regime of clashes.
We also investigated the average number of steps and time for the MCORE and MCORE-L algorithms. It was observed that for MCORE the number of average Monte Carlo Steps is less than 110 (= 109.21) and the average running time is 1.88 minutes. Moreover, for MCORE-L, the average running time is 20 minutes. Hence, MCORE can be applied to a wide variety of problems concerning the averaging of macromolecular structures due to its fast execution time.
Comparison with PULCHRA
Comparison of MCORE to COMBO and to PULCHRA in terms of RMSD and percentage of atoms involved in the clashes3.
Combo Vs MCORE
MCORE Vs PULCHRA
Furthermore, MCORE algorithm is comparable to PULCHRA in terms of efficiency also, as on average the computation time is around a minute. One of the major advantages of our approach compared to PULCHRA is that if the input structure is heavily distorted, PULCHRA might fail to converge where MCORE will always converge.
All-atom Model Reconstruction
It is essential to have a model with physical bond lengths and bond angles if further analysis is to be performed on the model. Since structure prediction methods often produce Cα-only models, all-atom models must be constructed from the Cα descriptions. In this regard, we built all-atom representations of the MCORE refined Cα atom models. The initial backbone reconstruction method applied is the backbone reconstruction method of an algorithm proposed by Milik et al . Once the backbone atoms are reconstructed, any side-chain packing methods [21, 22] can be utilized to build the side-chains. We performed the side-chain reconstruction using one of the most widely used side-chain packing algorithms SCWRL 3.0 . The MCORE refined models for the set of 1364 proteins had an average all-atom RMSD of 4.19 Å (which is, or course, higher than the value of 3.35 Å for the Cα models). The PULCHRA refined all-atom models on the same dataset had a comparable average value of 4.17 Å.
In this paper, we presented MCORE, a Monte-Carlo based algorithm for removing averaging artifacts of the averaged structure to improve the quality of the consensus structure. We verified the application of the proposed algorithm by applying the algorithm to refine the COMBO models of a set of 1364 proteins generated by TASSER algorithm, refining and correcting unphysical bond length and bond angles. On average, the RMSD to native of the refined model is 3.36 Å; where as RMSD of the COMBO model to the native is 3.28 Å, which is a mere 0.08 Å poorer than the RMSD of the COMBO model (averaged model). On the other hand, the average percentage of atoms involved in the clashes in the refined MCORE models is reduced from 63% (for the COMBO models) to only 1.0%. Moreover, slight RMSD gains were obtained by using a version of the MCORE algorithm that samples longer. However, the difference between MCORE-L (the longer version) and MCORE (Table 3) is not statistically significant, emphasizing that our convergence criterion is robust.
We have also generated a framework for producing all-atom models from the Cα only atom models by first reconstructing the backbone and then doing the side-chain reconstruction using existing methodologies. An obvious extension of the work is to refine not only Cα models, but also to apply MCORE to all-atom models. In essence, the new refinement algorithm helps in attaining structures with more physical bond lengths and bond angles by overcoming averaging artifacts produced due to averaging of structures. It has to be noted that there is always a trade-off between local geometric correctness and the deviation from the target structure. Generating averaged structures that are not heavily distorted can minimize this trade-off. These results provide a genuine model for the subsequent analysis of the respective protein structure using molecular mechanics force field. In addition, this algorithm does not have convergence problems like PULCHRA (which sometimes fails to converge if the input models are heavily distorted). Although the algorithm was tested for TASSER models only, the presented approach is general and can be applied to remove averaging artifacts arising from averaging over any ensemble of molecular conformations.
Root Mean Square Deviation
Critical Assessment of Techniques for Protein Structure Prediction
Nuclear Magnetic Resonance
MOnte Carlo based Refinement
- COMBO model:
- CLOSC model:
Model which is closest to the centroid Model.
Author would like to acknowledge Dr. Jeffrey Skolnick at Georgia Institute of Technology for guidance and providing the computational resources. Significant portion of the work was done when the author was at Georgia Institute of Technology. Furthermore, he also acknowledges Dr. Dennis R. Livesay at University of North Carolina at Charlotte for proofreading the paper and for helpful insights. Moreover, he would like to acknowledge Dr. Adrian Arakaki, Dr. Liliana Wroblewska, Dr. Hongyi Zhou and Dr. Shashi B. Pandit at Georgia Institute of Technology for stimulating discussions. Also, he would like to thank Dr. James Mottonen and Luis Carlos Gonzalez for proof reading.
- Rotkiewicz P, Skolnick J: Fast procedure for reconstruction of full-atom protein models from reduced representations. Journal of Computational Chemsitry 2008, 29: 1460–1465. 10.1002/jcc.20906View ArticleGoogle Scholar
- Ding Y, Chan CY, Lawrence CE: RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. Rna 2005, 11: 1157–1166. 10.1261/rna.2500605PubMed CentralView ArticlePubMedGoogle Scholar
- Furnham N, de Bakker PI, Gore S, Burke DF, Blundell TL: Comparative modelling by restraint-based conformational sampling. BMC structural biology 2008, 8: 7. 10.1186/1472-6807-8-7PubMed CentralView ArticlePubMedGoogle Scholar
- Zagrovic B, Snow CD, Khaliq S, Shirts MR, Pande VS: Native-like mean structure in the unfolded ensemble of small proteins. Journal of molecular biology 2002, 323: 153–164. 10.1016/S0022-2836(02)00888-4View ArticlePubMedGoogle Scholar
- Huang ES, Samudrala R, Ponder JW: Distance geometry generates native-like folds for small helical proteins using the consensus distances of predicted protein structures. Protein Sci 1998, 7: 1998–2003. 10.1002/pro.5560070916PubMed CentralView ArticlePubMedGoogle Scholar
- Zagrovic B, Pande VS: How does averaging affect protein structure comparison on the ensemble level? Biophys J 2004, 87: 2240–2246. 10.1529/biophysj.104.042184PubMed CentralView ArticlePubMedGoogle Scholar
- Murshudov GN, Vagin AA, Dodson EJ: Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr D Biol Crystallogr 1997, 53(pt3):240–255. 10.1107/S0907444996012255View ArticlePubMedGoogle Scholar
- Betancourt MR, Skolnick J: Finding the needle in a haystack: Educing native folds from ambiguous ab initio protein structure. Journal of Computational Chemistry 2001, 22: 339–353. Publisher Full Text 10.1002/1096-987X(200102)22:3<339::AID-JCC1006>3.0.CO;2-RView ArticleGoogle Scholar
- Zhang Y, Skolnick J: SPICKER: a clustering approach to identify near-native protein folds. Journal of Computational Chemistry 2004, 25: 865–871. 10.1002/jcc.20011View ArticlePubMedGoogle Scholar
- Zhou H, Pandit SB, Lee SY, Borreguero J, Chen H, Wroblewska L, Skolnick J: Analysis of TASSER-based CASP7 protein structure prediction results. Proteins 2007, 69(Suppl 8):90–97. 10.1002/prot.21649View ArticlePubMedGoogle Scholar
- Zhang Y, Arakaki AK, Skolnick J: TASSER: an automated method for the prediction of protein tertiary structures in CASP6. Proteins 2005, 61(Suppl 7):91–98. 10.1002/prot.20724View ArticlePubMedGoogle Scholar
- Kolinski A, Bujnicki JM: Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins 2005, 61(Suppl 7):84–90. 10.1002/prot.20723View ArticlePubMedGoogle Scholar
- Zhang Y, Devries ME, Skolnick J: Structure modeling of all identified G protein-coupled receptors in the human genome. PLoS Comput Biol 2006, 2: e13. 10.1371/journal.pcbi.0020013PubMed CentralView ArticlePubMedGoogle Scholar
- Gront D, Kmiecik S, Kolinski A: Backbone building from quadrilaterals: a fast and accurate algorithm for protein backbone reconstruction from alpha carbon coordinates. Journal of Computational Chemistry 2007, 28: 1593–1597. 10.1002/jcc.20624View ArticlePubMedGoogle Scholar
- Oldfield TJ, Hubbard RE: Analysis of C alpha geometry in protein structures. Proteins 1994, 18: 324–337. 10.1002/prot.340180404View ArticlePubMedGoogle Scholar
- James F: MINUIT Function Minimization and Error Analysis. CERN Program Library Long Writeup 1998., D506:Google Scholar
- Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equation-of-state calculations by fast computing machines. Journal of Chemical Physics 1953, 21: 1087–1092. 10.1063/1.1699114View ArticleGoogle Scholar
- Reva BA, Finkelstein AV, Skolnick J: What is the probability of a chance prediction of a protein structure with an rmsd of 6 A? Fold Des 1998, 3: 141–147. 10.1016/S1359-0278(98)00019-4View ArticlePubMedGoogle Scholar
- Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins 2004, 57: 702–710. 10.1002/prot.20264View ArticlePubMedGoogle Scholar
- Milik M, Kolinski A, Skolnick J: Algorithm for rapid reconstruction of protein backbone from alpha carbon coordinates. Journal of Computational Chemistry 1997, 18: 80–85. Publisher Full Text 10.1002/(SICI)1096-987X(19970115)18:1<80::AID-JCC8>3.0.CO;2-WView ArticleGoogle Scholar
- Dukka Bahadur KC, Tomita E, Suzuki J, Akutsu T: Protein side-chain packing problem: a maximum edge-weight clique algorithmic approach. J Bioinform Comput Biol 2005, 3: 103–126. 10.1142/S0219720005000904View ArticlePubMedGoogle Scholar
- Canutescu AA, Shelenkov AA, Dunbrack RL Jr: A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci 2003, 12: 2001–2014. 10.1110/ps.03154503PubMed CentralView ArticlePubMedGoogle Scholar