Improving consensus structure by eliminating averaging artifacts
 Dukka B KC^{1}Email author
DOI: 10.1186/14726807912
© Dukka; licensee BioMed Central Ltd. 2009
Received: 03 October 2008
Accepted: 06 March 2009
Published: 06 March 2009
Abstract
Background
Common structural biology methods (i.e., NMR and molecular dynamics) often produce ensembles of molecular structures. Consequently, averaging of 3D coordinates of molecular structures (proteins and RNA) is a frequent approach to obtain a consensus structure that is representative of the ensemble. However, when the structures are averaged, artifacts can result in unrealistic local geometries, including unphysical bond lengths and angles.
Results
Herein, we describe a method to derive representative structures while limiting the number of artifacts. Our approach is based on a Monte Carlo simulation technique that drives a starting structure (an extended or a 'closeby' structure) towards the 'averaged structure' using a harmonic pseudo energy function. To assess the performance of the algorithm, we applied our approach to Cα models of 1364 proteins generated by the TASSER structure prediction algorithm. The average RMSD of the refined model from the native structure for the set becomes worse by a mere 0.08 Å compared to the average RMSD of the averaged structures from the native structure (3.28 Å for refined structures and 3.36 A for the averaged structures). However, the percentage of atoms involved in clashes is greatly reduced (from 63% to 1%); in fact, the majority of the refined proteins had zero clashes. Moreover, a small number (38) of refined structures resulted in lower RMSD to the native protein versus the averaged structure. Finally, compared to PULCHRA [1], our approach produces representative structure of similar RMSD quality, but with much fewer clashes.
Conclusion
The benchmarking results demonstrate that our approach for removing averaging artifacts can be very beneficial for the structural biology community. Furthermore, the same approach can be applied to almost any problem where averaging of 3D coordinates is performed. Namely, structure averaging is also commonly performed in RNA secondary prediction [2], which could also benefit from our approach.
Background
Methods for the experimental or theoretical determination of protein structures often output their results as an ensemble. In the case of experimental data like Xray crystallography data, the ensemble represents both the conformational diversity and the inability to resolve the time and spatial aspects of the experiment, whereas in the case of computational experiments, the ensemble in part represents the uncertainty of interpreting the data[3]. However, often, there is a requirement for a single consensus structure. One way to generate this 'consensus' or 'representative structure' is to calculate the centroid structure by averaging the Cartesian coordinates of the ensemble of superimposed structures.
A series of computational and experimental studies have been performed to rationalize the averaging methodology. Zagrovic et al. [4] proposed the "meanstructure hypothesis" which states that the geometry of the collapsed unfolded state of small peptides and proteins in an average sense corresponds to the geometry of the native structure at equilibrium. Huang et al[5] have shown that finding the "averaged structure" from a set of decoys yield structures that are closer to the native structure than most individual structures. Moreover, Zagrovic et al[6] have shown mathematically that the RMSD between the "averaged structure" and the native structure is more similar than the most individual structures to the native structure. Furthermore, it was also argued that finding average distance matrices and using distance based root mean square deviation as a metric may be one way to capture the relevant features of ensembles of structure and compare them with other reference structures.
Unlike point based averaging where each member is a point, in averaging of structures, the "averaged model" often has unrealistic local geometry, including unphysical bond lengths and angles. In this regard, several methods have been developed to remove averaging artifacts. Due to the process of protein structure prediction, methods to remove averaging artifacts are most commonly developed in this context. The 'regularize' function of REFMAC [7] can be used to regularize the bonds and angles. Furthermore, Betancourt and Skolnick [8] developed a clustering approach, called SCAR, that uses a harmonic potential to refine centroid structures. However, structure prediction results indicate that SPICKER [9] outperforms SCAR in terms of model selection. Furthermore, it has been shown [10] that 'the models generated by TASSER [11] have incorrect sidechain conformations and poor hydrogen bonding patterns partly because of the onlattice modelling and the unphysical geometry of the SPICKER[9] cluster centroid structure'. PULCHRA [1], which combines a conjugant gradient search with a harmonic potential, supersedes both SCAR and SPICKER. Similarly, Kolinski and Bujnicki [12] have introduced an elegant approach using a combination of templatebased and de novo modelling followed by hierarchical clustering that employed averaging of very diverse models from threading, that results in consensus structures with improved local and global quality.
In general, averaging artifacts become more pronounced when members of the ensemble are more divergent. This artifact is exacerbated in TASSER due to the fact that it begins with a lattice model and averaging is performed across clusters of dissimilar structures. Consequently, the averaged structure is often not suitable for detailed atomic model building due to unrealistic bond lengths and angles and unphysical local geometry. In the same vein, the community wide experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP) also penalizes structures with unphysical bond lengths and unrealistic geometries. CASP defines two types of clashes for bond length. The first type of clash involves atoms that are less than 1.9 Å apart, and the other type of clash involves atoms that are less than 3.6 Å apart. We adopt these criteria in what follows.
Herein, we apply the proposed algorithm for removing averaging artifacts from clusters of structures generated by the TASSER (Threading/ASSembly/Refinement) algorithm [11]. Within TASSER, generated structures are clustered using SPICKER [9] and the cluster with the highest structural density is selected (this is still true in current version of TASSER [10]). Subsequently, the centroid model (called COMBO) that is obtained by averaging all the cluster members of the most densely populated cluster is selected as the predicted structure. In various benchmarks of the TASSER algorithm [13], the averaged structure (aka, the COMBO model) is generally closest to the native in terms of global RMSD. It is closer to the native structure than all the individual cluster members, including the medoid (CLOSC model). Hence, TASSER outputs the cluster centroid (COMBO model) as its final model. In this regard, averaged models have also been shown to outperform minimum free energy structures in the context of RNA secondary structure prediction [2].
Our goal is to generate a structure that is as close as possible to the 'averaged structure' while maintaining realistic bond lengths and angles and local geometry. Unless otherwise stated, the term bond length refers to 'virtual bond length' between two Cα atoms throughout this report, and bond angle refers to 'virtual bond angle' between any three consecutive Cα atoms. To address this issue, we have developed a new algorithm, MCORE (M onte C arlO based RE finement) that is designed to generate such structures by minimizing the difference between the 'averaged' and the physically reasonable structure using a Monte Carlo minimization procedure. We show that our approach is robust and general and can overcome averaging artifacts with minimal reduction of structure quality as assessed by the RMSD of the resulting model from the native structure. Once the refined Cα model is obtained, then approaches like the one based on Backbone Building from Quadrilaterals proposed by Gront et al. [14] can be used to complete backbone reconstruction.
Methods
The central idea behind our approach is to start from a structure that has physically allowed bond lengths and then minimize the difference between this starting structure and the averaged structure. In this respect, our methodology consists of two basic components: (1) generation of the starting structure and (2) minimization of this starting structure in the presence of the averaged structure.
Starting Structure
We explore two types of starting structures: (1) a fully extended structure with bond length corresponding to the average bond length obtained from the PDB and all ψ and ψ = 180°, (2) a model that is close to the 'averaged structure' but has physically reasonable bond length and angles, which we call the 'closeby model'. A typical model of this type can be the structure that is closest (based on RMSD) to the 'averaged structure' in an ensemble of proteins. In case of TASSER, CLOSC models fall in this category. In case, when two structures have the same RMSD to the averaged structure, one of them is chosen at random. Extended structures will be required when no 'closeby model' is available.
Energy Function
where N is the number of Cα atoms; k_{ excl }, k_{ ang }, k_{ clos }are the weights of corresponding contributions to V. r_{ kl }is the distance between the k^{ th }and l^{ th }Cα atoms, r_{0_excl}is the cutoff parameter for excluded volume violations and is set to 4 Å if r_{ kl }< 4.0 Å, otherwise r_{0_excl}is set to be equal to r_{ kl }. That is, this contribution to the potential is turned off. θ_{i, i+1, i+2}is the virtual bond angle formed by the i^{ th }, i+1^{ st }, and i+2^{ nd }Cα atom θ_{0_ang}is the cutoff angle, and is set to be 70° if θ_{i, i+1, i+2}< 70°, 150° if θ_{i, i+1, i+2}> 150°, or θ_{0_ang}= θ_{i, i+1, i+2}otherwise; ${d}_{k{k}_{t}}$ is the distance between the corresponding Cα atoms of the target structure and the current conformation, and d_{0_clo}is the maximum allowed displacement between the corresponding Cα atoms and is set to be equal to 0.001 if ${d}_{k{k}_{t}}$ > 0.001 Å or is set equal to ${d}_{k{k}_{t}}$ otherwise. The values for these parameters are chosen such that they are close to those by Oldfield et al. [15] The values of k_{ clos }, k_{ excl }, and k_{ ang }are chosen to be 1.0831, 0.56818 and 0.015, respectively, on the basis of optimization of the parameters using MINUIT [16] to maximize the correlation between the energy function and the RMSD to the native structure and manual adjustment based on empirical observation for a set of 726 proteins that are used for training parameters as described in the data set. The RMSD values are measured on the Cα atoms for all the cases except those where specified.
Move Sets
Convergency Criteria
There are no straightforward convergence criteria for Monte Carlo Simulations (MCS). However, two obvious convergence criteria are: (1) allowing for a prespecified total number of steps and (2) allowing the algorithm to proceed until it ceases to make progress. Herein, we use both types of convergency criteria. Starting from the 726 extended structures, the average final RMSD for a 2000 steps run was 0.05 Å. Hence, we choose 2000 steps as the specified steps for our simulation. Furthermore, we also devised a mechanism to stop the algorithm when it ceases to make progress. We define that the algorithm ceases to make progress after step j if following criteria is satisfied for every i, where 1 <i <n:∀_{ i }(RMSD_{ j } RMSD_{ji}) <T
where RMSD_{ j }is the RMSD of the conformation after j steps and RMDS_{ji}is the RMSD of the conformation after ji steps. The value of i goes from 1 through n and T is the tolerance cutoff. Once the step j (where the algorithm ceases to make progress) is obtained, the simulation is run for extra x steps. In other words, the simulation is stopped after j+x number of steps if the value of RMSD of the last n steps is within the tolerance region compared to the value of the step j. The values of n, x and T were chosen empirically and were chosen to be 50, 10 and 0.05 respectively. Monte Carlo simulations were performed using the above move set and the standard Metropolis criteria at a temperature of 450 K [17].
Data Set
To verify the application of MCORE algorithm, we use it to remove the averaging artifacts in the output of TASSER algorithm. The data set used for this study consists of 2090 nonhomologous single domain proteins with less than 200 residues with a maximum of 35% pairwise sequence identity to each other that cover the Protein Data Bank. All of these proteins have an initial RMSD of COMBO model (averaged model) against the native protein to be less than 6.5 Å. This is from the fact that the predicted models that are about 6.0 Å to the native structures are likely to have the same fold as the native structure [18]. In addition, from the TASSER outputs we have corresponding COMBO structure and CLOSC structures for each of these proteins. Out of the 2090 proteins, 726 proteins are used for training of model parameters, whereas the remaining 1364 proteins are used for validation. All root mean square deviation (RMSD) values refer to Cα atom comparisons unless otherwise stated.
Results and discussion
Comparison of Two Types of Starting Structures
Comparison of results for two types of starting structures^{1}
Starting Structure  100 steps  200 steps  2000 steps 

Extended  1 min/3.38 Å/1.8%  2 min/3.35 Å/1.5%  20 min/3.34 Å/1.3% 
'Closeby model'  1 min/3.36 Å/1.0%  2 min/3.35 Å/1.0%  20 min/3.34 Å/1.2% 
Refinement of COMBO Models
Before comparing the results of the refined models, we also take an opportunity to analyze the RMSD to native of the 1364 COMBO and CLOSC models. Across the dataset, only 100 CLOSC models had lower RMSD values relative to the native structure compared to the COMBO models, reiterating the advantage of averaged structures in this regard. The average RMSD of COMBO model to the native structure is 3.28 Å, whereas the average RMSD of CLOSC model to the corresponding native structure is 3.55 Å.
Cα RMSD of MCORE and other Models compared to the native structure for a set of 1364 proteins in terms of RMSD, TMSCORE and percentage of atoms in the clashes^{2}.
Methods  Average  Std. Dev  

RMSD(Å)  COMBO  3.28  1.39 
PULCHRA  3.35  1.77  
CLOSC  3.55  1.49  
MCORE  3.36  1.42  
MCOREL  3.35  1.41  
Methods  Average  Std. Dev  
TMScore  COMBO  0.743  0.13 
PULCHRA  0.746  0.13  
CLOSC  0.719  0.13  
MCORE  0.744  0.13  
MCOREL  0.746  0.13  
Methods  Average  Std. Dev  
Clashes <1.9  COMBO  4.50  10.01 
PULCHRA  0.05  0.4  
CLOSC  0  0  
MCORE  0  0  
MCOREL  0  0  
Methods  Average  Std. Dev  
Clashes <3.6  COMBO  63.40  18.67 
PULCHRA  3.64  12.00  
CLOSC  0.00  0  
MCORE  1.09  3.69  
MCOREL  1.20  4.02 
Overall, it is found that MCORE produces models better in terms of RMSD and TMscore [19] versus the corresponding CLOSC models. Moreover, the MCORE models are only slightly worse than the averaged COMBO models, which is consistent with our initial problem statement. Note that the larger TMscore, the better the model. Moreover, if we discriminate clashes into the two CASP types, then it is more evident that our refined models are much better in terms of clashes as they do not have any atoms involved in clashes that are less than 1.9 Å, whereas COMBO models have 4.5% of the atoms involved in this regime of clashes.
We also investigated the average number of steps and time for the MCORE and MCOREL algorithms. It was observed that for MCORE the number of average Monte Carlo Steps is less than 110 (= 109.21) and the average running time is 1.88 minutes. Moreover, for MCOREL, the average running time is 20 minutes. Hence, MCORE can be applied to a wide variety of problems concerning the averaging of macromolecular structures due to its fast execution time.
Comparison with PULCHRA
Comparison of MCORE to COMBO and to PULCHRA in terms of RMSD and percentage of atoms involved in the clashes^{3}.
Combo Vs MCORE  

RMSD (Å)  Combo  3.28 
MCORE  3.36  
pvalue  0.008  
Clashes  Combo  63.00% 
MCORE  1.09%  
pvalue  < 2.22E16  
MCORE Vs PULCHRA  
RMSD (Å)  PULCHRA  3.35 
MCORE  3.36  
pvalue  0.017  
Clashes  PULCHRA  3.64% 
MCORE  1.00%  
pvalue  2.22E16 
Furthermore, MCORE algorithm is comparable to PULCHRA in terms of efficiency also, as on average the computation time is around a minute. One of the major advantages of our approach compared to PULCHRA is that if the input structure is heavily distorted, PULCHRA might fail to converge where MCORE will always converge.
Allatom Model Reconstruction
It is essential to have a model with physical bond lengths and bond angles if further analysis is to be performed on the model. Since structure prediction methods often produce Cαonly models, allatom models must be constructed from the Cα descriptions. In this regard, we built allatom representations of the MCORE refined Cα atom models. The initial backbone reconstruction method applied is the backbone reconstruction method of an algorithm proposed by Milik et al [20]. Once the backbone atoms are reconstructed, any sidechain packing methods [21, 22] can be utilized to build the sidechains. We performed the sidechain reconstruction using one of the most widely used sidechain packing algorithms SCWRL 3.0 [22]. The MCORE refined models for the set of 1364 proteins had an average allatom RMSD of 4.19 Å (which is, or course, higher than the value of 3.35 Å for the Cα models). The PULCHRA refined allatom models on the same dataset had a comparable average value of 4.17 Å.
Conclusion
In this paper, we presented MCORE, a MonteCarlo based algorithm for removing averaging artifacts of the averaged structure to improve the quality of the consensus structure. We verified the application of the proposed algorithm by applying the algorithm to refine the COMBO models of a set of 1364 proteins generated by TASSER algorithm, refining and correcting unphysical bond length and bond angles. On average, the RMSD to native of the refined model is 3.36 Å; where as RMSD of the COMBO model to the native is 3.28 Å, which is a mere 0.08 Å poorer than the RMSD of the COMBO model (averaged model). On the other hand, the average percentage of atoms involved in the clashes in the refined MCORE models is reduced from 63% (for the COMBO models) to only 1.0%. Moreover, slight RMSD gains were obtained by using a version of the MCORE algorithm that samples longer. However, the difference between MCOREL (the longer version) and MCORE (Table 3) is not statistically significant, emphasizing that our convergence criterion is robust.
We have also generated a framework for producing allatom models from the Cα only atom models by first reconstructing the backbone and then doing the sidechain reconstruction using existing methodologies. An obvious extension of the work is to refine not only Cα models, but also to apply MCORE to allatom models. In essence, the new refinement algorithm helps in attaining structures with more physical bond lengths and bond angles by overcoming averaging artifacts produced due to averaging of structures. It has to be noted that there is always a tradeoff between local geometric correctness and the deviation from the target structure. Generating averaged structures that are not heavily distorted can minimize this tradeoff. These results provide a genuine model for the subsequent analysis of the respective protein structure using molecular mechanics force field. In addition, this algorithm does not have convergence problems like PULCHRA (which sometimes fails to converge if the input models are heavily distorted). Although the algorithm was tested for TASSER models only, the presented approach is general and can be applied to remove averaging artifacts arising from averaging over any ensemble of molecular conformations.
Abbreviations
 RMSD:

Root Mean Square Deviation
 TASSER:

Threading/ASSembly/refinement
 CASP:

Critical Assessment of Techniques for Protein Structure Prediction
 NMR:

Nuclear Magnetic Resonance
 MCORE:

MOnte Carlo based Refinement
 COMBO model:

Centroid model
 CLOSC model:

Model which is closest to the centroid Model.
Declarations
Acknowledgements
Author would like to acknowledge Dr. Jeffrey Skolnick at Georgia Institute of Technology for guidance and providing the computational resources. Significant portion of the work was done when the author was at Georgia Institute of Technology. Furthermore, he also acknowledges Dr. Dennis R. Livesay at University of North Carolina at Charlotte for proofreading the paper and for helpful insights. Moreover, he would like to acknowledge Dr. Adrian Arakaki, Dr. Liliana Wroblewska, Dr. Hongyi Zhou and Dr. Shashi B. Pandit at Georgia Institute of Technology for stimulating discussions. Also, he would like to thank Dr. James Mottonen and Luis Carlos Gonzalez for proof reading.
Authors’ Affiliations
References
 Rotkiewicz P, Skolnick J: Fast procedure for reconstruction of fullatom protein models from reduced representations. Journal of Computational Chemsitry 2008, 29: 1460–1465. 10.1002/jcc.20906View ArticleGoogle Scholar
 Ding Y, Chan CY, Lawrence CE: RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. Rna 2005, 11: 1157–1166. 10.1261/rna.2500605PubMed CentralView ArticlePubMedGoogle Scholar
 Furnham N, de Bakker PI, Gore S, Burke DF, Blundell TL: Comparative modelling by restraintbased conformational sampling. BMC structural biology 2008, 8: 7. 10.1186/1472680787PubMed CentralView ArticlePubMedGoogle Scholar
 Zagrovic B, Snow CD, Khaliq S, Shirts MR, Pande VS: Nativelike mean structure in the unfolded ensemble of small proteins. Journal of molecular biology 2002, 323: 153–164. 10.1016/S00222836(02)008884View ArticlePubMedGoogle Scholar
 Huang ES, Samudrala R, Ponder JW: Distance geometry generates nativelike folds for small helical proteins using the consensus distances of predicted protein structures. Protein Sci 1998, 7: 1998–2003. 10.1002/pro.5560070916PubMed CentralView ArticlePubMedGoogle Scholar
 Zagrovic B, Pande VS: How does averaging affect protein structure comparison on the ensemble level? Biophys J 2004, 87: 2240–2246. 10.1529/biophysj.104.042184PubMed CentralView ArticlePubMedGoogle Scholar
 Murshudov GN, Vagin AA, Dodson EJ: Refinement of macromolecular structures by the maximumlikelihood method. Acta Crystallogr D Biol Crystallogr 1997, 53(pt3):240–255. 10.1107/S0907444996012255View ArticlePubMedGoogle Scholar
 Betancourt MR, Skolnick J: Finding the needle in a haystack: Educing native folds from ambiguous ab initio protein structure. Journal of Computational Chemistry 2001, 22: 339–353. Publisher Full Text 10.1002/1096987X(200102)22:3<339::AIDJCC1006>3.0.CO;2RView ArticleGoogle Scholar
 Zhang Y, Skolnick J: SPICKER: a clustering approach to identify nearnative protein folds. Journal of Computational Chemistry 2004, 25: 865–871. 10.1002/jcc.20011View ArticlePubMedGoogle Scholar
 Zhou H, Pandit SB, Lee SY, Borreguero J, Chen H, Wroblewska L, Skolnick J: Analysis of TASSERbased CASP7 protein structure prediction results. Proteins 2007, 69(Suppl 8):90–97. 10.1002/prot.21649View ArticlePubMedGoogle Scholar
 Zhang Y, Arakaki AK, Skolnick J: TASSER: an automated method for the prediction of protein tertiary structures in CASP6. Proteins 2005, 61(Suppl 7):91–98. 10.1002/prot.20724View ArticlePubMedGoogle Scholar
 Kolinski A, Bujnicki JM: Generalized protein structure prediction based on combination of foldrecognition with de novo folding and evaluation of models. Proteins 2005, 61(Suppl 7):84–90. 10.1002/prot.20723View ArticlePubMedGoogle Scholar
 Zhang Y, Devries ME, Skolnick J: Structure modeling of all identified G proteincoupled receptors in the human genome. PLoS Comput Biol 2006, 2: e13. 10.1371/journal.pcbi.0020013PubMed CentralView ArticlePubMedGoogle Scholar
 Gront D, Kmiecik S, Kolinski A: Backbone building from quadrilaterals: a fast and accurate algorithm for protein backbone reconstruction from alpha carbon coordinates. Journal of Computational Chemistry 2007, 28: 1593–1597. 10.1002/jcc.20624View ArticlePubMedGoogle Scholar
 Oldfield TJ, Hubbard RE: Analysis of C alpha geometry in protein structures. Proteins 1994, 18: 324–337. 10.1002/prot.340180404View ArticlePubMedGoogle Scholar
 James F: MINUIT Function Minimization and Error Analysis. CERN Program Library Long Writeup 1998., D506:Google Scholar
 Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equationofstate calculations by fast computing machines. Journal of Chemical Physics 1953, 21: 1087–1092. 10.1063/1.1699114View ArticleGoogle Scholar
 Reva BA, Finkelstein AV, Skolnick J: What is the probability of a chance prediction of a protein structure with an rmsd of 6 A? Fold Des 1998, 3: 141–147. 10.1016/S13590278(98)000194View ArticlePubMedGoogle Scholar
 Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins 2004, 57: 702–710. 10.1002/prot.20264View ArticlePubMedGoogle Scholar
 Milik M, Kolinski A, Skolnick J: Algorithm for rapid reconstruction of protein backbone from alpha carbon coordinates. Journal of Computational Chemistry 1997, 18: 80–85. Publisher Full Text 10.1002/(SICI)1096987X(19970115)18:1<80::AIDJCC8>3.0.CO;2WView ArticleGoogle Scholar
 Dukka Bahadur KC, Tomita E, Suzuki J, Akutsu T: Protein sidechain packing problem: a maximum edgeweight clique algorithmic approach. J Bioinform Comput Biol 2005, 3: 103–126. 10.1142/S0219720005000904View ArticlePubMedGoogle Scholar
 Canutescu AA, Shelenkov AA, Dunbrack RL Jr: A graphtheory algorithm for rapid protein sidechain prediction. Protein Sci 2003, 12: 2001–2014. 10.1110/ps.03154503PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.