- Research article
- Open Access
QMEANclust: estimation of protein model quality by combining a composite scoring function with structural density information
© Benkert et al; licensee BioMed Central Ltd. 2009
- Received: 21 October 2008
- Accepted: 20 May 2009
- Published: 20 May 2009
The selection of the most accurate protein model from a set of alternatives is a crucial step in protein structure prediction both in template-based and ab initio approaches. Scoring functions have been developed which can either return a quality estimate for a single model or derive a score from the information contained in the ensemble of models for a given sequence. Local structural features occurring more frequently in the ensemble have a greater probability of being correct. Within the context of the CASP experiment, these so called consensus methods have been shown to perform considerably better in selecting good candidate models, but tend to fail if the best models are far from the dominant structural cluster. In this paper we show that model selection can be improved if both approaches are combined by pre-filtering the models used during the calculation of the structural consensus.
Our recently published QMEAN composite scoring function has been improved by including an all-atom interaction potential term. The preliminary model ranking based on the new QMEAN score is used to select a subset of reliable models against which the structural consensus score is calculated. This scoring function called QMEANclust achieves a correlation coefficient of predicted quality score and GDT_TS of 0.9 averaged over the 98 CASP7 targets and perform significantly better in selecting good models from the ensemble of server models than any other groups participating in the quality estimation category of CASP7. Both scoring functions are also benchmarked on the MOULDER test set consisting of 20 target proteins each with 300 alternatives models generated by MODELLER. QMEAN outperforms all other tested scoring functions operating on individual models, while the consensus method QMEANclust only works properly on decoy sets containing a certain fraction of near-native conformations. We also present a local version of QMEAN for the per-residue estimation of model quality (QMEANlocal) and compare it to a new local consensus-based approach.
Improved model selection is obtained by using a composite scoring function operating on single models in order to enrich higher quality models which are subsequently used to calculate the structural consensus. The performance of consensus-based methods such as QMEANclust highly depends on the composition and quality of the model ensemble to be analysed. Therefore, performance estimates for consensus methods based on large meta-datasets (e.g. CASP) might overrate their applicability in more realistic modelling situations with smaller sets of models based on individual methods.
- Model Ensemble
- Consensus Method
- Consensus Information
- CASP7 Target
- QMEAN Score
Generally, protein structure prediction consists of a conformational sampling step followed by a scoring step in which the best model is selected from the ensemble. The relative importance of the two steps depends on the modelling difficulty and the details of the specific method. In the conformational sampling step of ab initio structure prediction methods it is common practice to generate a vast number of models and to subsequently select the best candidates based on an energy function [1, 2]. Until several years ago, in comparative modelling usually only a few, if any, alternative models have been generated and the quality of the prediction was rarely better than the best template. However, in recent years there has been a clear trend in the field to generate a variety of models based on different template structures (or combinations thereof) and/or alternative alignments and to select the best candidate based on the estimated quality of the resulting models [3–10]. In order to cope with the uncertainties in modelling, early decision making, such as choosing the best template or alignment, can be postponed and performed at a later stage in the modelling pipeline based on the quality of the resulting structural model. For this last step, scoring functions for selecting the highest quality model among alternatives are of crucial importance.
These scoring functions fall into one of two categories, namely consensus or clustering methods which rely on the analysis of the structural density in the ensemble of models and approaches being able to estimate the quality of a single model without relying on consensus information. The basic idea of consensus-based methods is that conformations predicted more frequently are more likely to be correct than structural patterns occurring in only a few models [11–15]. The second category includes methods taking into account evolutionary information [16–18], stereochemical plausibility of the models [19, 20] and the environment compatibility of their residues  as well as energy-based methods which include physics-based energy functions [22, 23] and knowledge-based statistical potentials [24–29]. Composite scoring functions analysing multiple structural features have been introduced and shown to perform better than any single term [30–35].
Quality estimation can be performed on different dimensions: relative vs. absolute and global vs. local. The estimation of the relative quality of a model compared to a set of alternatives is, as mentioned above, a fundamental step in protein structure prediction and also in optimisation techniques (i.e. refinement). On the other hand, the estimation of the absolute quality of a model is of tremendous importance for the biological community since it is the quality of the model which dictates its biological applicability (e.g. for mutagenesis studies, virtual screening and molecular replacement) [36–38]. Traditionally, scoring functions have been assessed with regard to their ability to rank models by quality, while the estimation of absolute values of model quality has been only marginally addressed in the literature. Besides the global quality, local error estimation on a per residue basis has become an active field of research [17, 39]. Although the accuracy of local predictions is limited, these methods may be very valuable for biologists by helping them to discriminate between reliable and unreliable regions in the model.
Model quality assessment programs have been evaluated for the first time in a community-wide experiment in 2004 as part of Critical Assessment of Fully Automated Structure Prediction (CAFASP)  and most recently at CASP7 [41, 42]. The assessment of the predictions submitted to the quality assessment category of CASP7 clearly indicates that consensus based methods such as Pcons  outperform current scoring functions operating on single models. On the other hand, methods relying solely on structural density information have inherent limitations: First, they are not able to provide an estimate of the absolute quality of a single model or to rank just a small set of models. Second, these methods tend to fail when the highest quality candidates are far away from the dominant structural cluster of the ensemble. Outstanding predictions which are far removed from the bulk of the remaining models are hardly recognised [43, 44], and, in the case of hard free modelling targets, the ensemble does often not contain any meaningful density information at all. The approach pursued by Lee and co-workers  for the quality assessment category of CASP7 was also quite successful. This group produced quite accurate models for the template-based modelling category  and defined the quality of all other models as relative distance to their own models.
Based on these findings, we present in this paper a new approach to model quality estimation which combines different aspects of the approaches described above while simultaneously minimising their weaknesses. We use an optimised version of our recently published composite scoring function QMEAN  in order to define an ensemble of reference models which is used to calculate the structural consensus score. This method, called QMEANclust, combines a scoring function able to assess single models and perform an initial ranking with the strengths of using structural density information. Due to the pre-selection step, QMEANclust represents a compromise between the rigorous clustering strategy of Pcons (comparison to all models) and the strategy to define quality by comparison to a "best reference model". Based on the model ranking of QMEANclust, it is investigated whether using the ensemble of models for a given target sequence to retrieve target-specific statistical potentials  can lead to a further performance improvement (selfQMEAN).
The paper is structured as follows: First we describe the optimised QMEAN scoring function. We demonstrate that the inclusion of an all-atom interaction term in addition to the residue-level term improves the performance both with respect to correlation between predicted model score and degree of nativeness and in the task of selecting the best model. Then we compare different strategies to combine QMEAN with structural density information resulting in two versions of QMEANclust as well as in the selfQMEAN scoring function. We show that QMEANclust is indeed able to counteract the inherent limitations of purely consensus-based methods. All three scoring functions are compared to state-of-the-art methods on the basis of two comprehensive test sets. Finally, local versions of the three scoring functions for the per-residues error estimation are presented and the performance is compared to a recently published method.
QMEAN: Composite scoring functions for the evaluation of single models
Short description of the terms and their combinations used in this work.
Extended torsion potential over 3 consecutive residues. Bin sizes: 45 degree for the centre residue, 90 degree for the 2 adjacent residues
Residue-level, secondary structure specific interaction potential using Cβ atoms as interaction centres. Range 3...25 Å, step size: 1 Å
Potential reflecting the propensity of a certain amino acid for a certain degree of solvent exposure based on the number of Cβ atoms within a sphere of 9 Å around the centre Cβ.
All-atom, secondary structure specific interaction potential using all 167 atom types. Range 3...20 Å, step size: 0.5 Å
Agreement between the predicted secondary structure of the target sequence (using PSIPRED) and the calculated secondary structure of the model (using DSSP).
Agreement between the predicted relative solvent accessibility using ACCpro (buried/exposed) and the relative solvent accessibility derived from DSSP (> 25% accessibility => exposed)
linear combination of torsion, pair residue, salvation
linear combination of torsion, pair residue, solvation, pair all-atom
linear combination torsion, pair residue, solvation, SSE, ACC
linear combination of torsion, pair residue, solvation, pair all-atom, SSE, ACC
Comparison between QMEAN, various QMEANclust implementations and selfQMEAN on all CASP7 server models.
QMEAN3 * fraction modelled
QMEAN4 * fraction modelled
QMEAN5 * fraction modelled
QMEAN6 * fraction modelled
QMEANclust: no preselection
Mean (~3D-jury based on GDT_TS)
QMEANclust: QMEAN Z-score > x
Median: Z-score > -1
Mean: Z-score > -1
Weighted mean: Z-score > -1
Median: Z-score > 0
Mean: Z-score > 0
Weighted mean: Z-score > 0
Median: Z-score > 0.5
Mean: Z-score > 0.5
Weighted mean: Z-score > 0.5
QMEANclust: top × percent models
Median: 20% TBM, 20% FM
Median: 10% TBM, 10% FM
Median: 5% TBM, 5% FM
Median: 10% TBM, 20% FM
Median: 20% TBM, 10% FM
QMEANclust: ΔQMEAN-score from max
Median: Δ < 0.05 Å TBM, Δ < 0.05 Å FM
Median: Δ < 0.1 Å TBM, Δ < 0.1 Å FM
Median: Δ < 0.05 Å TBM, Δ < 0.1 Å FM
Median: Δ < 0.1 Å TBM, Δ < 0.05 Å FM
Linear combination of 5 terms (w/o all-atom)
Sum of Z-scores (5 terms)
Sum of Z-scores (6 terms)
For each QMEAN version, the performance of an alternative implementation which penalises incomplete models by multiplying the score by the fraction of modelled residues is given as well. Taking into account the coverage of the models with respect to the target sequence considerably improves the correlation to the GDT_TS score  by penalising incomplete models with otherwise good stereochemistry. This performance increase in estimating the relative model quality can be attributed to the fact that the GDT_TS score, traditionally used in the assessment of CASP, is by definition dependent on model completeness. Table 2 underlines that a large increase in performance can be obtained by including predicted secondary structure and solvent accessibility agreement terms as shown previously (QMEAN3 vs. QMEAN5 and QMEAN4 vs. QMEAN6). The integration of an all-atom term (QMEAN5 vs. QMEAN6 in Table 2) further improves the correlation between predicted quality of the model and its similarity to the native structure. More importantly, the all-atom term increases the ability of the scoring function to select good models. This is reflected by the significantly higher (p-value = 0.03 in a paired t-test) total GDT_TS score of the best models selected by QMEAN6 of 56.70 compared to 55.32 for QMEAN5.
Comparison of the best QMEAN versions with other methods participating in CASP7.
Random model selection
Best model per target
A further improvement may be achieved by using more specialised QMEAN versions for different modelling situations, such as QMEAN with all-atom term for template based targets and without for free modelling targets. First results suggest that the overall effect is only marginal and that the QMEAN version including the all-atom term leads to a better performance over the whole difficulty range. Using one scoring function for all modelling situations is not ideal as highlighted recently by Kihara co-workers . They showed that for a threading scoring function consisting of two terms, different weighting factor combinations are optimal for different protein families. Therefore, training weighting factors specifically for proteins of similar size and amino acid or secondary structure composition may improve the performance, especially in the prediction of absolute values of model quality . Optimising weighting factors in composite scoring functions based on a linear combination of terms is complicated by the fact that the different terms are dependent on the protein size which influences to ability of the combined scoring function to predict the absolute quality.
QMEANclust: including structural density of the model ensemble
In this section we describe a new method, termed QMEANclust, which combines the QMEAN scoring function with structural density information derived from the ensemble of models. In the straightforward implementation of methods based on structural density information, the score for a given model is calculated as its average (or median) distance to all other models in the ensemble. Different similarity measures are used for building the distance matrix: e.g. MaxSub  in 3Djury , LGscore  in Pcons  and TMscore  in the consensus method described in MODfold . In this work, the GDT_TS score , a well established similarity measure in the CASP assessment, is used. In all the above mentioned implementations, the single models are equally weighted in the calculation of the final score, no matter how good or bad a model is. In 3Djury only model pairs above a certain distance cut-off are considered in the calculation.
Clustering methods tend to fail when the top models are far away from the most prominent structural cluster or when there is no structural redundancy present in the ensemble that can be captured. Especially for difficult, template-free modelling targets the best models are usually not the most frequent conformations in the ensemble (at least not in the CASP decoy sets). In order to cope with the limitations of current clustering approaches, we investigated two strategies for the combination of the QMEAN composite scoring function and structural density information from the ensemble. In the first approach, QMEAN is used to select a subset of higher quality models against which the subsequent distance calculations are performed. The final score for a given model is defined as the median distance of this model to all models in the subset (strategy denoted as median in Table 2). An implementation based on the mean instead of the median GDT_TS is also investigated. In the second approach, the models are weighted according to their QMEAN score (denoted weighted mean); For deriving the distance matrix, the distance of a given model to more reliable models (i.e. to models having better QMEAN scores) is weighted stronger, which in turn reduced the influence of random models on the calculation.
Different strategies and cut-offs for model selection have been investigated. A benchmark of several alternative implementations on the CASP7 test set can be found in Table 2. In comparison to the performance of QMEAN, considerably higher correlation coefficients are obtained for all QMEANclust versions (r = 0.752 vs. r = 0.892).
If the whole ensemble of models is used in the derivation of the distance matrix (no pre-selection), the weighted mean performs comparable or better than taking the mean or median both in terms of correlation between predicted and observed model quality and the ability to identify good models. If only a subset of high-quality models is used in the calculation of the distance matrix, a score based on the distance median produced the best results and is used in the final version. Three different approaches have been investigated in order to select a subset of models based on QMEAN: (1) selection based on the Z-scores which are calculated by subtracting from each model the mean QMEAN score of the ensemble and dividing it by its standard deviation, (2) selection of a certain percentage of top ranking models as well as (3) a strategy in which only models with a similar QMEAN score as the top ranked model are used in order to cope with qualitatively outstanding predictions.
A combination of both pre-selection of models based on QMEAN and weighting the distances according to QMEAN in the subsequent clustering calculations is not useful as shown for the selection based on Z-scores. Z-scores have been calculated based on the model QMEAN score and only models above a given Z-score threshold are used for the clustering process. Table 2 shows that, with increasing Z-score threshold (i.e. fewer models from the ensemble are used in distance calculations), the ability of the weighted mean strategy to select good models gradually decreases, whereas the performance of the median strategy increases (until Z-score > 0). Using the median rather than the mean reduces the influence of outliers in smaller data sets. For the other two selection strategies, only median is shown, i.e. the final QMEANclust score of a model is the median distance of this model to all other models in the subset selected by the given strategy.
Model selection based on Z-scores has several disadvantages: the number of models selected using a given Z-score cut-off is highly dependent on the modelling difficulty. For an easy template based modelling target, the models in the ensemble tend to be very similar and there are no models with high Z-scores (e.g. for some targets there are no models with a Z-score greater than 1). On the other hand, for free modelling targets there are sometimes outstanding predictions compared to the bulk of more or less random models. Capturing these predictions in the selection step is the only way to circumvent the inherent limitations of consensus based methods. Furthermore, different selection cut-offs may be needed for template based modelling targets (TBM) and free modelling targets (FM) since the former contain much more structural redundancy which can be captured by clustering methods and more targets can potentially be used in the calculation of the distance matrix.
In the fourth section of Table 2, the results of a selection strategy based on a fixed percentage of top scoring models are shown. A total GDT_TS of 57.97 is achieved by using the top 20% models for TBM targets and top 10% for FM targets. Discrimination between TBM and FM targets is done based on mean QMEAN score by assigning targets with a model averaged QMEAN score above 0.4 to the template-based modelling category. This cut-off has been derived empirically by comparing the score distributions of FM and TBM targets (data not shown). The better performance of the approach, which uses a more tolerant model selection for TBM targets, may be attributed to the fact that the model ensemble of TBM targets contains more useful consensus information. In the case of FM targets, QMEAN is often able to identify some of the better models which are subsequently used in the consensus calculation.
Alternatively, a simple selection strategy aiming at capturing outstanding predictions has been investigated (fifth section of Table 2). Only models with a similar QMEAN score compared to the highest scoring model are considered for the distance calculation. A selection of models within 0.05 QMEAN units from the maximum for TBM targets and 0.1 units for FM targets results in a total GDT_TS of 58.11. Since the TBM models are structurally more homogenous, more models are selected in TBM targets than FM targets using these thresholds. For the subsequent comparison to other methods, the best versions of QMEAN, QMEANclust and selfQMEAN (see below) are used. The corresponding values are underlined in Table 2.
At CASP7, none of the quality assessment programs (clustering and non-clustering methods) was able to select better models out of the ensemble of server models than the Zhang server  submitted for each target [35, 41, 44]. The best QMEANclust implementation shows a better model selection performance than TASSER-QA  and a naive scoring function that simply takes the Zhang server models (total GDT_TS of 58.11 vs. 57.35). The difference is statistically significant at the 95% confidence level based on a paired t-test. Figure 1 underlines that QMEANclust and the single model scoring function QMEAN show a statistically better (p = 1.9*10-5 and p = 0.009, respectively) selection performance than Pcons, the best performing clustering based method at CASP7. In terms of correlation between predicted model quality and degree of nativeness, QMEANclust has significantly higher Pearson's (0.892 vs. 0.828 of TASSER-QA) and Spearman's (0.841 vs. 0.785) correlation coefficients than TASSER-QA and any other tested scoring function.
Although the ability of QMEANclust to pick the best model is better than a naive predictor that simply picks Zhang models, it can still potentially be improved. The weighting factors for the QMEAN scoring function used for model prioritisation has been optimised for regression and not for selecting the best model. Qui et al.  recently described an approach in which a composite scoring function has been optimised for model selection using support vector machines. Most current scoring functions ignore a trivial parameter for the estimation of model quality: the presence and closeness of a structural template which can be used to build the model . Zhou and Skolnick  recently described a scoring function in which the extent a model is covered by fragments of templates identified by threading is used as quality measure. QMEAN could benefit of such a term representing orthogonal information to the present implementation.
selfQMEAN: use of statistical potential terms derived from model ensemble
The idea of using the ensemble of models for a given target as basis to derive target-specific statistical potential terms has previously been investigated . In their work, Wang et al. generated a decoy-dependent implementation of the RAPDF interaction potential  by deriving the distance frequencies from the models in the decoy set and weighting each count according to the RAPDF score of the model. This decoy-dependent statistical potential performed better that the original RAPDF scoring function but not as good as a simple density score based on the average RMSD of a model to all others. Here we followed a similar strategy with the difference that a combined scoring function using multiple statistical potentials is used and that an improved density scoring function (QMEANclust) is used for weighting the models contributing to the selfQMEAN score (see Methods). As can be seen from Table 2, while selfQMEAN generates considerably higher correlation coefficients than QMEAN, the ability to select good models does not improve. The decoy-dependent scoring function does not perform better than QMEANclust, which is based on structural density information alone. Building a composite scoring function based on target-specific potentials is problematic since the weighting factors are highly dependent on the modelling difficulty: Ensembles containing lots of very similar models, e.g. in high accuracy template based models, result in much lower absolute energies in the statistical potential terms than sets of diverse models. We tried to circumvent the problem by just adding the energy Z-scores of each term. These results suggest that the level of detail captured by target-specific scoring functions decreases compared to the direct derivation of structural differences based on consensus methods. The structural density information seems to be captured more precisely when directly derived from the distance matrices without doing the detour using model ensemble specific statistical potentials. These methods are also not able to overcome the limitations of purely consensus based methods being determined by the most dominated structural cluster.
Comparison of QMEANclust with 3Djury-like consensus method
Targets T0354 represents an example in which QMEANclust failed to improve over a purely clustering based approach. This can be attributed to the inconsistencies in the QMEAN ranking in which a set of similar but very poor models have been ranked too high. For this target the best model selection would have been actually obtained by QMEAN (as denoted by the arrow on the right).
MOULDER test set: Performance in a realistic modelling situation
As the QMEAN scoring function has been optimised on CASP6 models and tested on CASP7 models, one might raise the argument that it tends to be over-trained for this special situation and also to the GDT_TS score used there. Therefore we analysed the performance of QMEAN on the MOULDER test set which represents a more realistic modelling situation. The MOULDER test set consists of 20 different targets, each with 300 alternative models generated by MODELLER .
Performance comparison of QMEAN to other single model scoring functions based on the MOULDER test set.
Mean ΔRMSD [Å]
Std. dev. [Å]
pairwise Cbeta, SSE
pairwise all-atom, SSE
Comparison between QMEAN and QMEANclust in the task of selecting near native models on the MOULDER test set.
median RMSD [Å]
# < 5Å
QMEANlocal: local quality estimation
Comparison of consensus and non-consensus based methods in the estimation of the local model quality.
The per-residue predictions based on QMEAN, QMEANclust and selfQMEAN are compared to the recently published ProQres scoring function (non-consensus method). In ProQres a neural network is used to combine several local descriptors . Recently, Fasnacht et al.  published a local composite scoring function based on different terms combined by support vector machines resulting in a slightly better performance. The SVM approach, as well as ProQres, have been shown to outperform classical scoring functions such as Verify3D  and ProsaII . A direct comparison to these methods is therefore not necessary and a rigorous benchmark against other local quality estimation methods is beyond the scope of this work. Rather, the general performance differences of non-clustering, clustering and "self-clustering" methods should be highlighted and discussed here.
The QMEANlocal composite scoring function described here consists of a linear combination of 8 structural descriptors. The local scores are calculated over a sliding window of 9 residues which resulted in the best performance compared to alternative window sizes (data not shown). In analogy to the global QMEAN version, 4 statistical potential terms are combined with 2 terms describing the local agreement between predicted and measured secondary structure and solvent accessibility. Additionally, two trivial descriptors are used: the average solvent accessibility and the fraction of residues in the segment with no defined secondary structure. The weighting factors have been optimised on the models submitted to CASP6 with the Cα distance as target function (see Methods for details).
QMEANlocal estimates the local quality using only the model, whereas the following two approaches consider the ensemble of models. We investigated two different approaches for local quality estimation relying on the structural density information contained in the ensemble of models (QMEANclust_local, selfQMEANlocal).
In the local consensus approach the Cα deviations among the equivalent positions in the models after a sequence-dependent superposition with the program TMscore  are analysed in order to derive a quality score. In analogy to the global QMEANclust score, either a subset of all models is used in the distance calculation and the median distance is retrieved, or a weighted mean distance according to the global model quality score is calculated. In this way, segments of more reliable models have a stronger influence on the predicted local score. The model ranking based on QMEANclust is used for model selection and weighting. A weighting according to QMEAN has been also investigated but resulted in a worse performance (data not shown). The statistical potential terms in selfQMEANlocal are trained on the best ranking models of the ensemble. The remaining terms are identical to those in QMEANlocal and the weighting factors are derived using the CASP6 data set.
The last two columns in Table 6 show an analysis of the lowest and highest scoring 10% residues per target according to the corresponding quality score. QMEANlocal shows the best performance in recognising reliable regions as reflected by the best average Cα distance of the lowest scoring 10% residues. As is the case with possibly any other scoring function analysing single models (i.e. based on statistical potential terms), QMEANlocal is not able to distinguish regions with high and very high deviation from native. If the model ensemble contains structural redundancy which can be captured by consensus based methods, the local version of QMEANclust is very effective in identifying regions in models which deviate from the structural consensus and regions which are potentially correct. For template-based modelling, correlation coefficients between predicted and calculated local deviation from native were observed as high as 0.95 over the residues of the model ensemble of some CASP7 targets. For the analysis of single models or in the case when the ensemble does not contain useful density information, composite scoring functions such as QMEANlocal may be used. Depending on the modelling situation either one or the other approach may be used to identify incorrect regions in the model which can be subjected to local conformational resampling in a model refinement protocol.
The quality measures described so far all rely on the entire set of residues of all models per target (or over all targets for ROCall) and describe the general agreement of predicted and measured local model quality. They do not explicitly analyse whether a method is able to estimate the reliability of different regions within a model. Therefore we also analysed for each model the degree of correspondence between predicted and observed local deviation using Kendall's tau rank correlation coefficient. Table 4 reports Kendall's tau averaged over all models per target. The performance of selfQMEANlocal lies between non-clustering and clustering methods.
A ROC curve analysis of the terms contributing to QMEANlocal suggests that the performance is strongly carried by trivial arguments such as solvent accessibility and secondary structure composition (data not shown). Two analogous terms are used both in ProQres and in the SVM approach of Fasnacht et al. The performance differences can therefore be partly explained by improved statistical potential terms. The QMEANlocal version presented in this work is only a starting point and a more elaborated approach is needed for combination the terms e.g. SVMs or neural networks. Nevertheless, the linear combination of terms used in QMEANlocal performs considerable better than the neural network based ProQres.
The QMEANclust scoring function described in this work combines the QMEAN composite scoring function which operates on single models with structural density information contained in a model ensemble. We showed that this approach is able to circumvent to some extent the inherent limitations of consensus methods which tend to fail if the best models are not part of the most prominent structural cluster. A statistically significant improvement over other methods relying on structural density information alone is obtained by selecting a subset of models based on the QMEAN score and calculating structural density only with respect to this subset.
The QMEAN scoring function outperforms all non-consensus methods participating at CASP7, both in terms of correlation to GDT_TS and in the task of selecting the best model. The results on the MOULDER test set show that QMEAN has not been specifically optimised for the context of CASP but represents a valuable tool for model selection on more realistic data sets. Compared to the original QMEAN version , an all-atom term has been added to the composite scoring function increasing its ability to select good models especially in the template based modelling category. Combining the terms with a more advanced machine learning algorithm may further its performance as model selector for QMEANclust.
At CASP7, consensus based methods have been shown to be superior to methods acting on single models. Nevertheless, none of the participating scoring functions was able at that time to select better models than the best server from Zhang has submitted. The QMEANclust scoring function presented in this work performs significantly better than a naive scoring function always picking Zhang models. The high correlation coefficients obtained for the global and local versions make QMEANclust a good candidate for a refinement protocol. It may be used to enrich the ensemble with good models and to reliably identify deviating regions which then can be subjected to local conformational re-sampling and refinement in a similar way as recently described by the Baker group .
The outstanding performance of consensus methods over scoring functions operating on single models at CASP is not observed on the MOULDER test set. The performance of QMEANclust on the more realistic modelling test set highly depends on the composition of the ensemble of models to be analysed. For decoy sets containing many near-native conformations, the performance of the two scoring functions is similar. However, consensus methods will fail on decoy set which include only few near-native protein conformations and do not contain useful consensus information. Performance estimates of consensus methods based on large meta-datasets (e.g. CASP) might overrate their applicability in more realistic modelling situations, and further research is required to investigate the influence of the ensemble composition and the methods used to generate these models.
QMEAN and QMEANlocal
The scoring function used in this work for the quality estimation of single models is an extension of the recently published QMEAN composite scoring function  consisting of the following five terms: A secondary structure-specific distance-dependent pairwise residue-level potential, a torsion angle potential over three consecutive amino acids, a Cβ solvation potential as well as two terms describing the agreement between predicted and calculated secondary structure and solvent accessibility. See Table 1 for a short description of all terms contributing to QMEAN. Further details about the implementation of the different terms can be found in the original paper.
The new QMEAN version used in this work additionally contains an all-atom interaction potential term in order to be able to capture more details of the models being assessed. The interaction potential is based on all 167 different atom types occurring in proteins and covers distances from 3 to 20 Å (bin size 0.5 Å). It follows the same secondary structure specific implementation as the residue-level potential . Different lower and upper distance cut-offs have been investigated, but these resulted in worse performance on the CASP6 training data set (data not shown).
Optimisation of the weighting factors for the QMEAN composite scoring was performed on the CASP6 training set by using the linear regression module of the R package  with the GDT_TS score as target function.
QMEAN = Wtorsion *Etorsion + Wsolvation *Esolvation + Wpair, residue *Epair, residue + Wpair, all-atom *Epair, all-atom + WSSE agreement *SSSE agreement + WACCagreement *SACCagreement + intercept
Wtorsion = -0.00185, Wsolvation = -0.00054, Wpair, residue = -0.00062, Wpair, all-atom = -0.00108, WSSE agreement = 0.38072, WACCagreement = 0.57997, intercept = -0.28663.
The local scoring function QMEANlocal consists of 8 terms. All terms are calculated over a sliding window of 9 residues and a triangular smoothing weighting scheme has been applied as described elsewhere [16, 17]. The same Cβ solvation and residue-level interaction potentials are used as in the global QMEAN scoring function. For the torsion angle potential, a standard implementation with 10 degree angle bins works slightly better than the coarse-grained version over 3 residues used in QMEAN (data not shown). An all-atom interaction potential implementation adapted to local analysis is used covering distances from 0 to 10 Å (step size 0.5 Å). The two agreement terms are adopted and describe the percentage agreement between predicted and measured solvent accessibility and secondary structure within the sliding window. Two trivial features are also used: the average solvent accessibility (weighted by triangular smoothing) and the fraction residues in the 9-residue window with no assigned secondary structure by DSSP .
The following weighting factors are used (derived using linear regression in analogy to QMEAN with the Cα distance as target function): Wtorsion = 1.477, Wsolvation = 0.508, Wpair, residue = 0.164, Wpair, all-atom = 2.097, WSSE agreement = -0.742, WACCagreement = -0.372, Wsolvent_accessibility = 0.051, Wfraction_loop = 0.666, intercept (with the y-axis) = 1.701.
QMEANclust and QMEANclust_local
In analogy to the analysis of the global deviation between models in QMEANclust, the distance between identical residues after superposition with the software TMscore is used to estimate the local model quality in QMEANclust_local. The Cα distances of all corresponding residues are extracted and stored in a n*n*m matrix (where n is the number of models an m the length of the complete target sequence).
selfQMEAN and selfQMEANlocal
For the target-specific versions of QMEAN, the statistical potentials have been derived from all models of a given argets with a QMEANclust Z-score above minus one. Thereby low quality outlier models carrying no information are excluded. The frequency counts (i.e. the basis for the different statistical potential terms) are weighted according to the global QMEANclust score. This ensures that structural features of more reliable models have a stronger impact on the resulting potentials. A specific weighting of each interaction according to the local QMEANclust score has also been investigated but resulted in a worse performance. Two approaches for the combination of the statistical potential terms with the agreement terms have been tested: Either the terms are combined directly using the same weighting factors as for QMEAN or Z-scores over all models are built for each term which are then summed up.
CASP data sets
The training set consists of all models submitted to CASP6. In order to reduce the influence of outliers in the derivation of the weighting factors we applied the following filter. All models which have, for any of the 4 statistical potential terms, a total energy above or below 3 standard deviations, are removed from the training set. This resulted in a final set of 23,925 models.
The CASP7 test set comprises all server models submitted to CASP7. In order to be able to compare our results to those presented in Zhou&Skolnick  we only included models of the TS category and skipped AL models. The GDT_TS values for the evaluation were taken directly from the official CASP7 website http://predictioncenter.org/casp7/. All data reported in the tables related to CASP7 represent averages of the 98 targets.
MOULDER data set
We use the MOULDER test set published in Eramian et al.  in order to test QMEAN under a more realistic modelling situation. The test set has been originally used to compare the support vector machine based metapredictor SVMod with a variety of existing energy functions. The performance data of all tested scoring functions can be obtained from the Sali Lab http://salilab.org/decoys/ and the comprehensive set of models from the webpage of Marti-Renom http://sgu.bioinfo.cipf.es/datasets/Models/comp_models.tar.gz. The MOULDER test set from Eramian et al. consists of 20 target/template pairs of remotely related homologues. The 20 targets do not share significant structural similarity to each other. For each modelling case a total number of 300 alternative models were generated using MOULDER . We directly used the performance data for all the scoring functions from the publication and re-run the benchmarking including the methods described in this paper.
The performance of a given scoring function in selecting the model closest to the native structure was benchmarked as described in the original paper. From the set of 300 models a random subset of 75 models is selected 2000 times. In each iteration, the models are ranked by the scoring function and the difference (in Ångstrom) between the selected model and the model with the lowest RMSD in the given subset is recorded. Finally, the delta RMSD is reported averaged over the 2000 iterations and 20 targets.
The analysis of the statistical significance on the CASP7 set is based on a paired t-test (95% confidence level) and has been carried out in R. The ROC curve analysis has been performed on all residues of all CASP7 server models using the R-package ROCR .
In order to evaluate the model quality estimation performance of different local scoring functions a Kendall's tau test has been used to measure the degree of correspondence of RMSD and predicted local score. Kendall's tau has been calculated on a per model basis and compared between the different scoring functions. For this purpose, the Kendall R-Package of A.I. McLeod has been used, accessible over the CRAN website http://cran.r-project.org/.
We thank James N. Battey for proofreading. We are grateful to Andrej Sali and Marc Marti-Renom for giving access to the MOULDER test set and Hongyi Zhou and Jeffrey Skolnick for providing the data of TASSER-QA. We would like to acknowledge financial support by the Swiss National Science Foundation (SNF) and by the Swiss Institute of Bioinformatics (SIB).
- Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 1997, 268(1):209–225. 10.1006/jmbi.1997.0959View ArticlePubMedGoogle Scholar
- Zhang Y, Arakaki AK, Skolnick J: TASSER: An automated method for the prediction of protein tertiary structures in CASP6. Proteins: Structure, Function, and Bioinformatics 2005, 61(S7):91–98. 10.1002/prot.20724View ArticleGoogle Scholar
- Sommer I, Toppo S, Sander O, Lengauer T, Tosatto SC: Improving the quality of protein structure models by selecting from alignment alternatives. BMC Bioinformatics 2006, 7: 364. 10.1186/1471-2105-7-364PubMed CentralView ArticlePubMedGoogle Scholar
- Saqi MA, Bates PA, Sternberg MJ: Towards an automatic method of predicting protein structure by homology: an evaluation of suboptimal sequence alignments. Protein Eng 1992, 5(4):305–311. 10.1093/protein/5.4.305View ArticlePubMedGoogle Scholar
- Cheng J: A multi-template combination algorithm for protein comparative modeling. BMC Struct Biol 2008, 8: 18. 10.1186/1472-6807-8-18PubMed CentralView ArticlePubMedGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: A new approach to protein fold recognition. Nature 1992, 358(6381):86–89. 10.1038/358086a0View ArticlePubMedGoogle Scholar
- John B, Sali A: Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res 2003, 31(14):3982–3992. 10.1093/nar/gkg460PubMed CentralView ArticlePubMedGoogle Scholar
- Petrey D, Xiang Z, Tang CL, Xie L, Gimpelev M, Mitros T, Soto CS, Goldsmith-Fischman S, Kernytsky A, Schlessinger A, et al.: Using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. Proteins: Structure, Function, and Genetics 2003, 53(S6):430–435. 10.1002/prot.10550View ArticleGoogle Scholar
- Kelley LA, MacCallum RM, Sternberg MJ: Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000, 299(2):499–520. 10.1006/jmbi.2000.3741View ArticlePubMedGoogle Scholar
- Fernandez-Fuentes N, Madrid-Aliste CJ, Rai BK, Fajardo JE, Fiser A: M4T: a comparative protein structure modeling server. Nucleic Acids Res 2007, (35 Web Server):W363–368. 10.1093/nar/gkm341Google Scholar
- Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics 2003, 19(8):1015–1018. 10.1093/bioinformatics/btg124View ArticlePubMedGoogle Scholar
- Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A: Pcons: A neural-network-based consensus predictor that improves fold recognition. Protein Sci 2001, 10(11):2354–2362. 10.1110/ps.08501PubMed CentralView ArticlePubMedGoogle Scholar
- Shortle D, Simons KT, Baker D: Clustering of low-energy conformations near the native structures of small proteins. Proceedings of the National Academy of Sciences of the United States of America 1998, 95(19):11158–11162. 10.1073/pnas.95.19.11158PubMed CentralView ArticlePubMedGoogle Scholar
- Wang K, Fain B, Levitt M, Samudrala R: Improved protein structure selection using decoy-dependent discriminatory functions. BMC Struct Biol 2004, 4: 8. 10.1186/1472-6807-4-8PubMed CentralView ArticlePubMedGoogle Scholar
- Xiang Z, Soto C, Honig B: Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc Natl Acad Sci USA 2002, 99(11):7432–7437. 10.1073/pnas.102179699PubMed CentralView ArticlePubMedGoogle Scholar
- Tress ML, Jones D, Valencia A: Predicting reliable regions in protein alignments from sequence profiles. J Mol Biol 2003, 330(4):705–718. 10.1016/S0022-2836(03)00622-3View ArticlePubMedGoogle Scholar
- Wallner B, Elofsson A: Identification of correct regions in protein models using structural, alignment, and consensus information. Protein Sci 2006, 15(4):900–913. 10.1110/ps.051799606PubMed CentralView ArticlePubMedGoogle Scholar
- Chen H, Kihara D: Estimating quality of template-based protein models by alignment stability. Proteins 2008, 71(3):1255–1274. 10.1002/prot.21819View ArticlePubMedGoogle Scholar
- Laskowski RA, MacArthur MW, Moss DS, Thornton JM: PROCHECK: a program to check the stereochemical quality of protein structures. Journal of Applied Crystallography 1993, 26(2):283–291. 10.1107/S0021889892009944View ArticleGoogle Scholar
- Hooft RW, Vriend G, Sander C, Abola EE: Errors in protein structures. Nature 1996, 381(6580):272. 10.1038/381272a0View ArticlePubMedGoogle Scholar
- Luthy R, Bowie JU, Eisenberg D: Assessment of protein models with three-dimensional profiles. Nature 1992, 356(6364):83–85. 10.1038/356083a0View ArticlePubMedGoogle Scholar
- Dominy BN, Brooks CL III: Identifying native-like protein structures using physics-based potentials. Journal of Computational Chemistry 2002, 23(1):147–160. 10.1002/jcc.10018View ArticlePubMedGoogle Scholar
- Lazaridis T, Karplus M: Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J Mol Biol 1999, 288(3):477–487. 10.1006/jmbi.1999.2685View ArticlePubMedGoogle Scholar
- Lu H, Skolnick J: A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 2001, 44(3):223–232. 10.1002/prot.1087View ArticlePubMedGoogle Scholar
- Melo F, Feytmans E: Assessing protein structures with a non-local atomic interaction energy. Journal of Molecular Biology 1998, 277(5):1141–1152. 10.1006/jmbi.1998.1665View ArticlePubMedGoogle Scholar
- Melo F, Sanchez R, Sali A: Statistical potentials for fold assessment. Protein Sci 2002, 11(2):430–448. 10.1110/ps.25502PubMed CentralView ArticlePubMedGoogle Scholar
- Shen M-Y, Sali A: Statistical potential for assessment and prediction of protein structures. Protein Sci 2006, 15(11):2507–2524. 10.1110/ps.062416606PubMed CentralView ArticlePubMedGoogle Scholar
- Sippl MJ: Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. Journal of Molecular Biology 1990, 213(4):859–883. 10.1016/S0022-2836(05)80269-4View ArticlePubMedGoogle Scholar
- Zhou H, Zhou Y: Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci 2002, 11(11):2714–2726. 10.1110/ps.0217002PubMed CentralView ArticlePubMedGoogle Scholar
- Wallner B, Elofsson A: Can correct protein models be identified? Protein Sci 2003, 12(5):1073–1086. 10.1110/ps.0236803PubMed CentralView ArticlePubMedGoogle Scholar
- Tosatto S: The victor/FRST function for model quality estimation. Journal of computational biology: a journal of computational molecular cell biology 2005, 12: 1316–1327.View ArticleGoogle Scholar
- Eramian D, Shen M-y, Devos D, Melo F, Sali A, Marti-Renom MA: A composite score for predicting errors in protein structure models. Protein Sci 2006, 15(7):1653–1666. 10.1110/ps.062095806PubMed CentralView ArticlePubMedGoogle Scholar
- Benkert P, Tosatto SCE, Schomburg D: QMEAN: A comprehensive scoring function for model quality assessment. Proteins: Structure, Function, and Bioinformatics 2008, 71(1):261–277. 10.1002/prot.21715View ArticleGoogle Scholar
- Qiu J, Sheffler W, Baker D, Noble WS: Ranking predicted protein structures with support vector regression. Proteins 2008, 71(3):1175–1182. 10.1002/prot.21809View ArticlePubMedGoogle Scholar
- Zhou H, Skolnick J: Protein model quality assessment prediction by combining fragment comparisons and a consensus C(alpha) contact potential. Proteins 2008, 71(3):1211–1218. 10.1002/prot.21813PubMed CentralView ArticlePubMedGoogle Scholar
- Hillisch A, Pineda LF, Hilgenfeld R: Utility of homology models in the drug discovery process. Drug Discov Today 2004, 9(15):659–669. 10.1016/S1359-6446(04)03196-4View ArticlePubMedGoogle Scholar
- Thorsteinsdottir HB, Schwede T, Zoete V, Meuwly M: How inaccuracies in protein structure models affect estimates of protein-ligand interactions: computational analysis of HIV-I protease inhibitor binding. Proteins 2006, 65(2):407–423. 10.1002/prot.21096View ArticlePubMedGoogle Scholar
- Baker D, Sali A: Protein structure prediction and structural genomics. Science 2001, 294(5540):93–96. 10.1126/science.1065659View ArticlePubMedGoogle Scholar
- Fasnacht M, Zhu J, Honig B: Local quality assessment in homology models using statistical potentials and support vector machines. Protein Sci 2007, 16(8):1557–1568. 10.1110/ps.072856307PubMed CentralView ArticlePubMedGoogle Scholar
- Fischer D: Servers for protein structure prediction. Curr Opin Struct Biol 2006, 16(2):178–182. 10.1016/j.sbi.2006.03.004View ArticlePubMedGoogle Scholar
- Cozzetto D, Kryshtafovych A, Ceriani M, Tramontano A: Assessment of predictions in the model quality assessment category. Proteins 2007, 69(Suppl 8):175–183. 10.1002/prot.21669View ArticlePubMedGoogle Scholar
- Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure prediction – Round VII. Proteins: Structure, Function, and Bioinformatics 2007, 69(S8):3–9. 10.1002/prot.21767View ArticleGoogle Scholar
- Battey JND, Kopp Jr, Bordoli L, Read RJ, Clarke ND, Schwede T: Automated server predictions in CASP7. Proteins 2007, 69(Suppl 8):68–82. 10.1002/prot.21761View ArticlePubMedGoogle Scholar
- Wallner B, Elofsson A: Prediction of global and local model quality in CASP7 using Pcons and ProQ. Proteins 2007, 69(Suppl 8):184–193. 10.1002/prot.21774View ArticlePubMedGoogle Scholar
- Joo K, Lee J, Lee S, Seo JH, Lee SJ: High accuracy template based modeling by global optimization. Proteins 2007, 69(Suppl 8):83–89. 10.1002/prot.21628View ArticlePubMedGoogle Scholar
- Zemla A: LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Research 2003, 31(13):3370–3374. 10.1093/nar/gkg571PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou H, Pandit SB, Lee SY, Borreguero J, Chen H, Wroblewska L, Skolnick J: Analysis of TASSER-based CASP7 protein structure prediction results. Proteins 2007, 69(Suppl 8):90–97. 10.1002/prot.21649View ArticlePubMedGoogle Scholar
- Yang YD, Park C, Kihara D: Threading without optimizing weighting factors for scoring function. Proteins 2008, 73(3):581–596. 10.1002/prot.22082View ArticlePubMedGoogle Scholar
- Eramian D, Eswar N, Shen MY, Sali A: How well can the accuracy of comparative protein structure models be predicted? Protein Sci 2008, 17(11):1881–1893. 10.1110/ps.036061.108PubMed CentralView ArticlePubMedGoogle Scholar
- Siew N, Elofsson A, Rychlewski L, Fischer D: MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics 2000, 16(9):776–785. 10.1093/bioinformatics/16.9.776View ArticlePubMedGoogle Scholar
- Cristobal S, Zemla A, Fischer D, Rychlewski L, Elofsson A: A study of quality measures for protein threading models. BMC Bioinformatics 2001, 2(1):5. 10.1186/1471-2105-2-5PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics 2004, 57(4):702–710. 10.1002/prot.20264View ArticleGoogle Scholar
- McGuffin LJ: Benchmarking consensus model quality assessment for protein fold recognition. BMC Bioinformatics 2007, 8: 345. 10.1186/1471-2105-8-345PubMed CentralView ArticlePubMedGoogle Scholar
- Wu S, Skolnick J, Zhang Y: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biol 2007, 5: 17. 10.1186/1741-7007-5-17PubMed CentralView ArticlePubMedGoogle Scholar
- Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. Embo J 1986, 5(4):823–826.PubMed CentralPubMedGoogle Scholar
- Samudrala R, Moult J: An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 1998, 275(5):895–916. 10.1006/jmbi.1997.1479View ArticlePubMedGoogle Scholar
- Sali A: Comparative protein modeling by satisfaction of spatial restraints. Mol Med Today 1995, 1(6):270–277. 10.1016/S1357-4310(95)91170-7View ArticlePubMedGoogle Scholar
- Sippl MJ: Recognition of errors in three-dimensional structures of proteins. Proteins: Structure, Function, and Genetics 1993, 17(4):355–362. 10.1002/prot.340170404View ArticleGoogle Scholar
- Qian B, Raman S, Das R, Bradley P, McCoy AJ, Read RJ, Baker D: High-resolution structure prediction and the crystallographic phase problem. Nature 2007, 450(7167):259–264. 10.1038/nature06249PubMed CentralView ArticlePubMedGoogle Scholar
- Benkert P, Kunzli M, Schwede T: QMEAN server for protein model quality estimation. Nucleic Acids Res 2009, in press.Google Scholar
- Ihaka R, Gentleman R: R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 1996, 5(3):299–314. 10.2307/1390807Google Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21(20):3940–3941. 10.1093/bioinformatics/bti623View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.