Bmc Structural Biology Improving Consensus Contact Prediction via Server Correlation Reduction

Background: Protein inter-residue contacts play a crucial role in the determination and prediction of protein structures. Previous studies on contact prediction indicate that although template-based consensus methods outperform sequence-based methods on targets with typical templates, such consensus methods perform poorly on new fold targets. However, we find out that even for new fold targets, the models generated by threading programs can contain many true contacts. The challenge is how to identify them.

good templates in PDB [8]. For example, Misura et al. [9] have revised the widely-used ab initio folding program, Rosetta [10], by incorporating inter-residue contact information as a component of Rosetta's energy function, and shown that the revised Rosetta exhibits not only a better computational efficiency, but also a better prediction accuracy. For some test proteins, the models built by this revised Rosetta are more accurate than their templatebased counterparts, which is rarely seen before [7]. Zhangserver [11] and TASSER [12] perform very well in both CASP7 and CASP8. One of the major advantages of these two programs over the others is that both depend on contacts and distance restraints, extracted from multiple templates, to refine the template-based models. It has been shown by Zhang et al. that ab initio prediction methods can benefit from contact predictions with an accuracy that is higher than 22% [13].
Protein inter-residue contact was first studied by [14][15][16][17] to calculate the mean force potential. Göbel et al. [18] formally proposed the problem of contact prediction, and showed that correlated mutation (CM) is useful information to predict inter-residue contacts. The fundamental assumption is that if two residues are in contact with each other, during evolution, if one residue mutates, the other one has a high chance to mutate as well. Thus, by analyzing residue mutation information from multiple sequence alignments, it can be predicted whether or not two residues are in contact. Since then, different correlated mutation statistical methods have been carefully examined [19][20][21][22][23][24].
According to whether structural templates information is taken into consideration, contact prediction methods can be classified into two categories: sequence-based methods and template-based methods. Among all the sequencebased methods, some rely solely on correlated mutation information calculated by different statistical approaches [18,[22][23][24], while others encode the correlated mutation, together with other features such as secondary structure and solvent accessibility, into machine learning models [25][26][27][28][29][30][31][32]. Although the correlated mutation performs well on local contact prediction, which is usually defined to be two residues within six amino acids from each other in the protein sequence, it usually fails for non-local contacts. Therefore, other information such as evolutionary information and secondary structure information, has been applied to improve the performance of contact prediction methods [25][26][27][28][29][30][31][32][33]. In [25], Fariselli et al. encoded four types of features into a neural network based server (COR-NET): 1) correlated mutation, 2) evolutionary information, 3) sequence conservation, and 4) predicted secondary structure.
They defined that two residues are in contact if the Euclidean distance between the coordinates of their C  atoms (C  atom for Glycine) is smaller than 8Å, and the sequence separation between the two residues is at least seven to eliminate the influence of local -helical contacts. CORNET achieves an average accuracy of 21%. Other features have been investigated since then [27,29]. PROFcon [31], one of the best three contact prediction servers in CASP6 [5], encodes more information into its neural network model, including solvent accessibility and secondary structure over the regions between the two residues, as well as the properties of the entire protein. PROFcon performs impressively on small proteins or alpha/ beta proteins with an accuracy of more than 30%. Recently, Shackelford and Karplus [33] proposed a neural network based method to calculate the correlated mutation by using the statistical significance of the mutual information between the columns of multiple sequence alignment. Their SAM-T06 server outperforms all the other contact prediction servers in CASP7, and achieves an average accuracy of 45% for all CASP7 target proteins, which is higher than that of any previously reported study.
In contrast to these sequence-based methods, which encode correlated mutation information and other sequence-derived information, there are some studies on predicting inter-residue contacts from structural templates [9,32,[34][35][36]. The underlying assumption for such methods is that contacts are usually very conserved during evolution. Consequently, templates, with structures similar to that of the target protein, usually contain common contacts, such that consensus methods work well. Bystroff et al. [34,35] have considered folding pathways, and predicted contacts by employing HMMSTR [37], a hidden Markov model for local sequence-structure correlation. LOMETS [36], a majority voting based consensus method, takes nine state-of-the-art threading programs as inputs. LOMETS predicts contacts by attempting to select the best input model.
Recently, two Support Vector Machines (SVMs) based contact prediction methods, SVM-SEQ and SVM-LOM-ETS, have been proposed by Wu et al. [32]. SVM-SEQ only takes sequence-derived information into consideration, whereas SVM-LOMETS, a consensus method, is based on structural templates. SVM-LOMETS differs from its ancestor, LOMETS, in that it carefully trains contact frequency, C  distances, and template quality by an SVM model. The inputs for SVM-LOMETS are nine state-of-the-art threading programs: FUGUE [38], HHSEARCH [39], PAINT, PPA-I, PPA-II [36], PROSPECT2 [40], SAM-T02 [41], SP3 and SPARKS2 [42]. Both SVM-SEQ and SVM-LOMETS are tested on a set of 554 proteins, on which each achieves an average accuracy of 29% and 53%, respectively. Although it is widely acknowledged that a method usually has dif-ferent performance on different data sets, one can still expect that a consensus contact prediction method will outperform the individual servers. Instead of testing on the entire CASP7 data set, these two programs are further tested on the 15 new fold (NF) targets of CASP7. The average accuracies are 26% and 11%, respectively. Through a comprehensive comparison of sequence-based and structure-based methods, including SVM-SEQ, SVMCON [43], SVM-LOMETS, LOMETS, and SAM-T06 server, Wu et al. have concluded that template-based methods are better than sequence-based methods on template-based modeling (TBM) targets, but worse on new fold targets. However, even for new fold targets, where the threading methods fail to identify good templates, the templates discovered by threading usually contain many true contacts. The major challenge is that some true contacts are not contained in a majority of the top templates, resulting in the failure of the traditional majority voting methods. In this paper, we propose a novel consensus contact prediction method to eliminate the effect of server correlation, and to discover true contacts even when they are not commonly found in the top templates. All the contacts, determined by structure prediction servers, are considered to be candidates. Our consensus method then assigns a confidence score to each contact candidate, while also taking correlated mutation information into consideration.

Data Set Server Selection
To evaluate the performance of the proposed consensus method, six threading-based protein structure prediction servers are used: FOLDpro [44], mGenThreader [45,46], RAPTOR [47,48], FUGUE3 [38], SAM-T02 [41], and SPARK3 [42]. Although there are some servers, such as Rosetta and Zhang-server, with a better performance than that of the six servers, they are not used because their models are already refined by predicted contacts.

Training and Test Data
The biennial CASP competition provides us a comprehensive and objective data set. The CASP7 targets and models generated by the six servers are adopted as the training and test data. For each server on a target, the five submitted models are considered. All server models are downloaded from the CASP7 website, except for mGenThreader, which does not participate in CASP7. We submitted the CASP7 targets to the mGenThreader web server, and downloaded models from there before August 2006. Therefore, all these models are generated before the native structures of the CASP7 targets are released. Eighty nine CASP7 target proteins are used as valid targets for the CASP7 evaluation, while 104 protein sequences are released as targets. Redundancy is removed at the 40% sequence identity level by using CD-HIT [49], which results in 88 target pro-teins. Only T0346 is removed, because it shares 71% sequence identity with T0290. Furthermore, two targets (T0334 and T0385) are removed from the data set due to some errors in the models, generated by some of the six individual servers on these two targets; for example, the models generated by some servers only cover discontinued regions of the target proteins. To conduct a cross-validation, the 86 target proteins are randomly divided into four sets of 22, 21, 21, and 22 proteins, respectively. If one target belongs to a certain set, all of its models and contacts are in the same set.

Data Set Statistics
The performance of the six individual servers are compared in terms of prediction accuracy and coverage. In evaluating the performance of a server, only the best models of each target are considered. If the number of contacts in a model is less than L/5, where L is the target size, both the accuracy and the coverage for this model are set to 0. As shown in Table 1, the average accuracy of the six servers ranges from 43% to 53%. The SAM-T02 server has the highest accuracy but the lowest coverage. The artificial server "Overall" in Table 1 means a server that generates the union set of all contacts contained in the best models. The accuracy of server "Overall" is very low (12%), compared to that of any individual server. Note that the server "Overall" consistently contains many more true contacts than any individual server does. Therefore, the low accuracy of the server "Overall" implies that the false contacts, generated by these individual servers, differ from each other in most cases, whereas the individual servers tend to generate common true contacts. This means the consensus method can probably be employed to differentiate true contacts from false ones.
As shown in Table 1, the average coverage of the six servers ranges from 37% to 52%. However, when they are combined, the coverage for server "Overall" is very high (approximately 80%). This indicates that some true con- tacts are predicted by only a small number of individual servers, and the different servers can predict a common subset of the true contacts. On the other hand, this implies that most of the native contacts are contained in the models, generated by threading programs. Thus, the challenge is how to identify them.

Server Correlation and Latent Servers
This section studies the correlation of the six individual servers and derives the latent servers. Table 2 lists the pairwise correlation of the six individual servers, which is calculated according to Eq. (3). Note that the matrix is not symmetric, because different servers predict different numbers of contacts. As shown in Table 2, the correlation between two different servers ranges from 0.25 to 0.59, which implies that some servers are more closely correlated than others in terms of contact prediction. Therefore, the majority voting based consensus methods, which simply apply majority voting and assume each server is independent, will not always work because some of the common components of these servers are over-expressed.
The relationship between the latent servers and the individual ones is then derived according to Eq. (4). Note here that the top five models for each target of each server are considered. The confidence score of a server on a contact candidate is estimated by the number of models in the top five, containing this contact, divided by the total number of models considered (five in this case). As shown in Table  3, different latent servers represent different individual servers. For example, H 1 represents the common characteristics shared by these individual servers, because the weights of H 1 , on these individual servers, are about the same; H 2 differentiates FUGUE3 from the other servers; H 3 represents FOLDpro by a large positive weight, and represents mGenThreader by a large negative weight. Based on the eigenvalues, H 6 is eliminated, since the eigenvalue for H 6 is much smaller than that of the others. Thus, H 6 is considered as random noise.
The optimal weights for the latent servers are derived by the cross-validation of the four sets. Correlated mutation is considered to be another independent latent server, because it provides a target sequence-related probability for each contact candidate. Correlated mutation is calculated as previously described in [18,22]. Each time the ILP model is trained on three of the four sets, and a set of weights is optimized by the ILP model, based on which a new prediction server is derived, named as , and , respectively. In this paper, S* refers to server on test set i (i = 1, 2, 3, 4). Since the inputs are the six individual servers, after the optimal weights * are calculated by the ILP model, each hidden server in Eq. (1) are further replaced by the linear combination of the original individual servers as calculated in Eq. (5). Table 4 shows the linear combination representation of S* on the individual servers and correlated mutation. It is clear that the four sets of weights are very similar. Note that mGenThreader has negative weights. This implies that the contribution of mGenThreader is accounted for by the other individual servers.

CASP7 Evaluation
We first assess our consensus server S* by Receiver Operating Characteristic (ROC) plots. They provide an intuitive way to examine the trade-off between the ability of a classifier to correctly identify positive cases and to incorrectly classify negative cases. Figure 1 depicts the performance of server S* and the six individual servers on the four test sets.
As shown in Figure 1, server S* performs better than any individual server on all the four test sets. For each server, the performance of this server on test set 1 is slightly better than that on the other three test sets, which means test set 1 is the easiest among those four. RAPTOR performs better than other individual servers on the first three test sets, and SPARK3 exhibits the best performance on test set 4. There are distinct performance differences between server S* and the best individual server on test set 1, 2, and 4, when the false positive rate is below 0.3. However, the difference is not obvious on those three test sets, when the false positive rate is higher than 0.3. For test set 3, the most difficult test set, the performance of S* is much better than that of any individual server all the time. It is also noticeable that the curve of S* is much smoother than that of the individual servers.
Then, the accuracy of S* is evaluated. Table 5 summarizes the average accuracy of S* and the majority voting method on the four test sets, where different numbers of top contacts are evaluated. Recall S* generates a confidence score for each contact candidate. The top contacts for each target are readily found by sorting the candidates according to their confidence scores. The majority voting method is implemented as follows: for each contact candidate, its confidence score by the majority voting method is calculated as the sum of the confidence scores assigned by the six servers. After the scores of all the contact candidates are calculated and sorted, different numbers of the top candidates are chosen.
As shown in Table 5, the average accuracy increases when the number of the top contacts decreases, except for server S* on test set 1, in which the accuracy for the top L/10 contacts is slightly lower than that for the top L/5 contacts. This occurs because L/10 is usually a small number (20-30 for most cases), and a few incorrectly predicted top contacts will influence the total accuracy significantly. The overall accuracy of S* on all the four test sets is at least 63%, and is consistently higher than the accuracy of the majority voting method. For the top L/5 contacts, the accuracy of S* is 73%, which is about 5% higher than that of the majority voting method, and much higher than the accuracy of any previously reported study. Figure 2 reflects the prediction accuracy for the top L/5 contacts of S* on each CASP7 target. It can be seen that the accuracy is higher than 80% on most targets. In fact, of the total 86 targets, S* has an accuracy of 100% on 13 targets, higher than 90% on 39 targets, higher than 80% on 58 targets, and below 40% on only 16 targets. Note that S* has an accuracy of 0 on two targets: T0309 (free modeling target) and T0335 (template based modeling target). We carefully look into these two cases. Both targets are very short. The target sequences, published by CASP7 for T0309 and T0335, have 76 and 85 residues, respectively. However, the experimentally determined size used by CASP7 to evaluate these two targets are only 62 and 36, respectively. This conveys that some parts of the targets are not experimentally determinable nor accurate enough. Thus, L/5 is only 12 and 7 for the two targets. Additionally, all the six individual servers perform poorly in terms of the contact prediction, which means there are only a few correct candidates among a large number of incorrect ones. This can explain the failure of S* on T0309 and T0335.
To evaluate more carefully how much our consensus method can improve upon individual servers and the simple majority voting method, all the targets are divided into three categories: easy (high accuracy), medium (template based modeling), and hard (new fold), according to the CASP7 assessment http://predictioncenter.org/casp7/. Table 6 shows the average accuracy and deviation on the top L/5 contacts of S*, the six individual servers, and the majority voting method. As shown in Table 6, for easy, medium, and hard targets, the accuracy of S* on the top L/5 contacts is 94%, 76%, and 37%, respectively, and much higher than the best individual server, where the improvement is at least 17% for each case. On the other hand, server S* always performs better than the majority voting method, and the improvements are about 2%, 5%, and 24%, respectively. This exactly verifies the server correlation assumption because, for easy targets, individual servers usually do well, which means for a contact candidate, the more servers that support it, the more likely it is correct. However, the majority voting rule does not always work on medium and hard targets, because it suffers from the over-expressed common components of the input servers due to the server correlation. Thus, our consensus method does much better than the majority voting method on harder targets.
Depending on the sequence separation, contacts can be classified as short-range contacts (separation 6-11), medium-range contacts (separation 12-24), and longrange contacts (separation >24) [32,43]. The performance of our method is further evaluated on different separation ranges for target protein classes with different difficulty levels. As shown in Table 7, the accuracy of a certain separation range decreases clearly when target proteins become harder. For easy targets, the accuracy of longrange contacts is higher than that of short-and mediumrange contacts. This makes sense because for an easy target, it is very likely that all the individual servers predict models that have very similar topology to the native structure. Thus, these models contain common long-range contacts, which helps to determine the overall topology. For medium targets, our method achieves similar performance on different separation ranges. Not surprisingly, when applied on hard targets, the accuracy of long-range contacts is much worse than that of short-and mediumrange contacts. This coincides with the fact that the individual servers are usually not able to generate models with correct folds, which causes most long-range contact candidates to be wrong ones. Among the three categories of the test proteins, the new fold category is much more important than the other two for fairly evaluating the performance of a contact predictor, especially for template-based consensus methods. In fact, new fold targets are adopted as the assessment data set for the contact predictors by CASPs. Table 8 shows the average accuracy on the top L/5 ROC curves for our method and the six individual servers   contacts of the six individual servers, the majority voting method, and our method on the 15 new fold targets of CASP7. Note that the classification of the new fold targets comes from the assessors of CASP7, according to the criterion that no server could find the correct templates although there might be homologs in the PDB. Our method significantly outperforms the best individual server on eight out of the 15 targets, and performs worse than the best individual server on five targets. It is noticeable that SAM-T02 outperforms our method on four of these five targets. The reason is that SAM-T02 does not generate complete models for these targets. Instead, it generates structures only for some very conserved regions of the targets. The contacts predicted by SAM-T02 thus cover only a small portion of the targets. It can also be seen from Table 8 that our method performs much better than the majority voting method on new fold targets. More specif-ically, the accuracy of our method at least doubles that of the majority voting method on 10 of the 15 targets. On the other hand, only four of these 15 new fold targets lacked any homologs during CASP7 season, i.e., T0287, T0309, T0314, and T0353 [43]. This implies that although other new fold targets have similar structures in the PDB, almost all structure prediction servers fail to detect them. Thus, by using our predicted contacts, one may be able to identify the similar structures for these target proteins, especially for the proteins on which our method achieves a high accuracy, such as T0316_D2, T0319, T0347_D2, T0350, T0356_D1, T0356_D3, and T0386.
We are not able to obtain top-notch contact predictors, such as SVM-LOMETS, the best published consensus method, SVM-SEQ, the best reported study on new fold targets, and SAM-T06 server, the best evaluated contact Prediction accuracy for the top L/5 contacts of S* on each CASP7 target Figure 2 Prediction accuracy for the top L/5 contacts of S* on each CASP7 target. predictor on CASP7. Thus, the performance of these three methods is retrieved from [32]. When SVM-LOMETS, SVM-SEQ, and SAM-T06 server are applied to the 15 new fold targets of CASP7, each achieves an accuracy of 10.8%, 25.8%, and 21.2% on the top L/5 contact predictions, respectively. On the same data set, the accuracy of our method for the top L/5 contacts is 37%, which indicates that the improvements are significant. Recall that among all three methods, SVM-LOMETS is the only templatebased consensus method. Although the input threading programs of our method are not the same as SVM-LOM-ETS, both methods contain some common input servers such as FUGUE and SAM-T02. The different input servers are within a similar range of accuracy in terms of structure prediction according to the CASP7 evaluation; three inputs for SVM-LOMETS, i.e. PAINT, PPA-I, and PPA-II, are components of Zhang-server, which is ranked the best among all the structure prediction servers on CASP7. Thus, the huge improvement of our method over SVM-LOMETS demonstrates that by revealing the server corre-lation and optimizing the gap between the true and false contacts, superior contacts can be predicted than those of other consensus methods.

Case Study on Two CASP7 New Fold Targets
As shown in the previous section, our method significantly outperforms the other methods, especially on new fold targets. Two CASP7 new fold targets, T0319 and T0350, are investigated in this section. T0319 (PDB id 2j6a) is a zinc finger protein from the ERF1 methyltransferase complex [50] with 135 residues. T0350 (PDB id 2hc5) is protein yvyC from Bacillus subtilis [51] with 117 residues. Table 9 lists the TM-score [52], contact accuracy, and contact coverage of the best model among the five models submitted by each threading server on T0319 and T0350. All the six threading servers fail to detect correct templates. Typically, a TM-score lower than 0.17 indicates a random structure, and a TM-score higher than 0.4 indicates a meaningful structure [52]. Consequently, all the models predicted by these six servers are probably not meaningful structures.
The hardness of these two targets causes all the six threading servers to fail. Thus, the templates selected by the threading servers are almost random and significantly different from each other, which consequently leads to the failure of the majority voting consensus method. As shown in Figure 3, the majority voting method is even worse than some individual servers on these two targets, whereas our method performs significantly better than any individual server. In fact, the top L/5 accuracy of our method is 40.0% and 80.2% on these two targets, while the majority voting method achieves an accuracy of only 10.8% and 21.0%, respectively. One may argue that some of the true contacts picked up by our method are strongly supported by correlated mutation. Even when we remove correlated mutation from our method, its accuracy decreases only slightly, to 39.3% on T0319 and 79.3% on T0350. The minor difference shows that correlated mutation information is not that important for these two targets. Therefore, the case study on these two new fold targets demonstrates that by removing the server correlation and optimizing the best combination of the individual servers, it is possible to select true contacts even if the majority of the individual servers does not support them. Top L/5 contacts are considered. All values are percentiles. Although the goal of this paper is to propose a novel contact prediction method, and thus the modeling of the entire proteins is beyond the scope of this paper, it is still important to demonstrate the potential applications of the proposed method in protein structure prediction methods. One of the major bottlenecks for the state-ofthe-art protein structure prediction methods is model ranking. The most widely used method for model ranking is clustering. However, although clustering based methods work well on easy and medium targets because for such targets, most of the models are high-quality ones and very similar to each other, such clustering methods usually fail on hard targets since the models usually have poor quality and are very different from each other. Thus, we test how well the contacts predicted by our method can rank the models on T0319 and T0350. We design a simple contact ranking score. Given the top L/5 contacts predicted by our method, for each contact, a model scores 1 if this model indeed contains this contact, and scores 0 otherwise. Table 10 shows the ranking of the best model, in terms of TM-score, of each individual server by their default model ranking method (according to the order of the models submitted to CASP7) and the ranking of the best model by our contact score. It is clear that our contact score has much better ranking of the best models for most cases. Additionally, for both T0319 and T0350, the best models generated by all these six individual servers, i.e., model 4 of RAPTOR for T0319 and model 4 of RAPTOR for T0350, are ranked first among all the models by our contact score. This demonstrates the potential applica- a TM-score of the best model among the five submitted models for each server. b Contact accuracy of the best model for each server. All contacts contained in this model are evaluated. c Contact coverage of the best model for each server. All contacts contained in this model are evaluated.
Performance comparison on T0319 and T0350 tions of our contact prediction method to select better submitted models or to select good models at each iteration of the refinement process, especially for hard targets.

Discussion
The experimental results demonstrate that by accounting for the correlation among different threading programs, our consensus method can successfully identify native contacts, even when these contacts are not contained in the majority of the models. It is worth noticing that the proposed method is quite different from the more direct linear combination or non-linear combination of the original individual servers. The underlying reason is that by detecting correlation among the individual servers and removing the last latent server which corresponds to the random noise, our ILP-based optimization process is able to find an optimal solution without the bias caused by the random noise. In this paper, our method is not directly compared to other consensus contact predictors on exactly the same input servers, since servers used in other studies are not all available. Instead, six threading programs are chosen in our method. They have similar or even lower accuracy levels than those used in other consensus contact studies.
One drawback of our method is that it is a selection-only consensus method. If all the individual servers generate models with very few native contacts, our method will fail simply because there is no correct contact to choose. To avoid this drawback, a server independent feature, correlated mutation, is used to introduce some contact candidates which are not predicted by any individual server. However, the signal contained in the correlated mutation is not strong enough to find native contacts. As a result, future work will be to combine more server-independent features to introduce true contact candidates, even if all the individual servers fail to do so. Another drawback is that the small data set used in this paper makes the evaluation of the method's performance less reliable compared to more extensive tests. This is mainly due to the availability of the individual servers used in our method. We did not conduct the experiments on both CASP7 and CASP8 because some of these individual servers have been improved significantly after CASP7 and some of these servers were not available at CASP8. However, the crossvalidation measurement applied in this paper can hopefully reduce the unreliability to the lowest level.
A potential application of our contact prediction method is to provide highly conserved constraints for ab initio folding or protein structure refinement. Recent research has shown that by incorporating contacts predicted from template-based methods or sequence-based methods, a structural model generated by comparative modeling can be refined [11,12,53,54]. However, if all the individual servers predict the structure for a target protein extremely well or very poorly, our consensus method will probably not help too much. In the former case, since almost all the contact candidates provided by these individual servers are correct ones, our method can only improve the accuracy slightly. In the latter case, since there are very few correct contact candidates for our method to choose from, the refinement process can hardly benefit from our results. However, in any other case, contacts provided by our method should help with the folding simulation. The reason is that our method can generate a small number of highly conserved contacts. Considering only a small number of contacts can reduce the conformational search space, and thus increase the speed and reduce the chance of generating wrong models. Moreover, experimental results demonstrate that our method can generate contacts with a higher accuracy than both sequence-based and template-based methods. This can reduce the risk of generating models with incorrect contacts, which can reduce the risk of selecting incorrect models from the final decoy set, and thus, greatly increase the overall ab initio folding accuracy.

Conclusion
We have described a consensus contact prediction method, that is able to reduce the server correlation. Experiments on CASP7 data set show that our method significantly outperforms any previously reported study, especially on new fold targets. Therefore, the proposed method can hopefully provide a powerful tool for protein structure refinement and prediction use. The program of the proposed method is available upon request.

Methods
Recent CASP results have indicated that correlation exists in different protein structure prediction servers, because of the common information used by the servers such as PSI-BLAST [55] sequence profile and PSIPRED-predicted secondary structure [56]. Thus, it is very likely that a true contact is supported less than some false ones due to the server correlation. Our consensus method is capable of reducing the impact caused by the server correlation. The outline of the newly developed consensus method is as follows: • A maximum likelihood (ML) method is applied to measure the correlation coefficient between two servers.
• Principal component analysis (PCA) technique is employed to extract a few independent latent servers from a set of correlated servers.
• An integer linear programming (ILP) method is then used to assign a weight to each latent server, by maximizing the difference between the true contacts and the false ones. Also, correlated mutation is treated as a latent server which assigns a probability value to each contact candidate. This results in a consensus contact predictor that can accurately assign confidence scores to all the contact candidates.

Notations
In this paper, a model refers to a protein conformation, generated by a protein structure prediction server. In contrast to human experts, a server refers to an automated system which predicts a set of models for a given protein (also called a target), whereas a contact predictor/server refers to an automated system which predicts a set of contacts. Following the contact definition used by CASPs, two residues are in contact, if the distance between their C  atoms (C  atom for Glycine), is smaller than 8Å, and they are at least six residues apart in the sequence. We call a contact native/true contact, if the two residues are indeed in contact in the native structure of the target.
The prediction accuracy is the number of correctly predicted contacts divided by the total number of contacts predicted by a predictor, while the coverage is defined as the number of correctly predicted contacts divided by the total number of native contacts. If a contact predictor is a tertiary structure prediction server, all the contacts, contained in the structural models of this server, are considered to be the contact prediction results of this server.
Let ᐍ denote the number of target proteins, and u denote the number of input contact prediction servers. Given a target t l (1  l  ᐍ), a server S i (1  i  u) outputs a set of models. The contacts, determined by these models, are extracted and considered as contact candidates, denoted as C i, l = {c i, l, q |1  q  n i, l }, where n i, l is the number of contacts, predicted by server S i for target t l . The set of all contact candidates for target t l is denoted as C l =  i C i, l . A consensus server aims to assign a confidence score to each candidate in C l .
This paper is based on the following two assumptions: • Server S i generates its predictions based on a confidence measure; that is, for each contact c  C l , S i has a confidence, s i, c, l , on how likely it is for c to appear in the native structure. Since the initial confidence score is unavailable, it is approximated as follows: the number of models containing this contact divided by the total number of models generated by the server for this target.
• There are v implicitly independent latent servers H j (1  j  v) dominating the explicit servers S i . Given a target t l , H j assigns a value h j, c, l (c  C l ) as the confidence score on how likely c is a native contact.
Identifying independent latent servers is essential to reduce the negative effects of the server correlation and to reduce the dimensionality of the search space, as the number of latent servers is expected to be smaller than the number of original servers. After deriving the latent servers, a new and more accurate prediction server S* can be designed, by an optimal linear combination of the latent servers, which for each target t l , assigns a confidence score to each contact candidate c  C l as follows: where is the weight of latent server H j in S*.

Maximum Likelihood Estimation of Server Correlation
Let O i, j, l denote the overlap set of C i, l and C j, l ; that is, O i, j, l = C i, l  C j, l , and let o i, j, l = |O i, j, l |. For a given target, let p i, j be the probability that a contact, returned by server S i , is the same as that returned by S j . Under a reasonable assumption that targets t l (1  l  ᐍ) are mutually independent, the likelihood that server S i (1  i  u) generates contacts c i, l, q (1  q  n i, l ) is λ j * Therefore, the maximum likelihood estimation of p i, j can be calculated as follows: In the rest of this paper, P denotes the matrix [p i, j ] u × u .

Uncovering Independent Latent Servers
Recall that on a target t l , s i, c, l and h j, c, l are the confidence scores assigned by server S i and H j , respectively. Since the latent servers are mutually independent, it is reasonable to assume that s i, c, l is a linear combination of h j, c, l (1  j  v) such that where , 1  i  u, and , 1  j  v. Here,  i, j is the weight, and a larger  i, j implies there is a higher chance that server S i will adopt the contacts reported by H j .
From the correlation matrix of prediction servers S i , the factor analysis technique is employed to derive  i, j and ; that is, can be represented as a linear combination of as follows: where < j,1 ,  j,2 , ʜ,  j, u > is an eigenvector of P T P.

ILP Model to Optimally Combine Latent Servers
After deriving the latent servers H j (1  j  v), a new server S* can be constructed as an optimal linear combination of the latent servers. For each target t l , S* assigns a score to each contact candidate c  C l as in Eq. (1).
To determine a reasonable setting of coefficient , a training process is conducted on a data set , containing |D| training proteins, where t l is a training protein,  C l denotes the set of native contacts, and  C l denotes the set of false contacts. The learning process attempts to maximize the number of contacts that can be correctly identified by S*.
More specifically, for each target t l in the training data set, a score is assigned to each contact candidate by S*. A good contact predictor should assign native contacts with higher scores than those with false ones. The larger the gap between the scores of the native contacts and those of the false ones, the more robust this new prediction server is. In practice, a "soft margin" idea is adopted to take the outliers into account; that is, by allowing errors on some samples, we maximize the number of native contacts with a score that is higher than that of all the false ones, by at least a gap.
This optimization problem is formulated as an integer linear programming model. Let x p, q be an integer variable such that x p, q = 1 if and only if the native contact p is assigned a score higher than that of the false contact q by at least  by the new server; x p, q = 0, otherwise. Here,  is a parameter used as the lower bound of the gap between the score of a native contact and that of the false ones. Similarly, y p, l = 1 if and only if the native contact p has a score higher than that of all the false contacts in ; y p, l = 0, otherwise. The goal is to maximize the number of native contacts that have higher score than that of all the false contacts.
Consequently, the consensus contact prediction problem is formulated by the following ILP model:  (7) forces x p, q to be 0 if the gap between the scores, assigned to the native contact p and the false contact q , is smaller than . If a native contact p has a score not higher than all the false contacts, constraint (8) forces y p, l to be 0. Thus, there is no contribution to the objective function. Constraint (9) normalizes the weights, and constraint (10) restricts x p, q and y p, l to be either 0 or 1. The objective function is the number of native contacts that have higher scores than all the false contacts.

New Prediction Server
After the independent latent servers are derived and the optimal weights are trained, a new contact predictor is formed. Given a query target t l , each server S i produces a set of contact candidates, C i, l . The set of all the candidates is denoted as C l =  i C i, l . For each contact candidate c  C l , the confidence assigned by the latent server H j is calculated by Eq. (5). Then, the new consensus server S* assigns a confidence score to contact candidate c according to Eq.
(1). S* assigns a confidence score to each contact candidate, and picks up the top scored ones as the final predictions.

Authors' contributions
XG and DB carried out the implementation and computational analysis. All authors participated in designing the study and preparing the manuscript. All authors read and approved the final manuscript.