Systematic analysis of the effect of multiple templates on the accuracy of comparative models of protein structure
© Chakravarty et al; licensee BioMed Central Ltd. 2008
Received: 06 March 2007
Accepted: 16 July 2008
Published: 16 July 2008
Although multiple templates are frequently used in comparative modeling, the effect of inclusion of additional template(s) on model accuracy (when compared to that of corresponding single-template based models) is not clear. To address this, we systematically analyze two-template models, the simplest case of multiple-template modeling. For an existing target-template pair (single-template modeling), a two-template based model of the target sequence is constructed by including an additional template without changing the original alignment to measure the effect of the second template on model accuracy.
Even though in a large number of cases a two-template model showed higher accuracy than the corresponding one-template model, over the entire dataset only a marginal improvement was observed on average, as there were many cases where no change or the reverse change was observed. The increase in accuracy due to the structural complementarity of the templates increases at higher alignment accuracies. The combination of templates showing the highest potential for improvement is that where both templates share similar and low (less than 30%) sequence identity with the target, as well as low sequence identity with each other. The structural similarity between the templates also helps in identifying template combinations having a higher chance of resulting in an improved model.
Inclusion of additional template(s) does not necessarily improve model quality, but there are distinct combinations of the two templates, which can be selected a priori, that tend to show improvement in model quality over the single template model. The benefit derived from the structural complementarity is dependent on the accuracy of the modeling alignment. The study helps to explain the observation that a careful selection of templates together with an accurate target:template alignment are necessary to the benefit from using multiple templates in comparative modeling and provides guidelines to maximize the benefit from using multiple templates. This enables formulation of simple template selection rules to rank targets of a protein family in the context of structural genomics.
Comparative modeling uses experimentally determined protein structures (templates) to predict the 3D conformation of another protein with a similar amino acid sequence (target). With the progress of structural genomics initiatives, comparative (or homology) modeling has become an increasingly important method for building protein structure models [1–3]. Not only is comparative modeling the most accurate method of structure prediction , but it also allows a priori estimation of the approximate quality of the models . Due to their added value , models are particularly suitable for comparative studies over complete protein families [7–9]. However, predicted structures in general contain errors and seldom reach the accuracy of experimental structures. Hence, improving the quality of comparative models, especially for models where Target:Template sequence identity is less than 30% still remains a challenge .
Three elements that influence the accuracy of comparative models  are: (i) the structural similarity between target and template, (ii) the Target:Template alignment accuracy; and (iii) our ability to refine the model (i.e. loop modeling and general refinement). Hence, quality (measured as errors) of a model in terms of these factors can be described as:Total Error = Structural Difference + Alignment Error - Refinement
Results and discussion
To facilitate the interpretation of the results the research design is described in this section.
Structure-based Target:Template alignments are the most accurate ones for structure modeling. While these structure-based alignments do not represent an alignment that is achievable in real modeling cases (because the target structure is by definition not known in these cases) they are a useful benchmark representing error free alignments. At the other end of the spectrum we find pairwise sequence alignments, which rely only on the knowledge of the sequence of the target and the template. Any difference in the quality of models based on these alignment types is solely due to differences in quality of the modeling alignment and has been the subject of our earlier study of single-template models . STR and SEQ alignments used here correspond to these baseline alignments and are used to study the influence of alignment accuracy on multiple-template modeling accuracy.
Description of model types (ALN.X.Y nomenclature)
Template1 and Template2 sequences are structurally aligned first. Target sequence is then aligned to both the templates sequence using a pairwise alignment algorithm without altering the structural alignment between the templates.
T1 & T2
A two-template model based on the simplest (least accurate) alignment. This model is influenced by the sequences and structures of both templates.
Same as SEQ.2.2
A one-template model based on the simplest (least accurate) alignment between target and both template sequences. This model is influenced by the sequences of both templates but only by the structure of T1.
Target, Template1 and Template2 sequences are structurally aligned.
T1 & T2
A two-template model based on an error-free alignment derived from the structural superposition of the target, T1, and T2.
Same as STR.2.2
A one-template model based on an error-free alignment derived from the structural superposition of the target, T1, and T2.
The relationship between alignment accuracy and model quality improvement due to structural complementarity is also examined i) indirectly by comparing structural complementarity in SEQ and STR models, and ii) directly by evaluating the alignment accuracy of SEQ alignments. Though the latter is the more rigorous comparison, we have deliberately carried out the analysis in both ways as the alignment accuracy is not a directly observable quantity in real modeling cases
Throughout the study model accuracy is measured by root mean squared deviation (RMSD) of the equivalent Cα atoms between the modeled and experimental structure of the target sequence. Since the data set has been designed such that coverage of all targets by the models is always 100%, there is no need to include coverage into the accuracy assessment (see methods for details).
Two-template vs. one-template model accuracy
Structural complementarity vs. alignment accuracy
To determine whether the apparent lack of structural complementarity observed in SEQ models is caused by errors in the alignment, two tests were carried out. First, the set of one-template and two-template models based on structural alignments were analyzed (STR2.1 and STR.2.2) to measure structural complementarity in the absence of alignment errors. Second, a direct comparison of alignment accuracy and structural complementarity in SEQ model was carried out. The comparison of STR models shows that two-template models are more accurate than one-template models (Figure 3B). The comparison of the average accuracy of STR.2.2 models with models based on ideal template chimeras, which represent a perfect two-template model (see methods), showed no difference (data not shown) indicating that the fact that the increase in model accuracy is small and is not a consequence of limitations in the modeling approach. The distribution of ΔRMSD between STR.2.1 and STR.2.2 models (Figure 3C) shows that in ~80% of the cases there is no model improvement upon addition of the second template, in a small fraction of cases (~6%) there is minimal deterioration of the models, and in ~14% of cases improvement of the model accuracy is observed. In the most favorable cases this improvement can reach up to 6 Å RMSD (Figure 3C), which is relatively large for changes that are not related to the alignment. As previously mentioned this improvement is a consequence of structural complementarity, thus suggesting that structural complementarity can more readily be observed in the context of highly accurate alignments.
These results, together with those in the previous section, indicate that the positive effect of structural complementarity on the average accuracy of multiple template models can only be obtained when the modeling alignment is highly accurate. The fact that an accurate alignment is necessary to obtain structural complementarity is not surprising. The regions of the templates that are more likely to complement each other are the less conserved regions, which will also contain the most alignment errors. If the complementary regions are not correctly aligned the benefits of the structural information are lost. This same interplay between alignment errors and structural information also affects loop modeling , where a good model building protocol may be limited by anchor residues that are inaccurate due to alignment errors. Since insertions tend to occur more frequently in less conserved regions the anchor residues for loop modeling will tend to be aligned less accurately than other regions of the protein. These results once again stress how crucial the alignment quality is in comparative modeling and show that the benefits of a more accurate alignment are amplified in the case of multiple-template modeling by the additional accuracy gains from structural complementarity. Thus, these results suggest that iterative approaches that combine alignment improvement or selection with explicit model building and evaluation may particularly benefit from the use of multiple templates [19, 22–24]. The alignment improvement signal would only be strengthened by the additional increase in accuracy due to structural complementarity, once the alignment accuracy reaches a certain level.
Template combinations resulting in improved model accuracy
Model improvement vs. model deterioration
Selection of optimal template combinations
The results of this large-scale (~30,000 models) comprehensive analysis of multiple-template models explain the previous contradictory examples of improvement and deterioration of model quality on inclusion of additional templates [17–19]. Both situations are possible. Combinations of templates with S1 ≅ S2, S1 < 30%, S3 < 30%, and Template RMSD 3.5–5.5 Å show a high probability of improved model accuracy over the single-template model, while most remaining combinations tend to deteriorate the model. Since most modeling cases fall in the sequence identity range below 30%, our results enable judicious choice of additional templates (based on S2, S3 and RMSD between templates) to improve model accuracy. While structural complementarity does not contribute significantly to the average accuracy of simple SEQ models, its role increases as the accuracy of the modeling alignment increases as illustrated by the high accuracy SEQ alignments and the STR alignments. Since template selection is a fundamental step in comparative modeling, and the selection criteria described here are independent of the model building strategy used, the results of our analysis are relevant to any multiple-template modeling case irrespective of the software used. Furthermore, the pre-screening of templates with increased potential for complementarity could prove beneficial in the context of modeling methods that attempt to identify good template combinations through model evaluation by decreasing the size of the search space . The potential improvements obtained from a judicious template selection are also complementary to other approaches for improving model accuracy such as loop modeling  and general refinement [12, 13]. Because our study is limited to two-template models and fixed alignments it is not representative of the expected model accuracy improvements that could be obtained by using larger numbers of templates and applying simultaneous alignment optimization. However, our results provide a clean description of the underlying relationships between alignment accuracy, template similarity, and model accuracy.
Construction of the data set
Single-domain chains (size: 100–200 residues) of high resolution (2.5 Å or better) X-ray structures were selected from the Protein Data Bank (PDB)  using domain definitions from CATH . Chains were grouped according to structural classes (i.e. all-α, all-β and α/β). Only the all-β and α/β fold-class proteins were used for the complete analysis. All-α proteins showed the same trends and differences between one-template and two-template models as all-β and α/β proteins, but with different average accuracies. For simplicity, they were eliminated from the rest of the analysis although the same conclusions apply to them. Redundancy in the set was eliminated by grouping together chains from the same homologous superfamily (same value of the first four CATH levels) that shared a sequence similarity of at least 95% identity over more than 85% the sequence length. Only the highest resolution member of each of these groups was retained as a representative in the final set. Homologous superfamilies with at least three representative chains were considered for the following steps. The representative chains within the same homologous superfamily were structurally aligned with each other using program CE . A combination of three chains (a triplet) was selected if at least two out of the three inter-chain structural alignments had a CE Z-score higher than 4.5. A total of 145 homologous superfamilies satisfied these criteria. A total of 10,641 chains triplets were chosen from these families such that bias from larger families was below a predefined cutoff. Entropy of the dataset was used to set the cutoff. Within each triplet only the common region of the structures (based on the CE alignments) was selected, hence any Target:Template combination within the triplet produces 100% coverage of the target.
Since each chain of a triplet can be the target with the other two as templates, the total number of models for the dataset is 31,923 (3 × 10,641). The sequence identity assigned to a particular Target:Template1:Template2 triplet was that of the Target:Template pair with the higher sequence identity. Template1 always refers to the template with the higher sequence identity.
Sequence identity measure
S1 = Target:Template1 sequence identity
S2 = Target:Template2 sequence identity
S3 = Template1:Template2 sequence identity
By definition S1 ≥ S2. Thus, the corresponding single-template model of the target is based on Template1 and is referred to as Target:Template1. In the results, sequence identity S1 is used as the reference sequence identity for both the two-template model as well as the corresponding single-template model.
A three-character (ALN.X.Y) notation is used to describe the models. ALN (SEQ or STR) refers to the method used to build the Target:Template alignment: pairwise SEQuence-alignment or STRucture-alignment. X refers to the number of templates used to obtain the alignment and Y refers to the number of the templates used in model building process (Figure 2). For example, SEQ.2.1 refers to a model based on a pairwise sequence alignment where the alignment is obtained using two templates but the model is built using the structure of Template1 only. The various models studied here are SEQ.2.2, SEQ.2.1, STR.2.2, and STR.2.1. Where 2.2 models correspond to typical two-template models and 2.1 models correspond to one-template models with identical alignments to the 2.2 models. These 2.1 models are used to study the contribution of structural complementarity on the final accuracy of the two-template models in the absence of any alignment effect (see below and Figure 2).
Target:Template alignment and model building
Models were calculated using the alignments described below and the template structures as input to the default 'model' routine of program MODELLER version 6v2 . SEQ.2.2: The structural alignment between the two templates (Template1, Template2) was first generated using the ALIGN3D command of MODELLER. The target sequence was aligned to this structural alignment of the templates, using the ALIGN command of MODELLER, without modifying the structural alignment (Figure 2). SEQ.2.1: The sequence of Template2 is eliminated from SEQ.2.2 (See Figure 2).STR.2.2: Structural alignment between the three structures (Target, Template1, and Template2) was generated using the ALIGN3D command of MODELLER.STR.2.1: The sequence of Template2 is eliminated from STR.2.2.
Alignment Accuracy Measurement
Alignment accuracy was measured as defined by Sauder et al. , namely, the ratio between the number of correctly aligned residue pairs and the total aligned residue pairs in a given alignment. A residue pair is defined as correctly aligned if it is the same in the reference ("error-free") alignment. STR.2.2 alignments were used at the error-free reference.
Construction of Template Chimera
Idealized template chimeras for particular Target:Template1:Template2 combinations were constructed by selecting the structurally closest equivalent segment (for each non overlapping target segments) from either of the two templates. The combination of these "best" segments from each of the templates results in an ideal template chimera that can be used to evaluate the efficiency with which the modeling program (i.e. MODELLER) combines the information from both templates. Closeness among the equivalent residues is determined by measuring the distance between the target residue backbone atoms and that of the equivalent template residue after optimal pairwise structural superposition of the target with each of the templates.
Overall Accuracy Measurement
Overall accuracy was measured by computing the root mean square deviation (RMSD) between the equivalent Cα atoms in the optimal superposition of target and model structures as it is the most common evaluation performed systematically for comparative models [5, 31, 32]. Since the sequences of target and model are identical, a sequence-based alignment was used to guide the initial structural superposition. Equivalent atoms are defined as those that are within 3.5 Å of their corresponding atom in the target after superposition of the structures. Superposition of structures is carried out by minimizing the RMSD between the equivalent Cα atoms. However, all accuracy measurements refer to the RMSD of all Cα atoms irrespective of cutoff. All calculations are implemented in the SUPERPOSE command of program MODELLER. As the structural differences between main-chains of models obtained from various comparative modeling programs are very small , results of the current analysis are based only on a single modeling program, MODELLER . In addition, as there are differences in the quality of side-chain modeling in different comparative modeling programs , the present accuracy analysis is restricted to comparison of backbone structures on which the template structure has a larger influence than on the side-chains .
This study was supported by funds from NIH/NIGMS 1P01GM066531 and 1R01GM081713, and an Irma T. Hirschl Career Scientist Award.
- Sanchez R, Sali A: Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc Natl Acad Sci U S A 1998, 95(23):13597–13602. 10.1073/pnas.95.23.13597View ArticleGoogle Scholar
- Sanchez R, Pieper U, Melo F, Eswar N, Marti-Renom MA, Madhusudhan MS, Mirkovic N, Sali A: Protein structure modeling for structural genomics. Nat Struct Biol 2000, 7 Suppl: 986–990. 10.1038/80776View ArticleGoogle Scholar
- Stevens RC, Yokoyama S, Wilson IA: Global efforts in structural genomics. Science 2001, 294(5540):89–92. 10.1126/science.1066011View ArticleGoogle Scholar
- Tramontano A, Morea V: Assessment of homology-based predictions in CASP5. Proteins 2003, 53 Suppl 6: 352–368. 10.1002/prot.10543View ArticleGoogle Scholar
- Chakravarty S, Wang L, Sanchez R: Accuracy of structure-derived properties in simple comparative models of protein structures. Nucleic Acids Res 2005, 33(1):244–259. 10.1093/nar/gki162View ArticleGoogle Scholar
- Chakravarty S, Sanchez R: Systematic analysis of added-value in simple comparative models of protein structure. Structure (Camb) 2004, 12(8):1461–1470. 10.1016/j.str.2004.05.018View ArticleGoogle Scholar
- Kiel C, Wohlgemuth S, Rousseau F, Schymkowitz J, Ferkinghoff-Borg J, Wittinghofer F, Serrano L: Recognizing and defining true Ras binding domains II: in silico prediction based on homology modelling and energy calculations. J Mol Biol 2005, 348(3):759–775. 10.1016/j.jmb.2005.02.046View ArticleGoogle Scholar
- Liu T, Rojas A, Ye Y, Godzik A: Homology modeling provides insights into the binding mode of the PAAD/DAPIN/pyrin domain, a fourth member of the CARD/DD/DED domain family. Protein Sci 2003, 12(9):1872–1881. 10.1110/ps.0359603View ArticleGoogle Scholar
- Murray PS, Li Z, Wang J, Tang CL, Honig B, Murray D: Retroviral matrix domains share electrostatic homology: models for membrane binding function throughout the viral life cycle. Structure 2005, 13(10):1521–1531. 10.1016/j.str.2005.07.010View ArticleGoogle Scholar
- Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29: 291–325. 10.1146/annurev.biophys.29.1.291View ArticleGoogle Scholar
- Moult J: A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 2005, 15(3):285–289. 10.1016/j.sbi.2005.05.011View ArticleGoogle Scholar
- Fan H, Mark AE: Refinement of homology-based protein structures by molecular dynamics simulation techniques. Protein Sci 2004, 13(1):211–220. 10.1110/ps.03381404View ArticleGoogle Scholar
- Qian B, Ortiz AR, Baker D: Improvement of comparative model accuracy by free-energy optimization along principal components of natural structural variation. Proc Natl Acad Sci U S A 2004, 101(43):15346–15351. 10.1073/pnas.0404703101View ArticleGoogle Scholar
- Rychlewski L, Jaroszewski L, Li W, Godzik A: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 2000, 9(2):232–241.View ArticleGoogle Scholar
- Marti-Renom MA, Madhusudhan MS, Sali A: Alignment of protein sequences by their profiles. Protein Sci 2004, 13(4):1071–1087. 10.1110/ps.03379804View ArticleGoogle Scholar
- Yona G, Levitt M: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol 2002, 315(5):1257–1275. 10.1006/jmbi.2001.5293View ArticleGoogle Scholar
- Winn PJ, Battey JN, Schleinkofer K, Banerjee A, Wade RC: Issues in high-throughput comparative modelling: a case study using the ubiquitin E2 conjugating enzymes. Proteins 2005, 58(2):367–375. 10.1002/prot.20318View ArticleGoogle Scholar
- Sanchez R, Sali A: Evaluation of comparative protein structure modeling by MODELLER-3. Proteins 1997, Suppl 1: 50–58. Publisher Full Text 10.1002/(SICI)1097-0134(1997)1+<50::AID-PROT8>3.0.CO;2-SView ArticleGoogle Scholar
- Venclovas C, Margelevicius M: Comparative modeling in CASP6 using consensus approach to template selection, sequence-structure alignment, and structure assessment. Proteins 2005, 61 Suppl 7: 99–105. 10.1002/prot.20725View ArticleGoogle Scholar
- Contreras-Moreira B, Fitzjohn PW, Bates PA: In silico protein recombination: enhancing template and sequence alignment selection for comparative protein modelling. J Mol Biol 2003, 328(3):593–608. 10.1016/S0022-2836(03)00309-7View ArticleGoogle Scholar
- Fiser A, Do RK, Sali A: Modeling of loops in protein structures. Protein Sci 2000, 9(9):1753–1773.View ArticleGoogle Scholar
- Ginalski K, Rychlewski L: Protein structure prediction of CASP5 comparative modeling and fold recognition targets using consensus alignment approach and 3D assessment. Proteins 2003, 53 Suppl 6: 410–417. 10.1002/prot.10548View ArticleGoogle Scholar
- Kosinski J, Cymerman IA, Feder M, Kurowski MA, Sasin JM, Bujnicki JM: A "FRankenstein's monster" approach to comparative modeling: merging the finest fragments of Fold-Recognition models and iterative model refinement aided by 3D structure evaluation. Proteins 2003, 53 Suppl 6: 369–379. 10.1002/prot.10545View ArticleGoogle Scholar
- John B, Sali A: Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res 2003, 31(14):3982–3992. 10.1093/nar/gkg460View ArticleGoogle Scholar
- Sali A: Target practice. Nat Struct Biol 2001, 8(6):482–484. 10.1038/88529View ArticleGoogle Scholar
- Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J, Padilla D, Ravichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zardecki C: The Protein Data Bank. Acta Crystallogr D Biol Crystallogr 2002, 58(Pt 6 No 1):899–907. 10.1107/S0907444902003451View ArticleGoogle Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures. Structure 1997, 5(8):1093–1108. 10.1016/S0969-2126(97)00260-8View ArticleGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11(9):739–747. 10.1093/protein/11.9.739View ArticleGoogle Scholar
- Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993, 234(3):779–815. 10.1006/jmbi.1993.1626View ArticleGoogle Scholar
- Sauder JM, Arthur JW, Dunbrack RL Jr.: Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 2000, 40(1):6–22. 10.1002/(SICI)1097-0134(20000701)40:1<6::AID-PROT30>3.0.CO;2-7View ArticleGoogle Scholar
- Eyrich VA, Marti-Renom MA, Przybylski D, Madhusudhan MS, Fiser A, Pazos F, Valencia A, Sali A, Rost B: EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 2001, 17(12):1242–1243. 10.1093/bioinformatics/17.12.1242View ArticleGoogle Scholar
- Marti-Renom MA, Madhusudhan MS, Fiser A, Rost B, Sali A: Reliability of assessment of protein structure prediction methods. Structure (Camb) 2002, 10(3):435–440. 10.1016/S0969-2126(02)00731-1View ArticleGoogle Scholar
- Wallner B, Elofsson A: All are not equal: a benchmark of different homology modeling programs. Protein Sci 2005, 14(5):1315–1327. 10.1110/ps.041253405View ArticleGoogle Scholar