Modeling of loops in proteins: a multi-method approach

Background Template-target sequence alignment and loop modeling are key components of protein comparative modeling. Short loops can be predicted with high accuracy using structural fragments from other, not necessairly homologous proteins, or by various minimization methods. For longer loops multiscale approaches employing coarse-grained de novo modeling techniques should be more effective. Results For a representative set of protein structures of various structural classes test predictions of loop regions have been performed using MODELLER, ROSETTA, and a CABS coarse-grained de novo modeling tool. Loops of various length, from 4 to 25 residues, were modeled assuming an ideal target-template alignment of the remaining portions of the protein. It has been shown that classical modeling with MODELLER is usually better for short loops, while coarse-grained de novo modeling is more effective for longer loops. Even very long missing fragments in protein structures could be effectively modeled. Resolution of such models is usually on the level 2-6 Å, which could be sufficient for guiding protein engineering. Further improvement of modeling accuracy could be achieved by the combination of different methods. In particular, we used 10 top ranked models from sets of 500 models generated by MODELLER as multiple templates for CABS modeling. On average, the resulting molecular models were better than the models from individual methods. Conclusions Accuracy of protein modeling, as demonstrated for the problem of loop modeling, could be improved by the combinations of different modeling techniques.


Background
Comparative modeling remains the most dependable and routinely used method for protein structure prediction [1,2]. The alternative term of homology modeling is frequently used. That is because the identification of a structural template (or templates) is typically based (although not always) on the homology relation between the target protein and the templates, which is usually reflected by a certain level of sequence similarity. When a template is being identified by some advanced Fold Recognition (FR) techniques, it is sometimes possible to identify templates that are structurally similar to the target without any obvious homology relations. This could be a genuine case of convergent evolution or (more frequently) the case when remote homology just can not be detected. Template free, de novo structure prediction is much more difficult and less dependable, although a steady progress is observed in this area of computational biology [3,4]. Most contemporary methods for de novo structure predictions heavily depend on certain aspects of evolutionary relationships between protein sequences and structures. The evolutionary methods are essential for the derivation of statistical potentials for de novo modeling and/or are employed in various strategies for extracting structure building blocks from known protein structures [5,6].
Classical homology modeling consists of three steps. First, a template for modeling needs to be identified and sequence alignment between the template and target sequences has to be generated. Usually, template identification is performed by certain standard tools, such as PSI-BLAST, and the resulting alignment is subsequently rectified by other tools and eventually by manual expert corrections. Remote templates can also be identified by FR procedures [7]. With the decreasing level of sequence similarity, which implies increasing evolutionary distance and thereby increasing structural differences between the template and the target, alignments become more and more ambiguous. Accuracy of classical comparative modeling heavily relies on the fidelity of the template-target alignment.
In the second stage the aligned fragments of templates are used to generate the corresponding fragments of the target structure. In the simplest case of a single template only, this step reduces to mere copying the template coordinates according to the alignment. In the case of multiple templates a consensus scaffold could be built, for instance via the distribution of the spatial restrains read from the templates, as it is implemented in the MODELLER method [8]. The key component of this stage of modeling is construction of loop regions that are frequently missing in the template scaffold. In certain newer approaches to comparative modeling the entire structure of the target is built using templates as sources of restraints of various types [3,9]. The main aim and challenge of such approaches is to be able to build a model of the target structure which is more similar to the true structure of the target than to any of the templates used, especially for distant homology based modeling.
The third, and final, stage of modeling is structure refinement which involves repacking the side chains and energy minimization of the entire structure [10].
The above scheme, or its variants, of comparative modeling remains the best choice when significant fragments of the alignment are error-free, which is usually the case in the range of high level sequence similarity (e.g. 40%, or more, of identical residues in the alignment). In the "twilight zone" of low sequence similarity the alignments contain significant errors. These could be sometimes corrected by building a multi-template consensus modeling scaffold [11]. Alternatively, it is possible to design a completely different modeling schemes, in which the alignment is built simultaneously to the actual modeling process [12].
In this paper we address the issue of loop modeling, separating it from the alignment problem. The test set of proteins with missing loops consists of two sub-sets. The first subset, containing missing loops of 4-12 residues, has been taken from a recent work by Rossi, et al., excluding the cases of incomplete chains in the corresponding PDB entries [13]. The work provides a comprehensive comparison of loop modeling performance of the most popular comparative modeling software. The loop database employed in the work of Rossi was adapted from a compiled loop database assembled by Jacobson et al [13,14]. Additionally, the database used in this work was expanded with cases of much longer loops, up to 25 residues (the second sub-set). The database covers all the structural classes of proteins, with 186 internal loops of various length. The expanded range of the modeled loop lenghts addresses the possibility of the extension of the range of applicability and accuracy of challenging instances of comparative modeling. Four methods of loop modeling are compared in this work: MODELLER, ROSETTA, CABS and a combination of MODELLER with CABS. Since MODELLER is a commonly accepted reference standard in comparative modeling, the results are qualitatively (although indirectly) comparable with other approaches [13,[15][16][17][18][19][20]. It should be noted that MODELLER is representative software for distance geometry and energy minimizations, while ROSETTA and CABS employ knowledge-based free search of a discretized conformational space. Thus, the comparison given in this paper should provide additional insights into the range of applicability of these qualitatively different approaches to protein molecular modeling. Previous computational experiments with the reconstruction of missing fragments of protein structures indicated that the coarse grained models (an early version of CABS and two other modeling tools based on similar principles) performed relatively well in the range of large fragments [21]. At this point we would like to present a comprehensive evaluation in a wide range of loop modeling instances.

Results
For a representative test set of protein structures with missing loop fragments the loop reconstruction procedure was executed using MODELLER, ROSETTA, CABS and the MODELLER-CABS hybrid modeling pipeline. The test set is summarized in Table 1. Modeling procedures are described in the Methods section. The test proteins represent various structural classes, including mainly helical, beta and alpha/beta structures. All test structures are of high quality with resolution of at least 2 Å and the average temperature factor lower than 35. The missing loops are representative, and they are exposed to the solvent or partially buried, connecting various elements of the secondary structure. The modeled loops span a wide range of lengths, from 4 to 25 residues. This is a range that is relevant for standard comparative modeling. In several proteins more than one loop is modeled. In some cases the modeled loops can interact with one another, which can have some influence on the performance of respective methods.
Using MODELLER, we generated 500 examples of individual loop regions, which were subsequently ranked by the DOPE statistical potential [22]. Top ranked means the highest rank, while the "best" result means a structure that is closest to the actual experimental, structure of the loop. Similarly, ROSETTA models were ranked with ROSETTA potentials. CABS modeling provides a trajectory containing several hundred instances. These were subject to the clustering procedure.
Interestingly, in most cases the medoid from the entire simulation was closer to the true structure than the largest clusters' medoids. This suggest very good convergence of CABS simulations. Consequently, the medoid structures were reported as the top-ranked models.
The statistics of the results is shown in Figure 1, in which the loops of a given length are described by average values of cRMSD of the loop fragments (coordinate Root-Mean-Square Deviation) from the corresponding crystallographic structures. To extract the loop cRMSD values protein structures without the modeled loop fragments were superimposed and the deviation was computed only for the loops. In the entire text the values of cRMSD are reported for Ca traces only. Corresponding data for all atom structures are essentially the same. The plots in Figure 1 clearly show two major trends. The first is obvious: with the increasing size of loops the average accuracy of modeling decreases. The second trend indicates, as expected, that the distribution of the quality of models, as measured by the difference between the best model and the top ranked models is much larger for MODELLER and ROSETTA as compared with CABS. For very long loops (20 and more residues) CABS results are on average better than for MODELLER and ROSETTA. The hybrid-CABS modeling takes advantages of different methods. Using top10 models generated by MODELLER the new method leads to results as good as MODELLER for short loops and noticeably better models in the range of long loops. When comparing with original CABS simulations the hybrid-CABS is much more accurate for short loops. This is illustrated in Figure 2. The results with hybrid-CABS show that there is always an added value in combining different modeling methods. Figure 3 and Figure 4 show the distributions of cRMSD for 186 cases studied. The distribution is quite broad, especially for longer loops. Use of distinct modeling techniques increases chances for obtaining good quality models. Unfortunately, all methods produce results of scattered quality. The problem how to identify the cases for which the models are of good accuracy remains unsolved.

Discussion
The loop modeling exercise described in this paper separates the two fundamental problems of comparative    modeling: target-template alignment and modeling of missing fragments. Ideal alignments have been assumed and the excised loops reconstructed and compared with the native structures (cRMSD of the reconstructed loops after the superposition of the fixed parts of templates and models). As expected, MODELLER and ROSETTA proved to be more accurate for short loops, while CABS models were better for longer loops (see the compilation of cRMSD values for different ranges of loop sizes, shown in Table 2(two-sample paired t-test, data in Additional file 1)), although the difference is small. In spite of the coarse-grained character of the method, the models from CABS allow for the meaningful reconstruction of the side chain details for shorter, and therefore more accurately predicted, loops (see Figure 5). The predicted side-chains conformations, shown in Figure 5, are of crystallographic accuracy, except for the tail portion of one side-chain. For longer loops the side chains are less accurate and their native-like conformations and interaction patterns are observed only for the best models. Figure 6 shows a typical situation for the loops from the range of accuracy of 4-6 Å. In such cases the side chains are approximately at proper positions, although their conformations on the atomic level are not reproduced. Finally, it should be noted that the simulation results from CABS could be used for the analysis of loop dynamics. In recent publications we have shown that isothermal trajectories from CABS, executed at the folding transition temperature, reproduce folding mechanisms of small proteins very well [23,24]. Thus loop mobility could also be modeled. In order to obtain the best possible model of the lowest energy structures, in the present study we used Replica Exchange Monte Carlo. Thus, the dynamics of the system is artificial. Obviously, isothermal simulations could be performed for the models obtained, leading to meaningful description of loop mobility. This was, however, beyond the scope of the present work.

Conclusions
In this work we have shown that de novo protein loops modeling using ROSETTA and CABS-based software is complementary to the classical modeling with MODEL-LER, the golden standard of comparative modeling. The proposed hybrid modeling pipeline, where ten top ranked (according to DOPE statistical potential) MOD-ELLER models are used as templates for CABS, allows for meaningful loop modeling for a broad range of loop length. The hybrid MODELLER-CABS method takes advantage of the local accuracy of MODELLER structures and the efficient sampling of local free-energy minima by CABS. The hybrid-CABS method described in this work extends the applicability range of protein comparative modeling. Further increase of accuracy for large loops will require better ranking of resulting models. Model-ranking in the range of moderate-and lowresolution computational structures remains a challenging problem for the entire structure-prediction field. In this case, a small step in this direction was performed by a combination of different modeling techniques.

Methods
The dataset employed in this work is summarized in Table 1. The cases of shorter loops, up to 12 residues, are taken from the work of Rossi et al. who used a loop database developed by Jacobson et al. [13,14]. The longer loops were selected from the same protein structures as continuous fragments of coil structures, according to the DSSP definition of secondary structure. Dangling ends are excluded from our test, similarly as it was done by others. Dangling ends are frequently structurally poorly defined, and therefore the results of their simulations are difficult to interpret. The dataset is available for download (Additional file 2).

Loop modeling with MODELLER and ROSETTA
All loops were first modeled using MODELLER, version 9v5, and the model-loop procedure [1]. The 500 resulting models were ranked using DOPE statistical potentials. Subsequently, loop modeling was repeated using ROSETTA software, leading to 500 independent models, ranked by the ROSETTA force field [25]. The description of the CABS modeling tool and the procedure employed in present study is provided below.

CABS modeling software
CABS is a versatile modeling tool, based on the coarse graining of polypeptide conformational space and knowledge-based force field. Applications of CABS include protein structure prediction (from comparative to template-free modeling), prediction of protein folding mechanisms and flexible modeling of macromolecular assemblies [3,23,24,26]. Technical details of CABS design and software are provided elsewhere [27]. At this point, for the reader's convenience, we provide only an outline of the most essential features. The CABS (Calpha, C-beta, and Side chain) representation of protein conformational space employs a united residue approach. A single amino acid is represented by four pseudo-atoms: centered on the alpha carbon, on the beta carbon, in the center of mass of the side chain (where applicable) and an additional pseudo-atom Conformational updates include various local transformations, controlled by a pseudo random mechanism. There are single Ca moves, two, three and four Ca fragment transitions and small displacements of larger (4-22 residue) fragments. Update of a single Ca position involves side chain updates of the central and two neighboring residues. The sampling scheme could be executed within a classical Metropolis Monte Carlo scheme (when isothermal dynamics is required) or using a Replica Exchange (REMC) protocol when equilibrium data are required only, as in the case of the present work.
The force field of CABS consist of several types of potentials, including the hard-core excluded volume of the main chain and Cb atoms, generic (sequence independent) short-range protein-like biases, making the model chain behaving like a generic polypeptide chain, sequence-dependent short-range statistical potential, context-specific pairwise interactions of the side chain united atoms, with repulsive and attractive square-well potential, and finally, a model of main-chain cooperative hydrogen bond networks. The details of the force field could be found in earlier publications and the numerical data for the histogram-type potentials are available from the authors' homepage http://biocomp.chem.uw.edu.pl.
CABS allows for very straightforward implementation of restraints of various types. These may include soft biases towards predicted secondary structures and theoretically predicted side chain contacts, distance restraints read from templates for comparative modeling, restraints derived from sparse NMR data, etc.
Loop modeling procedure with CABS First, the template proteins (with excised loop fragments) are projected onto the CABS lattice, and the loop fragments are added in a random fashion. Then, the non-loop fragments of the original structures are used to read several hundreds of distance restraints, similarly to the procedures used in comparative modeling with CABS [3]. Subsequently, the starting structure is copied to 20 identical replicas for REMC simulations. During the REMC simulations temperatures of all replicas were gradually lowered, with a constant temperature distance between the replicas. Only the snapshots from the lowest temperature replica were stored in a pseudotrajectory. Each simulation was repeated three times (with different streams of pseudo-random numbers), and the collated results were subject to final analysis. Trajectories were clustered using the K-means method. Also the medoids and the best observed structures from each trajectory were stored. It was observed that the centroids of the largest clusters were very close to the centroids from the entire trajectories. Thus, the trajectory medoid structures were reported as the top ranking models. For the top CABS structures the full atom molecular models were built using BBQ and SCWRL software [28,29]. Such a multiscale modeling strategy (from coarse-grained to all atom structures) proved very efficient in earlier applications of CABS software. Modeling of a single protein from the test set employed in this work using CABS protocol requires 10-15 hours of single LINUX box, which is similar to the cost of generating 500 structured by the ROSETTA method. Generation of 500 examples using MODELLER is 2-3 times faster.

Hierarchical modeling with MODELLER and CABS
Analysis of preliminary modeling results led to an interesting observation: The distribution of the accuracy of the models generated by MODELLER and ROSETTA was significantly broader than the distribution of the quality of models generated by CABS.
The reason is that the models generated by MODEL-LER and ROSETTA are independent of one another, while the models from CABS are highly correlated along the simulation pseudo-trajectory. Consequently, the best models (among the 500 generated by MODELLER or ROSETTA) are usually considerably better than the topranked models. Unfortunately, the selection of the best models from a large set of decoys remains an unsolved problem for each of these methods. Taking the above into consideration, we designed a hybrid modeling pipeline that should take advantages of these methods. Namely, top ranked models from MODELLER (top 10) were used as structural templates for the derivation of distance restrains (including loop fragments) for modeling with CABS. It was expected that better local geometry of MODELLER structures and their diversity should improve sampling with CABS. The result of such an approach are reported as the CABS-hybrid simulations. Medoids (structures closest to the structural centroid from a pseudo-trajectory) were reported as the top ranked models. A similar modeling strategy was designed for a combination of ROSETTA and CABS. The accuracy of such an approach is similar to the accuracy of the aformentioned MODELLER-CABS hybrid. Since MODELLER is computationally less expensive than ROSETTA we present a benchmark only for the later combination.