The dataset employed in this work is summarized in Table 1. The cases of shorter loops, up to 12 residues, are taken from the work of Rossi et al. who used a loop database developed by Jacobson et al. [13, 14]. The longer loops were selected from the same protein structures as continuous fragments of coil structures, according to the DSSP definition of secondary structure. Dangling ends are excluded from our test, similarly as it was done by others. Dangling ends are frequently structurally poorly defined, and therefore the results of their simulations are difficult to interpret. The dataset is available for download (Additional file 2).
Loop modeling with MODELLER and ROSETTA
All loops were first modeled using MODELLER, version 9v5, and the model-loop procedure [1]. The 500 resulting models were ranked using DOPE statistical potentials. Subsequently, loop modeling was repeated using ROSETTA software, leading to 500 independent models, ranked by the ROSETTA force field [25]. The description of the CABS modeling tool and the procedure employed in present study is provided below.
CABS modeling software
CABS is a versatile modeling tool, based on the coarse graining of polypeptide conformational space and knowledge-based force field. Applications of CABS include protein structure prediction (from comparative to template-free modeling), prediction of protein folding mechanisms and flexible modeling of macromolecular assemblies [3, 23, 24, 26]. Technical details of CABS design and software are provided elsewhere [27]. At this point, for the reader's convenience, we provide only an outline of the most essential features. The CABS (C-alpha, C-beta, and Side chain) representation of protein conformational space employs a united residue approach. A single amino acid is represented by four pseudo-atoms: centered on the alpha carbon, on the beta carbon, in the center of mass of the side chain (where applicable) and an additional pseudo-atom located in the center of the virtual Cα-Cα bond. The Cα pseudo atoms are restricted to vertices of regular cubic lattice with the lattice spacing equal to 0.61 Å. Due to allowed fluctuations of the Cα-Cα distance around the canonical value of 3.78 Å the set of possible representations of this virtual bond consists of 800 lattice vectors. Thus, serious lattice artifacts could be safely ignored. The accuracy of the Cα-trace projection onto this lattice is in the order of 0.35 Å. On the other hand, lattice representation smoothens the model energy landscape and speeds -up computation by using pre-computed local conformational transitions which require simple references to hashing tables instead of computing trigonometric transformations, as would be necessary in an otherwise equivalent continuous space model. Coordinates of other pseudo-atoms are off-lattice and are defined in the reference frame provided by the Cα trace. Again, these coordinates are pre-computed and stored in simple reference tables, in which the two indices (range of 1-800, each) encode the conformation of three consecutive alpha carbons. It is assumed that coordinates of such fragments define positions of the side chain for the central residue.
Conformational updates include various local transformations, controlled by a pseudo random mechanism. There are single Cα moves, two, three and four Cα fragment transitions and small displacements of larger (4-22 residue) fragments. Update of a single Cα position involves side chain updates of the central and two neighboring residues. The sampling scheme could be executed within a classical Metropolis Monte Carlo scheme (when isothermal dynamics is required) or using a Replica Exchange (REMC) protocol when equilibrium data are required only, as in the case of the present work.
The force field of CABS consist of several types of potentials, including the hard-core excluded volume of the main chain and Cβ atoms, generic (sequence independent) short-range protein-like biases, making the model chain behaving like a generic polypeptide chain, sequence-dependent short-range statistical potential, context-specific pairwise interactions of the side chain united atoms, with repulsive and attractive square-well potential, and finally, a model of main-chain cooperative hydrogen bond networks. The details of the force field could be found in earlier publications and the numerical data for the histogram-type potentials are available from the authors' homepage http://biocomp.chem.uw.edu.pl.
CABS allows for very straightforward implementation of restraints of various types. These may include soft biases towards predicted secondary structures and theoretically predicted side chain contacts, distance restraints read from templates for comparative modeling, restraints derived from sparse NMR data, etc.
Loop modeling procedure with CABS
First, the template proteins (with excised loop fragments) are projected onto the CABS lattice, and the loop fragments are added in a random fashion. Then, the non-loop fragments of the original structures are used to read several hundreds of distance restraints, similarly to the procedures used in comparative modeling with CABS [3]. Subsequently, the starting structure is copied to 20 identical replicas for REMC simulations. During the REMC simulations temperatures of all replicas were gradually lowered, with a constant temperature distance between the replicas. Only the snapshots from the lowest temperature replica were stored in a pseudo-trajectory. Each simulation was repeated three times (with different streams of pseudo-random numbers), and the collated results were subject to final analysis. Trajectories were clustered using the K-means method. Also the medoids and the best observed structures from each trajectory were stored. It was observed that the centroids of the largest clusters were very close to the centroids from the entire trajectories. Thus, the trajectory medoid structures were reported as the top ranking models. For the top CABS structures the full atom molecular models were built using BBQ and SCWRL software [28, 29]. Such a multiscale modeling strategy (from coarse-grained to all atom structures) proved very efficient in earlier applications of CABS software. Modeling of a single protein from the test set employed in this work using CABS protocol requires 10-15 hours of single LINUX box, which is similar to the cost of generating 500 structured by the ROSETTA method. Generation of 500 examples using MODELLER is 2-3 times faster.
Hierarchical modeling with MODELLER and CABS
Analysis of preliminary modeling results led to an interesting observation: The distribution of the accuracy of the models generated by MODELLER and ROSETTA was significantly broader than the distribution of the quality of models generated by CABS.
The reason is that the models generated by MODELLER and ROSETTA are independent of one another, while the models from CABS are highly correlated along the simulation pseudo-trajectory. Consequently, the best models (among the 500 generated by MODELLER or ROSETTA) are usually considerably better than the top-ranked models. Unfortunately, the selection of the best models from a large set of decoys remains an unsolved problem for each of these methods. Taking the above into consideration, we designed a hybrid modeling pipeline that should take advantages of these methods. Namely, top ranked models from MODELLER (top 10) were used as structural templates for the derivation of distance restrains (including loop fragments) for modeling with CABS. It was expected that better local geometry of MODELLER structures and their diversity should improve sampling with CABS. The result of such an approach are reported as the CABS-hybrid simulations. Medoids (structures closest to the structural centroid from a pseudo-trajectory) were reported as the top ranked models. A similar modeling strategy was designed for a combination of ROSETTA and CABS. The accuracy of such an approach is similar to the accuracy of the aformentioned MODELLER-CABS hybrid. Since MODELLER is computationally less expensive than ROSETTA we present a benchmark only for the later combination.