Accurate flexible refinement of atomic models against medium-resolution cryo-EM maps using damped dynamics

Background Dramatic progress has recently been made in cryo-electron microscopy technologies, which now make possible the reconstruction of a growing number of biomolecular structures to near-atomic resolution. However, the need persists for fitting and refinement approaches that address those cases that require modeling assistance. Methods In this paper, we describe algorithms to optimize the performance of such medium-resolution refinement methods. These algorithms aim to automatically optimize the parameters that define the density shape of the flexibly fitted model, as well as the time-dependent damper cutoff distance. Atomic distance constraints can be prescribed for cases where extra containment of parts of the structure is helpful, such as in regions where the density map is poorly defined. Also, we propose a simple stopping criterion that estimates the probable onset of overfitting during the simulation. Results The new set of algorithms produce more accurate fitting and refinement results, and yield a faster rate of convergence of the trajectory toward the fitted conformation. The latter is also more reliable due to the overfitting warning provided to the user. Conclusions The algorithms described here were implemented in the new Damped-Dynamics Flexible Fitting simulation tool “DDforge” in the Situs package. Electronic supplementary material The online version of this article (10.1186/s12900-018-0089-0) contains supplementary material, which is available to authorized users.


Background
Over the last few years, experimental techniques for cryoelectron microscopy (cryo-EM) have evolved dramatically, making it possible for some structures to be solved at near-atomic resolution (See e.g. refs. [1][2][3][4] for reviews.). However, a significant number of structures still defy EM reconstructions at that resolution level. Also, during a typical cryo-EM workflow, resolution normally increases over time, from a rather low value, as the quality and quantity of the images improves. These considerations indicate that there is still a need to be able to obtain atomic models from cryo-EM maps having medium resolutions, in the range of 5-10 Å.
-Force field heuristically defined in terms of the "overlap" between the EM map and the model [18]; -Monte-Carlo optimization of fragments combined with all-atom refinement [19]; -Bayesian-based refinement [20].
Despite the undeniable successes of these methods, they still suffer from one or more weaknesses, including: high complexity (time and algorithmic), potentially insufficient sampling of the conformational space, restrictions in the degrees of freedom (DOFs) (such as using rigid domains), requirement of significant user intervention, and even unnaturalness of the resulting deformations.
DDFF attempts to address these issues by using an idea quite different from those of the other proposals: damped dynamics [5]. Damped dynamics operates in generalized coordinates (positional and internal coordinates) and avoids harmonic potentials by using dampers (shock absorbers) between pairs of atoms.
One of the key improvements proposed in the present paper over the original DDFF is particularly beneficial to the medium-resolution range (5 to 10 Å): the optimization of parameters used in the computation of the model-induced map. These parameters are the width of the Gaussian kernel with which the atomic model is convolved, and the density threshold value to be used after convolution. Together, the new methods yield an accurate force field that attracts the atomic model to unoccupied regions of the EM map.
A second new feature that is unique to the current "DDforge" implementation described below is the ability for the user to impose distance constraints between specified pairs of pseudo-atoms. We show an example of this to help preserve the overall shape of β-sheets.
A third new feature is a streamlined scheme to continuously adjust the cutoff distance d cut between pseudoatoms to be connected by dampers. The new scheme is the result of a rationale that yields an updated d cut at each time step, in terms of the previous one and the ratio of the previous and current RMS velocities. As a result, the convergence rate of the DDforge trajectories is significantly improved.
Lastly, we propose a simple mechanism to guard against overfitting. This consists in estimating characteristic saturation times of the overlap function via a continually updated exponential regression.

Methods
Our force field relies on a simulated model-induced map that is matched with the experimental EM map. After a preliminary rigid-body fit, the force field is applied to the damped dynamics of coarse-grained side chains to guide the atomic model towards unoccupied regions of the experimental target map [5]. The matching of the model-induced map with the experimental map is therefore an important aspect in the accurate refinement of the structure.

Optimization of parameters for the model-induced map
The original DDFF approach used a fixed standard deviation σ of the Gaussian convolution kernel based on the Situs convention: σ = R/(2 √ 3) (section 4 in ref. [21]), where R is the nominal resolution of an EM map as determined by the experimentalists. This fixed σ convention was reasonable for the interpretation of low-resolution maps with the older Situs tools. However, for intermediate resolution maps using the new DDforge, we found it beneficial to allow σ to be a free parameter that can be optimized from an initial guess such as σ = R/(2 √ 3). The optimization of σ can be shown in some cases to provide more accurate fitting results (Fig. 1), and it adds only one additional free parameter, which is negligible compared to the many conformational DOFs in the damped dynamics model.
A second parameter which we optimize in the new DDforge is the threshold T to be applied after Gaussian kernel convolution. In the original DDFF engine, we implemented two options to determine the threshold T: (a) so that the volume within the isosurface of mean value matches the corresponding volume in the EM map; or (b) so that, after rescaling, the integral and maximum coincide with the respective values for the EM map. In the new approach, we merge these two alternatives into one. In this way, by considering σ as an extra unknown, we can 3) and T determined by equating volumes at mean values, with those obtained with DDforge (b) whereby σ and T are simultaneously optimized by following a procedure depicted in Fig. 2 have all three quantities match: volume, integral, and maximum. Actually, the maximum density (which is sensitive to outliers) was replaced by a more stable value: the 95% quantile. Likewise, the mean value for the isosurface was replaced by the threshold. Figure 2 provides an overview of the density matching and optimization steps. In summary, we have the following system of equations in the new DDforge: where V denotes the volume, I the integral, and Q the 95% quantile, subindex 1 corresponding to the EM map and subindex 2 to the model-induced map. The quantities on Fig. 2 Schematic depiction of the scheme used for finding the optimal values of kernel width σ and threshold T. a EM map (after it has been thresholded) from which its volume V 1 , integral I 1 , and 95% quantile value Q 1 are computed. b Model-induced map (before thresholding), with the corresponding quantities V 2 , I 2 and Q 2 , in addition to the parameters σ and T. These parameters are determined as those for which V, I and Q of both maps match, after proper scaling between the two maps the right side of the equations are computed directly from the EM map. The scaling factor a is to be solved for as well, along with σ and T. Note that a does not enter into the first equation since the volume does not depend on the scaling. We solve these equations as follows. Let g denote the model-induced map, computed as the convolution of the atomic model with a Gaussian kernel of standard deviation σ . This map, like the EM map, is defined on a set R of N voxels. In what follows, we denote a generic voxel (or its coordinates) by x. Then the integral I 2 is given by so from Eq. 2 we obtain: The volumes in Eq. 1 are conveniently expressed as number of voxels. Thus, If we letḡ denote the 1D array of N values of g sorted in ascending order, i.e.,ḡ 1 ≤ḡ 2 ≤ · · · ≤ḡ N , then Eq. 6 implies that T =ḡ N−V 2 , which together with Eq. 1 gives: On the other hand, from Eqs. 3 and 5 we obtain: where Q 2 depends on σ and T, while g depends only on σ . We now have two equations, 7 and 8, in the two unknowns σ and T. The algorithm to solve them proceeds iteratively between these two equations: for each value of σ (starting, for instance, with σ 0 = R/(2 √ 3)), build the model-induced map g, and compute T using Eq. 7. With this T, go to Eq. 8 and make one Newton-Raphson step on σ . With this updated σ , go back to Eq. 7 to get a new value of T, and continue in this way till convergence. In our examples, the number of iterations needed was very small (5 or 6), and the compute time was about a second, even for the largest structures that we considered.

Distance constraints
Constraints are given by general relations among the generalized coordinates of the form: The way to insert these conditions into the equations of motion is by taking their time derivative: whence the equations of motion become: where the h α are Lagrange multipliers and Q (m) j is the force field in generalized coordinates.
In the particular case of distance constraints, the f α take the form: where a and b depend on α, r a and r b denoting the positions of two specific atoms whose distance we want to require to be constant throughout the trajectory. The derivatives are: where the derivatives ∂r k ∂q i constitute the entries of the socalled Wilson's matrix [22], and can be computed directly from the geometry of the current conformation of the model.

Scheme to optimize the dampers' cutoff distance
This parameter d cut is the maximum distance between atoms that will be connected by a damper. The rationale for its optimization is based on the RMS velocity of the structure, defined by where A is the number of pseudo-atoms in the structure. The idea is to adjust d cut in such a way as to try to keep v constant. As the trajectory progresses, this will make d cut gradually decrease. This decrease is allowed until d cut reaches max{7Å, 2σ } (which ensures a minimum of connectivity to preserve the structural integrity of the model), after which point it is kept constant, with the ensuing decrease of v.
To keep v nearly constant as much as possible, we need an estimate of the dependence of v on d cut . For this, we can imagine a toy model involving a collective map force F (m) and a collective dampers' constant C (which depends on d cut ), so we have, roughly: .
The number of connections of length up to d cut is of the order of d 3 cut . They can be partitioned into those of length between r and r + dr, for r ranging from to 0 to d cut , of which the number is of the order of r 2 dr. A damper of length r has a strength ∼ 1/ √ r (Eq. 8 in ref. [5]). Therefore, the aggregate strength of all dampers up to length d cut will be: When v decreases due to a decrease in F (m) , we want to decrease d cut to try to restore v to its former value. Thus, if we have, for the previous and current time steps: Hence, if we put from where we get This is the optimal way to update d cut based on the ratio ρ of the current and previous velocities.

Stopping criterion
Overfitting is always a concern in flexible-fitting methods. In general, it is difficult to give an objective criterion in this regard. However, the fact that our method furnishes a whole trajectory of conformations-rather than a single one-provides us with a way to determine a "safe" final conformation. We do this by resorting to the plot of the overlap evolution, shown in the figures for each of our examples. The idea is that the start of the "saturation" of the graph can be considered as a warning time, after which any additional refinement is likely to lead to overfitting.
A concrete rule of thumb to estimate such a critical point in the trajectory is as follows. An exponential function, of the form is least-squares fitted to the graph of the overlap function.
(Note that in this paper the variable t denotes the time step number rather than a physical time.) This is a non-linear exponential regression problem on b, c, k, which is solved by first eliminating the linear parameters b and c from the regression equations, and then numerically solving the resulting non-linear equation for k. With these parameters, we can define the warning time as the time step t 1 such that the exponential function above reaches a fraction α 1 of the interval between y(0) = b−c and y(∞) = b. This gives Since the trajectory is being obtained dynamically, the above process needs to be performed at each time step, using the points of the overlap function that have been computed until the current step. Thus, as the simulation proceeds, updated values of the regression parameters are obtained, and so we need to carry on the simulation sufficiently long, to be certain that we have enough points to yield an accurate fit. Updated estimates of this stopping time t 2 are obtained in a manner similar to t 1 : by using a value α 2 instead of α 1 : The simulation then proceeds as long as the estimate t 2 is bigger than the current time step. When the point is reached at which this condition is no longer satisfied, the simulation is stopped and the current estimate of t 1 is defined as the warning time, after which point overfitting is likely to begin. This process is described graphically in Fig. 3 (which is the actual plot for the lactoferrin test case considered below). Adequate values that worked well in our examples are α 1 = 0.9 and α 2 = 0.99.

Side-chain optimization
DDforge includes an option to optimize the side-chain conformations along the trajectory. This step can help to escape from wrong side-chain geometries that could occur if they were simply evolved from their initial conformations, and it also compensates for the inaccuracy introduced by the reduced-residue model. The side-chain optimization (which in our examples was done for each conformation written out to disk) is performed by the SCWRL4 method [23], which uses an efficient treedecomposition algorithm that furnishes the best sidechain conformations for each given backbone geometry by minimizing a simplified atomic force field on a rotamer library.

Results
We describe several simulated and experimental cases with the main purpose of illustrating the advantage of the new features introduced in the present work. Side-chain optimization was used in all cases except thermosome, since in this case the focus was on modeling the considerable large-scale deformations between the atomic model and the EM map.

Validation test: lactoferrin
To demonstrate the level of accuracy achievable by means of the new features described in the previous section, we consider the simulated case of lactoferrin. This is an iron-binding protein that undergoes a large conformational change of about 8Å RMSD where three rigid domains rotate about hinge axes. Due to the piecewise-rigid motion, the system is a standard for testing rigid-body modeling techniques [24], but the highly localized flexibility-present only in the hinge regions between rigid domains-creates a formidable challenge for many flexible refinement methods.
The simulated maps were generated from the apolactoferrin structure [25,PDB code 1lfh], and the atomic model that was fitted into those maps was the iron-bound structure [26, PDB code 1lfg]. These structures have 691 residues. See Fig. 4a and b. This experiment consisted of four cases: • Simulated map to 7Å resolution; • Simulated map to 15Å resolution; each combined with: • Atomic model fully flexible; • Atomic model with helices and β-strands rigid.
In the fully flexible case, the number of free variables was 1888, while the number was 1266 when keeping helices and strands rigid. Table 1 summarizes the results of this experiment and compares them with those from other methods. We see that, for all the DDforge cases, the RMSD of the final conformation from the target is more than an order of magnitude better than the resolution of the simulated map. Moreover, since the rigid domains (when superimposed) exhibit RMSD values between only 0.42 and 0.65Å [27], our flexible refinement accuracy of 0.58 and 0.65Å (in the 7Å case) is actually at the level of the residual discrepancy between the rigid domains. (The RMSDs for DDforge are measured on all backbone atoms.) Even though we are mostly interested in the 5-10 Å resolution range, we wanted to compare the performance of our method with the lactoferrin results reported in ref. [28], with RMSDs of 1.89Å and 2.72Å reported for two flexible fitting methods. The rigid-body Iterative Modular Fitting (IMF) method was found to surpass the flexible methods, with an RMSD of 0.98Å. Volkmann [28] concludes that 5Å resolution is needed for a flexible-fitting approach (namely MDFF [12]) to yield better results than IMF. We note from Table 1, however, that DDforge matches the accuracy of IMF at 15Å resolution. One should consider that IMF is a piecewise-rigid method for which lactoferrin's piecewise-rigid motion is obviously  well suited by design. DDforge, by contrast, is a truly flexible refinement that can handle longer-range deformations (see below), but it handles even the special case of lactoferrin's piecewise-rigid motion well.
The runtime for this case, on a desktop computer with a 4 GHz Intel Core i7 processor and 32 GB of RAM, running on a single core, was about 9 min or 11 min, with or without side-chain optimization, respectively. These timings will vary depending on parameters used and on the number of conformations saved (since the side-chain optimization is performed for each saved conformation), so they should be taken only as a guideline. Figure 4c and d show the evolution of the overlap and RMSD with respect to the target, for the all-flexible fitting into the 7Å map: the overlap increases from 78% to 95%, while the RMSD decreases from 8.1Å to 0.65Å. These RMSD values are caused by rigid motions of domains and by their internal flexible refinement.

Tests with experimental EM maps Thermosome
Thermosome is an archaeal chaperonin from T. acidophilum. The asymmetric unit of the crystal structure [29] is a dimer (PDB code 1a6d); we used the hexadecameric biological assembly with D 4 symmetry as the atomic model for fitting (Fig. 5). This structure has a total of 8040 residues, making it the largest structure on which we tested our method. The EM map we used was a reconstruction to 10Å resolution of the open conformation of the hexadecamer [30] (EMDB code 1396).
For this case, we kept the helices and β-strands rigid, with which the number of free variables was 12,664.  Figure 5c shows the evolution of the overlap with the EM map: it increases from 29% to 64%. A movie of this trajectory is available as Additional file 1. The runtime for this case was about 5 h (with no side-chain optimization).

Actin filament with MyBP-C bound
Immunoglobulin domains (Ig-domains) of myosin binding protein-C (MyBP-C) bind to fibrillary actin (F-actin) in a highly polymorphic fashion [31]. To evaluate how DDforge works on multiprotein complexes, we applied it to a 3D reconstruction of one of the multiple modes of binding of C0 and C1 Ig-domains of MyPB-C to Factin complexed with tropomyosin protein (Fig. 6). The resolution of the map was determined, using the Fourier Shell Correlation (FSC) criterion of 0.143, to be 11Å. The initial atomic model-comprised of 10 actin monomers, 10 units of each C0 and C1 Ig-domains, and two pairs of tropomyosin helices-was obtained by rigid body fitting of the corresponding high-resolution structures (PDB codes 2k1m, 2v6h and 4a7f) into the EM density map.
The total number of residues in this structure was 6284. As with thermosome, helices and β-strands were kept rigid, making the number of free variables 12,198. Also, for this case distance constraints were imposed between adjacent endpoints of β-strands in the C0 and C1 domains, to help preserve their structure. The total number of such constraints was 280. Figure 6c shows the evolution of the overlap along the trajectory, going from 66% to 78%. A movie of this trajectory is available as Additional file 2. The runtime for this case was about 10 h, including the side-chain optimization at each saved conformation. Figure 1 compares the fitting results described above (using the simultaneous optimization of kernel width σ and density threshold T in DDforge) with those obtained with the original DDFF which used σ = R/(2 √ 3) and equated only the volumes of both maps to determine T. (R is the user-specified resolution of EM map.) The improvement of the fitting in the regions indicated by arrows is noteworthy.

GroEL: stability test
GroEL is another ring-like chaperonin whose tetradecameric crystal structure exhibits D 7 symmetry (PDB code 1xck). As target EM map we used the 6Å resolution variant (EMDB code 1081). Since, for this case, the atomic structure and the map are expected to represent the same conformation, the main purpose of this test was to ascertain the stability of the DDFF approach: the fitted conformation should have, at most, a small deviation from the starting atomic structure due to any Fig. 6 Flexible fitting of an atomic model comprised of F-actin, tropomyosin and Ig-domains of MyBP-C (blue ribbons) into an EM map (yellow surface) filtered to 11Å resolution. Helices and β-strands were kept rigid. In addition, distance constraints were imposed between adjacent endpoints of β-strands in the C0 and C1 domains. a The initial atomic model, docked into the electron density map using rigid body fitting, is shown in blue. b The resulting atomic model (final conformation of the trajectory) yielded by DDforge. c Evolution of the overlap values along the trajectory. The indicated value of t 1 = 53 is the warning time, where overfitting is likely to begin. (Coincidentally, its value was equal to that for lactoferrin. See Fig. 4.) The stopping time for this case was t 2 = 106 residual small differences between the crystallographic and ice-embedded EM specimens.
The total number of residues in the atomic structure was 7336. As before, helices and β-strands were kept rigid, with which the number of free variables was 9876. No distance constraints were imposed. The starting conformation was obtained by rigid-body fitting. The runtime for this case was 11 h (including the side-chain optimization step).
The final conformation of the trajectory was determined according to the stopping criterion described in the "Methods" Section. That final conformation had a C α RMSD of 1.8Å relative to the starting conformation. This is entirely consistent with previous results obtained by other methods [32], providing further support to the suggestion that such an RMSD value corresponds mostly to the intrinsic discrepancy between the structures, and only marginally to any potential instability of the fitting method.

Comparison with the original DDFF
Here we provide a brief comparison between results obtained with the new DDforge and the original DDFF. The refinement runs for the thermosome and GroEL cases actually required the new DDforge approach whose improved simulated maps (Fig. 2) enable the convergence of ring-like structures. DDFF data was available, however, for comparing the lactoferrin and actin systems mentioned above.
The original DDFF run on lactoferrin took 22 min, as compared with 9.4 min of DDforge (both done on the 7Å map, without the side-chain optimization, and with fully flexible backbone). This translates into a factor 2.34 speedup. The RMS deviation between the resulting conformations (at the same overlap level) was 0.74Å. It is also interesting to compare the RMS deviations between each of the resulting conformations (original and DDforge) and the atomic structure used to make the simulated target map: the DDforge one was 0.65Å (as indicated in Table 1), whereas the original one was 0.73Å. Naturally, it only makes sense to pay attention to such small differences when considering simulated cases such as this.
In real cases such as the actin complex, the difference is much more significant (Fig. 1).
The actin timings were 9 h for DDforge versus 22 h for DDFF (without performing the side-chain optimization), which means a factor 2.4 speedup. The difference in the resulting conformations is shown in Fig. 1; the corresponding RMS deviation was 1.9Å.

Discussion
It may be useful to recall here that DDforge works in generalized coordinates (internal coordinates-torsion angles-and global position coordinates of each chain), and thus preserves the covalent bond geometry of the structure (except for the torsions). In addition, the side chains can optionally be energy-minimized by means of the SCWRL4 method. This is generally useful in all cases (not only for high-resolution maps) mainly because it helps to resolve possible "locks" in the trajectory, and not only because we may want to look at the detailed conformation of the side chains.
The use of dampers to maintain the overall assembly of the molecule allows it to model arbitrarily large conformational changes in a natural way. A demonstration of the sensibility of the deformations generated by DDforge is provided by the lactoferrin test case above, for which previous flexible-fitting approaches would degrade the RMSD from the target conformation, relative to a rigidbody fitting of subdomains. The fact that we can obtain results comparable to the residual crystallographic discrepancy suggests that DDforge can handle both rigid and flexible refinement with high accuracy.
Another example of the ability of DDforge to handle large conformational changes in a sensible way is the thermosome. This structure has a high degree of symmetry (imposed during the map reconstruction [30]), but this symmetry was not utilized during our simulations: all the chains were treated independently of one another, and no distance constraints were imposed.
Regarding efficiency, a comparison of runtimes with the original version of the code yielded a factor 2.3 speedup for the trajectory to reach the same overlap values. This can be attributed to the new scheme to update the dampers' cutoff distance along the trajectory. The runtimes for the new version of the code, reported earlier, vary from a few minutes to 11 h, and depend not only on the size of the structures, but also on their specific geometries and parameters that affect the speed of the trajectory. Thus, even though thermosome is the largest system, it took less compute time than GroEL and the actin complex. These timings are considerably less than benchmarks reported for the NAMD molecular dynamics program used in MDFF [33]. A comparably sized F1-ATPase system (92,224 atoms) at 100 ns simulation time would take 14-100 days of compute time even on a much more powerful 24-core machine, depending on its GPU configuration.

Conclusions
We have developed a flexible refinement strategy termed DDforge that is well suited to handle the larger, higherresolution maps that have become common in recent years.
Significant new features in our implementation include the simultaneous optimization of both the convolution kernel width σ and the density threshold T, by equating three quantities of the EM and synthetic maps: integral, 95% quantile, and volume; the ability for the user to define distance constraints on the model, which are useful to preserve shapes in regions where the density is not well defined; a streamlined approach to update, at every time step, the cutoff length of dampers, ensuring that the speed of the trajectory is optimal and thereby shortening the compute time; and a practical scheme to signal a point in the trajectory where overfitting is likely to start.
The application of DDforge to the recent actin data demonstrates the effect of optimizing the kernel width and density threshold to generate appropriate model maps at each step. Figure 1 contrasts the results obtained by using the original approach (fixed σ ) with those using the optimized values. As with thermosome, the helical symmetry of this complex was not imposed in our calculations.
The stability of the DDforge approach was verified on the GroEL test case. This used a 6Å resolution EM map, and the drift from the initial to the final conformations was of only 1.8Å RMSD, which previous results suggest is mostly due to the intrinsic discrepancy between the atomic model and the EM map [32].
We should point out that our approach can be applied in other settings as well, such as transition pathways and homology/loop modeling. These involve straightforward modifications in the definition of the force field: instead of being generated by a map, the forces are defined simply to be proportional to the distance between each atom in the origin structure and the corresponding atom in the target structure.
The DDforge method will be made available in version 3.0 of the Situs EM fitting and interpretation package (http://situs.biomachina.org), where it fills a void for a refinement tool applicable to medium-resolution maps with discernible internal (secondary structure) features. But DDforge does not attempt to deal with atomicresolution maps, for which other existing tools coming from crystallography are undoubtedly more suitable. Plans for future work include extending the new capability of kernel-width optimization to allow for inhomogeneous [34] and anisotropic [35,36] resolution across the map, and to allow for inhomogeneous convolution that characterizes stability and disorder of particular side chains [37]. Likewise, we are envisioning possible ways to make our approach immune to regions of density not accounted for by the atomic model. Currently, these extra densities need to be removed prior to the simulation.