Our force field relies on a simulated model-induced map that is matched with the experimental EM map. After a preliminary rigid-body fit, the force field is applied to the damped dynamics of coarse-grained side chains to guide the atomic model towards unoccupied regions of the experimental target map [5]. The matching of the model-induced map with the experimental map is therefore an important aspect in the accurate refinement of the structure.
Optimization of parameters for the model-induced map
The original DDFF approach used a fixed standard deviation σ of the Gaussian convolution kernel based on the Situs convention: \(\sigma =R/(2\sqrt {3})\) (section 4 in ref. [21]), where R is the nominal resolution of an EM map as determined by the experimentalists. This fixed σ convention was reasonable for the interpretation of low-resolution maps with the older Situs tools. However, for intermediate resolution maps using the new DDforge, we found it beneficial to allow σ to be a free parameter that can be optimized from an initial guess such as \(\sigma =R/(2\sqrt {3})\). The optimization of σ can be shown in some cases to provide more accurate fitting results (Fig. 1), and it adds only one additional free parameter, which is negligible compared to the many conformational DOFs in the damped dynamics model.
A second parameter which we optimize in the new DDforge is the threshold T to be applied after Gaussian kernel convolution. In the original DDFF engine, we implemented two options to determine the threshold T: (a) so that the volume within the isosurface of mean value matches the corresponding volume in the EM map; or (b) so that, after rescaling, the integral and maximum coincide with the respective values for the EM map. In the new approach, we merge these two alternatives into one. In this way, by considering σ as an extra unknown, we can have all three quantities match: volume, integral, and maximum. Actually, the maximum density (which is sensitive to outliers) was replaced by a more stable value: the 95% quantile. Likewise, the mean value for the isosurface was replaced by the threshold.
Figure 2 provides an overview of the density matching and optimization steps. In summary, we have the following system of equations in the new DDforge:
$$\begin{array}{*{20}l} V_{2} &= V_{1}, \end{array} $$
(1)
$$\begin{array}{*{20}l} a I_{2} &= I_{1}, \end{array} $$
(2)
$$\begin{array}{*{20}l} a Q_{2} &= Q_{1}, \end{array} $$
(3)
where V denotes the volume, I the integral, and Q the 95% quantile, subindex 1 corresponding to the EM map and subindex 2 to the model-induced map. The quantities on the right side of the equations are computed directly from the EM map. The scaling factor a is to be solved for as well, along with σ and T. Note that a does not enter into the first equation since the volume does not depend on the scaling.
We solve these equations as follows. Let g denote the model-induced map, computed as the convolution of the atomic model with a Gaussian kernel of standard deviation σ. This map, like the EM map, is defined on a set \(\mathcal {R}\) of N voxels. In what follows, we denote a generic voxel (or its coordinates) by x. Then the integral I2 is given by
$$ I_{2} = \sum_{g(x)>T}(g(x)-T), $$
(4)
so from Eq. 2 we obtain:
$$ a \sum_{g(x)>T}(g(x)-T) = I_{1}. $$
(5)
The volumes in Eq. 1 are conveniently expressed as number of voxels. Thus,
$$ V_{2} = \#\{ x\in\mathcal{R} \:\vert\: g(x)>T \}. $$
(6)
If we let \(\bar {g}\) denote the 1D array of N values of g sorted in ascending order, i.e., \(\bar {g}_{1} \leq \bar {g}_{2} \leq \dots \leq \bar {g}_{N}\), then Eq. 6 implies that \(T = \bar {g}_{N-V_{2}}\), which together with Eq. 1 gives:
$$ T = \bar{g}_{N-V_{1}}. $$
(7)
On the other hand, from Eqs. 3 and 5 we obtain:
$$ D(\sigma, T) \equiv \frac{\sum_{g(x)>T}(g(x)-T)}{Q_{2}}-\frac{I_{1}}{Q_{1}} = 0, $$
(8)
where Q2 depends on σ and T, while g depends only on σ. We now have two equations, 7 and 8, in the two unknowns σ and T. The algorithm to solve them proceeds iteratively between these two equations: for each value of σ (starting, for instance, with \(\sigma _{0}=R/(2\sqrt {3})\)), build the model-induced map g, and compute T using Eq. 7. With this T, go to Eq. 8 and make one Newton-Raphson step on σ. With this updated σ, go back to Eq. 7 to get a new value of T, and continue in this way till convergence. In our examples, the number of iterations needed was very small (5 or 6), and the compute time was about a second, even for the largest structures that we considered.
Distance constraints
Constraints are given by general relations among the generalized coordinates of the form:
$$ f_{\alpha}(q_{1},\dots,q_{M}) = 0 \qquad (\alpha=1,\dots,K). $$
(9)
The way to insert these conditions into the equations of motion is by taking their time derivative:
$$ \sum\limits_{i=1}^{M} \frac{\partial f_{\alpha}}{\partial q_{i}} \dot{q_{i}} = 0 \qquad \forall\alpha, $$
(10)
whence the equations of motion become:
$$ \left. \begin{array}{lll} \sum\limits_{i} (B_{ij}+V_{ij}) \dot{q}_{i} - \sum\limits_{\alpha} \frac{\partial f_{\alpha}}{\partial q_{j}} h_{\alpha} &= Q^{(m)}_{j} \qquad \qquad &\forall j \\ \quad \quad \sum\limits_{i} \frac{\partial f_{\alpha}}{\partial q_{i}} \dot{q}_{i} &= 0 & \forall \alpha \end{array} \right\} $$
(11)
where the hα are Lagrange multipliers and \(Q^{(m)}_{j}\) is the force field in generalized coordinates.
In the particular case of distance constraints, the fα take the form:
$$ f_{\alpha} = \frac{1}{2} \left\| {\mathbf{{r}}_{a} - \mathbf{{r}}_{b}} \right\|^{2}, $$
(12)
where a and b depend on α, ra and rb denoting the positions of two specific atoms whose distance we want to require to be constant throughout the trajectory. The derivatives are:
$$ \frac{\partial f_{\alpha}}{\partial q_{i}} = \left\langle \mathbf{{r}}_{a}-\mathbf{{r}}_{b}, \frac{\partial\mathbf{{r}}_{a}}{\partial q_{i}} - \frac{\partial\mathbf{{r}}_{b}}{\partial q_{i}} \right\rangle, $$
(13)
where the derivatives \(\frac {\partial \mathbf {{r}}_{k}}{\partial q_{i}}\) constitute the entries of the so-called Wilson’s matrix [22], and can be computed directly from the geometry of the current conformation of the model.
Scheme to optimize the dampers’ cutoff distance
This parameter dcut is the maximum distance between atoms that will be connected by a damper. The rationale for its optimization is based on the RMS velocity of the structure, defined by
$$ v^{2} = \frac{1}{A} \sum\limits_{i=1}^{A} \| {\dot{\mathbf{{r}}}_{i}} \|^{2}, $$
(14)
where A is the number of pseudo-atoms in the structure.
The idea is to adjust dcut in such a way as to try to keep v constant. As the trajectory progresses, this will make dcut gradually decrease. This decrease is allowed until dcut reaches max{7Å,2σ} (which ensures a minimum of connectivity to preserve the structural integrity of the model), after which point it is kept constant, with the ensuing decrease of v.
To keep v nearly constant as much as possible, we need an estimate of the dependence of v on dcut. For this, we can imagine a toy model involving a collective map force F(m) and a collective dampers’ constant C (which depends on dcut), so we have, roughly:
$$ v \approx \frac{F^{(m)}}{C\left(d_{\text{cut}}\right)}. $$
(15)
The number of connections of length up to dcut is of the order of \(d_{\text {cut}}^{3}\). They can be partitioned into those of length between r and r+dr, for r ranging from to 0 to dcut, of which the number is of the order of r2 dr. A damper of length r has a strength \(\sim 1/\sqrt {r}\) (Eq. 8 in ref. [5]). Therefore, the aggregate strength of all dampers up to length dcut will be:
$$ C\left(d_{\text{cut}}\right) \sim \int_{0}^{d_{\text{cut}}} \frac{r^{2}}{\sqrt{r}}\,dr \sim d_{\text{cut}}^{5/2}. $$
(16)
When v decreases due to a decrease in F(m), we want to decrease dcut to try to restore v to its former value. Thus, if we have, for the previous and current time steps:
$$ {} v(t-1) = \frac{F^{(m)}(t-1)}{d_{\text{cut}}(t-1)^{5/2}} \qquad \text{and} \qquad v(t) = \frac{F^{(m)}(t)}{d_{\text{cut}}(t)^{5/2}}, $$
(17)
the goal is to get v(t+1)=v(t−1) assuming F(m) (t+1)=F(m)(t). This gives
$$ v(t-1) = v(t+1) = \frac{F^{(m)}(t+1)}{d_{\text{cut}}(t+1)^{5/2}} = \frac{F^{(m)}(t)}{d_{\text{cut}}(t+1)^{5/2}}. $$
(18)
Hence, if we put
$$ \rho = \frac{v(t)}{v(t-1)}, $$
(19)
then
$$ \rho = \frac{F^{(m)}(t)}{d_{\text{cut}}(t)^{5/2}} \cdot \frac{d_{\text{cut}}(t+1)^{5/2}}{F^{(m)}(t)} = \left(\frac{d_{\text{cut}}(t+1)}{d_{\text{cut}}(t)} \right)^{5/2}, $$
(20)
from where we get
$$ d_{\text{cut}}(t+1) = d_{\text{cut}}(t) \cdot \rho^{2/5}. $$
(21)
This is the optimal way to update dcut based on the ratio ρ of the current and previous velocities.
Stopping criterion
Overfitting is always a concern in flexible-fitting methods. In general, it is difficult to give an objective criterion in this regard. However, the fact that our method furnishes a whole trajectory of conformations—rather than a single one—provides us with a way to determine a “safe” final conformation. We do this by resorting to the plot of the overlap evolution, shown in the figures for each of our examples. The idea is that the start of the “saturation” of the graph can be considered as a warning time, after which any additional refinement is likely to lead to overfitting.
A concrete rule of thumb to estimate such a critical point in the trajectory is as follows. An exponential function, of the form
$$ y(t) = b-c \exp(-kt) $$
(22)
is least-squares fitted to the graph of the overlap function. (Note that in this paper the variable t denotes the time step number rather than a physical time.) This is a non-linear exponential regression problem on b,c,k, which is solved by first eliminating the linear parameters b and c from the regression equations, and then numerically solving the resulting non-linear equation for k. With these parameters, we can define the warning time as the time step t1 such that the exponential function above reaches a fraction α1 of the interval between y(0)=b−c and y(∞)=b. This gives
$$ t_{1} = \frac{1}{k} \ln{\frac{1}{1-\alpha_{1}}}. $$
(23)
Since the trajectory is being obtained dynamically, the above process needs to be performed at each time step, using the points of the overlap function that have been computed until the current step. Thus, as the simulation proceeds, updated values of the regression parameters are obtained, and so we need to carry on the simulation sufficiently long, to be certain that we have enough points to yield an accurate fit. Updated estimates of this stopping time t2 are obtained in a manner similar to t1: by using a value α2 instead of α1:
$$ t_{2} = \frac{1}{k} \ln{\frac{1}{1-\alpha_{2}}}. $$
(24)
The simulation then proceeds as long as the estimate t2 is bigger than the current time step. When the point is reached at which this condition is no longer satisfied, the simulation is stopped and the current estimate of t1 is defined as the warning time, after which point overfitting is likely to begin. This process is described graphically in Fig. 3 (which is the actual plot for the lactoferrin test case considered below). Adequate values that worked well in our examples are α1=0.9 and α2=0.99.
Side-chain optimization
DDforge includes an option to optimize the side-chain conformations along the trajectory. This step can help to escape from wrong side-chain geometries that could occur if they were simply evolved from their initial conformations, and it also compensates for the inaccuracy introduced by the reduced-residue model. The side-chain optimization (which in our examples was done for each conformation written out to disk) is performed by the SCWRL4 method [23], which uses an efficient tree-decomposition algorithm that furnishes the best side-chain conformations for each given backbone geometry by minimizing a simplified atomic force field on a rotamer library.