Unbiased, scalable sampling of protein loop conformations from probabilistic priors

Background Protein loops are flexible structures that are intimately tied to function, but understanding loop motion and generating loop conformation ensembles remain significant computational challenges. Discrete search techniques scale poorly to large loops, optimization and molecular dynamics techniques are prone to local minima, and inverse kinematics techniques can only incorporate structural preferences in adhoc fashion. This paper presents Sub-Loop Inverse Kinematics Monte Carlo (SLIKMC), a new Markov chain Monte Carlo algorithm for generating conformations of closed loops according to experimentally available, heterogeneous structural preferences. Results Our simulation experiments demonstrate that the method computes high-scoring conformations of large loops (>10 residues) orders of magnitude faster than standard Monte Carlo and discrete search techniques. Two new developments contribute to the scalability of the new method. First, structural preferences are specified via a probabilistic graphical model (PGM) that links conformation variables, spatial variables (e.g., atom positions), constraints and prior information in a unified framework. The method uses a sparse PGM that exploits locality of interactions between atoms and residues. Second, a novel method for sampling sub-loops is developed to generate statistically unbiased samples of probability densities restricted by loop-closure constraints. Conclusion Numerical experiments confirm that SLIKMC generates conformation ensembles that are statistically consistent with specified structural preferences. Protein conformations with 100+ residues are sampled on standard PC hardware in seconds. Application to proteins involved in ion-binding demonstrate its potential as a tool for loop ensemble generation and missing structure completion.


Background
Sampling conformations of kinematic chainsrigid objects connected by articulated jointsis a fundamental problem in protein structure prediction, the geometry of folding linkages, and robot motion planning. Sampling poses a challenging computational problem when chains are large and must satisfy a variety of constraints and statistical preferences. Conformations may be required to satisfy hard feasibility constraints, such as loop closure and collision avoidance, while also obeying soft preference constraints, such as low energy and high structural likelihood. Particularly around folded protein structures, the subset of feasible and favorable conformations comprises a miniscule fraction of the conformation space, and due to the "curse of dimensionality" this fraction shrinks dramatically with the dimensionality of the state space. Because interesting biological macromolecules have large numbers of degrees of freedom, ranging up to hundreds or thousands, new techniques are needed to sample severely constrained conformations efficiently.
Protein loops are flexible structures that often deform during binding, and are extremely important for understanding protein functioning [1]. Loop sampling has been used in missing fragment reconstruction, generating fluctuations in equilibrium conformations, and generating decoy sets for function prediction. Such applications typically require methods for sampling energetically-likely and diverse configuration ensembles rather than optimizing a single point estimate. The loop closure constraint, which requires the terminal atoms of a loop to lie at fixed positions dictated by the surrounding structured regions, poses a major challenge in sampling. Existing loop sampling methods include discrete search [2], optimization [1], and inverse kinematics (IK) methods [3][4][5][6]. Fiser et al's [1] approach takes an energy function that encodes spatial restraints and preferences on dihedral angles, and then runs a numerical optimization to minimize that energy. Optimization is relatively computationally expensive, which is usually worthwhile for single structure prediction but less so for generating conformation ensembles. Discrete search methods are able to explore a wider space of conformations by incrementally building a tree of clashfree subchain conformations starting from one end of the loop and progressing toward the terminal end [2,7,8].
Although effective for small loops, these methods do not scale well to large loops due to the problem of combinatorial explosion. As a result, discrete search is intractable with chains containing 7-8 or more residues. They also introduce discretization artifacts, and are not able to close the gap between a terminal atom and its desired position. Inverse kinematics (IK) techniques from the robotics field have been adopted to sample conformations with exact loop closure [3][4][5]9]. However, these methods prioritize the loop closure constraint and do not take energies into account during sampling. Hence, some authors employ a secondary energy optimization step to generate more plausible conformations [3,6,10].
For each of these methods, the sampling distribution is throughly entangled with the sampling procedure. To achieve a desired distribution, the sampling procedure must be tuned in an opaque, nontrivial manner, and it is unclear that a desired distribution can even be achieved. Instead, extensive post-hoc empirical testing to assess the quality of the resulting distribution and to argue that a method samples well. Monte Carlo (MC) techniques represent a more principled class of approaches that sample directly from a desired destribution. They have a long history of use in computational biology because they can quickly explore multiple energy minima and transition pathways, while molecular dynamics and optimization techniques often get stuck in single local minima [11][12][13][14]. They are also well-suited for generating conformation ensembles. The general Metropolis-Hastings approach generates a sequence of incrementally perturbed configurations via a random walk, with a carefully-designed acceptance criterion (the detailed balance condition) that ensures that the sampling distribution approaches the desired one as more samples are drawn. However, there is a tradeoff in choosing the perturbation size: small perturbations raise the fraction of accepted moves but lower the speed of conformation space exploration. Moreover, standard MC techniques cannot be directly applied to protein loops due to the loop closure constraint, which causes each step to be accepted with probability 0.
Our new method overcomes many of the weaknesses of prior methods (see Table 1); it simultaneously scales to long loops (e.g., >10 residues) and produces unbiased ensembles of conformations. It uses a unified probabilistic graphical modeling (PGMs) framework for modeling a desired distribution from experimentally available statistical priors. PGMs such as Bayesian networks and Markov random fields are powerful tools for inference in large domains with heterogeneous sources of information, and have been applied in a limited sense to protein structure prediction. They have been used to predict side-chain rotamer conformations [15] and conformations of macromolecular assemblies from electron density maps [16]. Their use is reasonably well understood in the discrete case, but continuous variables often prove challenging. Our work derives a new Gibbs sampling inference method for continuous PGMs with loop-closure constraints, which restrict the feasible domain to a nonlinear implicit manifold. In particular we derive the mathematical relationship between an inverse kinematic sampling distribution and the manifold's metric tensor, which is necessary to compute the detailed balance condition in the Metropolis-Hastings algorithm. The resulting sampling sequence is unbiased in the sense that its distribution approaches the target distribution in the large-sample limit.

Methods
SLIKMC is a Markov chain Monte Carlo (MCMC) method that takes as input an experimental conformation scoring function Φ, a protein structure from the Protein Data Bank (PDB), the beginning and ending residues of the loop, and outputs a sequence of perturbed loop conformations such that the sequence asymptotically approaches a probability distribution proportional to Φ. If the structure is missing, a rough initial structure is sampled using existing inverse kinematics loop closure techniques. To generate a subsequent conformation, it performs the following operations: For each 4-residue subloop, repeat the following steps: The method terminates when a fixed number of conformations are generated or until a desired time cutoff is reached. The novel contributions of this paper include an exact derivation of the importance ratio a for the inverse kinematics sampler of step 1 and the use of sparse PGMs to evaluate the importance ratio quickly per-subloop. We also describe extensions that handle flexibility in side-chains and molecules with multiple branches or loops (e.g., polycyclic compounds).
As a MCMC method, SLIKMC samples from a complex joint probability distribution by constructing a Markov chain whose equilibrium distribution is equal to the desired distribution. It is a hybrid MCMC algorithm that combines blocked Gibbs sampling and Metropolis-Hastings (M-H) sampling. M-H permits the use of nonnormalized probability distributions, which is important because it is relatively simple to define a useful scoring function but virtually impossible to ensure that it integrates to one. The blocked Gibbs sampling method samples a small subloop at each step, which helps SLIKMC scale better to large chains, because acceptance rates decrease roughly exponentially in the number of variables sampled at once. This section will first review classical MCMC methods and then describe the new approach.

Markov chain Monte Carlo framework
Let x = (x 1 , ..., x n ) denote the state variables of the system of interest. Experimental conditions including hard constraints and soft preferences are encoded into a nonnegative scoring function Φ(x 1 , ..., x n ). The score is required to have finite integral, is zero at states that violate hard constraints, and higher values indicate more desirable states. Φ is considered as an unnormalized probability density, and our goal is to generate samples with probability proportional to Φ. In other words, the goal is to sample from the normalized density P defined as: where the proportionality constant Z that ensures that P integrates to 1. Φ is closely related to energy functions E(x) through the Gibbs measure where T is the system temperature.
The Metropolis-Hastings (M-H) algorithm addresses the problem that it is hard to sample directly from an unnormalized distribution Φ in part due to the difficulty of evaluating the normalization term Z [17]. On step k, M-H first samples a candidate move from x (k) to x′ from a proposal distribution Q(x′; x (k) ), and accepts the move x (k+1) x′ with probability This is the so-called detailed balance condition. The Z terms in the numerator and denominator cancel out, so we use Φ directly instead of P. If the move is rejected, then the current state is maintained: is called the importance ratio. If the ratio is greater than 1, then the new sample is accepted; otherwise it is accepted with probability equal to the ratio. With the detailed balance construction, P (x) is indeed the stationary distribution of the Markov chain generated by successive samples.
The key question for M-H is how to choose a proposal distribution that we can sample from and evaluate. The acceptance strategy must evaluate the terms in (3) exactly so that the M-H algorithm respects the detailed balance. One of our key contributions is a technique for evaluating Q exactly when sampling from closed chain submanifolds, which enables our method to generate an unbiased sampling sequence.
Note that it is challenging to choose Q to approximate P closely, and hence in practice the probability of accepting a sample drops sharply in the dimensionality of the space. Gibbs sampling is commonly used to address this issue. Moreover it is convenient when sampling from a conditional density is easier than sampling from the entire joint density, which is the case in the sparse Bayesian models that we employ in this work. Given the current sample and keeping the remaining variables fixed. The variable is updated and the index i is incremented in looping fashion. If the dependencies between the variables are sparse (e.g., every variable x i only depends on a handful of variables rather than the remaining n -1 variables), then Gibbs sampling can be efficient even for very large problems. This is exploited in sparse PGMs. Blocked Gibbs sampling is a variation of Gibbs sampling that groups multiple variables as a block and samples the block from the joint distribution conditioned on all other variables.
Our method combines Gibbs sampling with M-H sampling to generate a new sample from (5). To do so, simply consider all other variables fixed, sample x i from a condi- and then apply the importance ratio test as usual to determine whether to accept the step x

Sparse factored models
Due to the locality of interactions in most scoring functions of interest, it is possible to represent Φ in a factored form: where each j i is known as a factor and each S i is a subset of {x 1 , ..., x n } known as the domain of the factor j i . For example, in protein structure prediction factors may include Ramachandran plots relating each pair of dihedral angles (, ψ), steric clashes, energy functions defined over atom positions, and prior knowledge from B-factors or electron density maps.
Probabilistic graphical models like Bayesian networks and Markov random fields are inherently factored: the domain of each factor consists only of a vertex and its neighbors in the graph. A graphical model is sparse if each variable x i is involved in only a handful of factors (i.e., bounded by a constant unrelated to n), and hence only interacts directly with a few other variables. An important consequence in the discrete case is that probabilistic inference is computationally tractable in sparse models (polynomial in n), whereas inference is intractable in dense models (in general, exponential in n). A key step in our method converts the representation of a kinematic chain from dense to sparse form, as described below. The implementation described in this paper currently supports: • Ramachandran plots j RP(r) (, ψ) which vary by residue r.
• Steric clashes j SC(j, k) (p j , p k ) which are 0 if atom j collides with atom k and 1 otherwise.
where µ j is the predicted atom position and B j is the B-factor value in the protein's PDB file. A constant of proportionality c can be set by the user according to his/her confidence in the quality of the B-factor estimates.
• Side-chain rotamer distributions, as described the Side Chain Sampling section.
Each factor can be evaluated quickly, but over thousands or millions of evaluations they accumulate significant computational cost. Significant savings can be achieved in sparse models, because when a few variables are changed, the change in Φ can be calculated quickly by only evaluating those factors involved, rather than recomputing Φ from scratch. Although steric clashes are theoretically considered as O(n 2 ) pairwise factors, in practice we use a grid-based hashing data structure that only checks nearby atoms for collision. As a result, each Gibbs sampling step can be performed in O(1) time.
In future work we are interested in including additional statistical potentials and/or all-atom energy function terms in scoring. With a naive implementation, each atom is involved in O(n) pairwise interactions, but we expect to exploit the weakness of distant interactions to reduce the number of factors included in the computation.

Kinematic chain modeling
Consider a jointed kinematic chain with reference frames T 0 , T 1 , ..., T N , connected with relative rotational angles q 1 , ..., q N . For a protein backbone, there is a one-to-one correspondence between frames and backbone atom positions p 1 , ..., p M , and the rotational variables are simply the backbone dihedral angles 1 , ψ 1 , ..., N/2 , ψ N/2 .
Although it is standard practice and beneficial for certain algorithms to define the system state with a minimal set of coordinates, e.g., x = (T 0 , q 1 , ..., q N ), a key step of our method is to consider an expanded state. Minimal coordinates use the fact that each subsequent frame T 1 , ..., T N can be determined from x through straightforward forward kinematics, leading to a lower dimensional representation. However, this approach eliminates sparsity in the probabilistic model because a factor defined on T N will depend on all variables, a factor defined over T N -1 will depend on all variables except q n , and so on. Moreover, if a sampler is asked to generate certain variables from a density defined over T 1 , ..., T N (for example, atom positions), the generated distribution may be biased unless it computes the determinant of an N × N metric tensor for each evaluation of Φ. As described below, this is a consequence of nonlinear transformations of distributions (see Appendix).
On the other hand, computing determinants takes O (N 3 ) time, which scales poorly with large N.
Our method represents an expanded state that incorporates all spatial variables along with the conformation variables: x = (q 1 , ..., q N , T 0 , ..., T N ). The joint probability density is then defined over angles and reference frames of all links along the chain (see Figure 1): where each S i is now a subset of {q 1 , ..., q n , T 0 , ..., T n }, and where j kinematic is the forward kinematic transform that defines the frame T j in terms of the prior frame T j-1 and the relative angle q j . Because the transform is deterministic, j kinematic should be thought of as an indicator function. Taking the convention that each frame's origin lies on its joint's axis: where T rel j is the relative transformation of frame j relative to frame j -1 and R(a, q) is the rotation of angle q about axis a. Fixed-endpoint constraints can also be encoded with indicator factors j closure (T 0 ) and j closure (T n ) that are zero everywhere except at the fixed frames.
With (7) encoded so that factors contain few variables in their domain, the model becomes sparse. However, we have added the complication of maintaining a valid kinematic structure, because the set of x for which Φ is nonzero lies on a lower-dimensional manifold. Technically speaking, the probability density must be considered with respect to a base measure that assigns finite, nonzero density to the manifold. For 3D chains, the state space has dimensionality 7N but the manifold has dimensionality 6 + N for free-endpoint chains or N -6 for fixed-endpoint chains. The next section will describe how we handle these submanifolds in detail.

Block sampling and selection
A block is a subset of variables that are simultaneously sampled. The number of variables in a block must be sufficiently large to give at least one continuous degree of freedom of movement. The Metropolis-Hastings criterion is used to accept or reject a move because it is unrealistic to sample directly from the block's conditional density. This key subroutine, Sample-Block-MH, takes as input the previous sample x (k) and a block B of b consecutive joint angles and their intervening frames. It then samples a candidate move, and accepts it according to the M-H criterion. Pseudocode is as follows: Sample-Block-MH(x (k) , B): 1. Using Sample-Block as described below, sample a candidate conformation x' B of B at random, keeping the rest of the chain x .

Accept the move x
Here the subscript B denotes the subset of variables in the block, while the subscript C denotes the complement of the block. The score Φ B calculates the product of factors j i whose domains S i overlap with B, which is more efficient than recomputing Φ from scratch. The remaining details of the methodthe block size, the block sampling procedure, and calculating the sampling probability Q Bare described in detail in the remainder of this section. To generate a new conformation x (k+1) of the entire chain, Sample-Block-MH is called several times with overlapping blocks incremented sequentially down the chain. Block ordering (e.g. forward, backward, or random order) has no effect on the asymptotic distribution and experiments suggest virtually no noticeable effect apart from the first handful of samples. Thanks to sparsity, each pass is performed in O(N) time, which takes a fraction of a second for chains with hundreds of variables.
How many variables should be included in a block? Standard Gibbs sampling (i.e., b = 1) does not work because loop closure constraints constrain the conditional density of any variable given the rest (5) to a Dirac. Hence, the state would never change. In fact, no mixing occurs for b ≤ 5, except possibly at singular conformations, which occupy a set of measure zero in conformation space and are therefore unlikely to occur naturally. For 6 angles, analytical inverse kinematics (IK) techniques are available to compute solutions for a pair of fixed end frames [9]. In fact, any number from 0 to 16 solutions may exist for a given 6-angle problem. Nevertheless, b = 6 is not suitable because it restricts the random walk to only a finite set of conformations.
Setting b = 7 angles allows sufficient freedom to sample from a 1-dimensional manifold of solutions. In general, a block of b ≥ 6 angles admits a b -6 dimensional solution manifold. Denote the block B = {q i , ..., q i+b-1 , T i , ..., T i+b-2 }, and let us call the first b -6 angles of the block q i , ..., q i+b-7 the independent subchain. Call the remaining 6 angles the dependent subchain. This is illustrated for a planar chain in Figure 2. A sampling procedure is as follows: Sample-Block 1. Sample values for the independent subchain at random. 2. Attempt to close the chain by calculating an analytical IK solution for the dependent subchain. We use the method of [9]. 3. If more than one IK solution exists, one is picked at random, and if no solution exists, the process terminates with failure.
It is recommended that b ≥ 7 be chosen as low as possible, because as b grows, the probability of sampling an independent subchain that admits closure drops off dramatically as b grows, particularly for "stretched out" conformations. In our implementation, 4 consecutive residues are considered as a block that contains b = 8 angles since even numbers align better with the (, ψ) angles priors of each residue (see Figure 3). (Throughout this discussion we have assumed a 3D chain but the method works equally well in 2D. For planar chains, at least 4 angles are needed, and the manifold of solutions is (b -3)-dimensional)

Calculation of sub-loop sampling densities
To calculate the M-H importance ratio, we must calculate sampling density for the sampling procedure Sample-Block. Several concepts from differential geometry are required in order to derive this density Q B (x' B |x (k) C ) . Fix the endpoints of the block, and let M denote the (b -6)-dimensional manifold of loop-closing conformations. Let us call the (b -6) angles of the independent subchain y, which are sampled w.r.t. the density P (y). Observe that the candidate sample x' B is distributed according to a nonlinear transformation of P(y) onto M. In fact, at non-singular conformations the independent subchain forms a local chart of M, which is a local bijection between R b−6 to M centered at x' B (see Figure 4). Since there is a local bijection f between y and the point on the manifold x B , the sampling density over x B is given by: where s is the number of IK solutions at y and G is the metric tensor of the chart x B = f(y) (see Appendix). The inclusion of the metric tensor is a natural consequence of transformation of variables. For example, for the case b = 7, the metric tensor is the squared arc length of the 1-dimensional parametrization of M (Figure 4, bottom). In general, G is given by where ∂f ∂y (y) is the Jacobian of the function f. Here we have also introduced a positive semidefinite weighting matrix W for the purpose of weighting the relative importance of matching the prior along certain axes. In the standard case, W is an identity matrix, but it can also be useful to choose a nonuniform diagonal matrix to account for heterogeneous units (e.g., angle vs. position variables).
A remaining issue is that it is often difficult to explicitly compute the Jacobian of the IK function involved in f. In other words, with z ≡ z(y) denoting the 6 angles in the dependent chain, it is difficult to evaluate ∂z/∂y. So, we compute an implicit chart Jacobian by considering the implicit form of the constraints C(x B ) = 0. These vector-valued constraints state that the difference between the terminal frame of the subchain and the desired frame is zero.
We have the constraint equation: Taking the derivative of both sides of (11) with respect to y we get: and hence holds as long as ∂C ∂z is invertible, which is true everywhere except at singular conformations. Each derivative of C in the above expression is a submatrix of the Jacobian and can be computed using standard techniques.
Finally, since we obtain the Jacobian Beyond computing the proper sampling density, it is also important to design the algorithm to efficiently compute the M-H acceptance probability. Since clash detection takes 60 times more computation time than calculating the rest of the terms in Φ, we check collisions after determining whether a move will be accepted. Compared to the naive method, this method achieves an order of magnitude speedup.

Extension to other topologies
Although the core method applies to linear closed kinematic chains, it can be extended to handle other molecular topologies, such as free-endpoint chains and side-chains. In theory, polycyclic compounds may also be handled as well. Each new topological structure requires specialized block selection and sampling routines. For example, freeend-point chains need separate sampling subroutines for the start and end blocks. Standard MC methods are used to do so.
Side-chain deformations are important for shaping binding cavities, and SLIKMC can be adapted to generate side-chain conformations in the same graphical modeling framework. It is known that the side-chain conformation depends on the backbone dihedral angle of the corresponding residue [18]. This requires sampling side-chains after the backbone conformation is sampled. Furthermore, since the distribution of sidechain torsional angles are limited to small number of typical conformations (rotamers) for most residues [19], we sample side-chains according to experimentallydetermined distributions.

Side-chain sampling
For side-chain conformation priors we use the 2010 Backbone-dependent Rotamer Library [20]. In this library, each rotameric residue is associated with a list of rotamers which representing the high probability regions for side-chain torsion angles. The probability of a rotamer conformation c is modeled as a continuous distribution given the backbone dihedral angle pairs. The dihedral angle (, ψ) space of each rotameric backbone residue r is discretized into a grid and each cell [a, b] × [c, d] contains its experimentally observed probabilities P (c | r, a ≤ ≤ b, c ≤ ψ ≤ d). Each distribution over c is specified as a Gaussian mixture model. For non-rotameric residues the terminal c angles are handled specially due to the asymmetry in their distributions.
Treating the remainder of the protein as fixed, we model the target distribution of a side-chain x s , conditional on backbone dihedral angles x b , as follows: where j SC indicates steric-clashes and j R(r) indicates the side-chain conformation prior for residue r. Sidechain B-factors are typically not included since we want to give enough freedom to explore the conformation space, and our experiments indicate that the flexibility of the protein chain will reduce greatly when we specify B-factors as prior to both backbone and side-chain atoms.
Extending block sampling to include side-chains requires justifying the importance ratio carefully to ensure unbiased sampling. An efficient sampling procedure is as follows: first compute a closed-loop backbone subchain from the blocked Gibbs sampling step and compute its acceptance probability as usual. If accepted, sample each side chain along the block according to its backbone-dependent rotameric distribution. Because it is a Gaussian mixture, we can sample from j R(r) directly: pick a Gaussian from the mixture according to its weight and then sample from the Gaussian. Finally, reject the sample if the side chains collide.
To justify this procedure, we show that its acceptance probability is equal to the M-H acceptance probability for the entire block including side-chains. Let the block be x B = (x b , x s ) and a candidate block sample x' B = (x' b , x' s ), with b, s denoting backbone and side-chain variables respectively.

The M-H importance ratio is
Since we sample the side-chain according to Q(x s |x b ) = j R(r) (x s | x b ) and the prior sample is clash-free, we cancel terms in the numerator and denominator to get: Since the first term is simply the importance ratio of the backbone and j SC is binary, we conclude that the block acceptance probability is either the backbone importance ratio if clash-free or zero if clashing. Hence the side-chain sampling procedure is sound.

Multiply-closed kinematic loops
It may be possible to extend SLIKMC to handle multiplyclosed loops such as those that occur in polycyclic compounds. This requires special care to divide the structure into blocks that can be split into dependent and independent subchains, such that a conformation of the independent subset completely determines the dependent subset, up to some finite multiplicity. In other words, the independent subchains form a chart of the space of closed-chain conformations of the whole block. The union of all blocks must also cover all state variables.
We illustrate the principle on planar kinematic chains, which require blocks of size at least 4. Assume each cycle contains at least 3 joints. We define a topological ordering by selecting a linear main chain and considering branches off of the main chain. Non-branching linear blocks, free-endpoint blocks, and side-chains (openended branches) are handled as described above. Each 3joint branch off of a branching block is then considered as part of a dependent subchain (see Figure 5). Branches off the dependent subchain are also added to the block in a recursive manner, leading to a block with a tree topology.
To sample a branching block, we first sample values for the independent subchain at random and then close the loops for each branch according to their topological order. To ensure unbiased sampling, we must also calculate the metric tensor in (9) for the entire branched block. This in turn requires computing the Jacobian of the chart, which requires computing the Jacobian of the implicit form for the multiple loop-closure constraints (11). Due to the tree structure the Jacobian is sparse, and the matrix inversion in the implicit chart Jacobian (13) can also be computed efficiently. We have implemented this approach on 2D chains with closed rings (see Figure 6), and extending it to 3D chains remains a problem for future work.

Mixing and autocorrelation
In any MCMC method it is important to empirically examine the mixing rate of the Markov Chain. Firstly, it can potentially take many iterations to "forget" the effects of a poor initialization. For protein sampling, this is not a significant problem because we initialize the chain with the native structure in PDB, which is typically quite good.
Secondly, subsequent samples are highly autocorrelated, and many conformations must be skipped to obtain a sequence with low autocorrelation. This is a serious concern because autocorrelation grows stronger as more variables are included in the conformation (see Figure 7). In practice, one must determine the skip length empirically in order to obtain a quasi-independent sampling sequence, which is defined as a sequence with autocorrelation below some given threshold (0.2 is used in our experiments).

Result and discussion
The SLIKMC algorithm implements a scalable framework for Monte Carlo sampling of kinematic chains. The technique uses a blocked Gibbs sampler that proposes movements of small subchains of conformation angles at once, along with a Metropolis-Hastings technique that guarantees an unbiased sampling of the loopclosure submanifold for that block. Due to the small block size, each energy function is local and adjustments are fast, ranging from microseconds to milliseconds. The method is mathematically proven to generate a statistically unbiased sample in the large sample limit. It is particularly well-suited for closed loops (see Figure 8) but can also be advantageous for chains with free endpoints as well (see Figure 9). SLIKMC is implemented as an add-on to the software package LoopTK [21] [22] for protein loop sampling and is available at http://www.iu.edu/~motion/slikmc/. All experiments are run on a Intel i7 2.7 GHz computer with 4 GB RAM. The library currently supports sampling with prior information from Ramachandran plots, steric clashes, and B-factors, and supports integration with the Backbone-Dependent Rotamer Library for side-chain sampling. Numerical experiments suggest that SLIKMC generates higher quality samples for large loops with lower computational cost than standard Monte Carlo techniques for open-ended chains and the RAMP loop completion package [23].

Loop sampling with prior distributions
We consider the 10-residue closed loop 1AMP181-190, which is a representative segment for testing loop reconstruction algorithms [24]. SLIKMC is applied to sample 2000 conformations from a joint probability that includes steric clashes, Ramachandran plots, and B-factors. The Ramachandran plot (see Figure 10) shows that the distribution of dihedral angles is contained within high probability regions but explores relatively widely, with an average angular deviation of 37°.
We compared our method with the discrete-search loop construction software RAMP [8,23,25]. The latest version 0.7b was used in these experiments. We test the methods on loops of 1AMP with different lengths starting from residue 181. RAMP is tested by calling loop closure function with 0.5 Å distance tolerance. SLIKMC samples from perturbed segments using Ramachandran plots as prior with clash-free constraint. Figure 11 plots the average time for both methods to obtain one closed conformation. Due to combinational explosion, the time required for RAMP increases exponentially and is impractical for >6 residues.
We also compare SLIKMC with a sample-then-select inverse kinematics method that first samples a set of Figure 7 Mixing of SLIKMC samples. Sampling conformations of a planar 20-link chain, anchored at the endpoints, with a uniform prior. Left: starting from a deliberately bad initial conformation. Middle: the sequence mixes relatively quickly, but the first 40 samples are biased by the initial conformation and autocorrelate strongly. Right: a sequence that takes every 40'th sample does not significantly autocorrelate.

Figure 10
Ramachandran plot of SLIKMC samples. Left: the Ramachandran plot of generic residues from a database that includes 500 highresolution proteins [26] used as a prior. Right: the Ramachandran plot for the generic residues in our 10-residue test protein (1AMP 181-190) generated from 2,000 consecutive samples. Each color represents one residue. (This figure is best viewed in color.) Figure 11 Running time comparison between RAMP and SLIKMC. Time required for the discrete search method RAMP and SLIKMC to obtain one sample for loops of varying size. The time required for RAMP increases exponentially while our method runs in approximately constant time.
Zhang and Hauser BMC Structural Biology 2013, 13(Suppl 1):S9 http://www.biomedcentral.com/1472-6807/13/S1/S9 clash-free, loop-closing conformations and then extracts the top scoring ones. The LoopTK configuration sampling method [22] was used here. Given 300 s cutoff time, LoopTK generates 888 conformations, while our method generates 705. Figure 12 shows how the top 20 samples of LoopTK compare with every 100th sample of our method. The upper figures are generated using the original B-factors in the PDB file. SLIKMC matches the prior information more closely and obviates the need for postprocessing using numerical optimization. The bottom figures correspond to enlarged B-factors, which indicate less confidence in the values prescribed by the PDB file. The distribution of SLIKMC samples has approximately three times the variance of the original, which closely matches theoretical predictions. Figure 13 shows the distribution of RMSDs of C a atoms compared to the native structure for both LoopTK and SLIKMC with varying scales in B-factors. This suggests that SLIKMC better supports variations in the experimenter's relative confidence in heterogeneous sources of information.

Missing loop completion
We now consider an application to completion of missing loops. Given the starting position and ending position of a missing segment, we first generate an arbitrary loopclosing configuration, then run SLIKMC to perturb it to a high-probability conformation. As a test case, we select a helix structure (residue from 40-51) from an APO protein 1B8C. We generate an arbitrary loop-closing configuration by running the LoopTK configuration sampling method [22] to perturb the original conformation. Starting from the highly disordered conformation, SLIKMC is run for 2 minutes using priors including enlarged B-factors (scaled by a factor of 10) from original chain segment and Ramachandran plots with steric clash-free constraints. SLIKMC generates approximately 300 samples within the time limit, and the closest sample to the original structure is with RMSD 0.2704 calculated from backbone atoms (see Figure 14). In contrast, the LoopTK configuration sampling method did poorly in constructing a favorable missing loop.

Scalability tests on free-endpoint chains
To further study scalability, we apply SLIKMC to subchains of chain A in a calcium-binding protein 1B8C. Samples for a 30-residue subchain are generated in 1 s ( Figure 15) and samples for the entire 108-residue 1B8C protein are generated in approximately 4 s ( Figure 16).    We compare SLIKMC against a standard Metropolis-Hastings algorithm that samples backbone angles according to a Gaussian proposal distribution with 1°standard deviation. The target distribution for both methods includes steric clashes, Ramachandran plots, and B-factors. Note that standard M-H has probability zero of sampling a conformation that satisfies terminal endpoint constraints exactly, and is not applicable to closed loops. So, these tests ignore the loop closure constraint altogether. Figure 17 displays the average time needed to obtain one quasi-independent sample over ten 30-minute runs for different chain lengths. The skip lengths are determined empirically for each run. This data suggests that our method achieves a cost per quasi-independent sample that is nearly linear to the length of the chain. In contrast, the likelihood that standard M-H accepts a sample drops dramatically as the number of residues increases, leading to exponentially growing cost per sample.

Simultaneous backbone and side-chain sampling
We demonstrate backbone and side-chain sampling using a 15-residue helix structure 1AMP 120-134. As priors we use backbone-dependent rotamer distributions, Ramachandran plot priors, B-factors for the backbone, and testing self-collision and collision against the non-loop portion of the chain. Given 20 min cutoff time, 1,623 samples are generated. Figure 18 illustrates that in residue 130 (arginine), the distributions of torsional angles c 3 and c 4 are limited due to steric clashes, while c 1 and c 2 match well with the priors.

Conclusion
We propose SLIKMC -a Markov chain Monte Carlo method for sampling closed chains according to specified probability distribution. A probabilistic graphical model (PGM) is proposed to specify the structure preferences. A novel method for sampling sub-loops is developed to generate statistically unbiased samples of probability densities restricted by loop-closure constraints and mathematical conditions necessary for unbiased sampling is derived. Simulation experiments show that SLIKMC completes large loops (>10 residues) orders of magnitude faster than standard Monte Carlo and discrete search techniques.
SLIKMC is demonstrated to be applicable to various tasks such as conformation ensemble generation, missing structure construction. For future work we intend to integrate SLIKMC with more complex energy functions, statistical potentials, and machine-learning-based structural function predictors. Another limitation of the technique is that due to the locality of each block adjustment, large-magnitude global motions may take a huge number of iterations to sample, particularly when the motion must cross low-scoring chasms in conformation space. We intend to investigate annealing-like or random restart techniques for overcoming these difficulties, as well as different block choices that allow the algorithm to take larger steps. Finally, we are interested in extending our method to study simultaneous backbone and side-chain flexibility in protein-ligand and protein-protein binding.