Superfamily space of flexibility
In order to get results from a varied and representative number of superfamilies, we looked for structural diversity, non-redundancy, and good distribution of domain size. Additionally, enough number of structures and a good percentage of the reference domain sequence length forming the core of the alignment was another selection criterion. In total, we finally selected 55 superfamilies in CATH version v3.0.0 containing at least 20 non-redundant members (redundancy defined as 95% of sequence identity or higher), belonging to all possible structural classes (α, β, α+β), and with a good span in sequence size (30–459 aa). The decomposition of the conformational space defined by a given superfamily was done following the same approach developed for flexible fitting in tridimensional electron microscopy (3D-EM) in the presence of incomplete data [21]. All the domains of the superfamily were structurally aligned using MAMMOTH [28] against the reference domain, that was studied with MD (Figure 1). The domains with a statistical significance score of -ln(E) > 5 as provided by MAMMOTH where used to build the core of the structural alignment for the superfamily (red box, Figure 9), being the rest excluded (purple discontinuous domain, Figure 9). The condition for an aminoacid of the reference domain to be part of the core is to be aligned at least once with the rest of the superfamily members (example in blue box, Figure 9). The 55 superfamilies selected for this study had at least 10 domains and 68% of the reference domain sequence length belonging to the core, with most of them showing even a higher value (90%), thus providing data with as least missing values as possible.
Once the domains were aligned, the coordinates of the aminoacids in the core were used to build the coordinate displacement vectors (cdv's):
(1)
where x, y, z stand for the coordinates of the same backbone atom n (Cα, O, N and C) in two structurally aligned aminoacids, each one belonging to one domain (i for the reference, j for the aligned). A CDV vector was created by using all the cdv's obtained for the atoms of a given aligned domain, placing x, y, z coordinates in consecutive indexes. Then a CDV matrix was built with all the CDVs as its columns (one per aligned domain). The CDV matrix was decomposed with the incremental singular value decomposition (ISVD) algorithm [29] to capture the main axes of variation (Figure 1). The use of ISVD, a variant of the single value decomposition (SVD) method [30], allows us to manage superfamilies with incomplete information in the core due to gaps in the alignment, since it can handle matrices for which some of the values of their elements are unknown. In any case, aminoacids in the reference domain that cannot be aligned in any of the pairwise alignments using MAMMOTH (black box, Figure 9) were excluded of further analysis. When ISVD is applied to the CDV matrix it produces:CDV = U·S·VT
U - 4*3*m × n-1 matrix containing an orthogonal basis for the multi-dimensional space defined by the CDVs, were m is the number of aminoacids in the core and n is the number of superfamily members used in the procedure. 4 comes for the 4 backbone atoms employed and 3 comes from the x, y, z coordinates.
S - n-1 × n-1 diagonal matrix containing the n-1 singular values of the decomposition.
V - n-1 × n-1 matrix containing an orthogonal basis for the space of the rows of CDV.
The elements of the columns of U define a new basis for CDV which, ranked by the relative value of the singular values in S, best explains the structural variation among the aligned domains. The ISVD algorithm estimates the incomplete columns of the original CDV matrix during the decomposition procedure in an incremental fashion, starting with the columns with less missing values. If the next CDV vector c has missing values, denoted as c0, they are estimated by:
(3)
where Z is the set of values that minimize the sum of squared errors for the known values, denoted as c•, when solving:
(4)
In eqs. (3) and (4), and are the corresponding rows of U' for the missing and known data, respectively. U' and S' are the decomposition matrices calculated in intermediate steps of the ISVD procedure. The interested reader is referred to [29] for the theory behind the ISVD, and to [21] for a complete explanation of the adaptation of ISVD to structural alignments of superfamilies. As in Principal Component Analysis (PCA), the result of both SVD and ISVD calculations is a transformation of the initial variation matrix into a set of orthogonal movements characterized by a set of singular vectors (which indicates the nature of the essential movement) and a set of singular values which, after transformation by eq. 5, are equal to the PCA eigenvalues.
(5)
where n is the number of snapshots used for the decomposition, l
i
is the PCA eigenvalue and s
i
is the [I]SVD singular value. Note that the original protein Cartesian coordinates appear now as projections onto the space defined by the singular vectors without any loss of structural information.
Molecular-dynamics space of flexibility
The range of conformations accessible for a protein under normal physiological conditions can be well explored by molecular dynamics (MD) simulations. The technique samples the movements of macromolecules by integration of Newton equations of motion, with the forces being obtained from an accurate potential functional (the force field) fitted to reproduce high accurate quantum mechanical data in small model systems [31, 32]. In opposition to Normal Mode Analysis, atomistic MD does not assume that the protein should be confined in a harmonic well around the experimental structure, allowing then, if required by the physics of the system, large conformational transitions. It is the best technique to explore the physical deformation space for proteins.
The reference protein domains were simulated in the context of the whole native protein. All protein structures were titrated, neutralized by ions, minimized, hydrated, heated and equilibrated (for at least 0.5 ns) using a well established protocol [20]. Trajectories were collected using AMBER parm99 force field [33] in conjunction with Jorgensen's TIP3P model [34, 35] for representing water molecules. Particle Mesh Ewald approach was used to deal with long-range effects [36]. Integration of motion equations was performed every 1 fs, the vibrations of bonds involving hydrogen atoms being removed by SHAKE algorithm [37]. Production runs were obtained with the program AMBER8 [38] and were extended for 10 ns. Computational effort performed here corresponds to more than 20 CPU years and were done thanks to access to large supercomputer resources.
Statistical descriptors for comparison
The MD and SF-spaces were subjected, for comparison purposes, to a modified version of the essential dynamics procedure [39] using SVD (with MD-space) and ISVD (with SF-space) decompositions. Many comparisons can be easily made using the singular vectors and values provided by the decomposition algorithms:
1) The size of deformability space was measured by the variance in MD or superfamily ensembles, summing the square of the singular values obtained after the decomposition. To avoid bias related to the limited number of structures in most superfamilies, the analysis of MD variance was repeated also using as many equally spaced MD snapshots as superfamily members (partial-MD space; MDp). The average values for 100 windows were computed.
2) The complexity of the deformability space was determined by the number of singular vectors needed to explain 90% of the variance.
3) The overlap between the SF- and MD-spaces was determined using the Hess metric [40] and associated Z-score (eqs. 6 and 7; [41]).
(6)
where X and Y stand for the two methods, the indexes i and j stand for the orders of the eigenvectors (ranked according to their contribution to the structural variance), and n stands for the number of superfamily members.
(7)
Pure random models were obtained by decomposition of a pseudo-covariance matrix obtained by random permutation of the backbone atoms for each snapshot in a trajectory, and the standard deviation (std) was obtained by considering 500 different pseudo-covariance matrices.
Additional Z-scores* (labeled with * to avoid confusion with previous Z-scores derived from purely random models) showing the relevance of the values for H in a more chemically sound environment were computed from models where the chemical connectivity was maintained and steric collapses were avoided. For this purpose, we performed several 10 ns discrete dynamics simulations for each protein with a simplified force-field defined by covalent bonds plus a hard sphere potential for each atom [42]. Essential dynamics from these trajectories provided sets of singular vectors being representative from random movements but still consistent with the basic physics of the protein. The standard deviations needed for Z-score calculations were evaluated from independent discrete dynamics simulations.
4) The coverage of MD-space achieved by the SF-space was measured by analyzing the distribution of the projections of the superfamily members on the essential subspace defined by the two first singular vectors of the MD-space (essential MD-space). The essential MD-space was divided into 9 equivalent portions were the maximum X and Y values were determined by the smallest and largest projection values achieved during the 10 ns trajectories. The coverage was evaluated as the number of portions of the MD-essential space that were visited by at least one superfamily member (example in Figure 10). Similar results were obtained changing the number of portions. Note that a low coverage can obey to the intrinsic differences between MD and superfamily-derived samplings, but also to the limited number of superfamily members available. In order to distinguish between both sources of deviation we also computed the coverage for the partial MD-space.
5) Individual mobility of residues was determined by the residue B-factors:
(8)
where ⟨Δr2⟩ stands for the oscillations of atoms around equilibrium positions.
Due to the fact that the structural alignment of the superfamilies yields incomplete sets of coordinates, we applied a Metropolis Monte Carlo algorithm with a Hamiltonian method [41] which allowed us to obtain energetically permitted projections along each singular vector within the SF-space (see eq. 9). The displacements obtained can then be projected to generate Cartesian "pseudo-trajectories" which have complete coordinates and are representative of the superfamily ensemble. The B-factors can be easily obtained from this pseudo-trajectory.
(9)
where n is the number of superfamily members and stands for a displacement along a given mode (i) in the space X. is the stiffness constant associated with a deformation mode, computed as k
b
T/(2l
i
), with kb being Boltzmann's constant, l
i
the corresponding PCA eigenvalue and T the absolute temperature.