- Methodology article
- Open Access
Defining structural and evolutionary modules in proteins: a community detection approach to explore sub-domain architecture
- Jose Sergio Hleap^{1}Email author,
- Edward Susko^{2} and
- Christian Blouin^{1, 3}
https://doi.org/10.1186/1472-6807-13-20
© Hleap et al.; licensee BioMed Central Ltd. 2013
- Received: 20 March 2013
- Accepted: 11 October 2013
- Published: 16 October 2013
Abstract
Background
Assessing protein modularity is important to understand protein evolution. Still the question of the existence of a sub-domain modular architecture remains. We propose a graph-theory approach with significance and power testing to identify modules in protein structures. In the first step, clusters are determined by optimizing the partition that maximizes the modularity score. Second, each cluster is tested for significance. Significant clusters are referred to as modules. Evolutionary modules are identified by analyzing homologous structures. Dynamic modules are inferred from sets of snapshots of molecular simulations. We present here a methodology to identify sub-domain architecture robustly, biologically meaningful, and statistically supported.
Results
The robustness of this new method is tested using simulated data with known modularity. Modules are correctly identified even when there is a low correlation between landmarks within a module. We also analyzed the evolutionary modularity of a data set of α-amylase catalytic domain homologs, and the dynamic modularity of the Niemann-Pick C1 (NPC1) protein N-terminal domain.
The α-amylase contains an (α/β)_{8} barrel (TIM barrel) with the polysaccharides cleavage site and a calcium-binding domain. In this data set we identified four robust evolutionary modules, one of which forms the minimal functional TIM barrel topology.
The NPC1 protein is involved in the intracellular lipid metabolism coordinating sterol trafficking. NPC1 N-terminus is the first luminal domain which binds to cholesterol and its oxygenated derivatives. Our inferred dynamic modules in the protein NPC1 are also shown to match functional components of the protein related to the NPC1 disease.
Conclusions
A domain compartmentalization can be found and described in correlation space. To our knowledge, there is no other method attempting to identify sub-domain architecture from the correlation among residues. Most attempts made focus on sequence motifs of protein-protein interactions, binding sites, or sequence conservancy. We were able to describe functional/structural sub-domain architecture related to key residues for starch cleavage, calcium, and chloride binding sites in the α-amylase, and sterol opening-defining modules and disease-related residues in the NPC1. We also described the evolutionary sub-domain architecture of the α-amylase catalytic domain, identifying the already reported minimum functional TIM barrel.
Keywords
- Community Detection Algorithm
- Cholesterol Binding
- Membership Vector
- Multiple Structure Alignment
- NPC1 Disease
Background
Jacob [1] stated that "Nature is a tinkerer and not an inventor". We still use a similar thought relating protein evolution, since domains are accepted as the protein evolutionary modules, and its modular reuse has been demonstrated in all domains of life [2]. This modularity gives protein structures enhanced flexibility [3] and might influence its ability to respond to selection. This ability is of main concern for evolutionary biology and is related to the robustness of a system [4, 5]. Robustness is the ability of a system to maintain its function under perturbations. A robust system neither increases nor decreases heritable phenotypic variation [5, 6]. In the protein world, the phenotype is the structure and the phenotypic variance is given by slight variations in protein structure shape.
In organismal biology, the coordination of subunits within a whole (e.g. mammalian limb bones, floral and leaf traits, parts of wings, individual organs, etc.) has long been known as morphological integration [7], which was renamed by evolutionary developmental biologists as modularity ( [8], and references therein). The modularity of a system is a property that is closely related with both evolvability and robustness [6, 9]. Such a property allows a system to increase its evolvability by diminishing adaptative constraints as well as giving the system the possibility for plasticity and the emergence of novel functions by rearranging the modules [9]. As stated in [8], integration and modularity concern the degree of covariation between parts of a structure. It is important, from an evolutionary viewpoint, to determine whether a structure is a single unit or consists of several modules. In molecular biology, the modularity of systems has been used to an extent, but more work has been done in systems biology [10–13] including analyses of metabolic networks [14–17], cell signaling networks [18–20], and protein interaction networks [21–25]. In the context of protein architecture, modularity has been used to refer to modules of exon shuffling [26, 27], and complexes of enzymatic machineries [11]. Some approaches to protein structure modularity have also been explored [6, 9, 28] showing modules as domains [29, 30] and also as sub-domain components [31–34]. However, the criterion to define protein modules depend on the definition of a proper quantitative treatment, which is not a trivial problem [9].
There have been different attempts to identify modules in protein structures [9] such as highly conserved close loops [35], foldons, and autonomous folding units [36]. Some of the aforementioned modules can only be identified experimentally and/or in single proteins. Another way, particularly robust, to perform modular decomposition is by using community detection algorithms [37] which have been applied extensively in system biology [10–25] as well as to the protein structure modularity identification problem [3, 38]. However, most of these attempts only consider the contact matrix [3, 38]. This approach bears no evolutionary information and depends exclusively in the definition of contact between residues [9]. We postulate that correlation information across a group of homologous structures (or a group of snapshots from a molecular dynamic simulation) is more relevant than molecular contact alone.
The analysis of graphs has become crucial to understand the features of different systems [39] such as community structure [40]. Several clustering algorithms have been developed (for a review on such algorithms see [39]) and applied successfully to different kinds of networks, such as networks of email messages [41], biological, and social networks [3, 37, 38, 42]. However, all clustering techniques including the graph-based ones, lack a statistical framework to determine the significance of the inferred clusters. This may lead to results that may not be biologically meaningful. In this paper we present a graph theory-based clustering method that includes a test of statistical significance, a power test, and a test for the accuracy the estimates giving the sample size (i.e. bootstrap). To do this, we propose a permutation-based t-test to assess statistical significance, and power test based on [43] to assess the reliability of the estimates. We also propose a bootstrap test and a power analysis to infer cluster robustness. These tests are applied to coordinate data, but can be generalized to other applications. Here a module is defined as any group of residues that has significant correlation within the group (i.e. among residues within group) and such correlation within is significantly higher than the one obtained when correlating this residues with residues of other groups in the dataset. To assess performance and illustrate the method, a simulation with one correlating module against background noise and one with two modules were performed. We also analyzed two different kinds of protein structure data sets: Dynamic modularity from a molecular dynamics simulation of the Niemann-Pick C1 (NPC1) protein, and evolutionary modularity from the α-amylase homologs. The former protein is part of a complex of two proteins (NPC 1 and 2) required for the cholesterol to exit the lysosomes [44]. The NPC1 N-terminal domain binds to the cholesterol in an orientation opposite to NPC2. Mutations of NPC1 N-terminal domain are involved in the development of the NPC1 disease, an inherited disorder associated with lipid metabolism [44]. Therefore, it is important to know its dynamics in solution and identify probable sub-domain architecture that can be related with function. The latter dataset (α-amylase) is a digestive enzyme that, by acting at random locations along the starch chain, hydrolyses the α-1,4 bonds of larger polysaccharides yielding glucose and maltose [45]. It is a phylogenetically widespread type of hydrolases with multiple industrial uses [46]. These enzymes are multidomain proteins, but share a common catalytic domain in the form of a (β/α)_{8}-barrel [47] which might give insight into fold evolution of the TIM-barrel, and the rise of its catalytic activity.
Methods
- 1.
Create test datasets: The input data (Section Input data) are cartesian coordinates of simulated data (Section Multivariate normal simulation) or coordinates of residues in the protein datasets (Sections Molecular dynamic simulation of NPC1 and α-amylase catalytic domain homologs).
- 2.
Align the structures when needed: In the case of the simulations (both Multivariate normal simulation and molecular dynamics simulation) there is no need to align, since they all share the same plane, and absolute rotation (Section Structural alignment).
- 3.
Extract the information as landmarks (Section Landmark definition) and residues contact matrix (Section Inter-residue contact definition).
- 4.
Create a graph where each landmark is a node and they are connected if significant correlation is found (Section Graph construction).
- 5.
Test if the partition of the data (grouping of residues) is statistically significant (Section Statistical significance test of clusters: controlling the false positives).
- 6.
Test for statistical power of each partition (Section Statistical power test of clusters: acknowledging the false negatives probability).
- 7.
Test for the stability of the partition to sample size: Bootstrapping (Section Bootstrapping: measuring the accuracy of sample estimates).
Input data
Multivariate normal simulation
The result is a matrix with a set of correlated variables (cluster), surrounded by random (uncorrelated) variables. Cluster intracorrelations ranged from 0 to 1 in increments of 0.05. The first 60 entries (accounting for a cluster with 30 elements with X and Y coordinates) had a given correlation, while 140 entries (accounting for 70 landmarks) where uncorrelated.
The resulting matrix contains two cluster which intracorrelations ranged from 0 to 1 in increments of 0.05.
The output of the simulation is a set of coordinates for a given number of samples to which the method (explained in subsequent sections) will be applied.
Molecular dynamic simulation of NPC1
The Niemann-Pick, type C1 (NPC1; PDB code: 3GKH) N-terminal domain was simulated in solution with the ligand (cholesterol), using the software GROMACS 4 [48]. The force fields modes used for the simulations were OPLS-AA/L for the protein, and the TIP3P for the water molecules. The data was collected every two picoseconds for 10 nanoseconds discarding the first 2 nanoseconds of simulation. All other parameters where left as default. This process was performed using 100 CPU cores on a computer cluster, in triplicates. Four thousand samples where gathered and analyzed. The data includes the coordinates for each atom of each residue for the NPC1 protein simulated across snapshots of the simulation. This dataset was ensemble to test the method (explained in the subsequent sections) in dynamic data. That is, to find correlating residues for a protein in solution.
α-amylase catalytic domain homologs
PDB codes of the α -amylase homologs and the species from it was crystallized
PDB code | Species | PDB code | Species |
---|---|---|---|
1A47.A | T. thermosulfurigenes EM1 | 1PIF.A | S. scrofa |
1AMY.A | H. vulgare | 1PJ9.A | B. circulans |
1AQH.A | P. haloplanctis | 1PPI.A | S. scrofa |
1AQM.A | P. haloplanctis | 1QHO.A | B. stearothermophilus |
1AVA.A | H. vulgare | 1QHP.A | stearothermophilus |
1B0I.A | P. haloplanctis | 1S46.A | N. polysaccharea |
1BAG.A | B. subtilis | 1SMD.A | H. sapiens |
1BF2.A | P. amyloderamosa | 1TMQ.A | T. molitor |
1BLI.A | B. licheniformis | 1U33.A | H. sapiens |
1BSI.A | H. sapiens | 1UA3.A | S. scrofa |
1BVZ.A | T. vulgaris R47 | 1UA7.A | B. subtilis |
1CDG.A | B. circulans | 1UOK.A | B. cereus |
1CGT.A | B. circulans,s8 | 1VB9.A | T. vulgaris R47 |
1CGU.A | B. circulans,s8 | 1VFM.A | T. vulgaris R47 |
1CIU.A | T. thermosulfurigenes EM1 | 1VFO.A | T. vulgaris R47 |
1CLV.A | T. molitor | 1VIW.A | T. molitor |
1CXE.A | B. circulans | 1VJS.A | B. licheniformis |
1CXF.A | B. circulans | 1WZK.A | T. vulgaris R47 |
1CXK.A | B. circulans | 1WZL.A | T. vulgaris R47 |
1CXL.A | B. circulans | 1WZM.A | T. vulgaris R47 |
1CYG.A | B. stearothermophilus | 1ZS2.A | N. polysaccharea |
1D3C.A | B. circulans | 2AAA.A | A.niger |
1DHK.A | S. scrofa | 2QMK.A | H. sapiens |
1EH9.A | S. solfataricus KM1 | 2QV4.A | H. sapiens |
1EHA.A | S. solfataricus KM1 | 2TAA.A | A. oryzae |
1G5A.A | N. polysaccharea | 3BAI.A | H. sapiens |
1G94.A | P. haloplanctis | 3BAJ.A | H. sapiens |
1GCY.A | P. stutzeri | 3BAW.A | H. sapiens |
1GJU.A | T. maritima | 3BMV.A | T. thermosulfurigenes EM1 |
1GJW.A | T. maritima | 3BMW.A | T. thermosulfurigenes EM1 |
1HNY.A | H. sapiens | 3CGT.A | B. circulans,s8 |
1HVX.A | B. stearothermophilus B | 3L2L.A | S. scrofa |
1HX0.A | S. scrofa | 3L2M.A | S. scrofa |
1JAE.A | T. molitor | 3UEQ.A | N. polysaccharea |
1JFH.A | S. scrofa | 4CGT.A | B. circulans,s8 |
1JG9.A | N. polysaccharea | 5CGT.A | B. circulans,s8 |
1JGI.A | N. polysaccharea | 6CGT.A | B. circulans,s8 |
1JIB.A | T. vulgaris R47 | 6TAA.A | A. oryzae |
1KCL.A | B. circulans | 7CGT.A | B. circulans,s8 |
1MVY.A | N. polysaccharea | 7TAA.A | A. oryzae |
1OB0.A | B. licheniformis | 8CGT.A | B. circulans,s8 |
1OSE.A | S. scrofa | 9CGT.A | B. circulans,s8 |
1OT1.A | B. circulans |
Structural alignment
The flexible structure aligner MATT with default parameters was used to align the structures and therefore deals with rotation, translation and natural deformations. This method allows local geometric flexibility for protein structures producing alignments with low root-mean-square deviations (RMSD), and estimating a p-value expressing the likelihood that a given alignment score can be generated by the alignment of unrelated proteins [50]. The multiple structure alignment outputted by MATT is then processed and analyzed as explain in the subsequent sections.
Landmark definition
A landmark is a point, vertex, site or control point in a shape object (protein or simulation object in our case) that can be found repeatedly (and consistently) in a group of such objects [51]. Here we define a landmark as the centroid of homologous residues in a multiple structure alignment. The residue centroid is used to include both sequence (residue side chain) and geometry, as opposed to only the geometry of the backbone. To do this the coordinates of all heavy atoms (A) are taken into account.
Inter-residue contact definition
Inter-residue contact maps are a widely used approach to analyze protein structures [52]. They are also important to understand protein folding and stability [53], and to identify residues playing structural and/or functional roles [52]. Despite this and the advances in the contact definition ( [52] and references therein), accurate contact map predictions are still mainly unsolved. There are some proposed tests [52] and software [54] but they are mainly using C_{ α }-C_{ α } or C_{ β }-C_{ β } distances with a threshold of about 7 to 8 Å [52, 54]. However, these types of contacts are a mere approximation to true contacts. Here we defined a contact between any two residues if the distance between them is equal or less than 4.5 Å in an all-atom (all side chain atoms) contact analysis. The all-atom approach is more accurate since it takes into account the distance between each possible pair of atoms in two side chains.
Graph construction
where the left-hand side of the equation 1 is the Fisher transformation of the estimated correlation r. The right-hand side of the equation 1 is the critical value for an alpha-level test of the null hypothesis that the correlation is 0. There, the Z _{ α } is the standard score which allows us to calculate the probability of a value occurring within our normal distribution and compare scores from different distributions.
This step is done to simplify the graph building process such that insignificant correlations are ignored.
where the value for the i th dimension, P _{ i }, is either r or 0 depending on the result in (1). The Ξ are obtained for each pair of landmarks and assign the edges of an undirected graph S, using the python-igraph library [55].
The summation in equation 2 is performed to agglomerate the dimensions (dimension reduction minimizing information loss). Since r is not additive and r^{2} is, the sum of r^{2} is the appropriate way to add the correlations without violating non-additivity. Also, Ξ is the correlation vector magnitude that guarantees that if there is any correlation in any of the dimensions, Ξ will include it, regardless of the vector direction. Let us assume that a given residue is highly and significantly correlated in the X axis, but poorly and/or not significantly correlated in Y or Z axes. Ξ will reflect such correlation since the residues must behave completely independent for Ξ to be zero or close to zero.
Graph abstraction
Let S = (N, f) be an undirected graph, where N is a list of nodes (landmarks), and f is a function $f:N\times N\to \mathbb{k}$ that assigns an edge weight to each landmark pair. An edge E _{ ij } is assigned only if Ξ_{ ij } > 0, and if the residues are in contact. The edge weight value is set to Ξ_{ ij }.
Community structure or clustering optimization
where m is the number of edges in the graph, A _{ vw } represents the weight of the edge between vertices v and w, ${\sum}_{w}{A}_{\mathit{\text{vw}}}$ and ${\sum}_{v}{A}_{\mathit{\text{vw}}}$ are the weighted degree of a vertex (v or w), defined to be the sum of the edge weights of the adjacent edges for each vertex. C _{ v } and C _{ w } are communities to which the vectors v and w belong to, and the δ is a binary function where δ(C _{ v }, C _{ w }) is 1 if C _{ v } = C _{ w } and 0 otherwise.
The modularity index (Q) is then the proportion of edges shared within groups minus the expected proportion if edges were distributed at random. For a given partition, Q indicates the density of nodes within groups when compared against a random distribution of connections regardless the partition. Q ranges between -1 and 1. If positive, there are more connections inside the module than expected by chance and therefore a possible community structure [56, 57] (i.e. partition or clustering of the data). In our case, a partition made by the optimization of Q is a group of residues that correlate in space (i.e. move together) given the sample. If the sample is across homologous proteins, the cluster or partition represents a concerted movement in the evolution of the protein. Sampling across molecular dynamic simulation snapshots represents parts of the protein that are moving together in solution.
The output is a membership vector that corresponds to the community structure (partition or clustering) in the graph of landmarks. It is interpreted as a set of clusters which number is given by the optimization procedure and therefore there is no need for an a priori determination of the number of clusters to be obtained. Each cluster is assumed to be a putative module but this membership vector provides no support or information about its statistical robustness and significance.
Statistical significance test of clusters: controlling the false positives
Despite the usefulness and ubiquity of tests using similar algorithms, the question of significance of clusters is critical since there is no support for the obtained clusters, and therefore its validity is questionable. To test if each cluster is significant, a permutation t-test [58] (as implemented in R [59, 60]) is applied.
The rationale for the test is based on the definition of cluster as an entity where the distribution of correlation of the elements inside the cluster (intracorrelation) is significantly distinct from the distribution of correlation with elements from other clusters (intercorrelation). This test is applied for each possible pair of clusters defined by a membership vector. For a given pair of clusters, we compare the distribution of the intracorrelation for that cluster with the distribution of intercorrelations for this pair. If one cluster is artificially broken down by the clustering algorithm, there should be no significant differences between the distribution of intra and inter-correlations.
Because the test is performed for a number of pairs, multiple comparisons are made. Let M(A) and M(B) be the mean intracorrelations for two clusters A and B, found by the community detection algorithm. Let M(A B) be the mean intercorrelation. The null hypothesis we test is H _{0} : M(A) = M(A B). With more than two clusters the number of comparisons (K _{ C }) will be K(K - 1), K being the number of clusters. If a single-inference procedure is used, this can result in an false increased significance which we correct for using the Benjamini-Hochberg False Discovery Rate correction (FDRc) procedure [61].
For example, a given set of homologous proteins is analyzed with our method and a possible partition is obtained. This will give us different pieces of the protein that correspond to groups of residues that are correlating (moving together) more within each cluster than among clusters. We use the correlations inside a given group and test against the correlation that exist between that group and other groups. If there is no significant difference, both entities are moving together and therefore should be merged.
Refinement of the membership vector
The results of the significance testing are summarized into a new graph. Let graph S = (C, E) be a directed graph, where C is a list of inferred clusters, and E a list of assigned edges. There will be a directed edge from cluster C _{ u } to cluster C _{ v } if the hypothesis that M(u) is distinct from M(u v) cannot be rejected. If C _{ u } and C _{ v } are connected by a bi-directional edge, they are merged into a single cluster. The process is iterated until no clusters can be merged.
Following the example in the previous section, let’s assume that the protein dataset analyzed was partitioned into 4 groups of residues (A, B, C, and D). Each of those groups will be the vertices (nodes) in a new graph. We will draw an arrow if there is no significant differences between a given group and other (e.g. correlations within A are not significantly different than the correlations between residues in A and residues in B). If this is reciprocal (e.g. correlations within B are not significantly different than the correlations between residues in B and residues in A), both groups of residues are merged.
Statistical power test of clusters: acknowledging the false negatives probability
where p is the number of pairs i, j in cluster C.
For instance, take a cluster (group of residues from the previous example) A that contains 10 elements, and 45 entries in the upper triangle of its correlation matrix. Assume that A was inferred with 100 observations (protein structures from the example). With that sample size, ρ _{ res } will be approximately 0.28 with a power of 0.8 and a significance level of 0.05. If two thirds of the entries in the upper triangle of the correlation matrix of A are below ρ _{ res }, PVP _{ A } will be equal to 0.66. In other words, for 30 entries of the correlation matrix we estimate that there was a power of 0.8 or greater. If there are clusters created by optimizing the modularity score using weakly correlated landmarks, this cluster’s PVPs will tend to be close to 0. This test is post-hoc, and is only to inform about the robustness of the partition created.
Bootstrapping: measuring the accuracy of sample estimates
All methods described here were performed with original python scripts (otherwise stated), available upon request to the authors and licensed under a GPL agreement.
Results and discussion
In this section we present the outcome of two simulations to validate the method. The parameters for the simulations can be found in the Methods section. We also present two real datasets. First a set of snapshots from a molecular dynamics simulation (MDS) of the NPC1 N-terminal domain are analyzed to provide insights into the modular architecture of dynamic data. That is, group of residues that move together in solution.
Then a set of homologous structures of the alpha-amylase catalytic domain are use to test the sub-domain architecture at the evolutionary level. A module here (diferent than the MDS modules) refers to a group of residues that are moving together in the evolution of the structure.
- 1.
A graph is built where the vertices are the centroids of residues. The correlation between coordinates is set as edge weight if it is significant (see equations 1 and 2), and if the two residues are in contact (See Contact definition in Methods).
- 2.
The community structure in the graph is inferred by fast-greedy (evaluating and selecting the best result at each step, as opposed to maximizing at the end of the scoring process) optimization of a modularity score (Q; see equation 3).
- 3.
The statistical support for each cluster is obtained.
- 4.
The solution is refined based on this statistical support.
- 5.
The statistical power to resolve each partition with respect to the size of the dataset is estimated (equations 4 and 5).
- 6.
The stability of the estimates with respect to the sampling error is measured using bootstrapping (Figure 2).
Finally, we present the results of the analysis of two protein data sets: a molecular dynamics simulation of the NPC1 protein, and a multiple structure alignment of the α-amylase catalytic domain homologs.
Simulations
Correlated landmark data was simulated using a multivariate normal simulation. Intracorrelations ranged from 0 to 1 in 0.05 increments. The parameters to make the simulations are explained in the Methods section.
Estimated correlations: precision of the simulations
Precision of the simulations with one module in background noise
Sample | Intra-cluster | Intracorrelation quantiles | Background correlation quantiles | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Size | correlation | 0% | 25% | 50% | 75% | 100% | 0% | 25% | 50% | 75% | 100% |
100 | 0.2 | -0.087 | 0.145 | 0.206 | 0.274 | 0.545 | -0.348 | -0.058 | 0.009 | 0.077 | 0.304 |
0.4 | 0.061 | 0.288 | 0.341 | 0.395 | 0.589 | -0.314 | -0.075 | -0.0064 | 0.059 | 0.331 | |
0.6 | 0.334 | 0.542 | 0.578 | 0.619 | 0.765 | -0.361 | -0.072 | -0.002 | 0.066 | 0.296 | |
0.8 | 0.651 | 0.771 | 0.791 | 0.811 | 0.875 | -0.365 | -0.073 | 0.002 | 0.067 | 0.291 | |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | -0.209 | -0.069 | -0.019 | 0.051 | 0.232 | |
0.2 | 0.061 | 0.168 | 0.193 | 0.220 | 0.354 | -0.169 | -0.029 | 0.001 | 0.029 | 0.139 | |
0.4 | 0.261 | 0.365 | 0.389 | 0.411 | 0.494 | -0.172 | -0.029 | 0.002 | 0.0332 | 0.204 | |
500 | 0.6 | 0.485 | 0.562 | 0.580 | 0.597 | 0.656 | -0.152 | -0.031 | 2 × 10^{-5} | 0.031 | 0.149 |
0.8 | 0.743 | 0.776 | 0.786 | 0.795 | 0.824 | -0.161 | -0.041 | -0.011 | 0.019 | 0.144 | |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | -0.142 | -0.0350 | 1.7 × 10^{-4} | 0.029 | 0.087 | |
0.2 | 0.087 | 0.170 | 0.190 | 0.209 | 0.308 | -0.131 | -0.021 | 7.2 × 10^{-4} | 0.023 | 0.116 | |
0.4 | 0.338 | 0.391 | 0.407 | 0.422 | 0.4745 | -0.108 | -0.016 | 0.005 | 0.0267 | 0.126 | |
1000 | 0.6 | 0.535 | 0.580 | 0.591 | 0.603 | 0.647 | -0.103 | -0.018 | 0.002 | 0.023 | 0.129 |
0.8 | 0.759 | 0.791 | 0.798 | 0.804 | 0.827 | -0.131 | -0.024 | -0.005 | 0.012 | 0.083 | |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | -0.038 | -0.014 | 0.009 | 0.024 | 0.068 |
Precision of the simulations with two modules
Sample | Intra-cluster | Intracorrelation quantiles | Intercorrelation quantiles | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Size | correlation | 0% | 25% | 50% | 75% | 100% | 0% | 25% | 50% | 75% | 100% |
0.2 | -0.12 | 0.112 | 0.178 | 0.247 | 0.532 | -0.349 | -0.061 | 0.009 | 0.081 | 0.357 | |
0.4 | 0.135 | 0.347 | 0.401 | 0.451 | 0.627 | -0.372 | -0.075 | -0.014 | 0.049 | 0.314 | |
100 | 0.6 | 0.405 | 0.581 | 0.617 | 0.653 | 0.758 | -0.468 | -0.225 | -0.165 | -0.104 | 0.116 |
0.8 | 0.716 | 0.776 | 0.795 | 0.813 | 0.879 | -0.256 | -0.087 | -0.049 | -0.011 | 0.147 | |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | -0.018 | -0.018 | -0.018 | -0.018 | -0.018 | |
0.2 | 0.059 | 0.172 | 0.2 | 0.229 | 0.337 | -0.177 | -0.036 | -0.008 | 0.023 | 0.151 | |
0.4 | 0.286 | 0.371 | 0.393 | 0.414 | 0.495 | -0.099 | 0.013 | 0.04 | 0.066 | 0.183 | |
500 | 0.6 | 0.516 | 0.587 | 0.602 | 0.618 | 0.67 | -0.099 | 0.005 | 0.029 | 0.051 | 0.153 |
0.8 | 0.748 | 0.793 | 0.801 | 0.808 | 0.838 | -0.069 | 0.014 | 0.031 | 0.05 | 0.131 | |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | -0.012 | -0.012 | -0.012 | -0.012 | -0.012 | |
0.2 | 0.109 | 0.182 | 0.201 | 0.221 | 0.301 | -0.116 | -0.023 | -0.001 | 0.021 | 0.128 | |
0.4 | 0.321 | 0.388 | 0.405 | 0.421 | 0.488 | -0.133 | -0.03 | -0.011 | 0.008 | 0.089 | |
1000 | 0.6 | 0.531 | 0.582 | 0.593 | 0.605 | 0.647 | -0.077 | -0.008 | 0.008 | 0.023 | 0.091 |
0.8 | 0.764 | 0.793 | 0.799 | 0.805 | 0.825 | -0.052 | 0.007 | 0.021 | 0.034 | 0.082 | |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.026 | 0.026 | 0.026 | 0.026 | 0.026 |
Performance of the the method
Clusters, significance, PVP and Bootstrap support for the simulated data
Clusters | Significance | PVP | Bootstrap |
---|---|---|---|
One module on background noise | |||
A | < 0.0375 | 0.617 | 93% |
Singletons | 0.134 | 0.006 | 9% |
Two modules | |||
A | < 0.0125 | 0.631 | 100% |
B | < 0.0125 | 0.640 | 100% |
In simulations with correlations of 0.35, the method was able to identify the "true" cluster (Table 4) with a relatively low number of observations (85 in this case), and the power analysis gives an estimation of robustness. In the simulation, only cluster A has enough power (0.617 estimated PVP) to resolve almost two thirds of the components of the cluster. The other "module" is a collection of singletons which has an estimated PVP close to zero. Similarly, the bootstrap value highly supports the "true" cluster, while the group of singletons is ruled out. Here we show that the significance test efficiently deals with false positives, the PVP gives information about the strength of clustering, and the bootstrap gives information about the repeatability of the clusters when re-sampling.
Protein data sets
The coordinate data was collected using two different strategies. The structure of a protein was simulated using molecular dynamics (MD) to produce molecular motion over time. From these simulations, snapshots of atomic coordinates were captured. Sites that move together over time are expected to have correlating centroid coordinates. Therefore any module inferred from this kind of data indicates the mechanistic component of the protein structure in solution. The Niemann-Pick, type C1 (NPC1; PDB code: 3GKH) N-terminal domain was simulated in solution, using the software GROMACS 4 [48]. All the parameters for the MD simulation are explained in the Methods section.
The second type of data is based on homology. In this case, homologous structures are aligned. The centroid coordinates of sites that are packed together and interacting across the evolutionary samples are expected to correlate. A module inferred from this type of data indicates that a defined subset of the protein structure is evolving as a unit. The α-amylase catalytic domain dataset from the Homstrad database [49] was used to assess clustering in an evolutionary perspective. The structures used, the sampling strategy, and the alignment method used are explained further in the Methods section.
Dynamic modules of the Niemann Pick C1 protein N-terminal domain
Figure 4 shows the modules gathered when the module identification is applied to the molecular dynamics simulation of the NPC1 N-terminal domain snapshots. All these modules showed a bootstrap above 66.7% and a PVP over 0.96. Interestingly, all modules are related with the binding pocket, surrounding the cholesterol molecule.
The first module (Figure 4A) encloses three cholesterol binding residues, and another binding residue to the N-Acetyl-D-Glucosamine (NAG). It also spans a residue associated with the development of the NPC1 disease in adulthood [72]. All other residues correlating with these seem to give structural support to the back of the cholesterol binding pocket, as well as serving as receptacles for both ligands. This region also encompasses four residues containing single nucleotide polymorphisms (SNPs) for the human gene [73].
In Figure 4B, a module that comprises more than half of the residues that make the sterol pocket is shown. From these residues, this module is the only one that includes the non-hydrophobic ones, being of importance in the direct protein - 3 β-hydroxyl interactions, as well as the water-mediated interaction with such groups. This helps in the stabilization of the bounded sterols and giving a particular stereo-specificity [44]. This module is located in the bottom part of the binding pocket and can be seen as a "glue" for the secondary structures in contact with such pockets, and therefore maintaining the shape of the structure in its less movable part. This also supports the model in [44], where the sterol opening needs to move in order to uptake the cholesterol from NPC2. This module also contains three SNPs found in the human gene [73].
The module shown in Figure 4C, shows the residues responsible for the water (black arrow) and sterol (gray arrow) openings described in [44] as being of functional importance to the cholesterol uptake and the retention of it in the binding site. If some residues within this module are mutated, the cholesterol might not be taken by the protein and the Niemann-Pick disease is expressed [44]. This module also includes two cholesterol binding sites, a residue shown to be related to the development of the disease in infantile stages [74], and a SNP [73].
The module in Figure 4D shows a small module that coincides with functionally important residues involved in the affinity for cholesterol binding [44]. These thus may be related to the expression of the NPC disease. Giving that these modules are analyzed in the light of dynamics, the module in Figure 4D shows that the affinity for the cholesterol mediated by these residues is given by geometric constraints induced by cholesterol binding.
Figure 4E shows a module that encloses two binding residues to NAG. It has also been shown that two residues are important in the development of a late infantile NPC1 disease [72, 74], and one SNP is also enclosed. It seems to be also of structural support for the cholesterol binding pocket in the top(E), creating a pocket that receives the ligand.
The module shown in Figure 4F encloses the α-helices 3, 7 and 8, that have been shown to play an important role in the access and release of cholesterol, since its movement controls the enlargement of the sterol opening [44]. This module also contains some of the residues that decrease the cholesterol transfer to the liposomes if mutated [44], as well as four SNPs [73]. The module shown in Figure 4F is therefore of functional importance for the intake and outtake of cholesterol.
Since there are disease-related mutations in all of the modules, it would be important to further study the relationship between modules and protein function. The correlation within modules is large enough to think of them as units, and therefore it is probable that the residues exposed in [44, 73, 74] are not the only major contributors to the disease. Further confirmation of the effects of mutations within these modules is needed.
Evolutionary modules in the α-amylase catalytic domain
Starch is the main storage of carbohydrates in plants. Processing it and discovering novel poly and oligosaccharides is important for biotechnological and chemo industrial applications [75]. Most starch-related enzymes are classified within the α-amylase family. This family catalyzes the hydrolysis of α-(1,4) glycosidic bonds of polysaccharides, and therefore is classified as glycoside hydrolases [76]. This a multi-reaction catalytic family, since its members can catalyze different reactions (hydrolysis, transglycosylation, condensation and cyclization) [77]. Industrially, some α-amylases are used in the production of ethanol [78], high-fructose corn syrup [79], and other oligosaccharides. It is therefore of industrial and biological importance. It has a highly symmetrical TIM-barrel ((β/α)_{8}) catalytic domain [75]. This fold is highly versatile and widespread among the structurally characterized enzymes, being present in almost 10% of them [80–83]. There has been a debate about the type of evolution that this fold has been through: convergent, divergent, or both [80]. However, there is evidence supporting the divergent evolution hypothesis [81]. The catalytic activity and substrate binding residues occur at the C-termini of β-strands and in loops that extend from these strands [75].
Clusters, significance, PVP and Bootstrap support for the α -amylase data set
Modules | Significance | PVP | Bootstrap |
---|---|---|---|
B | < 0.001 | 0.479 | 31.3% |
C | 0.005 | 0.440 | 42.9% |
D | < 0.001 | 0.503 | 39.9% |
E | < 0.001 | 0.580 | 51.7% |
Putative meaning of the sub-domain architecture
So far we have shown the significant partitions of a domain. But what is the probable meaning of such modules? One might think that this modules can represent autonomous folding units (AFU), however our data show discontinuous amino acid sequences (in one dimension, since they are in contact in 3D space) per module. Also, comparative analysis with the dataset analyzed by [86] showed no relationship with our grouping (Data not shown). Another plausible hypothesis could be assign modules to close loops, but the same continuity argument can be brought upon. Furthermore, the α-amylase subdomains identified by our method span several of the TIM-barrel close loops exposed by [87] with no particular pattern. This discrepancies are expected, since the definition of foldons, AUFs and close loops have little or no meaning in an evolutionary perspective. These concepts are derived from the analysis of single structures and their internal interactions (i.e. contact matrix, physical interactions, length, distance) and therefore the non-evolutionary approaches for sub-domain determination will identify a different kind of module than an evolutionary approach.
On a more related framework, [88] developed a method to test co-evolving sites. When tested on the α-amylase dataset used in this article, no pattern correlating the two methods were found (Data not shown). Moreover, the largest significant grouping of co-evolving residues with [88] method span only 10 residues of the protein. This discrepancy can be attributed to the fact that [88] are testing co-evolution in a sequence based perspective. That is, giving a phylogenetic tree and its source alignment, which residues have significant mutual information. This method disregard completely the geometry of protein structures, therefore answering a different question than our approach.
So what is the possible meaning of our sub-domains? Despite more work (both bioinformatic and experimental) is needed to clearly address this question, the sub-domain architecture here represented is probably co-evolving geometric units (in the case of homologous sampling) and semi-rigid components (in the dynamic perspective) of proteins.
Conclusions
Protein structures have a modular architecture. Such modularity can be seen as hierarchical because there are different degrees of integration among its modules. The domain architecture has been shown to harbor evolutionary and structural coherence [9, 29, 38, 89–92]. However, there is also evidence of a sub-domain architecture of protein structures that can drive protein structure evolution. Here we introduced a robust and significant way to identify such sub-domain architecture, giving information about the result’s power with a finite sample size number, providing ways of assessing the significance for clustering and the strength of correlations within clusters. With enough sampling the method correctly and confidently identifies modules with an intracorrelation as low as 0.05 (nearly random) for simulated data. In real datasets our method is able to capture functional, structural, and evolutionary information, returning sensible results.
The NPC1 N-terminal domain depicted a sub-domain architecture when tested in dynamic data, showing a correlation between its modularity and its proposed function. Further analysis of these modules, and experimental tests (e.g. directed mutagenesis) in these modules might provide important insights in the protein function and evolution, as well as important information for possible treatments of the NPC1 related diseases.
Evolutionarily, the α-amylase family displayed a clear sub-domain architecture. All its modules were tightly connected with its catalytic capabilities. These results give some insights into the evolution of a common fold, the TIM barrel, that have been of wide interest [77, 80–83, 85]. It can also provide guidance for new improvements of thermal stability, substrate plasticity, and in bioengineering of the amylase’s function.
A drawback of the sub-domain modularity identification for homologous aligned proteins presented here, is the relative low power and support for them. However, the support values can be improved by using a bigger dataset (if possible; see Methods). We are exploring the use of homology modeling to enrich the available data for a given protein system, and the mixture of MD simulations and evolutionary sampling.
Declarations
Acknowledgements
The authors thank Professor N. Zeh, Professor R. Beiko, Conor Mehan, and the members of Dr. Beiko’s Lab in Dalhousie University for some helpful suggestions and comments on the manuscript. To Wilson Chan for some of the molecular dynamic simulations, Alex Safatli and Kyle Nguyen for the collaboration in some of the code. We also thank Liz Mackay and Johanna Zaglauer for editing the manuscript. This study was funded by NSERC through the grant No. 120504858. This work was partially supported by The Departamento Administrativo de Ciencia y Tecnología - Colciencias (Colombia) through the CALDAS scholarship.
Authors’ Affiliations
References
- Jacob F: Evolution and tinkering. Science 1977, 196(4295):1161–1166. 10.1126/science.860134View ArticlePubMedGoogle Scholar
- Voigt CA, Martinez C, Wang ZG, Mayo SL, Arnold FH: Protein building blocks preserved by recombination. Nat Struct Mol Biol 2002, 9(7):553–558.Google Scholar
- Del Sol A, Araúzo-Bravo MJ, Amoros D, Nussinov R: Modular architecture of protein structures and allosteric communications: potential implications for signaling proteins and regulatory linkages. Genome Biol 2007, 8(5):R92. 10.1186/gb-2007-8-5-r92PubMed CentralView ArticlePubMedGoogle Scholar
- Pigliucci M: Is evolvability evolvable? Nat Rev Genet 2008, 9: 75–82. 10.1038/nrg2278View ArticlePubMedGoogle Scholar
- Wagner A: Robustness and evolvability: a paradox resolved. Proc R Society B Biol Sci 2008, 275(1630):91–100. 10.1098/rspb.2007.1137View ArticleGoogle Scholar
- Rorick MM, Wagner GP: Protein structural modularity and robustness are associated with evolvability. Genome Biol Evol 2011, 3: 456. 10.1093/gbe/evr046PubMed CentralView ArticlePubMedGoogle Scholar
- Cheverud JM: Developmental integration and the evolution of pleiotropy. Am Zool 1996, 36: 44–50.Google Scholar
- Klingenberg CP: Morphometric integration and modularity in configurations of landmarks: tools for evaluating a priori hypotheses. Evol Dev 2009, 11(4):405–21. 10.1111/j.1525-142X.2009.00347.xPubMed CentralView ArticlePubMedGoogle Scholar
- Rorick M: Quantifying protein modularity and evolvability: a comparison of different techniques. BioSystems 2012, 110: 22–33. 10.1016/j.biosystems.2012.06.006View ArticlePubMedGoogle Scholar
- Kitano H: Systems biology: a brief overview. Science 2002, 295(5560):1662–1664. 10.1126/science.1069492View ArticlePubMedGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dümpelfeld B, et al.: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631–636. 10.1038/nature04532View ArticlePubMedGoogle Scholar
- Popescu GV, Popescu SC: Complexity and modularity of MAPK signaling networks. In Handbook of Research on Computational and Systems Biology: Interdisciplinary Applications. Volume 1. Edited by: Liu LA, Wei D, Li Y. Hershey, PA: IGI Global; 2011:355–368.View ArticleGoogle Scholar
- Fraser JS, Gross JD, Krogan NJ: From systems to structure: bridging networks and mechanism. Mol Cell 2013, 49(2):222–231. 10.1016/j.molcel.2013.01.003PubMed CentralView ArticlePubMedGoogle Scholar
- Holme P: Metabolic robustness and network modularity: a model study. Plos One 2011, 6(2):e16605. 10.1371/journal.pone.0016605PubMed CentralView ArticlePubMedGoogle Scholar
- Yamada T, Bork P: Evolution of biomolecular networks–lessons from metabolic and protein interactions. Nat Rev Mol Cell Biol 2009, 10(11):791–803. 10.1038/nrm2787View ArticlePubMedGoogle Scholar
- Takemoto K, Borjigin S: Metabolic network modularity in archaea depends on growth conditions. PloS one 2011, 6(10):e25874. 10.1371/journal.pone.0025874PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou W, Nakhleh L: Convergent evolution of modularity in metabolic networks through different community structures. BMC Evol Biol 2012, 12: 181. 10.1186/1471-2148-12-181PubMed CentralView ArticlePubMedGoogle Scholar
- Sudol M, Harvey KF: Modularity in the Hippo signaling pathway. Trends Biochem Sci 2010, 35(11):627–633. 10.1016/j.tibs.2010.05.010View ArticlePubMedGoogle Scholar
- Pan CQ, Sudol M, Sheetz M, Low BC: Modularity and functional plasticity of scaffold proteins as p(l)acemakers in cell signaling. Cell Signal 2012, 24(11):2143–2165. 10.1016/j.cellsig.2012.06.002View ArticlePubMedGoogle Scholar
- Tran PV, Lachke SA, Stottmann RW: Toward a systems-level understanding of the Hedgehog signaling pathway: defining the complex, robust, and fragile. Wiley Interdiscip Rev Syst Biol Med 2013, 5: 83–100. 10.1002/wsbm.1193View ArticlePubMedGoogle Scholar
- Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL: Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol 2009, 27(2):199–204. 10.1038/nbt.1522View ArticlePubMedGoogle Scholar
- Kim J, Tan K: Discover protein complexes in protein-protein interaction networks using parametric local modularity. BMC Bioinformatics 2010, 11: 521. 10.1186/1471-2105-11-521PubMed CentralView ArticlePubMedGoogle Scholar
- Seebacher J, Gavin AC: SnapShot: Protein-protein interaction networks. Cell 2011, 144(6):1000. 10.1016/j.cell.2011.02.025View ArticlePubMedGoogle Scholar
- Taylor IW, Wrana JL: Protein interaction networks in medicine and disease. Proteomics 2012, 12(10):1706–1716. 10.1002/pmic.201100594View ArticlePubMedGoogle Scholar
- Di Paola L, De Ruvo M, Paci P, Santoni D, Giuliani A: Protein contact networks: an emerging paradigm in chemistry. Chem Rev 2013, 113(3):1598–1613. 10.1021/cr3002356View ArticlePubMedGoogle Scholar
- Patthy L: Genome evolution and the evolution of exon-shuffling – a review. Gene 1999, 238: 103–114. 10.1016/S0378-1119(99)00228-0View ArticlePubMedGoogle Scholar
- Xing Y, Lee CJ: Protein modularity of alternatively spliced exons is associated with tissue-specific regulation of alternative splicing. PLoS Genet 2005, 1(3):e34. 10.1371/journal.pgen.0010034PubMed CentralView ArticlePubMedGoogle Scholar
- Gherardini PF, Ausiello G, Russell RB, Helmer-Citterich M: Modular architecture of nucleotide-binding pockets. Nucleic Acids Res 2010, 38(11):3809–16. 10.1093/nar/gkq090PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.PubMedGoogle Scholar
- Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJP, Chothia C, Murzinm AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 2008, 36(Database issue):D419-D425.PubMed CentralPubMedGoogle Scholar
- Berezovsky IN, Trifonov EN: Loop fold nature of globular proteins. Protein Eng 2001, 14(6):403–407. 10.1093/protein/14.6.403View ArticlePubMedGoogle Scholar
- Fedorov A, Cao X, Saxonov S, De Souza SJ, Roy SW, Gilbert W: Intron distribution difference for 276 ancient and 131 modern genes suggests the existence of ancient introns. Proc Nat Acad Sci 2001, 98(23):13177–13182. 10.1073/pnas.231491498PubMed CentralView ArticlePubMedGoogle Scholar
- Gelly JC, De Brevern AG, Hazout S: 'Protein Peeling’: an approach for splitting a 3D protein structure into compact fragments. Bioinformatics 2006, 22(2):129–133. 10.1093/bioinformatics/bti773View ArticlePubMedGoogle Scholar
- Ahnert SE, Johnston IG, Fink TMA, Doye JPK, Louis AA: Self-assembly, modularity, and physical complexity. Phys Rev E 2010, 82(2):026117.View ArticleGoogle Scholar
- Sobolevsky Y, Frenkel ZM, Trifonov EN: Combinations of ancestral modules in proteins. J Mol Evol 2007, 65(6):640–650. 10.1007/s00239-007-9032-xView ArticlePubMedGoogle Scholar
- Haglund E, Danielsson J, Kadhirvel S, Lindberg MO, Logan DT, Oliveberg M: Trimming down a protein structure to its bare foldons. J Biol Chem 2012, 287(4):2731–2738. 10.1074/jbc.M111.312447PubMed CentralView ArticlePubMedGoogle Scholar
- Girvan M, Newman MEJ: Community structure in social and biological networks. Proc Natl Acad Sci USA 2002, 99(12):7821–7826. 10.1073/pnas.122653799PubMed CentralView ArticlePubMedGoogle Scholar
- Feldman HJ: Identifying structural domains of proteins using clustering. BMC Bioinformatics 2012, 13: 286. 10.1186/1471-2105-13-286PubMed CentralView ArticlePubMedGoogle Scholar
- Fortunato S: Community detection in graphs. Phys Rep 2010, 486(3–5):75–174. 10.1016/j.physrep.2009.11.002View ArticleGoogle Scholar
- Clauset A, Newman MEJ, Moore C: Finding community structure in very large networks. Phys Rev E 2004, 70: 066111.View ArticleGoogle Scholar
- Tyler JR, Wilkinson DM, Huberman BA: Email as Spectroscopy: Automated Discovery of Community Structure within Organizations. Technical report: Hewlett-Packard Labs; 2003:81–96.Google Scholar
- Novák P, Neumann P, Macas J: Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 2010, 11: 378. 10.1186/1471-2105-11-378PubMed CentralView ArticlePubMedGoogle Scholar
- Cohen J: Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: L. Erlbaum Associates; 1988.Google Scholar
- Kwon HJ, Abi-Mosleh L, Wang ML, Deisenhofer J, Goldstein JL, Brown MS, Infante RE: Structure of N-terminal domain of NPC1 reveals distinct subdomains for binding and transfer of cholesterol. Cell 2009, 137(7):1213–1224. 10.1016/j.cell.2009.03.049PubMed CentralView ArticlePubMedGoogle Scholar
- Schenkels LC, Veerman EC, Amerongen AVN: Biochemical composition of human saliva in relation to other mucosal fluids. Crit Rev Oral Biol Med 1995, 6(2):161–175. 10.1177/10454411950060020501View ArticlePubMedGoogle Scholar
- Kirk O, Borchert TV, Fuglsang CC: Industrial enzyme applications. Curr Opin Biotechnol 2002, 13(4):345–351. 10.1016/S0958-1669(02)00328-2View ArticlePubMedGoogle Scholar
- MacGregor EA, Janeček Š, Svensson B: Relationship of sequence and structure to specificity in the α-amylase family of enzymes. Biochim Biophys Acta (BBA) Protein Struct Mol Enzymol 2001, 1546: 1–20. 10.1016/S0167-4838(00)00302-2View ArticleGoogle Scholar
- Hess B, Kutzner C, van der Spoel D, Lindahl E: GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J Chem Theor Comput 2008, 4(3):435–447. 10.1021/ct700301qView ArticleGoogle Scholar
- Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci 1998, 7(11):2469–71. 10.1002/pro.5560071126PubMed CentralView ArticlePubMedGoogle Scholar
- Menke M, Berger B, Cowen L: Matt: local flexibility aids protein multiple structure alignment. PLoS Comput Biol 2008, 4: e10. 10.1371/journal.pcbi.0040010PubMed CentralView ArticlePubMedGoogle Scholar
- Dryden IL, Mardia KV: Statistical Shape Analysis. 1 edition. Chichester: Wiley; 1998.Google Scholar
- Faure G, Bornot A, de Brevern AG: Protein contacts, inter-residue interactions and side-chain modelling. Biochimie 2008, 90(4):626–39. 10.1016/j.biochi.2007.11.007View ArticlePubMedGoogle Scholar
- Punta M, Rost B: Protein folding rates estimated from contact predictions. J Mol Biol 2005, 348(3):507–512. 10.1016/j.jmb.2005.02.068View ArticlePubMedGoogle Scholar
- Yuan C, Chen H, Kihara D: Effective inter-residue contact definitions for accurate protein fold recognition. BMC Bioinformatics 2012, 13: 292. 10.1186/1471-2105-13-292PubMed CentralView ArticlePubMedGoogle Scholar
- Csardi G, Nepusz T: The igraph software package for complex network research. Inter J 2006, Complex Systems: 1695. http://igraph.sf.net Google Scholar
- Newman MEJ: Analysis of weighted networks. Phys Rev E Stat Nonlin Soft Matter Phys 2004, 70(5 Pt 2):056131.View ArticlePubMedGoogle Scholar
- Newman ME: Modularity and community structure in networks. Proc Nat Acad Sci 2006, 103(23):8577–8582. 10.1073/pnas.0601602103PubMed CentralView ArticlePubMedGoogle Scholar
- Good P: Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. 2nd edition. New York: Springer; 2000.View ArticleGoogle Scholar
- DCT R: R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2011. . [ISBN 3–900051–07–0]. http://www.R-project.org/ Google Scholar
- Maindonald J, Braun WJ: DAAG: Data Analysis and Graphics Data and Functions. Vienna, Austria: R Foundation for Statistical Computing; 2011. . [R package version 1.08]. http://CRAN.R-project.org/package=DAAG Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 1995, 57: 289–300.Google Scholar
- Champely S: pwr: Basic Functions for Power Analysis. 2009. . [R package version 1.1.1]. http://CRAN.R-project.org/package=pwr Google Scholar
- Muirhead RJ: Aspects of Multivariate Statistical Theory. Hoboken: John Wiley & Sons Inc.; 2008.Google Scholar
- Berge C, Froloff N, Kalathur RKR, Maumy M, Poch O, Raffelsberger W, Wicker N: Multidimensional fitting for multivariate data analysis. J Comput Biol 2010, 17(5):723–32. 10.1089/cmb.2009.0126View ArticlePubMedGoogle Scholar
- Mishra P, Pandey PN: A graph-based clustering method applied to protein sequences. Bioinformation 2011, 6(10):372–374. 10.6026/97320630006372PubMed CentralView ArticlePubMedGoogle Scholar
- Patterson MC, Vanier MT, Suzuki K, Morris JA, Carstea E, Neufeld EB, Blanchette-Mackie JE, Pentchev PG, Chapter 145: Niemann-pick disease type C: a lipid trafficking disorder. In Scriver’s OMMBID: The Online Metabolic and Molecular Bases of Inherited Disease. Edited by: Valle D, Beaudet AL, Vogelstein B, Kinzler KW, Antonarakis SE, Ballabio A, Scriver CR, Sly WS, Childs B. New York: Mc-Graw HIll Inc; 2006.Google Scholar
- Garver WS, Krishnan K, Gallagos JR, Michikawa M, Francis GA, Heidenreich RA: Niemann-Pick C1 protein regulates cholesterol transport to the trans-Golgi network and plasma membrane caveolae. J Lipid Res 2002, 43(4):579–589.PubMedGoogle Scholar
- Davies JP, Ioannou YA: Topological analysis of Niemann-Pick C1 protein reveals that the membrane orientation of the putative sterol-sensing domain is identical to those of 3-hydroxy-3-methylglutaryl-CoA reductase and sterol regulatory element binding protein cleavage-activating protein. J Biol Chem 2000, 275(32):24367–24374. 10.1074/jbc.M002184200View ArticlePubMedGoogle Scholar
- Infante RE, Radhakrishnan A, Abi-Mosleh L, Kinch LN, Wang ML, Grishin NV, Goldstein JL, Brown MS: Purified NPC1 protein: II. Localization of sterol binding to a 240-amino acid soluble luminal loop. J Biol Chem 2008, 283(2):1064–1075. 10.1074/jbc.M707944200View ArticlePubMedGoogle Scholar
- Watari H, Blanchette-Mackie EJ, Dwyer NK, Glick JM, Patel S, Neufeld EB, Brady RO, Pentchev PG, Strauss JF: Niemann-Pick C1 protein: obligatory roles for N-terminal domains and lysosomal targeting in cholesterol mobilization. Proc Nat Acad Sci 1999, 96(3):805–810. 10.1073/pnas.96.3.805PubMed CentralView ArticlePubMedGoogle Scholar
- Humphrey W, Dalke A, Schulten K: VMD – Visual molecular dynamics. J Mol Graph 1996, 14: 33–38. 10.1016/0263-7855(96)00018-5View ArticlePubMedGoogle Scholar
- Fancello T, Dardis A, Rosano C, Tarugi P, Tappino B, Zampieri S, Pinotti E, Corsolini F, Fecarotta S, D’Amico A, et al.: Molecular analysis of NPC1 and NPC2 gene in 34 Niemann–Pick C Italian patients: identification and structural modeling of novel mutations. Neurogenetics 2009, 10(3):229–239. 10.1007/s10048-009-0175-3View ArticlePubMedGoogle Scholar
- Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, Haussler D, Sali A: LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 2005, 21(12):2814–2820. 10.1093/bioinformatics/bti442View ArticlePubMedGoogle Scholar
- Millat G, Baïlo N, Molinero S, Rodriguez C, Chikh K, Vanier MT: Niemann-Pick C disease: use of denaturing high performance liquid chromatography for the detection of NPC1 and NPC2 genetic variations and impact on management of patients and families. Mol Genet Metab 2005, 86(1–2):220–232.View ArticlePubMedGoogle Scholar
- Svensson B: Protein engineering in the α-amylase family: catalytic mechanism, substrate specificity, and stability. Plant Mol Biol 1994, 25(2):141–157. 10.1007/BF00023233View ArticlePubMedGoogle Scholar
- Davies G, Henrissat B: Structures and mechanisms of glycosyl hydrolases. Structure 1995, 3(9):853–859. 10.1016/S0969-2126(01)00220-9View ArticlePubMedGoogle Scholar
- Ben Ali M, Khemakhem B, Robert X, Haser R, Bejar S: Thermostability enhancement and change in starch hydrolysis profile of the maltohexaose-forming amylase of Bacillus stearothermophilus US100 strain. Biochem J 2006, 394(Pt 1):51–56.PubMed CentralView ArticlePubMedGoogle Scholar
- Bothast RJ, Schlicher MA: Biotechnological processes for conversion of corn into ethanol. Appl Microbiol Biotechnol 2005, 67: 19–25. 10.1007/s00253-004-1819-8View ArticlePubMedGoogle Scholar
- Visuri K, Klibanov AM: Enzymatic production of high fructose corn syrup (HFCS) containing 55% fructose in aqueous ethanol. Biotechnol Bioeng 1987, 30(7):917–920. 10.1002/bit.260300715View ArticlePubMedGoogle Scholar
- Farber GK: An α/β-barrel full of evolutionary trouble. Curr Opin Struct Biol 1993, 3(3):409–412. 10.1016/S0959-440X(05)80114-9View ArticleGoogle Scholar
- Höcker B, Jürgens C, Wilmanns M, Sterner R: Stability, catalytic versatility and evolution of the (β/α)8-barrel fold. Curr Opin Biotechnol 2001, 12(4):376–381. 10.1016/S0958-1669(00)00230-5View ArticlePubMedGoogle Scholar
- Wierenga RK: The TIM-barrel fold: a versatile framework for efficient enzymes. FEBS Lett 2001, 492(3):193–198. 10.1016/S0014-5793(01)02236-0View ArticlePubMedGoogle Scholar
- Gerlt JA, Raushel FM: Evolution of function in (β/α)8-barrel enzymes. Curr Opin Chem Biol 2003, 7(2):252–264. 10.1016/S1367-5931(03)00019-XView ArticlePubMedGoogle Scholar
- Machius M, Declerck N, Huber R, Wiegand G: Activation of Bacillus licheniformis α-amylase through a disorder→order transition of the substrate-binding site mediated by a calcium–sodium–calcium metal triad. Structure 1998, 6(3):281–292. 10.1016/S0969-2126(98)00032-XView ArticlePubMedGoogle Scholar
- Ben Ali M, Ghram M, Hmani H, Khemakhem B, Haser R, Bejar S: Toward the smallest active subdomain of a TIM-barrel fold: Insights from a truncated α-amylase. Biochem Biophys Res Commun 2011, 411(2):265–270. 10.1016/j.bbrc.2011.06.114View ArticlePubMedGoogle Scholar
- Fischer KF, Marqusee S: A rapid test for identification of autonomous folding units in proteins. J Mol Biol 2000, 302(3):701–712. 10.1006/jmbi.2000.4049View ArticlePubMedGoogle Scholar
- Frenkel ZM, Trifonov EN: Closed loops of TIM barrel protein fold. J Biomol Struct Dyn 2005, 22(6):643–655. 10.1080/07391102.2005.10507032View ArticlePubMedGoogle Scholar
- Dutheil J, Galtier N: Detecting groups of coevolving positions in a molecule: a clustering approach. BMC Evol Biol 2007, 7: 242. 10.1186/1471-2148-7-242PubMed CentralView ArticlePubMedGoogle Scholar
- Tiana G, Shakhnovich BE, Dokholyan NV, Shakhnovich EI: Imprint of evolution on protein structures. Proc Nat Acad Sci USA 2004, 101(9):2846–2851. 10.1073/pnas.0306638101PubMed CentralView ArticlePubMedGoogle Scholar
- Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, Sillitoe I, Yeats C, Thornton JM, Orengo CA: The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 2007, 35(Database issue):D291-D297.PubMed CentralView ArticlePubMedGoogle Scholar
- Forslund K, Henricson A, Hollich V, Sonnhammer ELL: Domain tree-based analysis of protein architecture evolution. Mol Biol Evol 2008, 25(2):254–64. 10.1093/molbev/msm254View ArticlePubMedGoogle Scholar
- Buljan M, Bateman A: The evolution of protein domain families. Biochem Soc Trans 2009, 37(Pt 4):751–755.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.