- Research article
- Open Access
Protein-segment universe exhibiting transitions at intermediate segment length in conformational subspaces
BMC Structural Biology volume 8, Article number: 37 (2008)
Many studies have examined rules governing two aspects of protein structures: short segments and proteins' structural domains. Nevertheless, the organization and nature of the conformational space of segments with intermediate length between short segments and domains remain unclear. Conformational spaces of intermediate length segments probably differ from those of short segments. We investigated the identification and characterization of the boundary(s) between peptide-like (short segment) and protein-like (long segment) distributions. We generated ensembles embedded in globular proteins comprising segments 10–50 residues long. We explored the relationships between the conformational distribution of segments and their lengths, and also protein structural classes using principal component analysis based on the intra-segment Cα-Cα atomic distances.
Our statistical analyses of segment conformations and length revealed critical dual transitions in their conformational distribution with segments derived from all four structural classes. Dual transitions were identified with the intermediate phase between the short segments and domains. Consequently, protein segment universes were categorized. i) Short segments (10–22 residues) showed a distribution with a high frequency of secondary structure clusters. ii) Medium segments (23–26 residues) showed a distribution corresponding to an intermediate state of transitions. iii) Long segments (27–50 residues) showed a distribution converging on one huge cluster containing compact conformations with a smaller radius of gyration. This distribution reflects the protein structures' organization and protein domains' origin. Three major conformational components (radius of gyration, structural symmetry with respect to the N-terminal and C-terminal halves, and single-turn/two-turn structure) well define most of the segment universes. Furthermore, we identified several conformational components that were unique to each structural class. Those characteristics suggest that protein segment conformation is described by compositions of the three common structural variables with large contributions and specific structural variables with small contributions.
The present results of the analyses of four protein structural classes show the universal role of three major components as segment conformational descriptors. The obtained perspectives of distribution changes related to the segment lengths using the three key components suggest both the adequacy and the possibility of further progress on the prediction strategies used in the recent de novo structure-prediction methods.
Vast amounts of three-dimensional (3D) protein data from structural genomic studies and other individual efforts have been added to our knowledge, thereby enhancing our understanding of protein structures. To date, only two extremes of protein structural data have been studied. One extreme includes local features of proteins: those of short protein segments, typically of 10 residues long or less. The other extreme includes global features of proteins: protein folds or structural domains.
Regarding the short protein segments, abundant research examples exist partly because of the existence of variations of methods to analyze the local features of proteins. Various measures, such as RMSDs after structural superposition [1–3], Cα-Cα atomic distances coupled with the torsion angles [4, 5], dihedral angles , and so on have been used to define the conformational similarity of protein segments. Different clustering techniques, such as k-means clustering [7, 8], hierarchical methods , competitive learning [6, 10], and other methods , have been used to describe the organization of the segments' conformational space. The abundance of research results in this area is also partly attributable to various applications of the clustering results of the short segments. A set of representatives from the resulting clusters are often called structural building blocks (SBBs). Even when using different procedures, clustering resolutions of SBBs can be categorized into only a few levels depending mainly on their respective applications, such as structural modeling, verification, comparison, and prediction [6, 12]. The most dominant cluster of the short segments, which is common in all studies, corresponds to α-helices, whereas the variability of β-strands is observed at the high-resolution clustering. Regarding global features of proteins, understanding of their organization and analysis of the protein-fold (or structural domain) space studies are progressing well.
As reviewed recently , both hierarchical and continuous aspects of fold space have been realized. Regarding hierarchical classification, widely used databases such as CATH  and SCOP  have been constructed. Other databases such as FSSP  and VAST  have been developed. They are based on continuous measurements of protein structural similarity. Several studies have provided insights into the nature of fold space. Holm and Sander first described the conformational distribution of protein folds in a fold universe with multi-dimensional scaling methods based on an all-on-all comparison using the Dali program . Using the same measurement, Hou et al.  showed visual representations of the protein fold universe and identified three major components which characterize the fold space: secondary structure compositions, chain topologies, and the protein domain size.
Compared to these two extremes, limited surveys have been done on the conformational space of medium size segments between protein short segments and folds. Specifically, supersecondary structures such as α-hairpin, βαβ-unit, and β-hairpin are typical structural motifs of medium size; those motifs have been analyzed. For example, Salem et al. reported that most superfolds contain a higher proportion of their α-helical or β-strand residues in one such supersecondary structure . Szustakowski et al. built a dictionary of supersecondary structures . Kurgan and Kedarisetti studied regularity among twilight zone protein structures at the level of the sequence segments that correspond to the secondary structure fragments of varying length . However, the organization and statistical properties of the whole conformational space of medium-to-long segments remain unclear. Statistical and systematic analyses should be done on the 'segment universe' from short to long lengths to bridge this gap.
Our previous study identified structural clusters and visualized the uneven distribution of short segments in the conformational spaces of 6–22 residues, where known and novel secondary-structure motifs are distributed as isolated clusters . The general features of the segment distribution were consistent for these lengths. However, the question we sought to answer is: Do spaces of long segments differ from those of short segments? In this study, we explore the relationships between the conformational distribution of segments and their length: 10–50 residues, thereby providing a global view of a 'segment universe' and showing critical dual changes (i.e. dual transitions) of the distribution shape in the conformational space of short to long segments. The critical changes might reflect changes of the protein structures' organization. Therefore, the present results suggest the adequacy and the possibility of further progress of the hierarchical treatment used in the recent de novo structure prediction methods. Furthermore, by comparing conformational components among structural classes (i.e., all-α, all-β, α/β, and α+β), we demonstrate the specificity and generality of protein fold classes.
Transitions of segment distribution: short, medium, and long segments
The coverage of segments in cluster(s) was calculated as described below. A densely populated region in the 3D principal component analysis (PCA) space was defined as a cluster . Given a density threshold, the segments are classifiable into two groups: those in regions of a density larger than the threshold and those outside the regions. The coverage of segments in clusters is defined as a ratio of the segments in the regions to all the segments.
Figure 1a portrays the coverage of segments versus the density threshold for the conformational spaces of 10, 20, 30, 40, and 50 residue lengths. The coverage curves exhibited a transition from concave shapes for short lengths (10 and 20 residues long) to convex ones for long lengths (30, 40, and 50 residues long). Notably, the differences of coverage at a density of 0.2 or less show a transition between the short and long segments. For instance, at a density of 0.1, the coverage is only 16.3% for 10 residues, although the coverage is greater than 50% for 30 residues. In addition, at a density of 0.01, the coverage for 10 residues is 45.6%, although coverage for 30 residues is 91.9%. These quantitatively indicate that the density gradient of the conformational space changes markedly with segment elongation.
Further analyses of the coverage graphs between the short and long segments were meaningful to discover the boundaries of distribution changes. Figure 1b shows coverage curves for lengths of 21–30 residues. The dual and critical transitions, with an intermediate phase for segment lengths of 23–26 residues, can be recognized clearly, as presented in Fig. 1b. The transitions at intermediate length are also characterized by the distributional alteration of the radius of gyration of segments in the populated region with density of 0.10–0.35 (Fig. 2). To adjust the effect of different segment lengths, we defined here the relative score (F_Rg) of the radius of gyration for a segment as (Rg(i,j)- Min Rg(j))/(Max Rg(j)- Min Rg(j)), where Rg(i,j)denotes the radius of gyration of a segment i with length j, and where Max Rg(j)and Min Rg(j)represent the maximal and minimal radius of gyration of the entire segment dataset with length j. Based on these observations, the segment length is categorized into the following three groups: short (10–22 residues), medium (23–26 residues), and long (27–50 residues). We were able to show that changes in the density gradient are associated with distributional alterations in the segment universe in subsequent analyses of visualizing the 3D PCA space. In fact, the difference in the coverage between lengths of 10 and 30 residues was attributable to the increase in the volume for the most populated region, as discussed below. The typical global images of segment universes from the three categories are depicted in Fig. 3d. The segment universes here were generated by the first three principal components derived from the entire segment dataset: PCall1, PCall2, and PCall3 (see Methods).
Short length (10–22 residues long)
The conformational space of short segments showed a distribution with an extreme density gradient that originated from secondary structure clusters: α-helix and β-strand clusters were discriminated using a density of 0.01 (shown in orange in Fig. 3a). Between the lengths of 10 and 20 residues, spatial arrangements of the segment distribution, especially for α-helical, β-strand, and β-hairpin clusters, were conserved in short conformational spaces. The highly populated core of the α-helix cluster exhibited a density of 0.1 (shown as magenta in Fig. 3a), consisting of completed α-helical segments. The surrounding area of the central region consisted of various types of helical conformations including helix-capping motifs . The central region of the β-strand cluster consisted of fully extended segments that originated mainly from β-sheets and loop regions. The β-hairpin conformations were separated into several clusters at a density of 0.005. Then they were discriminated using the coordinate c2 along PCall2 (see Methods for the definition of c2). The β-hairpin clusters showed a symmetrical relationship related to the N-terminal and C-terminal halves. They were arranged symmetrically around an edge of segment universes of short length.
Medium length (23–26 residues long)
The segment distribution for medium lengths differed from that for short lengths. The distributional change from short to medium lengths is characterized using a diminishing β-strand cluster and a growing α-helix cluster. The overall distribution was shortened in the direction of PCall1, and enlarged in the direction of PCall2 and PCall3. In the segment universe of 26 residues, the α-helix cluster was discriminated using a density of 0.1 (magenta in Fig. 3b). Interestingly, the shape of the α-helix cluster was a ring (designated as a helix ring cluster). The helix ring cluster that is specific to the medium-length universe consisted not only of the extended α-helices but also of various α-helical conformations, as presented in the inset of Fig. 3b. This cluster included conformations that had originated mainly from all-α, α/β, and α +β proteins (Fig. 4a). The average content of the α-helical residues per segment in the helix ring cluster was about 50% (Fig. 4b); 24.9% of all segments were included within the helix ring cluster. The long-α-helical segments, whose conformation was not compact, were located near the origin of the conformational space (red in Fig. 3b). In contrast, the α-hairpin conformations with a small radius of gyration were located on the opposite side of the position on PCall1. The various α-hairpin conformations with the different turn positions were located symmetrically along PCall2. For medium lengths, the β-strand clusters were diminished because long extended β-strands are rarely found in proteins. The β-hairpin conformations were located symmetrically along PCall2, although the cluster separation of β-hairpins was not clear in medium lengths.
Long length (27–50 residues long)
Conformational spaces for the long lengths were further shortened in the direction of PCall1 and enlarged in that of PCall3. The segment distribution converged on a large populated region that exhibited a density of 0.1 (magenta in Fig. 3c) in the conformational space. With a length of 30 residues, there were two clusters consisting of compact segments and long α-helical segments, respectively, with densities of 0.35 (red in Fig. 3c) in the populated region. The emergence of the compact-segment cluster was attributable to an increase in various types of segments with a small radius of gyration (see inset of Fig. 3c). Various types of conformations are mixed up in the compact-segment cluster. The α-hairpins are derived mainly from all-α proteins. The compact β-sheet structures are derived mainly from all-β proteins. Compact conformations of other types are derived from α/β and α +β proteins (Fig. 4c). About 2% of all segments were included in the compact-segment cluster for 27-residue length. In contrast, long α-helical segments with a large radius of gyration were located on the opposite side of the cluster of the compact segments along the PCall1 axis. For lengths greater than 30 residues, the proportion of the conformations with a small radius of gyration in the compact-segment cluster increased rapidly to around 14% for 50-residue lengths. Those conformations were derived from various folds (Fig. 4c). The supersecondary structures, such as βαβ units and β-sheets, were included in the compact-segment cluster (Fig. 4d).
Contribution ratios of principal axes
Distributional alterations were observed associated with the changes of segment length. For principal component analyses, the contribution ratios (see Methods for the contribution ratios) of the principal components (i.e. PC axes) to the entire distribution indicate how well the PC axes can cover the variation in the original data. Figure 5 portrays contribution ratios of the first five PC axes (PCall1 – PCall5) for segment lengths of 10–50 residues. Even with a length of 43 residues, the cumulative contribution ratio of the first three PC axes, Q123 (= Q1 + Q2 + Q3), was greater than 60%, although Q123 decreased constantly with increased segment length. Each of Q4 and Q5 was always less than 8%. The contribution ratios for higher-order PC axes than PCall5 did not exceed 5% for the examined segment lengths. Therefore, it is sufficient to use only the first three PC axes (or the first five PC axes occasionally) to explain the original structural variation.
With respect to the individual contribution ratios (Q1-Q3) of the first three PC axes, Q1 was overwhelmingly higher than those of the other PC axes up to 50-residue length (Fig. 5), which indicates that PCall1 is a meaningful and fundamental descriptor for segment conformation. Actually, Q1 decreased rapidly, and Q2 increased in the short segment lengths (i.e. 10–22 residues). Thereafter, both Q1 and Q2 decreased slowly. In addition, Q3 increased gradually with lengths up to 33 residues, with a maximum value of 11.5%.
Investigation of structural properties of conformational axes
An eigenvector was analyzed for each PC axis with a triangle map to elucidate the physical and conformational meaning of the PC axes of the conformational space of the short to long segments. The eigenvector can be regarded as a collective variable to describe the segment conformation. Figure 6 shows triangle maps of the first five PC axes (PCall1 – PCall5) for short (10 residues), medium (26 residues), and long segments (30 residues). The triangle map clearly portrays residue pairs, with large or small deviations of Cα-Cα distances along each PC axis from the average distance <q i >. In the triangle map, positive (red) and negative (blue) areas correspond to residue pairs with mutually inverse deviations. The patterns of red and blue areas are conserved in the universes of short to long segments, indicating that conformational deviations related to the PC axes are conserved among the universes. Figure 7 depicts the conformational changes along the PC axis using colored arrows.
Actually, PCall1 corresponds to the change of the radius of gyration (Rg). The triangle map for PCall1 has only one positive area, shown as red in Fig. 6, which is located near the residue pairs at the N-terminal and C-terminal sides. This single area indicates that the distant residue pairs in the sequence have a larger conformational deviation along PCall1. The correlation coefficient of the conformational deviation along PCall1 with Rg was greater than 0.9 in segment lengths of 10–50 residues (Fig. 8). The arrows in Fig. 7 point to the center of the segment, which indicates clearly that the conformational changes along PCall1 are involved with expansions or compressions of the conformation. For short lengths, PCall1 also shows a strong correlation with the changes of the segment end-to-end distance (D end ), which is defined as the Cα-Cα distance between the first and last residues of segments. Correlation between PCall1 and D end slowly weakened with increased segment length: 0.91 for 10 residues, 0.79 for 26 residues, and 0.77 for 30 residues.
The PCall2 correlates to a degree of structural symmetry (D sym ) of a segment with respect to the N-terminal and C-terminal halves. The D sym is defined as follows: Given a distance matrix for a segment, where element (i,j) is the distance (denoted as r ij ) between Cα atoms of residue i and j. Then, the degree of structural symmetry is defined as the sum of the squared differences of symmetric elements in a distance matrix for a segment: D sym = Σ1 ≤ i < j ≤ n (r ij - rn-(j-1)n-(i-1))2, where n is the segment length. The triangle map for PCall2 was separated into one positive area (red) and one negative area (blue). The correlation coefficient of the conformational deviation along PCall2 with structural symmetry, D sym was greater than 0.90 in the segment lengths of 10–50 residues (Fig. 8). Both conformations displayed mirrored symmetry about a plane constructed by PCall1 and PCall3 when two conformations were picked from opposite positions along PCall2. The segment conformations picked up along PCall2 are shown in Figs. 3a–3c.
The PCall3 correlated with a physical indicator that describes a conformational transition between structures with one turn and ones with two turns (PCall3 in Fig. 6). The picked conformations along PCall3 indicate that segregation of a β-hairpin structure exists along with conformational changes by PCall3. We defined the physical indicator (Dmn+mc) of the β-hairpin formation: Dmn+mcis the sum of the norms of two vectors, which were generated by the middle point of the segment for both the N-terminal and C-terminal residues: Dmn+mc= , where and respectively denote the vectors from the midpoint to the N-terminal and C-terminal residues of the segment. Good correlation was found between PCall3 and Dmn+mc(Fig. 8). The correlation coefficient was greater than 0.7 for the 10–50 residues. The triangle map of PCall3 indicated a separation of one positive area (red) and two negative areas (blue). It is noteworthy that the triangle map of PCall3 for short segments differed slightly from those of medium and long segments. A positive area is visible near the residue pair of the N-terminal and C-terminal in the short map, suggesting that PCall3 has a (negative) correlation with D end . For medium and long lengths, the positive area was close to the center of the triangle map. Therefore, the correlation between PCall3 and Dmn+mc/D end was necessarily smaller in medium and long lengths.
The triangle map of PCall4 had one negative area and one positive area. The positive area, located at the map center, suggests that PCall4 is correlated with the radius of gyration ( mid Rg) of the middle region of the segment – except for both the N-terminal and C-terminal quarter portions – in the medium and long segments. The respective correlation coefficients for the 26 and 30 residue lengths were 0.73 and 0.72. The PCall4 also has a weak (negative) correlation with D end . The respective correlation coefficients between PCall4 and D end for the 26 and 30 residue lengths were -0.45 and -0.42.
We identified no simple physical indicator for conformational changes along PCall5. However, visual inspection from conformations picked along PCall5 suggests that PCall5 is a conformational axis that represents segregated β-sheet structures. Conformations picked up from both ends on PCall5 are depicted in Fig. 6. In the triangle map for PCall5, two positive and two negative areas exist along the diagonal line, which might indicate that PCall5 segregates segment conformations with double turns. The PCβ5 contribution ratio, which was derived from all-β proteins, was higher than that derived from other structural classes, which suggests that PC5 is important for describing the structural variation of β-structures.
Segment universes derived from different structural classes
The segment universes described above are those derived from proteins of the four structural classes. Therefore, decomposition of the universe into four classes is helpful to evaluate the influence of each structural class on the segment universe. To this end, a segment universe was constructed for each structural class separately, and compared the PC axes derived from each universe with those of all segments (i.e., PCall1-PCall3). The first three largest eigenvectors of each structural class were also compared respectively with PCall1, PCall2, and PCall3 to elucidate the structural properties of PC axes derived from each universe.
Figure 9 depicts the contribution ratios of the first three PC axes, PCx1 -PCx3 (x = α, β, α +β, or α/β), in each structural class. The marks on the curves in Fig. 9 indicate that the correlation coefficient (vx i ·vall i ) between PCx1 -PCx3 and PCall1 -PCall3 (i.e., i = 1, 2, 3) is greater than 0.7, which was used here as a threshold of conservation of structural properties. The properties of the first two PC axes corresponding to the PCall1 and PCall2 were highly conserved in all four structural classes. The characteristics of PCall3 were also conserved in all four structural classes, although exceptions were apparent for the 20-residue-long and 10–16-residue-long all-α and all-β classes. Therefore, it is confirmed that the first three PC axes (Rg, symmetry, and one/two turn(s)) are important in almost all cases to describe the conformation of segments embedded in globular proteins.
However, the curves for the contribution ratios of both all-α and all-β classes (see two panels of Fig. 9) differ clearly from those of PCall1 – PCall3 (i.e. Q1 – Q3 in Fig. 5). The Qα1, contribution ratio was always higher than 40%, which indicates that the distribution of the all-α segments has a large deviation with respect to Rg. In contrast, the Qβ1 contribution ratio decreased rapidly with increasing segment length. The value of Qα2 increased moderately with increasing segment length. In contrast, the Qβ2 had a maximum value greater than 20% at a length of 22 residues. This rapid increase of Qβ2 might reflect a typical feature for β-sheet conformations. For PC3, the curves for the contribution ratios of the all-α and all-β classes also mutually differed. Although Qβ3 peaked at a length of 35 residues, Qα3 peaked with a short length, which indicates that the structural variable based on PCall3 is important for β-segments longer than 30 residues. In contrast, the behaviors of the contribution ratios for both α+β and α/β classes along with the segment length resembled each other. They were also similar to Q1-Q3 in Fig. 5 because those structural classes are mixtures of α-helices and β-sheets.
Subsequently, PC axes that were specific for each structural class were examined. For this analysis, the PC axis was defined as a "class-specific" one when a PC axis from a structural class showed no similarity with the first 20 PC axes from the other three structural classes (see Methods). The first 10 PC axes of each class were investigated for the short (10 residues), medium (26 residues), and long (30 residues) segments. Ten class-specific conformational axes were identified and consisted of one (PCβ10) for the short length, eight for the medium, and one (PCα8) for the long. The eight class-specific axes for the medium-length segments are PCα5, PCα8, and PCα10 for all-α, PCβ10 for all-β, PCα+β9 and PCα+β10 for α+β, and PCα/β8 and PCα/β10 for α/β. Four examples out of eight are depicted in Fig. 10. A clear correlation of these PC axes is difficult to discern according to simple physical or structural quantities. Figure 10a shows that the PCα8 describes a structural change of three (both ends and the middle portion) parts of α-segments. The PCα/β8 is related to βαβ motifs, which is the most fundamental structural unit for α/β proteins.
Investigation of the protein segment universe is an important subject for bioinformatics. Results of this study show that the segment universe can be categorized naturally into three regimes: short, medium, and long. A main finding of this study is that the three regimes are clearly demarcated by critical changes in the shape of the segment distribution in the conformational space. Preceding studies demonstrated that the average length of α-helix is 14 residues  and that for β-strand is five residues . Results of the present study show that transitional segment lengths (22 and 26 residues long) do not coincide with these average lengths. Therefore, a single secondary structure element does not characterize the shape of the segment distribution. The appearance of the medium length regime segregates the segment fold universe into three. The combination of secondary-structure elements is important to characterize not only the medium-length segment universe but also the entire segment fold universe.
Meanwhile, loops, which make up 30% of the protein structures , are also expected to take a larger role to form some unique conformations by connecting secondary-structure elements in the medium to the long-length segment universe than short one. The segments in the cluster of the medium to long-length universe tend to contain more loop regions than those of the short segment universe, as shown in Figs. 4b and 4d, and have a wider variety of origins (Figs. 4a and 4c). For example, the segments in the cluster with density of 0.35–1.0 of the universe of 30 residues length are derived from 461 proteins out of all 600 representatives used for this study (see Additional File 1). Longer loops that possess extended conformations are located on the opposite side of the compact-segment cluster along PCall1 in the medium to long segment universe (Figs. 3b and 3c). Instead of discrete clusters, they appear to constitute a rather continuous distribution. Some analyses examine short loops with respect to their completeness [27, 28] and elaborate classification [26, 29]. In the analysis of short segments, our method also captured some loop conformation classes, such as joint loops connecting two helices, and exposed and extended loops participated in protein-protein interactions .
A natural boundary was identified, in this study, between the peptide-like and protein-like distributions between the lengths of 23 and 26 residues using actual conformations of protein segments. This observation with respect to the boundary is consistent with the results described by Shen et al. , even though they used a sphere-packing model to estimate a minimal domain size of about 20 residues. A recent study by Sawada and Honda  also identified a boundary at 10–20 residue length by calculating the structural diversity of segments. They discretized the conformational space using a single-pass clustering method. In contrast, we observed the density distribution to uncover differences of conformational space between short and long segments. The segment conformational space for lengths of 10–22 residues provided a distribution with an extreme density gradient towards the secondary structure, such as the α-helix, β-strand, and β-hairpin clusters, which are expected to belong to the peptide-like conformational regime. This conformational variation reflects that short segments embedded in globular proteins are mainly stabilized by the physicochemical property of the peptide. On the other hand, the segment conformational spaces for lengths of 27 residues or more have a distribution that is dominated by compact segments, which suggests a protein-like distribution (protein-like conformational regime). This distribution arises from the hydrophobic effect imparted by the solvent molecules, which is of great importance for structural stability in long segments derived from globular proteins. If this is the case, our observations support the de novo structure prediction methods, so-called fragment assembling methods, that have been developed recently [32–35]. These approaches are usually based on the prediction of local segment conformations followed by assembly of segments, and are generally used to separate criteria at each step; sequence similarity or secondary structural propensity for the prediction of segment conformations, and non-local energy terms for the assembling step. These strategies used in the de novo prediction methods seems to be consistent with the results shown here. Results of our analyses clearly show such a hierarchical organization of protein structures, and indicate that preparing segment libraries up to around 20 residues long would be helpful for such methods.
These results indicate that the structural meanings for the conformational axes (i.e., the radius of gyration for PCall1, structural symmetry related to the N-terminal and C-terminal halves for PCall2, and a single-turn/two-turn structure for PCall3) are conserved in the different lengths and structural classes. This fact suggests that these conformational components are key structural variables for protein segments. On the other hand, when conformational axes among the four structural classes were compared, we were able to identify several conformational axes that were specific to each structural class, especially in the medium length range. In fact, a distribution change for medium lengths was observed, involving an increase in compact segments. Those segments included supersecondary structures such as α-hairpins, parts of the β-sheets, and βαβ units. These results might be related to the specificity of the structural class or fold of the contents of supersecondary structures . Typical supersecondary structural motifs, α-hairpin, β-hairpin, and βαβ are, respectively, the basic structural units for the all-α, all-β, and α/β proteins. These motifs are often shared within the structural classes. Therefore, the contribution ratios observed for the class-specific conformational axes were high. Class-specific conformational axes were rarely observed in short and long lengths, probably because short segments are too nonspecific and are often shared over different structural classes; long segments are too specific and have very low contribution ratios for conformational axes that are specific for each structural class.
The currently found class-specific conformational axes provide a hint to solve a difficulty in classifying diverse sets of protein structures. Both α/β and α+β classes are known to show a substantial overlap. In the CATH classification, α/β and α+β classes are treated as one structural class as α-β class. Classifying α/β and α+β proteins is sometimes a difficult problem, although several classification [19, 36, 37] and also prediction [38, 39] schemes have been proposed. The present study showed that α/β and α+β classes have similar characteristics of universes, and also have unique ones at the same time. For example, our results show that PCα/β8, whose contribution ratio was 1.4%, was associated only with the βαβ motif. In the α+β class, no axis was strongly correlated with PCα/β8 (see Additional File 2), which is a clear example of the difference in structural variables between α+β and α/β classes originating from class-specific supersecondary structures. Consequently, projecting segments onto a conformational subspace using the axis PCα/β8 could be useful for objectively dividing protein domains of α-β class into α/β and α+β classes. A considerable localization of segments derived from α/β proteins in a PCA subspace is observed (see Additional Files 3 and 4).
An effective method must be developed for conformational sampling for de novo prediction methods. The resulting structural variables analyzed in this study would be helpful for additional progress in de novo structure prediction. For example, testing the distribution of segments or models in terms of the degree of symmetry using the descriptor (D sym ) might be useful to verify the completeness of sampling of the conformational space. Using a filtering threshold or function (generally used in fragment assembling methods for selecting proper models) that is tolerant of the radius of gyration might be useful for improving the prediction of all-α proteins because the contribution ratio, Qα1, of PCα 1 corresponding to the radius of gyration (Rg) is larger than those of the other structural classes in the medium and long segments. Consequently, projecting segments of models onto a conformational subspace constructed by PCx(where x = α, β, α/β, α+β, or all) axes might be helpful for filtering out models and assigning a protein to a structural class.
In this study, the dual critical transitions in the protein segment universe from short to long length are shown. Our observations are related to the transitions proposed by the significance of two-stage treatment in de novo structure prediction. Considering the hierarchical organization of a protein segment universe that we have shown, we suggest the efficacy of using the evaluation functions that is secondary-structure-directed for sampling local structures less than 23 residues long. We also suggest the suitability of evaluating protein-like features of models using another function (e.g. Rg) for longer segments. Changing the criteria of filtering for each structural class will enhance the effectiveness of the conformation sampling process. Through these analyses, we have demonstrated that our clustering methodology is useful to identify a distinctive distribution shift of conformational space between short and long segments and that distribution changes depend on structural classes.
Preparing the segment libraries
One representative from each fold group of the SCOP database (ver. 1.63)  was chosen to obtain a segment library without a bias of usage of the folds. The representatives cover the four major structural classes (all-α, all-β, α+β, and α/β), because we are interested in and specifically examine characterization of the nature of segments embedded in usual size globular proteins. Small proteins of less than 50 residues and non-single chain proteins with less than 100 residues were excluded, as were membrane proteins. It is expected that those proteins possess different structural properties from those of usual size globular proteins and induce biased results. In all, 600 representatives were used for this study (all-α, 150; all-β, 116; α+β, 219; α/β, 115; see Additional File 5). Dividing the protein structures into segments with a sliding window by one residue along the sequence generated a segment library of arbitrary length. We prepared a segment library for each length of 10–50 residues to generate conformational spaces of short-to-long segments. In such cases, segments with incomplete coordinate data (e.g., having an unusual covalent-bond length or lacking main-chain atoms) were excluded. Furthermore, to elucidate differences among the conformational spaces derived from the four major structural classes, we generated a segment library for each class.
Construction and visualization of conformational space
We previously reported a method for constructing and visualizing the conformational space of protein segments using principal component analysis based on intra-segment Cα-Cα atomic distances . Briefly, atomic distances of all Cα-Cα pairs for each segment in a segment library of an arbitrary length were calculated first. A distance is designated as q i , where i is the index for the Cα-Cα pair, i = 1, ..., n(n - 1)/2, and n is the segment length, as expressed by the number of residues in the segment. Subsequently, a set of eigenvectors and eigenvalues were obtained by diagonalizing a variance-covariance matrix, C, that was calculated as C ij = (<(q i - <q i >)(q j - <q j >)> = <(q i q j - q i <q j >- <q i >q j + <q i ><q j >)> = <q i q j >- <q i ><q j >- <q i ><q j > + <q i ><q j > =) <q i q j >- <q i ><q j >, where the average <...> is taken over the segments. Two equations, C v i = λ i v i and v i ·v j = δ ij , are satisfied. Eigenvectors with larger eigenvalues are more important in the study of the conformational varieties of the segments. Eigenvalues are arranged in descending order: λ i > λ j if i <j. The contribution ratio of the i-th PCA element (i.e. the i-th eigenvector) to the whole conformational distribution is given as Q i = λ i /Σ k allλ k . The eigenvectors, which are called PCx1, PCx2, PCx3, ...etc., were used as conformational axes to construct a segment conformational space, a PCA space, in which x indicates a segment dataset: x = α, β, α/β, α +β, or all). The indicator "x = all" is given when conformational axes are generated by the whole segment dataset. The origin of the PCA space is set on the average Cα-Cα atomic distances: <q> = [<q1>, <q2>, <q3>, ..., <q n >]. This enables ready comparison of conformational distributions between constructed universes. Any position (i.e. any segment structure) in the PCA space can be expressed using a linear combination of eigenvectors as c k = Σ n all(q- <q>)·v k λ k 1/2, where c k is a coordinate (i.e. projection of q) on the PC axis k. Using the first three eigenvectors (PCx1, PCx2, PCx3), a three-dimensional (3D) PCA space can be constructed.
We defined a vector, r, to express the position of each segment in the 3D PCA space: r= [c1, c2, c3]. After projection of the segments on the 3D PCA space, the distribution of segments in the 3D PCA space was visualized using the following procedure. The 3D space was divided into N bins (total N3 cubes). The bin size was defined as (max [c1] - min [c1])/N, where N = 36, and max [c1] and min [c1] respectively signify the maximum and minimum of the coordinates of the segments along the first principal component axis. The number (i.e. frequency) of segments detected in a cube represents the density (i.e. probability) of segments to be found in the cube. The density of each cube, ρ was normalized by the maximum density, ρmax among the cubes so that the maximal value of normalized density (we call this density in the text) is set to 1 (refer to eq. (3) in ). Four levels of contour surfaces (i.e. iso-density surfaces) were depicted to visualize the 3D PCA space. The density values for those surfaces were set respectively as 0.005, 0.01, 0.1, and 0.35.
We also separately constructed the universe for four structural classes to assess differences among their conformational spaces. For this study, we specifically examined the first 10 PC axes of each structural class because the 10 PC axes are more important than the other axes with respect to capturing the differences in the conformational axes. Although the eigenvectors from the same structural class are mutually uncorrelated (i.e., vx i ·vx j = 0, where i ≠ j and x = α, β, α/β, or α+β), the eigenvectors from different structural classes might have some correlation (i.e., vx i ·vy j ≠ 0, where x ≠ y). The PC axis is defined as the conformational component specific to the structural class when a PC axis from a structural class has no similarity to the first 20 PC axes from the other structural classes with a correlation coefficient > 0.8 (i.e. vx i ·vy j > 0.8).
Matsuo Y, Kanehisa M: An approach to systematic detection of protein structural motifs. Comput Appl Biosci 1993, 9(2):153–159.
Unger R, Sussman JL: The importance of short structural motifs in protein structure analysis. J Comput Aided Mol Des 1993, 7(4):457–472. 10.1007/BF02337561
Micheletti C, Seno F, Maritan A: Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies. Proteins 2000, 40(4):662–674. 10.1002/1097-0134(20000901)40:4<662::AID-PROT90>3.0.CO;2-F
Prestrelski SJ, Byler DM, Liebman MN: Generation of a substructure library for the description and classification of protein secondary structure. II. Application to spectra-structure correlations in Fourier transform infrared spectroscopy. Proteins 1992, 14(4):440–450. 10.1002/prot.340140405
Rackovsky S: Quantitative organization of the known protein x-ray structures. I. Methods and short-length-scale results. Proteins 1990, 7(4):378–402. 10.1002/prot.340070409
de Brevern AG, Etchebest C, Hazout S: Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins 2000, 41(3):271–287. 10.1002/1097-0134(20001115)41:3<271::AID-PROT10>3.0.CO;2-Z
Fetrow JS, Palumbo MJ, Berg G: Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme. Proteins 1997, 27(2):249–271. 10.1002/(SICI)1097-0134(199702)27:2<249::AID-PROT11>3.0.CO;2-M
Sander O, Sommer I, Lengauer T: Local protein structure prediction using discriminative models. BMC Bioinformatics 2006, 7: 14. 10.1186/1471-2105-7-14
Rooman MJ, Rodriguez J, Wodak SJ: Automatic definition of recurrent local structure motifs in proteins. J Mol Biol 1990, 213(2):327–336. 10.1016/S0022-2836(05)80194-9
Schuchhardt J, Schneider G, Reichelt J, Schomburg D, Wrede P: Local structural motifs of protein backbones are classified by self-organizing neural networks. Protein Eng 1996, 9(10):833–842. 10.1093/protein/9.10.833
Hunter CG, Subramaniam S: Protein fragment clustering and canonical local shapes. Proteins 2003, 50(4):580–588. 10.1002/prot.10309
Tomii K, Kanehisa M: Systematic detection of protein structural motifs. In Pattern discovery in biomolecular data. Edited by: Wang JTL, Shapiro BA, Shasha D. New York: Oxford University Press; 1999:97–110.
Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol 2006, 16(3):393–398. 10.1016/j.sbi.2006.04.007
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH – a hierarchic classification of protein domain structures. Structure 1997, 5(8):1093–1108. 10.1016/S0969-2126(97)00260-8
Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.
Holm L, Ouzounis C, Sander C, Tuparev G, Vriend G: A database of protein structure families with common folding motifs. Protein Sci 1992, 1(12):1691–1698.
Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol 1996, 6(3):377–385. 10.1016/S0959-440X(96)80058-3
Holm L, Sander C: Mapping the protein universe. Science 1996, 273(5275):595–603. 10.1126/science.273.5275.595
Hou J, Sims GE, Zhang C, Kim SH: A global representation of the protein fold space. Proc Natl Acad Sci USA 2003, 100(5):2386–2390. 10.1073/pnas.2628030100
Salem GM, Hutchinson EG, Orengo CA, Thornton JM: Correlation of observed fold frequency with the occurrence of local structural motifs. J Mol Biol 1999, 287(5):969–981. 10.1006/jmbi.1999.2642
Szustakowski JD, Kasif S, Weng Z: Less is more: towards an optimal universal description of protein folds. Bioinformatics 2005, 21(Suppl 2):ii66–71. 10.1093/bioinformatics/bti1111
Kurgan L, Kedarisetti KD: Sequence representation and prediction of protein secondary structure for structural motifs in twilight zone proteins. Protein J 2006, 25(7–8):463–474. 10.1007/s10930-006-9029-0
Ikeda K, Tomii K, Yokomizo T, Mitomo D, Maruyama K, Suzuki S, Higo J: Visualization of conformational distribution of short to medium size segments in globular proteins and identification of local structural motifs. Protein Sci 2005, 14(5):1253–1265. 10.1110/ps.04956305
Kumar S, Bansal M: Structural and sequence characteristics of long alpha helices in globular proteins. Biophys J 1996, 71(3):1574–1586.
Penel S, Morrison RG, Dobson PD, Mortishire-Smith RJ, Doig AJ: Length preferences and periodicity in beta-strands. Antiparallel edge beta-sheets are more likely to finish in non-hydrogen bonded rings. Protein Eng 2003, 16(12):957–961. 10.1093/protein/gzg147
Donate LE, Rufino SD, Canard LH, Blundell TL: Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci 1996, 5(12):2600–2616.
Fidelis K, Stern PS, Bacon D, Moult J: Comparison of systematic search and database methods for constructing segments of protein structure. Protein Eng 1994, 7(8):953–960. 10.1093/protein/7.8.953
Lessel U, Schomburg D: Creation and characterization of a new, non-redundant fragment data bank. Protein Eng 1997, 10(6):659–664. 10.1093/protein/10.6.659
Wojcik J, Mornon JP, Chomilier J: New efficient statistical sequence-dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification. J Mol Biol 1999, 289(5):1469–1490. 10.1006/jmbi.1999.2826
Shen M-y, Davis FP, Sali A: The optimal size of a globular protein domain: A simple sphere-packing model. Chemical Physics Letters 2005, 405(1–3):224–228. 10.1016/j.cplett.2005.02.029
Sawada Y, Honda S: Structural diversity of protein segments follows a power-law distribution. Biophys J 2006, 91(4):1213–1223. 10.1529/biophysj.105.076661
Bonneau R, Strauss CE, Rohl CA, Chivian D, Bradley P, Malmstrom L, Robertson T, Baker D: De novo prediction of three-dimensional structures for major protein families. J Mol Biol 2002, 322(1):65–78. 10.1016/S0022-2836(02)00698-8
Chikenji G, Fujitsuka Y, Takada S: A reversible fragment assembly method for de novo protein structure prediction. The Journal of Chemical Physics 2003, 119(13):6895–6903. 10.1063/1.1597474
Lee J, Kim S-Y, Lee J: Protein structure prediction based on fragment assembly and parameter optimization. Biophysical Chemistry 2005, 115(2–3):209–214. 10.1016/j.bpc.2004.12.046
Bujnicki JM: Protein-structure prediction by recombination of fragments. Chembiochem 2006, 7(1):19–27. 10.1002/cbic.200500235
Michie AD, Orengo CA, Thornton JM: Analysis of domain structural class using an automated class assignment protocol. J Mol Biol 1996, 262(2):168–185. 10.1006/jmbi.1996.0506
Kurgan LA, Zhang T, Zhang H, Shen S, Ruan J: Secondary structure-based assignment of the protein structural classes. Amino Acids 2008.
Chou KC: Progress in protein structural class prediction and its impact to bioinformatics and proteomics. Curr Protein Pept Sci 2005, 6(5):423–436. 10.2174/138920305774329368
Kurgan L, Cios K, Chen K: SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 2008, 9: 226. 10.1186/1471-2105-9-226
Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211
KI and JH were partly supported by BIRD of Japan Science and Technology Agency (JST).
This study was conceived and carried out by KI, who also analyzed the results and drafted the manuscript. HT approved the study and participated in the discussion. JH participated in the design and coordination of the study. He also helped to write the manuscript. KT participated in the design and discussions of the study and wrote the manuscript. KI and JH developed the methodology. All authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Ikeda, K., Hirokawa, T., Higo, J. et al. Protein-segment universe exhibiting transitions at intermediate segment length in conformational subspaces. BMC Struct Biol 8, 37 (2008). https://doi.org/10.1186/1472-6807-8-37
- Structural Class
- Conformational Space
- Contribution Ratio
- Segment Distribution
- Protein Segment