Protein-segment universe exhibiting transitions at intermediate segment length in conformational subspaces
© Ikeda et al; licensee BioMed Central Ltd. 2008
Received: 25 April 2008
Accepted: 13 August 2008
Published: 13 August 2008
Many studies have examined rules governing two aspects of protein structures: short segments and proteins' structural domains. Nevertheless, the organization and nature of the conformational space of segments with intermediate length between short segments and domains remain unclear. Conformational spaces of intermediate length segments probably differ from those of short segments. We investigated the identification and characterization of the boundary(s) between peptide-like (short segment) and protein-like (long segment) distributions. We generated ensembles embedded in globular proteins comprising segments 10–50 residues long. We explored the relationships between the conformational distribution of segments and their lengths, and also protein structural classes using principal component analysis based on the intra-segment Cα-Cα atomic distances.
Our statistical analyses of segment conformations and length revealed critical dual transitions in their conformational distribution with segments derived from all four structural classes. Dual transitions were identified with the intermediate phase between the short segments and domains. Consequently, protein segment universes were categorized. i) Short segments (10–22 residues) showed a distribution with a high frequency of secondary structure clusters. ii) Medium segments (23–26 residues) showed a distribution corresponding to an intermediate state of transitions. iii) Long segments (27–50 residues) showed a distribution converging on one huge cluster containing compact conformations with a smaller radius of gyration. This distribution reflects the protein structures' organization and protein domains' origin. Three major conformational components (radius of gyration, structural symmetry with respect to the N-terminal and C-terminal halves, and single-turn/two-turn structure) well define most of the segment universes. Furthermore, we identified several conformational components that were unique to each structural class. Those characteristics suggest that protein segment conformation is described by compositions of the three common structural variables with large contributions and specific structural variables with small contributions.
The present results of the analyses of four protein structural classes show the universal role of three major components as segment conformational descriptors. The obtained perspectives of distribution changes related to the segment lengths using the three key components suggest both the adequacy and the possibility of further progress on the prediction strategies used in the recent de novo structure-prediction methods.
Vast amounts of three-dimensional (3D) protein data from structural genomic studies and other individual efforts have been added to our knowledge, thereby enhancing our understanding of protein structures. To date, only two extremes of protein structural data have been studied. One extreme includes local features of proteins: those of short protein segments, typically of 10 residues long or less. The other extreme includes global features of proteins: protein folds or structural domains.
Regarding the short protein segments, abundant research examples exist partly because of the existence of variations of methods to analyze the local features of proteins. Various measures, such as RMSDs after structural superposition [1–3], Cα-Cα atomic distances coupled with the torsion angles [4, 5], dihedral angles , and so on have been used to define the conformational similarity of protein segments. Different clustering techniques, such as k-means clustering [7, 8], hierarchical methods , competitive learning [6, 10], and other methods , have been used to describe the organization of the segments' conformational space. The abundance of research results in this area is also partly attributable to various applications of the clustering results of the short segments. A set of representatives from the resulting clusters are often called structural building blocks (SBBs). Even when using different procedures, clustering resolutions of SBBs can be categorized into only a few levels depending mainly on their respective applications, such as structural modeling, verification, comparison, and prediction [6, 12]. The most dominant cluster of the short segments, which is common in all studies, corresponds to α-helices, whereas the variability of β-strands is observed at the high-resolution clustering. Regarding global features of proteins, understanding of their organization and analysis of the protein-fold (or structural domain) space studies are progressing well.
As reviewed recently , both hierarchical and continuous aspects of fold space have been realized. Regarding hierarchical classification, widely used databases such as CATH  and SCOP  have been constructed. Other databases such as FSSP  and VAST  have been developed. They are based on continuous measurements of protein structural similarity. Several studies have provided insights into the nature of fold space. Holm and Sander first described the conformational distribution of protein folds in a fold universe with multi-dimensional scaling methods based on an all-on-all comparison using the Dali program . Using the same measurement, Hou et al.  showed visual representations of the protein fold universe and identified three major components which characterize the fold space: secondary structure compositions, chain topologies, and the protein domain size.
Compared to these two extremes, limited surveys have been done on the conformational space of medium size segments between protein short segments and folds. Specifically, supersecondary structures such as α-hairpin, βαβ-unit, and β-hairpin are typical structural motifs of medium size; those motifs have been analyzed. For example, Salem et al. reported that most superfolds contain a higher proportion of their α-helical or β-strand residues in one such supersecondary structure . Szustakowski et al. built a dictionary of supersecondary structures . Kurgan and Kedarisetti studied regularity among twilight zone protein structures at the level of the sequence segments that correspond to the secondary structure fragments of varying length . However, the organization and statistical properties of the whole conformational space of medium-to-long segments remain unclear. Statistical and systematic analyses should be done on the 'segment universe' from short to long lengths to bridge this gap.
Our previous study identified structural clusters and visualized the uneven distribution of short segments in the conformational spaces of 6–22 residues, where known and novel secondary-structure motifs are distributed as isolated clusters . The general features of the segment distribution were consistent for these lengths. However, the question we sought to answer is: Do spaces of long segments differ from those of short segments? In this study, we explore the relationships between the conformational distribution of segments and their length: 10–50 residues, thereby providing a global view of a 'segment universe' and showing critical dual changes (i.e. dual transitions) of the distribution shape in the conformational space of short to long segments. The critical changes might reflect changes of the protein structures' organization. Therefore, the present results suggest the adequacy and the possibility of further progress of the hierarchical treatment used in the recent de novo structure prediction methods. Furthermore, by comparing conformational components among structural classes (i.e., all-α, all-β, α/β, and α+β), we demonstrate the specificity and generality of protein fold classes.
Transitions of segment distribution: short, medium, and long segments
The coverage of segments in cluster(s) was calculated as described below. A densely populated region in the 3D principal component analysis (PCA) space was defined as a cluster . Given a density threshold, the segments are classifiable into two groups: those in regions of a density larger than the threshold and those outside the regions. The coverage of segments in clusters is defined as a ratio of the segments in the regions to all the segments.
Short length (10–22 residues long)
The conformational space of short segments showed a distribution with an extreme density gradient that originated from secondary structure clusters: α-helix and β-strand clusters were discriminated using a density of 0.01 (shown in orange in Fig. 3a). Between the lengths of 10 and 20 residues, spatial arrangements of the segment distribution, especially for α-helical, β-strand, and β-hairpin clusters, were conserved in short conformational spaces. The highly populated core of the α-helix cluster exhibited a density of 0.1 (shown as magenta in Fig. 3a), consisting of completed α-helical segments. The surrounding area of the central region consisted of various types of helical conformations including helix-capping motifs . The central region of the β-strand cluster consisted of fully extended segments that originated mainly from β-sheets and loop regions. The β-hairpin conformations were separated into several clusters at a density of 0.005. Then they were discriminated using the coordinate c2 along PC all 2 (see Methods for the definition of c2). The β-hairpin clusters showed a symmetrical relationship related to the N-terminal and C-terminal halves. They were arranged symmetrically around an edge of segment universes of short length.
Medium length (23–26 residues long)
Long length (27–50 residues long)
Conformational spaces for the long lengths were further shortened in the direction of PC all 1 and enlarged in that of PC all 3. The segment distribution converged on a large populated region that exhibited a density of 0.1 (magenta in Fig. 3c) in the conformational space. With a length of 30 residues, there were two clusters consisting of compact segments and long α-helical segments, respectively, with densities of 0.35 (red in Fig. 3c) in the populated region. The emergence of the compact-segment cluster was attributable to an increase in various types of segments with a small radius of gyration (see inset of Fig. 3c). Various types of conformations are mixed up in the compact-segment cluster. The α-hairpins are derived mainly from all-α proteins. The compact β-sheet structures are derived mainly from all-β proteins. Compact conformations of other types are derived from α/β and α +β proteins (Fig. 4c). About 2% of all segments were included in the compact-segment cluster for 27-residue length. In contrast, long α-helical segments with a large radius of gyration were located on the opposite side of the cluster of the compact segments along the PC all 1 axis. For lengths greater than 30 residues, the proportion of the conformations with a small radius of gyration in the compact-segment cluster increased rapidly to around 14% for 50-residue lengths. Those conformations were derived from various folds (Fig. 4c). The supersecondary structures, such as βαβ units and β-sheets, were included in the compact-segment cluster (Fig. 4d).
Contribution ratios of principal axes
With respect to the individual contribution ratios (Q1-Q3) of the first three PC axes, Q1 was overwhelmingly higher than those of the other PC axes up to 50-residue length (Fig. 5), which indicates that PC all 1 is a meaningful and fundamental descriptor for segment conformation. Actually, Q1 decreased rapidly, and Q2 increased in the short segment lengths (i.e. 10–22 residues). Thereafter, both Q1 and Q2 decreased slowly. In addition, Q3 increased gradually with lengths up to 33 residues, with a maximum value of 11.5%.
Investigation of structural properties of conformational axes
The PC all 2 correlates to a degree of structural symmetry (D sym ) of a segment with respect to the N-terminal and C-terminal halves. The D sym is defined as follows: Given a distance matrix for a segment, where element (i,j) is the distance (denoted as r ij ) between Cα atoms of residue i and j. Then, the degree of structural symmetry is defined as the sum of the squared differences of symmetric elements in a distance matrix for a segment: D sym = Σ1 ≤ i < j ≤ n (r ij - rn-(j-1)n-(i-1))2, where n is the segment length. The triangle map for PC all 2 was separated into one positive area (red) and one negative area (blue). The correlation coefficient of the conformational deviation along PC all 2 with structural symmetry, D sym was greater than 0.90 in the segment lengths of 10–50 residues (Fig. 8). Both conformations displayed mirrored symmetry about a plane constructed by PC all 1 and PC all 3 when two conformations were picked from opposite positions along PC all 2. The segment conformations picked up along PC all 2 are shown in Figs. 3a–3c.
The PC all 3 correlated with a physical indicator that describes a conformational transition between structures with one turn and ones with two turns (PC all 3 in Fig. 6). The picked conformations along PC all 3 indicate that segregation of a β-hairpin structure exists along with conformational changes by PC all 3. We defined the physical indicator (Dmn+mc) of the β-hairpin formation: Dmn+mcis the sum of the norms of two vectors, which were generated by the middle point of the segment for both the N-terminal and C-terminal residues: Dmn+mc= , where and respectively denote the vectors from the midpoint to the N-terminal and C-terminal residues of the segment. Good correlation was found between PC all 3 and Dmn+mc(Fig. 8). The correlation coefficient was greater than 0.7 for the 10–50 residues. The triangle map of PC all 3 indicated a separation of one positive area (red) and two negative areas (blue). It is noteworthy that the triangle map of PC all 3 for short segments differed slightly from those of medium and long segments. A positive area is visible near the residue pair of the N-terminal and C-terminal in the short map, suggesting that PC all 3 has a (negative) correlation with D end . For medium and long lengths, the positive area was close to the center of the triangle map. Therefore, the correlation between PC all 3 and Dmn+mc/D end was necessarily smaller in medium and long lengths.
The triangle map of PC all 4 had one negative area and one positive area. The positive area, located at the map center, suggests that PC all 4 is correlated with the radius of gyration ( mid Rg) of the middle region of the segment – except for both the N-terminal and C-terminal quarter portions – in the medium and long segments. The respective correlation coefficients for the 26 and 30 residue lengths were 0.73 and 0.72. The PC all 4 also has a weak (negative) correlation with D end . The respective correlation coefficients between PC all 4 and D end for the 26 and 30 residue lengths were -0.45 and -0.42.
We identified no simple physical indicator for conformational changes along PC all 5. However, visual inspection from conformations picked along PC all 5 suggests that PC all 5 is a conformational axis that represents segregated β-sheet structures. Conformations picked up from both ends on PC all 5 are depicted in Fig. 6. In the triangle map for PC all 5, two positive and two negative areas exist along the diagonal line, which might indicate that PC all 5 segregates segment conformations with double turns. The PC β 5 contribution ratio, which was derived from all-β proteins, was higher than that derived from other structural classes, which suggests that PC5 is important for describing the structural variation of β-structures.
Segment universes derived from different structural classes
The segment universes described above are those derived from proteins of the four structural classes. Therefore, decomposition of the universe into four classes is helpful to evaluate the influence of each structural class on the segment universe. To this end, a segment universe was constructed for each structural class separately, and compared the PC axes derived from each universe with those of all segments (i.e., PC all 1-PC all 3). The first three largest eigenvectors of each structural class were also compared respectively with PC all 1, PC all 2, and PC all 3 to elucidate the structural properties of PC axes derived from each universe.
However, the curves for the contribution ratios of both all-α and all-β classes (see two panels of Fig. 9) differ clearly from those of PC all 1 – PC all 3 (i.e. Q1 – Q3 in Fig. 5). The Q α 1, contribution ratio was always higher than 40%, which indicates that the distribution of the all-α segments has a large deviation with respect to Rg. In contrast, the Q β 1 contribution ratio decreased rapidly with increasing segment length. The value of Q α 2 increased moderately with increasing segment length. In contrast, the Q β 2 had a maximum value greater than 20% at a length of 22 residues. This rapid increase of Q β 2 might reflect a typical feature for β-sheet conformations. For PC3, the curves for the contribution ratios of the all-α and all-β classes also mutually differed. Although Q β 3 peaked at a length of 35 residues, Q α 3 peaked with a short length, which indicates that the structural variable based on PC all 3 is important for β-segments longer than 30 residues. In contrast, the behaviors of the contribution ratios for both α+β and α/β classes along with the segment length resembled each other. They were also similar to Q1-Q3 in Fig. 5 because those structural classes are mixtures of α-helices and β-sheets.
Investigation of the protein segment universe is an important subject for bioinformatics. Results of this study show that the segment universe can be categorized naturally into three regimes: short, medium, and long. A main finding of this study is that the three regimes are clearly demarcated by critical changes in the shape of the segment distribution in the conformational space. Preceding studies demonstrated that the average length of α-helix is 14 residues  and that for β-strand is five residues . Results of the present study show that transitional segment lengths (22 and 26 residues long) do not coincide with these average lengths. Therefore, a single secondary structure element does not characterize the shape of the segment distribution. The appearance of the medium length regime segregates the segment fold universe into three. The combination of secondary-structure elements is important to characterize not only the medium-length segment universe but also the entire segment fold universe.
Meanwhile, loops, which make up 30% of the protein structures , are also expected to take a larger role to form some unique conformations by connecting secondary-structure elements in the medium to the long-length segment universe than short one. The segments in the cluster of the medium to long-length universe tend to contain more loop regions than those of the short segment universe, as shown in Figs. 4b and 4d, and have a wider variety of origins (Figs. 4a and 4c). For example, the segments in the cluster with density of 0.35–1.0 of the universe of 30 residues length are derived from 461 proteins out of all 600 representatives used for this study (see Additional File 1). Longer loops that possess extended conformations are located on the opposite side of the compact-segment cluster along PC all 1 in the medium to long segment universe (Figs. 3b and 3c). Instead of discrete clusters, they appear to constitute a rather continuous distribution. Some analyses examine short loops with respect to their completeness [27, 28] and elaborate classification [26, 29]. In the analysis of short segments, our method also captured some loop conformation classes, such as joint loops connecting two helices, and exposed and extended loops participated in protein-protein interactions .
A natural boundary was identified, in this study, between the peptide-like and protein-like distributions between the lengths of 23 and 26 residues using actual conformations of protein segments. This observation with respect to the boundary is consistent with the results described by Shen et al. , even though they used a sphere-packing model to estimate a minimal domain size of about 20 residues. A recent study by Sawada and Honda  also identified a boundary at 10–20 residue length by calculating the structural diversity of segments. They discretized the conformational space using a single-pass clustering method. In contrast, we observed the density distribution to uncover differences of conformational space between short and long segments. The segment conformational space for lengths of 10–22 residues provided a distribution with an extreme density gradient towards the secondary structure, such as the α-helix, β-strand, and β-hairpin clusters, which are expected to belong to the peptide-like conformational regime. This conformational variation reflects that short segments embedded in globular proteins are mainly stabilized by the physicochemical property of the peptide. On the other hand, the segment conformational spaces for lengths of 27 residues or more have a distribution that is dominated by compact segments, which suggests a protein-like distribution (protein-like conformational regime). This distribution arises from the hydrophobic effect imparted by the solvent molecules, which is of great importance for structural stability in long segments derived from globular proteins. If this is the case, our observations support the de novo structure prediction methods, so-called fragment assembling methods, that have been developed recently [32–35]. These approaches are usually based on the prediction of local segment conformations followed by assembly of segments, and are generally used to separate criteria at each step; sequence similarity or secondary structural propensity for the prediction of segment conformations, and non-local energy terms for the assembling step. These strategies used in the de novo prediction methods seems to be consistent with the results shown here. Results of our analyses clearly show such a hierarchical organization of protein structures, and indicate that preparing segment libraries up to around 20 residues long would be helpful for such methods.
These results indicate that the structural meanings for the conformational axes (i.e., the radius of gyration for PC all 1, structural symmetry related to the N-terminal and C-terminal halves for PC all 2, and a single-turn/two-turn structure for PC all 3) are conserved in the different lengths and structural classes. This fact suggests that these conformational components are key structural variables for protein segments. On the other hand, when conformational axes among the four structural classes were compared, we were able to identify several conformational axes that were specific to each structural class, especially in the medium length range. In fact, a distribution change for medium lengths was observed, involving an increase in compact segments. Those segments included supersecondary structures such as α-hairpins, parts of the β-sheets, and βαβ units. These results might be related to the specificity of the structural class or fold of the contents of supersecondary structures . Typical supersecondary structural motifs, α-hairpin, β-hairpin, and βαβ are, respectively, the basic structural units for the all-α, all-β, and α/β proteins. These motifs are often shared within the structural classes. Therefore, the contribution ratios observed for the class-specific conformational axes were high. Class-specific conformational axes were rarely observed in short and long lengths, probably because short segments are too nonspecific and are often shared over different structural classes; long segments are too specific and have very low contribution ratios for conformational axes that are specific for each structural class.
The currently found class-specific conformational axes provide a hint to solve a difficulty in classifying diverse sets of protein structures. Both α/β and α+β classes are known to show a substantial overlap. In the CATH classification, α/β and α+β classes are treated as one structural class as α-β class. Classifying α/β and α+β proteins is sometimes a difficult problem, although several classification [19, 36, 37] and also prediction [38, 39] schemes have been proposed. The present study showed that α/β and α+β classes have similar characteristics of universes, and also have unique ones at the same time. For example, our results show that PCα/β8, whose contribution ratio was 1.4%, was associated only with the βαβ motif. In the α+β class, no axis was strongly correlated with PCα/β8 (see Additional File 2), which is a clear example of the difference in structural variables between α+β and α/β classes originating from class-specific supersecondary structures. Consequently, projecting segments onto a conformational subspace using the axis PCα/β8 could be useful for objectively dividing protein domains of α-β class into α/β and α+β classes. A considerable localization of segments derived from α/β proteins in a PCA subspace is observed (see Additional Files 3 and 4).
An effective method must be developed for conformational sampling for de novo prediction methods. The resulting structural variables analyzed in this study would be helpful for additional progress in de novo structure prediction. For example, testing the distribution of segments or models in terms of the degree of symmetry using the descriptor (D sym ) might be useful to verify the completeness of sampling of the conformational space. Using a filtering threshold or function (generally used in fragment assembling methods for selecting proper models) that is tolerant of the radius of gyration might be useful for improving the prediction of all-α proteins because the contribution ratio, Qα1, of PCα 1 corresponding to the radius of gyration (Rg) is larger than those of the other structural classes in the medium and long segments. Consequently, projecting segments of models onto a conformational subspace constructed by PC x (where x = α, β, α/β, α+β, or all) axes might be helpful for filtering out models and assigning a protein to a structural class.
In this study, the dual critical transitions in the protein segment universe from short to long length are shown. Our observations are related to the transitions proposed by the significance of two-stage treatment in de novo structure prediction. Considering the hierarchical organization of a protein segment universe that we have shown, we suggest the efficacy of using the evaluation functions that is secondary-structure-directed for sampling local structures less than 23 residues long. We also suggest the suitability of evaluating protein-like features of models using another function (e.g. Rg) for longer segments. Changing the criteria of filtering for each structural class will enhance the effectiveness of the conformation sampling process. Through these analyses, we have demonstrated that our clustering methodology is useful to identify a distinctive distribution shift of conformational space between short and long segments and that distribution changes depend on structural classes.
Preparing the segment libraries
One representative from each fold group of the SCOP database (ver. 1.63)  was chosen to obtain a segment library without a bias of usage of the folds. The representatives cover the four major structural classes (all-α, all-β, α+β, and α/β), because we are interested in and specifically examine characterization of the nature of segments embedded in usual size globular proteins. Small proteins of less than 50 residues and non-single chain proteins with less than 100 residues were excluded, as were membrane proteins. It is expected that those proteins possess different structural properties from those of usual size globular proteins and induce biased results. In all, 600 representatives were used for this study (all-α, 150; all-β, 116; α+β, 219; α/β, 115; see Additional File 5). Dividing the protein structures into segments with a sliding window by one residue along the sequence generated a segment library of arbitrary length. We prepared a segment library for each length of 10–50 residues to generate conformational spaces of short-to-long segments. In such cases, segments with incomplete coordinate data (e.g., having an unusual covalent-bond length or lacking main-chain atoms) were excluded. Furthermore, to elucidate differences among the conformational spaces derived from the four major structural classes, we generated a segment library for each class.
Construction and visualization of conformational space
We previously reported a method for constructing and visualizing the conformational space of protein segments using principal component analysis based on intra-segment Cα-Cα atomic distances . Briefly, atomic distances of all Cα-Cα pairs for each segment in a segment library of an arbitrary length were calculated first. A distance is designated as q i , where i is the index for the Cα-Cα pair, i = 1, ..., n(n - 1)/2, and n is the segment length, as expressed by the number of residues in the segment. Subsequently, a set of eigenvectors and eigenvalues were obtained by diagonalizing a variance-covariance matrix, C, that was calculated as C ij = (<(q i - <q i >)(q j - <q j >)> = <(q i q j - q i <q j >- <q i >q j + <q i ><q j >)> = <q i q j >- <q i ><q j >- <q i ><q j > + <q i ><q j > =) <q i q j >- <q i ><q j >, where the average <...> is taken over the segments. Two equations, C v i = λ i v i and v i ·v j = δ ij , are satisfied. Eigenvectors with larger eigenvalues are more important in the study of the conformational varieties of the segments. Eigenvalues are arranged in descending order: λ i > λ j if i <j. The contribution ratio of the i-th PCA element (i.e. the i-th eigenvector) to the whole conformational distribution is given as Q i = λ i /Σ k all λ k . The eigenvectors, which are called PC x 1, PC x 2, PC x 3, ...etc., were used as conformational axes to construct a segment conformational space, a PCA space, in which x indicates a segment dataset: x = α, β, α/β, α +β, or all). The indicator "x = all" is given when conformational axes are generated by the whole segment dataset. The origin of the PCA space is set on the average Cα-Cα atomic distances: <q> = [<q1>, <q2>, <q3>, ..., <q n >]. This enables ready comparison of conformational distributions between constructed universes. Any position (i.e. any segment structure) in the PCA space can be expressed using a linear combination of eigenvectors as c k = Σ n all (q- <q>)·v k λ k 1/2, where c k is a coordinate (i.e. projection of q) on the PC axis k. Using the first three eigenvectors (PC x 1, PC x 2, PC x 3), a three-dimensional (3D) PCA space can be constructed.
We defined a vector, r, to express the position of each segment in the 3D PCA space: r= [c1, c2, c3]. After projection of the segments on the 3D PCA space, the distribution of segments in the 3D PCA space was visualized using the following procedure. The 3D space was divided into N bins (total N3 cubes). The bin size was defined as (max [c1] - min [c1])/N, where N = 36, and max [c1] and min [c1] respectively signify the maximum and minimum of the coordinates of the segments along the first principal component axis. The number (i.e. frequency) of segments detected in a cube represents the density (i.e. probability) of segments to be found in the cube. The density of each cube, ρ was normalized by the maximum density, ρmax among the cubes so that the maximal value of normalized density (we call this density in the text) is set to 1 (refer to eq. (3) in ). Four levels of contour surfaces (i.e. iso-density surfaces) were depicted to visualize the 3D PCA space. The density values for those surfaces were set respectively as 0.005, 0.01, 0.1, and 0.35.
We also separately constructed the universe for four structural classes to assess differences among their conformational spaces. For this study, we specifically examined the first 10 PC axes of each structural class because the 10 PC axes are more important than the other axes with respect to capturing the differences in the conformational axes. Although the eigenvectors from the same structural class are mutually uncorrelated (i.e., v x i ·v x j = 0, where i ≠ j and x = α, β, α/β, or α+β), the eigenvectors from different structural classes might have some correlation (i.e., v x i ·v y j ≠ 0, where x ≠ y). The PC axis is defined as the conformational component specific to the structural class when a PC axis from a structural class has no similarity to the first 20 PC axes from the other structural classes with a correlation coefficient > 0.8 (i.e. v x i ·v y j > 0.8).
KI and JH were partly supported by BIRD of Japan Science and Technology Agency (JST).
- Matsuo Y, Kanehisa M: An approach to systematic detection of protein structural motifs. Comput Appl Biosci 1993, 9(2):153–159.Google Scholar
- Unger R, Sussman JL: The importance of short structural motifs in protein structure analysis. J Comput Aided Mol Des 1993, 7(4):457–472. 10.1007/BF02337561View ArticleGoogle Scholar
- Micheletti C, Seno F, Maritan A: Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies. Proteins 2000, 40(4):662–674. 10.1002/1097-0134(20000901)40:4<662::AID-PROT90>3.0.CO;2-FView ArticleGoogle Scholar
- Prestrelski SJ, Byler DM, Liebman MN: Generation of a substructure library for the description and classification of protein secondary structure. II. Application to spectra-structure correlations in Fourier transform infrared spectroscopy. Proteins 1992, 14(4):440–450. 10.1002/prot.340140405View ArticleGoogle Scholar
- Rackovsky S: Quantitative organization of the known protein x-ray structures. I. Methods and short-length-scale results. Proteins 1990, 7(4):378–402. 10.1002/prot.340070409View ArticleGoogle Scholar
- de Brevern AG, Etchebest C, Hazout S: Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins 2000, 41(3):271–287. 10.1002/1097-0134(20001115)41:3<271::AID-PROT10>3.0.CO;2-ZView ArticleGoogle Scholar
- Fetrow JS, Palumbo MJ, Berg G: Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme. Proteins 1997, 27(2):249–271. 10.1002/(SICI)1097-0134(199702)27:2<249::AID-PROT11>3.0.CO;2-MView ArticleGoogle Scholar
- Sander O, Sommer I, Lengauer T: Local protein structure prediction using discriminative models. BMC Bioinformatics 2006, 7: 14. 10.1186/1471-2105-7-14View ArticleGoogle Scholar
- Rooman MJ, Rodriguez J, Wodak SJ: Automatic definition of recurrent local structure motifs in proteins. J Mol Biol 1990, 213(2):327–336. 10.1016/S0022-2836(05)80194-9View ArticleGoogle Scholar
- Schuchhardt J, Schneider G, Reichelt J, Schomburg D, Wrede P: Local structural motifs of protein backbones are classified by self-organizing neural networks. Protein Eng 1996, 9(10):833–842. 10.1093/protein/9.10.833View ArticleGoogle Scholar
- Hunter CG, Subramaniam S: Protein fragment clustering and canonical local shapes. Proteins 2003, 50(4):580–588. 10.1002/prot.10309View ArticleGoogle Scholar
- Tomii K, Kanehisa M: Systematic detection of protein structural motifs. In Pattern discovery in biomolecular data. Edited by: Wang JTL, Shapiro BA, Shasha D. New York: Oxford University Press; 1999:97–110.Google Scholar
- Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol 2006, 16(3):393–398. 10.1016/j.sbi.2006.04.007View ArticleGoogle Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH – a hierarchic classification of protein domain structures. Structure 1997, 5(8):1093–1108. 10.1016/S0969-2126(97)00260-8View ArticleGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.Google Scholar
- Holm L, Ouzounis C, Sander C, Tuparev G, Vriend G: A database of protein structure families with common folding motifs. Protein Sci 1992, 1(12):1691–1698.View ArticleGoogle Scholar
- Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol 1996, 6(3):377–385. 10.1016/S0959-440X(96)80058-3View ArticleGoogle Scholar
- Holm L, Sander C: Mapping the protein universe. Science 1996, 273(5275):595–603. 10.1126/science.273.5275.595View ArticleGoogle Scholar
- Hou J, Sims GE, Zhang C, Kim SH: A global representation of the protein fold space. Proc Natl Acad Sci USA 2003, 100(5):2386–2390. 10.1073/pnas.2628030100View ArticleGoogle Scholar
- Salem GM, Hutchinson EG, Orengo CA, Thornton JM: Correlation of observed fold frequency with the occurrence of local structural motifs. J Mol Biol 1999, 287(5):969–981. 10.1006/jmbi.1999.2642View ArticleGoogle Scholar
- Szustakowski JD, Kasif S, Weng Z: Less is more: towards an optimal universal description of protein folds. Bioinformatics 2005, 21(Suppl 2):ii66–71. 10.1093/bioinformatics/bti1111View ArticleGoogle Scholar
- Kurgan L, Kedarisetti KD: Sequence representation and prediction of protein secondary structure for structural motifs in twilight zone proteins. Protein J 2006, 25(7–8):463–474. 10.1007/s10930-006-9029-0View ArticleGoogle Scholar
- Ikeda K, Tomii K, Yokomizo T, Mitomo D, Maruyama K, Suzuki S, Higo J: Visualization of conformational distribution of short to medium size segments in globular proteins and identification of local structural motifs. Protein Sci 2005, 14(5):1253–1265. 10.1110/ps.04956305View ArticleGoogle Scholar
- Kumar S, Bansal M: Structural and sequence characteristics of long alpha helices in globular proteins. Biophys J 1996, 71(3):1574–1586.View ArticleGoogle Scholar
- Penel S, Morrison RG, Dobson PD, Mortishire-Smith RJ, Doig AJ: Length preferences and periodicity in beta-strands. Antiparallel edge beta-sheets are more likely to finish in non-hydrogen bonded rings. Protein Eng 2003, 16(12):957–961. 10.1093/protein/gzg147View ArticleGoogle Scholar
- Donate LE, Rufino SD, Canard LH, Blundell TL: Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci 1996, 5(12):2600–2616.View ArticleGoogle Scholar
- Fidelis K, Stern PS, Bacon D, Moult J: Comparison of systematic search and database methods for constructing segments of protein structure. Protein Eng 1994, 7(8):953–960. 10.1093/protein/7.8.953View ArticleGoogle Scholar
- Lessel U, Schomburg D: Creation and characterization of a new, non-redundant fragment data bank. Protein Eng 1997, 10(6):659–664. 10.1093/protein/10.6.659View ArticleGoogle Scholar
- Wojcik J, Mornon JP, Chomilier J: New efficient statistical sequence-dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification. J Mol Biol 1999, 289(5):1469–1490. 10.1006/jmbi.1999.2826View ArticleGoogle Scholar
- Shen M-y, Davis FP, Sali A: The optimal size of a globular protein domain: A simple sphere-packing model. Chemical Physics Letters 2005, 405(1–3):224–228. 10.1016/j.cplett.2005.02.029View ArticleGoogle Scholar
- Sawada Y, Honda S: Structural diversity of protein segments follows a power-law distribution. Biophys J 2006, 91(4):1213–1223. 10.1529/biophysj.105.076661View ArticleGoogle Scholar
- Bonneau R, Strauss CE, Rohl CA, Chivian D, Bradley P, Malmstrom L, Robertson T, Baker D: De novo prediction of three-dimensional structures for major protein families. J Mol Biol 2002, 322(1):65–78. 10.1016/S0022-2836(02)00698-8View ArticleGoogle Scholar
- Chikenji G, Fujitsuka Y, Takada S: A reversible fragment assembly method for de novo protein structure prediction. The Journal of Chemical Physics 2003, 119(13):6895–6903. 10.1063/1.1597474View ArticleGoogle Scholar
- Lee J, Kim S-Y, Lee J: Protein structure prediction based on fragment assembly and parameter optimization. Biophysical Chemistry 2005, 115(2–3):209–214. 10.1016/j.bpc.2004.12.046View ArticleGoogle Scholar
- Bujnicki JM: Protein-structure prediction by recombination of fragments. Chembiochem 2006, 7(1):19–27. 10.1002/cbic.200500235View ArticleGoogle Scholar
- Michie AD, Orengo CA, Thornton JM: Analysis of domain structural class using an automated class assignment protocol. J Mol Biol 1996, 262(2):168–185. 10.1006/jmbi.1996.0506View ArticleGoogle Scholar
- Kurgan LA, Zhang T, Zhang H, Shen S, Ruan J: Secondary structure-based assignment of the protein structural classes. Amino Acids 2008.Google Scholar
- Chou KC: Progress in protein structural class prediction and its impact to bioinformatics and proteomics. Curr Protein Pept Sci 2005, 6(5):423–436. 10.2174/138920305774329368View ArticleGoogle Scholar
- Kurgan L, Cios K, Chen K: SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 2008, 9: 226. 10.1186/1471-2105-9-226View ArticleGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211View ArticleGoogle Scholar