 Research Article
 Open Access
 Published:
Clustering and percolation in protein loop structures
BMC Structural Biology volume 15, Article number: 22 (2015)
Abstract
Background
High precision protein loop modelling remains a challenge, both in template based and template independent approaches to protein structure prediction.
Method
We introduce the concepts of protein loop clustering and percolation, to develop a quantitative approach to systematically classify the modular building blocks of loops in crystallographic folded proteins. These fragments are all different parameterisations of a unique kink solution to a generalised discrete nonlinear Schrödinger (DNLS) equation. Accordingly, the fragments are also local energy minima of the ensuing energy function.
Results
We show how the loop fragments cover practically all ultrahigh resolution crystallographic protein structures in Protein Data Bank (PDB), with a 0.2 Ångström rootmeansquare (RMS) precision. We find that no more than 12 different loop fragments are needed, to describe around 38 % of ultrahigh resolution loops in PDB. But there is also a large number of loop fragments that are either unique, or very rare, and examples of unique fragments are found even in the structure of a myoglobin.
Conclusions
Protein loops are built in a modular fashion. The loops are composed of fragments that can be modelled by the kink of the DNLS equation. The majority of loop fragments are also common, which are shared by many proteins. These common fragments are probably important for supporting the overall protein conformation. But there are also several fragments that are either unique to a given protein, or very rare. Such fragments are probably related to the function of the protein. Furthermore, we have found that the amino acid sequence does not determine the structure in a unique fashion. There are many examples of loop fragments with an identical amino acid sequence, but with a very different structure.
Background
Protein taxonomy [1–5] reveals that crystallographic protein structures have surprisingly little conformational diversity. It might be that the majority of different conformations have already been found [6, 7]. This apparent convergence in protein structure provides the rationale for the development of comparative modelling or threading techniques [8–12]. These approaches try to predict the tertiary structure of a folded protein using libraries of known protein structures as templates. According to the communitywide Critical Assessment for Structural Prediction (CASP) tests [13], at the moment this kind of methods have the best predictive power to determine a folded conformation.
In the loop regions, comparative modelling approaches still continue lacking in their precision [14, 15]. It is not uncommon that there are gaps in the loop regions that need to be filled by various insertion techniques. The success in loop modelling is also often limited to supersecondary structures where αhelices and βstrands are connected to each other by relatively short twists and turns [16, 17]. In the case of a very short loop, with no more than three residues, the shape can be determined by a combination of geometrical considerations and stereochemical constraints [18]. In the case of longer loops, both template based and template independent methods are being developed to predict their shapes [19–21]. The underlying assumption is that the number of loop conformations which can be accommodated by a given sequence should be limited. The different fragments which are already available in the Protein Data Bank (PDB) [22] database could then be used like Lego bricks, as structural building blocks in constructing the loops. A given amino acid sequence is simply divided into short fragments, and the shape of the ensuing loop is deduced using homologically related fragments that have known structures. The entire protein is then assembled by joining these fragments together. For the process of joining the fragments, both allatom energy functions and comparisons with closely homologous template structures in the Protein Data Bank can be utilised [8, 9, 12, 14].
In the present article we propose a new systematic, purely quantitative method to identify and classify the modular building blocks of PDB loops; we identify a loop following the DSSP [23] convention. Our approach is based on a firstprinciples energy function [24–29]. It is built on the concept of universality [30–36] to model the fragments of even long protein loops in terms of different parameterisations of a unique kink that solves a variant [37, 38] of the discrete nonlinear Schrödinger (DNLS) equation [39, 40]. Our starting point is the observation made in [41] that over 92 % of loops in those PDB structures that have been measured with better than 2.0 Å resolution, can be composed from 200 different parameterisations of the kink profile, with better than 0.65 Ångström RMSD (rootmeansquaredistance) accuracy. Here we refine this observation, with the aim to develop it into a systematic loop fragment classification scheme. For this we consider only those ultrahigh precision PDB structures that have been measured with better than 1.0 Å resolution. This ensures that the Bfactors in the loop regions are small, and in particular that the structures have not been subjected to extensive refinement procedures. Indeed, two loop fragments should be considered different only, when the average interatomic distance is larger than the average DebyeWaller Bfactor fluctuation distance. If the Bfactors are large, any systematic attempt to identify and/or distinguish two fragments becomes ambiguous. In the case of these intrahigh resolution structures we can aim for the RMSD precision of 0.2 Å. We estimate this to be the extent of zero point fluctuations i.e. a distance around 0.2 Å corresponds to the intrinsic uncertainty in the determination of heavy atom positions along the protein backbone. Thus any difference less than 0.2 Å between average atomic coordinates is essentially undetectable. By explicit constructions, we show how in the case of this subset of ultrahigh resolution PDB protein structures, the loops can be systematically modeled using combinations of the unique kink of the generalised DNLS equation. As such, our approach provides a foundation for a general approach to classify loops in high precision crystallographic PDB structures, in terms of an energy function based firstprinciples mathematical concept.
Method
C α based Frenet frames
Let r _{ i } (i=1,…,N) be the coordinates of the protein backbone αcarbon (C α) atoms. The indexing starts from the N terminus. At each r _{ i } we introduce the discrete Frenet frame (t _{ i },n _{ i },b _{ i }) shown in Fig. 1 following the method in reference [42].
From the Frenet frames, we define the virtual C α backbone bond (κ) and torsion (τ) angles shown in Fig. 2 as follows,
We identify the bond angle κ∈ [ 0,π] with the latitude angle of a twosphere which is centered at the C α carbon; the tangent vector t points towards the northpole where κ=0. The torsion angle τ∈ [ −π,π) is the longitudinal angle on the sphere. We have τ=0 on the great circle that passes both through the north pole and through the tip of the normal vector n, and the longitude increases in the counterclockwise direction around the tangent vector. We stereographically project the sphere onto the complex (x,y) plane from the southpole
as shown in Fig. 3; the northpole where κ=0 becomes mapped to the origin (x,y) =(0,0) while the southpole κ=π is sent to infinity.
We often find it convenient to extend the range of the latitude κ from positive to arbitrary real values. We compensate for this double covering of the sphere by introducing the following discrete \(\mathbb Z_{2}\) gauge transformation
This transformation has no effect on the backbone coordinates r _{ i }, and it leaves the C α backbone intact.
The C α trace visualization, loops and kinks
The C α map
We visualise the backbone C α trace of a given protein in terms of a trajectory on the stereographically projected twosphere, as follows [43–45]: At the location of each C α we introduce the corresponding discrete Frenet frames (t _{ i },n _{ i },b _{ i }). The base of the i ^{th} tangent vector t _{ i } is located at the position r _{ i } of the i ^{th} C α carbon, it coincides with the centre of the twosphere and the vector t _{ i } points towards the northpole. We translate the sphere from the location of the i ^{th} C α to the location of the (i+1)^{th} C α, without introducing any rotation of the sphere with respect to the i ^{th} Frenet frames. We identify the direction of t _{ i+1}, i.e. the direction towards the C α carbon at site r _{ i+2} from the site r _{ i+1}, on the surface of the sphere in terms of the ensuing spherical coordinates (κ _{ i },τ _{ i }). We repeat the procedure for all the backbones in PDB. To enhance statistics, for visualisation purposes we use here those protein structures that have been measured with better than 2.0 Å resolution, which gives us the map shown in Fig. 4 a; see also Figure S1 in Additional file 1. The color intensity correlates directly with the statistical distribution of the (κ _{ i },τ _{ i }): red is large, blue is small and white is none. The map describes the direction of the C α carbon at r _{ i+2} as it is seen at the vertex r _{ i+1}, in terms of the Frenet frames at r _{ i }.
Note how the statistical distribution in Fig. 4 concentrates within an annulus, roughly between the latitude angle values (in radians) κ∼1 and κ∼π/2. The exterior of the annulus is a sterically excluded region. The entire interior is in principle sterically allowed, but it is very rarely occupied in the case of folded proteins. The four major secondary structure regions, αhelices, βstrands, lefthanded αhelices and loops, are identified according to their PDB classification. For example, (κ,τ) values (in radians) for which
describes a righthanded αhelix, and values for which
describes a βstrand. We note that the Fig. 4 a is akin the Newman projection of stereochemistry: The vector t _{ i } which is denoted by the red dot at the center of the figure, points along the backbone from the proximal C α at r _{ i } towards the distal C α at r _{ i+1}, and the colour intensity displays the statistical distribution of the r _{ i+2} direction. We also note that the Fig. 4 provides nonlocal information on the backbone geometry; the information content extends over several peptide units. This is unlike the Ramachandran map, which can only provide localised information in the immediate vicinity of a single C α carbon. As shown in [46], the C α backbone bond and torsion angles (κ _{ i },τ _{ i }) are sufficient to reconstruct the entire backbone, while the Ramachandran angles are not.
In Fig. 4 b we visualise as an example a path made by a generic protein loop that connects two righthanded αhelical structures. A notable property of the trajectory drawn in Fig. 4 b is that it encircles the northpole of the twosphere. It turns out that this kind of encircling is quite generic for loops, even entire folded proteins [47]. Consequently, we assign to each loop a winding number which we term folding index that we denote I n d _{ f } [47] and define as follows,
Here [x] denotes the integer part of x, and Γ is the total rotation angle (in radians) that the projections of the C α atoms of the consecutive loop residues make around the north pole. The folding index is a positive integer when the rotation is counterclockwise, and a negative integer when the rotation is clockwise. The folding index can be used to detect and classify loop structures and entire folded proteins, in terms of its values. The value is equal to twice the number of times the ensuing pathway encircles the northpole in the map of Fig. 4; for the trajectory shown in Fig. 4 b the folding index is +2.
The discrete nonlinear Schrödinger equation
The virtual bond length between two neighboring C α atoms is essentially constant, with the value 3.8 Å. Accordingly the Helmholtz free energy for the C α trace backbone can be expressed in terms of the virtual bond angles κ _{ i } and dihedral angles τ _{ i } only. To the leading order in the infrared limit the result coincides with
This is essentially the Hamiltonian of the discrete nonlinear Schrödinger equation [39, 40]; for a detailed derivation we refer to [24–29]. Remarkably, the free energy (9) supports a kink (topological soliton) as a classical solution [37, 38]. An excellent approximation of a kink can be obtained by naively discretising the kink solution of the continuum nonlinear Schrödinger equation [37, 38, 48]
The torsion angles τ are then expressed as functions of the bond angles κ
For the torsion angles, from (11) we conclude that the overall scale of the parameters (d,q) and (e,b) cancel in the expression (11). This leaves us with only three independent parameters. In (10) there are four parameters when we use translation invariance to remove s. Thus the profile of a single kink becomes fully determined in terms of seven independent parameters. This coincides exactly with the number of independent coordinates along a C α backbone segment, with six residues. For this, we may always place the first residue to coincide with the origin of a Cartesian (xyz) coordinate system. We can always place the second residue along the zaxis, and we can always place the third residue on the x=0 plane. Thus, there is only one independent coordinate for the three first residues. Since the remaining three residues can each be placed to arbitrary angular directions, there are six additional independent coordinates. Accordingly, a segment with six residues indeed engages seven independent parameters.
Clustering and percolation
We shall classify the loop structures in PDB in terms of the following clustering algorithm:

We define a cluster to be a set of loop fragments such that for each fragment in a given cluster there is at least one other fragment within a prescribed RMS cutoff distance.
Two clusters are disjoint, when the RMSD between any fragment in the first cluster and any fragment in the second cluster exceeds this prescribed RMS cutoff distance.

We define the initiator of a cluster to be an a priori random loop fragment which defines the cluster by completion, as follows: We start with the initiator. We identify all those fragments in our entire data set which deviate from the initiator by less than the given RMS cutoff distance. We continue the process by identifying all those fragments, that deviate from the fragments that we have identified in the previous step, by less than the RMS cutoff distance. We repeat the procedure until we find no additional fragments in PDB, within the RMS cutoff distance from any of those fragments that have been identified in the previous steps.
The cluster is clearly independent of its initiator, any element of the cluster could be used as the initiator. But the cluster depends on the RMS cutoff distance. Moreover, if the RMS cutoff distance is too large, no clear clustering is observed.
According to [49] for a PDB protein structure which is measured with resolution 2.0 Å or better, the characteristic values of the thermal Bfactors are mostly less than around
From the DebyeWaller relation we then obtain the following estimate for the one standard deviation error in the atomic coordinates
Thus, two loop fragments that have been measured with 2.0 Å resolution should be (in average) considered different only, when their RMS distance exceeds 0.65 Å.
The construction of PDB loop fragments in terms of the kink profile (10), (11) in those crystallographic protein structures which have been measured with resolution 2.0 Å or better, has been addressed in [41]. There, it was found that over 92 percent of loops can be covered in a modular fashion by 200 explicit kink profiles (10), (11), with RMSD accuracy that matches (13) i.e. with less than 0.65 Å RMSD deviation from the crystallographic structure. Thus 0.65 Å RMS distance is the appropriate RMS cutoff value, to search for for the more refined clustering patterns in those crystallographic structures which have been measured with resolution 2.0 Å. However, we find that the value 0.65 Å is too large, to observe clear clustering patterns. Accordingly, we shall search for clustering by considering only those PDB structures that have been determined with the ultrahigh resolution 1.0 Å or better. For these ultrahigh resolution structures, a precision better than the value (13) can be expected. To determine an appropriate value, we display in Fig. 5 the number of all C α atoms in all currently available PDB structures, that have been measured with resolution 1.0 Å or better, as a function of their DebyeWaller fluctuation distance. For most of the structures, the fluctuation distance is clearly below the upper bound (13); the maximum of the curve is located at around 0.3 Å. We also observe the (essential) absence of structures with a fluctuation distance less than 0.1 Å; historically this distance is considered as the boundary wavelength between xrays and γrays.
Using a combination of Fig. 5 with various tests that we have performed, we have arrived at the conclusion that 0.2 Å in RMS distance can be currently adopted as a reasonable estimate for the minimal zeropoint fluctuation distance in ultrahigh resolution structures, those that have been measured with better than 1.0 Å resolution. Thus we shall try and see, to what extent loops in these protein structures can be classified in terms of elemental fragments, such that two fragments are considered different when their RMS distance exceeds 0.2 Å. According to Fig. 5, over 99 % of individual C α carbons that have been measured with below 1.0 Å resolution, have a Bfactor fluctuation distance which is larger than 0.2 Å; our choice of cutoff distance is close to the 3 σ level.
We note that other cutoff values can be introduced; the ultimate appears to be 0.1 Å. But our qualitative conclusions are fairly independent of the value chosen, provided it is small enough to provide a clustering pattern. In this article our goal is to present a proofofconcept. To our knowledge, no related analysis has been previously attempted, to systematically classify the loop structures in ultrahigh resolution crystallographic protein conformations, in a quantitative fashion using an energy function. In particular, no commonly accepted experimental standard exist, that we could rely on, to infer the “most preferred” cutoff value. We hope that such a value can be eventually inferred, from careful experimental measurements. Thus, at the moment we have no criterion to prefer any other particular value, 0.2 Å i.e. around 3 σ appears to be a reasonable choice at this point.
We start the identification of loop fragments, using the set of 200 fragments constructed in [41]. But our results are independent of the starting point, quite similar results can be obtained using a fairly generic set of loop fragments as a starting point. We note that the fragments in [41] have between five and nine residues, and most of them (116 out of 200) have six residues. We have already argued that six is the optimal number of residues in a loop fragment, as it matches the number of independent parameters in the kink profile (10), (11). Thus, we shall consider only fragments that have six residues, in the clustering algorithm. In this manner, we find that we can classify all PDB fragments into clusters, each determined by their initiator.
We have found that there are clusters that have a very large number of fragments. But we also find that there are clusters with only a single, or very few fragments. It is natural to expect that those clusters which are large, contain mostly fragments that are structurally important. On the other hand, those clusters which are small should include mainly fragments that are functionally important. Furthermore, we find several examples of amino acid sequences that are included in different clusters: The sequence does not define the structure, in a unique fashion. This leads us to address the concept of cluster percolation: Given the sequence of a loop fragment in a cluster, percolation means that there are other, possibly new clusters where the same sequence appears but with a different structure.
Results
Clustering
We have constructed our clusters by starting with the 200 loop fragments that were introduced in [41]. Around 92 % of all loops in those PDB structures that have been measured with resolution better than 2.0 Å, are within a 0.65 Å RMS distance from some of the 200 loop fragments. However, when we decrease the RMSD cutoff distance to 0.2 Å, which is the cutoff distance used in the present article, the coverage drops to below 2 % [41].
We remark that the authors of reference [41] did not investigate clustering, as the concept is defined here. In [41] all the RMS distances were evaluated from the fixed set of 200 loop fragments, and the coverage of PDB loop structures was determined in terms of these fixed loop fragments.
When we specify to the present subset of PDB structures in [41] that have been measured with better than 1.0 Å resolution, we find that a total of 102 out of the 200 fragments in [41] have been measured with this resolution. We use these 102 loop fragments as the initiators, to start our clustering construction.
clusters
The 102 loop fragments in [41] that have been measured with better than 1.0 Å resolution, have between five and nine residues. Here we have argued that a loop fragment modelled by (10), (11) has six residues. There are 70 such clusters among the 200, but only 14 of them contain more than 30 fragments. Moreover, two of these merge together into an αhelical structure, when we subject them to our clustering algorithm; we call them bends instead of kinks. The remaining 12 loop fragments determine clusters which cover around 38 % of the 1.0 Å protein loop structures, when we use our 0.2 Å RMSD cutoff. These loop fragments are our final initiators. In Table 1 we list the PDB entry codes and residue numbers of these initiators.
We proceeded to describe some of the major features of the ensuing 12 clusters. Additional details including a breakdown according to amino acid constituents in each cluster, are presented in Figure S2 of Additional file 1.
The Figs. 6 and 7 show the (κ,τ) distribution in each of the 12 clusters on the stereographically projected twosphere of Fig. 4; note that the definition of bond angle takes three residues while the definition of torsion angle takes four. Thus for a 6 residue loop fragment there are three (κ,τ) pairs. The fourth κvalue could be used to refine the loop classification, but here this possibility is not considered.
In Figs. 8 and 9 we show the three dimensional pictures of the initiators of the twelve clusters.
A detailed inspection reveals that except for IV, all the initiators have the canonical structure of a single kink, in terms of the folding index (8). Moreover, the initiator I is part of a short loop connecting an αhelix and a βstrand. However, the bond and torsion angle spectrum which we display in Fig. 10 a shows that this loop is actually a pair of two kinks which are very close to each other, and the initiator I is the second kink along the backbone.
On the other hand, a comparison with (8) suggests that the initiator IV exhibits a somewhat small variation in the values of the torsion angles, for a kink. This can be seen in Fig. 6. The torsion angle values suggest that the initiator IV resembles more a bent αhelix than a kink. In Fig. 10 b, c we show the spectrum of the bond and torsion angles of the initiator IV, both before and after we have implemented the \(\mathbb Z_{2}\) gauge transformation. Since this bent structure determines an isolated cluster according to our 0.2 Å cutoff criteria, it is included among our loop fragments.
In Figs. 11 and 12 we show the three dimensional figures of each of the twelve clusters, including all the entries.
Finally, we have also investigated how the coverage of the 12 clusters increases, when we increase the cutoff distance. The results are shown in Table 2.
Cluster elongation and completion
In addition of the 12 initiators listed in Table 1, among the 102 loop fragments of [41] that we have considered, there is also one initiator that has only five residues. The PDB code is 1p1x_A (80–84). The ensuing cluster with five residue long elements is very large: There are a total of 42618 entries. The reason for the occurrence of such a large cluster is that the RMSD clustering criteria 0.2 Å is too large for revealing clustering patterns in fiveresiduelong loop segment: The fiveresiduelong loop fragment covers all the fiveresiduelong loops, within the chosen cutoff criterion. In Fig. 13 we show the distribution of (κ,τ) values in this cluster.
There is also an overlap with each of the 12 clusters that we obtained previously. Together the 13 clusters cover around 96.1 % of all PDB loop structures.
It is apparent that an initiator with only five residues is too short to identify a clustering pattern of PDB loops, even with 0.2 Å precision. Consequently we have elongated this initiator. For this, we have systematically added residues at the beginning and at the end of the individual elements in its cluster, to search for clustering patterns. For example, we may take the element 1p1x_A (80–84), elongate it to 1p1x_A (80–85) and 1p1x_A (79–84), and then use these two elongated ones as initiators to do the clusterings: We denote by H, S and L a residue which is located in a helix, strand and loop respectively, according to the PDB classification. The five residue long cluster which is generated by 1p1x_A (80–84) consists of several different elements, such as for example LLLLL, HLLLL, LLLLS etc.
As an example, we have selected the pattern LLLLL which has the largest number of elements; there are a total of 7901 elements. We have elongated each of these 7901 elements into a protein loop fragment with six residues, by incorporating the corresponding PDB residue which is either right before the first L residue, or immediate after the last L residue. In this manner we find 15802 different loop fragments with six residues each. We have investigated the corresponding clustering patterns: There are 30 new clusters with more than 30 elements, bringing the total number of the clusters with more than 30 elements, to 42. We list these 30 additional clusters in Table 3. In Figs. 14, 15 and 16 we display the (κ,τ) distributions of these 30 clusters. A visual inspection of these clusters reveals, that at the level of the (κ,τ) distribution the cluster 26 appears to display additional subclustering. But the present cutoff value 0.2 Å is not sufficiently refined to detect this subclustering, at the level of RMS distance. Furthermore, the clusters 29 and 30 both appear to merge with the regular βstrand. In Fig. 17 we show the corresponding initiators: The cluster 29 is clearly a loop, while the cluster 30 consist of the regular βstrand and thus we exclude it from our set of loop fragments. This leaves us with a total of 41 clusters, with 30 or more loop fragments. These clusters cover around 52 % of all loop structures in PDB.
By completing the elongation process we have identified 3240 different clusters with 0.2 Å cutoff. These clusters cover around ∼85 % of all those PDB loop sites in our set of resolution better than 1.0 Å proteins. Among these clusters there are 1677 unique ones, in the sense that the cluster has only single element. Thus, around 14 % of all loop structures in PDB appear to be unique, to the given protein. In addition, there are 1531 rare clusters with two or more, but less than 32 elements. Thus, there are 32 clusters with 32 or more elements.
The remaining ∼15 % of loop fragments that are not covered by the 3240 clusters, can be constructed by completion. For example, we can search for novel clusters by using the patterns other than LLLLL in the five residue cluster generated by 1p1x_A (80–84). But when the four patterns HLLLL, LLLLH, SLLLL and LLLLS are included the coverage increases no more than around one per cent.
Cluster percolation
We have also investigated the relation between the sequence and the structure, using the 42 clusters listed in Tables 1 and 3. Here we only describe some of the major features, more details can be found in Figure S3 in Additional file 1.
There are several examples of identical sequences that correspond to different structures in different proteins. Accordingly a sequence clearly does not determine a unique structure. When a given sequence gives rise to multiple structures, we have a phenomenon we call cluster percolation. These sequences with multiplet structures may be utilised to try and introduce novel clusters.
For example, in Table 4 those sequences that are found both in Cluster VIII and outside of it, are listed, together with their PDB identifications and RMSD to the initiator of Cluster VIII.
As an example, in Fig. 18 a we compare the four PDB structures that have the identical sequence SDGNGM in the Table 4. The difference between the two mutually similar structures 2vb1 A (100–105) and 4lzt A (100–105) to the two equally mutually similar structures 1iee A (100–105) and 4b4e A (100–105) is visually apparent. A visual comparison with the Cluster VIII in Fig. 12 also reveals that both 1iee A (100–105) and 4b4e A (100–105) are clearly outside of this cluster.
Figure 18 b shows the comparison of the sequence ADGKPV to the initiator. The difference between the structures of 4hen A (54–59) and the initiator is again clear. The structure of 4hen A (54–59) is also quite different from the structures in Fig. 18 a, and from the Cluster VIII shown in Fig. 12.
In Table S1 of Additional file 1 we list those sequences that appear both in the 12 clusters of Table 1 and in protein structures which are not contained in any of the clusters. We have investigated these structures, and found 454 new clusters. But most of them have very few elements, only two of them have more than 30 elements. With these new clusters the coverage becomes increased to 88 %. In Fig. 19 we show the (κ,τ) distributions on the stereographically projected twosphere of the two clusters with more than 30 elements; the initiators are 1ix9_A (133–138) and 3aj4_B (73–78) correspondingly. These two clusters are found by considering the sequences LKGDKL in cluster III and KDCMLQ in cluster XI, respectively.
Example: Myoglobin
Myoglobin is a widely studied protein, thus we have analysed its loop structure from the present perspective. We have chosen the crystallographic oxymyoglobin structure 1A6M [50] which is one of the few myoglobin structures that have been measured with resolution better than 1.0 Å, for our comparative study.
We have located in 1A6M four putative kink segments with six residues each, that are either unique or very rare in PDB, with our 0.2 Å RMSD cutoff. These kinks are located between helices C and D, and between helices E and F. The two putative kinks between helices C and D correspond to the residue sites (41–46) and (48–53). The two putative kinks between helices E and F correspond to residue sites (77–82), and the practically overlapping (78–83). In Fig. 20 we show how in our PDB set, the number of matches for each of these four kinks depends on the RMS cutoff distance.
The 1A6M is closely related to the PDB entries 1A6G, 1A6K and 1A6N; they represent four different ligation states of the same protein. Each of the three 1A6G, 1A6K and 1A6N have been measured with resolution above 1.0 Å, thus they do not appear in our data set. In Table 5 the RMS distance of the four rare kinks of 1A6M are compared to the corresponding kinks in 1A6G, 1A6K and 1A6N. All the RMSD values are below the cutoff 0.2 Å.
We conclude that the four kinks are stable, in the sense that they do not change their conformation when the ligation state changes.
Chain inversion
Finally, the operation of local chain inversion along a protein segment is defined as a mapping, that sends a sequence with C α coordinates
into a sequence with C α coordinates
We note that a regular secondary structure such as an αhelix becomes mapped onto itself i.e. remains invariant under chain inversion. But we have found that the 12 clusters that we have constructed are not inversion invariant; the inversion does not map a cluster onto itself. Thus one might expect that new clusters could be found by inversion of these clusters. However, surprisingly we have found only one single example of a PDB segment by inversion. This is the segment (1115–1120) in the PDB structure 1MC2. Thus local chain inversion is apparently a broken symmetry, in the case of protein loops. This sets the loops apart from the regular structures like αhelices and βstrands.
Discussion
We have introduced the concept of loop clustering to analyse those ultrahigh resolution crystallographic protein structures in PDB, that have been measured with resolution 1.0 Å or less. We have chosen these structures since we expect, that in the case of a ultrahigh resolution measurement there should be less need to introduce structure validation. Thus there should also be less bias towards a priori chemical knowledge and stereochemical paradigms, in this subset of PDB proteins. Moreover, our investigation of 2.0 Å subset shows that high resolution is necessary to reveal the clustering structure in the case of protein crystals.
We have inquired to what extent the protein structures can be constructed in a modular fashion. For the modular building blocks we have chosen different parameterisations of the unique kink solution to a generalised discrete nonlinear Schrödinger equation. The precision we have used as a criterion in making a difference between two structures is 0.2 Å in RMSD. We have concluded that this should be the shortest meaningful RMS distance that can be introduced, at the moment, to classify different modular protein components.
We have identified a set of 12 different kink parameterisations, which cover around 38 % of all PDB loop structures. Accordingly, these are loop patterns that are abundantly present in the folded proteins. It appears to us, that these kinks are often located in such protein segments that are structurally important, as opposed to those that are functionally important. We have introduced various techniques to extent the initial set of 12 kinks, and we have found that around 52 % of loop regions become covered when we introduce a set of 29 additional kinks. But in order to cover the remaining ∼48 % of protein loops, we need to substantially increase the number of kinks. For example, we need to introduce over 1000 kinks to cover over 88 % of loops. In particular, we have concluded that there are several kinks that are very rare, even unique, in PDB when we use the present cutoff value. We propose that a rare or even unique kink should have a an important functional rôle, in a protein. This can be exemplified by the myoglobin 1A6M segments (41–46), (48–53) and (78–83) which are all rare. These segments also constitute the CD corner and EF corner in myoglobin, which have been argued to be closely related to the ligand migration process [51, 52].
Conclusions
Protein loops are built in a modular fashion, in terms of various parametrisations of the kink solution to a generalised version of the discrete nonlinear Schrödinger equation. Most loops can be built from a very small number of modular components, these loops are most likely important for the overall structure of the protein. However, there are also several unique, or very rare loops, which are most likely related to the function. The amino acid sequence does not define the structure uniquely, instead a given sequence can give rise to several different conformations.
Availability of supporting data
The datasets supporting the result of this article are available in Protein Data Bank (PDB) by confining the resolution better than 1.0 Å (http://www.rcsb.org).
Abbreviations
 DNLS:

Discrete Nonlinear Schrö
 dinger; PDB:

Protein Data Bank
 RMS:

Rootmeansquare
 CASP:

Critical Assessment for Structural Prediction
References
 1
Sillitoe I, Cuff A, Dessailly B, Dawson N, Furnham N, Lee D, et al.New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Res. 2013; 41(Database issue):D490.
 2
Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL, et al.CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 2015; 43(D1):D376–81.
 3
Murzin AG, Brenner SE, Hubbard T, Chothia C.SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995; 247:536–40.
 4
Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, et al.Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008; 36(suppl 1):D419–25.
 5
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG. SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res. 2014; 42(D1):D310–4.
 6
Rackovsky S. Quantitative organization of the known protein Xray structures. I. Methods and shortlengthscale results. Proteins. 1990; 7:378–402.
 7
Skolnick J, Arakaki AK, Seung YL, Brylinski M. The continuity of protein structure space is an intrinsic property of proteins. Proc Natl Acad Sci USA. 2009; 106:15690–5.
 8
Schwede T, Kopp J, Guex N, Peitsch MC. SWISSMODEL: an automated protein homologymodeling server. Nucleic Acids Res. 2003; 31(13):3381–5.
 9
Chivian D, Baker D. Homology modeling using parametric alignment ensemble generation with consensus and energybased model selection. Nucleic Acids Res. 2006; 34(17):e112.
 10
Song Y, DiMaio F, Wang RYR, Kim D, Miles C, Brunette T, et al.Highresolution comparative modeling with RosettaCM. Structure. 2013; 21(10):1735–42.
 11
Zhang Y. Protein structure prediction: when is it useful?Curr Opin Struc Biol. 2009; 19(2):145–55.
 12
Roy A, Kucukural A, Zhang Y. ITASSER: a unified platform for automated protein structure and function prediction. Nat protoc. 2010; 5(4):725–38.
 13
Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struc Biol. 2005; 15(3):285–9.
 14
Olson MA, Feig M, Brooks CL. Prediction of protein loop conformations using multiscale modeling methods with physical energy scoring functions. J Comput Chem. 2008; 29(5):820–31.
 15
Jamroz M, Kolinski A. Modeling of loops in proteins: a multimethod approach. BMC Struct Biol. 2010; 10(1):5.
 16
Fidelis K, Stern PS, Bacon D, Moult J.Comparison of systematic search and database methods for constructing segments of protein structure. Protein Eng. 1994; 7(8):953–60.
 17
van Vlijmen HW, Karplus M. PDBbased protein loop prediction: parameters for selection and methods for optimization. J Mol Biol. 1997; 267(4):975–1001.
 18
Nekouzadeh A, Rudy Y. Threeresidue loop closure in proteins: A new kinematic method reveals a locus of connected loop conformations. J Comput Chem. 2011; 32(12):2515–25.
 19
Fiser A, Do RKG, Šali A. Modeling of loops in protein structures. Protein Sci. 2000; 9(9):1753–73.
 20
Jacobson MP, Pincus DL, Rapp CS, Day TJ, Honig B, Shaw DE, et al. A hierarchical approach to allatom protein loop prediction. Proteins. 2004; 55(2):351–67.
 21
Eswar N, Eramian D, Webb B, Shen MY, Sali A. Protein structure modeling with MODELLER. In: Structural Proteomics. New York: Springer; 2008, pp. 145–159.
 22
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al.The protein data bank. Nucleic Acid Res. 2000; 28:235–42.
 23
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers. 1983; 22(12):2577–637.
 24
Niemi AJ. Phases of bosonic strings and two dimensional gauge theories. Phys Rev D. 2003; 67:106004.
 25
Danielsson UH, Lundgren M, Niemi AJ. Gauge field theory of chirally folded homopolymers with applications to folded proteins. Phys Rev E. 2010; 82:021910.
 26
Hu S, Jiang Y, Niemi AJ. Energy functions for stringlike continuous curves, discrete chains, and spacefilling one dimensional structures. Phys Rev D. 2013; 87:105011.
 27
Ioannidou T, Jiang Y, Niemi AJ. Spinors, strings, integrable models, and decomposed YangMills theory. Phys Rev D. 2014; 90(2):025012.
 28
Niemi AJ. Gauge fields, strings, solitons, anomalies, and the speed of life. Theor Math Phys. 2014; 181(1):1235–62.
 29
Niemi AJ. WHAT IS LIFESubcellular Physics of Live Matter. 2014. arXiv preprint arXiv:14128321.
 30
Widom B. Surface Tension and Molecular Correlations near the Critical Point. J Chem Phys. 1965; 43:3892–7.
 31
Kadanoff LP. Scaling laws for Ising models near T(c). Physics. 1966; 2:263–72.
 32
Wilson KG. Renormalization Group and Critical Phenomena. I. Renormalization Group and the Kadanoff Scaling Picture. Phys Rev B. 1971; 4:3174–83.
 33
Wilson KG, Kogut J. The renormalization group and the ε expansion. Phys Rep. 1974; 12(2):75–199.
 34
Fisher ME. The renormalization group in the theory of critical behavior. Rev Mod Phys. 1974; 46:597–616.
 35
De Gennes PG. Scaling concepts in polymer physics. New York: Cornell University press; 1979.
 36
Schafer L. Excluded volume effects in polymer solutions, as Explained by the Renormalization Group. Berlin: Springer; 1999.
 37
Chernodub M, Hu S, Niemi AJ. Topological solitons and folded proteins. Phys Rev E. 2010; 82(1):011916.
 38
Molkenthin N, Hu S, Niemi AJ. Discrete Nonlinear Schrödinger Equation and Polygonal Solitons with Applications to Collapsed Proteins. Phys Rev Lett. 2011; 106:078102.
 39
Faddeev L. D, Takhtadzhyan L. A. Hamiltonian Methods in the Theory of Solitons. Berlin: Springer; 1987.
 40
Ablowitz MJ, Prinari B, Trubatch AD, Vol. 302. Discrete and continuous nonlinear Schrödinger systems. London: Cambridge University Press; 2004.
 41
Krokhotin A, Niemi AJ, Peng X. Soliton concepts and protein structure. Phys Rev E. 2012; 85(3):031906.
 42
Hu S, Lundgren M, Niemi AJ. Discrete Frenet frame, inflection point solitons, and curve visualization with applications to folded proteins. Phys Rev E. 2011; 83:061908.
 43
Lundgren M, Niemi AJ, Sha F. Protein loops, solitons, and sidechain visualization with applications to the lefthanded helix region. Phys Rev E. 2012; 85:061909.
 44
Lundgren M, Niemi AJ. Correlation between protein secondary structure, backbone bond angles, and sidechain orientations. Phys Rev E. 2012; 86(2):021904.
 45
Peng X, Chenani A, Hu S, Zhou Y, Niemi AJ. A three dimensional visualisation approach to protein heavyatom structure reconstruction. BMC Struct Biol. 2014; 14(1):27.
 46
Hinsen K, Hu S, Kneller GR, Niemi AJ. A comparison of reduced coordinate sets for describing protein structure. J Chem Phys. 2013; 139:124115.
 47
Lundgren M, Krokhotin A, Niemi AJ. Topology and structural selforganization in folded proteins. Phys Rev E. 2013; 88(4):042709.
 48
Hu S, Krokhotin A, Niemi AJ, Peng X. Towards quantitative classification of folded proteins in terms of elementary functions. Phys Rev E. 2011; 83(4):041907.
 49
Petsko GA, Ringe D. Fluctuations in protein structure from Xray diffraction. Ann Rev Biophys Bioeng. 1984; 13:331–71.
 50
Vojtěchovskỳ J, Chu K, Berendzen J, Sweet RM, Schlichting I. Crystal structures of myoglobinligand complexes at nearatomic resolution. Biophys J. 1999; 77(4):2153–74.
 51
Lucas MF, Guallar V. An atomistic view on human hemoglobin carbon monoxide migration processes. Biophys J. 2012; 102(4):887–96.
 52
Cottone G, Lattanzi G, Ciccotti G, Elber R. Multiphoton Absorption of Myoglobin–Nitric Oxide Complex: Relaxation by DNEMD of a Stationary State. J Phys Chem B. 2012; 116(10):3397–410.
Acknowledgements
AJN acknowledges support from Vetenskapsrådet, Carl Trygger’s Stiftelse för vetenskaplig forskning, and Qian Ren Grant at BIT.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
XP and AN conceived and designed the study. XP, JH and AN developed the analysis method. XP performed the analysis. XP, JH and AN interpreted the results. XP, JH and AN wrote the article. All authors have read and approved the final manuscript.
Additional file
Additional file 1
Description on Supplemental Material. Figure S1. The stereographic distribution map of C _{ α } atoms in the PDB subset with resolution better than 1.0 Å, which is the same as that of resolution better than 2.0 Å (See Fig. 4). Figure S2. and Figure S3. The distributions of the amino acids on each site of the sixsitelong segments of the clusters listed in Tables 1 and 3. Table S1. Sequences that appear both in the 12 clusters and in protein structures which are not contained in the clusters before percolation. (PDF 1178 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Loop modeling
 Protein backbone
 C α trace problem