Clustering and percolation in protein loop structures

Background High precision protein loop modelling remains a challenge, both in template based and template independent approaches to protein structure prediction. Method We introduce the concepts of protein loop clustering and percolation, to develop a quantitative approach to systematically classify the modular building blocks of loops in crystallographic folded proteins. These fragments are all different parameterisations of a unique kink solution to a generalised discrete nonlinear Schrödinger (DNLS) equation. Accordingly, the fragments are also local energy minima of the ensuing energy function. Results We show how the loop fragments cover practically all ultrahigh resolution crystallographic protein structures in Protein Data Bank (PDB), with a 0.2 Ångström root-mean-square (RMS) precision. We find that no more than 12 different loop fragments are needed, to describe around 38 % of ultrahigh resolution loops in PDB. But there is also a large number of loop fragments that are either unique, or very rare, and examples of unique fragments are found even in the structure of a myoglobin. Conclusions Protein loops are built in a modular fashion. The loops are composed of fragments that can be modelled by the kink of the DNLS equation. The majority of loop fragments are also common, which are shared by many proteins. These common fragments are probably important for supporting the overall protein conformation. But there are also several fragments that are either unique to a given protein, or very rare. Such fragments are probably related to the function of the protein. Furthermore, we have found that the amino acid sequence does not determine the structure in a unique fashion. There are many examples of loop fragments with an identical amino acid sequence, but with a very different structure. Electronic supplementary material The online version of this article (doi:10.1186/s12900-015-0049-x) contains supplementary material, which is available to authorized users.


Background
Protein taxonomy [1][2][3][4][5] reveals that crystallographic protein structures have surprisingly little conformational diversity. It might be that the majority of different conformations have already been found [6,7]. This apparent convergence in protein structure provides the rationale for the development of comparative modelling or threading techniques [8][9][10][11][12]. These approaches try to predict the tertiary structure of a folded protein using libraries of known protein structures as templates. According to the community-wide Critical Assessment for Structural Prediction (CASP) tests [13], at the moment this kind of *Correspondence: hjf@bit.edu.cn 2 School of Physics, Beijing Institute of Technology, 100081 Beijing, People's Republic of China Full list of author information is available at the end of the article methods have the best predictive power to determine a folded conformation.
In the loop regions, comparative modelling approaches still continue lacking in their precision [14,15]. It is not uncommon that there are gaps in the loop regions that need to be filled by various insertion techniques. The success in loop modelling is also often limited to supersecondary structures where α-helices and β-strands are connected to each other by relatively short twists and turns [16,17]. In the case of a very short loop, with no more than three residues, the shape can be determined by a combination of geometrical considerations and stereochemical constraints [18]. In the case of longer loops, both template based and template independent methods are being developed to predict their shapes [19][20][21]. The underlying assumption is that the number of loop conformations which can be accommodated by a given sequence should be limited. The different fragments which are already available in the Protein Data Bank (PDB) [22] database could then be used like Lego bricks, as structural building blocks in constructing the loops. A given amino acid sequence is simply divided into short fragments, and the shape of the ensuing loop is deduced using homologically related fragments that have known structures. The entire protein is then assembled by joining these fragments together. For the process of joining the fragments, both all-atom energy functions and comparisons with closely homologous template structures in the Protein Data Bank can be utilised [8,9,12,14].
In the present article we propose a new systematic, purely quantitative method to identify and classify the modular building blocks of PDB loops; we identify a loop following the DSSP [23] convention. Our approach is based on a first-principles energy function [24][25][26][27][28][29]. It is built on the concept of universality [30][31][32][33][34][35][36] to model the fragments of even long protein loops in terms of different parameterisations of a unique kink that solves a variant [37,38] of the discrete nonlinear Schrödinger (DNLS) equation [39,40]. Our starting point is the observation made in [41] that over 92 % of loops in those PDB structures that have been measured with better than 2.0 Å resolution, can be composed from 200 different parameterisations of the kink profile, with better than 0.65 Ångström RMSD (root-mean-square-distance) accuracy. Here we refine this observation, with the aim to develop it into a systematic loop fragment classification scheme. For this we consider only those ultrahigh precision PDB structures that have been measured with better than 1.0 Å resolution. This ensures that the B-factors in the loop regions are small, and in particular that the structures have not been subjected to extensive refinement procedures. Indeed, two loop fragments should be considered different only, when the average interatomic distance is larger than the average Debye-Waller B-factor fluctuation distance. If the B-factors are large, any systematic attempt to identify and/or distinguish two fragments becomes ambiguous. In the case of these intra-high resolution structures we can aim for the RMSD precision of 0.2 Å. We estimate this to be the extent of zero point fluctuations i.e. a distance around 0.2 Å corresponds to the intrinsic uncertainty in the determination of heavy atom positions along the protein backbone. Thus any difference less than 0.2 Å between average atomic coordinates is essentially undetectable. By explicit constructions, we show how in the case of this subset of ultrahigh resolution PDB protein structures, the loops can be systematically modeled using combinations of the unique kink of the generalised DNLS equation. As such, our approach provides a foundation for a general approach to classify loops in high precision crystallographic PDB structures, in terms of an energy function based first-principles mathematical concept.

Cα based Frenet frames
Let r i (i = 1, . . . , N) be the coordinates of the protein backbone α-carbon (Cα) atoms. The indexing starts from the N terminus. At each r i we introduce the discrete Frenet frame (t i , n i , b i ) shown in Fig. 1 following the method in reference [42].
From the Frenet frames, we define the virtual Cα backbone bond (κ) and torsion (τ ) angles shown in Fig. 2 as follows, We identify the bond angle κ ∈ [0, π] with the latitude angle of a two-sphere which is centered at the Cα carbon; the tangent vector t points towards the north-pole where κ = 0. The torsion angle τ ∈ [−π, π) is the longitudinal angle on the sphere. We have τ = 0 on the great circle that passes both through the north pole and through the tip of the normal vector n, and the longitude increases in the counterclockwise direction around the tangent vector. We stereographically project the sphere onto the complex (x, y) plane from the south-pole as shown in Fig. 3; the north-pole where κ = 0 becomes mapped to the origin (x, y)=(0,0) while the south-pole κ = π is sent to infinity. We often find it convenient to extend the range of the latitude κ from positive to arbitrary real values. We compensate for this double covering of the   1) and (2) are noted in the figure sphere by introducing the following discrete Z 2 gauge transformation This transformation has no effect on the backbone coordinates r i , and it leaves the Cα backbone intact.

The Cα trace visualization, loops and kinks The Cα map
We visualise the backbone Cα trace of a given protein in terms of a trajectory on the stereographically projected two-sphere, as follows [43][44][45]: At the location of each Cα we introduce the corresponding discrete Frenet frames (t i , n i , b i ). The base of the i th tangent vector t i is located at the position r i of the i th Cα carbon, it coincides with the centre of the two-sphere and the vector t i points towards the north-pole. We translate the sphere from the location of the i th Cα to the location of the (i + 1) th Cα, without introducing any rotation of the sphere with respect to the i th Frenet frames. We identify the direction of t i+1 , i.e. the direction towards the Cα carbon at site r i+2 from the site r i+1 , on the surface of the sphere in terms of the ensuing spherical coordinates (κ i , τ i ). We repeat the procedure for all the backbones in PDB. To enhance statistics, for visualisation purposes we use here those protein structures that have been measured with better than 2.0 Å resolution, which gives us the map shown in Fig. 4a; see also Figure S1 in Additional file 1. The color intensity correlates directly with the statistical distribution of the (κ i , τ i ): red is large, blue is small and white is none. The map describes the direction of the Cα carbon at r i+2 as it is seen at the vertex r i+1 , in terms of the Frenet frames at r i .
Note how the statistical distribution in Fig. 4 concentrates within an annulus, roughly between the latitude angle values (in radians) κ ∼ 1 and κ ∼ π/2. The exterior of the annulus is a sterically excluded region. The entire interior is in principle sterically allowed, but it is very rarely occupied in the case of folded proteins. The four major secondary structure regions, α-helices, β-strands, left-handed α-helices and loops, are identified according to their PDB classification. For example, (κ, τ ) values (in radians) for which describes a right-handed α-helix, and values for which describes a β-strand. We note that the Fig. 4a is akin the Newman projection of stereochemistry: The vector t i which is denoted by the red dot at the center of the figure, points along the backbone from the proximal Cα at r i towards the distal Cα at r i+1 , and the colour intensity displays the statistical distribution of the r i+2 direction. We also note that the Fig. 4 provides non-local information on the backbone geometry; the information content extends over several peptide units. This is unlike the Ramachandran map, which can only provide localised information in the immediate vicinity of a single Cα carbon. As shown in [46], the Cα backbone bond and torsion angles (κ i , τ i ) are sufficient to reconstruct the entire backbone, while the Ramachandran angles are not. In Fig. 4b we visualise as an example a path made by a generic protein loop that connects two right-handed α-helical structures. A notable property of the trajectory drawn in Fig. 4b is that it encircles the north-pole of the two-sphere. It turns out that this kind of encircling is quite generic for loops, even entire folded proteins [47]. Consequently, we assign to each loop a winding number which we term folding index that we denote Ind f [47] and define as follows, Here [ x] denotes the integer part of x, and is the total rotation angle (in radians) that the projections of the Cα atoms of the consecutive loop residues make around the north pole. The folding index is a positive integer when the rotation is counterclockwise, and a negative integer when the rotation is clockwise. The folding index can be used to detect and classify loop structures and entire folded proteins, in terms of its values. The value is equal to twice the number of times the ensuing pathway encircles the northpole in the map of Fig. 4; for the trajectory shown in Fig. 4b the folding index is +2.

The discrete nonlinear Schrödinger equation
The virtual bond length between two neighboring Cα atoms is essentially constant, with the value 3.8 Å. Accordingly the Helmholtz free energy for the Cα trace backbone can be expressed in terms of the virtual bond angles κ i and dihedral angles τ i only. To the leading order in the infrared limit the result coincides with This is essentially the Hamiltonian of the discrete nonlinear Schrödinger equation [39,40]; for a detailed derivation we refer to [24][25][26][27][28][29]. Remarkably, the free energy (9) supports a kink (topological soliton) as a classical solution [37,38]. An excellent approximation of a kink can be obtained by naively discretising the kink solution of the continuum nonlinear Schrödinger equation [37,38,48] The torsion angles τ are then expressed as functions of the bond angles κ For the torsion angles, from (11) we conclude that the overall scale of the parameters (d, q) and (e, b) cancel in the expression (11). This leaves us with only three independent parameters. In (10) there are four parameters when we use translation invariance to remove s. Thus the profile of a single kink becomes fully determined in terms of seven independent parameters. This coincides exactly with the number of independent coordinates along a Cα backbone segment, with six residues. For this, we may always place the first residue to coincide with the origin of a Cartesian (xyz) coordinate system. We can always place the second residue along the z-axis, and we can always place the third residue on the x = 0 plane. Thus, there is only one independent coordinate for the three first residues. Since the remaining three residues can each be placed to arbitrary angular directions, there are six additional independent coordinates. Accordingly, a segment with six residues indeed engages seven independent parameters.

Clustering and percolation
We shall classify the loop structures in PDB in terms of the following clustering algorithm: • We define a cluster to be a set of loop fragments such that for each fragment in a given cluster there is at least one other fragment within a prescribed RMS cut-off distance. Two clusters are disjoint, when the RMSD between any fragment in the first cluster and any fragment in the second cluster exceeds this prescribed RMS cut-off distance. • We define the initiator of a cluster to be an a priori random loop fragment which defines the cluster by completion, as follows: We start with the initiator. We identify all those fragments in our entire data set which deviate from the initiator by less than the given RMS cut-off distance. We continue the process by identifying all those fragments, that deviate from the fragments that we have identified in the previous step, by less than the RMS cut-off distance. We repeat the procedure until we find no additional fragments in PDB, within the RMS cut-off distance from any of those fragments that have been identified in the previous steps.
The cluster is clearly independent of its initiator, any element of the cluster could be used as the initiator. But the cluster depends on the RMS cut-off distance. Moreover, if the RMS cut-off distance is too large, no clear clustering is observed.
According to [49] for a PDB protein structure which is measured with resolution 2.0 Å or better, the characteristic values of the thermal B-factors are mostly less than around From the Debye-Waller relation we then obtain the following estimate for the one standard deviation error in the atomic coordinates Thus, two loop fragments that have been measured with 2.0 Å resolution should be (in average) considered different only, when their RMS distance exceeds 0.65 Å.
The construction of PDB loop fragments in terms of the kink profile (10), (11) in those crystallographic protein structures which have been measured with resolution 2.0 Å or better, has been addressed in [41]. There, it was found that over 92 percent of loops can be covered in a modular fashion by 200 explicit kink profiles (10), (11), with RMSD accuracy that matches (13) i.e. with less than 0.65 Å RMSD deviation from the crystallographic structure. Thus 0.65 Å RMS distance is the appropriate RMS cut-off value, to search for for the more refined clustering patterns in those crystallographic structures which have been measured with resolution 2.0 Å. However, we find that the value 0.65 Å is too large, to observe clear clustering patterns. Accordingly, we shall search for clustering by considering only those PDB structures that have been determined with the ultrahigh resolution 1.0 Å or better. For these ultrahigh resolution structures, a precision better than the value (13) can be expected. To determine an appropriate value, we display in Fig. 5 the number of all Cα atoms in all currently available PDB structures, that have been measured with resolution 1.0 Å or better, as a function of their Debye-Waller fluctuation distance. For most of the structures, the fluctuation distance is clearly below the upper bound (13); the maximum of the curve is located at around 0.3 Å. We also observe the (essential) absence of structures with a fluctuation distance less than 0.1 Å; historically this distance is considered as the boundary wavelength between x-rays and γ -rays.
Using a combination of Fig. 5 with various tests that we have performed, we have arrived at the conclusion that 0.2 Å in RMS distance can be currently adopted as a reasonable estimate for the minimal zero-point fluctuation distance in ultra-high resolution structures, those that have been measured with better than 1.0 Å resolution. Thus we shall try and see, to what extent loops in these protein structures can be classified in terms of elemental fragments, such that two fragments are considered different when their RMS distance exceeds 0.2 Å. According to Fig. 5, over 99 % of individual Cα carbons that have been measured with below 1.0 Å resolution, have a B-factor fluctuation distance which is larger than 0.2 Å; our choice of cut-off distance is close to the 3-σ level.
We note that other cut-off values can be introduced; the ultimate appears to be 0.1 Å. But our qualitative conclusions are fairly independent of the value chosen, provided it is small enough to provide a clustering pattern. In this article our goal is to present a proof-ofconcept. To our knowledge, no related analysis has been previously attempted, to systematically classify the loop structures in ultra-high resolution crystallographic protein conformations, in a quantitative fashion using an energy function. In particular, no commonly accepted experimental standard exist, that we could rely on, to infer the "most preferred" cut-off value. We hope that such a value can be eventually inferred, from careful experimental measurements. Thus, at the moment we have no criterion to prefer any other particular value, 0.2 Å i.e. around 3-σ appears to be a reasonable choice at this point.
We start the identification of loop fragments, using the set of 200 fragments constructed in [41]. But our results are independent of the starting point, quite similar results can be obtained using a fairly generic set of loop fragments as a starting point. We note that the fragments in [41] have between five and nine residues, and most of them (116 out of 200) have six residues. We have already argued that six is the optimal number of residues in a loop fragment, as it matches the number of independent parameters in the kink profile (10), (11). Thus, we shall consider only fragments that have six residues, in the clustering algorithm. In this manner, we find that we can classify all PDB fragments into clusters, each determined by their initiator.
We have found that there are clusters that have a very large number of fragments. But we also find that there

Clustering
We have constructed our clusters by starting with the 200 loop fragments that were introduced in [41]. Around 92 % of all loops in those PDB structures that have been measured with resolution better than 2.0 Å, are within a 0.65 Å RMS distance from some of the 200 loop fragments. However, when we decrease the RMSD cut-off distance to 0.2 Å, which is the cut-off distance used in the present article, the coverage drops to below 2 % [41]. We remark that the authors of reference [41] did not investigate clustering, as the concept is defined here. In [41] all the RMS distances were evaluated from the fixed set of 200 loop fragments, and the coverage of PDB loop structures was determined in terms of these fixed loop fragments.
When we specify to the present subset of PDB structures in [41] that have been measured with better than 1.0 Å resolution, we find that a total of 102 out of the 200 fragments in [41] have been measured with this resolution. We use these 102 loop fragments as the initiators, to start our clustering construction.

clusters
The 102 loop fragments in [41] that have been measured with better than 1.0 Å resolution, have between five and nine residues. Here we have argued that a loop fragment modelled by (10), (11) has six residues. There are 70 such clusters among the 200, but only 14 of them contain more than 30 fragments. Moreover, two of these merge together into an α-helical structure, when we subject them to our clustering algorithm; we call them bends instead of kinks. The remaining 12 loop fragments determine clusters which cover around 38 % of the 1.0 Å protein loop structures, when we use our 0.2 Å RMSD cut-off. These loop fragments are our final initiators. In Table 1 we list the PDB entry codes and residue numbers of these initiators. We proceeded to describe some of the major features of the ensuing 12 clusters. Additional details including a breakdown according to amino acid constituents in each cluster, are presented in Figure S2 of Additional file 1.
The Figs. 6 and 7 show the (κ, τ ) distribution in each of the 12 clusters on the stereographically projected twosphere of Fig. 4; note that the definition of bond angle takes three residues while the definition of torsion angle takes four. Thus for a 6 residue loop fragment there are three (κ, τ ) pairs. The fourth κ-value could be used to refine the loop classification, but here this possibility is not considered.
In Figs. 8 and 9 we show the three dimensional pictures of the initiators of the twelve clusters.
A detailed inspection reveals that except for IV, all the initiators have the canonical structure of a single kink, in terms of the folding index (8). Moreover, the initiator I is part of a short loop connecting an α-helix and a β-strand. However, the bond and torsion angle spectrum which we display in Fig. 10a shows that this loop is actually a pair of two kinks which are very close to each other, and the initiator I is the second kink along the backbone.
On the other hand, a comparison with (8) suggests that the initiator IV exhibits a somewhat small variation in the values of the torsion angles, for a kink. This can be seen in Fig. 6. The torsion angle values suggest that the initiator IV resembles more a bent α-helix than a kink. In Fig. 10b, c we show the spectrum of the bond and torsion angles of the initiator IV, both before and after we have implemented the Z 2 gauge transformation. Since this bent structure determines an isolated cluster according to our 0.2 Å cut-off criteria, it is included among our loop fragments.
In Figs. 11 and 12 we show the three dimensional figures of each of the twelve clusters, including all the entries.
Finally, we have also investigated how the coverage of the 12 clusters increases, when we increase the cut-off distance. The results are shown in Table 2.  Table 1 are shown on the stereographic map like Fig. 4a; In each panel the order along the Cα backbone is red → blue → yellow Fig. 7 The stereographic maps of 12 clusters VII-XII. The clusters VII-XII in Table 1 are shown on the stereographic map like Fig. 4a; The ordering along the Cα backbone is red → blue → yellow

Cluster elongation and completion
In addition of the 12 initiators listed in Table 1, among the 102 loop fragments of [41] that we have considered, there is also one initiator that has only five residues. The PDB code is 1p1x_A (80-84). The ensuing cluster with five residue long elements is very large: There are a total of 42618 entries. The reason for the occurrence of such a large cluster is that the RMSD clustering criteria 0.2 Å is too large for revealing clustering patterns in five-residuelong loop segment: The five-residue-long loop fragment covers all the five-residue-long loops, within the chosen cut-off criterion. In Fig. 13 we show the distribution of (κ, τ ) values in this cluster.
There is also an overlap with each of the 12 clusters that we obtained previously. Together the 13 clusters cover around 96.1 % of all PDB loop structures.
It is apparent that an initiator with only five residues is too short to identify a clustering pattern of PDB loops, even with 0.2 Å precision. Consequently we have elongated this initiator. For this, we have systematically added residues at the beginning and at the end of the individual elements in its cluster, to search for clustering patterns. For example, we may take the element 1p1x_A (80-84), elongate it to 1p1x_A (80-85) and 1p1x_A (79-84), and then use these two elongated ones as initiators to do the clusterings: We denote by H, S and L a residue which is located in a helix, strand and loop respectively, according to the PDB classification. The five residue long cluster which is generated by 1p1x_A (80-84) consists of several different elements, such as for example LLLLL, HLLLL, LLLLS etc.
As an example, we have selected the pattern LLLLL which has the largest number of elements; there are a total of 7901 elements. We have elongated each of these 7901 elements into a protein loop fragment with six residues, by incorporating the corresponding PDB residue which is  Table 1 are shown in their three dimensional backbone environment. The (dark) red color identifies the initiator, and the (light) yellow color identifies the immediate backbone environment Fig. 9 The initiators of the 12 clusters VII-XII. The initiators VII-XII listed in Table 1 are shown in their three dimensional backbone environment. The (dark) red color identifies the initiator, and the (light) yellow color identifies the immediate backbone environment   Table 1 are superimposed in three dimensions. The colour ranges from red (initiator) to blue (the entry with largest RMSD distance from initiator) Fig. 12 The 3D superimposed structures for 12 clusters VII-XII. The clusters VII-XII in Table 1 are superimposed in three dimensions. The colour ranges from red (initiator) to blue (the entry with largest RMSD distance from initiator) either right before the first L residue, or immediate after the last L residue. In this manner we find 15802 different loop fragments with six residues each. We have investigated the corresponding clustering patterns: There are 30 new clusters with more than 30 elements, bringing the total number of the clusters with more than 30 elements, to 42. We list these 30 additional clusters in Table 3. In Figs. 14, 15 and 16 we display the (κ, τ ) distributions of these 30 clusters. A visual inspection of these clusters reveals, that at the level of the (κ, τ ) distribution the cluster 26 appears to display additional sub-clustering. But the present cut-off value 0.2 Å is not sufficiently refined to detect this sub-clustering, at the level of RMS distance. Furthermore, the clusters 29 and 30 both appear to merge with the regular β-strand. In Fig. 17 we show the corresponding initiators: The cluster 29 is clearly a loop, while the cluster 30 consist of the regular β-strand and thus we exclude it from our set of loop fragments. This leaves us with a total of 41 clusters, with 30 or more loop fragments.  . 13 The stereographic map generated by cluster 1p1x_A (80-84).
In a the distribution of the first (κ, τ ) and in b the distribution of the second (κ, τ ). Note the widely spreaded distributions of this cluster These clusters cover around 52 % of all loop structures in PDB.
By completing the elongation process we have identified 3240 different clusters with 0.2 Å cut-off. These clusters cover around ∼85 % of all those PDB loop sites in our set of resolution better than 1.0 Å proteins. Among these clusters there are 1677 unique ones, in the sense that the cluster has only single element. Thus, around 14 % of all loop structures in PDB appear to be unique, to the given protein. In addition, there are 1531 rare clusters with two or more, but less than 32 elements. Thus, there are 32 clusters with 32 or more elements.
The remaining ∼15 % of loop fragments that are not covered by the 3240 clusters, can be constructed by completion. For example, we can search for novel clusters by using the patterns other than LLLLL in the five residue cluster generated by 1p1x_A (80-84). But when the four patterns HLLLL, LLLLH, SLLLL and LLLLS are included the coverage increases no more than around one per cent.

Cluster percolation
We have also investigated the relation between the sequence and the structure, using the 42 clusters listed in Tables 1 and 3. Here we only describe some of the major features, more details can be found in Figure S3 in Additional file 1.
There are several examples of identical sequences that correspond to different structures in different proteins. Accordingly a sequence clearly does not determine a unique structure. When a given sequence gives rise to multiple structures, we have a phenomenon we call cluster percolation. These sequences with multiplet structures may be utilised to try and introduce novel clusters.
For example, in Table 4 those sequences that are found both in Cluster VIII and outside of it, are listed, together with their PDB identifications and RMSD to the initiator of Cluster VIII.
As an example, in Fig. 18a we compare the four PDB structures that have the identical sequence SDGNGM in  Fig. 12 also reveals that both 1iee A (100-105) and 4b4e A (100-105) are clearly outside of this cluster. Figure 18b shows the comparison of the sequence ADGKPV to the initiator. The difference between the structures of 4hen A (54-59) and the initiator is again clear. The structure of 4hen A (54-59) is also quite different from the structures in Fig. 18a, and from the Cluster VIII shown in Fig. 12.
In Table S1 of Additional file 1 we list those sequences that appear both in the 12 clusters of Table 1 and in protein structures which are not contained in any of the clusters. We have investigated these structures, and found 454 new clusters. But most of them have very few elements, only two of them have more than 30 elements. With these new clusters the coverage becomes increased to 88 %. In Fig. 19 we show the (κ, τ ) distributions on the stereographically projected two-sphere of the two clusters with more than 30 elements; the initiators are 1ix9_A (133-138) and 3aj4_B (73-78) correspondingly. These two clusters are found by considering the sequences LKGDKL in cluster III and KDCMLQ in cluster XI, respectively.

Example: Myoglobin
Myoglobin is a widely studied protein, thus we have analysed its loop structure from the present perspective. We  Table 3. The ordering along the Cα backbone is red → blue → yellow Fig. 15 The stereographic map of the clusters 11-20 in Table 3. The ordering along the Cα backbone is red → blue → yellow have chosen the crystallographic oxymyoglobin structure 1A6M [50] which is one of the few myoglobin structures that have been measured with resolution better than 1.0 Å, for our comparative study.
We have located in 1A6M four putative kink segments with six residues each, that are either unique or very rare in PDB, with our 0.2 Å RMSD cut-off. These kinks are located between helices C and D, and between helices E and F. The two putative kinks between helices C and D correspond to the residue sites (41)(42)(43)(44)(45)(46) and (48)(49)(50)(51)(52)(53). The two putative kinks between helices E and F correspond to residue sites (77-82), and the practically overlapping (78-83). In Fig. 20 we show how in our PDB set, the number of matches for each of these four kinks depends on the RMS cut-off distance.
The 1A6M is closely related to the PDB entries 1A6G, 1A6K and 1A6N; they represent four different ligation states of the same protein. Each of the three 1A6G, 1A6K and 1A6N have been measured with resolution above 1.0 Å, thus they do not appear in our data set. In Table 5 the RMS distance of the four rare kinks of 1A6M are compared to the corresponding kinks in 1A6G, 1A6K and 1A6N. All the RMSD values are below the cut-off 0.2 Å.
We conclude that the four kinks are stable, in the sense that they do not change their conformation when the ligation state changes.

Chain inversion
Finally, the operation of local chain inversion along a protein segment is defined as a mapping, that sends a sequence with Cα coordinates { r(i), r(i + 1), . . . , r(i + k − 1), r(i + k) } into a sequence with Cα coordinates { r(i + k), r(i + k − 1), . . . , r(i + 1), r(i) } We note that a regular secondary structure such as an α-helix becomes mapped onto itself i.e. remains invariant under chain inversion. But we have found that the 12 clusters that we have constructed are not inversion Fig. 16 The stereographic map of the clusters 21-30 in Table 3. The ordering along the Cα backbone is red → blue → yellow  Table 3. The cluster 29 consists of loops, while the cluster 30 consist of regular β-strands invariant; the inversion does not map a cluster onto itself. Thus one might expect that new clusters could be found by inversion of these clusters. However, surprisingly we have found only one single example of a PDB segment by inversion. This is the segment (1115-1120) in the PDB structure 1MC2. Thus local chain inversion is apparently a broken symmetry, in the case of protein loops. This sets the loops apart from the regular structures like α-helices and β-strands.

Discussion
We have introduced the concept of loop clustering to analyse those ultrahigh resolution crystallographic protein structures in PDB, that have been measured with resolution 1.0 Å or less. We have chosen these structures since we expect, that in the case of a ultrahigh resolution measurement there should be less need to introduce structure validation. Thus there should also be less bias towards a priori chemical knowledge and stereochemical paradigms, in this subset of PDB proteins. Moreover, our investigation of 2.0 Å subset shows that high resolution is necessary to reveal the clustering structure in the case of protein crystals.
We have inquired to what extent the protein structures can be constructed in a modular fashion. For the modular building blocks we have chosen different parameterisations of the unique kink solution to a generalised discrete nonlinear Schrödinger equation. The precision we have used as a criterion in making a difference between two structures is 0.2 Å in RMSD. We have concluded that this should be the shortest meaningful RMS distance that can be introduced, at the moment, to classify different modular protein components.
We have identified a set of 12 different kink parameterisations, which cover around 38 % of all PDB loop structures. Accordingly, these are loop patterns that are abundantly present in the folded proteins. It appears to us, that these kinks are often located in such protein segments that are structurally important, as opposed to those that are functionally important. We have introduced various techniques to extent the initial set of 12 kinks, and we have found that around 52 % of loop regions become covered when we introduce a set of 29 additional kinks. But in order to cover the remaining ∼48 % of protein loops, we need to substantially increase the number of kinks. For example, we need to introduce over 1000 kinks to cover over 88 % of loops. In particular, we have concluded that there are several kinks that are very rare, even unique, in PDB when we use the present cut-off value. We propose Fig. 19 The (κ, τ ) distributions of the two clusters with more than 30 entries obtained by percolation. Clusters with initiators a 1ix9_A (133-138) and b 3aj4_B (73-78) that a rare or even unique kink should have a an important functional rôle, in a protein. This can be exemplified by the myoglobin 1A6M segments (41)(42)(43)(44)(45)(46), (48)(49)(50)(51)(52)(53) and (78-83) which are all rare. These segments also constitute the CD corner and EF corner in myoglobin, which have been argued to be closely related to the ligand migration process [51,52].

Conclusions
Protein loops are built in a modular fashion, in terms of various parametrisations of the kink solution to a generalised version of the discrete nonlinear Schrödinger equation. Most loops can be built from a very small number of modular components, these loops are most likely important for the overall structure of the protein.  However, there are also several unique, or very rare loops, which are most likely related to the function. The amino acid sequence does not define the structure uniquely, instead a given sequence can give rise to several different conformations.

Availability of supporting data
The datasets supporting the result of this article are available in Protein Data Bank (PDB) by confining the resolution better than 1.0 Å (http://www.rcsb.org).