- Research article
- Open Access
A comprehensive analysis of non-sequential alignments between all protein structures
BMC Structural Biology volume 7, Article number: 78 (2007)
The majority of relations between proteins can be represented as a conventional sequential alignment. Nevertheless, unusual non-sequential alignments with different connectivity of the aligned fragments in compared proteins have been reported by many researchers. It is interesting to understand those non-sequential alignments; are they unique, sporadic cases or they occur frequently; do they belong to a few specific folds or spread among many different folds, as a common feature of protein structure. We present here a comprehensive large-scale study of non-sequential alignments between available protein structures in Protein Data Bank.
The study has been conducted on a non-redundant set of 8,865 protein structures aligned with the aid of the TOPOFIT method. It has been estimated that between 17.4% and 35.2% of all alignments are non-sequential depending on variations in the parameters. Analysis of the data revealed that non-sequential relations between proteins do occur systematically and in large quantities. Various sizes and numbers of non-sequential fragments have been observed with all possible complexities of fragment rearrangements found for alignments consisting of up to 12 fragments. It has been found that non-sequential alignments are not limited to proteins of any particular fold and are present in more than two hundred of them. Moreover, many of them are found between proteins with different fold assignments. It has been shown that protein structure symmetry does not explain non-sequential alignments. Therefore, compelling evidences have been provided that non-sequential alignments between proteins are systematic and widespread across the protein universe.
The phenomenon of the widespread occurrence of non-sequential alignments between proteins might represent a missing rule of protein structure organization. More detailed study of this phenomenon will enhance our understanding of protein stability, folding, and evolution.
Protein structure comparison is a fundamental approach in many areas of biomedical studies. Its applications range from protein classification and establishing evolutionary relationship between proteins to functional prediction, molecular modeling and protein engineering. While structure comparison can be done in a number of ways, protein structure alignment is one of the major techniques used, populated today with more than 40 methods, the most complete list of which can be found at Wikipedia . These methods rely on a wide variety of statistical, geometrical, physical, and other structure properties in order to produce an alignment. But most of them follow a simple sequential rule: two proteins are aligned in sequential order, by placing their chains adjacent to each other from N-terminal to C-terminal and introducing gaps.
The key representation of such sequential alignment was introduced as a matrix approach by Needleman and Wunsch , which states that given a scoring function, the optimal alignment is the best way through the matrix. Such an approach has fertilized a large number of methods on sequence and structure alignments and resulted in many achievements in our understanding of protein similarities, their evolutionary relationships, functionality and so on. However, there is a number of cases reported in literature, which are unusual from the sequential point of view, for which structurally equivalent parts have different connectivity in the sequences of compared proteins. These alignments cannot be represented as a diagonal path through the matrix. Figure 1 shows an example of such an alignment. The alignment consists of four segments; only three of them can be included in a sequential alignment. Since the remaining segment is a part of the alignment, but is not in a sequential order, it is called non-sequential (NS); accordingly, the alignment is called non-sequential. A non-sequential alignment is an alignment where structurally similar parts are not in the same order in protein sequences.
Understanding more about these types of alignments is interesting; are they unique, sporadic cases; do they occur frequently; do they belong to a few specific folds or spread among many different folds as a common feature of protein structure. Such a large-scale study is also important for the theoretical understanding of protein organization, the evolution of proteins, and using non-sequential approach has a practical application as a designing tool in protein engineering.
Many researches have reported cases of non-sequential alignments such as circular permutations, domain or region swaps [3–15], and β-hairpin flip [6, 10]. The most studied case of non-sequential alignments is a circular permutation, when the N-terminal of each aligned protein is aligned with the C-terminal of the other protein. The circular permutations have been analyzed by both sequence and structure related computational methods [16, 17]. A suggested evolutionary mechanism for circular permutation in proteins  states that first a gene duplication of the precursor gene occurs in such a way that both genes become fused in frame, leading to a tandem protein. After generation(s) of a new start codon within the 5' part of the tandem gene and a stop codon at an equivalent position in the 3' part of the gene, a protein is encoded that represents a circular permutation of the precursor gene product. Later the mechanism was shown to be valid for a protein family of adenine-n6 DNA methyltransferases . Many naturally occurring proteins were experimentally redesigned to have circular permutation and it was shown that they preserve their structure and function [20–30]; thus providing evidence that circular reordering of protein structural elements does not affect protein folding and functionality.
The appearance of similar domains/regions in different orders in sequence as a domain/region swap have been analyzed by Fliess and coworkers . Their study was based on sequence alignments of proteins in the Swiss-Prot database , where they found 140 swap cases and concluded that the swapping of regions is a relatively rare evolutionary event. A comparatively large (at that time) structure based large-scale analysis of non-sequential cases has been reported about a decade ago , where 426 representative structures from PDB were analyzed by the SARF2 method. Along with other results, that work presented several cases of non-sequential alignments and estimated that they are found in 11% of cases.
Since then several methods for protein structure alignment have been developed which can produce non-sequential alignments [15, 33–38] including TOPOFIT , developed in our group. MASS  method was developed to produce multiple structure alignments; GANGSTA  and SCALI  were suggested to be used for structure classification; SSM  and KENOBI  appear to be computationally efficient and OPAAS  was applied to analysis of alternative structure alignments. TOPOFIT compares topologies of Delaunay tessellation patterns calculated using positions of Cα-atoms in protein structures and does not assume any sequential order of residues in an alignment. Its distinctive feature is that the method does not balance between lower RMSD and a higher number of aligned positions (N e ) but rather identifies the largest group of residues which have the same neighbors in the same locations common in both compared structures, defined mathematically as a topological invariant and detected by saturation point (topomax point) in the spatial tessellation graph. Such an objective methodology provides unambiguous identification and separation of the structurally invariant parts from the variable parts by identifying a precise border between the two. Unlike all other methods (which can produce non-sequential alignments), composing alignments of fragments or secondary structure elements, TOPOFIT extends an alignment pair by pair of residues; thus, is not biased by fragment choice or secondary structure element definition. The method is also computationally efficient, so that all proteins in the PDB (as of July 2005) have already been calculated, grouped into clusters and stored in the TOPOFIT-DB database . We have used TOPOFIT in our comprehensive large-scale analysis of non-sequential relations between proteins. To the best of our knowledge this is the first comprehensive large-scale analysis of non-sequential alignments between all available protein structures.
Non-sequential alignments between proteins do occur systematically and in large quantities
A comprehensive large-scale analysis of 8,865 non-redundant representatives from each protein cluster in TOPOFIT-DB  has been performed. TOPOFIT-DB is a collection of alignments for all significant values of Z-score, i.e. Z-score > 3. From the experience of using T-DB we should mention that the range of Z-score values from 3 to 5 is the "twilight zone" where together with structurally significant alignments there are also trivial cases containing just one or two secondary structure elements; while alignments at Z-score > 5 typically represent high structural similarity between proteins. But to ensure the validity of this study we used an even tighter criteria: only the alignments with very high structural similarity, Z-score > 7, have been collected, resulting in total of 82,263 structurally similar protein pairs. These alignments are referred to as dataset D1. The alignments collected in the dataset D1 are considerably large in size (with average of 120 aligned amino acid residues) and represent high structural match (RMSD < 2 Å) as shown in Figure 2. Thus, there is no doubt of their structural similarity.
Another dataset has been collected by compiling alignments between protein families as defined by SCOP  (release 1.69). For each family, the first structure in the list of proteins for the corresponding family has been used as a representative, resulting in 2,845 representatives. 4,045,590 structural alignments have been produce and stored in TOPOFIT_DB database  by comparing the representatives. As for dataset D1 only alignments with Z-score > 7 have been used, resulting in total of 4,648 alignments. The distributions of their alignment sizes and RMSD are similar to the ones for dataset D1. These alignments will be referred to below as dataset D2.
The most striking and surprising result from the analysis performed here is that non-sequential (NS) alignments have been found in large quantities in structurally similar proteins. In other words, there are many alignments between highly structurally similar proteins for which the alignment matrix is not diagonal. The overall proportion of non-sequential alignments was estimated to be as high as 35.2%, but not lower than 17.4% when tightened thresholds have been applied (see details later in Table 1). The detected non-sequential alignments are presented in a large variety of alignment patterns with various orders of alignment fragments in structurally similar proteins, as well as with various sizes and numbers of non-sequential fragments. They can be as simple as an almost sequential alignment with the rearrangement of a single fragment, and as complex as it is hard to define what the sequential part in the alignment is. Even more interesting, many cases of reverse alignments have been detected, i.e. alignments where fragments structurally match each other but the polypeptide chains go in opposite directions.
Types of observed non-sequential alignments
The easiest and also the most studied case of non-sequential alignment is a circular permutation, which is defined as a case where the structurally equivalent part of a protein has been rearranged from N- to C-terminal (or vise versa) in the protein sequence. An example of a circular permutation alignment for posphoinositide-specific phospholipase C delta (PDB-code 2isd:A) and C2-domain of synaptotagmin I (PDB-code 1rsy) is shown in Figure 3 (both proteins are from Rattus norvegicus). The structures are aligned at N e = 108 and RMSD = 1.2 Å, where N e is number of equivalent residues in alignment and RMSD is root mean square deviation between Cα-atoms of the equivalent residues; and the alignment consist of two parallel layers of 4 β-strands. In synaptotagmin one of the β-strands is located at the N-terminal end, while in phospholipase, its structural equivalent is at the C-terminal end. This β-strand is the non-sequential part of the alignment and can be seen on the alignment plot as a small fragment (in green) parallel to the long sequential alignment (Figure 3d).
Similar to the circular permutations there are also alignments with just one structurally equivalent part rearranged in the sequence, but not necessarily from N- to C-terminal. An example has already been shown in Figure 1, where there is a long sequential alignment, while the non-sequential part (NS) is located in the middle of the alignment. Another example of an alignment of such type is shown in Figure 4, where the structure of 2-dehydro-3-deoxygluconokinase from Thermus thermophilus (PDB-code 1v1b) and ADP-dependent glucokinase from Thermococcus litoralis (PDB-code 1gc5:A) are aligned at N e = 234 residues and RMSD of 1.7 Å. In this example, two structurally equivalent regions: 1) α-helix and 2) α-helix and β-strand are located one after another but in a different order in the sequences of the compared proteins. Most of the alignment is sequential, namely, one can produce a long sequential alignment out of the aligned residues with only a small part of it being non-sequential, either magenta or orange on the picture. It is evident that if those parts were swapped in any of the sequences then one would get a perfect sequential alignment. Based on this observation, we will call such alignments "swaps". Interestingly, the functionality of these proteins is similar and involves ATP/ADP binding. Moreover, the binding site residues are composed from the parts, which are non-sequential.
Another type of simple non-sequential alignment is similar to the above examples, but different in the direction of the polypeptide chain. Such alignment is observed when all the structurally aligned fragments have the same order in the sequences, but the direction of the chains in one fragment is opposite, i.e. in one protein the residues in this fragment go from N- to C-terminal, while in the other protein they go from C- to N-terminal. An example of such alignment is shown in Figure 5 for adoment-dependent methyltransferase from Mycobacterium tuberculosis (PDB-code 1i9g:A) and zeta-crystallin from Homo sapiens (PDB-code 1yb5:A). These two structures are very similar (RMSD is 1.7 Å) with the non-sequential region found at the place where antiparallel β-strand of methyltransferase is aligned to the parallel β-strand of zeta-crystallin. There is no permutation of fragment order in these proteins; most of the alignment is sequential while the reverse part, just 10 residues, is small but noticeable. To separate such cases (with opposite direction in the aligned chains) from the previous alignments we will call the aligned fragments with the same direction of the polypeptide chain as the 'forward' alignment and those with the opposite direction as the 'reverse'.
More complex examples consist of alignments with several non-sequential fragments, which can be forward and/or reverse. As shown in Figure 6, an alignment of UDP-galactose 4-epimerase from Escherichia coli (PDB-code 1kvu) andcatechol o-methylstransferase from Rattus norvegicus (PDB-code 1vid) has four non-sequential fragments, one of which is reverse. The two proteins share a large common structural part, consisting of 137 residues superimposed at RMSD of 1.7 Å. The major part of it is the long sequential alignment, while the non-sequential fragments are three secondary structural elements (α-helix and two β-strands) and an irregular fragment of four residues. Even though the number of residues in the non-sequential fragments (24 residues) is not that large, the permutation of fragments in the sequences of protein is complex, which is shown on the schematic diagram (Figure 6d).
In the above examples there is a common feature: one can clearly identify a long sequential segment in an alignment with the non-sequential part(s) being substantially smaller than the sequential one. While alignments with such a feature occur frequently, nevertheless, we have observed many cases without a dominant sequential part. An example of such case is shown in Figure 7 displaying an alignment of alpha subunit of 2-oxoisovalerate dehydrogenase from Homo sapiens (PDB-code 1v16:A) and molybdenum cofactor biosynthetic enzyme from Escherichia coli (PDB-code 1di6:A). Both proteins belong to the α/β class, but to different folds: THDP-fold and molybdenum cofactor biosynthetic enzymes fold respectively. The core of the domains consists of five β-strands surrounded by six α-helices. In dehydrogenase all strands are parallel while in biosynthetic enzyme one of the strands (namely β 5) is antiparallel. The structures are aligned with N e = 95 residues and RMSD of 1.6 Å. The structural alignment consists of six fragments (Figure 7), one of the fragments contains an α-helix and a β-strand (22 residues), while the others are single secondary structure elements: α-helixes or β-strands. Four parallel β-strands are well aligned, but their orders in polypeptide chain are completely different (see Figure 7b and 7c), i.e. β 2 is aligned to β 4, β 3 to β 3, β 4 to β 2, and β 5 to β 1. The order of α-helices is also different in both polypeptides (α 1 is aligned to α 3, α 3 to α 6, and α 6 to α 2). Interestingly, the sizes of the aligned β-strands are almost the same, while the sizes of the α-helices are different, e.g. helix α 6 in the dehydrogenase has an extra turn compared to the corresponding helix α 2 in the biosynthetic enzyme. The longest possible sequential alignment is just 25 residues long, which is less than one third of the entire structural alignment.
Another interesting type of alignment is a completely reverse alignment. In this type two proteins share significant structural similarity, while their sequences align in the opposite directions in all the aligned fragments. To the best of our knowledge, only one case of the reverse alignments is well-known; the α-helix bundle with several helices, where one or many of the helices can be aligned in the opposite direction. In the presented study many cases of the reverse alignments have been found. A reverse complex alignment of adenylate kinase from Methanococcus thermolithotrophicus(PDB-code 1ki9:A) and glucose/galactose-binding protein from Salmonella typhimurium (PDB-code 1gca) is shown in Figure 8. The alignment consists of four segments. The longest segment consists of four consecutive fragments: α-helix, β-strand, β-strand, and α-helix. In both proteins the segments have long insertions: in the adenylate kinase three helices are inserted between the two aligned β-strands, while in the glucose/galactose-binding protein another domain is inserted between the second aligned β-strand and last aligned α-helix. The fourth segment represents an alignment of consecutive α-helix, β-strand, and α-helix. The remaining two segments represent an alignment of single β-strand. This is a remarkable example of how the same structure can be formed by the polypeptide chain going in opposite directions; moreover, the order of the segments forming the structure is different in both sequences.
General statistics on all different alignment types is shown in Table 1 and described in the following sections.
Non-sequential alignments can be trivial if they occur as a result of symmetry or shift in protein structure, but such cases are easily detected: in this case an alternative sequential alignment should exist. It is known that proteins with symmetries and repeats have many alternative alignments, thus, for each protein pair we have evaluated all possible alternative alignments with similar length (ΔNe < 20). Once, an alternative sequential alignment has been found the protein pair was considered to be sequential. Only those non-sequential alignments without any alternative sequential alignments have been considered as true non-sequential cases and are included in the following analysis.
General classification of non-sequential alignments
We have classified non-sequential alignments between proteins into three classes based on the types of alignment fragments in the alignment: forward (all fragments are of forward type), reverse (all fragment are of reverse type), and mixed (different fragment types). Furthermore, each class has been subdivided into subclasses based on the pattern of fragment permutation: simple (order of fragments is not permuted), circular (cases fitting the definition of circular permutation), swaps (two fragment are swapped but is not a circular permutation), and complex (all other cases). Statistics on the number of non-sequential cases using different thresholds (see Methods) and considering alternative alignments have been summarized in Table 1.
As seen from Table 1, the majority of non-sequential alignments (13.2–22.7%) are of the forward class; the number of mixed alignments is smaller but, is still significantly large (3.9–10.7%), while the reverse alignments are much less populated (0.3–1.8%) with only several hundred such cases found. The forward circular alignments is the most populated class, with more than 50% of all non-sequential alignments belonging to this class.
There is a clear tendency that the more complicated alignments are less prevalent for forward and reverse classes, i.e. there are fewer complex than swap alignments, while there are fewer swap than circular alignments. Contrary to this tendency, more complicated alignments in the mixed class are more abundant, i.e. there are more complex than swap alignments, while there are more swaps than circular alignments. Interestingly, the number of simple alignments in this class is of the same order as the number of complex ones, i.e. there is a tendency that if an alignment has two types of fragments (reverse and forward) then it is either very simple (has no permutations) or very complex (has too many permutations) alignment. Table 1 also demonstrates that variation in parameters (using different thresholds and considering alternative alignments) does change the proportion of non-sequential alignments; nevertheless, the proportion remains significant, of the order of 20%. The Table 1 also shows that the usage of different data sets results in comparable numbers, thus, crosschecking the obtained numbers.
NS alignments occur across many folds, as well as between different folds
Since all structures in SCOP are split into domains and classified, the D2 dataset is better suited for analysis of alignment distribution among protein folds. All alignments can be clearly separated into three groups by dominant type of secondary structure elements of the aligned residues: all-α, all-β, and mixture of α and β (see statistics in Table 2). The majority of non-sequential alignments (48%) are found for proteins with a mixture of helices and sheets, while for all-α and all-β groups the proportion is 24% and 28% respectively. Remarkably, the proportions are not very different from the proportions for all alignments, showing an even distribution of non-sequential alignments in protein classes. Another interesting fact is that consideration of alternative alignments eliminates a large amount of symmetry and/or shift related case (23% of total alignments), with the majority of all-α alignments being α-helical bundles.
The following observations have been made using true non-sequential alignments: 17,428 in dataset D1 and 1,130 in dataset D2 (first row in Table 2). Non-sequentially related proteins have been found in 272 folds and several most frequently found folds with non-sequential alignments are presented in Table 3. While one can see that a lot of non-sequential cases are found for proteins with symmetrical structure, their frequency (of non-sequential alignments) has to be normalized to the occurrence of proteins in a particular fold to allow for proper comparison of numbers. In other words, one has to compare a fraction of non-sequential alignment in each fold. The table shows that a typical fraction of non-sequential alignments within a particular fold, regardless of its symmetry, is of the order of 20–30% (bold columns). Moreover, the fraction of non-sequential alignments for proteins with different folds (30–40%) is of the same order of magnitude as for proteins with the same fold. Interestingly, up to 50% of non-sequential alignments are found for proteins with a different fold, which signifies that non-sequential alignments are not limited to a particular fold or set of folds.
The table also shows that the numbers, obtained using the two data sets, agree with cases of large discrepancy (e.g. fold of 'FAD/NAD(P)-binding domain') being exceptional. The reason for this is the outdated version of SCOP (dataset D2), when compared to TOPOFIT-DB (dataset D1), and ambiguity in assigning SCOP folds to TOPOFIT-DB's centroids, which are not split into domains and can represent multi-domain proteins. Thus, the discrepancies in numbers are explained purely by technical rather than biological or methodological reasons and results obtained using the two datasets are consistent.
Protein structure symmetry does not explain non-sequential alignments
While trivial non-sequential alignments (occurring as a result of symmetry or shift in protein structure) had been eliminated, still non-sequential alignments in symmetrical structures have been found. This points to the fact that a non-sequential alignment in a symmetrical structure is not always a trivial case. Consider as an example, the structure alignment of transaldolase B from Escherichia coli (PDB-code 1onr:A) and class I aldolase from Drosophila melanogaster (PDB-code 1fba:A) shown in Figure 9. Both structures are TIM barrels and can be aligned sequentially preserving the order of α/β-units (i.e. first α/β-unit is aligned to first, second to second, etc.) over 170 residues with RMSD of 3.6 Å (CE  alignment). Most of the alignment methods will agree that such an alignment is statistically significant. However, as discussed  the correct "biological" alignment must be a circular permutation, where the first α/β-unit of transaldolase is aligned to the third unit of aldolase, i.e. there must be a shift by 2 units in the alignment. The best structure alignment for this protein pair produced by TOPOFIT reflects such a circular permutation with 142 aligned residues and RMSD of 1.8 Å. Therefore, this example shows that non-sequential alignment for symmetric protein structures is not necessarily a trivial consequence of symmetry and in fact, can represent the true biological relation between proteins.
Another interesting case of alignment in proteins with symmetrical structures can be found for proteins of 6- and 7-bladed β-propeller folds. Proteins in these folds are characterized by 6 and 7 blade-shaped β-strands arranged toroidally around a central axis. Each strand typically has four antiparallel β-strands twisted so that the first and fourth strands are almost perpendicular to each other. The majority of non-sequential alignments for proteins of these folds are circular permutations. An important aspect of these alignments is that they cannot be explained by a simple symmetrical shift by a whole number of blades because there is always a non-sequential region inside of a blade consisting of 1, 2 or 3 β-strands (see schematic diagram in Figure 10a and 10b). Besides circular permutation, more complex cases of non-sequential alignments can be found while aligning structures of β-propeller. The complexity of the alignment arises from different topology, referred to as β-pinwheel , of β-strands in some structures (see Figure 10c). Again, for these cases a symmetrical shift by a whole number of blades does not explain non-sequential alignments. Thus, the unusually high (see Table 3) fraction of non-sequential alignments in β-propellers folds is not surprising. Overall, these examples show that indeed one can find true-positive non-sequential alignments in symmetrical structures.
To show that non-sequential cases are found not only in symmetrical structures we have made an additional test. Knowing that 48.9% of non-sequential alignments are found when aligned structures belong to different folds (using dataset D2), we have excluded folds from the analysis where there are at least two proteins with non-sequential alignment. Thus, all potentially symmetrical folds have been excluded resulting in a new dataset (reduced dataset), where all non-sequential alignments occur only between proteins of different folds. It was found that non-sequential cases are found in 7.7% of cases of reduced dataset, which is smaller than 21.2% on the whole data set, but is still very significant. In other words, at least one third of non-sequential alignments are found in non-symmetrical structures.
The previously observed results can be briefly summarized: 1) Non-sequential alignments are found in many non-symmetrical folds; 2) Non-sequential alignments are spread more or less evenly across folds, i.e. there is no specific fold(s) preferable for non-sequential alignments; 3) Up to 50% of non-sequential alignments are found for proteins with different folds; 4) The proportion of non-sequential alignments for proteins with different folds is comparable with proportions for proteins with the same fold; 5) At least one third of non-sequential alignments are found in non-symmetrical structures. Thus, the conclusion is that non-sequential alignments do occur in any class and type of protein structures and a protein structure symmetry/shift does not explain non-sequential alignments. In other words, the occurrence of non-sequential alignments is a general feature of protein structure.
All possible complexities of fragment rearrangements have been observed
Non-sequential alignments can be very simple that only one fragment is non-sequential, whereas, they can be so complex that only one fragment can be put in sequential order in both sequences. In other words, we have observed very simple and complex rearrangements of structurally equivalent elements in proteins. In order to address rearrangement complexity we introduce the term "rank" of an alignment, which is the number of rearrangements of structurally equivalent parts of proteins needed to put them in sequential order in the sequences of both proteins. According to this definition, sequential alignments are represented as a single structural equivalent and thus have rank zero, while circular permutations and cases similar to the one shown in Figure 1, have rank one and more complex alignments have rank two or higher. Technically, we have calculated rank as the number of segment rearrangements rather than fragment rearrangements (see Methods). This was done to ensure that rank is not overestimated due to the presence of several fragments in one segment. Using this definition, it is easy to see that any alignment with n fragments can have the highest rank of n - 1, because at least one structural element is not rearranged relative to others (we do not consider reverse alignments here).
Figure 11 shows a scatter plot of alignment rank vs number of fragments. As seen from Figure 11, for alignments consisting of up to 14 fragments almost any complexities, i.e. any possible rank value (with rare exceptions) has been observed. For alignments with a larger number of fragments this is not the case, but it can be explained by the limited statistics (see bar charts on the top and the left of the picture). Thus, we hypothesize that there is no restriction on how elements of protein structure can be permuted in a sequence and that any rearrangement of fragments can be found in nature. An illustrative example of an alignment with many rearrangements has already been described in Figure 7.
Analysis of the redundant data set
It is interesting to understand whether there are any non-sequential cases in highly similar proteins, both in structure and in sequence, i.e. those that have been grouped in TOPOFIT-DB in clusters. Thus, alignments between the structures of each of 8,865 clusters have been collected for a total of 2,509,599 alignments. The analysis reveals that the absolute majority of detected non-sequential cases are circular permutations with few exceptions. Statistically, 31,358 out of 2,509,599 alignments were non-sequential, out of which 95.5% (29,938 cases) were circular permutations, 3.5% represented alignment of different conformation of same protein, and the remaining 1% have been accounted for non-sequential alignments in only 7 protein families: fructose-1,6-bisphosphatase (1fpk:A and 1d9q:B), arrestin (1cf1:A and 1ayr:B), annexin (1hm6:A and 1hvg), aspartate/ornithine carbamoyltransferase (2atc:B and 1rac:B), 3-isopropylmalate dehydrogenase (1iso and 1hqs:A), NADH peroxidase (1f3p:A and 1nhs), α-β tubulin (1jff:B and 1tub:B). Thus, we can state the absolute majority of proteins with high sequence similarity have only circular permutations cases of non-sequential alignments.
In the presented study a comprehensive large-scale analysis of non-sequential alignments between all PDB structures (as of July 2005) has been performed. We have found that up to 35.2% of all significant alignments are non-sequential. Consideration of different thresholds and alternative alignments has been made to ensure robust detection of non-sequential cases. These variations in methodology revealed that non-sequential alignments are found in at least 17.4% of cases. Thus, the estimated proportion of non-sequential alignments is in the range of values between 17.4 to 35.2%, which is a significant proportion of structural relations not detected by most of the current methods.
It was found that the majority (more than 50%) of the non-sequential alignments fit to the formal definition of circular permutation. It is important to stress here how this number should be understood. Often, proteins aligned in a circular way are assumed to be evolutionary related and this assumption is often encoded into an alignment method to detect such cases. There is no such assumption (of evolutionary origin) in the methodology used in this study and thus, a large number of circular alignments alone does not necessarily mean an evolutionary relationship between the compared proteins. The same way, the origin of more complex non-sequential alignments is not clear.
Besides circular permutations, non-sequential alignments with a large variety of alignment patterns have been found. All possible complexities of rearrangements, various sizes and numbers of non-sequential fragments have been observed. It has been found that non-sequential alignments are not limited to proteins of any particular fold and are present in more than two hundred of different folds. Moreover, up to 50% of non-sequential alignments are found for proteins with a different fold assignment. While many of the non-sequential alignments were found for proteins with symmetrical structures, it has been shown that protein structure symmetry does not explain non-sequential alignments. Therefore, compelling evidence of different forms has been provided, confirming that non-sequential alignments between proteins are diverse and widespread across the protein universe.
Many cases of reverse alignments in various folds have been found in this study. To the best of our knowledge, only one case of reverse alignment is well known, the α-helix bundle with several helices, where one or many of the helices can be aligned in the opposite direction. The α-helix bundles have been studied experimentally and successful attempts on redesigning the four-helix bundle to have inverted helices have been reported [45, 46]. Such successful redesign of α-helix bundle can be theoretically extended to other protein folds with the cases of reverse alignments observed in this study. Thus, the existence of the reverse alignments for proteins of other folds can serve as the basis for new approaches in protein engineering to redesign proteins.
The discovery of the existence of all theoretically possible complexities of fragment rearrangement in proteins is intriguing (see Results and Figure 11). The plot is not complete due to limited statistics, which we assume as of the lack of the data for the large proteins. We believe that there is a strong confidence in a statement that any possible combination of fragments can be found in any protein structure. Currently, one can introduce a hypothesis to test (with strong support from all the presented results), which can be formulated as follows: the three-dimensional shape of tertiary structure does not depend on the order of protein fragments in the polypeptide chain, the protein core has just to be organized in a complementary manner and internal fragments have to fit to each other, while the external loops might reconnect the internal fragments in any reasonable way. The protein core here is the structural invariant, which was introduced earlier in our TOPOFIT method , while the external loops are the fragments outside of the structural invariant.
Such a hypothesis can be tested experimentally and will provide a strong empirical basis for protein redesign as a recombination of different fragments; one can see many practical applications from it to create new proteins. The validation of the hypothesis will broaden our understanding of protein structure organization and folding, and can be directly applied in fragment-based methods for protein structure and function prediction . It is encouraging that the hypothesis is supported by experimental studies on circularly permuting protein structure [20–30] and redesigning four-helix bundle proteins to have several different topologies of helices [45, 46]. Therefore, a similar reengineering by rearranging fragments may be applied to other protein folds.
The discovery of the widespread occurrence of the non-sequential alignments among many different protein folds presents an interesting phenomenon. Based on this phenomenon, one may suggests that there is some unknown common rule that governs relations between proteins detected by the non-sequential alignments, a missing rule(s) in our understanding of protein structure organization. Finding such a rule can be a challenge for the future research, but, apparently, the existence of the non-sequential alignments is not rare effect but rather a systematic feature of all proteins. More detailed studies of these alignments will bring new insight in our understanding of protein evolution, protein stability and protein folding and functionality. As a first step toward understanding the non-sequential alignments, a testable hypothesis has been suggested, stating that the three-dimensional shape of protein structure does not depend on the order of protein fragments in the polypeptide chain.
Selecting representative data sets
For this study the structural relations between the representative proteins from the TOPOFIT-DB  database (centroids), have been analyzed. The data set from TOPOFIT-DB contains all 33,315 proteins from PDB (as of July 12, 2005). All structures in the database are divided into clusters of high similarity, both in structure and in size, with assigned (to each cluster) centroids representing each cluster. The 8,865 protein clusters in TOPOFIT-DB can be considered as an analog of a structural families in CATH  and SCOP . For each cluster a centroid structure is chosen as a representative by maximum sum of Z-scores to all other proteins in the cluster. Comparison of the centroids and proteins inside each cluster resulted in 39,276,862 structural alignments stored in the database. For this study, only centroid-centroid alignments from TOPOFIT-DB with Z-score > 7 have been used, leading to a total of 82,263 alignments.
A second data set has been collected by comparing alignments between protein families as defined by SCOP (release 1.69). For each family the first structure, in the list of proteins assigned to the family, has been used as a representative, resulting in 2,845 representatives. 4,045,590 structural alignments have been produce and stored in TOPOFIT_DB database  by comparing the representatives. For this study, only alignments with Z-score > 7 have been used, leading to a total of 4,648 alignments.
Identifying sequential parts (segments) and noise filtering procedure
Since TOPOFIT alignments can be fragmented we define alignment fragment as the sequential part of an alignment without "long gaps", gaps longer than 2 residues. The cut off has been chosen based on the analysis of gap distribution in all alignments. Then we define an alignment segment as a sequential (reverse or forward) part of a structural alignment (see Figure 1). An alignment segment is different from an alignment fragment as the segment can have long gaps (longer than 2 residues) and consequently, may consists of one or more fragments. Thus, a fragment is a particular case of a segment. In Figure 1 segments are highlighted in different colors. For simplicity only the term "segment" is used in the following description of the procedure. During the procedure some alignment residue pairs were considered as noise and removed (circled on the figure). Let us define an interfering segment z, for a pair of segments x and y, as a segment located in between the two segments in either of the sequences (see example on the Figure 1). The input parameter in the algorithm is the value of F min , which controls the minimal size of a segment, i.e. all segments smaller than F min are eventually removed from the alignment or combined with other segments.
Alignment segments have been combined in a pairwise manner as follows. On each step all pairs of segments have been evaluated by the following three values (by criteria pointed in parenthesis):
number of segments interfering with it (smaller preference);
number of aligned residues in the interfering segments (smaller preference);
cumulative number of residues in the tested pair of segments (larger preference).
The best pair is found by comparing those values, where each next value is used only if the preceding values were equal. Segments in the best pair are combined only if the pair has no interfering segments. Otherwise, the interfering segment having a minimal number of aligned residues is removed from the structural alignment. So, on each step, the number of segments decreases by one. Steps are repeated until all segments are combined into one or the segment to remove has length more or equal then value of F min .
The procedure considers forward and reverse segments simultaneously, however only segments of the same type (both are either forward or reverse) are being combined. Special care is taken with segments of length one; they are evaluated in pairs with both forward and reverse segments. Here it is important to stress that the minimal fragment parameter F min is not like a conventional threshold because short fragments are not simply removed from the alignment, but first are tested for the possibility of being combined with longer fragments and only upon failure are removed.
Robustness of non-sequential alignment detection, signal/noise discrimination, optimal values of F min
The TOPOFIT method has no limitations on fragment size and some fragments can be as small as a single pair of aligned residues, which is illustrated as single dots in the alignment. Such aligned pairs of residues can be signal or noise (see Figure 1). Therefore, while finding and analyzing alignments care must be taken to discriminate between the two. Signal to noise discrimination has been achieved by applying the procedure of combining alignment fragments into continuous alignment segments (described above). The frequency distributions of residues in the segments for the range of F min values have been calculated in order to evaluate the discrimination of noise caused by small size fragments (see Figure 12). The blue line shows the original distribution when the value of F min = 1. Distributions with gradually increasing minimal fragment have also been produced for values of F min equal to 2, 3, 4, 5, 6, 7, 8 and 9 residues.
The major change in distribution occurs at F min changing from 2 to 3. Not only has the area under the distribution changed dramatically (i.e. number of non-sequential cases reduced), but the spike in the distribution at lower values has disappeared. Thus, it is evident that the noise is mostly represented by short fragments of length 1 and 2 residues. The distributions for F min values from 3 to 6 do not differ much, while larger F min values lead to significant disruptions in the shape of distributions in the region from 75 to 110. Consequently, non-sequential alignments mostly consist of aligned segments of 6 or more aligned residues. Therefore, the best signal-to-noise discrimination can be archived when the value of the F min parameter equals 3–6 residues. This is where the majority of the noise is filtered out while the signal (quantity of non-sequential alignments) is not cut. In the overall analysis presented here, the value F min = 4 has been used, while additionally a tightened criteria, F min = 6, has been applied for cross checking.
Applying tightened criteria resulted in an 11 % decrease (25,849 compare to 28,949) in the number of non-sequential cases detected. Thus, we concluded that at selected values of the F min parameter, detection of non-sequential cases is robust.
The rank of an alignment is defined as the number of rearrangements of structurally equivalent parts of proteins needed to put them in sequential order in the sequences of both proteins. Technically, the rank was calculated as the number of segment permutations. In order to calculate the number of permutations in an alignment, the corresponding alignment segments have been ordered by sequence order in the first aligned protein and numbered incrementally starting from one. Then, the segments have been ordered by sequence order in the second aligned protein. In case the considered alignment is non-sequential, renumbering will permute the order of the numbers assigned. For example, the order of numbers for the alignment shown in Figure 1 will be (1,3,2,4). A simple bubble sort algorithm has been used to calculate the number of permutations needed to sort the numbers in ascending order. For the alignment shown in Figure 1 only one permutation is needed. For reverse alignments, a reverse order of amino acids for second sequence has been considered while calculating permutations and for mixed alignments, a reverse order of amino acids for the second sequence has been considered only if the cumulative N e of reverse segments is higher than the cumulative N e of forward segments.
The non-sequential alignments were visualized and analyzed in integrated software package, Friend  with the integrated TOPOFIT method . The final views (shown in figures) of proteins structures were produced with Chimera . Data analysis has been performed with the aid of the ROOT software package . All data are publicly available in TOPOFIT-DB and can be accessed at our web site .
Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453. 10.1016/0022-2836(70)90057-4
Einspahr H, Parks EH, Suguna K, Subramanian E, Suddath FL: The crystal structure of pea lectin at 3.0-A resolution. J Biol Chem 1986, 261(35):16518–16527.
Alexandrov NN, Fischer D: Analysis of topological and nontopological structural similarities in the PDB: new examples with old structures. Proteins 1996, 25(3):354–365. Publisher Full Text 10.1002/(SICI)1097-0134(199607)25:3%3C354::AID-PROT7%3E3.3.CO;2-W
Essen LO, Perisic O, Lynch DE, Katan M, Williams RL: A ternary metal binding site in the C2 domain of phosphoinositide-specific phospholipase C-delta1. Biochemistry 1997, 36(10):2753–2762. 10.1021/bi962466t
Fuentes-Prior P, Noeske-Jungblut C, Donner P, Schleuning WD, Huber R, Bode W: Structure of the thrombin complex with triabin, a lipocalin-like exosite-binding inhibitor derived from a triatomine bug. Proc Natl Acad Sci USA 1997, 94(22):11845–11850. 10.1073/pnas.94.22.11845
Gong W, O'Gara M, Blumenthal RM, Cheng X: Structure of pvu II DNA-(cytosine N4) methyltransferase, an example of domain permutation and protein fold assignment. Nucleic Acids Res 1997, 25(14):2702–2715. 10.1093/nar/25.14.2702
Polekhina G, Board PG, Gali RR, Rossjohn J, Parker MW: Molecular basis of glutathione synthetase deficiency and a rare gene permutation event. Embo J 1999, 18(12):3204–3213. 10.1093/emboj/18.12.3204
Gooptu B, Hazes B, Chang WS, Dafforn TR, Carrell RW, Read RJ, Lomas DA: Inactive conformation of the serpin alpha(1)-antichymotrypsin indicates two-stage insertion of the reactive loop: implications for inhibitory function and conformational disease. Proc Natl Acad Sci USA 2000, 97(1):67–72. 10.1073/pnas.97.1.67
Grishin NV, Osterman AL, Brooks HB, Phillips MA, Goldsmith EJ: X-ray structure of ornithine decarboxylase from Trypanosoma brucei: the native structure and the structure in complex with alpha-difluoromethylornithine. Biochemistry 1999, 38(46):15174–15184. 10.1021/bi9915115
Grishin NV: Fold change in evolution of protein structures. J Struct Biol 2001, 134(2–3):167–185. 10.1006/jsbi.2001.4335
Tsai LC, Shyur LF, Lee SH, Lin SS, Yuan HS: Crystal structure of a natural circularly permuted jellyroll protein: 1,3–1,4-beta-D-glucanase from Fibrobacter succinogenes. J Mol Biol 2003, 330(3):607–620. 10.1016/S0022-2836(03)00630-2
Levdikov VM, Blagova EV, Brannigan JA, Cladiere L, Antson AA, Isupov MN, Seror SJ, Wilkinson AJ: The crystal structure of YloQ, a circularly permuted GTPase essential for Bacillus subtilis viability. J Mol Biol 2004, 340(4):767–782. 10.1016/j.jmb.2004.05.029
Shin DH, Lou Y, Jancarik J, Yokota H, Kim R, Kim SH: Crystal structure of YjeQ from Thermotoga maritima contains a circularly permuted GTPase domain. Proc Natl Acad Sci USA 2004, 101(36):13198–13203. 10.1073/pnas.0405202101
Yuan X, Bystroff C: Non-sequential structure-based alignments reveal topology-independent core packing arrangements in proteins. Bioinformatics 2005, 21(7):1010–1019. 10.1093/bioinformatics/bti128
Uliel S, Fliess A, Unger R: Naturally occurring circular permutations in proteins. Protein Eng 2001, 14(8):533–542. 10.1093/protein/14.8.533
Jung J, Lee B: Circularly permuted proteins in the protein structure database. Protein Sci 2001, 10(9):1881–1886.
Ponting CP, Russell RB: Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem Sci 1995, 20(5):179–180. 10.1016/S0968-0004(00)89003-9
Jeltsch A: Circular permutations in the molecular evolution of DNA methyltransferases. J Mol Evol 1999, 49(1):161–164. 10.1007/PL00006529
Viguera AR, Blanco FJ, Serrano L: The order of secondary structure elements does not determine the structure of a protein but does affect its folding kinetics. J Mol Biol 1995, 247(4):670–681. 10.1006/jmbi.1994.0171
Ay J, Gotz F, Borriss R, Heinemann U: Structure and function of the Bacillus hybrid enzyme GluXyn-1: native-like jellyroll fold preserved after insertion of autonomous globular domain. Proc Natl Acad Sci USA 1998, 95(12):6613–6618. 10.1073/pnas.95.12.6613
Ay J, Hahn M, Decanniere K, Piotukh K, Borriss R, Heinemann U: Crystal structures and properties of de novo circularly permuted 1,3–1,4-beta-glucanases. Proteins 1998, 30(2):155–167. 10.1002/(SICI)1097-0134(19980201)30:2<155::AID-PROT5>3.0.CO;2-M
Keitel T, Simon O, Borriss R, Heinemann U: Molecular and active-site structure of a Bacillus 1,3–1,4-beta-glucanase. Proc Natl Acad Sci USA 1993, 90(11):5287–5291. 10.1073/pnas.90.11.5287
Pieper U, Hayakawa K, Li Z, Herzberg O: Circularly permuted beta-lactamase from Staphylococcus aureus PC1. Biochemistry 1997, 36(29):8767–8774. 10.1021/bi9705117
Wright G, Basak AK, Wieligmann K, Mayr EM, Slingsby C: Circular permutation of betaB2-crystallin changes the hierarchy of domain assembly. Protein Sci 1998, 7(6):1280–1285.
Tougard P, Bizebard T, Ritco-Vonsovici M, Minard P, Desmadril M: Structure of a circularly permuted phosphoglycerate kinase. Acta Crystallogr D Biol Crystallogr 2002, 58(Pt 12):2018–2023. 10.1107/S0907444902015548
Barrientos LG, Louis JM, Ratner DM, Seeberger PH, Gronenborn AM: Solution structure of a circular-permuted variant of the potent HIV-inactivating protein cyanovirin-N: structural basis for protein stability and oligosaccharide interaction. J Mol Biol 2003, 325(1):211–223. 10.1016/S0022-2836(02)01205-6
Chu V, Freitag S, Le Trong I, Stenkamp RE, Stayton PS: Thermodynamic and structural consequences of flexible loop deletion by circular permutation in the streptavidin-biotin system. Protein Sci 1998, 7(4):848–859.
Horne WS, Yadav MK, Stout CD, Ghadiri MR: Heterocyclic peptide backbone modifications in an alpha-helical coiled coil. J Am Chem Soc 2004, 126(47):15366–15367. 10.1021/ja0450408
Manjasetty BA, Hennecke J, Glockshuber R, Heinemann U: Structure of circularly permuted DsbA(Q100T99): preserved global fold and local structural adjustments. Acta Crystallogr D Biol Crystallogr 2004, 60(Pt 2):304–309. 10.1107/S0907444903028695
Fliess A, Motro B, Unger R: Swaps in protein sequences. Proteins 2002, 48(2):377–387. 10.1002/prot.10156
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31(1):365–370. 10.1093/nar/gkg095
Szustakowski JD, Weng Z: Protein structure alignment using a genetic algorithm. Proteins 2000, 38(4):428–440. 10.1002/(SICI)1097-0134(20000301)38:4<428::AID-PROT8>3.0.CO;2-N
Dror O, Benyamini H, Nussinov R, Wolfson H: MASS: multiple structural alignment by secondary structures. Bioinformatics 2003, 19(Suppl 1):i95–104. 10.1093/bioinformatics/btg1012
Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 2004, 60(Pt 12 Pt 1):2256–2268. 10.1107/S0907444904026460
Kolbeck B, May P, Schmidt-Goenner T, Steinke T, Knapp EW: Connectivity independent protein-structure alignment: a hierarchical approach. BMC Bioinformatics 2006, 7: 510. 10.1186/1471-2105-7-510
Shih ES, Hwang MJ: Alternative alignments from comparison of protein structures. Proteins 2004, 56(3):519–527. 10.1002/prot.20124
Shih ES, Gan RC, Hwang MJ: OPAAS: a web server for optimal, permuted, and other alternative alignments of protein structures. Nucleic Acids Res 2006, 34(Web Server):W95–98. 10.1093/nar/gkl264
Ilyin VA, Abyzov A, Leslin CM: Structural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax point. Protein Sci 2004, 13(7):1865–1874. 10.1110/ps.04672604
Leslin CM, Abyzov A, Ilyin VA: TOPOFIT-DB, a database of protein structural alignments based on the TOPOFIT method. Nucleic Acids Res 2007, (35 Database):D317–321. [http://mozart.bio.neu.edu/topofit] 10.1093/nar/gkl809
Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540. 10.1006/jmbi.1995.0159
Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11(9):739–747. 10.1093/protein/11.9.739
Nagano N, Orengo CA, Thornton JM: One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J Mol Biol 2002, 321(5):741–765. 10.1016/S0022-2836(02)00649-6
Corbett KD, Shultzaberger RK, Berger JM: The C-terminal domain of DNA gyrase A adopts a DNA-bending beta-pinwheel fold. Proc Natl Acad Sci USA 2004, 101(19):7293–7298. 10.1073/pnas.0401595101
Kresse HP, Czubayko M, Nyakatura G, Vriend G, Sander C, Bloecker H: Four-helix bundle topology re-engineered: monomeric Rop protein variants with different loop arrangements. Protein Eng 2001, 14(11):897–901. 10.1093/protein/14.11.897
Micklatcher C, Chmielewski J: Helical peptide and protein design. Curr Opin Chem Biol 1999, 3(6):724–729. 10.1016/S1367-5931(99)00031-9
Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol 2006, 16(3):393–398. 10.1016/j.sbi.2006.04.007
Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett C, Marsden R, Grant A, Lee D, et al.: The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res 2005, (33 Database):D247–251.
Abyzov A, Errami M, Leslin CM, Ilyin VA: Friend, an integrated analytical front-end application for bioinformatics. Bioinformatics 2005, 21(18):3677–3678. 10.1093/bioinformatics/bti602
Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: UCSF Chimera--a visualization system for exploratory research and analysis. J Comput Chem 2004, 25(13):1605–1612. 10.1002/jcc.20084
We are grateful to Chesley Leslin for his outstanding help in collecting data for TOPOFIT-DB and for the maintenance of the database and reading the manuscript. We also thank the members of our laboratory and the Biology department at Northeastern University for useful discussions and comments.
AA did the data collection, calculations, and analysis and prepared the manuscript. VAI did design of the project, data analysis and prepared the manuscript. All authors have read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.