Comparative Analysis of Protein Structure Alignments
© Mayr et al; licensee BioMed Central Ltd. 2007
Received: 30 January 2007
Accepted: 26 July 2007
Published: 26 July 2007
Skip to main content
© Mayr et al; licensee BioMed Central Ltd. 2007
Received: 30 January 2007
Accepted: 26 July 2007
Published: 26 July 2007
Several methods are currently available for the comparison of protein structures. These methods have been analysed regarding the performance in the identification of structurally/evolutionary related proteins, but so far there has been less focus on the objective comparison between the alignments produced by different methods.
We analysed and compared the structural alignments obtained by different methods using three sets of pairs of structurally related proteins. The first set corresponds to 355 pairs of remote homologous proteins according to the SCOP database (ASTRAL40 set). The second set was derived from the SISYPHUS database and includes 69 protein pairs (SISY set). The third set consists of 40 pairs that are challenging to align (RIPC set). The alignment of pairs of this set requires indels of considerable number and size and some of the proteins are related by circular permutations, show extensive conformational variability or include repetitions. Two standard methods (CE and DALI) were applied to align the proteins in the ASTRAL40 set. The extent of structural similarity identified by both methods is highly correlated and the alignments from the two methods agree on average in more than half of the aligned positions. CE, DALI, as well as four additional methods (FATCAT, MATRAS, C α -match and SHEBA) were then compared using the SISY and RIPC sets. The accuracy of the alignments was assessed by comparison to reference alignments. The alignments generated by the different methods on average match more than half of the reference alignments in the SISY set. The alignments obtained in the more challenging RIPC set tend to differ considerably and match reference alignments less successfully than the SISY set alignments.
The alignments produced by different methods tend to agree to a considerable extent, but the agreement is lower for the more challenging pairs. The results for the comparison to reference alignments are encouraging, but also indicate that there is still room for improvement.
Structural biology relies heavily on structure comparison methods. These methods are routinely applied in order to establish structural, evolutionary and functional relationships between proteins . In general these methods provide a measure of structural similarity between proteins, which is used to identify similar folds and evolutionary related proteins. Most of the methods also generate an alignment that defines the residues that have a structurally equivalent role in the proteins compared. When the aligned proteins are assumed to share a common ancestor, a structure alignment supports the identification of evolutionary equivalent residues. Since protein structure is more conserved in evolution than sequence, structure alignments of remote homologous proteins are considered more reliable than sequence based alignments to identify the equivalent residues. The structure alignment of functionally related proteins provides insights into the functional mechanisms, and has been successfully applied in the functional annotation of proteins whose structures have been determined .
When aligning structures the nature of the structural models should also be taken into account. Experimental structural models are usually determined by X-ray crystallography or by Nuclear Magnetic Resonance spectroscopy. The atomic coordinates obtained from these experiments are always associated with some degree of uncertainty resulting from experimental errors and from the intrinsic flexibility of the proteins or from atom vibrations. These uncertainties become problematic especially for some comparison methods that assume that the protein backbone is formed by regular secondary structure elements, and correct assignment of these elements might not be possible for models with poor resolution. Additional difficulties originate from the nature of the protein structural relationships. Similar structures might display considerable structural variability and are often related by several insertions and deletions (indels) of considerable size. Structural variation is noticeable in the comparison of alternative conformations of a single protein, and reflects the intrinsic protein flexibility .
Structural similarity between different proteins is the result of evolution from a common ancestor if the proteins to be compared are homologous, or they are the result of convergent or parallel evolution . The evolution of proteins involves mutations of single residues, insertions and deletions , gene duplication or fusion and exon duplication, deletion or shuffling . Such changes accumulate over time and result in structural differences between the two proteins. These changes preferably affect the surface regions of the proteins, except for the functional sites which tend to be conserved if the protein retains the same molecular function. The hydrophobic core, essential to maintain structural integrity, in general remains relatively conserved [6, 7]. Homologous proteins might also be related by circular permutation or shuffling of the protein sequence, which results in a non-sequential sequence or structure alignment between the two structures. Circular permutations are the result of gene duplication, exon shuffling or post-translation modifications .
Repetition is a common feature of protein structures, and is observed at different structural levels. These repetitions occur at the level of the secondary structure elements, at the level of supersecondary elements, at the subdomain level or at the domain level. Recurring substructures imply that protein structures can be aligned in alternative ways with comparable structural similarity scores. The existence of alternative alignments has been investigated before [9, 10].
Currently there are a considerable number of structural comparison tools available to the structural biologist [1, 11]). In general, these methods compare the geometry of the C α backbone atoms, but they are based on different algorithms and have been designed for various applications. CE  and DALI  are two popular methods for searching similarities in a structural database and for pairwise comparison of two structures. Both methods search for compatible pairs of fragments with similar intramolecular C α distances. Then they use different strategies to combine these fragments into a final alignment. Methods like FATCAT  are able to align subdomains in different relative orientations, resulting from protein flexibility or from evolutionary divergence . Another strategy is to consider not only the backbone geometry but also the physicochemical environment of each residue in order to align the two structures [15–17]. This strategy is followed in the SHEBA implementation . Some tools match secondary structure elements to obtain in an efficient way a first alignment that is later refined. MATRAS  in particular matches secondary structure elements in the first stage of alignment. Environmental properties and C α distances are then applied to obtain the final solution. MATRAS applies a Markov transition model of evolution to derive different types of scoring functions. Some methods, like C α -match, do not take the protein sequence order into account and allow for non-sequential alignments . C α -match in particular is based on geometric hashing and ignores connectivity between aligned residues, which is desirable for the comparison of folds and architectures and for the comparison of proteins related by circular permutation. Other methods are able to align multiple structures [19–22]. Finally, some tools perform very fast comparisons between a given query protein and a structural database, and provide structural similarity scores for each comparison but no alignment [23, 24].
So far structure comparison methods have been primarily evaluated in terms of their ability to identify proteins with similar folds or to identify homologous proteins [11, 25, 26]. They have also been assessed relative to the extent of structural similarity that is identified, where better performance corresponds to longer alignments and to better rigid body superpositions, or to a better score according to other geometric measures [26–29]. Several methods have been the focus of these analyses, in particular SSAP , STRUCTAL , DALI , LSQMAN , CE , SSM , ASH  and TM-align . Less attention, however, has been given to the objective analysis of the extent of agreement between alignments produced by different pairwise structure comparison methods. There is also a need to assess the accuracy of these structure based alignments regarding the correct identification of equivalent residues in terms of structure, evolution or function.
We analysed and compared pairwise structure alignments produced by six methods based on different algorithms: CE, DALI, FATCAT, MATRAS, SHEBA and C α -match. First, CE and DALI were applied to a representative set of remote homologous proteins comprising 355 structure pairs derived from the ASTRAL database  (ASTRAL40 set). Then we applied CE, DALI, FATCAT, MATRAS, SHEBA and C α -match to 69 related protein pairs obtained from the SISYPHUS database  (SISY set). Finally, these six methods were applied to a third set comprising 40 pairs that are challenging to align. These pairs include r epetitions, i ndels, p ermutation and c onformational variability (RIPC set). The methods were compared in terms of the extent of structural similarity detected according to the resulting alignments and in terms of alignment consistency. The methods were also compared relative to the extent of agreement to reference alignments. Finally, to illustrate the different types of structure comparison challenges, the results of selected pairs were analysed in more detail.
In this section we present the results for the comparison of CE and DALI structural alignments using the ASTRAL40 set. Then we provide the alignment comparison results obtained for the SISY and RIPC sets using six different methods. The alignments from different methods were compared regarding identification of structure similarity and alignment consistency (all sets) and agreement with reference alignments (for SISY and RIPC sets). Finally we describe in more detail the different alignments obtained for seven pairs of proteins from the RIPC set which illustrate the different types of challenges currently faced by structure alignment methods.
The standard structure comparison methods CE and DALI were applied to each pair of structures in the ASTRAL40 set. Proteins of the ASTRAL40 collection are remote homologous in the sense that they have less than 40% sequence identity and belong to the same SCOP  superfamily but different families [see Additional file 1]. The structure based alignments obtained by CE and DALI were compared with regard to the identification of structural similarity and the consistency or agreement of the residues aligned.
Two standard measures of protein structure similarity can be derived from structure-based alignments: The alignment length expressed with the number of equivalent residues (EQR) and the root-mean-square distance (RMSD) of the superimposed structures. The number of equivalent residues provides a measure of how large is the region of structural similarity, and the RMSD provides a measure of the degree of structure similarity in the aligned region. The RMSD depends on the number of equivalent residues, therefore the RMSD values associated with alignments of different lengths can not be compared. Different normalised measures have been proposed . In particular the RMSD100 corresponds to the RMSD value expected if the two protein structures were 100 residues length . A simpler alternative is to divide the RMSD by EQR: RMSDN = RMSD/EQR.
In the ASTRAL40 set, RMSD100 values are highly correlated to the RMSDN (Pearson correlation 0.96). RMSD100 was selected for the analysis of the results. The differences in lengths and RMSD100 values for the alignments produced by CE and DALI for each pair are small in general. The CE alignments tend to be longer (median difference 3.0), while DALI alignments tend to have better RMSD100 values (median difference 0.1). Although the differences are small, the distributions of EQR and RMSD100 of the alignments obtained with CE and DALI are significantly different in the ASTRAL40 set according to the Wilcoxon signed-rank test with paired observations . The p-values are 2.0·10-8 for EQR and 3.0·10-5for RMSD100.
To summarise, the lengths of alignments produced by CE and DALI are highly correlated, but the RMSD100 values are less correlated. The differences between alignment lengths and the RMSD100 are small but significant, where DALI tends to generate shorter alignments but with better RMSD100 than CE.
The A0 distribution has mean of 0.59 and median 0.68. The more tolerant measure A4 has higher mean 0.80 and median 0.90, indicating that CE and DALI tend to align proteins in the same region. The spread is much larger for the A0 distribution. There is a maximum at [0.00, 0.05], with 14% of alignments having an A0 ≤ 0.05, and there is a another peak around 0.8. The A4 shows a pronounced bimodal distribution with a peak at [0.90, 0.95], and a smaller peak at [0.00, 0.05]. For 8% of proteins CE and DALI alignments are completely different, with A4 ≤ 0.05.
The SISY set is based on the SISYPHUS database, which contains structural alignments for proteins with non-trivial relationships . Most pairs of the SISY set are categorised in SISYPHUS as homologous (52 out of 69) while the remaining are structurally related through a common fold or a fragment definition [see Additional file 2]. Alignments were calculated by CE, DALI, FATCAT, MATRAS, C α -match and SHEBA for each pair in the SISY set. These alignments were compared with regard to the extent of structural similarity detected and the consistency between alignments. The alignments obtained by the six different methods were also compared to reference alignments obtained from the SISYPHUS database.
We compared EQR of the alignments obtained with the six methods. There is a considerable correlation between all methods regarding the EQR [see Additional file 3, Figure S1]. In particular the correlation is high between CE and DALI, as observed previously with the ASTRAL40 set. MATRAS tends to show lower correlation with FATCAT and SHEBA. The correlation regarding RMSD100 is much lower. For example, between CE and DALI the correlation is 0.34 (Pearson) and 0.76 (Spearman).
The distribution of the differences of the length of the alignments generated by two methods for each pair of structures are given in Figure S2 in Additional file 3. In general SHEBA and FATCAT generate longer alignments than the other methods, while C α -match generates the shortest alignments. Similar analysis of RMSD100 differences indicates that C α -match has the smallest RMSD100, while SHEBA and to a less extent MATRAS alignments tend to have larger RMSD100.
In order to group the methods according to the alignment consistency we used the mean values of A0 and A4 as well as the median values to compute several dissimilarity measures. Two dissimilarity measures were computed of the type d = 1 - M, where M is the mean of A0 or A4. Hierarchical clustering  was applied using the four alternative dissimilarity measures. The silhouette width value is a measure of cluster quality  and was applied to select the best number of clusters. The best average silhouette width values were obtained with two clusters using any of the two dissimilarity measures. One of the clusters included only the C α -match method and the remaining cluster included the other five methods. The silhouette width values were usually below 0.5. This indicates that the cluster quality is low, and that the methods are uniformly distinct regarding alignment consistency.
The alignment agreement between all six methods is in general much lower than between any pair of methods. When A0 is computed based on the number of aligned residues in common by all six methods, then the alignment consistency is 0.0 for 42%, and 64% have A0 ≤ 0.20.
To summarise, the alignment agreement between two methods shows a large spread, and the observed range of medians is 0.3 - 0.8. If alignment shifts of up to four residues are tolerated, then the range of medians increases to 0.5 - 0.9. C α -match alignments tend to be less consistent with the alignments from other methods. The alignment agreement over all six methods is much lower, with 42% of the pairs sharing no aligned residues over all methods.
Wilcoxon test for alignment accuracy in SISY set.
C α -match
C α -match
List of pairs in RIPC set
The results for the comparison of the alignment EQR and RMSD100 obtained in the RIPC set are similar to the results obtained in the SISY set. In particular there is considerable correlation of EQR values between alignments from different methods. The correlation for RMSD100 is also much lower.
So far we have investigated the consistency between alignments from different methods. There are two possible reasons for the differences between alignments. First, only one of the two alignments is optimal in the sense that it identifies the regions with most extensive structural similarity, or it identifies the evolutionary equivalent residues. Second, different alignments are equally optimal which is possible if there is not one but several alternative solutions for aligning two proteins. Such different alignments result usually from the existence of repetitions in the structures compared. They correspond to the same degree of structural similarity or to alternative ways to define evolutionary equivalent residues.
Some of the structure comparison methods produce alternative solutions, but in the previous analysis we only considered one single alignment from each method (the best scoring alignment). This might result in low consistency between alignments for pairs of structures that have alternative alignments. To investigate the role of alternative optimal alignments, one can consider all the alternative alignments from the different methods. Another simpler approach is to remove from the sets the pairs for which a method gives alternative alignments, and then investigate whether the consistency improves for the remaining pairs with unique alignments. We decided for the second approach, and removed from the ASTRAL40, SISY and RIPC sets all pairs with alternative solutions according to DALI. In total 124 pairs were excluded from the ASTRAL40 set (231 pairs remaining), 21 pairs were excluded from the SISY set (48 remaining), and 14 pairs were removed from the RIPC set (26 remaining). The consistency scores A0 and A4 were recomputed for these subsets. The new results show an improvement (better agreement) in some cases, but in general they are still similar to the ones obtained with the original sets. This indicates that the methods actually produce different solutions for some pairs, independent of the existence of alternative alignments.
In this section we investigate seven pairs of SCOP domains selected from the RIPC set. These examples illustrate how repetitions, indels, circular permutations and local conformational changes affect structural alignment results. For each example alignment path plots are provided for the visualisation of the alignments from the different methods.
P-loop containing NTP hydrolases are difficult cases for alignment because they may vary in the number of β-strands in the central sheet. Aligning these proteins therefore requires several indels . Figure 8C shows extensive differences in the alignments by the different methods of the two NTP hydrolases. The N-terminal P-loop, the region associated with ADP binding, is in general correctly aligned by the different methods.
The homology between NK-lysin and swaposin with a circular permutation was revealed by sequence prior to knowledge of the crystal structures. The common fold consists of five helices forming a folded leaf. A previous curated alignment of the two proteins  was employed as reference alignment. Figure 10B shows that the methods align the helices in sequence order, which is incorrect regarding the evolutionary equivalent residues, and results in poor RMSD100 values (around 4). The exception is the alignment from C α -match, which aligns some residues in the permutated regions correctly, although most of the equivalences are sequentially unconnected. No aligned residues are shared by all six methods.
The UDP-glucose-6-dehydrogenase middle domain d1dlia1  and the GDP-mannose-6-dehydrogenase middle domain d1mv8a1  are structurally related and of similar size. The Catalytic Site Atlas identifies four equivalent catalytic residues that are used to define the reference alignment. The two proteins consist of two common substructures that are conserved but in considerably different relative orientations. Most methods align only the N-terminal region, but CE matches only the C-terminal fragment. We observe a high alignment consistency in the first substructure located in the N-terminal region (see Figure 10C). FATCAT and SHEBA succeed in aligning the two substructures and the catalytic amino acids correctly.
We have presented a comparative analysis of pairwise structural alignments using three different datasets. The aim of this work is not to rank or benchmark the different methods, but instead to reveal the differences in the results and the challenges these methods face. The results indicate that the alignments of homologous proteins generated by two standard methods (DALI and CE) tend to be similar. Nevertheless for some of these homologous pairs the alignments are still completely different. The alignment agreement is lower in the more challenging datasets (SISY and RIPC sets), in particular if alignments from other methods (FATCAT, MATRAS, C α -match and SHEBA) are also compared. For these two datasets, reference alignments were compiled based on curated alignments and on the identification of equivalent functional residues. We find that the different methods tend to match the reference alignments to some extent, but there is still large room for improvement, specially for the more challenging protein pairs.
The analysis of the results obtained for seven protein pairs that are challenging to align illustrated the strengths and limitations of the different methods. These examples revealed how repetition and extensive indels results in low alignment consistency. They also revealed how proteins related by circular permutations are still difficult to align correctly by most methods, and that some methods can successfully align proteins with considerable conformational variability.
These results raise several issues of relevance for the users of structure alignment methods. In particular the results indicate that different alignments can be obtained when comparing the structures of remote homologous proteins with different methods. In addition, the resulting alignments not always match equivalent functional residues or curated alignments. These findings should also encourage developers to further improve their methods. In particular they should focus on testing and improving the results for challenging cases, as provided in the RIPC set.
The current study focused on the analysis of pairwise structure alignments. It would be of interest to perform in the future a similar comparative analysis of multiple structure alignments. In this respect one should take into account the procedures that have been successfully established to test multiple sequence alignment tools [49–51].
SCOP domains with less than 40% sequence identity were derived from the ASTRAL compendium. From every superfamily, a representative from each of two different randomly chosen families were randomly selected. Multichain domains were excluded. The ASTRAL40 set contains 355 structure pairs, and is available in the Additional file 1.
From each SISYPHUS multiple structure alignment  the pair of proteins with the lowest identity was chosen. Pairs with more than 40% sequence identity or including structures comprised of multiple chains were excluded. The full protein chains were used to generate the alignments. Reference alignments were obtained from the SISYPHUS database as well. The SISY set comprises 69 structure pairs. 52 pairs are grouped in the SISYPHUS homologous category. The dataset is available in the Additional file 2.
The RIPC set was collected by consulting the SCOP classification of proteins for remote homologous structure pairs and the Molecular Movements Database  for proteins with alternate conformations. We paid attention that all-α, all-β and α/β-containing domains are represented. The resulting set comprises 40 protein pairs. Each pair is associated with at least one of the structure comparison challenges: R epetitions, I ndels, circular P ermutation and extensive C onformational differences. The dataset is available in Table 1 and in the Additional file 5.
We applied CE and DALI to align the protein pairs in the ASTRAL40 set. CE, DALI, FATCAT, MATRAS, C α -match and SHEBA were applied to align the protein pairs in the SISY and RIPC sets. The standalone implementations of CE, DALI and SHEBA were used, FATCAT, MATRAS and C α -match results were obtained by accessing the corresponding online services [55–57]. Python scripts where implemented to parse the output files.
where I s is the number of aligned residues that are consistent in the two alignments within a tolerance shift of s positions. Therefore I0 is the number of identically aligned residues in the two alignments, I1 is I0 plus the number of aligned residues that are shifted by one position, I2 is the number of aligned residues within a shift of two, and so on. L max is the length of the longer structure alignment: L max = max(L1, L2).
In the current work A0 is used to measure the extent of identity between alignments, and A4 is used to measure the extent of similarity between alignments. A value of s = 4 tolerates shifts of four aligned positions, corresponding to consecutive turns in an α-helix. A s values range between 0, corresponding to no similarity, and 1, where all aligned residues in the two alignments are consistent within shift s.
The agreement to reference alignments was computed as the percentage of residues aligned identically to the reference alignment (I s ) relative to the length of the reference alignment (L ref ): 100·I s /L ref
Reference alignments were derived for 23 pairs from the RIPC set. Two of these are based on curated alignments of homologous proteins . Three additional reference alignments result from mapping the residue numbers in the PDB structures that correspond to alternative conformations of the same protein. The remaining 18 reference alignments are the result of a search for functionally equivalent residues. Reference alignments are available in the Additional file 7.
Several strategies were applied in the search for functionally equivalent residues. If the two proteins bind the same or similar ligand then SiteEngine  was applied to confirm that these binding sites share similar physicochemical environments and similar structures and to obtain the equivalent residues. Equivalent catalytic residues or metal binding sites were obtained from the Catalytic Site Atlas [59, 60], from the PDBSum  or from the literature. SPASM  was then applied to verify that the geometry is conserved in these sites.
We would like to thank Andreas Prlic and Antonina Andreeva for providing SISYPHUS data. This work was supported by the Austrian Science Fund (FWF) Grant P15909-N04. This research was performed in the context of the BioSapiens Network of Excellence, which is funded by the European Commission, contract number LSHG-CT-2003-503265.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.