Structural footprinting in protein structure comparison: the impact of structural fragments
© Zotenko et al; licensee BioMed Central Ltd. 2007
Received: 26 January 2007
Accepted: 09 August 2007
Published: 09 August 2007
One approach for speeding-up protein structure comparison is the projection approach, where a protein structure is mapped to a high-dimensional vector and structural similarity is approximated by distance between the corresponding vectors. Structural footprinting methods are projection methods that employ the same general technique to produce the mapping: first select a representative set of structural fragments as models and then map a protein structure to a vector in which each dimension corresponds to a particular model and "counts" the number of times the model appears in the structure. The main difference between any two structural footprinting methods is in the set of models they use; in fact a large number of methods can be generated by varying the type of structural fragments used and the amount of detail in their representation. How do these choices affect the ability of the method to detect various types of structural similarity?
To answer this question we benchmarked three structural footprinting methods that vary significantly in their selection of models against the CATH database. In the first set of experiments we compared the methods' ability to detect structural similarity characteristic of evolutionarily related structures, i.e., structures within the same CATH superfamily. In the second set of experiments we tested the methods' agreement with the boundaries imposed by classification groups at the Class, Architecture, and Fold levels of the CATH hierarchy.
In both experiments we found that the method which uses secondary structure information has the best performance on average, but no one method performs consistently the best across all groups at a given classification level. We also found that combining the methods' outputs significantly improves the performance. Moreover, our new techniques to measure and visualize the methods' agreement with the CATH hierarchy, including the threshholded affinity graph, are useful beyond this work. In particular, they can be used to expose a similar composition of different classification groups in terms of structural fragments used by the method and thus provide an alternative demonstration of the continuous nature of the protein structure universe.
Protein structure comparison is an important tool that helps biologists understand various aspects of protein function and evolution. Unfortunately highly accurate protein structure comparison methods are computationally expensive and therefore are not suitable for large-scale analysis, such as when all pairwise comparisons have to be performed for a large number of protein structures. One approach for speeding-up protein structure comparison is the projection approach, where a protein structure is mapped to a vector in a high-dimensional space. Once the mapping is done, protein structure comparison is reduced to a distance computation between the corresponding vectors and therefore is very efficient. For example, it was shown  that once vector representations are computed it takes on average 500 seconds for a projection method to perform all pairwise comparisons among 5,024 domains. (Compare this to an estimated four months it would take DALI , a highly accurate protein structure comparison method, to perform the same number of pairwise comparisons.) However, the advantage of the projection approach is also one of its main limitations; namely, in the process of mapping, some structural information is lost. Furthermore, there is no agreement on what constitutes a good projection technique, and currently known projection methods [1, 3–7] utilize very different approaches to the mapping construction, both in terms of which structural information is included and how this information is integrated to produce a vector representation.
Recently, Zotenko et al.  performed a comprehensive comparison of projection methods in the context of two typical applications for such methods, high-throughput protein structure comparison and classification. The authors found that the SSEF method  performed the best in their tests, followed closely by the LFF method . Both methods use the same general approach, which we call structural footprinting, to construct the mapping: (i) select a representative set of structural fragments as models, (ii) map a structure to a structural footprint, a vector in which each dimension corresponds to a particular model and "counts" the number of times the model appears in the structure. (Since structural fragments are not discrete objects, a count of one is distributed among one or several most similar models weighted by the precision with which the model is reproduced in the structure.) While both methods use the same strategy to integrate the structural information, they differ substantially in the type of structural fragments used as models and their representation. The LFF method  uses a pair of backbone segments (each ten residues long) as a structural fragment whose conformation is described by a set of 100 inter-atomic distances between the corresponding C α atoms. The SSEF method  uses a triplet of secondary structure elements (throughout the paper we use SSE to refer to a secondary structure element) as a structural fragment whose conformation is captured by a set of pairwise angles and distances between the corresponding SSE vectors. Even though the comparison results of Zotenko et al.  show that the structural footprinting is an adequate approach to the mapping construction, we are not aware of any systematic study that evaluates the effect of the choice of structural fragments on the ability of a structural footprinting method to detect different types of structural similarity.
The main objective of this work is to explore in detail the dependence of a structural footprinting method on the set of structural fragments it selects to model the structure and their representation. Towards this end we focus our attention on three structural footprinting methods that vary significantly in their selection of models. To complement the LFF and SSEF methods described above, we have designed a structural footprinting method that uses contiguous segments (thirty-two residues long) of protein backbone as structural fragments; we call this method the SEGF method. The conformation of a backbone segment is captured by a set of fourteen shape descriptors introduced by Rogen et al. [3, 8]. These descriptors build upon a geometric invariant inspired by the writhing number of a closed space curve , a concept from Knot Theory. As opposed to the common geometric invariants such as angles and distances, the fourteen shape descriptors lack intuitive interpretation, with each descriptor being a function of many factors (see Methods).
We benchmarked the methods' performance against the CATH database . CATH is an hierarchical classification of protein structures, where protein domains are classified into groups at the Class, Architecture, Topology (Fold), and Homologous Superfamily levels. Members of the same homologous superfamily group share a clear common evolutionary origin supported either by significant sequence similarity or significant structural and functional similarity, and several superfamilies are grouped into topology (fold) groups based on significant structural similarity. The architecture level further groups proteins based on coarse topological organization of secondary structure elements. Finally, the class level groups proteins according to secondary structure element content: mainly α, mainly β, mixed α and β, or small structures.
In the first set of experiments we compared the methods' ability to detect structural similarity characteristic of evolutionarily related structures, i.e., structures within the same CATH superfamily. In a recent study , Reeves and colleagues analyzed the extent and nature of structural diversity across different superfamilies in the CATH database. In particular, it was shown that some superfamilies, especially those from layered architectures such as mainly β or α-β sandwiches, are much more structurally diverse than others. Moreover, the repertoire of structural changes is very rich, ranging from changes in conformation of the loop regions, to changes in the orientation of secondary structure elements, to insertion/deletion of secondary structures or even whole super-secondary structure motifs. Thus, studying the relative performance of the methods across different superfamilies allowed us to observe the relative strengths and weaknesses of the methods in a variety of settings and to propose two strategies to combine the methods to achieve a better performance. We showed that combination methods provide a significant improvement in performance. Even though the method that uses the SSE information has the best performance on average, combining the methods allows better handling of the whole spectrum of structural variability exhibited by various CATH classification groups.
Recently several groups [12–14] demonstrated the existence of meaningful structural relationships between protein domains classified in different folds. Harrison and colleagues , for example, introduced a measure of gregariousness, where the gregariousness of a fold quantifies how many other folds have a significant structural overlap with it. The presence of common structural motifs, often on a level of super-secondary structure elements, emerged as one of the main reasons for these inter-fold similarities. As a structural footprinting method measures structural similarity based on the presence/absence of common structural fragments, inter-fold similarities of this kind should be prominent in the method's view of the protein structure universe. Therefore, in the second set of experiments we tested how the method's definition of structural similarity extends beyond the Superfamily level and whether it agrees with the boundaries imposed by the classification groups at the Class, Architecture, and Fold levels of the hierarchy. To study these similarities in a systematic way we defined an affinity score of one superfamily towards the other, which measures how well the method retrieves members of the second superfamily using the members of the first superfamily as queries. We developed a set of techniques that allowed us to summarize the affinity scores for a particular method to expose the agreement between the method and the CATH classification hierarchy.
Results and discussion
The evaluation procedure
In this work we used the CATH database (version 2.6 released on April 2005) for benchmarking purposes. We used a set of non-redundant domains (a total of 5,588 domains) as our set of database domains. From these we selected a set of 133 well-populated superfamilies that span 55 folds, 17 architectures and 3 classes (see Methods). Given a structural footprinting method and a well-populated superfamily, every member of the superfamily was used by the method as a query to rank the remaining database domains. We then used the ROC300 scores , which measure to what extent the positives (the remaining members of the superfamily) precede the negative results (domains in folds different than that of the query), to quantify the method's ability to retrieve the other members of a superfamily given one member as a query.
In what follows we use the CATH numbering system to refer to individual superfamilies, folds, architectures, and classes. The CATH number of a classification group encodes its position within the hierarchy. Thus, for example, the 188.8.131.52 superfamily is in the 3.30.70 fold group, in the 3.30 architecture group, and in the 3 class group.
Structural similarity at the CATH Superfamily level
Average ROC300 scores for combined methods. The average ROC300 scores obtained over a range of combination strategies: the original methods, voting with all three methods, and linear combination of similarity scores. (Individual ROC300 scores are given in the supplementary material [see Additional file 1].)
To illustrate this point let us consider three outliers in Figure 1, superfamilies for which the performance of one method is quite different from that of another, the 184.108.40.206 (Cytoskeleton), 3.30.300.20 (Rna Binding Protein), and 3.30.450.20 (Signaling Protein) superfamilies.
In contrast, the structural variability exhibited by the members of the 3.30.300.20 superfamily does not affect the performance of the SSEF method, but it does affect the other two methods. As shown in Figure 2(b), the members of this superfamily have approximately the same number of SSEs, and they are oriented in roughly the same way.
The 3.30.450.20 (Signaling Protein) superfamily contains structural representatives of the PAS domain, a family of sensor protein domains involved in signal transduction . The common fold shared by the PAS domains is flexible to accommodate binding of a large variety of co-factors, which allows PAS domains to serve as input modules in proteins that sense light, redox potential, and other stimuli . As shown in Figure 2(c), in this case the structural variability characteristic of the members of this superfamily confuses the LFF method more than the other two methods.
It is reasonable to assume that for structurally conserved superfamilies all three methods would perform well. To check this hypothesis we color coded the points in the scatter plots of Figure 1 according to the structural diversity of the corresponding superfamilies, where red color denotes the most structurally conserved superfamilies and blue the least structurally conserved superfamilies. Even though the concentration of the red points in the upper-right corner is clearly visible on all three plots, there are structurally conserved superfamilies for which one or more methods do not perform well. This can happen when a small structural change triggers a big change in the structural footprint produced by the method; consider for example performance of the SSEF method on the 220.127.116.11 superfamily discussed above. Another reason for a poor performance of a method on a structurally conserved superfamily is its inability to distinguish between the members of the superfamily and members of other superfamilies that are composed of similar structural fragments but have different overall structure.
Combining the methods
Can we take advantage of variation in performance of the methods across different superfamilies, i.e. can the output of the methods be combined in such a way as to leverage their relative strengths? To answer this question we have studied two combination strategies: voting and linear combination of similarity scores. Given a query domain, both strategies use the original similarity scores to produce a new ranking of database domains. In voting, each method's similarity scores are first used to rank the database domains. The new score of a database domain is determined by averaging the domain's positions in the three original rankings, with ties being resolved arbitrarily. In linear combination, a new structural similarity score between the query and a database domain is defined as a linear combination of the original similarity scores, where the optimal coefficients are learned with a Support Vector Machine (SVM)  from a set of positive and negative examples (see Methods). The new similarity score is then used to rank the database domains.
As shown in Table 1, the average ROC300 scores increase from 0.750 (the SSEF method), to 0.774 (the voting combination strategy), to 0.814 (the linear combination strategy). Even with the simple voting strategy we obtain an improvement of 0.024 over the best (on average) method; the introduction of weights (in linear combination strategy) further improves the performance by 0.040. We used the binomial sign test for two dependent samples  to evaluate the statistical significance of improvements due to combination. This test can be applied to evaluate whether a number of superfamilies on which one method outperforms the other differs significantly from what would be expected by chance. We found that both combination strategies significantly improve over the SSEF method: the improvement due to voting has a p-value of 3.35e-02 and improvement due to linear combination has a p-value of 1.43e-15.
The success of a combination strategy largely depends on how consistent are the methods in their ranking of false positives. The combination is most effective when the methods disagree on their ranking of false positive domains, i.e., false positive domains ranked near the top by one method are ranked near the bottom by other methods. Thus the success of a combination strategy is a function of the methods being combined. To find out which pair of methods are the most complementary, i.e., their combination gives the best results, we repeated the linear combination experiments for all pairs of methods. The outcomes of these experiments (see Table 1 under SSEF+SEGF, SSEF+LFF, and SEGF+LFF) indicate that combination of the SSEF and SEGF methods gives the best results. This outcome demonstrates that the stand-alone performance is of lesser importance for combination purposes. Indeed, while the SEGF method is the weakest among the three methods, its performance is the least correlated with that of the SSEF method (see the performance correlation values in Figure 1).
The success of the combination strategies supports our hypothesis that the methods are indeed complementary i.e., no single approach is able to deal with a full spectrum of structural variability exhibited by different superfamilies. Moreover, we observe that the combination of two least correlated methods (SSEF+SEGF) is better than SSEF+LFF or LFF+SEGF.
Structural similarity at the CATH Class, Architecture, and Fold levels
So far we have evaluated whether the methods' definition of structural similarity agrees with the CATH hierarchy at the Superfamily level. Here we extend this evaluation to the higher (Class, Architecture, and Fold) levels of the hierarchy by studying the similarities, as seen by a particular method, among different superfamilies. To quantify these similarities we introduce an affinity score between a pair of superfamilies which measures how well the method retrieves the members of the second superfamily using members of the first superfamily as queries. More formally, the affinity score of superfamily A towards superfamily B is an average ROC score over all rankings of database domains produced by the method with the members of A as queries, where the positives are the remaining members of A plus members of B and the negatives are all other domains in the database. It should be noted that affinity scores are not symmetric, i.e., affinity of A towards B is not necessarily the same as affinity of B towards A.
We can also use affinity scores to quantify the agreement between the method and the hierarchy for individual classification groups. Given a classification group, we say that there is a perfect agreement between the method and the hierarchy if, for every member superfamily, within-the-group affinity scores (affinity scores of the superfamily towards other members of the group) are higher than outside-the-group affinity scores (affinity scores of the superfamily towards superfamilies in other groups). The agreement is rarely perfect; thus we measure the amount of agreement (see Methods) for every superfamily within the group and set the degree of agreement for the group to be an average of these values. The agreement values range from 0.0 to 1.0, where 0.0 corresponds to the lowest agreement. In general, an agreement value close to one means that from the method's perspective the corresponding classification groups are structurally isolated from other groups, i.e., the composition of its member protein domains in terms of structural fragments used by the method to model the structure is quite different from that of domains in other groups.
Average agreement with CATH. Average agreement with CATH classification groups at a given classification level for the three methods.
In this work we evaluated the effect of the model selection process on the ability of structural footprinting methods to detect various types of structural similarity. Towards this end we evaluated three methods – the SSEF, SEGF, and LFF methods – that vary greatly in terms of what structural fragments are chosen to model the structure and their representation. In our first set of experiments we studied the effect of the structural diversity exhibited by members of well-populated superfamilies in the CATH database on the methods' ability to retrieve other members of the group given one member as a query. We found that there is a large variation in performance both across the superfamilies and across the methods. This is consistent with the findings of Reeves and colleagues  that some superfamilies are more structurally diverse than others. Poor correlation in performance among different methods supported the hypothesis that methods are indeed complementary in the following sense: there are types of structural variation that impact some methods to a greater extent than others. We were able to demonstrate the interplay between the nature of the structural variation and the performance of individual methods by looking in depth at several outliers, superfamilies with the most pronounced difference in performance across the methods.
To exploit the complementarity of the methods we tested two strategies, voting and linear combination of similarity scores, to combine the methods' output. We found that both strategies result in significant improvement in average performance over the best method, the SSEF method, with the linear combination strategy yielding the biggest improvement. Thus, by using a linear combination of the three similarity scores to rank database proteins we were able to improve the average ROC300 scores from 0.750 (the best average score achieved by a stand-alone method) to 0.814. We next tested which pair of methods is best suited for combination and found that combining the SSEF and SEGF methods gives the best results. This is interesting since the LFF method has a significantly better performance on average than the SEGF method and therefore one might expect that the pair of SSEF and LFF would be the winner. Thus, we conclude that the ability to reverse each others' bad decisions is more important than stand-alone performance for combination purposes.
Next we studied whether the methods' definition of structural similarity agrees with the higher (Class, Architecture, and Fold) levels of the CATH hierarchy. Towards this end, we introduced an affinity score as a measure of structural similarity, as seen by a particular method, between a pair of superfamilies. We visualized the affinity scores among well-populated superfamilies by a thresholded affinity graph, where there is a vertex for each superfamily and there is an edge between a pair of superfamilies if both affinity scores are above a certain threshold. By comparing the position of a particular superfamily in the graph to its known CATH classification we were able to recover several interesting structural similarities between superfamilies in diffierent Fold and even Class levels. We also used the affinity scores to quantify the agreement between a particular method and the CATH hierarchy for individual classification groups. The agreement values allowed us to identify classification groups that are structurally isolated from other groups and to compare the agreement with the CATH hierarchy across diffierent methods. Once again we observed that no one method has the highest agreement values across all groups at a given classification level but on average the SSEF method agrees the most with the hierarchy.
Since a structural footprinting method measures structural similarity based on presence/absence of common structural fragments, we believe that affinity scores produced by the method and their analysis techniques employed in our work can be useful beyond understanding the specifics of the method. In particular, the techniques can easily expose a similar composition of diffierent groups in terms of structural fragments used by the method and thus provide an alternative view of the continuum of the protein structure universe.
We used the CATH classification database version 2.6 (released on April, 2005). To create a set of database domains, we downloaded a list of non-redundant domains filtered at 35% sequence identity from the CATH classification database web-site. We excluded from the list domains for which a valid footprint could not be produced by one or more methods, which resulted in a dataset with 5,588 domains. As the SSEF method uses triplets of SSEs and the SEGF method uses backbone segments thirty-two residues long, we excluded from the original dataset of 6,003 domains 383 domains with fewer than three SSEs or shorter than thirty-two residues. We further removed 32 domains that do not contain a single valid structural fragment for either SSEF or for SEGF.
The set of database domains contains members from 1,416 superfamilies. From these we selected a set of well-populated superfamilies, superfamilies that satisfy the following constraints: (i) the superfamily has at least five members in the set of database domains and (ii) the superfamily is not the only superfamily in its fold. There are 133 superfamilies that satisfy the above constraints. These superfamilies contain 2,348 domains and span 55 folds, 17 architectures, and 3 classes.
The SEGF Method
Structural fragments and their representation
We use a contiguous segment (thirty-two residues long) of protein backbone as a structural footprint. The protein backbone is viewed as a polygonal line passing through the C-α atoms whose conformation is captured by a set of fourteen shape descriptors, a subset of the thirty shape descriptors originally used by Rogen et al. [3, 8]. The shape descriptors are various combinations of an average crossing number, a geometric invariant that captures the relative orientation of two oriented line segments. In what follows we first describe the average crossing number invariant and then show how the fourteen shape descriptors are constructed using this invariant as a building block.
Selecting the models
We use a procedure very similar to that of the SSEF method to obtain a representative set of fragments. The SSEF method derives its representative set of structural fragments from the SCOP classification database . Similarly to CATH, SCOP is an hierarchical classification database that organizes protein domains into four classification levels: Class, Fold, Superfamily and Family. For the SEGF method, we first extract all backbone segments from protein domains in the SCOP fold dataset, a set of structures that represent every fold in the SCOP database version 1.65 . The extracted segments are clustered with a k-means clustering algorithm to obtain a total of p (in our case p = 300) clusters. From each cluster we select one backbone segment, the one closest to the cluster center, for our set of models. For the purpose of clustering, each backbone segment is represented by a point in R14. Before the clustering is carried out we normalize the data points by applying a standard normalization procedure to each of the fourteen dimensions, where values in dimension i are normalized by subtracting their mean and dividing by their standard deviation.
s is a structural fragment of Q
c(s, m i ) is a contribution of s to model m i
d(s, m i ) is the distance between s and a model m i (Euclidean distance between the corresponding points in R14)
a is a scale factor
γ is a threshold
A structural fragment s contributes to a model m only if they are similar enough, i.e., the distance d(s, m) is below a certain threshold γ. The value of this threshold and the scale factor a are determined from the distribution of distances of a structural fragment to the closest model [see Additional file 3].
Computing structural similarity
where μ Q and μ P are the means of f Q and f P , respectively.
Learning the linear combination coefficients with an SVM
A new structural similarity score between the query and a database domain can be defined as a linear combination of original similarity scores:simCOMB = wSSEF simSSEF + wSEGF simSEGF + wLFF simLFF - w0.
Coefficient values used in linear combination strategy. Coefficient values learned with SVM for the four combinations: SSEF+SEGF+LFF, SSEF+SEGF, SSEF+LFF, and SEGF+LFF.
Computing agreement values for superfamilies
Given a classification group, the method's agreement with the CATH hierarchy for this group is equal to an average agreement value taken over all well-populated superfamilies that are members of the group. For a member superfamily the agreement value is measured by a ROC score taken over the list of well-populated superfamilies ranked by their affinity to the superfamily, where the positives are superfamilies in the same classification group and the negatives are superfamilies outside the group. To separate classification groups at different levels of the hierarchy, we further restrict the set of positives in our ROC score computation: at the Class level the positives are superfamilies within the same class but different architecture groups than the query superfamily, at the Architecture level the positives are within the same architecture but different fold groups, and at the Fold level the positives are within the same fold group. Since not every architecture group contains at least two superfamilies from different fold groups and not every fold group contains at least two superfamilies, the agreement values are not available for all the architectures and folds spanned by the set of well-populated superfamilies.
For the LFF method, we obtained the set of models from the authors of the LFF method. We computed footprints and distances as described in . We implemented the prototype of the SEGF method in Python using the BioPython suite of packages . The Python code and auxiliary files necessary to compute: (i) the SEGF footprint from a PDB file of a structure; (ii) the SSEF footprint from a PDB file of a structure; (iii) structural similarity score given PDB files of structures using either SSEF, SEGF or SSEF+SEGF methods, are given as supplementary material [see Additional file 3].
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. The authors would like to thank John Spouge for helpful discussion on evaluation of statistical significance and In-Geol Choi and Sung-Hou Kim for the models used by the LFF method. The authors are grateful to the anonymous reviewers for their constructive comments.
- Zotenko E, O'Leary D, Przytycka T: Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification. BMC Struct Biol 2006, 6: 12. 10.1186/1472-6807-6-12PubMed CentralView ArticlePubMedGoogle Scholar
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233: 123–138. 10.1006/jmbi.1993.1489View ArticlePubMedGoogle Scholar
- Rogen P, Fain B: Automatic classification of protein structure by using Gauss integrals. Proc Natl Acad Sci USA 2003, 100: 119–124. 10.1073/pnas.2636460100PubMed CentralView ArticlePubMedGoogle Scholar
- Bostick D, Shen M, Vaisman I: A simple topological representation of protein structure: implications for new, fast, and robust structural classification. Proteins 2004, 56(3):487–501. 10.1002/prot.20146View ArticlePubMedGoogle Scholar
- Choi I, Kwon J, Kim S: Local feature frequency profile: a method to measure structural similarity in proteins. Proc Natl Acad Sci USA 2004, 101: 3797–3802. 10.1073/pnas.0308656100PubMed CentralView ArticlePubMedGoogle Scholar
- Carugo O, Pongor S: Protein fold similarity estimated by a probabilistic approach based on C[alpha]-C[alpha] distance comparison. J Mol Biol 2002, 315: 887–898. 10.1006/jmbi.2001.5250View ArticlePubMedGoogle Scholar
- Gáspári Z, Vlahovicek K, Pongor S: Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm. Bioinformatics 2005, 21(15):3322–3323. 10.1093/bioinformatics/bti513View ArticlePubMedGoogle Scholar
- Rogen P, Bohr H: A new family of protein shape descriptors. Mathematical Biosciences 2003, 182: 167–181. 10.1016/S0025-5564(02)00216-XView ArticlePubMedGoogle Scholar
- Fuller F: The writhing number of a space curve. Proceedings of the National Academy of Sciences USA 1971, 68: 815–819. 10.1073/pnas.68.4.815View ArticleGoogle Scholar
- Orengo C, Michie A, Jones S, Jones D, Swindells M, Thornton J: CATH – A hierarchic classification of protein domain structures. Structure 1997, 5: 1093–1108. 10.1016/S0969-2126(97)00260-8View ArticlePubMedGoogle Scholar
- Reeves G, Dallman T, Redfern O, Akpor A, Orengo C: Structural diversity of domain superfamilies in the CATH database. Journal of Molecular Biology 2006, 360(3):725–741. 10.1016/j.jmb.2006.05.035View ArticlePubMedGoogle Scholar
- Harrison A, Pearl F, Mott R, Thornton J, Orengo C: Quantifying the similarities within fold space. Journal of Molecular Biology 2002, 323(5):909–926. 10.1016/S0022-2836(02)00992-0View ArticlePubMedGoogle Scholar
- Friedberg I, Godzik A: Connecting the protein structure universe by using sparse recurring fragments. Structure 2005, 13(8):1213–1224. 10.1016/j.str.2005.05.009View ArticlePubMedGoogle Scholar
- Sam V, Tai C, Garnier J, Gibrat J, Lee B, Munson P: ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification. BMC Bioinformatics 2006, 7: 206. 10.1186/1471-2105-7-206PubMed CentralView ArticlePubMedGoogle Scholar
- Gribskov M, Robinson NL: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem 1996, 20(1):25–33. 10.1016/S0097-8485(96)80004-0View ArticlePubMedGoogle Scholar
- Taylor BL, Zhulin IB: PAS domains: internal sensors of oxygen, redox potential, and light. Microbiol Mol Biol Rev 1999, 63(2):479–506.PubMed CentralPubMedGoogle Scholar
- Pandini A, Bonati L: Conservation and specialization in PAS domain dynamics. Protein Eng Des Sel 2005, 18(3):127–137. 10.1093/protein/gzi017View ArticlePubMedGoogle Scholar
- Cristianini N, Shawe-Taylor J: An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press; 2000.View ArticleGoogle Scholar
- Sheskin D: Handbook of parametric and nonparametric statistical procedures. fourth edition. Chapman and Hall CRC; 2007.Google Scholar
- Murzin A, Brenner S, Hubbard T, Chotia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Joachims T: Advances in Kernel Methods: Support Vector Learning, chap. Making large-scale SVM learning practical. Edited by: Bernhard Scholkopf, Christopher JC Burges, Alexander J Smola. The MIT Press; 1998:169–184.Google Scholar
- The BioPython Project[http://www.biopython.org]
- Bray JE, Todd AE, Pearl FM, Thornton JM, Orengo CA: The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. Protein Engineering 2000, 13(3):153–165. 10.1093/protein/13.3.153View ArticlePubMedGoogle Scholar
- Orengo CA, Taylor WR: SSAP: sequential structure alignment program for protein structure comparison. Methods in Enzymology 1996, 266: 617–635.View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Secondary structure definition by the program DSSP. Biopolymers 1983, 22: 2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
- DeLano WL: The PyMOL Molecular Graphics System (2002).[http://pymol.sourceforge.net]
- The Cytoscape Website[http://www.cytoscape.org/]
- Fenn R: Geometry. Springer Undergraduate Mathematics Series, Springer-Verlag; 2001.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.