Evolutionarily consistent families in SCOP: sequence, structure and function
© Pethica et al.; licensee BioMed Central Ltd. 2012
Received: 1 April 2012
Accepted: 3 October 2012
Published: 18 October 2012
Skip to main content
© Pethica et al.; licensee BioMed Central Ltd. 2012
Received: 1 April 2012
Accepted: 3 October 2012
Published: 18 October 2012
SCOP is a hierarchical domain classification system for proteins of known structure. The superfamily level has a clear definition: Protein domains belong to the same superfamily if there is structural, functional and sequence evidence for a common evolutionary ancestor. Superfamilies are sub-classified into families, however, there is not such a clear basis for the family level groupings. Do SCOP families group together domains with sequence similarity, do they group domains with similar structure or by common function? It is these questions we answer, but most importantly, whether each family represents a distinct phylogenetic group within a superfamily.
Several phylogenetic trees were generated for each superfamily: one derived from a multiple sequence alignment, one based on structural distances, and the final two from presence/absence of GO terms or EC numbers assigned to domains. The topologies of the resulting trees and confidence values were compared to the SCOP family classification.
We show that SCOP family groupings are evolutionarily consistent to a very high degree with respect to classical sequence phylogenetics. The trees built from (automatically generated) structural distances correlate well, but are not always consistent with SCOP (hand annotated) groupings. Trees derived from functional data are less consistent with the family level than those from structure or sequence, though the majority still agree. Much of GO and EC annotation applies directly to one family or subset of the family; relatively few terms apply at the superfamily level. Maximum sequence diversity within a family is on average 22% but close to zero for superfamilies.
Proteins are made up of domains. Protein domains in this context can be regarded as the building blocks of proteins, and the smallest units of protein evolution. A small protein may consist of a single domain, larger proteins maybe contain multiple domains. A domain can be defined as a protein unit which is seen in nature either on its own or in combination with other different domains.
Detecting the evolutionary relationship between two or more domains using sequence information alone is often not possible, as sequences often diverge beyond the point of detection by comparison methods. Lack of sequence information does not necessarily show that there is no relationship between domains. If the three dimensional structure of the domains is known, evolutionary relationships can usually be recognised. The Structural Classification of Proteins (SCOP)[1–3], is a hierarchical classification system of proteins for which atomic resolution three dimensional structures are known; units in SCOP are protein domains. The SCOP classification takes protein structures published in the Protein Data Bank (PDB) as the primary data source from which the domain classification is derived. The classification of domains is based on both manual curation and automatic methods, the balance of which has resulted in a classification system which is regarded as the ‘gold standard’, and is an essential bioinformatics resource.
Levels of classification in SCOP from the top down are: class, fold, superfamily, family. A class is just a convenient grouping, e.g. domains containing only alpha-helices. Folds and superfamilies have a clear and precise definition of what they are supposed to represent: a fold groups together domains which have the same topological arrangement of secondary structure; a superfamily groups together domains which share a common evolutionary ancestor. The family level sub-groups domains within a superfamily, but unlike the other levels lacks a precise definition. The first SCOP paper states 30% sequence identity between members of a superfamily as significant support for a family grouping. However, in the first release of SCOP there were far fewer protein structures available (a total of 13073 domains), and selecting an arbitrary sequence identity cutoff was possible. There are now nearly ten times the number of domains (110800 as of SCOP 1.75). The family level of the classification further draws on structure and functional information in the absence of strong sequence similarity, but the meaning and the properties of the family object in SCOP remains unclear.
Many projects have been based on the SCOP classification leading to several thousand citations[5–8]. Most of these projects make use of the clear evolutionary definition of a domain, and of a superfamily, so a better understanding of the family level will add value to future work which makes use of SCOP, and enable new research questions to be addressed. The research presented in this paper was carried out in order to elucidate the meaning and significance of the SCOP family level, in particular with regard to sequence, structure and function and their relationships to family classification.
We also draw on protein functional information taken from gene ontology (GO) terms. GO is a standardised vocabulary for depicting gene products in three biological concepts: Biological Process, Molecular Function and Cellular Component. Since many proteins are enzymes Enzyme Commission (EC) numbers can also aid in the understanding of protein function.
To understand the meaning of a family, we compared the groupings of domains in SCOP to determine the similarity to automatically generated groupings based independently on the three aspects we wished to investigate: sequence, structure and function. Since we begin without a pre-conceived idea of the granularity or size/depth of the groupings it is necessary to generate the automatic groupings at every possible level. This is represented by a tree which is the result of hierarchical clustering of the domains based on one of the three sources of information: sequence similarity, structural similarity, functional labels (in the forms of Gene Ontology and Enzyme Classification). The level of agreement between one type of information and the grouping of a SCOP family can be assessed by asking whether each edge in the tree divides domains into family groups, or splits a family, grouping together domains from different families.
Within the literature there is variation in suggested levels for the minimum informative bootstrap confidence[11, 12], with most suggesting about 70-80% required for confidence. We found that from 2046 families across 428 superfamilies, 99.6% of the phylogenetic trees agree with the SCOP groupings for bootstrap values above 80%. We also found that, although less reliable, there is useful information which can be acquired from the trees for bootstrap values down to 60%. These results show that, to the extent to which sequence information can reliably determine evolutionary relationships, SCOP family groupings are evolutionarily consistent. Classical sequence phylogenetics are quite reliable for high bootstrap values, but are limited in the evolutionary distance over which they can resolve relationships. There are plenty of SCOP family groupings which sequence-based phylogenetics alone is unable to determine with high confidence - the low confidence parts of the tree. Although the classical phylogenetic analysis cannot inform us directly about the evolutionary consistency of many family groupings, the fact that there is such strong agreement with those that it can, gives a strong suggestion that the others (classified independently from this information) are also likely to be evolutionarily consistent.
A potential factor which contributes to the disagreements seen in trees calculated from sequence data compared to those from the other data sources is also worth noting. Diverse superfamilies with very low sequence identity between member domains may provide an unreliable multiple sequence alignment thereby creating a result tree with limited accuracy. Anomalies introduced from this effect are more likely to be seen in very large superfamilies with a great deal of structural variation.
The trees built from automatically generated structural distances largely agree, but are not always consistent with SCOP’s hand annotated groupings. The hand classification of structures in SCOP at the superfamily and fold levels is often referred to as the gold standard in the field, and clearly surpasses any fully automatic method. Since detectable structural similarity remains long after sequences have diverged beyond the point of recognition, the structurally-derived trees are able to resolve deeper edges of the tree with higher confidence than the sequence-based ones (the intersection of the red and blue lines in Figure1). That the trees are largely in agreement with the family classification indicates that SCOP is also evolutionarily consistent at greater divergence distances. The differences we see could either be cases where SCOP has grouped domains based on some criterion other than evolution (e.g. common function), or may be due to geometric structural distance being in some cases a poor measure of divergence. For some proteins, changes to the structure of a binding site may be the best indication of evolutionary divergence, but these changes make a relatively small contribution to the automatic superposition of the whole body. Conversely, movements of secondary structures relative to each other, e.g. a change of angle between beta-sheets, can cause dramatic changes in superposable structural distance which mask the true relationships. In this way structural geometric distance does not always equate to evolutionary distance.
Examining high ranking disagreements between the SCOP family classification and structural trees can mostly be explained by the above, however one exception is shown in Example 2 from Figure2. This example shows a sequence tree but we see the same disagreement when we look at the structural tree, and so in this case it suggests the possibility of a mis-classification.
The lines for EC numbers and GO terms shown in Figure1 are smaller and less smooth than the others. This is because confidence values are generated using the total number of independent features that support a particular edge of the tree. There are not very many GO features per tree and barely any for EC number. This is partly due to a lack of richness in the ontological hierarchy but also due to the incompleteness of the annotation of the domains with terms. Trees derived from both GO and EC functional data are less consistent with the family level than trees derived from structure or sequence, though the majority still agree with the classification. This may be due to the low quality of the derived functional dataset, most commonly the lack of functional annotation for a particular domain. Functions are also appended to the protein chain rather than individual domains, therefore terms may be uninformative for two domains found within the same protein. The fact that the correlation with function is so much weaker than sequence and structure suggests that although function may guide the choice of granularity or level of grouping of families in SCOP (see section on Distribution of GO terms), it is not a primary source of information for determining relationships.
In SCOP all domains must belong to a family, so a superfamily with a single member must also have a single family. As more structures are added to a superfamily over time, there may be new additions that have enough in common to group them apart from the rest and a second family is created to hold them. If this happens successively the result is that some families contain domains with something in common, but any leftovers lacking common features with each other may remain in the original family that contained the first member of the superfamily. These non-specific families are referred to here as 'dustbin families'. The 'dustbin families' line in Figure1 is derived from the same trees as for the standard domain sequences line, but the rules by which edges are defined as conflicting are adjusted to not penalise for the presence of a single dustbin family in each superfamily. Remarkably, despite expectations, the results show that they are not a major feature of the SCOP classification.
It is clear from the distribution in the graph in Figure3 that SCOP families are not selected by simply choosing a random sequence identity cutoff, and that the process of curation is much more elaborate.
Despite the weak link between SCOP family classification and the edges of trees representing functional data, we see a very large proportion of functional terms corresponding to exactly one family, and almost none close to the superfamily level. This suggests that the relationships between members of a superfamily and their distance apart is evolutionary, having been based on evidence from structure and sequence (not function), but the granularity at which to divide the members of a superfamily is decided by function. I.e. domains are not grouped based on their function, but the number of groups relates to the number of functions.
Sequence information contributes to the classification of domains into families, but alone is not enough. To classify a family evolutionarily: it must be consistent with sequence phylogenetics, will likely draw on structural distance, and will often coincide with a particular function. Sequence diversity between families (within a superfamily) is considerably greater than within a family. Sequence phylogenetics do not give a strong enough signal at the superfamily level to classify families, but where there is a signal it is consistent with the SCOP classification. Structural information is necessary for identifying evolutionary relationships of families in a superfamily where sequence identity is low. We see that although function does not determine the relationships, i.e. edges, it is used to guide the level at which the tree is cut to make a family, i.e. the choice of node from which to derive a clade (granularity).
The families in SCOP represent a level at which sequence, structure, function plus other information on a shared peculiarity must all be taken into account. A balance of the strengths of signals available is used to establish the evolutionary relationships and resolve the groupings.
The data for all trees used to generate Figure2 are available as a web resource athttp://supfam2.cs.bris.ac.uk/pethica/scopresults. The data may be ranked on each of the confidence scores separately or together. For every superfamily there are tree images for sequence, structure and function annotated with the PDB domain and SCOP family ID as shown in Figure2. The tree data can additionally be downloaded in Newick format. Also available are all the matrices of Structal data used to generate the structural distance trees.
Domain sequences for SCOP version 1.73, filtered to 95% sequence identity were obtained from ASTRAL. The complete set of sequences was filtered to remove superfamilies for which SCOP's family level classification could not be contested. These cases included superfamilies containing a single family, those where each family contained only one member, and any superfamily made up of three or less domains. A detailed breakdown of the number of domains, families, and superfamilies used in the analysis can be found in Additional file2: Table S2.
For each superfamily in the classification the sequences of assigned domains were used to produce an alignment using MUSCLE. Alignments were converted to Stockholm format using sreformat which is part of the HMMER package. Quicktree, a fast implementation of the neighbour joining algorithm was used to produce runs of both 300 and 600 bootstrap replicate trees from the sequence alignments. Phylip Consense was used to create a single consensus tree from the sets of replicate trees. In this process, the number of occurrences of a particular edge from the replicate trees was converted to a single confidence score giving the final tree confidence values for each edge.
A second set was also produced where domain sequences were padded with homologue sequences from the SUPERFAMILY database. These were aligned, and trees created as with the original set. A script was used to remove the homologues from the trees leaving only the original domain sequences, but preserving all phylogenetic relationships. The dataset calculated without homologues, with 300 replicates was chosen as very little difference was seen between the two replicate sets, and the addition of homologues sequences created larger alignments which were handled badly by the phylogenetic algorithms.
PDB style protein three dimensional structures for the same filtered SCOP 1.73 set of domains were taken from ASTRAL. The same filtered set of SCOP 1.73 domains as for sequence was used. Structal was used to compare the 3D structures of every domain against every other in a superfamily, for all superfamilies in the set in a computationally expensive process of around 1.5 million structural comparisons. The Structal software was chosen from the large number of other structural comparison methods due to its balance of speed and accuracy for a computation of this kind. The Structal SAS scores (100*RMS/Number of positions matched) for each domain were used to create a matrix of structural distances for each superfamily. The neighbour joining algorithm in the PAUP package was used to compute phylogenetic trees from the distance matrices.
Gene ontology (GO) data from EBI GOA was used to annotate domains with functional terms using the same set of domains that was used for the sequence and structure trees. For each superfamily, a binary presence/absence matrix was generated of all GO terms versus all domains in the superfamily. The terms were treated independently of the hierarchy, but uninformative terms (present in all or present in only one domain) were ignored. For each superfamily the presence/absence matrix was used to generate a phylogenetic tree using PAUP neighbour joining. An additional set of functional trees was also generated using the same technique, but with functional data from Enzyme Commission (EC) numbers.
Phylogenetic trees of domains in each superfamily produced by each method could then be compared with the groupings at the SCOP family level. An algorithm was produced to traverse the trees and identify if a particular edge agreed, disagreed or was uninformative with regard to SCOP families:
An edge of the tree is said to agree with SCOP if one side contains the full set of domains for a certain family and no members of another family.
An edge disagrees with SCOP when domains from a certain family are found on both sides along with domains from a different family.
A neutral or uninformative edge is where one side contains only members from a certain family, but not the complete set. i.e. more are found on the other side of the edge.
Sequences for domains in SCOP 1.73 superfamilies were acquired from ASTRAL. Superfamilies containing a single domain only were removed. For each superfamily grouping, sequence identities were sequentially calculated with Washington University BLAST, the highest sequence identity members being removed until only the two most distant sequences remained. This process was repeated for domains grouped in families to give sequence distance scores for all relevant families and superfamilies in SCOP.
For each GO term in the EBI GOA dataset a list of single domain proteins with the particular annotation was generated. The sequence identity of the two most distant sequences in the set was determined. The distribution of domains across the SCOP classification and level in the hierarchy for a specific functional annotation was also calculated. e.g. All domains contained within a specific family or superfamily.
We express our gratitude to Cyrus Chothia for his help and input on this work, to Martin Madera for technical assistance and to Alexey Murzin for his responses to questions about the SCOP classification.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.