Improving protein structure similarity searches using domain boundaries based on conserved sequence information
© Thompson et al; licensee BioMed Central Ltd. 2009
Received: 07 July 2008
Accepted: 19 May 2009
Published: 19 May 2009
The identification of protein domains plays an important role in protein structure comparison. Domain query size and composition are critical to structure similarity search algorithms such as the Vector Alignment Search Tool (VAST), the method employed for computing related protein structures in NCBI Entrez system. Currently, domains identified on the basis of structural compactness are used for VAST computations. In this study, we have investigated how alternative definitions of domains derived from conserved sequence alignments in the Conserved Domain Database (CDD) would affect the domain comparisons and structure similarity search performance of VAST.
Alternative domains, which have significantly different secondary structure composition from those based on structurally compact units, were identified based on the alignment footprints of curated protein sequence domain families. Our analysis indicates that domain boundaries disagree on roughly 8% of protein chains in the medium redundancy subset of the Molecular Modeling Database (MMDB). These conflicting sequence based domain boundaries perform slightly better than structure domains in structure similarity searches, and there are interesting cases when structure similarity search performance is markedly improved.
Structure similarity searches using domain boundaries based on conserved sequence information can provide an additional method for investigators to identify interesting similarities between proteins with known structures. Because of the improvement in performance of structure similarity searches using sequence domain boundaries, we are in the process of implementing their inclusion into the VAST search and MMDB resources in the NCBI Entrez system.
As the amount of diverse biological data continues to grow, it is important for new methods of analysis to be devised and current methods to be improved. The ability to detect that two proteins have diverged from a common ancestor allows one to infer functional similarity between the two. A common method for identifying similarity between proteins is the use of sequence alignment tools such as FASTA  and BLAST , which provide an alignment of two sequences and a score indicating whether the alignment is significant or could be attributed to chance. The comparison of protein structures allows one to peer back farther into evolutionary time, based on the concept that a form or structure remains similar long after sequence similarity has become undetectable [3–6]. There are many methods [7–15] and databases [16–19] currently available for protein structure comparisons. While the performance of the methods and databases available are for the most part satisfactory, it is not unusual for such methods to miss certain biologically related protein structures that may be identified by human inspection. One may consider two directions when attempting to improve the ability to detect structural similarity. The first is to improve the similarity search method itself, either by using a novel approach for constructing an alignment or by optimizing an existing method. The second approach is to improve the definition of the objects to be compared by the methods. Although initial reflection on the two possibilities may indicate the first may be most fruitful, there is indeed a great deal that may be done with the data itself.
It has long been understood that there is an intermediate organization in proteins, typically called a domain, that is greater than secondary structure and less than the full-length chain of amino acids [20–22]. This fact considerably complicates the problems of sequence and structural alignment, because it is possible that two long proteins may contain a similar common domain, which is much smaller than either of the entire proteins. Ideally we want to recognize this situation, but it is difficult to detect true similarity of small subregions while at the same time excluding the small similarities that may occur due to chance. One part of the solution lies in testing for statistical significance of alignment scores or various similarity measures; but even so, it is possible for small but important similarities to be missed. Another part of the solution, which is possible in the case of structure comparison, is to identify the smaller subregions of potential similarity (the domains) and to directly compare them.
Thus, it becomes critical to identify the domains appropriately before performing structure similarity searches. Structurally compact domains are currently being used for computing related structures in MMDB. Recent studies investigating the performance of several structurally based domain parsers in comparison to expert curated structure domain boundaries have indicated the limitations of different methods and potential improvements [23, 24]. Here we ask the questions, "How often do structurally identified domain boundaries disagree with those determined by sequence conservation" and "Does either domain type perform better in structure similarity searches when disagreement occurs?"
In this work, we first systematically compared the domain boundaries of the sequence-based domains in the Conserved Domain Database to the structure-based domains in the medium redundancy subset of MMDB. We have identified a noticeable fraction of sequence based domains that differ significantly from those derived based on structural compactness. The new domains were then used as queries in identifying related structures using VAST and changes in structure similarity search results were analyzed. Using SCOP as a standard of truth, interesting cases were observed where the new domain boundaries perform better than the original domains in terms of homologous structure recognition. We have also found that the overall performance of sequence domains is comparable to that of whole chain and structure domain based queries.
Results and Discussion
Comparison of structure similarity search results
Having tested a series of thresholds for identification of differences between sequence and structure-based domains as described in the methods section, we focused on the structure similarity search results using sequence based domains for which at least 90% of the domain consensus sequence was aligned to a structure. A sequence based domain was determined to be different from existing structure based domains when its secondary structure composition was at least four secondary structure elements (SSEs) different from both the most similar structure domain and the entire protein chain. These results are derived from applying the methods to the medium redundancy subset of MMDB, in which the structure database is reduced by clustering similar structures based on sequence similarity and then selecting a single representative from each cluster based on structure quality . We report our analysis results based on the non-redundant data set, though application of the methods and analysis to the larger non-identical subset of MMDB yielded similar performance of structure similarity searches. In addition, although we present the full analysis on older versions of the databases, recalculations using more recent version of the databases revealed similar ratios of domain differences found. The search for domain differences on the 6231 chains in the medium redundancy set identified 635 sequence domains on 495 protein chains. Of the differences found, one-third of the sequence domains fall within single structure based domain, whereas the remainder join regions of multiple structure domains.
VAST Search results for domains with boundary conflict from the medium redundancy subset of MMDB.
Structure Domains Total
Sequence Domains Total
Structure Domains (> = 20% Id)
Sequence Domains (> = 20% Id)
Structure Domains (<20% Id)
Sequence Domains (<20% Id)
A primary goal of this study was to determine if providing an additional automated resource to investigators allowing structure searches using alternate domain boundaries when endpoint conflict occurs could be beneficial. Our interest is not to seek domain definition replacement, but rather to see whether additional biologically relevant insight can be gained by using additional automatically generated domain sets. The analyses do reveal interesting new similarities that justify the inclusion of the sequence domain search results as an additional domain resource within MMDB. In the following two examples, we explore the scenarios of how structure domain and sequence domain can differ and the effects of the new boundaries on structure similarity search results.
DNA topoisomerase I
VAST Search results for Topoisomerase-I from human.
PDB Id and Chain
Structure Domain Hit
Sequence Domain Hit
Fold Group Member
VAST Search results for gelatin binding region of human fibronectin.
PDB Id and Chain
Structure Domain Hit
Sequence Domain Hit
Fold Group Member
Our investigation shows that although conflicting domain boundaries occur relatively infrequently, when disagreement occurs there is a slight gain in performance in the overall structure similarity search results by using sequence-based domain boundaries. While the improvement in performance is not consistently better for all differences identified, more structure neighbors are identified in general, and there are noticeable instances where there is a marked increase in the ability to distinguish homologs from non-homlogs in search results. As the number and quality of curated sequence conservation based protein domain families improves over time, the impact of sequence based domains on biologically related structure recognition could become more significant and it is clearly beneficial to add sequence based domains in automatic fashion when computing related structures in MMDB. We are in the process of implementing the inclusion of sequence domains into the protein structure resources in the Entrez system at NCBI. MMDB protein structure pages will soon allow for inspection of similar structures detected using sequence domains and the VAST search service will allow such sequence based domains to be identified automatically in user submitted structures, permitting these subregions to be used as queries for structure similarity searches. The addition of sequence domain boundaries to these services will allow investigators to potentially identify interesting new relationships between protein structures that were previously undetected, and similar screening methods could easily be applied to other search systems.
Identifying sequence domains disagreeing with structure domains
The April 2005 Conserved Domain Database (CDD) and medium redundancy Molecular Modeling Database (MMDB) were used for the sequence to structure domain comparisons, the current versions of which are available at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml and http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml. Of the Conserved Domain Database, entries derived from the Clusters of Orthologous Groups[31, 32] database were excluded, as they tend to be full gene products containing multiple sequence and structure domains. Each domain profile from the CDD was compared to the sequence of protein structures in our subset of MMDB. These comparisons involved using the sequence of MMDB entries as queries in RPS-BLAST against our subset of the CDD, using a 'hit' expectation value threshold of 0.01 and a requirement that at least 90% of a Conserved Domain (CD) sequence be aligned to a query to be considered for comparison. We also tested the effect of reducing the percentage of CD sequence alignment required for boundary difference comparisons. Since these tests resulted in few additional domain identifications at the cost of reduced sequence alignment length, we focused on our most stringent 90% alignment coverage requirement. Because the CDD is collected from several database sources, some domains in the database are very similar, thus sequence domains are curated into domain families. When collecting the set of sequence domains, if multiple sequence domains from the same family aligned to a protein structure, a family representative was chosen based on the following criteria: 1) The domain family member with the greatest percentage alignment was chosen, and 2) if more than one domain family member had the same percentage alignment, the member with shorter overall length was chosen.
Structure similarity search assessment
The domain entries from MMDB and sequence domains identified as different were used as queries for structure similarity searches against the medium redundancy set of MMDB using Vector Alignment Search Tool (VAST), available on the web at http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml. VAST is essentially a two-phase process, the first being the alignment of vectors of secondary structure and preliminary scoring. Those initial alignments whose scores exceed an empirically derived threshold are then refined in the second phase of structural alignment using the Ca coordinates. Only those refined alignments with a statistical significance of P < 10-5 are reported as structurally similar. Although available to the public on the web, our study used an in-house version of the VAST executable to allow the submission of multiple queries and more efficient use of computational resources. To evaluate the change in structure similarity search results when using the new domains based on sequence, we considered structurally similar domains classified within the same superfamily division as the query domain of SCOP 1.69, available at http://scop.mrc-lmb.cam.ac.uk/scop/, to be homologs. Since the study explicitly looked for differences in domain boundaries, it was not possible to directly map both structure and sequence domains to corresponding entries in the SCOP database. For example, if a structure domain from MMDB has very similar domain boundaries as a SCOP domain, then a sequence domain found to be different from the MMDB domain would also be different from a SCOP domain definition. Thus, in order to measure the ability to identify similar domains, a homolog set for a query domain was identified as the SCOP superfamily members for all SCOP domains identified on the query chain. Although this 'collapsing' of superfamilies on a chain could introduce the possibility of some false homolog mapping or unrealistically large homolog sets, it allowed for sensitivity and specificity analysis of individual domains in the test set as well as overall assessment of the domain based structure similarity search result sets. In addition, to avoid missing data issues due to the smaller size of the SCOP database, all domains used as VAST queries and resulting similar structures were reduced to only those structures included in the 1.69 release of SCOP. Individual search results were also evaluated using SCOP fold classification members to test the possibly that previously identified non-homologs were potentially distant homologous structures that were not included in the superfamily classification. The structure similarity search results for each domain query and domain type sets were then compared based on the homologous and non-homologous structures found, as well as search result overlap, e.g. hits common to both sequence and structure domain similarity search results, regardless of the significance scores of the alignment other than the statistical significance of P < 10-5 required for being reported as similar by the VAST algorithm. Individual search results of the new domains were then compared to results of the original structure domains and visualized using PyMOL  and Cn3D .
This research was supported by the Intramural Research Program of the NIH.
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85(8):2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMed
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMed
- Chothia C, Lesk A: The relation between the divergence of sequence and structure in proteins. EMBO J 1986, 5(4):823–826.PubMed CentralPubMed
- Doolittle R: Similar amino acid sequences: chance or common ancestry? Science 1981, 214(4517):149–159. 10.1126/science.7280687View ArticlePubMed
- Sierk M, Pearson W: Sensitivity and selectivity in protein structure comparison. Protein Sci 2004, 13(3):773–785. 10.1110/ps.03328504PubMed CentralView ArticlePubMed
- Wood T, Pearson W: Evolution of protein sequences and structures. J Mol Biol 1999, 291(4):977–995. 10.1006/jmbi.1999.2972View ArticlePubMed
- Gibrat J, Madej T, Bryant S: Surprising similarities in structure comparison. Curr Opin Struct Biol 1996, 6(3):377–385. 10.1016/S0959-440X(96)80058-3View ArticlePubMed
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233(1):123–138. 10.1006/jmbi.1993.1489View ArticlePubMed
- Jung J, Lee B: Protein structure alignment using environmental profiles. Protein Eng 2000, 13(8):535–543. 10.1093/protein/13.8.535View ArticlePubMed
- Levitt M, Gerstein M: A unified statistical framework for sequence comparison and structure comparison. Proc Natl Acad Sci USA 1998, 95(11):5913–5920. 10.1073/pnas.95.11.5913PubMed CentralView ArticlePubMed
- Mooney S, Liang M, DeConde R, Altman R: Structural characterization of proteins using residue environments. Proteins 2005, 61(4):741–747. 10.1002/prot.20661PubMed CentralView ArticlePubMed
- Shindyalov I, Bourne P: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11(9):739–747. 10.1093/protein/11.9.739View ArticlePubMed
- Szustakowski J, Weng Z: Protein structure alignment using a genetic algorithm. Proteins 2000, 38(4):428–440. 10.1002/(SICI)1097-0134(20000301)38:4<428::AID-PROT8>3.0.CO;2-NView ArticlePubMed
- Taylor W: Protein structure comparison using iterated double dynamic programming. Protein Sci 1999, 8(3):654–665.PubMed CentralView ArticlePubMed
- Zhi D, Krishna S, Cao H, Pevzner P, Godzik A: Representing and comparing protein structures as paths in three-dimensional space. BMC Bioinformatics 2006, 7: 460. 10.1186/1471-2105-7-460PubMed CentralView ArticlePubMed
- Chen J, Anderson J, DeWeese-Scott C, Fedorova N, Geer L, He S, Hurwitz D, Jackson J, Jacobs A, Lanczycki C, et al.: MMDB: Entrez's 3D-structure database. Nucleic Acids Res 2003, 31(1):474–477. 10.1093/nar/gkg086PubMed CentralView ArticlePubMed
- Murzin A, Brenner S, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.PubMed
- Orengo C, Michie A, Jones S, Jones D, Swindells M, Thornton J: CATH – a hierarchic classification of protein domain structures. Structure 1997, 5(8):1093–1108. 10.1016/S0969-2126(97)00260-8View ArticlePubMed
- Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett C, Marsden R, Grant A, Lee D, et al.: The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res 2005, (33 Database):D247–251.
- Philips David C: The three-dimensional structure of an enzyme molecule. Sci Am 1966, 215(5):78–90.View Article
- Edelman G, Cunningham B, Gall W, Gottlieb P, Rutishauser U, Waxdal M: The covalent structure of an entire gammaG immunoglobulin molecule. Proc Natl Acad Sci USA 1969, 63(1):78–85. 10.1073/pnas.63.1.78PubMed CentralView ArticlePubMed
- Edelman G: The covalent structure of a human gamma G-immunoglobulin. XI. Functional implications. Biochemistry 1970, 9(16):3197–3205. 10.1021/bi00818a012View ArticlePubMed
- Holland T, Veretnik S, Shindyalov I, Bourne P: Partitioning protein structures into domains: why is it so difficult? J Mol Biol 2006, 361(3):562–590. 10.1016/j.jmb.2006.05.060View ArticlePubMed
- Veretnik S, Bourne P, Alexandrov N, Shindyalov I: Toward consistent assignment of structural domains in proteins. J Mol Biol 2004, 339(3):647–678. 10.1016/j.jmb.2004.03.053View ArticlePubMed
- Finn R, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al.: Pfam: clans, web tools and services. Nucleic Acids Res 2006, (34 Database):D247–251. 10.1093/nar/gkj149
- Schultz J, Milpetz F, Bork P, Ponting C: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci USA 1998, 95(11):5857–5864. 10.1073/pnas.95.11.5857PubMed CentralView ArticlePubMed
- Letunic I, Copley R, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res 2005, 34(Database issue):D257–260.PubMed Central
- Marchler-Bauer A, Panchenko A, Shoemaker B, Thiessen P, Geer L, Bryant S: CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 2001, 30(1):281–283. 10.1093/nar/30.1.281View Article
- Marchler-Bauer A, Anderson J, Cherukuri P, DeWeese-Scott C, Geer L, Gwadz M, He S, Hurwitz D, Jackson J, Ke Z, et al.: CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 2005, (33 Database):D192–196.
- VAST Help2007. [http://www.ncbi.nlm.nih.gov/Structure/VAST/vasthelp.html]
- Tatusov R, Koonin E, Lipman D: A genomic perspective on protein families. Science 1997, 278(5338):631–637. 10.1126/science.278.5338.631View ArticlePubMed
- Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E, Krylov D, Mazumder R, Mekhedov S, Nikolskaya A, et al.: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMed
- DeLano M: The PyMol Molecular Graphics System. Palo Alto, CA, USA: DeLano Scientific; 2002.
- Wang Y, Geer L, Chappey C, Kans J, Bryant S: Cn3D: sequence and structure views for Entrez. Trends Biochem Sci 2000, 25(6):300–302. 10.1016/S0968-0004(00)01561-9View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.