1,000 structures and more from the MCSG
© Lee et al; licensee BioMed Central Ltd. 2011
Received: 16 September 2010
Accepted: 10 January 2011
Published: 10 January 2011
The Midwest Center for Structural Genomics (MCSG) is one of the large-scale centres of the Protein Structure Initiative (PSI). During the first two phases of the PSI the MCSG has solved over a thousand protein structures. A criticism of structural genomics is that target selection strategies mean that some structures are solved without having a known function and thus are of little biomedical significance. Structures of unknown function have stimulated the development of methods for function prediction from structure.
We show that the MCSG has met the stated goals of the PSI and use online resources and readily available function prediction methods to provide functional annotations for more than 90% of the MCSG structures. The structure-to-function prediction method ProFunc provides likely functions for many of the MCSG structures that cannot be annotated by sequence-based methods.
Although the focus of the PSI was structural coverage, many of the structures solved by the MCSG can also be associated with functional classes and biological roles of possible biomedical value.
Sequencing of complete genomes has become commonplace in the last decade but traditional methods of protein structure determination cannot keep pace. Structural Genomics (SG) has emerged with the goal of providing a structure for every protein encoded by a genome using a high-throughput combination of experimental structure determination and homology modelling approaches. The Protein Structure Initiative (PSI) is a USA federal government, university, and industry effort that has now completed two phases . The first phase of the PSI (PSI-1), from 2000 to 2005, was dedicated to demonstrating the feasibility of high-throughput structure determination, solving unique protein structures, and developing the methodology and technology for a subsequent production phase. The second phase, PSI-2, focused on implementing the high-throughput structure determination methods developed in PSI-1, as well as homology modelling and addressing bottlenecks like modelling membrane proteins.
There are four large-scale PSI-2 centres and a number of smaller specialist centres. The four large-scale centres are the Joint Center for Structural Genomics (JCSG), the Midwest Center for Structural Genomics (MCSG), the Northeast Structural Genomics (NESG) Consortium, and the New York SGX Research Center (NYSGXRC) for Structural Genomics.
When PSI-1 began at the end of the 20th century there was much optimism that SG could provide structures to cover the whole of protein sequence space and together with improvements in homology modelling technology, it was only a matter of time before a reasonably accurate structure could be predicted for any protein in nature as soon as its sequence was known. It has since become apparent that protein sequence space is much larger than was imagined at the time. Advancements in homology modelling have also not been as great as was hoped. As a consequence SG has failed to deliver the hoped for level of coverage of sequence space and has been left with a collection of structures that much of the time were not targeted on the basis of their biological function. This has led to a criticism of the PSI that many of its structures are of proteins of unknown function and many do not have corresponding publications and therefore give little biological insight. Indeed, the third phase of the PSI is called PSI:Biology, which is intended to reflect a new emphasis on the biological relevance of the work. As PSI-2 draws to a close, we attempt to partially address this problem by exploiting a wide range of bioinformatics tools to provide functional annotations for as many as possible of the protein structures experimentally determined so far by the MCSG.
An early analysis of structures solved worldwide by SG consortia assessed the new coverage of sequence and fold space using the CATH  and SCOP  domain structure classifications . While SG was judged to be succeeding in structurally characterising new superfamilies, an early observation was that many of the sequences selected as likely to represent new folds were actually found to have existing folds. In 2005 Xie and Bourne adopted a different approach to measuring the impact of SG . They assessed functional coverage of the human genome by existing structures, structural genomics targets, and homology models. Using functional categories in the Enzyme Commission (EC)  and Gene Ontology (GO)  classifications they showed that, at the time, SG structures provided at least one domain that covered about a third of all the functional classes in the genome, and whole structure coverage for about a quarter of the genome. Even if all the registered SG targets were solved, which was a substantial number of targets even in 2005, then only about two thirds of the functional classes would be covered and there would be whole structure coverage for less than half of the genome. They therefore argued in favour of target selection aimed at functional coverage, especially with a view to understanding human genetic disease.
Also in 2005 Chandonia and Brenner suggested the "Pfam5000" strategy for target selection  which involved selecting the 5000 most important Pfam families . This comprehensive collection of protein families is used extensively throughout the biological sciences, often for functional annotation of complete genomes, and it was felt that this strategy would be medically and biologically relevant, of good value, and tractable. Pfam5000 was felt to be better value than the complete solution of several small to moderately sized bacterial proteomes or partial coverage of the human proteome since these would have limited impact on structural knowledge of other proteomes. The JCSG, on the other hand, focused on selecting targets from Thermotoga maritima leading to the generation of a three-dimensional reconstruction of the central metabolic network of this bacterium . The JCSG also championed the selection of sequences that were over-represented in the human gut metagenome  and this strategy became part of the overall PSI effort.
Chandonia and Brenner went on to analyze the novelty, cost, and impact of SG structures compared to non-SG structural biology (SB) structures . Determination of the first structure in a Pfam family was established as a measure of success. It should be noted, however, that a Pfam family often contains proteins with a range of related functions and finer grained target selection would be necessary to truly achieve complete functional coverage. In 2004 about half of all first structures for Pfam families were from SG rather than SB.
Marsden et al.  analyzed 203 complete genomes in the Gene3D resource  to provide new insights into protein family space. The number of protein families was found to be continually expanding with time but a significant proportion of the proteomes could be assigned to relatively few large, well-characterized domain families while the vast majority of domain families were relatively rare and often species specific. It was suggested that SG could provide structures for fewer than a thousand Pfam families to achieve reasonable structural coverage of genomes. Within these large families it was suggested that multiple structures could be determined to reveal more about the evolution of the family and enable greater understanding of how function evolves. Similarly, the selection of targets from within large and diverse CATH superfamilies was proposed to maximize modelling leverage . Since many of the targets previously selected as likely to have a novel fold proved to have a known fold this could be seen as a more rational use of resources.
There was a coordinated strategy of target selection by the four large-scale centres during PSI-2, described by Dessailly et al. , involving four main approaches: targeting representatives from large, structurally uncharacterized protein domain superfamilies; targeting structurally uncharacterized subfamilies in very large and diverse superfamilies with incomplete structural coverage; community nominated targets; and biomedically important targets. The first two approaches were primarily aimed at increasing the structural coverage of protein sequence space and many targets were selected regardless of their function. Metrics have been calculated in this and several other publications [16–18] to show that this strategy has had the result that PSI's per structure contribution to novel structural leverage has been much higher than that for SB, as would be expected since SB is not aiming to increase sequence coverage but rather provide biological insights.
The Structural Biology Knowledgebase [19, 20] offers an easy way of keeping abreast of developments by the PSI and by SG and SB in general. It is a continually updated portal to research data and other resources from the PSI. Metrics are regularly updated giving a summary of the total number of structures, the number of distinct and novel structures determined by the PSI, and the modelling leverage that PSI structures provide. Models are made available through the Protein Model Portal [21, 22]. The JCSG has made a notable effort towards the annotation of PSI structures with The Open Protein Structure Annotation Network (TOPSAN), a wiki-based portal for the dissemination of information for the broad biomedical community .
The increased appearance of structures of proteins of unknown function in the PDB due to SG has stimulated the development of computational methods of function prediction from structure. ProFunc, developed for the MCSG, combines a number of sequence-based and structure-based methods to gain clues about the function of a protein . The MCSG PSI-1 structures were used to test and develop ProFunc . When a sequence match is weak and/or multiple functions are suggested, the structure-based methods within ProFunc can help select from the options that are presented and increase confidence in a particular putative function. Another recent and notable method that has been used for function prediction from PSI structures is FLORA . This uses structural motifs associated with different functional sub-groups within functionally diverse CATH domain superfamilies.
The MCSG selected targets from a broad range of pathogenic bacteria according to selection criteria that have evolved throughout the first two stages of the PSI. Here we attempt to extend the functional annotations associated with MCSG structures by employing a diverse set of bioinformatics tools and resources. We also analyse the structural novelty of protein structures solved by the MCSG. We use sequence-based methods to assess the proportion of MCSG proteins that have a known function, a putative function, a possible function, and no known function. This is followed up with ProFunc analysis to support some of the possible functional annotations and in some cases to suggest possible functions for the unknowns.
A list of all protein structures released by the PDB , their release date, and the source SG centre where applicable was downloaded from the PDB on the 3rd March 2010. In all subsequent analyses only data that were available on this date are used. At the time 1,165 MCSG protein structures for 1,118 targets had been released by the PDB.
The source of most of the annotations used in this work is Gene3D 9.1.0 . This database contains nearly 10 million protein sequences including all UniProt sequences  and most complete genomes. Annotations are imported from multiple sources including EC numbers  from UniProt, GO terms , KEGG genes  and protein family assignments from Pfam , TIGRFAMs , SMART , and PANTHER . KEGG currently has 357 reference pathway maps. These are represented in a general way to be applicable to all organisms and thus are useful in the analysis of SG structural coverage. Each node in a reference pathway is represented by one or more manually curated KEGG orthologs. Where there are multiple orthologs these may represent different subunits or different versions of the same enzyme. Also a node may belong to more than one pathway when different pathways interact. KEGG genes and orthologs are mapped to genomes in Gene3D and PDB chains are mapped to UniProt entries using http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/data/pdb_chain_sp_ec the data for which comes from the PDBe database . In addition to collecting these annotations from Gene3D for MCSG structures, annotations are considered for inheritance by sequence-sequence comparison, sequence-profile comparison, and profile-profile comparison, such that increasingly remote sequence similarities are considered.
Not all assignments to a protein family lead to a functional annotation. Some assignments are filtered out as being uninformative e.g. assignments to a Pfam family described as a DUF (domain of unknown function). The sequence of each MCSG structure is compared to all sequences in Gene3D using BLAST (sequence-sequence comparison) and five iterations of PSI-BLAST (sequence-profile comparison), both with an E-value cut-off of 0.01 . EC numbers and molecular function GO terms associated with matching sequences are considered for inheritance. Both of these types of annotation are hierarchic. EC numbers all have four levels, the first level refers to the enzyme class ('1,' for example, refers to oxidoreductases), the second level refers to the type of bond or group that is acted on ('4,' for example, denotes a peptide bond), and the next two levels give progressively more specific details of the catalyzed reaction and its substrates. GO molecular function terms are organized as a directed acyclic graph where there may be many nested levels with each child having an 'is a' relationship to its parent. Annotations associated with a sequence or sequence match are divided into deep annotations comprising level 3 or 4 EC numbers and GO terms at level 6 or above, and the remainder which are generally less specific annotations. Hidden Markov models (HMMs) are built using the sequence of each MCSG structure as a seed for SAM-T  and then each of these HMMs is compared to all Pfam 24.0 HMMs (profile-profile comparison) using PRC .
Database annotations, inherited annotations, and annotations from the headers of the PDB files are presented together in a table at http://www.biochem.ucl.ac.uk/cgi-bin/dlee/MCSG_annotations used by the authors to manually assign an annotation status to each MCSG structure. Annotations are organized from left to right to reflect decreasing levels of both depth and reliability. Four levels of status are assigned; known, putative, possible, and unknown. GO provides evidence codes for functional annotations where the most reliable codes are the experimental evidence and author statement codes. 62% of MCSG solved targets have a molecular function GO term associated with them but only 2% have such a high quality functional annotation, the remainder have annotations that are electronically inferred. There are currently very few other annotation resources where evidence codes are available and sequence coverage is low. We consider possession of one of the most reliable GO evidence codes as too strict a criterion to define the 'known' status. Rather, possession of a deep database annotation and preferably agreement with at least one set of protein families is generally taken as evidence that the function is regarded as being known. Note that a GO term level of 6 or higher is only a rough guide to the depth of an annotation and the association of a term such as GO:0005506 for iron ion binding (level 7) is not considered to be sufficient in itself for a 'known' status. This is one reason why manual assessment of status is so important. The 'putative' status is generally assigned where there is no deep database annotation but there is a protein family assignment and the family description includes specific molecular functions. The 'possible' status is generally assigned where only inherited annotations are available but without the added confidence provided by assignment to a family in a curated resource. There may be a choice between a variety of deep annotations or perhaps only a very general, non-specific annotation is available presenting a broad range of possible deep annotations. The 'unknown' status is applied where sequence methods are unable to provide any clue to the function of the protein. ProFunc analysis (see below) is linked to the online table for structures assigned a 'possible' or 'unknown' status.
GO slims are cut-down versions of the GO ontologies containing a subset of the terms in the whole GO. The GO provides a generic GO slim that gives a broad overview of all function categories without the detail of the specific fine grained terms. This is useful for visualization of the broad functional coverage of MCSG structures and for comparison to the functional coverage of the PDB as a whole. A sequence unique subset that approximately represents all PDB protein entries of known function is generated by selecting representatives annotated in Gene3D with a molecular function GO term at level six or higher. GO terms associated with these representatives are then mapped to the generic GO slim using the GO Slimmer tool at http://amigo.geneontology.org/cgi-bin/amigo/slimmer. Similarly the UniProt entries for all MCSG solved targets with known function manually assigned as above are mapped to the generic GO slim.
The ProFunc web server at http://www.ebi.ac.uk/profunc employs a number of complementary function-prediction methods, focusing, in particular, on methods based on 3D structural information. This accepts a PDB file and runs a number of sequence- and structure-based analyses on it, listing any hits it finds to existing sequences and structures in descending order of significance. These can help researchers identify any strong similarities which may be indicative of the protein's function.
Of ProFunc's structure-based methods, the two that have been shown most accurately to suggest function  are the SSM (Secondary Structure Matching) fold-matching algorithm  and the "reverse template" search . Both these methods have a similar success rate and are able to find distant homologues when simple sequence-based searches fail. The majority of their hits overlap, but occasionally one method picks up a match that the other fails to identify. The advantage of the reverse template method over the fold-match is that its matches are local and the top-scoring hits tend to pick up sites of functional relevance.
To demonstrate the added value of 3D structure for proteins of uncertain or unknown function, we run ProFunc on the full data set of 1,165 MCSG structures. ProFunc results for targets assigned a 'possible' or 'unknown' status following sequence analysis are linked to the online table. ProFunc results for targets of 'unknown' status are manually assessed to determine the improvement in functional coverage afforded by ProFunc for MCSG solved targets.
To further illustrate the value of ProFunc analysis we then focus on MCSG structures where no significant match to a known 3D structure in the PDB can be found from a search based on sequence alone, but is detectable using the structure-based reverse template method. We investigate which parts of the matched structures the method identifies as the most significantly similar.
The reverse template method takes the query protein structure and breaks it up into a large set (typically several hundred) of 3-residue templates. Each template consists of three neighbouring residues chosen such that the closest atom-atom distance between any two of the residues is no larger than 5.0Å. Templates containing more than one hydrophobic residue are rejected in order to bias the templates towards surface residues. A scan of each template against a representative set of the structures in the PDB is then made using the Jess algorithm  to identify structures containing similar constellations of the three residues. The representative set is a non-redundant list of protein chains in the PDB downloaded weekly from the Pisces ftp server . Hits are scored and assigned E-values as described in Laskowski et al. , the scoring being based on the local similarity of the equivalent regions when the structures are superposed on the matched residues.
The aim of the method is to pick up local structural similarity between two proteins and, in particular, to identify regions that have been conserved by evolution and hence are more likely to be functionally important. Thus, even though the sequences of two homologous proteins might have diverged considerably over evolutionary time, it is possible that, to maintain their function, the proteins' active sites have undergone less change. The detection of such local conservation can provide strong support for the proteins having retained the same, or a similar, function. 3 residues and a 5.0Å cut-off were found through empirical testing to give a good balance between the duration and success of a search (unpublished data).
Identification of functionally important regions
To demonstrate this tendency to pick up functionally important regions, we select a subset of the MCSG data where a simple FASTA  search against the sequences in the PDB fails to return a significant result. We use an E-value cut-off of 0.01, which gives us 336 structures whose best match is at or below this threshold.
We then consider the top four hits returned by the reverse template search for each of these 336 structures to see if the template residues have any association with the protein's function. We use the functional annotations given in PDBsum , namely: residues belonging to a PROSITE  pattern; catalytic residues, as determined from the Catalytic Site Atlas (CSA) ; residues defined in the SITE records of the original PDB file; or residues in contact with a ligand, metal or DNA/RNA in the structure.
However, as the search data set used by the reverse template method is a representative one, it is possible that the PDB entry matched by the search has no functional annotation, yet a closely related PDB entry, excluded from the data set, does have some annotation. There were nearly 28,000 protein chains in the representative data set, compared with over 64,000 structures in the PDB. Thus, to pick up such 'lost' annotations, we use the Sequence Annotated by Structure (SAS) server  to find additional functional annotations for each of the matched structures. SAS identifies closely related sequences from the PDB using FASTA. Functional annotations are then imported from the resultant alignment(s) where the sequence identity is at least 30%, the alignment overlap at least 80 residues, and the FASTA E-value < 0.001.
The release dates of all PDB entries that are associated with a GO term , are assigned to a Pfam family , or map to KEGG orthologs  are compared to find the earliest example released by the PDB for each annotation. Each structure that represents a first example is then categorized as being solved by a PSI centre or as 'non-PSI' in order to rate the performance of PSI compared to the combined effort of all other laboratories, and to compare the performance of individual PSI centres. The dates that the first structures were solved are also divided into the two PSI periods PSI-1 (before 1st July 2005) and PSI-2. A measure of novelty per PDB release is calculated and for comparison the novelty per non-redundant structure at 95% sequence identity following clustering using cd-hit is also given.
where RMSD is the root mean square deviation of the superposition, max (L1, L2) is the length in residues of the longest domain in the superposition, and N mat is the number of aligned residue pairs (Kolodny et al. 2005) .
Novel modelling leverage
Novel modelling leverage is calculated according to the method of Nair et al. . The sequences of all PDB protein chains, non-redundant at 100% sequence identity, are compared to UniRef100 sequences downloaded on the 3rd of March 2010 using PSI-BLAST with three iterations and an E-value cut-off of 1e-10. Novel modelling leverage in residues is determined for each PDB entry (including redundant entries) on the date of its release by the PDB. The novel modelling leverage of the four large-scale PSI centres is compared to each other and to the leverage of the combined non-PSI laboratories.
Human non-synonymous single nucleotide polymorphisms
The MCSG focused on solving structures of proteins from pathogenic bacteria and has only solved structures for two human proteins. However, a number of MCSG structures may be used as templates for comparative modelling of human proteins. All such structures identified in the modelling leverage analysis described above are used to analyze human non-synonymous nucleotide polymorphisms (nsSNPs). MCSG structures are included regardless of whether a closer template is available from another laboratory.
For each human modelling target identified above, variants are retrieved from UniProt, Ensembl and OMIM and filtered for uniqueness. UniProt variants and Ensembl gene IDs are listed for each protein in the UniProt text file available from UniProt. The Ensembl API is used to query their MySQL database and retrieve all Ensembl variants. OMIM variants are linked to UniProt IDs through the OMIM Missense server at http://www.bioinf.org.uk/omim/ Each variant is inspected to see if it falls within a region having a template match and is thus modellable.
Results and Discussion
The apparent discrepancy between the results obtained by looking at EC numbers in Figure 2 and GO terms in Figure 3 needs explaining. EC numbers are very high quality annotations but coverage of the PDB is less than that of GO and some catalytic activity is not represented by EC numbers. Within the GO catalytic activity category an over-representation of transferase activity and an under-representation of hydrolase activity is seen which does broadly agree with the EC analysis. Within the GO binding category MCSG structures also show particularly significant under-representation of protein binding and calcium binding while nucleotide binding is actually a little over-represented.
ProFunc's template searches are rated as certain, probable, possible, or long shots to indicate the likelihood that a prediction is correct. ProFunc analysis is linked to all structures assigned a "possible" or "unknown" status in the online table at http://www.biochem.ucl.ac.uk/cgi-bin/dlee/MCSG_annotations ProFunc suggests functions for all of the 78 MCSG targets that are of unknown function following sequence analysis. The overall likelihood of the best scoring ProFunc prediction being correct is shown in Figure 1 b. One target has a prediction rated as being certain while another 17/78 (22%) targets have predictions rated as being probable. These predictions are all made by the reverse templates method while two of the probable predictions are also made by ligand-binding templates. ProFunc makes a valuable contribution to functional annotation coverage of MCSG structures.
A second ProFunc example is presented [Additional file 1] to show how ProFunc analysis can add weight and specific detail to a possible annotation suggested by sequence analysis.
Functional novelty of structures.
Annotations and period
GO terms during PSI
GO terms during PSI-1
GO terms during PSI-2
Pfam families during PSI
Pfam families during PSI-1
Pfam families during PSI-2
Pfam families, on the other hand, were targeted by PSI centres and the proportion of first structures solved out of all structures solved is significantly higher for PSI than for non-PSI laboratories as would be expected. As would also be expected, the proportion of first structures for Pfams are considerably higher than those for GO terms since GO terms are a much more general functional category compared to the evolutionarily related members of Pfams. The further back in time that we go the fewer is the number of GO terms and Pfams that were structurally characterized and thus there was a higher probability of solving the first structure for a term or family. As structural coverage becomes more complete the probability of solving the first structure for a term or family diminishes. This is one likely reason that the rate of solving first structures during PSI-1 is higher than that during PSI-2 for both PSI and non-PSI laboratories. Another possible explanation is that the easier functions have been solved and the remaining functions are more difficult to obtain a structure for. Perhaps we are seeing a combination of both factors. By virtue of solving more structures than any other individual PSI centre the MCSG also solved the highest number of first structures for both GO terms and Pfams. In both cases the proportion of first structures is close to the average for all PSI centres although they excelled at solving the first structures for Pfams during PSI-1.
39/45 of the MCSG first structures for GO molecular function terms correspond to a level 4 EC number and 3 of the remaining 6 also have catalytic activity e.g. GO:0015424 amino acid-transporting ATPase activity, again showing a bias towards enzymes. The EC and GO annotation systems have different coverage from each other. Analysis of EC numbers in Gene3D shows that the MCSG solved the first structure for 32 rather than 39 level 4 EC numbers. 5 of these EC numbers were not revealed by the GO analysis so in total the MCSG solved the first structure for 44 level 4 EC numbers. The novelty of these structures may also be considered in terms of providing structural coverage for metabolic pathways.
There are currently 357 KEGG reference pathways. These pathways have nodes, most but not all of which correspond to EC numbers. KEGG orthologs representing evolutionarily related groups of proteins map to many but not all of these nodes and many nodes have multiple orthologs. Currently, 7,035 orthologs are assigned to the reference pathways and more are likely to be added in the future. 1,849 out of a total of 4,977 currently possible level 4 EC numbers map through KEGG orthologs to KEGG reference pathway nodes. 1,147/1,849 (62%) have at least one structural representative compared to 1,670/4,977 (34%) for the EC classification as a whole. 9/357 (3%) pathways have complete structural coverage of their orthologs but these are all small pathways, the largest having only 8 orthologs belonging to 6 nodes, ko00072 synthesis and degradation of ketone bodies. This pathway is also the largest pathway composed entirely of nodes corresponding to level 4 EC numbers with complete structural coverage.
The 44 nodes with EC numbers that the MCSG solved the first structure for belong to 44 KEGG reference pathways but this number is a coincidence, with some of the nodes appearing in multiple pathways while multiple nodes also appear in single pathways. The largest number of nodes within a single KEGG reference pathway to gain a first structure from the MCSG is 6 for pathway ko00330, arginine and proline metabolism, but this is a large pathway with more than 100 nodes. PSI centres did not select targets with the aim of achieving complete coverage of metabolic pathways but this is a potential goal for the future.
Overall the MCSG solved the first structure for 73 orthologs in 59 pathways. 43 of these orthologs belong to multiple pathways and thus join pathways together. Membership of multiple pathways is, however, not rare with about a half of all KEGG orthologs (3,514/7,053) belonging to more than one reference pathway.
Structural novelty of MCSG structures.
Level of structural novelty
Number of domains (cumulative total)
Structural sub-group (SSG)
By definition all new folds are also new superfamilies and all new superfamilies are also new structural subgroups (SSGs). However, in this analysis not all new SSGs are classified in CATH or necessarily belong to existing superfamilies or folds. Of the 52 domains that belong to new SSGs but are not classified in CATH as new folds or superfamilies, 17 are found to be classified into pre-existing CATH superfamilies while 35 are not classified in CATH v3.3. 11 of the 35 not classified achieve a score >70 from the structural comparison program SSAP following the CATHEDRAL scan suggesting that they probably do belong to pre-existing superfamilies. The remaining 24 domains could potentially represent new superfamilies or folds.
Novel modelling leverage
Novel modelling leverage.
Novel modelling leverage (residues)
Average novel modelling leverage per structure (residues)
Human non-synonymous single nucleotide polymorphisms
In the modelling leverage analysis 152 MCSG structures are identified as templates for 867 human proteins in UniProt. In this analysis of modellable nsSNPs all modelling leverage of MCSG structures is considered, not just novel leverage, and the MCSG structure is not necessarily the closest available template. Within these human protein sequences 8,982 unique Ensembl nsSNPs, 1,580 unique UniProt nsSNPs, and 191 unique OMIM nsSNPs are identified. 2,252 of these nsSNPs are within a modellable region when using an MCSG structure as a template.
A good example of an MCSG structure helping to explain human disease is illustrated by PDB entry 2hma. Human mitochondrial tRNA-specific 2-thiouridylase 1 (UniProt ID O75648) matches 2hma with 42% sequence identity in a BLAST alignment. The enzyme has been implicated in aggravating mitochondrial 12S ribosomal RNA aminoglycoside-induced and non-syndromic deafness. It catalyzes the 2-thiolation of uridine at the wobble position (U34) of mitochondrial tRNA(Lys), tRNA(Glu) and tRNA(Gln). A few natural variants have been identified and one (Ala10Ser) has been linked to a decrease in enzymatic activity.
The MCSG has performed well during the first two phases of the PSI in terms of the goals that were established during this period. Whilst there has been concern that the PSI produced too many structures of unknown function our analysis reveals that by using a range of bioinformatics tools and resources we are able to provide functional annotations for more than 90% of the structures solved by the MCSG. Structures of unknown function have helped stimulate the development of methods such as ProFunc that can predict function from structure. ProFunc analysis complements sequence analysis by both adding weight and specific detail to predicted function and by suggesting function where sequence methods have failed to do so.
This work was funded by the NIH as part of the Protein Structure Initiative.
- Norvell JC, Berg JM: Update on the Protein Structure Initiative. Structure 2007, 15: 1519–1522. 10.1016/j.str.2007.11.004View ArticlePubMedGoogle Scholar
- Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, Sillitoe I, Yeats C, Thornton JM, Orengo CA: The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 2007, 35: D291-D297. 10.1093/nar/gkl959PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540.PubMedGoogle Scholar
- Todd AE, Marsden RL, Thornton JM, Orengo CA: Progress of Structural Genomics Initiatives: An Analysis of Solved Target Structures. J Mol Biol 2005, 348: 1235–1260. 10.1016/j.jmb.2005.03.037View ArticlePubMedGoogle Scholar
- Xie L, Bourne PE: Functional Coverage of the Human Genome by Existing Structures, Structural Genomics Targets, and Homology Models. PLoS Comp Biol 2005, 1: e31. 10.1371/journal.pcbi.0010031View ArticleGoogle Scholar
- Bairoch A: The ENZYME database in 2000. Nucleic Acids Res 2000, 28: 304–305. 10.1093/nar/28.1.304PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nature Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticleGoogle Scholar
- Chandonia JM, Brenner SE: Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Proteins 2005, 58: 166–179. 10.1002/prot.20298View ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunesekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A: The Pfam protein families database. Nucleic Acids Res 2010, 38: D211-D222. 10.1093/nar/gkp985PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Thiele I, Weekes D, Zhanwen L, Jaroszewski L, Ginalski K, Deacon AM, Wooley J, Lesley SA, Wilson IA, Palsson B, Osterman A, Godzik A: Three-Dimensional Structural View of the Central Metabolic Network of Thermotoga maritima . Science 2009, 325: 1544–1549. 10.1126/science.1174671PubMed CentralView ArticlePubMedGoogle Scholar
- Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H, Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y, Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M: Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res 2007, 14: 169–181. 10.1093/dnares/dsm018PubMed CentralView ArticlePubMedGoogle Scholar
- Chandonia JM, Brenner SE: The Impact of Structural Genomics: Expectations and Outcomes. Science 2006, 311: 347–351. 10.1126/science.1121018View ArticlePubMedGoogle Scholar
- Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA: Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res 2006, 34: 1066–1080. 10.1093/nar/gkj494PubMed CentralView ArticlePubMedGoogle Scholar
- Lees J, Yeats C, Redfern O, Clegg A, Orengo C: Gene3D: merging structure and function for a thousand genomes. Nucleic Acids Res 2010, 38: D296-D300. 10.1093/nar/gkp987PubMed CentralView ArticlePubMedGoogle Scholar
- Dessailly BH, Nair R, Jaroszewski L, Fajardo JE, Kouranov A, Lee D, Fiser A, Godzik A, Rost B, Orengo C: PSI-2: Structural Genomics to Cover Protein Domain Family Space. Structure 2009, 17: 869–881. 10.1016/j.str.2009.03.015PubMed CentralView ArticlePubMedGoogle Scholar
- Levitt M: Growth of novel protein structural data. Proc Natl Acad Sci USA 2007, 104: 3183–3188. 10.1073/pnas.0611678104PubMed CentralView ArticlePubMedGoogle Scholar
- Liu J, Montelione GT, Rost B: Novel leverage of structural genomics. Nat Biotechnol 2007, 25: 849–851. 10.1038/nbt0807-849View ArticlePubMedGoogle Scholar
- Nair R, Liu J, Soong TT, Acton TB, Everett JK, Kouranov A, Fiser A, Godzik A, Jaroszewski L, Orengo C, Montelione GT, Rost B: Structural genomics is the largest contributor of novel structural leverage. J Struct Funct Genomics 2009, 10: 181–191. 10.1007/s10969-008-9055-6PubMed CentralView ArticlePubMedGoogle Scholar
- The Structural Biology Knowledgebase[http://www.sbkb.org/]
- Berman HM, Westbrook JD, Gabanyi MJ, Tao W, Shah R, Kouranov A, Schwede T, Arnold K, Kiefer F, Bordoli L, Kopp J, Podvinec M, Adams PD, Carter LG, Minor W, Nair R, La Baer J: The protein structure initiative structural genomics knowledgebase. Nucleic Acids Res 2009, 37: D365-D368. 10.1093/nar/gkn790PubMed CentralView ArticlePubMedGoogle Scholar
- The Protein Model Portal[http://www.proteinmodelportal.org/]
- Arnold K, Kiefer F, Kopp J, Batty JN, Podvinec M, Westbrook JD, Berman HM, Bordoli L, Schwede T: The Protein Model Portal. J Struct Funct Genomics 2009, 10: 1–8. 10.1007/s10969-008-9048-5PubMed CentralView ArticlePubMedGoogle Scholar
- Laskowski RA, Watson JD, Thornton JM: ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res 2005, 3: W89-W93. 10.1093/nar/gki414View ArticleGoogle Scholar
- Watson JD, Sanderson S, Ezersky A, Savchenko A, Edwards A, Orengo C, Joachimiak A, Laskowski RA, Thornton JM: Towards Fully Automated Structure-based Function Prediction in Structural Genomics: A Case Study. J Mol Biol 2007, 367: 1511–1522. 10.1016/j.jmb.2007.01.063PubMed CentralView ArticlePubMedGoogle Scholar
- Redfern OC, Dessailly BH, Dallman TJ, Sillitoe I, Orengo CA: FLORA: a novel method to predict protein function from structure in diverse superfamilies. PloS Comput Biol 2009, 5: e1000485. 10.1371/journal.pcbi.1000485PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- UniProt Consortium: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 2010, 38: D142-D148. 10.1093/nar/gkp846View ArticleGoogle Scholar
- Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 2010, 38: D355-D360. 10.1093/nar/gkp896PubMed CentralView ArticlePubMedGoogle Scholar
- Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O: TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res 2007, 35: D260-D264. 10.1093/nar/gkl1043PubMed CentralView ArticlePubMedGoogle Scholar
- Letunic I, Doerks T, Bork P: SMART 6: recent updates and new developments. Nucleic Acids Res 2009, 37: D229-D232. 10.1093/nar/gkn808PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 2003, 13: 2129–2141. 10.1101/gr.772403PubMed CentralView ArticlePubMedGoogle Scholar
- Velankar S, Best C, Beuth B, Boutselakis CH, Cobley N, Sousa Da Silva AW, Dimitropoulos D, Golovin A, Hirshberg M, John M, Krissinel EB, Newman R, Oldfield T, Pajon A, Penkett CJ, Pineda-Castillo J, Sahni G, Sen S, Slowley R, Suarez-Uruena A, Swaminathan J, van Ginkel G, Vranken WF, Henrick K, Kleywegt GJ: PDBe: Protein Data Bank in Europe. Nucleic Acids Res 2010, 38: D308-D317. 10.1093/nar/gkp916PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998, 14: 846–856. 10.1093/bioinformatics/14.10.846View ArticlePubMedGoogle Scholar
- Madera M: Profile Comparer: a program for scoring and aligning profile hidden Markov models. Bioinformatics 2008, 24: 2630–2631. 10.1093/bioinformatics/btn504PubMed CentralView ArticlePubMedGoogle Scholar
- Krissinel E, Henrick K: Protein structure comparison in 3D based on secondary structure matching (SSM) followed by Cα alignment, scored by a new structural similarity function. In Proceedings of the 5th International Conference on Molecular Structural Biology. Edited by: Kungl AJ, Kungl PJ. Vienna; 2003:88.Google Scholar
- Laskowski RA, Watson JD, Thornton JM: Protein function prediction using local 3D templates. J Mol Biol 2005, 351: 614–626. 10.1016/j.jmb.2005.05.067View ArticlePubMedGoogle Scholar
- Barker JA, Thornton JM: An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics 2003, 19: 1644–1649. 10.1093/bioinformatics/btg226View ArticlePubMedGoogle Scholar
- Wang G, Dunbrack RL Jr: PISCES: a protein sequence culling server. Bioinformatics 2003, 19: 1589–1591. 10.1093/bioinformatics/btg224View ArticlePubMedGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85: 2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
- Laskowski RA: PDBsum new things. Nucleic Acids Res 2009, 37: D355-D359. 10.1093/nar/gkn860PubMed CentralView ArticlePubMedGoogle Scholar
- Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P: PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinform 2002, 3: 265–274. 10.1093/bib/3.3.265View ArticlePubMedGoogle Scholar
- Porter CT, Bartlett GJ, Thornton JM: The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 2004, 32: D129-D133. 10.1093/nar/gkh028PubMed CentralView ArticlePubMedGoogle Scholar
- Milburn D, Laskowski RA, Thornton JM: Sequences annotated by structure: A tool to facilitate the use of structural information in sequence analysis. Prot Eng 1998, 11: 855–859. 10.1093/protein/11.10.855View ArticleGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- Redfern OC, Harrison A, Dallman T, Pearl FM, Orengo CA: CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 2007, 3: e232. 10.1371/journal.pcbi.0030232PubMed CentralView ArticlePubMedGoogle Scholar
- Orengo CA, Taylor WR: SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol 1996, 266: 617–635. full_textView ArticlePubMedGoogle Scholar
- Kolodny R, Koehl P, Levitt M: Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol 2005, 346: 1173–1188. 10.1016/j.jmb.2004.12.032PubMed CentralView ArticlePubMedGoogle Scholar
- Steere B, Eisenberg D: Characterization of high-order oligomerization and energetics in diphtheria toxin. In PhD Thesis. UCLA; 2001.Google Scholar
- Yates SP, Taylor PL, Jørgensen R, Ferrraris D, Zhang J, Andersen GR, Merrill AR: Structure-function analysis of water-soluble inhibitors of the catalytic domain of exotoxin A from Pseudomonas aeruginosa . Biochem J 2005, 385: 667–675. 10.1042/BJ20041480PubMed CentralView ArticlePubMedGoogle Scholar
- Turgeon Z, White D, Jørgensen R, Visschedyk D, Fieldhouse RJ, Mangroo D, Merrill AR: Yeast as a tool for characterizing mono-ADP-ribosyltransferase toxins. FEMS Microbiol Lett 2009, 300: 97–106. 10.1111/j.1574-6968.2009.01777.xView ArticlePubMedGoogle Scholar
- Granum PE, Lund T: Bacillus cereus and its food poisoning toxins. FEMS Microbiol Lett 1997, 157: 223–228. 10.1111/j.1574-6968.1997.tb12776.xView ArticlePubMedGoogle Scholar
- Guan MX, Yan Q, Li X, Bykhovskaya Y, Gallo-Teran J, Hajek P, Umeda N, Zhao H, Garrido G, Mengesha E, Suzuki T, del Castillo I, Peters JL, Li R, Qian Y, Wang X, Ballana E, Shohat M, Lu J, Estivill X, Watanabe K, Fischel-Ghodsian N: Mutation in TRMU related to transfer RNA modification modulates the phenotypic expression of the deafness-associated mitochondrial 12S ribosomal RNA mutations. Am J Hum Genet 2006, 79: 291–302. 10.1086/506389PubMed CentralView ArticlePubMedGoogle Scholar