Exploring functionally related enzymes using radially distributed properties of active sites around the reacting points of bound ligands
© Ueno et al.; licensee BioMed Central Ltd. 2012
Received: 14 December 2011
Accepted: 26 April 2012
Published: 26 April 2012
Structural genomics approaches, particularly those solving the 3D structures of many proteins with unknown functions, have increased the desire for structure-based function predictions. However, prediction of enzyme function is difficult because one member of a superfamily may catalyze a different reaction than other members, whereas members of different superfamilies can catalyze the same reaction. In addition, conformational changes, mutations or the absence of a particular catalytic residue can prevent inference of the mechanism by which catalytic residues stabilize and promote the elementary reaction. A major hurdle for alignment-based methods for prediction of function is the absence (despite its importance) of a measure of similarity of the physicochemical properties of catalytic sites. To solve this problem, the physicochemical features radially distributed around catalytic sites should be considered in addition to structural and sequence similarities.
We showed that radial distribution functions (RDFs), which are associated with the local structural and physicochemical properties of catalytic active sites, are capable of clustering oxidoreductases and transferases by function. The catalytic sites of these enzymes were also characterized using the RDFs. The RDFs provided a measure of the similarity among the catalytic sites, detecting conformational changes caused by mutation of catalytic residues. Furthermore, the RDFs reinforced the classification of enzyme functions based on conventional sequence and structural alignments.
Our results demonstrate that the application of RDFs provides advantages in the functional classification of enzymes by providing information about catalytic sites.
High-throughput methods for structural genomics have produced an increasing number of protein structures to be solved by X-ray crystallography. The abundance of protein structure information in the Protein Data Bank (PDB) has increased the need and desire for structure-based function prediction  and has contributed to structure-based drug design . However, two problems remain regarding the prediction of enzyme function. First, proteins within a superfamily, which are usually expected to share the same catalytic properties, can catalyze different reactions. There are reports that enzymes with 98% sequence identity, such as melamine deaminase and atrazine chlorohydrolase, may catalyze different reactions . Second, two enzymes belonging to different superfamilies or fold classes can catalyze almost identical reactions .
The function of a protein can be affected by a small number of residues in a localized region of its three-dimensional structure . Moreover, the specific arrangement and conformation of these residues can be crucial to a protein’s function and may be strongly conserved during its evolution, even when the protein sequence and structure change significantly . For example, it was reported that the positioning of the reactive region of a substrate with respect to a cofactor is generally conserved in flavoenzymes .
Two methods for the description of local structures have been developed for predicting enzymatic functions. First, in the element-based description of catalytic residues, the catalytic roles in an enzymatic reaction are defined as acid–base, stabilizer or modulator roles . Some insight into enzymatic reactions can be gained using this method, but manual annotation is inherently required. In addition, it is often difficult to differentiate between the acid–base and stabilizer roles because most structures solved by X-ray crystallography provide no information about hydrogen atoms. The second method is based on descriptions of substructures within the local structures of enzymes [8–23]. Many approaches to analyze and compare local structures have been proposed. One group of algorithms, which includes the PINTS , ETA [9–11] and FLORA  algorithms, scans protein structural databases using pre-calculated or automatically generated templates. Another group includes algorithms that compare the substructural epitopes of proteins using geometric hashing [13–15]. Similarly, SiteEngine  uses the concept of pseudocenters  to define the properties of the corresponding surface. None of these approaches can characterize catalytic sites and create feature vectors, even though they assess the similarity between catalytic sites.
In this study, we examine the structures of oxidoreductases and transferases using radial distribution functions (RDFs) that encode radially distributed properties of active sites centered around the reacting points of bound ligands. Thus, element-based and substructure descriptions are integrated into the RDF, assuming that catalytic roles are restricted by distances and that different catalytic residues can play identical roles. Although the topological correlation vector method of Stahl et al. and WaveGeoMap, developed by Kupas et al., provide feature vectors related to enzyme cavities, these descriptions use patches of active sites, regardless of the orientation of the catalytic residues. Therefore, it is still unclear whether the orientation of active sites around a reacting point is related to enzymatic function and how much of the orientation is conserved. Our method provides a different view of enzymatic function by focusing on the physicochemical properties surrounding a reacting point found in enzyme cofactors.
Characteristic physicochemical pattern of active sites
Effect of mutations on the physicochemical properties of active sites
1 – cosine
100 – match score
Active site properties as the critical determinants of enzyme function
Similarly, the CCP residues were mainly localized in the area around node [33, 10], including the two different catalytic sites (Figure 2). Within the CCP distribution, 1sog and 1dso from Saccharomyces cerevisiae (PDB) were positioned at nodes [36, 8] and [34, 13], respectively. In the active site of 1dso, histidine 175 is replaced by glycine (Figure 3B). Thus, the results show that the obtained clusters of enzymes consist of clusters of their catalytic sites, suggesting that the RDFs of active sites account for a major part of the enzyme function.
Prediction of enzyme functions based on the physicochemical properties of active sites
SOM assignment of RDFs of oxidoreductases
Occupied by one class
SOM assignment of RDFs of transferases
Occupied by one class
Then, to evaluate how many of the active sites are associated with enzyme functions, we performed a statistical analysis of the results of the SOM clustering. The averaged F-measure of all of the assigned EC numbers of oxidoreductases was 0.87, ranging from 0.22 to 1.00. Over 88% of the active sites of oxidoreductases were assigned to an EC number (see Additional file 5: Table S3). Similarly, the averaged F-measure of all of the assigned EC numbers of transferases was 0.88, ranging from 0.33 to 1.00. Over 88% of the active sites of transferases were assigned to an EC number (see Additional file 6: Table S4).
Prediction performance in comparison with sequence and structural alignment-based annotation
Partial correlation between the different measures of oxidoreductases
Partial correlation between the different measures of transferases
Evaluation of the SOM distance with the RDFs for the prediction of enzyme function of oxidoreductases
Evaluation of the SOM distance with the RDFs for the prediction of enzyme function of transferases
Identification of remote orthologs assigned to the same nodes in the SOM
Structural genomics prediction
SOM predictions for the proteins with unknown function in structural genomics
2.3.3, 5.4.4, 6.3.2
1.14.14, 2.3.2, 2.7.7, 3.5.4, 3.6.1, 4.2.99, 5.1.3
2.1.1, 3.1.3, 3.5.3, 5.1.3
1.13.11, 2.3.2, 2.7.10, 2.8.1, 3.6.1, 3.6.3, 4.1.2, 4.3.1, 6.3.2
1.3.1, 1.6.8, 2.7.4, 3.4.21, 3.7.1
1.1.1, 2.3.1, 3.4.11, 3.4.22, 4.2.99
1.1.1, 1.18.6, 1.3.3, 1.7.1, 2.7.1, 2.7.7, 3.2.1, 3.3.2, 3.4.21, 4.1.1, 6.3.3, 6.3.5
1.5.1, 1.7.1, 3.1.4
Understanding the orientation of catalytic sites is important for drug design. For a given G protein-coupled receptor, there are several types of ligands, classified as conformational change inducers, agonists, antagonists and inverse agonists . The RDFs describe the orientation of catalytic sites, detecting conformational changes as well as enzyme function (Table 1). In addition, the description of the microenvironment produced by the RDF is better than simple superposition of catalytic sites when a particular functional group is not present (Figure 3).
In structural genomics, the RDFs would be advantageous for finding remote orthologs, especially when evolutionary pressure has enhanced sequence/structural divergence. Although sequence-based methods are the first choice for functional annotation, proteins with sequence identities of < 20-35% are problematic . Measuring structural similarity is more informative for enzyme functions exhibiting distant relationships and/or convergent evolution. However, proteins within well-known superfamilies sharing the same structural topology, such as TIM barrels, do not always have the same functions . In these cases, the measure of structural similarity alone does not correspond to functional similarity. Therefore, a specific measure representing functionality is desirable. We focused specifically on the local features around the catalytic site. Compared to the structural alignment, the functional annotation was reinforced by focusing on the reaction center (Tables 6 and 7). It is also likely that convergent evolution of an enzyme function depend less on evolutional process than on physicochemical properties of active sites (Tables 8 and 9). For proteins with unknown function, 41% of query structures were newly classified into the EC numbers (Table 9). However, the true performance of our method will be evaluated by revealing the actual function of those proteins. The combination of results obtained using different approaches will also improve the accuracy of function predictions.
We propose a novel classification method for the prediction of enzymatic function based on the physicochemical properties of catalytic sites. The RDFs for predicting enzymatic functions are thus far limited to enzymes with bound ligands. For ligand-unbound structures, either homology modeling or superposition based on ligand-bound structures can be applied to our method. Our results suggest that the RDF provides a different perspective compared to structural and sequence alignments by focusing on a local feature because catalytic sites are thought to be more highly conserved than the overall sequences or structures of enzymes.
Dataset of active sites
Two sets of 1,880 oxidoreductase (EC1) and 789 transferase (EC2) protein structures were initially obtained from the PDB. In the case of NMR data, we used the first model in the PDB file. To simplify the filtering of the candidate active sites, structures including at least one cofactor or analogous compound were manually selected based on the annotation of PDBsum . In this study, we used the substructures within 10 Å from the reaction centers of these cofactors as active site data. The reaction centers  of the cofactors are extensionally defined as follows: (1) atoms associated with bond formation and cleavage; (2) atoms exhibiting a change in charge; and (3) corresponding atoms in analogous compounds (see Additional files 1 and 2, Additional file 1: Tables S1, Additional file 2: Table S2). In oxidoreductases, a cofactor generally forms a part of the reaction center, acting as a donor and acceptor. Finally, based on this definition, 4,092 oxidoreductase and 1,444 transferase active sites corresponding to reaction centers were obtained. The subsequent encoding for comparison of active sites also used the Cartesian coordinates of these reaction centers as a starting point. In addition, a set of 102 protein structures with the key words of “structural genomics” and “unknown function” in the PDB was used for a blind validation of function prediction.
Characterization of physicochemical properties of active sites
The values of physicochemical atomic properties, including the main chain of the amino acid residues, were empirically calculated by the PETRA server [35, 36]. The atomic properties included were the total charge for electrostatic interactions and σ-electronegativity, π-electronegativity and effective atom polarizability for van der Waals interactions. These properties are based on the Partial Equalization of Orbital Electronegativities (PEOE) , which is independent of 3D structures. Because the side chains of proteins show various conformations, PEOE is suitable for describing their properties.
Physicochemical encoding of active sites for the RDFs
where N is the number of atoms in the active site residues; r i is a constant for the inter-atomic distance between atom i and the reaction center atom (see Additional file 1: Table S1); σ 2 is the fluctuation of the atoms around their averaged positions; and p is an atomic property (see Additional file 7: Figure S3). Thus, the RDFs naturally combine active site structures and their physicochemical properties, which exhibit an isotropic and rotationally invariant nature. In addition, we tested the effect of large σ 2 in the RDFs to investigate the robustness to conformational change, suggesting that the RDFs were robust over a large range of B-factor (= 8π2σ2/3) in the PDB (see Additional file 8: Figure S4).
SOM clustering and SOM distance
SOMs provide a topology-preserving map using a nonlinear projection of high-dimensional data onto a low-dimensional grid . The low-dimensional grid is composed of nodes that represent data clusters. The neighboring nodes are connected to each other in the sense that they receive similar updates. Hence, SOMs provide information on the similarity between nodes. The SOM was run using a batch algorithm with an Epanechnikov or cut-Gaussian neighborhood function and an initial update radius of 5 or 10 nodes via implementation in the SOM Toolbox for Matlab (Mathworks, Inc.), which was developed in the Laboratory of Computer and Information Science of the Helsinki University of Technology.
In addition to the clustering, we also defined the SOM distance, which is the Euclidean distance between the SOM locations of the nodes on the grid, to obtain the distance measure between the active sites encoded by the RDFs.
Software for the alignment of sequences, structures and active sites for comparative experiments
The sequences and structures were aligned using the Smith-Waterman algorithm  or the Needleman-Wunsch algorithm , both of which are implemented in the EMBOSS program package , or the structure-based alignment algorithms in the MAMMOTH program package . All of the pairwise alignments were performed with the default parameters. The active sites were compared using a geometric hashing algorithm implemented in SiteEngine .
Evaluation of SOM clustering
where N is the number of enzymes in the EC class. The averaged F-measure for the validation of the classification performance was obtained by calculating the average of all of the EC classes, with 1 being the best value and 0 being the worst value.
Evaluation of the measures for predicting enzyme functions
The ROC curve is a graphical plot of TPR versus FPR, showing the fidelity of discrimination at varying thresholds. The AUC is defined as the area under the ROC curve, representing the overall performance of discrimination. In this study, the SOM distances represented the dissimilarities among the RDFs. In the alignments, the similarities were the percentages of the number of aligned residues in the shortest protein.
Partial correlation coefficients between the measures
In this study, we used the pseudo-inverse of the correlation matrix in the first step .
This work was partly supported by the Institute for Bioinformatics Research and Development (BIRD) of the Japan Science and Technology Agency (JST) and the Japan Initiative for Global Research Network on Infectious Diseases (J-GRID). KU was supported by a Grant-in-Aid for Young Scientists B (KAKENHI 20700263).
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Greer J, Erickson JW, Baldwin JJ, Varney MD: Application of the three-dimensional structures of protein target molecules in structure-based drug design. J Med Chem 1994, 37(8):1035–1054. 10.1021/jm00034a001View ArticlePubMedGoogle Scholar
- Seffernick JL, de Souza ML, Sadowsky MJ, Wackett LP: Melamine deaminase and atrazine chlorohydrolase: 98 percent identical but functionally different. J Bacteriol 2001, 183(8):2405–2410. 10.1128/JB.183.8.2405-2410.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Babbitt PC: Definitions of enzyme function for the structural genomics era. Curr Opin Chem Biol 2003, 7(2):230–237. 10.1016/S1367-5931(03)00028-0View ArticlePubMedGoogle Scholar
- Watson JD, Laskowski RA, Thornton JM: Predicting protein function from sequence and structural data. Curr Opin Struct Biol 2005, 15(3):275–284. 10.1016/j.sbi.2005.04.003View ArticlePubMedGoogle Scholar
- Fraaije MW, Mattevi A: Flavoenzymes: diverse catalysts with recurrent features. Trends in biochemical sciences 2000, 25(3):126–132. 10.1016/S0968-0004(99)01533-9View ArticlePubMedGoogle Scholar
- Bartlett GJ, Porter CT, Borkakoti N, Thornton JM: Analysis of catalytic residues in enzyme active sites. J Mol Biol 2002, 324(1):105–121. 10.1016/S0022-2836(02)01036-7View ArticlePubMedGoogle Scholar
- Stark A, Russell RB: Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures. Nucleic Acids Res 2003, 31(13):3341–3344. 10.1093/nar/gkg506PubMed CentralView ArticlePubMedGoogle Scholar
- Kristensen DM, Ward RM, Lisewski AM, Erdin S, Chen BY, Fofanov VY, Kimmel M, Kavraki LE, Lichtarge O: Prediction of enzyme function based on 3D templates of evolutionarily important amino acids. BMC bioinformatics 2008, 9: 17. 10.1186/1471-2105-9-17PubMed CentralView ArticlePubMedGoogle Scholar
- Ward RM, Venner E, Daines B, Murray S, Erdin S, Kristensen DM, Lichtarge O: Evolutionary Trace Annotation Server: automated enzyme function prediction in protein structures using 3D templates. Bioinformatics 2009, 25(11):1426–1427. 10.1093/bioinformatics/btp160View ArticlePubMedGoogle Scholar
- Erdin S, Ward RM, Venner E, Lichtarge O: Evolutionary trace annotation of protein function in the structural proteome. J Mol Biol 2010, 396(5):1451–1473. 10.1016/j.jmb.2009.12.037PubMed CentralView ArticlePubMedGoogle Scholar
- Redfern OC, Dessailly BH, Dallman TJ, Sillitoe I, Orengo CA: FLORA: a novel method to predict protein function from structure in diverse superfamilies. PLoS computational biology 2009, 5(8):e1000485. 10.1371/journal.pcbi.1000485PubMed CentralView ArticlePubMedGoogle Scholar
- Wallace AC, Borkakoti N, Thornton JM: TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci 1997, 6(11):2308–2323.PubMed CentralView ArticlePubMedGoogle Scholar
- Rosen M, Lin SL, Wolfson H, Nussinov R: Molecular shape comparisons in searches for active sites and functional similarity. Protein Eng 1998, 11(4):263–277. 10.1093/protein/11.4.263View ArticlePubMedGoogle Scholar
- Weskamp N, Kuhn D, Hullermeier E, Klebe G: Efficient similarity search in protein structure databases by k-clique hashing. Bioinformatics 2004, 20(10):1522–1526. 10.1093/bioinformatics/bth113View ArticlePubMedGoogle Scholar
- Shulman-Peleg A, Nussinov R, Wolfson HJ: Recognition of functional sites in protein structures. J Mol Biol 2004, 339(3):607–633. 10.1016/j.jmb.2004.04.012View ArticlePubMedGoogle Scholar
- Schmitt S, Kuhn D, Klebe G: A new method to detect related function among proteins independent of sequence and fold homology. J Mol Biol 2002, 323(2):387–406. 10.1016/S0022-2836(02)00811-2View ArticlePubMedGoogle Scholar
- Stahl M, Taroni C, Schneider G: Mapping of protein surface cavities and prediction of enzyme class by a self-organizing neural network. Protein Engineering 2000, 13(2):83–88. 10.1093/protein/13.2.83View ArticlePubMedGoogle Scholar
- Kupas K, Ultsch A, Klebe G: Large scale analysis of protein-binding cavities using self-organizing maps and wavelet-based surface patches to describe functional properties, selectivity discrimination, and putative cross-reactivity. Proteins 2007, 71(3):1288–1306. 10.1002/prot.21823View ArticleGoogle Scholar
- Jambon M, Imberty A, Deleage G, Geourjon C: A new bioinformatic approach to detect common 3D sites in protein structures. Proteins 2003, 52(2):137–145. 10.1002/prot.10339View ArticlePubMedGoogle Scholar
- Schalon C, Surgand JS, Kellenberger E, Rognan D: A simple and fuzzy method to align and compare druggable ligand-binding sites. Proteins 2008, 71(4):1755–1778. 10.1002/prot.21858View ArticlePubMedGoogle Scholar
- Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA: Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS computational biology 2009, 5(12):e1000585. 10.1371/journal.pcbi.1000585PubMed CentralView ArticlePubMedGoogle Scholar
- Sonavane S, Chakrabarti P: Prediction of active site cleft using support vector machines. J Chem Inf Model 2010, 50(12):2266–2273. 10.1021/ci1002922View ArticlePubMedGoogle Scholar
- Bell JK, Yennawar HP, Wright SK, Thompson JR, Viola RE, Banaszak LJ: Structural analyses of a malate dehydrogenase with a variable active site. J Biol Chem 2001, 276(33):31156–31162. 10.1074/jbc.M100902200View ArticlePubMedGoogle Scholar
- Wang JM, Mauro M, Edwards SL, Oatley SJ, Fishel LA, Ashford VA, Xuong NH, Kraut J: X-ray structures of recombinant yeast cytochrome c peroxidase and three heme-cleft mutants prepared by site-directed mutagenesis. Biochemistry 1990, 29(31):7160–7173. 10.1021/bi00483a003View ArticlePubMedGoogle Scholar
- Didierjean C, Corbier C, Fatih M, Favier F, Boschi-Muller S, Branlant G, Aubry A: Crystal structure of two ternary complexes of phosphorylating glyceraldehyde-3-phosphate dehydrogenase from Bacillus stearothermophilus with NAD and D-glyceraldehyde 3-phosphate. J Biol Chem 2003, 278(15):12968–12976. 10.1074/jbc.M211040200View ArticlePubMedGoogle Scholar
- Wilson KP, Black JA, Thomson JA, Kim EE, Griffith JP, Navia MA, Murcko MA, Chambers SP, Aldape RA, Raybuck SA, et al.: Structure and mechanism of interleukin-1 beta converting enzyme. Nature 1994, 370(6487):270–275. 10.1038/370270a0View ArticlePubMedGoogle Scholar
- Nagradova NK: Study of the properties of phosphorylating D-glyceraldehyde-3-phosphate dehydrogenase. Biochemistry (Mosc) 2001, 66(10):1067–1076. 10.1023/A:1012472627801View ArticleGoogle Scholar
- Kotera M, Okuno Y, Hattori M, Goto S, Kanehisa M: Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J Am Chem Soc 2004, 126(50):16487–16498. 10.1021/ja0466457View ArticlePubMedGoogle Scholar
- Rosenbaum DM, Rasmussen SG, Kobilka BK: The structure and function of G-protein-coupled receptors. Nature 2009, 459(7245):356–363. 10.1038/nature08144PubMed CentralView ArticlePubMedGoogle Scholar
- Rost B: Twilight zone of protein sequence alignments. Protein Eng 1999, 12(2):85–94. 10.1093/protein/12.2.85View ArticlePubMedGoogle Scholar
- Nagano N, Orengo CA, Thornton JM: One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J Mol Biol 2002, 321(5):741–765. 10.1016/S0022-2836(02)00649-6View ArticlePubMedGoogle Scholar
- Laskowski RA, Hutchinson EG, Michie AD, Wallace AC, Jones ML, Thornton JM: PDBsum: a Web-based database of summaries and analyses of all PDB structures. Trends in biochemical sciences 1997, 22(12):488–490. 10.1016/S0968-0004(97)01140-7View ArticlePubMedGoogle Scholar
- Chen LR, Gasteiger J: Knowledge discovery in reaction databases: Landscaping organic reactions by a self-organizing neural network. J Am Chem Soc 1997, 119(17):4033–4042. 10.1021/ja960027bView ArticleGoogle Scholar
- Gasteiger J: Empirical Methods for the Calculation of Physicochemical Data of Organic Compounds. In Physical Property Prediction in Organic Chemistry. Edited by: Jochum C, Hicks MG, Sunkel J. Springer, Heidelberg, Germany; 1988:119–138.View ArticleGoogle Scholar
- PETRA server[http://www2.ccc.uni-erlangen.de/services/petra/]
- Aires-de-Sousa J, Hemmer MC, Gasteiger J: Prediction of H-1 NMR chemical shifts using neural networks. Anal Chem 2002, 74(1):80–90. 10.1021/ac010737mView ArticlePubMedGoogle Scholar
- Kohonen T: Self-organizing maps. 3rd edition. Springer, Berlin; 2001.View ArticleGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16(6):276–277. 10.1016/S0168-9525(00)02024-2View ArticlePubMedGoogle Scholar
- Ortiz AR, Strauss CE, Olmea O: MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 2002, 11(11):2606–2621.PubMed CentralView ArticlePubMedGoogle Scholar
- Schafer J, Strimmer K: An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 2005, 21(6):754–764. 10.1093/bioinformatics/bti062View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.PubMedGoogle Scholar