Surface-Based Retrieval of Binding Sites for the Same Protein: HIV-1 Protease
The utility of surface shape comparisons was assessed by retrieving functionally homologous human immunodeficiency virus (HIV-1) protease-ligand complexes from the GPSS. HIV-1 protease is an essential aspartyl protease that cleaves nascent polypeptides enabling maturation of viral proteins. Inactivation of the protease blocks production of infectious viral particles[36]. Therefore, HIV-1 has been an active target and one of the early success stories of rational drug design[37]. We identified 151 HIV-1 protein-inhibitor complexes deposited in the PDB with the following criteria: proteins are in the dimer conformation, inhibitors are not compound fragments (molecular weight >100), and inhibitors are unique in our dataset. The proteins in our dataset share at least 48% sequence similarity and secondary structure similarity Z-scores greater 9.0, as measured using the secondary structure matching (SSM) algorithm[38].
The binding surface of human HIV-1 (PDB:1eby, E.C. 3.4.23.16, CATH[39] 20.40.70.10, Figure 6ab) with bound inhibitor BEB (MW 652.7, Figure 6c) was selected as a query. First, the query was searched against the GPSS library using the SSS comparisons. The sorted KS distance scores between the query surface and all members from the library are plotted in Figure 6d. Points highlighted in red indicate known HIV-1 inhibitor binding surfaces. The results behave expectedly as 124 of 151 have KS distance scores less than 0.1. Plotting the search results in a receiver operator characteristic (ROC) curve (see Methods) we measure the retrieval rate using SSS at 84.7% from the area under the curve (AUC) (Figure 6e). The poorest ranking HIV-1 protease surfaces are associated with aggressive mutation studies in the binding pocket or correlated to decisively small (<200) or large (>900) molecular weight inhibitors.
Next, we performed the same search using only the spatial alignment scores to evaluate similarity. We observe that all three alignment-based scoring measures provide better specificity than SSS distance score. The AUC for cRMSD P-value, oRMSD P-value, and SVOT are: 97.5%, 96.8% and 93.0%, respectively. The ROC plots are shown for each measure in Figure 6e. The improved specificity of the spatial alignment scores comes at a significant runtime disadvantage. The SSS shape retrieval method took 24 minutes to compare the query to the GPSSS library, while the spatial alignment took 1,657 minutes. When using the SSS scores to pre-filter the search library, as described in the SurfaceScreen methodology, we can achieve an AUC of 95.3% for the combined SurfaceScreen (Figure 6e) score with an overall runtime of 148 minutes. The shape signature filter reduced the library 86%, to just over 4,000 surfaces, yet did not eliminate any true positives from the library.
Using the SurfaceScreen score, the most similar (rank 132) non HIV-1 surface was from plasmepsin II from P. falciparum (PDB:1lf3, E.C. 3.4.23.39, CATH 2.40.70.10), another aspartic protease and a major virulence factor[40].P. falciparm is a species of Plasmodiums that causes one of the major malaria infections in humans. Plasmepsin II plays an essential role in P. falciparm in the degradation of hemoglobin as a source of amino acids for growth and maturation. The binding surface of inhibitor EH5 exhibited strong similarity to the HIV-1 inhibitor binding site, with a SurfaceScreen score of 1.89. Plasmepsin II shares only 12% sequence identity to HIV-1 protease and an SSM alignment produces a non-significant Z-score of 3.3. The surfaces are both formed at the intersection of loops and β-sheets, although the plasmepsin II binding surface is formed from a single chain, unlike HIV-1 protease, which is occurs at a dimer interface. There are 15 residues that are conserved between the two surfaces. While it is not surprising that proteases share a similar binding site, the low level of sequence and secondary structure similarity highlights that localized functional conservation can be found in surfaces. Our observation is in agreement with recent reports where HIV-1 inhibitors have been shown to be effective antimalarial agents[41].
The highest-ranking non-protease surface was from aclacinomycin methylesterase (RdmC) from S. purpurascens (PDB:1q0r, Figure 6fg). RdmC modifies the aklavinone skeleton in the biosynthesis of anthracyclines in Streptomyces species[42]. Anthracyclines are a class of aromatic polyketide antibiotics used as chemotherapuetic agents to treat a wide range of cancers, including leukemia, lymphoma, and breast, uterine, ovarian and lung cancers. Despite sharing only 7% sequence identity and being built present in different scaffolds (CATH 3.40.50.1820), the RdmC binding surface of DcmaT (Figure 6h), was found to be similar with a SurfaceScreen score of 1.81 (ranked at position 134). The superposition of the surfaces is shown in Figure 6ij and with their corresponding inhibitors in Figure 6k. The surprising similarity of these surfaces has significant medicinal impact as it supports the recent reports of the inhibitory effects of anthracycline agents on protease activity[43]. This result shows how identification of similar binding surfaces (and their corresponding ligands) can provide guidance for structure based drug discovery, not only in scaffold design but also to screen against potential undesirable binding site similarities that could result in undesired side effects.
Retrieval and Prediction of Heme Binding Surfaces
Heme is a versatile prosthetic group that plays an important role across many biological systems. Hemoproteins have diverse functions including oxygen binding and transport, electron transfer and redox, and catalysis. Their functional diversity is accomplished through an equally diverse range of protein topologies[9, 44]. A comprehensive analysis of 68 b-type heme binding interactions by Schneider et. al. identified over 20 different folds that bind heme in both solvent accessible cavities and buried voids[45]. Even functional homologues show diversity in binding orientation as observed in HasA and HemS[45]. Surfaces from myoglobin (CATH code = 1.10.490.10, PDB:1mbn), nitrophorin (CATH code = 1.40.128.20, PDB:1np4), and inducible nitric oxide synthase (iNOS) (CATH code = 3.90.1230.10, PDB:4nos)[46], representing extrema of heme binding, are shown in Figure 7.
The variability of heme binding proteins presents considerable challenges for automated identification and retrieval of hemoproteins from sequence and structure databases. Using myoglobin from P. catodon (PDB:1mbn, Figure 7ab) as a query protein, we compared it to a non-redundant (<95% sequence identity) PDB set using BLAST[47]. Using the 690 heme binding proteins in the PDB as true-positive hits, a retrieval rate of 68.7% is calculated from the rank order of sorted (by E-value) search results. Comparing the structure of myoglobin against the same set of proteins using SSM search results provides a retrieval rate of 64.4%. The retrieval rate was calculated from the rank order of sorted Z-scores. The ROC plots are shown for both methods in Figure 7h.
The myoglobin query was then searched against the GPSS library. The retrieval rate, using the SurfaceScreen score, is 94.8%. All surface scoring measures had superior performance over sequence and structure methods (Figure 7h). The most selective was the SSS KS distance with retrieval at 95.8%. Despite the variability of topologies forming binding surfaces, the binding surface shape appears to be the most conserved feature of hemoproteins. This can be seen in the shape signature plots for myoglobin, nitrophorin and iNOS (Figure 7h). The iNOS heme binding pocket is the lowest ranking true-positive surface against our query, as observed by the stark difference in shape signatures. It appears that despite evolutionary pressure imposed for functional specification, surface must maintain geometry necessary to accommodate the canonical heme shape. Surface shape is better conserved for heme binding than the amino acid environment. These observations agree with that of Schneider in which heme binding interactions were found to be generally diverse, with the exception of only three amino acids at "hot spots".
While true heme binding surfaces dominate the top scoring surfaces we find that other binding surfaces have surprising similarity to our query surface. The highest ranking, at position 42, was the Ampcpr binding surface from ADP-ribose pyrophosphatase (ADPRase) from E. coli (PDB:1khz). A Nudix hydrolase enzyme, ADPRase catalyzes the Mg2+-dependent hydrolysis of ADP-ribose to AMP and ribose 5-phosphate[48]. The surface, formed at the intersection of β-sheets and loops, is shown in Figure 7ij. The SSS plot confirms that strong visual shape similarity between the two surfaces (Figure 7g, red). A shape-based superposition (ROCS, OpenEye Scientific, Inc.) of the ligands shows that (Figure 7j) the similarity of these functionally unrelated proteins may lie strictly in their ability to accommodate similarly sized molecules.
Detection of a Convergent Heme Binding Surface
Convergent evolution presents a far more difficult challenge for annotation of proteins of unknown function. Structural genomics target, IsdG from S. aureus[49] (PDB:1xbw), shows no significant sequence similarity and does not contain the conserved N-terminal histidine or the GXXXG motif characteristic present in the heme-monooxygenases family, yet this enzyme displays classical heme-monooxygenase activity[49]. Also, while all known members of heme-oxygenase superfamily are of all α-helical fold, IsdG adopts α+β sandwich with an anti-parallel β-sheet and ferredoxin-like fold and a β-barrel at the dimer interface[49].
The structure of IsdG has a prominent pocket formed between the α-helices and beta sheets (Figure 8a). This is the largest surface pocket identified by the CASTp webserver[50]. Querying this surface against the GPSS library reveals a striking similarity to the heme binding pocket in heme oxygenase (HmuO) from C. diphtheriae (PDB:1iw0, Figure 8b). The SSS distributions have distance of 0.06 (Figure 8d). There are 19 conserved residues between the surfaces that come from diverse regions of the primary sequence (Figure 8c). The surfaces align with cRMSD P-value of 9.84 × 10-3 (Figure 8ef) and oRMSD P-value of 5.32 × 10-4 (Figure 8gh). Superposition of the surfaces results in gSVOT of 0.78, lSVOT 0.84, and rSVOT of 0.93 (Figure 8i). The gSVOT overlap is highlighted in Figure 8j. The SurfaceScreen score for the comparison is 1.98, which ranked fourth overall against the search library.
Several other structural homologues to IsdG have subsequently been solved in the structural genomics effort. A clustering of the putative binding sites for four additional enzymes is shown in Figure 9. Surface analysis reveals that the heme binding pocket is well conserved in protein TT1390 from T. thermopilus (PDB:1iuj) and protein BC2969 from B. cereus (PDB:1tz0) suggesting that both T. thermopilus and B. cereus can acquire iron through heme degradation. Despite overall structural similarity, the binding surface is not well conserved in ActVA-Orf66 from S. coelicolor (PDB:1lq9), a protein involved in antibiotic synthesis that is known to bind 6-deoxydihydrokalafungin (6-DHHK)[51]. Expectedly, this surface forms the most distant branch of the clustering. Although all proteins appear to function as monooxygenases they operate on very different substrates suggesting that convergent evolution may be an important driving force to evolve new functions from existing protein scaffolds. In this manner, surface analysis could be used to define a chemical structure space by interpolating between known substrates clustered on each node.
Binding Site Retrieval of Functionally Diverse and Conformationally Variable Nucleotides
Specific Nucleotide Binding Site Retrieval
ATP is a multifunctional nucleotide associated that has been classified to catalyze 58 different reactions by the Enzyme Commission (EC). In over 300 structural complexes, ATP binding is associated with domains from 45 homologous superfamilies, some sharing less than 8% sequence identity[52]. The nucleotide is quite flexible and adopts a wide range of conformations, some in less than energetically favorable states[52]. To determine the extent that conformational variability exerts on similarity searching, we conducted retrieval experiments with query surfaces binding ATP in diverse conformations: cAMP- dependent kinase (PDB:1atp) protein kinase CK2 from Z. mays (PDB:1a6o)[53], ATP:corrinoid adenosyltransferase from S. typhimurium (PDB:1g5t)[54], PurT-encoded glycinamide ribonucleotide transformylase from E. coli (PDB:1kj8)[55]. The conformations were selected by clustering all ATP molecules by their three-dimensional shape similarity (see Methods) and are shown in Figure 10.
The AUCs calculated using the SurfaceScreen score were 79.1%, 80.1%, 83.0%, and 85.4% for ATP:corrinoid adenosyltransferase, PurT-encoded glycinamide ribonucleotide transformylase, protein kinase CK2 and cAMP-dependent kinase, respectively (Figure 10e). The extended ATP form, which is the most prominent form in the PDB, had the best retrieval rate, while the bent form ATP had the poorest. Overall, the rates underperform compared to the HIV-1 inhibitor and heme binding surface retrievals. Despite the influence of ligand conformation of surface conformations, our method appears rather tolerant to flexible ligands (and their corresponding binding surfaces) albeit at the expense of the specificity seen in more rigid molecules.
It should be noted that the retrieval rates for ATP are especially conservative, as a disproportional number of ATP binding surfaces complexed with other molecules are in the PDB. For example, there are 11 structures of Protein Kinase A from Bos taurus (PDB:1xha,1xh8,1xh7,1xh6,1xh5,1xh4,1veb,1svg,1sve,1svh) which have ATP competitive inhibitors bound. By correcting for protein kinase inhibitor complexes, the AUCs improve approximately 4% across all query surfaces. Unfortunately, there is no automated method to associate these types of complexes with their native cofactors except through literature analysis, which can be uninformative, as every structure deposit does not have a corresponding publication. The authors are developing an automated database that catalogs such natural cofactor/inhibitor relationships between structures in the PDB.
Non-specific Nucleotide Retrieval
In some proteins, ligand binding is not an exclusive event, as some proteins are able to utilize different cofactors to catalyze the same reaction. Casein kinase 2 (CK2) is a highly conserved eukaryotic serine/threonine kinase that plays a key role in various cellular processes and possesses dual-cosubstrate specificity for guanosine-5'-triphosphate (GTP) or ATP[53]. This feature, whose biological significance is not well understood, is exceptional among eukaryotic protein kinases. Querying the CK2 ATP binding surface (Figure 10b), we can retrieve GTP binding surfaces from the GPSS library with AUC of 83.7%, slightly better than ATP. Given that the two molecules differ only in their nucleoside, this result is not surprising, as we have shown ligand shape complimentarity is a strong precursor of overall surface similarity. In a comparison of the surface retrieval rates for all nucleotides binding surfaces in the PDB against the CK2 ATP binding surface, we observe trends which mirror ligand shape similarity: purine derivatives retrieval is better than the pyrimidines and tri-phosphate molecules retrieval is better than di-phosphates which are retrieved better than mono-phosphates. These results suggest that our method may be useful to identify a diverse set of molecular shapes that could potentially bind to a given surface.
Prediction and Validation of a GDP Binding Site
The F420 coenzyme plays important roles in archaea and eubacteria in a variety of biochemical reactions (e.g. methanogenesis, the formation of secondary metabolites, the degradation of nitroaromatic compounds, DNA repair)[56]. CofE, a F420-0:gamma-glutamyl ligase, is responsible for the last two enzymatic steps in coenzyme F420-2 biosynthesis. Belonging to a structurally uncharacterized family of enzymes, CofE from A. fulgidus was solved as a structural genomics target by the Midwest Center for Structural Genomics and found to be of novel fold (PDB:2g9i) (Figure 4a). Solvent accessible cavities were calculated using the CASTp webserver [27–29, 34], and the largest pocket, presumably the F420, GTP and L-glutamate binding pocket, was selected to query against the GPSS ligand surface library. The top-ranking surface was from GDP-binding protein from B. taurus (PDB:1tad, red, Figure 4c). A GDP molecule is posed into the surface based on the superposition from the alignment (Figure 11a, red GDP molecule). Based on this prediction, the protein was co-crystallized with GDP and a model of the complex was determined (Figure 11a, green GDP molecule)[56]. The GDP position had RMSD of 1.0Ǻ from the predicted pose (Figure 11b). The addition of the ligand also improved the resolution of the structure from 2.50Ǻ to 1.35Ǻ and allowed two loop regions to be modeled where no electron density was previously seen (Figure 11a, magenta).
ATP Binding Surface Landscape
The contributions of molecular flexibility only partially account for reduced retrieval rates observed for ATP binding surfaces. It is surprising that binding surfaces for this essential nucleotide do not exhibit a greater level of conservation. To explore the global relationship between surfaces, enzymatic functions and ligand conformation, we carried out an all-against-all comparison for a limited homology dataset (<50% whole sequence identity) of ATP binding surfaces. This cutoff was selected to encourage surface diversity between functionally homologous proteins yet eliminate redundant analysis. A distance matrix for the 116 surface set was constructed using the SurfaceScreen score and complete linkage clustering was applied. A dendrogram of the clustering is shown in Figure 12.
The cluster results show that there is minimal functional exclusivity between binding surfaces and ATP conformation. The same enzymatic functions can be accomplished using a variety of binding surfaces and, within each surface, multiple ligand conformations can be bound. In the most well represented functional families, hyrdolases, ligases, and transferases, we observe different degrees of binding mode conservation. A breakdown of surface clustering and ligand conformations is shown as a balloon plot in Figure 13. Hydrolases have two conformation preferences and favor, deep, encapsulating binding surfaces. Bent form ATP is disfavored in hydrolases. Ligases are the most conserved, heavily favoring the bent form of ATP that requires a wide-mouth surface shape. Transferases are the most adept of the ATP binding proteins, sampling the most surface/conformation combinations. They do not discriminate between ATP conformations but have a preference for encapsulating binding sites. Several combinations occur with higher frequencies, including an exclusive combination (4-◆), which is the most observed in this family.
Analysis of a broad collection of ATP binding proteins suggest that some functional families may have conserved binding surfaces while others are more divergent. Binding surfaces themselves also deviate on their level of ligand conformation tolerance. It is likely that altering protein surfaces may be the most cost effective evolutionary mechanism for exploiting functional niches, even within functional families.
Automated Protein Kinase Classification by ATP Binding Site Comparisons
Within the transferase family, the most well conserved ATP binding surfaces belong to protein kinases. Protein kinases play vital roles in regulating cellular pathways by phosphorylating other proteins. Malfunctioning kinases have been linked to a variety of diseases such as immunodeficiency, endocrine disorders and cancer, making them the target of drug discovery efforts. At their highest level, kinases are divided by the amino acid residues they target (serine/threonine or tyrosine) and further classified by more specific biochemical activity. Functional classification of kinase families has been undertaken by many methodologies utilizing primary sequence, structure, binding sites, pharmacophore profiles and expert manual analysis [57–60]
We apply our comparison methodology to protein kinases to assess our ability to automatically classify them into their functional subfamilies using only ATP binding surfaces. A dataset of 297 protein kinases with bound ligands, including both natural and synthetic molecules, were annotated using a combination of the PDB web query system, EC numbers, KinBase[61] and the Protein Kinase Resource[62]. For consistency, family nomenclature is applied from KinBase.
An all-against-all comparison was performed with results used to populate a distance matrix of SurfaceScreen scores. A dendrogram showing the complete-linkage clustering is shown in Figure 14 where each surface node is color coded by kinase subfamily. Overall, the method shares strong agreement with the annotated classification, as seen by the color banding. CDK2 kinases are the most ordered; with all members perfectly clustered together and distinct sub-grouping separating nucleotide ligands from small compound inhibitors. The CK2 and CAMP families also show divergence between natural ligands and inhibitors. The CAMP groupings are further clustered by the molecular weight of their bound ligands. The mitogen activated protein kinases (MAP) are successfully classified into their sub-families, but on distant nodes in the graph. In all families, we observe differentiation based on the activation state of the kinase.
Surface based classification conveys many similarities to other methods, but has advantages of additional functional insight that could not be automated in other methods. The ability to distinguish between different activation states and to organized surfaces based on ligand types and molecular weight differences could prove useful in developing binding profiles for enhanced specificity in kinase drug discovery.
Binding Surface Similarity of c-Abl Kinase Inhibitors
The surreptitious fusion of the cellular form of Abelson leukemia virus tyrosine kinase (c-Abl) with the breakpoint cluster region (BCR) gene disrupts the internal control mechanism causing increased tyrosine kinase activity[63]. The fusion protein BCR-Abl results in the disease chronic myelogenous leukemia (CML). Five structures of c-Abl proteins can be found in the PDB with two classes of small molecule inhibitors: pyrido [2,3.d]pyrimidine-type (PDB:1m52, 1opk) and 2-phenylaminopyrimidine-type (PDB:1iep, 1fpu, 1opj). Both inhibitor classes bind in the ATP binding, but 2-phenylaminopyrimidine-type bind exclusively in the inactive conformation of the activation loop. The smaller pyrido [2,3.d]pyrimidine-type class are indifferent to activation state, making them more potent but less specific inhibitors[64]. Our clustering accounts for this behavior and groups them in distinct nodes.
The 2-phenylaminopyrimidine-type inhibitor STI-571 (Figure 15a) is an effective inhibitor of c-Abl activity for treatment against CML [64–66]. It has been shown to be specific for tyrosine kinases and also inhibits stem-cell factor receptor kinase c-Kit (PDB:1t46). Results from querying c-Abl (PDB:1opj, Figure 15a) against the GPSS library show cKit is the best scoring non-ABL kinase. This cross reactivity is detected in our cluster (Figure 14b).
A surprising member of this cluster node is serine/threonine kinase p38 mitogen-activated protein (MAP) kianse (PDB:1kv2, Figure 14b). p38 MAP kinases play critical roles in regulation of proinflammatory cytokines such as tumor necrosis factor and interleukin-1 and are a target for many inflammatory diseases[67]. STI-571 is not currently known to inhibit, by design or mechanism, any serine/threonine kinase[68]. The structure of 1kv2, complexed with inhibitor B96, occupies a unique conformation for p38 MAPs, where the highly conserved DFG motif is turned out[67] (Figure 15b). This is the first observation of this state in a serine/threonine kinase, where it is a hallmark for tyrosine kinases (Figure 15a).
A superposition, based on the alignment of the binding surfaces of c-Abl and p38 MAP, shows that the inhibitors bind in similar orientation (Figure 15c). STI-571 can be posed into the p38 MAP surface, based on the alignment of the surfaces, with no steric clashing and preserves the orientation of several polar atoms (Figure 15d). While the conformation of this p38 MAP is unique and presumed to occur infrequently, its existence presents opportunity to explore the use of STI-571 and analogs for additional therapeutic uses. Automated surface classification can also provides important cross-reactivity analysis; where unexpected binding sites similarities could result in undesired side effects.