Protein Functional Surfaces: Global Shape Matching and Local Spatial Alignments of Ligand Binding Sites
© Binkowski and Joachimiak; licensee BioMed Central Ltd. 2008
Received: 25 June 2008
Accepted: 27 October 2008
Published: 27 October 2008
Protein surfaces comprise only a fraction of the total residues but are the most conserved functional features of proteins. Surfaces performing identical functions are found in proteins absent of any sequence or fold similarity. While biochemical activity can be attributed to a few key residues, the broader surrounding environment plays an equally important role.
We describe a methodology that attempts to optimize two components, global shape and local physicochemical texture, for evaluating the similarity between a pair of surfaces. Surface shape similarity is assessed using a three-dimensional object recognition algorithm and physicochemical texture similarity is assessed through a spatial alignment of conserved residues between the surfaces. The comparisons are used in tandem to efficiently search the Global Protein Surface Survey (GPSS), a library of annotated surfaces derived from structures in the PDB, for studying evolutionary relationships and uncovering novel similarities between proteins.
We provide an assessment of our method using library retrieval experiments for identifying functionally homologous surfaces binding different ligands, functionally diverse surfaces binding the same ligand, and binding surfaces of ubiquitous and conformationally flexible ligands. Results using surface similarity to predict function for proteins of unknown function are reported. Additionally, an automated analysis of the ATP binding surface landscape is presented to provide insight into the correlation between surface similarity and function for structures in the PDB and for the subset of protein kinases.
It has become apparent that surfaces, comprised of a fraction of the total residues, are the most conserved functional features of proteins. Proteins utilize common surface motifs to create precise chemical environments designed to perform specific functions. These motifs are not restricted to a single protein scaffold but can be found within different protein folds or at domain/domain and subunits interfaces. While biochemical activity can be attributed to a few key residues (e.g catalytic triads), the broader surrounding environment (i.e. auxiliary residues in spatial proximity) often plays an equally import role in fine-tuning molecular recognition and/or catalysis.
Powerful evolutionary forces have allowed proteins to govern ligand binding through seemingly subtle local surface variability. These changes, which are not easily detectable by sequence analysis, may provide competitive advantage for optimization of co-factor specificity. In some circumstances, surface diversity adversely affects normal cell process by providing environments for undesired binding events (e.g. drug side effects) or mutations directly correlated to disease. The conservation of functional surfaces presents an opportunity to compare and analyze proteins independent of sequence or fold. These comparisons can be used to classify protein functions or to infer biochemical activity for proteins with unknown function, such as those targeted by structural genomics programs.
Several methods have been developed detecting localized, spatial protein similarities with applications for evolutionary analysis, function prediction and drug discovery. The use of graph theory has been widely applied to the comparison of three-dimensional patterns. Artymiuk et al. developed an algorithm based on subgraph isomorphism detection to search residue patterns against the PDB. Kinoshita et al. used clique detection algorithms to assign protein biochemical functions using the similarity information of molecular surface geometries and electrostatic potentials. Using a clique-detection algorithm, Schmitt et al., compared generic pseudo-centers that code for possible ligand-protein interactions in protein cavities. Query cavities are searched against Cavbase, a pre-computed database of cavities extracted from the PDB. The method has been applied to identify surfaces in non-homologous proteins as well as for the classification of protein families. Kleywegt searched for motifs of residue pseudo-centers in a library of protein structures using a depth-first search algorithm. Russell also developed an algorithm based on depth-first search that detects atomic geometric patterns common in between side-chains in proteins and presented new examples of convergent evolution. Parametric statistical evaluations of Russell's atomic superposition method were extended by Stark et al. .
Another widely used approach is geometric hashing, which is an efficient method for matching features against a database. Jackson and Gold used geometric hashing to perform an all-against-all comparison of protein-ligand binding sites in the SitesBase database [9–11]. Their method was also applied for functional annotation and building pharmacophore models for drug discovery. Fischer et. al. developed an algorithm based on geometric hashing that detects surface similarities of proteins using spatial patterns of atoms[12, 13]. A similar method, TESS, has been applied for the derivation and matching of annotated spatial templates. JESS, a successor to TESS, searches small groups of atoms under arbitrary constraints on geometry and chemistry and utilized statistics to evaluate matches. It is used to query the Catalytic Site Atlas (CSA) a collection of annotated residues patterns extracted from manual literature searches. JESS is also used in the PROFUNC suite of annotation tools in the reverse template search, where a radius defined perimeter extends a local residue pattern search for improved search specificity.
A protein evolution based method, pvSOAR, was developed that used the unique approach of aligning sub-sequences of surface residues to establish a residue correspondence between surfaces[18, 19]. The residues were then superimposed on each other and statistical significance was evaluated for the resulting RMSD. This method was used to detect similar functional surfaces in non-homologous proteins. Furthermore, in a recent study of shape variation of ligand binding pockets, Kahraman et. al., used a shape-only comparison metric based on spherical harmonics. It was shown that shape descriptors could be used to classify ligand into their binding sites.
In this study, we describe a new method for the sequence order independent comparison and alignment of protein functional surfaces. Our method, SurfaceScreen, attempts to optimize two components, global surface shape and local physicochemical texture, for evaluating the similarity between a pair of surfaces. Surface shape similarity is assessed using a three-dimensional object recognition algorithm and is used to rapidly pre-classify surfaces from a large library of surfaces. Surfaces with sufficient shape complimentarity are then aligned by combinatorially identifying the best superimposition of common residues between the two surfaces. We introduce several metrics for scoring different properties of a surface alignment and an overall scoring function used in library searches. Furthermore, we introduce the Global Protein Surface Survey (GPSS), a library of annotated protein surfaces calculated from all structures in the PDB. Querying surfaces from proteins of unknown function against the GPSS library allows SurfaceScreen to be utilized as predictive tool.
We describe three types of analysis to assess surface shape comparisons and spatial alignments. First, we describe the retrieval of surfaces from the GPSS library for surfaces, from the same protein, that bind ligands of various size, shape and pharmacophore properties. For this we use the example of HIV-1 protease. Second, we use the example of heme (iron-protoporphyrin IX) binding sites to describe the retrieval of a functionally diverse binding surface that binds the same ligand. We provide the example of using our method as an annotation tool, identifying a new member of the heme binding monooxygenase family. Third, we describe how conformational diversity of bound ligands impacts retrieval rate for ubiquitous nucleotide binding sites. We also present the example of a nucleotide binding surface prediction and crystallographic validation for a structural genomics target with a new fold. We conclude with an analysis of the ATP binding surface landscape to provide insight on the correlation between surface similarity and function for structures in the PDB and for the subset of protein kinases.
Theory and Algorithms
While key conserved residues are localized within a surface for function, additional residues contribute to the overall size and shape of the functional surface environment. In some instances, these non-key residues play an important, but non-obvious, functional role. This has been shown for aminopeptidases, where the mouth opening diameter filters peptide access into the active sites[21, 22], or for N-acetyltransferases, where the acetyl-CoA and substrate binding surfaces are sub-pockets within a larger cleft. In other instances, these auxiliary residues have no obvious relevance, being positioned simply as a result of folding and structural requirements, codon pressure or unrelated functional specifications. Therefore, a detachment exists between global and local surface properties of functional sites, which can distort comparison measures and mask similarities.
Our approach to identifying meaningful similarities between two surfaces is accomplished by decomposing surface comparisons into two components, global shape and local physicochemical texture. Global shape comparisons can be a very powerful precursor of overall surface similarity. They are computed very efficiently and can help rapidly reject grossly dissimilar surfaces. This is followed by a non-heuristic spatial residue alignment, which assures the optimal superposition of conserved residues between two surfaces.
Surface Shape Signatures for Rapid Global Surface Shape Comparison
Unlike protein folds, a common lexicon has not emerged to precisely describe the shapes formed by surfaces yet quantitative measures can be computed to empirically describe shape. Here, we introduce a metric, the SurfaceShapeSignature (SSS), which describes the global shape of a protein surface that can be used for comparison. Adapted from three-dimensional database object retrieval, the method represents the signature of an object as a probability distribution sampled from a shape function measuring global geometric properties. The complexity of shape matching is then reduced to the comparison of two probability distributions. This approach was used to quickly and successfully retrieve and classify complex shapes from three-dimensional databases.
Advantages and Limitations of Surface Shape Signatures
The primary advantages of this approach are computational robustness and efficiency. In the original work of Osada, the robustness of shape signatures was verified against a variety of transformations including scale, rotation and mirroring and were insensitive to model simplification. These properties are ideal for applications in structural biology as minor conformational changes or small surface perturbations should not mask overall shape similarities. For example, we should be able to detect the similarity between the apo and bound states of a binding site, yet still discern between the two states. Insensitivity to model simplification allows for meaningful comparisons between surfaces comprised of non-trivial atoms count differences. The implementation and execution of this algorithm is computationally straightforward and allows a query surface to be searched against the GPSS library in minutes.
Biochemical functions rely on a combination of shape and chemical compatibility and therefore one should not expect that shape alone could describe this complexity. Figure 1 shows that NAD, ATP, and heme binding surfaces share similar and, in some cases, overlapping distributions. The SSS comparison can convey gross shape similarity but is void of chemical typing information and cannot be used to infer specific functional information. Instead, it provides a fast and robust comparison metric utilized as the entry point into more comprehensive surface shape matching methodology.
Surface Shape Signature Similarity Threshold
Local Spatial Surface Residue Alignments for Physicochemical Texture Similarity
The spatial arrangement of localized surface patterns and orientation of side chains are used to assess our evaluations of biochemical function complimentarity. An exhaustive enumeration and search algorithm, SurfaceAlign, is applied to detect groups of spatially conserved amino acid residue sets. In this application, the term "conserved" refers to the identical residues common to two protein surfaces. Our three-dimensional representation of a surface residue is represented by a single point located at the center of mass of all atoms identified as contributing to a particular surface.
Statistical Significance of Aligned Surface Distances
RMSD calculations are sensitive to the number of data points compared, making it necessary to assess the statistical significance of raw distance measures. This is accomplished by converting calculated cRMSD and oRMSD values to a probability value (P-value) measuring the likelihood of obtaining a specific RMSD value for a solution with a given number of residues. This allows for the meaningful comparisons between alignments with differing number of common residues. Following the method described by Binkowski et al, random surface alignments were performed for alignment solutions of N res residues, where 3 ≤ N res ≤ 100. Calculations numbering 1010 were computed to construct lookup tables associating cRMSD and oRMSD scores to P-values. To minimize the inherent bias in the PDB toward particular protein families, special consideration was employed to utilize a non-redundant surface library consisting of proteins sharing less than 90% whole-protein sequence similarity.
Surface Volumes Overlap of Aligned Surfaces
While the alignment compares key residues important for shared biochemical function, the volume overlap of an alignment provides a comparison of all atoms belonging to a surface. The overlap volume is defined as the volume difference between the superimposed surfaces and is calculated from the formula:V AB = V A + V B - VA∪B
Where V A , V B , and VA∪Bare the volumes of surface A, B and the superimposed construct AB, respectively. Volumes are calculated using the weighted Delauney triangulation and alpha shape methods [27–29].
A SVOT score is bounded between 0 (representing non-overlapping surfaces) and 1.0 (representing identical surfaces). The SVOT values for the spatial alignment series show in Figure 3d are 0.50, 0.52, 0.73, 0.62, 0.91, and 0.89 when viewed from left (black) to right (white). The best spatial alignment, as judged by RMSD values, does not guarantee the best SVOT score (i.e. SVOT scores are not correlated to RMSD).
Where V a and V b are the volumes of the surfaces formed only by the conserved residues used in the alignment solution. The lSVOT for the alignment in Figure 4g is 0.71, conveying stronger similarity.
The rSVOT is bounded between 0 and 1 and represents the fraction of a surface utilized in the alignment solution. In Figure 4, the rSVOT value of 0.52 confirms that it is a sub-pocket match. The rSVOT score can be used to automatically detect surfaces that do not have the desired properties for a particular search (e.g. excluding sub-surface matches during a library search).
Automated Identification of Ligand Binding Surfaces
Functional surfaces of proteins can be derived from structural information extracted from three-dimensional coordinates. This task can be automated for PDB files deposited with heteroatom records (HETATM) describing atoms belonging to small molecule cofactors rather than to part of a biopolymer chain. To extract the functional surfaces that surround structural features of a protein, an exclusion contact surface is generated by measuring a difference in solvent accessibility between a structure with and without a molecule in proximity. This is illustrated for the NADP binding surface from human pathogen S. pyogenes (PDB:2ahr) in Figure 1a. Atoms with a change in solvent accessibility between the bound and apo structure are identified as the contact surface (Figure 1b). We utilize the Delauney triangulation and alpha shape method for measuring solvent accessibility [27–29].
A Library of Protein Functional Surfaces
An exclusion contact surface has been calculated for every heteroatom molecule associated with a protein's PDB file and organized into the Global Protein Surface Survey (GPSS). The GPSS contains three-dimensional libraries of functionally annotated surfaces from ligand, DNA, metal and peptide binding surfaces and is updated weekly to correspond to PDB deposits. Libraries are publicly accessible through a web browser or via a PyMol plugin. In this study, we utilize only the ligand binding surfaces of the GPSS: 113,921 members representing 5,575 unique ligands (PDB version: November 2006). For this subset of the GPSS, the average number of residues forming a surface is 12 and the average molecular weight of the bound ligands is 305. To reduce redundancy and improve search efficiency, we further limit our search library to a single ligand of each type from each protein deposit. The first ligand of each type, as described in the PDB file, was selected.
Similarity Searching Surface Libraries
The SurfaceScreen score is bounded between 0 and 3, and represents contributions from global shape similarity, local spatial residue alignment, and global alignment volume overlap. The denominator in the cRMSD P-value component reflects the maximum probability value estimated by our statistical significance evaluations, 10-9. The oRMSD, lSVOT, and rSVOT scores are omitted because they are too highly correlated to the cRMSD and gSVOT scores, respectively. They are, however, still useful for post-processing results.
Data Analysis and Classification
Receiver Operator Characteristic Curves
Surface retrieval benchmarking experiments are summarized in a Receiver Operator Characteristic (ROC) curve, where the sensitivity is plotted against its specificity at various significance levels of summed probabilities. In the ROC curve, the x-axis represents the false positive rate, or 1-specificity, which is calculated by as 1-TN/(TN+FP), where TN is the number of true negatives and FP is the number of false positives. The y- axis represents the true positive rate, or sensitivity, and is calculated as TP/(TP+FN), where FN is the number of false negatives. An overall performance measure of a classification test can be calculated by the area under the ROC curve (AUC). Bound between 0 and 1, an AUC of 1 is indicative of a perfectly accurate classification test, in which all true positives are distinguished from false positive. An AUC of 0.5 corresponds to a random classification test (e.g. a coin flip). The AUC is a combined measure of sensitivity and specificity.
For our SurfaceScreen methodology, the ROC curves measure our ability to accurately identify similar surfaces from a large database. A true-positive data point in our database retrieval experiments is defined when the ligand from the query surface matches the ligand from the corresponding library surface (e.g. retrieving heme binding sites when a heme binding site is used as the query). This definition should be considered conservative, as protein surfaces have the ability to bind multiple ligands and, in some cases, a false-positive prediction may indeed be a biologically relevant hit.
ATP Conformation Classification
The coordinates for all ATP molecules in the PDB were identified and extracted. Multiple occurrences of the ligand in structure deposit were included and treated as unique molecules. A pairwise, least-squares superposition of all remaining molecules was performed and RMSD values recorded in a distance matrix. Complete linkage clustering was applied to the data matrix. For clarity, we chose to discover the minimum number of conformation families that would accurately represent all ATP molecules. A range of cut values was tested and we chose to set the number of clusters to four based on manual visualization and analysis. The four conformations represent the bent, extended, and two intermediary forms of ATP.
Results and discussion
Surface-Based Retrieval of Binding Sites for the Same Protein: HIV-1 Protease
The utility of surface shape comparisons was assessed by retrieving functionally homologous human immunodeficiency virus (HIV-1) protease-ligand complexes from the GPSS. HIV-1 protease is an essential aspartyl protease that cleaves nascent polypeptides enabling maturation of viral proteins. Inactivation of the protease blocks production of infectious viral particles. Therefore, HIV-1 has been an active target and one of the early success stories of rational drug design. We identified 151 HIV-1 protein-inhibitor complexes deposited in the PDB with the following criteria: proteins are in the dimer conformation, inhibitors are not compound fragments (molecular weight >100), and inhibitors are unique in our dataset. The proteins in our dataset share at least 48% sequence similarity and secondary structure similarity Z-scores greater 9.0, as measured using the secondary structure matching (SSM) algorithm.
Next, we performed the same search using only the spatial alignment scores to evaluate similarity. We observe that all three alignment-based scoring measures provide better specificity than SSS distance score. The AUC for cRMSD P-value, oRMSD P-value, and SVOT are: 97.5%, 96.8% and 93.0%, respectively. The ROC plots are shown for each measure in Figure 6e. The improved specificity of the spatial alignment scores comes at a significant runtime disadvantage. The SSS shape retrieval method took 24 minutes to compare the query to the GPSSS library, while the spatial alignment took 1,657 minutes. When using the SSS scores to pre-filter the search library, as described in the SurfaceScreen methodology, we can achieve an AUC of 95.3% for the combined SurfaceScreen (Figure 6e) score with an overall runtime of 148 minutes. The shape signature filter reduced the library 86%, to just over 4,000 surfaces, yet did not eliminate any true positives from the library.
Using the SurfaceScreen score, the most similar (rank 132) non HIV-1 surface was from plasmepsin II from P. falciparum (PDB:1lf3, E.C. 188.8.131.52, CATH 184.108.40.206), another aspartic protease and a major virulence factor.P. falciparm is a species of Plasmodiums that causes one of the major malaria infections in humans. Plasmepsin II plays an essential role in P. falciparm in the degradation of hemoglobin as a source of amino acids for growth and maturation. The binding surface of inhibitor EH5 exhibited strong similarity to the HIV-1 inhibitor binding site, with a SurfaceScreen score of 1.89. Plasmepsin II shares only 12% sequence identity to HIV-1 protease and an SSM alignment produces a non-significant Z-score of 3.3. The surfaces are both formed at the intersection of loops and β-sheets, although the plasmepsin II binding surface is formed from a single chain, unlike HIV-1 protease, which is occurs at a dimer interface. There are 15 residues that are conserved between the two surfaces. While it is not surprising that proteases share a similar binding site, the low level of sequence and secondary structure similarity highlights that localized functional conservation can be found in surfaces. Our observation is in agreement with recent reports where HIV-1 inhibitors have been shown to be effective antimalarial agents.
The highest-ranking non-protease surface was from aclacinomycin methylesterase (RdmC) from S. purpurascens (PDB:1q0r, Figure 6fg). RdmC modifies the aklavinone skeleton in the biosynthesis of anthracyclines in Streptomyces species. Anthracyclines are a class of aromatic polyketide antibiotics used as chemotherapuetic agents to treat a wide range of cancers, including leukemia, lymphoma, and breast, uterine, ovarian and lung cancers. Despite sharing only 7% sequence identity and being built present in different scaffolds (CATH 220.127.116.110), the RdmC binding surface of DcmaT (Figure 6h), was found to be similar with a SurfaceScreen score of 1.81 (ranked at position 134). The superposition of the surfaces is shown in Figure 6ij and with their corresponding inhibitors in Figure 6k. The surprising similarity of these surfaces has significant medicinal impact as it supports the recent reports of the inhibitory effects of anthracycline agents on protease activity. This result shows how identification of similar binding surfaces (and their corresponding ligands) can provide guidance for structure based drug discovery, not only in scaffold design but also to screen against potential undesirable binding site similarities that could result in undesired side effects.
Retrieval and Prediction of Heme Binding Surfaces
The variability of heme binding proteins presents considerable challenges for automated identification and retrieval of hemoproteins from sequence and structure databases. Using myoglobin from P. catodon (PDB:1mbn, Figure 7ab) as a query protein, we compared it to a non-redundant (<95% sequence identity) PDB set using BLAST. Using the 690 heme binding proteins in the PDB as true-positive hits, a retrieval rate of 68.7% is calculated from the rank order of sorted (by E-value) search results. Comparing the structure of myoglobin against the same set of proteins using SSM search results provides a retrieval rate of 64.4%. The retrieval rate was calculated from the rank order of sorted Z-scores. The ROC plots are shown for both methods in Figure 7h.
The myoglobin query was then searched against the GPSS library. The retrieval rate, using the SurfaceScreen score, is 94.8%. All surface scoring measures had superior performance over sequence and structure methods (Figure 7h). The most selective was the SSS KS distance with retrieval at 95.8%. Despite the variability of topologies forming binding surfaces, the binding surface shape appears to be the most conserved feature of hemoproteins. This can be seen in the shape signature plots for myoglobin, nitrophorin and iNOS (Figure 7h). The iNOS heme binding pocket is the lowest ranking true-positive surface against our query, as observed by the stark difference in shape signatures. It appears that despite evolutionary pressure imposed for functional specification, surface must maintain geometry necessary to accommodate the canonical heme shape. Surface shape is better conserved for heme binding than the amino acid environment. These observations agree with that of Schneider in which heme binding interactions were found to be generally diverse, with the exception of only three amino acids at "hot spots".
While true heme binding surfaces dominate the top scoring surfaces we find that other binding surfaces have surprising similarity to our query surface. The highest ranking, at position 42, was the Ampcpr binding surface from ADP-ribose pyrophosphatase (ADPRase) from E. coli (PDB:1khz). A Nudix hydrolase enzyme, ADPRase catalyzes the Mg2+-dependent hydrolysis of ADP-ribose to AMP and ribose 5-phosphate. The surface, formed at the intersection of β-sheets and loops, is shown in Figure 7ij. The SSS plot confirms that strong visual shape similarity between the two surfaces (Figure 7g, red). A shape-based superposition (ROCS, OpenEye Scientific, Inc.) of the ligands shows that (Figure 7j) the similarity of these functionally unrelated proteins may lie strictly in their ability to accommodate similarly sized molecules.
Detection of a Convergent Heme Binding Surface
Convergent evolution presents a far more difficult challenge for annotation of proteins of unknown function. Structural genomics target, IsdG from S. aureus (PDB:1xbw), shows no significant sequence similarity and does not contain the conserved N-terminal histidine or the GXXXG motif characteristic present in the heme-monooxygenases family, yet this enzyme displays classical heme-monooxygenase activity. Also, while all known members of heme-oxygenase superfamily are of all α-helical fold, IsdG adopts α+β sandwich with an anti-parallel β-sheet and ferredoxin-like fold and a β-barrel at the dimer interface.
Binding Site Retrieval of Functionally Diverse and Conformationally Variable Nucleotides
Specific Nucleotide Binding Site Retrieval
The AUCs calculated using the SurfaceScreen score were 79.1%, 80.1%, 83.0%, and 85.4% for ATP:corrinoid adenosyltransferase, PurT-encoded glycinamide ribonucleotide transformylase, protein kinase CK2 and cAMP-dependent kinase, respectively (Figure 10e). The extended ATP form, which is the most prominent form in the PDB, had the best retrieval rate, while the bent form ATP had the poorest. Overall, the rates underperform compared to the HIV-1 inhibitor and heme binding surface retrievals. Despite the influence of ligand conformation of surface conformations, our method appears rather tolerant to flexible ligands (and their corresponding binding surfaces) albeit at the expense of the specificity seen in more rigid molecules.
It should be noted that the retrieval rates for ATP are especially conservative, as a disproportional number of ATP binding surfaces complexed with other molecules are in the PDB. For example, there are 11 structures of Protein Kinase A from Bos taurus (PDB:1xha,1xh8,1xh7,1xh6,1xh5,1xh4,1veb,1svg,1sve,1svh) which have ATP competitive inhibitors bound. By correcting for protein kinase inhibitor complexes, the AUCs improve approximately 4% across all query surfaces. Unfortunately, there is no automated method to associate these types of complexes with their native cofactors except through literature analysis, which can be uninformative, as every structure deposit does not have a corresponding publication. The authors are developing an automated database that catalogs such natural cofactor/inhibitor relationships between structures in the PDB.
Non-specific Nucleotide Retrieval
In some proteins, ligand binding is not an exclusive event, as some proteins are able to utilize different cofactors to catalyze the same reaction. Casein kinase 2 (CK2) is a highly conserved eukaryotic serine/threonine kinase that plays a key role in various cellular processes and possesses dual-cosubstrate specificity for guanosine-5'-triphosphate (GTP) or ATP. This feature, whose biological significance is not well understood, is exceptional among eukaryotic protein kinases. Querying the CK2 ATP binding surface (Figure 10b), we can retrieve GTP binding surfaces from the GPSS library with AUC of 83.7%, slightly better than ATP. Given that the two molecules differ only in their nucleoside, this result is not surprising, as we have shown ligand shape complimentarity is a strong precursor of overall surface similarity. In a comparison of the surface retrieval rates for all nucleotides binding surfaces in the PDB against the CK2 ATP binding surface, we observe trends which mirror ligand shape similarity: purine derivatives retrieval is better than the pyrimidines and tri-phosphate molecules retrieval is better than di-phosphates which are retrieved better than mono-phosphates. These results suggest that our method may be useful to identify a diverse set of molecular shapes that could potentially bind to a given surface.
Prediction and Validation of a GDP Binding Site
ATP Binding Surface Landscape
Analysis of a broad collection of ATP binding proteins suggest that some functional families may have conserved binding surfaces while others are more divergent. Binding surfaces themselves also deviate on their level of ligand conformation tolerance. It is likely that altering protein surfaces may be the most cost effective evolutionary mechanism for exploiting functional niches, even within functional families.
Automated Protein Kinase Classification by ATP Binding Site Comparisons
Within the transferase family, the most well conserved ATP binding surfaces belong to protein kinases. Protein kinases play vital roles in regulating cellular pathways by phosphorylating other proteins. Malfunctioning kinases have been linked to a variety of diseases such as immunodeficiency, endocrine disorders and cancer, making them the target of drug discovery efforts. At their highest level, kinases are divided by the amino acid residues they target (serine/threonine or tyrosine) and further classified by more specific biochemical activity. Functional classification of kinase families has been undertaken by many methodologies utilizing primary sequence, structure, binding sites, pharmacophore profiles and expert manual analysis [57–60]
We apply our comparison methodology to protein kinases to assess our ability to automatically classify them into their functional subfamilies using only ATP binding surfaces. A dataset of 297 protein kinases with bound ligands, including both natural and synthetic molecules, were annotated using a combination of the PDB web query system, EC numbers, KinBase and the Protein Kinase Resource. For consistency, family nomenclature is applied from KinBase.
Surface based classification conveys many similarities to other methods, but has advantages of additional functional insight that could not be automated in other methods. The ability to distinguish between different activation states and to organized surfaces based on ligand types and molecular weight differences could prove useful in developing binding profiles for enhanced specificity in kinase drug discovery.
Binding Surface Similarity of c-Abl Kinase Inhibitors
The surreptitious fusion of the cellular form of Abelson leukemia virus tyrosine kinase (c-Abl) with the breakpoint cluster region (BCR) gene disrupts the internal control mechanism causing increased tyrosine kinase activity. The fusion protein BCR-Abl results in the disease chronic myelogenous leukemia (CML). Five structures of c-Abl proteins can be found in the PDB with two classes of small molecule inhibitors: pyrido [2,3.d]pyrimidine-type (PDB:1m52, 1opk) and 2-phenylaminopyrimidine-type (PDB:1iep, 1fpu, 1opj). Both inhibitor classes bind in the ATP binding, but 2-phenylaminopyrimidine-type bind exclusively in the inactive conformation of the activation loop. The smaller pyrido [2,3.d]pyrimidine-type class are indifferent to activation state, making them more potent but less specific inhibitors. Our clustering accounts for this behavior and groups them in distinct nodes.
A surprising member of this cluster node is serine/threonine kinase p38 mitogen-activated protein (MAP) kianse (PDB:1kv2, Figure 14b). p38 MAP kinases play critical roles in regulation of proinflammatory cytokines such as tumor necrosis factor and interleukin-1 and are a target for many inflammatory diseases. STI-571 is not currently known to inhibit, by design or mechanism, any serine/threonine kinase. The structure of 1kv2, complexed with inhibitor B96, occupies a unique conformation for p38 MAPs, where the highly conserved DFG motif is turned out (Figure 15b). This is the first observation of this state in a serine/threonine kinase, where it is a hallmark for tyrosine kinases (Figure 15a).
A superposition, based on the alignment of the binding surfaces of c-Abl and p38 MAP, shows that the inhibitors bind in similar orientation (Figure 15c). STI-571 can be posed into the p38 MAP surface, based on the alignment of the surfaces, with no steric clashing and preserves the orientation of several polar atoms (Figure 15d). While the conformation of this p38 MAP is unique and presumed to occur infrequently, its existence presents opportunity to explore the use of STI-571 and analogs for additional therapeutic uses. Automated surface classification can also provides important cross-reactivity analysis; where unexpected binding sites similarities could result in undesired side effects.
Proteins maintain the surprising ability to preserve local, sequentially unordered, surface residue patterns capable of performing explicit biochemical functions in proteins showing negligible evolutionary relationships. Even in homologous proteins, subtle amino acid mutations, which can be underappreciated by sequence analysis, can alter the properties of a surface and protein function. In this study, we describe a novel method for the comparison and analysis of protein functional surfaces. We observe that conservation of surface shape and physicochemical texture provides sufficient discriminative features for accurate retrieval of functionally homologous binding sites. The method serves as a predictive tool allowing for the identification of cross-binding ligands or binding sites on proteins of unknown function and for the comparative analysis of proteins, such as the classification of functionally diverse families. We introduced the Global Protein Surface Survey, a searchable library of functionally annotated protein sites. The SurfaceScreen methodology was benchmarked against binding pockets from HIV-1 protease, heme, and ATP and used to further analyze the relationship between surface similarity, biochemical function and ligand conformation.
Limitations and Outlook of SurfaceScreen
Our results have shown that two components, shape and physicochemical texture, define well the functional competence of surfaces. Surface geometry allows accessibility and proximity for interaction and accurate residue positioning make available specific functional groups for biochemical function. A notable limitation to our current method is our spatial residue model, which does not afford for amino acid substitutions during spatial alignments. Previous studies have shown that the substitution rates for localized surfaces differ from those of the whole sequences[69, 70] and that these differences can provide better discrimination in surface sequence comparisons. One option would be to exclude residues less likely to be involved in function. For example, in the study of general enzyme function, Ysteng et al studied 3,275 functional surfaces to discover His, Asp, Glu, Ser, and Cys residues account for more than 80% of active site residues in functional pockets. This is similar to previously published reports [71–73]. Defining a minimum residue set describing different functions is plausible, but would reintroduce difficulties arising from global versus local surfaces characteristics and properties. This would also naively assume that some of the residues excluded in a scheme are not contributing in some way, either mechanistically or structurally, to the function. Many proteins also show high promiscuity and can bind several different ligands into this same functional site, solvent mediated interactions add complexity to the surface comparisons and electrostatic potential also plays important role in conformational changes and attracting or rejecting ligands. It is clear that in the future analysis of protein surfaces properties will need to include contribution from electrostatic potential, side chain dynamics, and chemical propensities to better describe functional sites.
We wish to thank Dr. Jie Liang for discussion of ideas and access to CASTp database, all members of the Midwest Center for Structural Genomics for making their data available and discussions. Molecular graphics were created using PyMOL. This work was supported by National Institutes of Health Grants GM62414 and GM074942 and by the U.S. Department of Energy, Office of Biological and Environmental Research, under contract DE-AC02-06CH11357.
- Ye Y, Li Z, Godzik A: Modeling and analyzing three-dimensional structures of human disease proteins. Pac Symp Biocomput 2006, 439–450.
- Artymiuk PJ, Poirrette AR, Grindley HM, Rice DW, Willett P: A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. J Mol Biol 1994, 243(2):327–344.View Article
- Kinoshita K, Nakamura H: Identification of protein biochemical functions by similarity search using the molecular surface database eF-site. Protein Sci 2003, 12(8):1589–1595.View Article
- Schmitt S, Kuhn D, Klebe G: A new method to detect related function among proteins independent of sequence and fold homology. J Mol Biol 2002, 323(2):387–406.View Article
- Kuhn D, Weskamp N, Schmitt S, Hullermeier E, Klebe G: From the similarity analysis of protein cavities to the functional classification of protein families using cavbase. J Mol Biol 2006, 359(4):1023–1044.View Article
- Kleywegt GJ: Recognition of spatial motifs in protein structures. J Mol Biol 1999, 285(4):1887–1897.View Article
- Russell RB: Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. J Mol Biol 1998, 279(5):1211–1227.View Article
- Stark A, Sunyaev S, Russell RB: A model for statistical significance of local similarities in structure. J Mol Biol 2003, 326(5):1307–1316.View Article
- Gold ND, Jackson RM: Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships. J Mol Biol 2006, 355(5):1112–1124.View Article
- Gold ND, Jackson RM: A searchable database for comparing protein-ligand binding sites for the analysis of structure-function relationships. J Chem Inf Model 2006, 46(2):736–742.View Article
- Gold ND, Jackson RM: SitesBase: a database for structure-based protein-ligand binding site comparisons. Nucleic Acids Res 2006, (34 Database):D231–234.
- Fischer D, Norel R, Wolfson H, Nussinov R: Surface motifs by a computer vision technique: searches, detection, and implications for protein-ligand recognition. Proteins 1993, 16(3):278–292.View Article
- Norel R, Fischer D, Wolfson HJ, Nussinov R: Molecular surface recognition by a computer vision-based technique. Protein Eng 1994, 7(1):39–46.View Article
- Wallace AC, Borkakoti N, Thornton JM: TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci 1997, 6(11):2308–2323.View Article
- Barker JA, Thornton JM: An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics 2003, 19(13):1644–1649.View Article
- Porter CT, Bartlett GJ, Thornton JM: The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 2004, (32 Database):D129–133.
- Laskowski RA, Watson JD, Thornton JM: ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res 2005, (33 Web Server):W89–93.
- Binkowski TA, Adamian L, Liang J: Inferring functional relationships of proteins from local sequence and spatial surface patterns. J Mol Biol 2003, 332(2):505–526.View Article
- Binkowski TA, Freeman P, Liang J: pvSOAR: detecting similar surface patterns of pocket and void surfaces of amino acid residues on proteins. Nucleic Acids Res 2004, (32 Web Server):W555–558.
- Kahraman A, Morris RJ, Laskowski RA, Thornton JM: Shape variation in protein binding pockets and their ligands. J Mol Biol 2007, 368(1):283–301.View Article
- Taylor A: Aminopeptidases: structure and function. Faseb J 1993, 7(2):290–298.
- Russo S, Baumann U: Crystal structure of a dodecameric tetrahedral-shaped aminopeptidase. J Biol Chem 2004, 279(49):51275–51281.View Article
- Dyda F, Klein DC, Hickman AB: GCN5-related N-acetyltransferases: a structural overview. Annu Rev Biophys Biomol Struct 2000, 29: 81–103.View Article
- Osada R, Funkhouser T, Chazelle B, Dobkin D: Shape Distributions. ACM Transactions on Graphics 2002, 21(4):807–832.View Article
- Chakravarti IM, Laha RG, Roy J: Handbook of Methods of Applied Statistics. Volume I. John Wiley and Sons; 1967.
- Umeyama S: Least-squares estimation of transformation paremeters between two point patterns. IEEE Trans Pattern Anal Mach Intell 1991, 13: 376–380.View Article
- Liang J, Edelsbrunner H, Fu P, Sudhakar PV, Subramaniam S: Analytical shape computation of macromolecules: I. Molecular area and volume through alpha shape. Proteins 1998, 33(1):1–17.View Article
- Liang J, Edelsbrunner H, Fu P, Sudhakar PV, Subramaniam S: Analytical shape computation of macromolecules: II. Inaccessible cavities in proteins. Proteins 1998, 33(1):18–29.View Article
- Liang J, Edelsbrunner H, Woodward C: Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 1998, 7(9):1884–1897.View Article
- Willett P, Barnard J, Downs G: Chemical Similarity Searching. J Chem Inf Comput Sci 1998, 38: 983–996.View Article
- Rush TS 3rd, Grant JA, Mosyak L, Nicholls A: A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J Med Chem 2005, 48(5):1489–1495.View Article
- Global Protein Surface Survey[http://gpss.mcsg.anl.gov]
- DeLano WL: The PyMOL Molecular Graphics System. Palo Alto, CA, USA: DeLano Scientific; 2002.
- Binkowski TA, Naghibzadeh S, Liang J: CASTp: Computed Atlas of Surface Topography of proteins. Nucleic Acids Res 2003, 31(13):3352–3355.View Article
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21(20):3940–3941.View Article
- Wlodawer A, Erickson JW: Structure-based inhibitors of HIV-1 protease. Annu Rev Biochem 1993, 62: 543–585.View Article
- Wlodawer A, Vondrasek J: Inhibitors of HIV-1 protease: a major success of structure-assisted drug design. Annu Rev Biophys Biomol Struct 1998, 27: 249–284.View Article
- Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 2004, 60(Pt 12 Pt 1):2256–2268.View Article
- Orengo CA, Pearl FM, Thornton JM: The CATH domain structure database. Methods Biochem Anal 2003, 44: 249–271.
- Asojo OA, Gulnik SV, Afonina E, Yu B, Ellman JA, Haque TS, Silva AM: Novel uncomplexed and complexed structures of plasmepsin II, an aspartic protease from Plasmodium falciparum. J Mol Biol 2003, 327(1):173–181.View Article
- Parikh S, Gut J, Istvan E, Goldberg DE, Havlir DV, Rosenthal PJ: Antimalarial activity of human immunodeficiency virus type 1 protease inhibitors. Antimicrob Agents Chemother 2005, 49(7):2983–2985.View Article
- Jansson A, Niemi J, Mantsala P, Schneider G: Crystal structure of aclacinomycin methylesterase with bound product analogues: implications for anthracycline recognition and mechanism. J Biol Chem 2003, 278(40):39006–39013.View Article
- Fekete MR, McBride WH, Pajonk F: Anthracyclines, proteasome activity and multi-drug-resistance. BMC Cancer 2005, 5: 114.View Article
- Lupas AN, Ponting CP, Russell RB: On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 2001, 134(2–3):191–203.View Article
- Schneider S, Marles-Wright J, Sharp KH, Paoli M: Diversity and conservation of interactions for binding heme in b-type heme proteins. Nat Prod Rep 2007, 24(3):621–630.View Article
- Fischmann TO, Hruza A, Niu XD, Fossetta JD, Lunn CA, Dolphin E, Prongay AJ, Reichert P, Lundell DJ, Narula SK, et al.: Structural characterization of nitric oxide synthase isoforms reveals striking active-site conservation. Nat Struct Biol 1999, 6(3):233–242.View Article
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402.View Article
- Gabelli SB, Bianchet MA, Ohnishi Y, Ichikawa Y, Bessman MJ, Amzel LM: Mechanism of the Escherichia coli ADP-ribose pyrophosphatase, a Nudix hydrolase. Biochemistry 2002, 41(30):9279–9285.View Article
- Wu R, Skaar EP, Zhang R, Joachimiak G, Gornicki P, Schneewind O, Joachimiak A: Staphylococcus aureus IsdG and IsdI, heme-degrading enzymes with structural similarity to monooxygenases. J Biol Chem 2005, 280(4):2840–2846.View Article
- Dundas J, Ouyang Z, Tseng J, Binkowski A, Turpaz Y, Liang J: CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res 2006, (34 Web Server):W116–118.
- Sciara G, Kendrew SG, Miele AE, Marsh NG, Federici L, Malatesta F, Schimperna G, Savino C, Vallone B: The structure of ActVA-Orf6, a novel type of monooxygenase involved in actinorhodin biosynthesis. Embo J 2003, 22(2):205–215.View Article
- Stockwell GR, Thornton JM: Conformational diversity of ligands bound to proteins. J Mol Biol 2006, 356(4):928–944.View Article
- Yde CW, Ermakova I, Issinger OG, Niefind K: Inclining the purine base binding plane in protein kinase CK2 by exchanging the flanking side-chains generates a preference for ATP as a cosubstrate. J Mol Biol 2005, 347(2):399–414.View Article
- Bauer CB, Fonseca MV, Holden HM, Thoden JB, Thompson TB, Escalante-Semerena JC, Rayment I: Three-dimensional structure of ATP:corrinoid adenosyltransferase from Salmonella typhimurium in its free state, complexed with MgATP, or complexed with hydroxycobalamin and MgATP. Biochemistry 2001, 40(2):361–374.View Article
- Thoden JB, Firestine SM, Benkovic SJ, Holden HM: PurT-encoded glycinamide ribonucleotide transformylase. Accommodation of adenosine nucleotide analogs within the active site. J Biol Chem 2002, 277(26):23898–23908.View Article
- Nocek B, Evdokimova E, Proudfoot M, Kudritska M, Grochowski LL, White RH, Savchenko A, Yakunin AF, Edwards A, Joachimiak A: Structure of an amide bond forming F(420):gamma-glutamyl ligase from Archaeoglobus fulgidus – a member of a new family of non-ribosomal peptide synthases. J Mol Biol 2007, 372(2):456–469.View Article
- Scheeff ED, Bourne PE: Structural evolution of the protein kinase-like superfamily. PLoS Comput Biol 2005, 1(5):e49.View Article
- Kuhn D, Weskamp N, Hullermeier E, Klebe G: Functional Classification of Protein Kinase Binding Sites Using Cavbase. ChemMedChem 2007, 2(10):1432–1447.View Article
- Kelly MD, Mancera RL: Comparative analysis of the surface interaction properties of the binding sites of CDK2, CDK4, and ERK2. ChemMedChem 2006, 1(3):366–375.View Article
- McGregor MJ: A Pharmacophore Map of Small Molecule Protein Kinase Inhibitors. J Chem Inf Model 2007, 47(6):2374–2382.View Article
- Smith CM, Shindyalov IN, Veretnik S, Gribskov M, Taylor SS, Ten Eyck LF, Bourne PE: The protein kinase resource. Trends Biochem Sci 1997, 22(11):444–446.View Article
- Roskoski RJ: STI-571: an anticancer protein-tyrosine kinase inhibitor. Biochem Biophys Res Commun 2003, 309(4):709–717.View Article
- Nagar B, Bornmann WG, Pellicena P, Schindler T, Veach DR, Miller WT, Clarkson B, Kuriyan J: Crystal structures of the kinase domain of c-Abl in complex with the small molecule inhibitors PD173955 and imatinib (STI-571). Cancer Res 2002, 62(15):4236–4243.
- Mol CD, Fabbro D, Hosfield DJ: Structural insights into the conformational selectivity of STI-571 and related kinase inhibitors. Curr Opin Drug Discov Devel 2004, 7(5):639–648.
- Mol CD, Dougan DR, Schneider TR, Skene RJ, Kraus ML, Scheibe DN, Snell GP, Zou H, Sang BC, Wilson KP: Structural basis for the autoinhibition and STI-571 inhibition of c-Kit tyrosine kinase. J Biol Chem 2004, 279(30):31655–31663.View Article
- Pargellis C, Tong L, Churchill L, Cirillo PF, Gilmore T, Graham AG, Grob PM, Hickey ER, Moss N, Pav S, et al.: Inhibition of p38 MAP kinase by utilizing a novel allosteric binding site. Nat Struct Biol 2002, 9(4):268–272.View Article
- Schindler T, Bornmann W, Pellicena P, Miller WT, Clarkson B, Kuriyan J: Structural mechanism for STI-571 inhibition of abelson tyrosine kinase. Science 2000, 289(5486):1938–1942.View Article
- Tseng YY, Liang J: Predicting Enzyme Functional Surfaces and Locating Key Residues Automatically from Structures. Ann Biomed Eng 2007.
- Tseng YY, Liang J: Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a Bayesian Monte Carlo approach. Mol Biol Evol 2006, 23(2):421–436.View Article
- Laskowski RA, Watson JD, Thornton JM: Protein function prediction using local 3D templates. J Mol Biol 2005, 351(3):614–626.View Article
- Bartlett GJ, Porter CT, Borkakoti N, Thornton JM: Analysis of catalytic residues in enzyme active sites. J Mol Biol 2002, 324(1):105–121.View Article
- Kim J, Mao J, Gunner MR: Are acidic and basic groups in buried proteins predicted to be ionized? J Mol Biol 2005, 348(5):1283–1298.View Article
- Srivastava S, Watowich SJ, Petrash JM, Srivastava SK, Bhatnagar A: Structural and kinetic determinants of aldehyde reduction by aldose reductase. Biochemistry 1999, 38(1):42–54.View Article
- Joachimiak A, Haran TE, Sigler PB: Mutagenesis supports water mediated recognition in the trp repressor-operator system. Embo J 1994, 13(2):367–372.
- Murray D, Honig B: Electrostatic control of the membrane targeting of C2 domains. Mol Cell 2002, 9(1):145–154.View Article
- Sayle RA, Milner-White EJ: RASMOL: biomolecular graphics for all. Trends Biochem Sci 1995, 20(9):374.View Article