- Research article
- Open access
- Published:
A library of protein surface patches discriminates between native structures and decoys generated by structure prediction servers
BMC Structural Biology volume 11, Article number: 20 (2011)
Abstract
Background
Protein surfaces serve as an interface with the molecular environment and are thus tightly bound to protein function. On the surface, geometric and chemical complementarity to other molecules provides interaction specificity for ligand binding, docking of bio-macromolecules, and enzymatic catalysis.
As of today, there is no accepted general scheme to represent protein surfaces. Furthermore, most of the research on protein surface focuses on regions of specific interest such as interaction, ligand binding, and docking sites. We present a first step toward a general purpose representation of protein surfaces: a novel surface patch library that represents most surface patches (~98%) in a data set regardless of their functional roles.
Results
Surface patches, in this work, are small fractions of the protein surface. Using a measure of inter-patch distance, we clustered patches extracted from a data set of high quality, non-redundant, proteins. The surface patch library is the collection of all the cluster centroids; thus, each of the data set patches is close to one of the elements in the library.
We demonstrate the biological significance of our method through the ability of the library to capture surface characteristics of native protein structures as opposed to those of decoy sets generated by state-of-the-art protein structure prediction methods. The patches of the decoys are significantly less compatible with the library than their corresponding native structures, allowing us to reliably distinguish native models from models generated by servers. This trend, however, does not extend to the decoys themselves, as their similarity to the native structures does not correlate with compatibility with the library.
Conclusions
We expect that this high-quality, generic surface patch library will add a new perspective to the description of protein structures and improve our ability to predict them. In particular, we expect that it will help improve the prediction of surface features that are apparently neglected by current techniques.
The surface patch libraries are publicly available at http://www.cs.bgu.ac.il/~keasar/patchLibrary.
Background
Protein surfaces attract numerous studies as they are the site of molecular binding and enzymatic reactivity. To date these studies use three levels of protein surface representations. The oldest represents surfaces as sets of exposed atoms [1]. A common alternative is to represent surfaces by sets of mesh points [2–4] that smooth the exposed atom surfaces. Finally, sets of mesh points may be coarse grained by descriptor-based methods [5–7] that allow rapid comparisons of surfaces and surface patches. These representations have served as an infrastructure for numerous studies that analyze surface electrostatics [8, 9], predict catalytic residues and active sites [10], and characterize binding sites for small ligands as well as other proteins (for recent reviews see [7, 11, 12]). While these studies mark a major trend in the annotation and prediction of protein function, surfaces are practically ignored in protein structure prediction. Specifically, we are not aware of any study that tried to assess the surfaces of models generated by prediction methods. This is somewhat surprising as one of the ultimate goals of structure prediction is to allow functional annotation of the target proteins and to support structure-based design of ligands and mutations [13]. The current study suggests a plausible approach to the assessment of model surfaces and compares surface accuracy with standard backbone-based measures such as Root Mean Square Deviation (RMSD) [14] or Global Distance Test - Total Score [15].
Notwithstanding the importance of the fine-grained representations of protein surfaces, their complexity calls for coarse graining, or abstraction; a coarser perspective can reveal new insights about the surface architecture that are otherwise masked by the plethora of fine details. Two previous lines of study, [16–18], and [19–21], suggested coarse-grained representation of protein surfaces using the notion of surface patches. Their approaches to the problem were remarkably different, reflecting the different aims of these studies. Jones and Thornton [16, 17] and later Albou et al. [18] defined surface patches as overlapping sets of proximate surface residues, and compared binding site patches with non-binding ones to characterize and predict protein-protein interaction sites [22]. Baldacci et al. [19, 20] defined surface patches as non-overlapping sets of homogeneous and connected surface points and classified them to twelve predefined types. They employed data mining techniques on these patches to identify structural similarity and plausible evolutionary connection between proteins. Since both applications of the surface patch concept are so tightly tailored to their specific aim, it is hard to see how they can be used in a different context.
Here we present a more general representation of surface patches, which is inspired by the central role of clustering in the study of protein fragments (i.e., contiguous structural segments along the protein chain) [23]. Representative fragments, extracted by clustering large data sets of protein structure fragments, have been used for a wide range of applications including: studies of sequence/structure relationships [24, 25], sequence alignment [26], structural comparison and classification [27], large scale mapping of the fold space of proteins [28], and for protein structure prediction [26, 29]. Here, we use the K-means++ [30] clustering algorithm to generate a library of representative protein surface-patches that commonly occur in the Protein Data Bank (PDB). To demonstrate the utility of our approach, we quantify the differences between the surfaces of native protein structures and those of decoys generated by state-of-the-art structure prediction methods. We also suggest a variety of other applications for future research.
Briefly, a surface patch in this study is a set of surface atoms within a certain radius around a surface β-carbon, denoted the pivot (Figure 1). The distance between two patches is the Root Mean Square Deviation (RMSD) between their atoms under a mapping that preserves chemical identity. Pairs of patches of different chemical compositions are considered infinitely distant. The K-means++ algorithm uses this distance to break a large data set of patches into k = 350 structurally homogeneous clusters. The centroids of these clusters constitute our library (Figure 2), which captures genuine features of native structures surfaces (Figure 3).
Results
We extracted 15,288 surface patches from the training set domains, calculated all vs. all distances, and weeded out 200 outlier patches that were too far from most other patches. Then, using the K-means++ algorithm [30] we divided the patches to k = 350 clusters. The algorithm associates each cluster with a representative centroid. The set of 350 centroids constitutes a library of surface patches (Figure 2). Given this library, any surface patch may be associated with the closest library element, and the surface of any protein structure may be described by a list of the associated library elements.
Below, we compare the library-compatibility of the training-set proteins to the compatibilities of the test-set native structures and their decoys. We further compare the compatibilities of the decoys themselves, attempting to correlate it with the decoy quality.
Distribution of native and decoy patch distances from cluster centroids
Given a library of surface patches, any surface patch may be marked with its distance to the closest library element (DCLE). The essence of the K-means algorithm is optimization of the average DCLE within the clusters. Thus, one may expect a low average of DCLE values for training set patches and higher values for unrelated patches. Figure 3 compares the distribution of training set DCLE values with six test set distributions: that of the native structures and those of the first, most confident, models submitted by five state-of-the-art CASP8 structure prediction servers. The DCLE distribution of patches extracted from native test set structures is almost indistinguishable from the training set distribution, which indicates that the library is not over-fitted. On the other hand, The DCLE distributions of the decoy patches, are significantly wider (Wilcoxon rank sum test, p< 10-30), with larger averages. This difference is large enough to distinguish native structures from a set of five decoy structures in 68% of the test set proteins (Table 1). The random expectation is 1/6, i.e., 16.6(± 7.7)% (where the standard deviation of 7.7 was estimated by 10,000 bootstrap re-sampling iterations).
While compatibility with the surface patch library discriminates between native structures and decoys, it provides a weaker clue regarding the quality of the decoys themselves. The best decoys (by RMSD), are only slightly enriched within the most compatible decoys (Tables 2 and 3), probably because on average the decoys are more similar to one another than to the native structure. Decoy quality assessment by GDT_TS resulted in similar results (data not shown).
The relative size of clusters
Cluster preference is another property that distinguishes between the patches of native and decoy structures. Formally, for a set of patches Q (e.g., patches extracted from some decoy set) this preference is a vector F(Q) = { f(Q,C1) .... f(Q,Ck) }, where f(Q,C) is the fraction of Q elements that are closest to the centroid of cluster C, and k is the number of clusters. Figure 4 presents a cumulative distribution of Δi = | f(Q0,Ci)- f(Q,Ci)|, per each data set Q, where Q0 is the set of training patches. The Δ values of the test set native structures are significantly lower than those of the decoys (p< 10-4 by Wilcoxon rank-sum test), indicating that the native structure preferences are far more similar to those of the training set than the preferences of the decoys. Curiously, not only do the native structures differ significantly from the decoys, the server structures differ considerably among themselves.
Discussion and Conclusions
This work presents a new library of surface patches analogous to the fragment libraries that had a considerable impact on computational structural biology over the last twenty years [23]. Here, to demonstrate the significance of our library, we use it to compare patches taken from native structures and from decoys generated by state-of-the-art protein structure prediction servers. Our results show that the clusters are meaningful, and capture genuine aspects of native protein surfaces. Specifically, patches of decoys generated by servers are significantly different from patches of native proteins. Furthermore, this difference has a predictive power allowing us to identify native protein structures within a set of server models.
This phenomenon can be only partially attributed to the qualities of the models as measured by the standard RMSD and GDT_TS scores. Patch-derived measures (e.g., DCLE) are not correlated with RMSD or GDT_TS (data not shown), Good models (e.g., of low RMSD) are as prone to non-native surface patches as bad ones. Thus, we cannot use it to reliably rank decoys. On the other hand, we hope that our library will shed light on inherent limitations of the current modeling techniques. Such limitations in the representation of surfaces may be overlooked by the current model assessment procedures. However, they may drastically reduce the applicability of models for real life problems that often involve surface interactions. The characterization of these discrepancies between model surfaces and the surfaces of native structures is an obvious direction to continue this study. We hope that it would lead to some insight about the limitations of current modeling procedures and eventually to better model building techniques. A few other future applications are listed below.
Our approach to surface patch sampling requires quite a few parameters, such as the patch radius and the number of clusters. Due to the exploratory nature of the current study, we have decided to avoid a time consuming systematic search for the optimal values of these parameters. Some of them were assigned arbitrary values, and for others we sparsely sampled a wide range of values (data not shown). Although some values generated better results than others, the results were qualitatively similar, suggesting that the approach presented here is stable and viable.
Protein structures are extremely complex entities and no single perspective exposes all their properties. In the past, new protein representations (e.g., fragments [23], and rotamers) opened the way to diverse lines of study. One may speculate a similar trend here. Possible directions include functional inference from patch content, evolutionary conservation, and diversification of patch content and graphical representation of protein surfaces with patches as nodes and patch overlap as edges. The latter suggests new directions for structure-based comparison, search, and classification.
Methods
Data Sets
The training set, which is available online at http://www.cs.bgu.ac.il/~keasar/patchLibrary/domain_names.html, is the one previously used by Kolodny et al. [25] and includes 200 unique domains from SCOP version 1.57. These domains were solved using X-ray crystallography at high resolution [31] and each of them has the highest ranking SPACI scores [32] in its SCOP category.
The test set includes both native structures and their server-predicted models (decoys). These structures correspond to 55 CASP8 [33] single domain targets that were solved by X-ray crystallography and are non-homologous to the training set proteins. Specifically, the training set proteins have a BLAST [34] E-value of at least 10-3 when run against the training set. The decoys were generated by five top CASP8 servers (Table 4), and are available through the CASP8 web site. Following the CASP regulations, each server submitted five models per target, ordered by confidence.
Identification of surface atoms
We consider an atom of type t (e.g., Alanine-Cα) to reside on the surface if its accessible surface area, calculated by PROGEOM [35], is at least α.access_surf t (Figure 1a). Here, access_surf t is the 99th percentile of the cumulative distribution of accessible surface area within all the atoms of type t, and α = 0.9. The empirical adjustment of these two parameters reduces the effect of errors in the crystallographic data (e.g., missing side chains that superficially expose backbone atoms), and ensures continuous coverage of protein surfaces.
Patch definition
We define surface patches as sets of surface atoms centered about all solvent exposed β-carbons, which we denote pivots (Figure 1). Each patch includes the central pivot and all surface atoms within a given radius around it. This radius is a critical parameter as the number of atoms within a patch is strongly dependent on it. Thus, a large radius results in large numbers of atoms and long evaluation times for the combinatorial distance measure (see below). On the other hand a too small radius may leave surface regions uncovered. A preliminary study suggested 7Å as a reasonable compromise that keeps a manageable number of atoms in a patch (around 25 on average) and provides a continuous coverage of proteins' surfaces by overlapping patches.
Measuring the distance between two patches
Given two patches A and B, we look for an optimal superposition in terms of structure and chemical properties, and define the distance between A and B as the minimal RMSD under a set of chemical constraints. If the compositions (see below) of the patches are too remote to allow meaningful superposition, we set the distance to infinity.
More formally: Let the patches be the respective sets of atoms in A and B, A = {a1,...,a n and B = {b1,...,b m . Let T iA be the number of atoms of type T i in patch A and rg(A) the radius of gyration of A (symmetrically for B).
Notice that , and .
The patches A and B are compatible if
,, and
The threshold values for size difference, chemical difference, and radius of gyration difference were arbitrarily set to Φ1 = Φ2 = 0.2, and Φ3 = 5Å. The distance between incompatible patches is infinite.
Let t: {set of all atoms} → T be a mapping so that for an atom a, t(a) is the atom's type. A mapping f, from A to B, is proper if it satisfies f(a) = b if and only if f(b) = a and t(a) = t(b).
Let F = {f1,..., fk} be the set of all proper mappings of A and B.
Then, the distance between A and B is:
where RMSD(A,B,f) is the optimal superposition [14] of the atoms of A and B that are mapped by f.
In practice, finding the optimal mapping is a hard combinatorial optimization problem, although the requirement for compatibility provides a filter that reduces the number of these calculations considerably. Thus, the use of the exact distance definition above might have rendered the calculation of numerous distances infeasible. Instead, we use a heuristic approximation that reduces the number of tested mappings. To this end, we define the inner sphere of a patch to be a sphere, centered at the pivot, of radius r < 7Ã…, which is adjusted so that the number of surface atoms in the inner sphere is between 4 and 9 (see Figure 5a). We then exhaustively enumerate all possible chemically valid mappings between the inner sphere of one patch and the inner sphere of the other patch (Figure 5b). The RMSD between these inner spheres is measured after optimal least-squares superposition. If this RMSD is less than 2Ã…, the transformation it implies serves as a seed for matching the full patches A and B. If no seed was found, the distance between the patches is taken to be infinity. Once the transformation of a seed match was applied to the full patches, we match the atoms of A and B: each atom of A is matched according to proximity and chemical attributes to the best fitting atom in B (Figure 5c). Now we have a mapping between A and B for each seed. For each such mapping we compute the RMSD between A and B and pick the matching with the lowest RMSD.
Outlier weeding
Patches that are distant from the majority of other patches are outliers; we weed them out in a pre-processing step to avoid numerous non-informative singleton clusters. Here, we define an outlier as a patch that has a distance greater than 2.5Ã… to more than 90% of the other patches; this filters out 1.51% of the surface patches. A closer look at some of the outliers reveals a diverse population. Some of them are unique (within our dataset) functional elements like metal binding sites, for example the small protein 1VFY contributes four outliers due to its two metal binding sites and a large fraction of unstructured chain. Others are artifacts of using domains instead of whole proteins, for example 1JHG, which is a homo-dimmer, contributes five outliers. Three of them are actually buried by the other subunit. Finally, some of the outliers do not show any peculiarity that we could identify. Their uniqueness may be simply an artifact of the relatively small size of our dataset.
References
Lee B, Richards FM: The interpretation of protein structures: estimation of static accessibility. J Mol Biol 1971, 55: 379–400. 10.1016/0022-2836(71)90324-X
Connolly ML: Solvent-accessible surfaces of proteins and nucleic acids. Science 1983, 221: 709–713. 10.1126/science.6879170
Liang J, Edelsbrunner H, Fu P, Sudhakar PV, Subramaniam S: Analytical shape computation of macromolecules: II. Inaccessible cavities in proteins. Proteins 1998, 33: 18–29. 10.1002/(SICI)1097-0134(19981001)33:1<18::AID-PROT2>3.0.CO;2-H
von Freyberg B, Richmond TJ, Braun W: Surface area included in energy refinement of proteins. A comparative study on atomic solvation parameters. J Mol Biol 1993, 233: 275–292. 10.1006/jmbi.1993.1506
Bock ME, Cortelazzo GM, Ferrari C, Guerra C: Identifying similar surface patches on proteins using a spin-image surface representation. Lect Notes Comput Sci 2005, 3537: 417–428. 10.1007/11496656_36
Ankerst M, Kastenmüller G, Kriegel H-P, Seidl T: 3D shape histograms for similarity search and classification in spatial databases. Lect Notes Comput Sci 1999, 1651: 207–226. 10.1007/3-540-48482-5_14
Venkatraman V, Sael L, Kihara D: Potential for Protein Surface Shape Analysis Using Spherical Harmonics and 3D Zernike Descriptors. Cell Biochem Biophys 2009, 54: 23–32. 10.1007/s12013-009-9051-x
Kuo SH, Tidor B, White J: A meshless, spectrally accurate, integral equation solver for molecular surface electrostatics. J Emerg Technol Comput Syst 2008, 4: 1–30.
Klapper I, Hagstrom R, Fine R, Sharp K, Honig B: Focusing of electric fields in the active site of Cu-Zn superoxide dismutase: effects of ionic strength and amino-acid modification. Proteins 1986, 1: 47–59.
Ben-Shimon A, Eisenstein M: Looking at enzymes from the inside out: the proximity of catalytic residues to the molecular centroid can be used for detection of active sites and enzyme-ligand interfaces. J Mol Biol 2005, 351: 309–326. 10.1016/j.jmb.2005.06.047
Via A, Ferrè F, Brannetti B, Helmer-Citterich M: Protein surface similarities: a survey of methods to describe and compare protein surfaces. Cell Mol Life Sci 2000, 57: 1970–1977. 10.1007/PL00000677
Gherardini PF, Helmer-Citterich M: Structure-based function prediction: approaches and applications. Brief Funct Genomic Proteomic 2008, 7: 291–302. 10.1093/bfgp/eln030
Kopp J, Schwede T: Automated protein structure homology modeling: a progress report. Pharmacogenomics 2004, 5: 405–416. 10.1517/14622416.5.4.405
Kabsch W: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 1978, 34: 827–828. 10.1107/S0567739478001680
Zemla A, Venclovas Č, Moult J, Fidelis K: Processing and evaluation of predictions in CASP4. Proteins 2001, 45(Suppl 5):13–21.
Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. J Mol Biol 1997, 272: 121–132. 10.1006/jmbi.1997.1234
Jones S, Thornton JM: Prediction of protein-protein interaction sites using patch analysis. J Mol Biol 1997, 272: 133–143. 10.1006/jmbi.1997.1233
Albou LP, Schwarz B, Poch O, Wurtz JM: Defining and characterizing protein surface using alpha shapes. Proteins 2009, 76: 1–12. 10.1002/prot.22301
Baldacci L, Goldarelli M, lumini A, Rizzi S: A Template-Matching Approach for Protein Surface Clustering. 18th International Cponference on Pattern Recognition, 2006 City: publisher; 2006, 3: 340–343. [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?reload=true&arnumber=1699535]
Baldacci L, Golfarelli M, Lumini A, Rizzi S: Clustering techniques for protein surfaces. Pattern Recogn 2006, 39: 2370–2382. 10.1016/j.patcog.2006.02.024
Baldacci L, Golfarelli M: Mining Complex Patterns from Protein Surfaces. In Procedings of the 16th International Workshop on Database and Expert Systems Applications. Edited by: Matteo G. Copenhagen, Denmark; 2005:590–594.
Murakami Y, Jones S: SHARP2: protein-protein interaction predictions using patch analysis. Bioinformatics 2006, 22: 1794–1795. 10.1093/bioinformatics/btl171
Offmann B, Tyagi M, de Brevern AG: Local Protein Structures. Curr Bioinform 2007, 2: 165–202. 10.2174/157489307781662105
Han KF, Baker D: Global properties of the mapping between local amino acid sequence and local structure in proteins. Proc Natl Acad Sci USA 1996, 93: 5814–5818. 10.1073/pnas.93.12.5814
Kolodny R, Koehl P, Guibas L, Levit Michael: Small libraries of protein fragments model native protein structures accurately. J Mol Biol 2002, 323: 297–307. 10.1016/S0022-2836(02)00942-7
Levitt M: Accurate modeling of protein conformation by automatic segment matching. J Mol Biol 1992, 226: 507–533. 10.1016/0022-2836(92)90964-L
Le Q, Pollastri G, Koehl P: Structural alphabets for protein structure classification: a comparison study. J Mol Biol 2009, 387: 431–450. 10.1016/j.jmb.2008.12.044
Friedberg I, Godzik A: Connecting the Protein Structure Universe by Using Sparse Recurring Fragments. Structure 2005, 13: 1213–1224. 10.1016/j.str.2005.05.009
Bystroff C, Baker D: Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 1998, 281: 565–577. 10.1006/jmbi.1998.1943
Arthur D, Vassilvitskii S: k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms: 7–9 January 2007. Society for Industrial and Applied Mathematics, New Orleans, Louisiana; 2007:1027–1035.
Murzin AG, Brenner SE, Hubbard T, Chothis C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540.
Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 2000, 28: 254–256. 10.1093/nar/28.1.254
Moult J, Fidelis K, Zemla A, Hubbard T: Critical assessment of methods of protein structure prediction - Round VIII. Proteins 2009, 77(Suppl 9):1–4.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.
Koehl P: PROGEOM.[http://nook.cs.ucdavis.edu/~koehl/ProShape/overview.html]
Acknowledgements
Funding: RG and CK were partially supported by the Israeli Science Foundation (grant no. 289/06), and the National Institute of Health (award no. 1R01 GM081712-01). RK was partially supported by Marie Curie IRG grant 224774.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
RG participated in the project design, wrote software, generated the data and analyzed it, and drafted the manuscript. RK participated in the project design and supervised its clustering aspects. KK participated in the project design and supervised its computational geometry aspects. CK conceived the project and coordinated it. All authors took part in manuscript writing, read the final version and approved it.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Gamliel, R., Kedem, K., Kolodny, R. et al. A library of protein surface patches discriminates between native structures and decoys generated by structure prediction servers. BMC Struct Biol 11, 20 (2011). https://doi.org/10.1186/1472-6807-11-20
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1472-6807-11-20