A tool for calculating binding-site residues on proteins from PDB structures
© Hu and Yan; licensee BioMed Central Ltd. 2009
Received: 1 March 2009
Accepted: 3 August 2009
Published: 3 August 2009
In the research on protein functional sites, researchers often need to identify binding-site residues on a protein. A commonly used strategy is to find a complex structure from the Protein Data Bank (PDB) that consists of the protein of interest and its interacting partner(s) and calculate binding-site residues based on the complex structure. However, since a protein may participate in multiple interactions, the binding-site residues calculated based on one complex structure usually do not reveal all binding sites on a protein. Thus, this requires researchers to find all PDB complexes that contain the protein of interest and combine the binding-site information gleaned from them. This process is very time-consuming. Especially, combing binding-site information obtained from different PDB structures requires tedious work to align protein sequences. The process becomes overwhelmingly difficult when researchers have a large set of proteins to analyze, which is usually the case in practice.
In this study, we have developed a t ool for c alculating b inding-site r esidues on p roteins, TCBRP http://yanbioinformatics.cs.usu.edu:8080/ppbindingsubmit. For an input protein, TCBRP can quickly find all binding-site residues on the protein by automatically combining the information obtained from all PDB structures that consist of the protein of interest. Additionally, TCBRP presents the binding-site residues in different categories according to the interaction type. TCBRP also allows researchers to set the definition of binding-site residues.
The developed tool is very useful for the research on protein binding site analysis and prediction.
Proteins perform various functions through interactions with other molecules, such as DNA, RNA, proteins, carbohydrates, and ligands. To understand the mechanisms of these interactions, many studies have been performed to analyze the properties of binding sites on proteins, such as residue composition, secondary structure, geometric shape, electrostatic potentials, etc [1–10]. To perform such an analysis, researchers must first identify the amino acid residues (referred to as binding-site residues) that are involved in the interactions. In other studies where the goal is to build computational predictors for predicting functional sites on proteins (e.g. DNA-binding sites, RNA-binding sites, protein-protein binding sites), researchers also need to identify binding-site residues on the training and test sets to train and evaluate their predictors [11–17].
Thus, for the same protein, different sets of binding-site residues might be obtained depending on the PDB structure that is considered, and a residue of a protein may be defined as binding-site residue in one PDB structure but as non-binding-site residue in another. This inconsistency can cause serious problems in research. Thus, for a given protein, researchers need to identify all PDB structures that contain the protein, and calculate binding-site residues on the protein using all of them.
After users have found all the PDB structures that contain a given protein, the protein sequences shown in different PDB structures must be aligned properly to combine the binding-site information obtained from different structures. This step is not as simple as it may first appear. It cannot be done by matching the sequence indexes of residues in the PDB structures, because the same protein chain may have different sequence indexing in different PDB structures. For example, 1qqi_A and 1gxp_A are the same protein chain in different PDB structures. In PDB 1gxp, the first residue in chain A is ALA with sequence index of 127. However, the same residue in PDB 1qqi has an index of 2. It can neither be done by performing a simple one-to-one mapping between the two sequences from head to tail, because residue missing occurs frequently in PDB structures. Thus, researchers need a tool that can efficiently combine binding-site information from different PDB structures.
The abovementioned needs become overwhelmingly impressive when users have a large set of proteins to analyze. Against these needs, we have developed TCBRP, a t ool for c alculating b inding-site r esidues on p roteins. For an input protein, TCBRP can quickly find all binding-site residues on it by integrating binding-site information obtained from all PDB structures that contain the protein of interest. Additionally, the TCBRP presents the binding-site residues by categories based on the type of the molecule that they contact, e.g. DNA, RNA, protein, carbohydrates, and ligands. An extra benefit of TCBRP is that it allows users to choose the definition of binding-site residues.
Upon the input, TCBRP searches the entire PDB database to find all the complex structures that contain a protein that shares at least 95% sequence similarity with the input protein. Then, the biological units derived from these structures are used to calculate binding-site residues on the protein of interest. We use biological units instead of the raw PDB structure because the biological units show the functional state of the protein in life systems. Additionally, using biological units can avoid the artificial interactions that due to artificial packing in the raw PDB structures. TCBRP focuses on the binding sites involved in inter-molecule interactions, because they correspond to functional sites on proteins. Intra-molecule interactions that involve residues from the same chain or from different chains of the same molecule are discarded.
To reveal all binding sites on the input protein, the binding-site residues obtained from different biological units must be mapped to the input protein. To do this, the sequences of the different copies of the protein in different PDB structures must be aligned properly. In TCBRP, this step is achieved by aligning the protein sequences in PDB structures with the protein sequence found in the Uniprot  using global alignment.
Using TCBRP, users can input one protein chain in a time, or input a file of protein chains in a batch. For a protein that consists of multiple chains, users can submit a file consisting of all the chains, then the TCBRP will show the binding-site residues on each chain that are involved in inter-molecule interactions.
Results and discussion
Figures 4 and 5 show an example of input and output for TCBRP. Assume that 1ARO is one of the PDB structures from which a user wants to find binding sites. Note that 1ARO is a complex of T7 RNA Polymerase (chain P) with T7 Lysozyme (Chain I). Without TCBRP, the user may use 1ARO to calculate binding-site residues on T7 RNA polymerase and only find 26 binding-site residues that correspond to the interaction between T7 RNA Polymerase and T7 Lysozyme. However, RNA polymerase interacts with multiple molecules including RNA, DNA, and proteins. In the research on functional site prediction and analysis, the user will need to find all the functional sites on the T7 RNA polymerase. To obtain these results without TCBRP, the user would need to go through a long and painful process of finding all complexes that contain T7 RNA polymerase, calculating binding-site residues using each of the complexes, and combining the information given by different structures. Using TCBRP, the user can obtain the results easily. Figure 4 shows the input page. Here, the input is the P chain of 1ARO, which is T7 RNA polymerase. Upon the input, the TCBRP automatically finds all PDB structures that contain T7 RNA polymerase, i.e. 1ARO, 1QLN, 2PI4, 1S76, 1MSW, 1CEZ, 1S77, and 2PI5. Binding-sites on T7 RNA polymerase are then calculated based on each of these structures and mapped to the sequence of the input chain 1ARO_P (upper box in figure 5). In the end, the TCBRP combines all the results and shows the binding-site residues involved in each type of interaction separately (lower box in figure 5), which include 26 residues in protein-protein binding sites, 112 in DNA-binding, 28 RNA-binding sites, and 54 ligand-binding residues.
Many studies have been conducted on protein functional site prediction and analysis. Calculating binding-site residues on proteins based on the PDB structures has been a necessary and yet painful and time-costly step for these studies. TCBRP has been developed to address this problem with ease. Using TCBRP, users will be able to collect all binding-site residues on proteins of interest very quickly. The developed web server will be useful for the studies on protein interaction and protein functional sites.
Availability and requirements
Project name: A tool for calculating binding-site residues on proteins from PDB structures
Project home page: http://yanbioinformatics.cs.usu.edu:8080/ppbindingsubmit
Operating system(s): Platform independent
- Reichmann D, Rahat O, Albeck S, Meged R, Dym O, Schreiber G: From The Cover: The modular architecture of protein-protein binding interfaces. Proc Natl Acad Sci USA 2005, 102(1):57–62. 10.1073/pnas.0407280102PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Z, Palzkill T: Dissecting the protein-protein interface between beta-lactamase inhibitory protein and class A beta-lactamases. J Biol Chem 2004, 279(41):42860–42866. 10.1074/jbc.M406157200View ArticlePubMedGoogle Scholar
- Caffrey D, Somaroo S, Hughes J, Mintseris J, Huang ES: Are protein-protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci 2004, 13(1):190–202. 10.1110/ps.03323604PubMed CentralView ArticlePubMedGoogle Scholar
- Halperin I, Wolfson H, Nussinov R: Protein-protein interactions; coupling of structurally conserved residues and of hot spots across interfaces. Implications for docking. Structure 2004, 12(6):1027–1038. 10.1016/j.str.2004.04.009View ArticlePubMedGoogle Scholar
- Ofran Y, Rost B: Analysing six types of protein-protein interfaces. J Mol Biol 2003, 325(2):377–387. 10.1016/S0022-2836(02)01223-8View ArticlePubMedGoogle Scholar
- Lo Conte L, Chothia C, Janin J: The atomic structure of protein-protein recognition sites. J Mol Biol 1999, 285: 2177–2198. 10.1006/jmbi.1998.2439View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. J Mol Biol 1997, 272(1):121–132. 10.1006/jmbi.1997.1234View ArticlePubMedGoogle Scholar
- Chothia C, Janin J: Principles of protein-protein recognition. Nature 1975, 256(5520):705–708. 10.1038/256705a0View ArticlePubMedGoogle Scholar
- Nooren IM, Thornton JM: Structural characterisation and functional significance of transient protein-protein interactions. J Mol Biol 2003, 325(5):991–1016. 10.1016/S0022-2836(02)01281-0View ArticlePubMedGoogle Scholar
- Keskin O, Mab B, Nussinov R: Hot regions in protein-protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol 2005, 345(5):1281–1294. 10.1016/j.jmb.2004.10.077View ArticlePubMedGoogle Scholar
- Yan C, Dobbs D, Honavar V: A two-stage classifier for identification of protein-protein interface residues. Bioinformatics 2004, 20(Suppl 1):i371-i378. 10.1093/bioinformatics/bth920View ArticlePubMedGoogle Scholar
- Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D, Honavar V: Identifying amino acid residues involved in protein-DNA interactions from sequence. BMC Bioinformatics 2006, 7: 262. 10.1186/1471-2105-7-262PubMed CentralView ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Prediction of protein-protein interaction sites using patch analysis. J Mol Biol 1997, 272(1):133–143. 10.1006/jmbi.1997.1233View ArticlePubMedGoogle Scholar
- Wang L, Brown SJ: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucl Acids Res 2006, 34: W243-W248. 10.1093/nar/gkl298PubMed CentralView ArticlePubMedGoogle Scholar
- Hwang S, Gou Z, Kuznetsov I: DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 2007, 23(5):634–636. 10.1093/bioinformatics/btl672View ArticlePubMedGoogle Scholar
- Tjong H, Zhou H-X: DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces. Nucl Acids Res 2007, 35(5):1465–1477. 10.1093/nar/gkm008PubMed CentralView ArticlePubMedGoogle Scholar
- Ofran Y, Rost B: ISIS: interaction sites identified from sequence. Bioinformatics 2007, 23(2):e13–16. 10.1093/bioinformatics/btl303View ArticlePubMedGoogle Scholar
- Berman HM, Henrick K, Nakamura H: Announcing the worldwide Protein Data Bank. Nat Struct Bio 2003, 10(12):980. 10.1038/nsb1203-980View ArticleGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucl Acids Res 2006, 34(suppl_1):D187–191. 10.1093/nar/gkj161PubMed CentralView ArticlePubMedGoogle Scholar