Molecular processes in the living cell are coordinated and executed under tight regulation. Proteins play a fundamental role in almost all biological processes, and their overall activity is regulated at different levels . At a first level, the concentration of a particular protein in the cell is regulated through its synthesis rate (gene expression) and its degradation rate. At another level, mechanisms act on the protein molecule itself through covalent modifications or non-covalent binding of small ligands or other molecules. These regulatory mechanisms are not only essential for the proper functioning of the molecular processes that maintain life, but are also responsible for cross-signaling and regulation processes between an organism and its environment.
Many metabolic enzymes, signalling proteins and transcription factors, among others, are regulated allosterically. Allosteric regulation has been studied for more than 50 years and it is considered the most powerful and common way to regulate protein activity . However, for most known cases of allosterism, the atomic details that explain the functional relationship between distant sites on the same protein molecule have not been elucidated [3, 4].
Many pharmaceutical compounds act through allosteric regulation, as seen in the case of paclitaxel (Paxol), a cancer therapeutic drug that regulates tubulin polymerization allosterically [5, 6]. Even though active sites represent the classic drug-target pocket (e.g. Aspirin and cyclooxygenase), allosteric sites present advantages over active sites in the context of drug design. Enzymatic activity usually involves charged transition states and the substrates are not always drug-like. Thus, orally active inhibitors that complement these sites can be very difficult to obtain. Moreover, allosteric sites may allow the discovery of not only novel drug-like inhibitors, but activators as well [2, 3].
In this context, predicting allosteric sites computationally is of great interest. Allosteric sites have been predicted using structural information  and phylogeny . Recently, methods have been developed in order to model or predict the relationship between allosteric and active sites [9–11]. These methods represent an important step forward in the understanding of allosterism. However, these studies are limited by the low quantity of readily available data on allosteric sites. As stated by Thornton and collaborators in their recent review , this is due in part to the lack of a formal database that organizes and stores knowledge on allosteric proteins and the corresponding mechanisms.
To unveil common patterns underlying allosterism, given that these exist, a large-scale study using structural and sequence data would be necessary. However, given the present scenario of scarce allosteric-site data, we decided to perform a large-scale analysis of protein ligand-binding pockets, as these represent potential locations of functional and allosteric or regulatory sites. Our approach is supported by the concept that besides naturally ocurring allosteric sites, serendipitous sites -having no natural ligand but effectively being an allosteric site given an appropriate ligand- may be of great pharmacological interest . Examples of previously unknown allosteric sites discovered on already solved protein structures [12, 13] support the idea that orphan or serendipitous allosteric sites exist which may lack a known natural effector, but provide an excellent opportunity for drug discovery approches such as virtual screening. Hardy and Wells also suggest that the large amount of 'crystallization artifacts' present in the Protein Data Bank (PDB) , such as ligands co-crystallized in unexpected binding sites, could hint the presence of previously unknown allosteric sites .
A large database of protein structures and associated small-molecule ligands is available  and has been used to predict ligand-binding sites by homology . However, small-molecule ligands are not always easy to co-crystallize and we did not want to limit our study to only such cases. In this context, ligand-binding sites can be computationally predicted from structure alone with reasonable accuracy [17–20]. To our knowledge, ligand-binding pockets as predicted directly from structure  have not been studied or characterized at large-scale yet, even though they represent the potential location of yet unknown effectors .
Functional pockets in proteins have been previously characterized in terms of their flexibility [21, 22], evolutionary conservation [21, 23] and electrostatic potential  and these characteristics have been used to predict their presence and location in the protein structure . Evolutionary conservation is a common characteristic of biologically functional sites. However, until now it has been exploited solely at the sequence level . Although sequence and structural conservation correlate, structure is closer to function and may be conserved even in the lack of a sequence-level signal . Despite this, to our knowledge, an approach based on the structural conservation of protein pockets has not been previously used. Here, we introduce a simple methodology to study pockets at the protein family level, consisting in the identification of pockets present in equivalent positions across different structures of the same protein family. To parameterize the method, we used protein pockets that matched known active sites, as these are well annotated [26, 27]. Once parameterized, we applied the method to all protein structures available in the PDB , leading to the identification of protein pockets for thousands of different protein families . Next, we compared the levels of structural conservation with other pocket characteristics estimated on the same protein families, such as sequence level conservation, backbone flexibility and electrostatic potential.
In the following sections we also discuss the results of this analysis for a small set of biological examples which illustrate the relevance of structural conservation in studying protein functional and regulatory sites. Finally, we perform an estimation of the amount of potentially paired regulatory and functional sites that may exist in the entire data set.