Preservation of protein clefts in comparative models
© Piedra et al; licensee BioMed Central Ltd. 2008
Received: 29 May 2007
Accepted: 16 January 2008
Published: 16 January 2008
Comparative, or homology, modelling of protein structures is the most widely used prediction method when the target protein has homologues of known structure. Given that the quality of a model may vary greatly, several studies have been devoted to identifying the factors that influence modelling results. These studies usually consider the protein as a whole, and only a few provide a separate discussion of the behaviour of biologically relevant features of the protein. Given the value of the latter for many applications, here we extended previous work by analysing the preservation of native protein clefts in homology models. We chose to examine clefts because of their role in protein function/structure, as they are usually the locus of protein-protein interactions, host the enzymes' active site, or, in the case of protein domains, can also be the locus of domain-domain interactions that lead to the structure of the whole protein.
We studied how the largest cleft of a protein varies in comparative models. To this end, we analysed a set of 53507 homology models that cover the whole sequence identity range, with a special emphasis on medium and low similarities. More precisely we examined how cleft quality – measured using six complementary parameters related to both global shape and local atomic environment, depends on the sequence identity between target and template proteins. In addition to this general analysis, we also explored the impact of a number of factors on cleft quality, and found that the relationship between quality and sequence identity varies depending on cleft rank amongst the set of protein clefts (when ordered according to size), and number of aligned residues.
We have examined cleft quality in homology models at a range of seq.id. levels. Our results provide a detailed view of how quality is affected by distinct parameters and thus may help the user of comparative modelling to determine the final quality and applicability of his/her cleft models. In addition, the large variability in model quality that we observed within each sequence bin, with good models present even at low sequence identities (between 20% and 30%), indicates that properly developed identification methods could be used to recover good cleft models in this sequence range.
In order to make full use of the growing amount of sequence information, in terms of increasing our knowledge of protein function, engineering new variants of known proteins, developing biomedical applications, etc, structural information is clearly required [1–6]. Indeed, one of the most important challenges in the post-genomics era is to fill the gap between the large number of known protein sequences and the still relatively small number of known structures [6–9]. Structural genomics projects have addressed this challenge and have led to the design and development of high-throughput production pipelines for structure determination [2, 10–15]. This considerable research effort is starting to give results and recent reports show a clear increase in the number of known structures, and particularly of structures showing new folds, solved in structural genomics projects [16–20].
Providing experimental structures for all possible proteins clearly exceeds our present capacity. Therefore, the yield of structural genomics projects is increased by the use of comparative/homology modelling tools [1, 2, 6, 9, 10, 13, 16, 21]. Indeed, the latter are of great importance as they allow the extension of the knowledge provided by structural genomics projects by at least one order of magnitude [6, 7, 22]. However, the usefulness of homology models varies and is determined by their quality [1, 23, 24]. Drug design (probably the most demanding application of homology models) requires high quality models that are usually obtained for sequence identity (seq.id.) levels above 70% between the target and template [23, 24]. Useful designs of pseudo-molecules fitting the active site of an enzyme, which can be employed for screening small-compound databases, can be obtained at seq.id. of around 30% ; medium to high-accuracy models can be applied to interpret the damaging effect of point mutations [2, 24], etc. A series of independent studies [24, 26–29], as well as the results of CASP experiments [30–41], give the user of comparative modelling a good idea of the model's overall performance, and how the latter can be estimated from the seq.id. between the target and template sequences.
Most of these studies address quality issues regarding the model as a whole; however, because many applications of homology models depend on the quality of the biologically crucial parts of the protein [1, 21, 23, 24], more recent work either includes specific analyses of these sub-structures[26, 27, 30, 40] or is completely devoted to the same . Among the points addressed is the hypothesis that some functional regions are better modelled than others because of their higher sequence and structure conservation . Analyses of CASP experiments provide contradictory evidence either supporting  or rejecting  this hypothesis. Along another line, De-Weese and Moult  used CASP data to explore how ligand binding information can be obtained from comparative models. These authors analyzed the errors in protein-ligand contacts as well as the source of these errors (e.g. alignment problems, incorrect side-chain rotamers, etc). They found that when there are no alignment errors, comparative models provide a useful understanding of the interaction between the protein and its ligand, even at seq. id. levels of around 30%. Complementary to these CASP-based studies, recent large-scale studies of comparative models have also considered the quality of protein functional regions [26, 27]. In these two studies, the authors describe the behaviour of several global, structure-dependent properties, such as accessible surface area and electrostatic potential, in comparative models [26, 27]. In addition to examining these global properties, the authors also analysed the degree of conservation of protein clefts in terms of location and boundary residues. They reported that: (i) spurious clefts appear as seq.id. decreases; (ii) the more similar the target and template sequences, the more conserved the clefts; and (iii) clefts in models have a more rugged surface than in the experimental structure.
The work by De-Weese and Moult  and by Sanchez's group [26, 27] provides a valuable, but still incomplete, picture of how the quality of functional cavities is preserved in comparative models. In the case of De-Weese and Moult's work , the reach of their results is limited by the following: the reduced number of proteins and models studied, 10 and 207, respectively; the consideration of only small molecule binding; and the fact that the analysis is based on the use of essentially one variable, distance root-mean-square deviation. Sanchez's group [26, 27] studied a series of structure-based properties, including clefts. More precisely, in the case of clefts, their work was restricted mainly to the issue of the degree of their preservation between the experimental structure and the model. However, apart from the ruggedness study, no shape descriptors were used to specifically define cleft quality in protein models. In summary, and to the extent of our knowledge, there is no exhaustive study entirely devoted to assess how cleft structure varies in comparative models. Here we address this issue and examine the quality of clefts in protein models obtained at a range of target-template seq.id. levels, using six variables that cover various features of cleft structure. Although we provide data for the entire seq.id. range, we focused on the behaviour of comparative models in the medium (30% – 60%) and low (< 30%) ranges for the following reasons: (i) the quality of homology models above 60% seq.id. is usually high [1, 23]; (ii) biochemical function above 60% seq.id. is usually conserved [43–45]; (iii) target selection protocols in structural genomics projects usually rely on a 30% seq.id. threshold to obtain a maximal coverage [6, 46]; and (iv) comparative modelling is possible below 30% seq.id. because the protein structure is preserved below this threshold [43, 47, 48]. The study was carried out using 53507 comparative models (built with the standard modelling software MODELLER ) for 3802 protein CATH domains . Our results provide a detailed and quantitative view of how cleft quality varies in comparative models and constitute a valuable guide for users of this structure prediction technique. More precisely, we (i) quantitatively show the dependence between several descriptors of cleft quality and seq.id. between target and template sequences; (ii) demonstrate that a certain number of good quality models up to 20% seq.id. can be found; and (iii) indicate that above 30% seq.id. cleft quality approaches that obtained when using the best possible alignments (structural alignments).
Results and discussion
Sequence identity distribution for the target-template pairs.
The domains chosen were distributed over the four CATH  classes (mainly-alpha: 24%; mainly-beta: 29%; alpha-beta: 45%; low secondary structure content: 2%), sampling 33 architectures and 390 topologies, thus giving a good coverage of the structure space of protein domains.
Clefts were computed for each experimental protein structure using SURFNET , which provides a list of clefts. We chose the largest cleft from this list because it is the one that is most likely to play a relevant functional/structural role. Furthermore, in whole proteins this cleft is usually associated with the biochemical function of the protein, by either participating in protein-protein interactions, or hosting the enzymes' active site [53, 54]. In our case, in addition, because we considered protein domains, the largest cleft may also correspond to the locus of domain-domain interactions that determine the structure of the whole protein, thus playing an equally important structural role. However, given that smaller clefts may have a functional role in some cases, we also provide results for the top-five clefts.
To explore how well clefts were reproduced in the models, we used six variables (see Materials and Methods): root-mean-square deviation (rmsd), normalized root-mean-square deviation (rmsd100), global distance test (GDT), protrusion index (cx), variation in accessible surface area (ΔASA) and contact number (ΔCN). Rmsd is widely applied in many areas of structural analysis, and in particular has been successfully used in the characterization of shape variations in binding sites [55, 56], a problem formally analogous to that addressed in the present study. Rmsd100  is a transformation of rmsd that eliminates the size dependence present in the latter and its use allowed us to exclude size biases from our results. GDT, developed within the context of CASP experiments , is a quality measure that helps to detect the presence of well preserved sub-structures in otherwise bad models, thereby helping to prevent the sensitivity of rmsd to outliers. Cx is a simple measure of the protrusion degree of protein atoms, related to the atomic environment, that can be used to characterise binding sites, cleavage sites, etc . ASA  is a shape descriptor that has been extensively employed in protein structural analysis to describe, amongst others, energetic and functional features, such as atom-atom interactions [61, 62], protein solvation [63, 64], protein-protein interactions , etc. Finally, ΔCN, which is directly derived from ΔASA , provides an approximate idea of how comparative models preserve the capacity of cleft atoms to establish functional interactions.
Rmsd between the observed structure of the protein and the homology model was computed considering only the set of atoms defining the cleft in the former (see Materials and Methods). As a control, and to assess the limits introduced by the model building procedure itself, we employed the results of the auto-modeling process in which a model for the target protein was produced using its own experimental structure as template. Our results provide a basal line that corresponds to the limits of the modelling software – MODELLER  in our case- and includes the impact of the distinct approximations implicit in the different steps of the structure building process – e.g. the force field employed, the minimization protocol, the internal protein representation, etc.
cx is a volume ratio (see Materials and Methods) that gives a local measure of the atomic environment that can be related to function . We computed the percentage of cleft atoms for which the cx value varied between the observed and the model structures and examined how this number varied with target-template seq.id.. To exclude noise corresponding to small experimental fluctuations, we followed a simple protocol (see Materials and Methods). We first obtained a set of 223 structure pairs with each pair member corresponding to a different experimental version of the same structure. We then computed the difference in cx for all pairs of equivalent atoms and plotted the resulting distribution (data not shown). For over 99% of the cases, the difference in cx was between -1 and 1. On this basis we considered that: for any given atom cx had varied between the experimental and the model structures when the difference in cx was larger than 1 in absolute value.
ΔASA and ΔCN
As mentioned previously, a number of applications of comparative modelling, like drug design or study of enzyme-substrate interactions, require accurate modelling of native atomic interactions between the target protein and another molecule (either a small substrate or a macromolecule). To provide an estimate of how modelling of these interactions may vary in comparative models, we used an additional parameter, ΔCN (changes in contact number), which is computed from ΔASA using an approximated relationship proposed by Colonna-Cesari and Sander : ΔCN ~ 0.31ΔASA. ΔCN gives a rough idea of how changes in solvent accessibility can modify the ability of cleft atoms to establish interactions with other molecules.
We found (Figure 9) that even for high-quality models almost 25% of cleft atoms had ΔCN values around three. This indicates that these atoms had either gained (ΔCN E 3) the ability to establish three non-native interactions or lost (ΔCN E-3) their ability to establish three native interactions, on average. Furthermore, while this situation was comparable for medium-quality models, for low-quality models the figure rose to over 40% of the cases.
Factors affecting cleft quality
Finally, we studied the effect of several factors contributing to cleft quality, focusing on two related issues: (i) the effect of non-aligned cleft contouring residues; and (ii) the maximal improvement we could obtain when optimal target-template alignments were available. We also examined whether cleft quality was affected by differences in protein fold, or cleft rank (using the five largest clefts of a protein, instead of the largest one), although these results are provided separately as additional files [see Additional files 3 and 4]. To take into account the size effect, we used rmsd100 in all cases.
Here we provide a quantitative view of how the quality of protein clefts varies in comparative models, depending on the seq.id. between the target and template sequences. Our results show how cleft quality – measured using rmsd, rmsd100, GDT, cx, ASA and contact number- is related to target-template seq.id.. When considered together, these analyses consistently show that below 20% seq.id. cleft quality undergoes a clear decrease, both from a global (Figure 1) as well as from a local point of view (Figures 6, 8 and 9). This finding suggests that even between 20% and 30% seq.id., useful models of protein clefts can be obtained, although the use of quality assessment tools is strongly advised, due to the important proportion of poor models within this seq.id. range. Once identified, the cleft model may be subject to subsequent refinement steps aimed at improving quality, e.g. using global model refinement (taking advantage of the better backbone quality, Figure 7), although the greatest improvement is likely to result from the use of good alignments (Figure 11). Above 30% – 40% seq.id., the main restriction to model quality is determined by the template selected (Figure 1). Within this seq.id. range, overall backbone structure tends to deteriorate more than cleft structure, probably because of functional restraints on the latter. Therefore, further model refinement should probably freeze, at least partly, the structure of the cleft, to prevent degradation. Overall, our work goes beyond previous studies [26, 27, 34] presenting a complete view of how the structure of protein clefts varies in comparative models, which constitutes a useful guide for researchers interested in the study of protein function using comparative modelling methods.
Here we explored the level of preservation of protein clefts in comparative models. While the latter can be obtained using templates with varying degrees of sequence similarity from the target, our interest focused mainly on complete coverage of the seq.id. range below 60%, for the reasons mentioned above. This decision determined our selection strategy of targets and templates, which was designed to provide a large number of models within this seq.id. range. Nonetheless, models were also obtained at higher seq.id. to confirm the consistency of the trends observed.
The target-template pairs
Homology, or comparative, modelling methods have the capacity to produce a 3D structure for any target protein provided that structural information is already available for at least one of its family members [1, 23, 24], usually called the template(s). Thus the starting point of our study was to build a list of target-template protein pairs that covered the desired seq.id. range. The difference, in our case, was that the structure of the target protein had to be known in order to allow the assessment of protein clefts variation among comparative models of distinct quality. To this end, we used the CATH database , version 3.0.0.. This database is a domain database in which whole protein structures have been previously separated into their constituting domains. While the use of protein domains did not affect the results of our analysis, it slightly affected the functional/structural meaning of the cavities considered. As explained above, the largest cleft was selected for our analysis because it usually coincides with the protein functional locus [53, 54]. However, because we are dealing with domains, some of these clefts may appear only after separating interacting domains from the same protein. Their value, for the purpose of our research, is similar to that of other functional clefts, as they play a vital role when docking independently modelled domains to build the structure of multi-domain proteins .
CATH provides a hierarchical classification of protein domains  with a range of levels that go from very broad – like the Class level- to very fine, sequence family levels. Among the latter is the O-level, in which domains with seq.id. higher than 60% are grouped, and which was used as a starting point in our study. Indeed, our set of target-template pairs corresponded to all possible pairs of O-level representatives belonging to a given H-level (Homologous Superfamily level), for all H-levels. The list of O-level representatives was obtained from the CATH server , and was subsequently filtered: we excluded all discontinuous CATH domains, structures not considered as true folds in SCOP  (version 1.67), and all protein domains with missing atoms, or main-chain discontinuities. The resulting number of structures was 3802 and the list of target-template pairs had 90948 pairs. The latter were used as input to the program MODELLER and resulted in 88410 models, as there was a small fraction of target-template pairs for which no alignment could be produced. A final filter was implemented to leave only those cases for which all residues from the target's largest cleft (see below) were aligned to template residues. The final number of target-template pairs was 53507 (A list with the pdb codes for all these pairs is provided target-template pairs is provided as additional file [see Additional file 5].
The homology modelling protocol
The homology models were obtained with the standard program MODELLER , using the sequence alignment between the target and template sequences as input. The latter were extracted directly from the CATH  file CathDomainDescriptionFile.v3.0.0, and aligned using the ALIGN option from MODELLER  which implements a global dynamic programming algorithm with afine gap penalties . The models were built using MODELLER's default parameters.
To assess the quality limits imposed by the comparative modelling software, we used a set of models obtained using the experimental structure of the target protein as template, which amounted to a total of 3797 models (five less than the 3802 due to small discrepancies between the PDB sequence and that given in the file CathDomainDescriptionFile.v3.0.0). This auto-model control gives a good idea of how the bias introduced in the distinct structural features (e.g. side-chain torsional angles, atom-atom contacts, etc) by the explicit and implicit approximations in the modelling package affect the final quality of the model.
Alignment accuracy is one of the most important issues in homology modelling [1, 21, 23, 24], as it has a strong effect on the final quality of the model (e.g. alignment errors are essentially unrecoverable). In our study, where the goal was to assess the impact of conventional modelling on cleft quality, we generated sequence alignments with the ALIGN option from the MODELLER package , as done by Sanchez's group  in their work on the impact of comparative models on structure-derived properties. It must be noted, however, that the performance of dynamic programming algorithms, such as that implemented in ALIGN, decreases for seq.id. below 20% – 30% . For this reason, the results shown in our study for seq.id. below 30% constitute a lower bound estimate of cleft quality. There are currently several alternatives to conventional dynamic programming  to obtain sequence alignments within this seq.id. range. However, it is unclear which is the best , and proper assessment of these alternatives is a difficult task to which much research effort is devoted and beyond the scope of the present study. Instead, following Sanchez's approach , we used structure-based alignments to show how an increase in alignment accuracy may improve cleft models. In our case the structure alignments were obtained using the MAMMOTH software suite [74, 75]. The final number of models derived from these alignments, 89563, was slightly higher because MAMMOTH aligned some of the target-template pairs that were too difficult to align using sequence information alone. Again, for our analyses we only considered those models for which all cleft residues from the target were aligned to template residues.
As a reference for the user of homology modelling methods, we related all our results to the similarity between the target and template sequences, as provided by the MODELLER package, which is equal to: (number of identical residues)/(number of residues of the shortest sequence).
Clefts were obtained using the standard software SURFNET  which, for a given a protein, gives a list of clefts, each defined by a set of contouring atoms. We used the number of contouring atoms of a cleft as a measure of its size. For all our analyses, we focused on the largest cleft, except in one case [see Additional file 4] where we examined the five largest clefts.
Changes in protein clefts
We used 6 parameters to characterize changes in protein clefts: root-mean-square deviation (rmsd), normalized root-mean-square deviation (rmsd100), global distance test (GDT), protrusion index (cx), variation in accessible surface area (ΔASA) and contact number (ΔCN). Unless otherwise stated, these parameters were computed using only the subset of protein atoms defining the chosen cleft in the target's experimental structure. For example, if for a given protein this cleft was defined by atoms a12, a23, a34, ..., a332, the rmsd computation between the experimental and the modelled structure was restricted to these.
We used the coordinates rmsd  as a quality measure of the clefts resulting from the modelling process. This rmsd is usually computed using all protein atoms, or main-chain atoms, etc. However, in our case we used the list of contouring atoms of the target's largest cleft. In some cases (Figure 4) we also obtained the rmsd using all the protein Cα atoms, to relate cleft and backbone qualities.
The normalized rmsd, rmsd100, is obtained from conventional rmsd using the following formula : rmsd/[1 + 0.5ln(N/100)], where rmsd is the non-normalized value (obtained as explained in the previous section), and N is the number of aligned residue pairs. rmsd100 is independent of size  and therefore allows comparison of changes observed for clefts of different sizes on the same scale.
The global distance test is a measure used when comparing two structures. It allows the identification of common sub-structures between them. This measure corresponds to the percentage of aligned atoms that are at a distance lower than a given threshold. Here we used four thresholds, 1 Å, 2 Å, 4 Å and 8 Å, which are typically applied to assess structure predictions. GDT was computed for each threshold following an iterative procedure :
a- compute the optimal superimposition  between the cleft atoms in the experimental and model structures, as explained in the rmsd section
b- find all aligned atom pairs at a distance lower than the threshold
c- obtain the optimal superimposition using only the atom pairs obtained in step b.
d- repeat steps b and c until no changes are observed in the pairs list during two iteration cycles.
cx  is a parameter that provides a fine-grained, local view of atomic environment. It is equal to the ratio between free and occupied volume within a sphere (10 Å radius) centered in each heavy atom. cx varies between 0 and 15, with large values corresponding to protruding atoms that may either be involved in protein-protein interactions, or correspond to proteolysis sites. cx was computed with the program developed by Pintar and colleagues .
For a given atom, when comparing cx values between structures, we may find small fluctuations that are probably meaningless. To establish a threshold beyond which variation in cx values may be relevant, we compared a set of pairs of replicas of the same structure, obtained under different experimental conditions. This pairs list was obtained by clustering all PDB  structures from the version of May 25th, 2007. We implemented a series of filters to exclude: theoretical models, modified residues, incomplete residues, missing and/or unknown residues, extreme experimental conditions (e.g. high or low pressure, etc), and mutants. After applying these filters, we then clustered the accepted proteins with Cd-hit  and eliminated those cases for which there were length differences between cluster members. The final list comprised 223 pairs of equivalent protein structures. For each protein atom we then computed, Δcx, the difference in cx between replicas. We found (results not shown) that over 99% of Δcx values clustered between -1 and 1. On this basis we imposed that only atoms with variations in cx larger than 1 in absolute value between the experimental and model structure would be taken into account.
The atom ASA was computed using the program NACCESS  with probe radius equal to 1.4 Å. It is a shape descriptor related to the capacity of atoms and residues to interact with their environment.
Change in contact number is derived from ΔASA following the study of Colonna-Cesari and Sander : ΔCN ~ 0.31ΔASA. ΔCN provides a coarse-grained view of how the interaction capacity of a given atom varies as a result of changes in the model.
To plot the large number of data resulting from our analyses we mostly used boxplots instead of dotplots to avoid the overplotting problem that affects the latter . Boxplots are usually employed to represent continuous variables and facilitate comparison between distributions . Apart from the median, which is represented as an independent point with a special symbol (a triangle in our case), there are three main features in the boxplots: the central box, the "whiskers" and the outliers. The central box goes from the first (25th percentile) to the third quartile (75th percentile). One "whisker" starts at the first quartile and goes down the graph; the other "whisker" starts at the third quartile and goes to the top of the graph. The length of these "whiskers" is equal to the minimum between the respective extreme values and 1.5 times the interquartile range (the difference between the 75th and the 25th percentile values). Outliers are plotted as separate points. For clarity, we omitted outliers, but no conclusion was affected by their absence.
List of Abbreviations used
accessible surface area
global distance test
GDT_2, GDT_3 and GDT_4: global distance tests computed for 1 Å, 2 Å, 4 Å and 8 Å distance thresholds, respectively
normalized root-mean-square deviation
ratio between number of cleft side-chain atoms contributing to a given GDT (e.g. GDT_1, GDT_2, etc) and total number of cleft side-chain atoms.
The authors thank the CATH team for their support, Maria Kontoyianni for critical reading of the manuscript, Modesto Orozco and Carles Ferrer for helpful suggestions, and the reviewers for their constructive comments. XdC acknowledges funding from the Spanish government (Grants BIO2003-09327, BIO2006-15557) and the Wellcome Trust (Research Collaboration Grant 069878/Z/02/Z). DP acknowledges economical support from the Government of Catalonia and SL from the Consejo Superior de Investigaciones Científicas.
- Baker D, Sali A: Protein structure prediction and structural genomics. Science 2001, 294(5540):93–96. 10.1126/science.1065659View ArticleGoogle Scholar
- Burley SK, Bonanno JB: Structuring the universe of proteins. Annu Rev Genomics Hum Genet 2002, 3: 243–262. 10.1146/annurev.genom.3.022502.103227View ArticleGoogle Scholar
- Erlandsen H, Abola EE, Stevens RC: Combining structural genomics and enzymology: completing the picture in metabolic pathways and enzyme active sites. Curr Opin Struct Biol 2000, 10(6):719–730. 10.1016/S0959-440X(00)00154-8View ArticleGoogle Scholar
- Jung JW, Lee W: Structure-based functional discovery of proteins: structural proteomics. J Biochem Mol Biol 2004, 37(1):28–34.View ArticleGoogle Scholar
- Schmid MB: Seeing is believing: the impact of structural genomics on antimicrobial drug discovery. Nat Rev Microbiol 2004, 2(9):739–746. 10.1038/nrmicro978View ArticleGoogle Scholar
- Vitkup D, Melamud E, Moult J, Sander C: Completeness in structural genomics. Nat Struct Biol 2001, 8(6):559–566. 10.1038/88640View ArticleGoogle Scholar
- Marsden RL, Lewis TA, Orengo CA: Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC Bioinformatics 2007, 8: 86. 10.1186/1471-2105-8-86View ArticleGoogle Scholar
- Sadreyev RI, Grishin NV: Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds. BMC Struct Biol 2006, 6: 6. 10.1186/1472-6807-6-6View ArticleGoogle Scholar
- O'Toole N, Raymond S, Cygler M: Coverage of protein sequence space by current structural genomics targets. J Struct Funct Genomics 2003, 4(2–3):47–55. 10.1023/A:1026156025612View ArticleGoogle Scholar
- Chance MR, Bresnick AR, Burley SK, Jiang JS, Lima CD, Sali A, Almo SC, Bonanno JB, Buglino JA, Boulton S, et al.: Structural genomics: a pipeline for providing structures for the biologist. Protein Sci 2002, 11(4):723–738. 10.1110/ps.4570102View ArticleGoogle Scholar
- Goh CS, Lan N, Douglas SM, Wu B, Echols N, Smith A, Milburn D, Montelione GT, Zhao H, Gerstein M: Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. J Mol Biol 2004, 336(1):115–130. 10.1016/j.jmb.2003.11.053View ArticleGoogle Scholar
- Lesley SA, Kuhn P, Godzik A, Deacon AM, Mathews I, Kreusch A, Spraggon G, Klock HE, McMullan D, Shin T, et al.: Structural genomics of the Thermotoga maritima proteome implemented in a high-throughput structure determination pipeline. Proc Natl Acad Sci USA 2002, 99(18):11664–11669. 10.1073/pnas.142413399View ArticleGoogle Scholar
- O'Toole N, Grabowski M, Otwinowski Z, Minor W, Cygler M: The structural genomics experimental pipeline: insights from global target lists. Proteins 2004, 56(2):201–210. 10.1002/prot.20060View ArticleGoogle Scholar
- Page R, Peti W, Wilson IA, Stevens RC, Wuthrich K: NMR screening and crystal quality of bacterially expressed prokaryotic and eukaryotic proteins in a structural genomics pipeline. Proc Natl Acad Sci USA 2005, 102(6):1901–1905. 10.1073/pnas.0408490102View ArticleGoogle Scholar
- Peti W, Page R, Moy K, O'Neil-Johnson M, Wilson IA, Stevens RC, Wuthrich K: Towards miniaturization of a structural genomics pipeline using micro-expression and microcoil NMR. J Struct Funct Genomics 2005, 6(4):259–267. 10.1007/s10969-005-9000-xView ArticleGoogle Scholar
- Chandonia JM, Brenner SE: The impact of structural genomics: expectations and outcomes. Science 2006, 311(5759):347–351. 10.1126/science.1121018View ArticleGoogle Scholar
- Levitt M: Growth of novel protein structural data. Proc Natl Acad Sci USA 2007, 104(9):3183–3188. 10.1073/pnas.0611678104View ArticleGoogle Scholar
- Spraggon G, Pantazatos D, Klock HE, Wilson IA, Woods VL Jr, Lesley SA: On the use of DXMS to produce more crystallizable proteins: structures of the T. maritima proteins TM0160 and TM1171. Protein Sci 2004, 13(12):3187–3199. 10.1110/ps.04939904View ArticleGoogle Scholar
- Symersky J, Zhang Y, Schormann N, Li S, Bunzel R, Pruett P, Luan CH, Luo M: Structural genomics of Caenorhabditis elegans: structure of the BAG domain. Acta Crystallogr D Biol Crystallogr 2004, 60(Pt 9):1606–1610. 10.1107/S0907444904017603View ArticleGoogle Scholar
- Todd AE, Marsden RL, Thornton JM, Orengo CA: Progress of structural genomics initiatives: an analysis of solved target structures. J Mol Biol 2005, 348(5):1235–1260. 10.1016/j.jmb.2005.03.037View ArticleGoogle Scholar
- Ginalski K: Comparative modeling for protein structure prediction. Curr Opin Struct Biol 2006, 16(2):172–177. 10.1016/j.sbi.2006.02.003View ArticleGoogle Scholar
- Yura K, Yamaguchi A, Go M: Coverage of whole proteome by structural genomics observed through protein homology modeling database. J Struct Funct Genomics 2006, 7(2):65–76. 10.1007/s10969-006-9010-3View ArticleGoogle Scholar
- Sanchez R, Pieper U, Melo F, Eswar N, Marti-Renom MA, Madhusudhan MS, Mirkovic N, Sali A: Protein structure modeling for structural genomics. Nat Struct Biol 2000, 7(Suppl):986–990. 10.1038/80776View ArticleGoogle Scholar
- Peitsch MC: About the use of protein models. Bioinformatics 2002, 18(7):934–938. 10.1093/bioinformatics/18.7.934View ArticleGoogle Scholar
- Marti L, Abella A, De La Cruz X, Garcia-Vicente S, Unzeta M, Carpene C, Palacin M, Testar X, Orozco M, Zorzano A: Exploring the binding mode of semicarbazide-sensitive amine oxidase/VAP-1: identification of novel substrates with insulin-like activity. J Med Chem 2004, 47(20):4865–4874. 10.1021/jm0499211View ArticleGoogle Scholar
- Chakravarty S, Sanchez R: Systematic analysis of added-value in simple comparative models of protein structure. Structure 2004, 12(8):1461–1470. 10.1016/j.str.2004.05.018View ArticleGoogle Scholar
- Chakravarty S, Wang L, Sanchez R: Accuracy of structure-derived properties in simple comparative models of protein structures. Nucleic Acids Res 2005, 33(1):244–259. 10.1093/nar/gki162View ArticleGoogle Scholar
- Sanchez R, Sali A: Advances in comparative protein-structure modelling. Curr Opin Struct Biol 1997, 7(2):206–214. 10.1016/S0959-440X(97)80027-9View ArticleGoogle Scholar
- Tondel K: Prediction of homology model quality with multivariate regression. J Chem Inf Comput Sci 2004, 44(5):1540–1551. 10.1021/ci049924mView ArticleGoogle Scholar
- Tramontano A, Leplae R, Morea V: Analysis and assessment of comparative modeling predictions in CASP4. Proteins 2001, (Suppl 5):22–38. 10.1002/prot.10015
- Venclovas C: Comparative modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins 2001, (Suppl 5):47–54. 10.1002/prot.10008
- Ginalski K, Rychlewski L: Protein structure prediction of CASP5 comparative modeling and fold recognition targets using consensus alignment approach and 3D assessment. Proteins 2003, 53(Suppl 6):410–417. 10.1002/prot.10548View ArticleGoogle Scholar
- Venclovas C: Comparative modeling in CASP5: progress is evident, but alignment errors remain a significant hindrance. Proteins 2003, 53(Suppl 6):380–388. 10.1002/prot.10591View ArticleGoogle Scholar
- DeWeese-Scott C, Moult J: Molecular modeling of protein function regions. Proteins 2004, 55(4):942–961. 10.1002/prot.10519View ArticleGoogle Scholar
- Bates PA, Sternberg MJ: Model building by comparison at CASP3: using expert knowledge and computer automation. Proteins 1999, (Suppl 3):47–54. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+%3C47::AID-PROT7%3E3.0.CO;2-F
- Venclovas C, Ginalski K, Fidelis K: Addressing the issue of sequence-to-structure alignments in comparative modeling of CASP3 target proteins. Proteins 1999, (Suppl 3):73–80. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+%3C73::AID-PROT11%3E3.0.CO;2-P
- Kryshtafovych A, Venclovas C, Fidelis K, Moult J: Progress over the first decade of CASP experiments. Proteins 2005, 61(Suppl 7):225–236. 10.1002/prot.20740View ArticleGoogle Scholar
- Martin AC, MacArthur MW, Thornton JM: Assessment of comparative modeling in CASP2. Proteins 1997, (Suppl 1):14–28. Publisher Full Text 10.1002/(SICI)1097-0134(1997)1+%3C14::AID-PROT4%3E3.0.CO;2-O
- Contreras-Moreira B, Ezkurdia I, Tress ML, Valencia A: Empirical limits for template-based protein structure prediction: the CASP5 example. FEBS Lett 2005, 579(5):1203–1207. 10.1016/j.febslet.2005.01.014View ArticleGoogle Scholar
- Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A: Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins 2005, 61(Suppl 7):27–45. 10.1002/prot.20720View ArticleGoogle Scholar
- Moult J: Rigorous performance evaluation in protein structure modelling and implications for computational biology. Philos Trans R Soc Lond B Biol Sci 2006, 361(1467):453–458. 10.1098/rstb.2005.1810View ArticleGoogle Scholar
- Sanchez R, Sali A: Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc Natl Acad Sci USA 1998, 95(23):13597–13602. 10.1073/pnas.95.23.13597View ArticleGoogle Scholar
- Orengo CA, Thornton JM: Protein families and their evolution-a structural perspective. Annu Rev Biochem 2005, 74: 867–900. 10.1146/annurev.biochem.74.082803.133029View ArticleGoogle Scholar
- Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 2003, 333(4):863–882. 10.1016/j.jmb.2003.08.057View ArticleGoogle Scholar
- Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 2001, 307(4):1113–1143. 10.1006/jmbi.2001.4513View ArticleGoogle Scholar
- Bray JE, Marsden RL, Rison SC, Savchenko A, Edwards AM, Thornton JM, Orengo CA: A practical and robust sequence search strategy for structural genomics target selection. Bioinformatics 2004, 20(14):2288–2295. 10.1093/bioinformatics/bth240View ArticleGoogle Scholar
- Rost B: Protein structures sustain evolutionary drift. Fold Des 1997, 2(3):S19–24. 10.1016/S1359-0278(97)00059-XView ArticleGoogle Scholar
- Lesk AM, Levitt M, Chothia C: Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Eng 1986, 1(1):77–78. 10.1093/protein/1.1.77View ArticleGoogle Scholar
- Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29: 291–325. 10.1146/annurev.biophys.29.1.291View ArticleGoogle Scholar
- Pearl FM, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA: The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res 2003, 31(1):452–455. 10.1093/nar/gkg062View ArticleGoogle Scholar
- Rohl CA, Strauss CE, Chivian D, Baker D: Modeling structurally variable regions in homologous proteins with rosetta. Proteins 2004, 55(3):656–677. 10.1002/prot.10629View ArticleGoogle Scholar
- Laskowski RA: SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 1995, 13(5):323–330. 10.1016/0263-7855(95)00073-9View ArticleGoogle Scholar
- Laskowski RA, Luscombe NM, Swindells MB, Thornton JM: Protein clefts in molecular recognition and function. Protein Sci 1996, 5(12):2438–2452.Google Scholar
- Liang J, Edelsbrunner H, Woodward C: Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 1998, 7(9):1884–1897.View ArticleGoogle Scholar
- de la Cruz X, Mark AE, Tormo J, Fita I, van Gunsteren WF: Investigation of shape variations in the antibody binding site by molecular dynamics computer simulation. J Mol Biol 1994, 236(4):1186–1195. 10.1016/0022-2836(94)90020-5View ArticleGoogle Scholar
- Fradera X, De La Cruz X, Silva CH, Gelpi JL, Luque FJ, Orozco M: Ligand-induced changes in the binding sites of proteins. Bioinformatics 2002, 18(7):939–948. 10.1093/bioinformatics/18.7.939View ArticleGoogle Scholar
- Carugo O, Pongor S: A normalized root-mean-square distance for comparing protein three-dimensional structures. Protein Sci 2001, 10(7):1470–1473. 10.1110/ps.690101View ArticleGoogle Scholar
- Zemla A: LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res 2003, 31(13):3370–3374. 10.1093/nar/gkg571View ArticleGoogle Scholar
- Pintar A, Carugo O, Pongor S: CX, an algorithm that identifies protruding atoms in proteins. Bioinformatics 2002, 18(7):980–984. 10.1093/bioinformatics/18.7.980View ArticleGoogle Scholar
- Richards FM: Areas, volumes, packing and protein structure. Annu Rev Biophys Bioeng 1977, 6: 151–176. 10.1146/annurev.bb.06.060177.001055View ArticleGoogle Scholar
- de La Cruz X, Calvo M: Use of surface area computations to describe atom-atom interactions. J Comput Aided Mol Des 2001, 15(6):521–532. 10.1023/A:1011133332333View ArticleGoogle Scholar
- de la Cruz X, Reverter J, Fita I: Representation of noncovalent interactions in protein structures. J Mol Graph 1992, 10(2):96–100. 10.1016/0263-7855(92)80062-IView ArticleGoogle Scholar
- Eisenberg D, McLachlan AD: Solvation energy in protein folding and binding. Nature 1986, 319(6050):199–203. 10.1038/319199a0View ArticleGoogle Scholar
- Eisenhaber F: Hydrophobic regions on protein surfaces. Derivation of the solvation energy from their area distribution in crystallographic protein structures. Protein Sci 1996, 5(8):1676–1686.View ArticleGoogle Scholar
- Young L, Jernigan RL, Covell DG: A role for surface hydrophobicity in protein-protein recognition. Protein Sci 1994, 3(5):717–729.View ArticleGoogle Scholar
- Colonna-Cesari F, Sander C: Excluded volume approximation to protein-solvent interaction. The solvent contact model. Biophys J 1990, 57(5):1103–1107.View ArticleGoogle Scholar
- Fan H, Mark AE: Refinement of homology-based protein structures by molecular dynamics simulation techniques. Protein Sci 2004, 13(1):211–220. 10.1110/ps.03381404View ArticleGoogle Scholar
- Moult J: A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 2005, 15(3):285–289. 10.1016/j.sbi.2005.05.011View ArticleGoogle Scholar
- Stark A, Sunyaev S, Russell RB: A model for statistical significance of local similarities in structure. J Mol Biol 2003, 326(5):1307–1316. 10.1016/S0022-2836(03)00045-7View ArticleGoogle Scholar
- Maiorov VN, Crippen GM: Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins. J Mol Biol 1994, 235(2):625–634. 10.1006/jmbi.1994.1017View ArticleGoogle Scholar
- Maiorov VN, Crippen GM: Size-independent comparison of protein three-dimensional structures. Proteins 1995, 22(3):273–283. 10.1002/prot.340220308View ArticleGoogle Scholar
- Canutescu AA, Shelenkov AA, Dunbrack RL Jr: A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci 2003, 12(9):2001–2014. 10.1110/ps.03154503View ArticleGoogle Scholar
- Manocha D, Zhu Y, Wright W: Conformational analysis of molecular chains using nano-kinematics. Comput Appl Biosci 1995, 11(1):71–86.Google Scholar
- Lupyan D, Leo-Macias A, Ortiz AR: A new progressive-iterative algorithm for multiple structure alignment. Bioinformatics 2005, 21(15):3255–3263. 10.1093/bioinformatics/bti527View ArticleGoogle Scholar
- Ortiz AR, Strauss CE, Olmea O: MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 2002, 11(11):2606–2621. 10.1110/ps.0215902View ArticleGoogle Scholar
- Lise S, Walker-Taylor A, Jones DT: Docking protein domains in contact space. BMC Bioinformatics 2006, 7: 310. 10.1186/1471-2105-7-310View ArticleGoogle Scholar
- Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 2004, (32 Database):D226–229. 10.1093/nar/gkh039
- Marti-Renom MA, Madhusudhan MS, Sali A: Alignment of protein sequences by their profiles. Protein Sci 2004, 13(4):1071–1087. 10.1110/ps.03379804View ArticleGoogle Scholar
- Vogt G, Etzold T, Argos P: An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J Mol Biol 1995, 249(4):816–831. 10.1006/jmbi.1995.0340View ArticleGoogle Scholar
- Dunbrack RL Jr: Sequence comparison and protein structure prediction. Curr Opin Struct Biol 2006, 16(3):374–384. 10.1016/j.sbi.2006.05.006View ArticleGoogle Scholar
- Kabsch WA: A solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 1976, A32: 922–923. 10.1107/S0567739476001873View ArticleGoogle Scholar
- Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, et al.: The Protein Data Bank. Acta Crystallogr D Biol Crystallogr 2002, 58(Pt 6 No 1):899–907. 10.1107/S0907444902003451View ArticleGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658–1659. 10.1093/bioinformatics/btl158View ArticleGoogle Scholar
- Hubbard SJ, Thornton JM: NACCESS. Department of Biochemistry and Molecular Biology, University College London; 1993.Google Scholar
- Unwin A, Theus M, Hofman H: Graphics of Large Datasets. New York: Springer; 2006.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.