- Research article
- Open Access
Type II restriction endonuclease R.Hpy188I belongs to the GIY-YIG nuclease superfamily, but exhibits an unusual active site
BMC Structural Biologyvolume 8, Article number: 48 (2008)
Catalytic domains of Type II restriction endonucleases (REases) belong to a few unrelated three-dimensional folds. While the PD-(D/E)XK fold is most common among these enzymes, crystal structures have been also determined for single representatives of two other folds: PLD (R.BfiI) and half-pipe (R.PabI). Bioinformatics analyses supported by mutagenesis experiments suggested that some REases belong to the HNH fold (e.g. R.KpnI), and that a small group represented by R.Eco29kI belongs to the GIY-YIG fold. However, for a large fraction of REases with known sequences, the three-dimensional fold and the architecture of the active site remain unknown, mostly due to extreme sequence divergence that hampers detection of homology to enzymes with known folds.
R.Hpy188I is a Type II REase with unknown structure. PSI-BLAST searches of the non-redundant protein sequence database reveal only 1 homolog (R.HpyF17I, with nearly identical amino acid sequence and the same DNA sequence specificity). Standard application of state-of-the-art protein fold-recognition methods failed to predict the relationship of R.Hpy188I to proteins with known structure or to other protein families. In order to increase the amount of evolutionary information in the multiple sequence alignment, we have expanded our sequence database searches to include sequences from metagenomics projects. This search resulted in identification of 23 further members of R.Hpy188I family, both from metagenomics and the non-redundant database. Moreover, fold-recognition analysis of the extended R.Hpy188I family revealed its relationship to the GIY-YIG domain and allowed for computational modeling of the R.Hpy188I structure. Analysis of the R.Hpy188I model in the light of sequence conservation among its homologs revealed an unusual variant of the active site, in which the typical Tyr residue of the YIG half-motif had been substituted by a Lys residue. Moreover, some of its homologs have the otherwise invariant Arg residue in a non-homologous position in sequence that nonetheless allows for spatial conservation of the guanidino group potentially involved in phosphate binding.
The present study eliminates a significant "white spot" on the structural map of REases. It also provides important insight into sequence-structure-function relationships in the GIY-YIG nuclease superfamily. Our results reveal that in the case of proteins with no or few detectable homologs in the standard "non-redundant" database, it is useful to expand this database by adding the metagenomic sequences, which may provide evolutionary linkage to detect more remote homologs.
Type II restriction endonucleases (REases) form one of the largest groups of biochemically characterized enzymes (reviews: [1, 2]). They usually recognize a short (4–8 bp) palindromic sequence of double-stranded DNA and catalyze the hydrolysis of phosphodiester bonds at precise positions within or close to this sequence, leaving "blunt" ends or "sticky" (5' or 3') overhangs. They form restriction-modification (RM) systems together with DNA methyltransferases (MTases) of the same or a similar sequence specificity, whose enzymatic activity leads to methylation of the target sequence and, consequently, its protection against the cleavage by the REase . Type II RM systems behave as selfish "toxin-antitoxin" genetic modules; they undergo rampant horizontal transfer and parasitize the cells of prokaryotic hosts to ensure the maintenance of their DNA [4–6]. The activity of the RM systems manifests itself by destruction of DNA molecules without the required methylation patterns, e.g. DNA molecules of invading phages or plasmids, or the genomic DNA of their host cells that once had the RM genes but have lost them.
The activity of REases is the target of selection pressure involving various agents: their host, the invading DNA molecules, and their competitors including other RM systems [7–10]. Presumably because of the absence of simple constant selection pressure on the REase activity, they undergo rapid divergence, and as a consequence, different REase families exhibit very little sequence similarity (review: ). Besides, there is formidable evidence, mainly from crystallographic analyses, that these enzymes have originated independently in the evolution on at least several occasions.
Thus far, REases have been found to belong to at least five unrelated structural folds. Most of REases belong to the PD-(D/E)XK superfamily of Mg2+-dependent nucleases, which also includes various proteins involved in DNA recombination and repair [12, 13]. Two REases with different folds have been found to be Mg2+-independent: R.BfiI belongs to the phospholipase D (PLD) superfamily of phosphodiesterases [14, 15], while R.PabI exhibits a novel "half-pipe" fold [16, 17]. A number of REases have been predicted to be related to the HNH superfamily of metal-dependent nucleases, which groups together enzymes with various activities, such as recombinases, DNA repair enzymes, and homing endonucleases [12, 18]. For some of these REases from the HNH superfamily, bioinformatics predictions of the active site have been substantiated by mutagenesis; examples include R.KpnI , R.MnlI , and R.Eco31I . Finally, R.Eco29kI and its two close homologs have been predicted to belong to the GIY-YIG superfamily of nucleases that includes e.g. DNA repair enzymes and homing nucleases ; this prediction has been recently supported by mutagenesis of the R.Eco29kI active site . Among of all REase folds, the mechanism of action of GIY-YIG and half-pipe nucleases is least well understood, and no co-crystal structures are available for any member of these superfamilies.
A recent large-scale bioinformatics survey of Type II REase sequences  indicated that for about 81% of experimentally characterized (i.e. not putative) enzymes, the three-dimensional fold can be predicted based on advanced bioinformatics analyses, mainly protein fold-recognition and analysis of amino acid conservation patterns and secondary structure prediction (review of methodology: ). However, the other REases remain unassigned to known folds and the architectures of their active sites and potential mechanisms of action remain obscure.
R.Hpy188I is one of the REases, for which no fold prediction have been made thus far. R.Hpy188I recognizes the unique sequence, TCNGA, and cleaves the DNA between nucleotides N and G in its recognition sequence to generate a one-base 3' overhang . Its orthologs are found among many, but not all, strains of Helicobacter pylori that have been tested with respect to the REase activity . In this work, we present the results of a bioinformatics analysis that has detected remote relationship between R.Hpy188I and known GIY-YIG nucleases thanks to utilization of metagenomics sequences to generate a multiple sequence alignment with enhanced evolutionary information. We suggest that this approach could be applied to predict structure of other proteins, for which fold-recognition analyses done with standard alignments have failed.
Initial bioinformatics analysis of R.Hpy188I and its homologs
The lack of overall sequence conservation among REases, the absence of invariable residues even in the active site and the presence of several alternative folds makes structure prediction and generation of multiple sequence alignments for these enzymes a non-trivial task. In order to predict the structure of R.Hpy188I, we used the GeneSilico meta-server, which is a gateway to a number of third-party algorithms (see Methods). In particular, we predicted the secondary structure of this enzyme and carried out the fold recognition analysis to identify the structures of potentially homologous proteins in the Protein Data Bank that could serve as modeling templates. Unfortunately, querying the meta-server with R.Hpy188I sequence alone has not revealed any significant matches to proteins of known structure (for a discussion of significance thresholds of individual FR methods, see the Methods section). Of all methods used, only HHSEARCH revealed a match to GIY-YIG nucleases, albeit at the 9th position of the ranking, with a score that did not indicate statistical significance (probability 0.113, e-value 68).
Most of fold recognition methods make their predictions not for a single sequence, but for a multiple sequence alignment generated by PSI-BLAST searches of the non-redundant (nr) NCBI database (or of a subset of sequences culled from this database). Analysis of an independently carried out PSI-BLAST run against that database (with e-value threshold of 1e-3) revealed only one nearly identical sequence, of REase R.HpyF17I that exhibits only 1 amino acid difference and 4 additional amino acids at the N terminus (Sapranauskas, R., Lubys, A. and Janulaitis, A. unpublished reference "Cloning and analysis of the TCNGA-specific restriction-modification system from Helicobacter pylori strain A17-2"). The results of fold recognition analysis starting from R.HpyF17I or from an alignment of R.Hpy188I and R.HpyF17I, were the same as those starting from R.Hpy188I alone. Thus, R.Hpy188I can be considered an "ORFan" , at least with respect to the nr database.
Previously, in the course of bioinformatics analysis of R.NlaIV enzyme, we found that inclusion of sequences from metagenomics projects can increase information content of a multiple sequence alignment and improve detection of remote homologies, in particular for proteins with very few homologs in the nr database . Thus, we carried out a new PSI-BLAST search for R.Hpy188I (also with e-value threshold of 1e-3), of the env_nr database (protein sequences deduced from environmental DNA samples), obtained from the NCBI server database. This search revealed 9 sequences, with e-values ranging from 3e-10 to 3e-4. Again, running FR analyses for these sequences gave no significant matches to any structure. Nonetheless, a PSI-BLAST search of a database comprising both nr AND env_nr revealed an increased number of sequences. In the search, 25 sequences including 18 from marine metagenome  were found to exhibit significant scores (e-values < e-4) and a conserved pattern of residues (I/V)-Y-X9-(K)-I-G (where X indicates any amino acid residue) associated with a predicted β-hairpin structure that remotely resembled the genuine bipartite GIY-YIG motif. FR analysis of a multiple sequence alignment calculated for the sequences returned by the PSI-BLAST search revealed the relationship of these sequences to the GIY-YIG superfamily, according to the following servers: HHSEARCH (probability 0.946), FFAS (score -12.6), mGenTHREADER (probability 0.422), FUGUE (score 10.3), INUB (score 44.1). According to the Livebench evaluation, all these scores indicate higher reliability than the threshold of approximately 5% false positives (see Methods) and in our experience can be taken as reasonably confident 3D fold prediction. Further, the consensus predictor PCONS selected 1yd0 as a preferred template with score 0.665, a value almost exactly at the threshold. Thus, we estimate that a probability of incorrect fold prediction for the R.Hpy188I family is around 5%.
We conclude that utilization of evolutionary information from metagenomics sequences can greatly increase the information content of a multiple sequence alignment, to the point where a reasonably sized family can be detected for a sequence, which appears as an "ORFan" when only the nr database is considered. An extended multiple sequence alignment that includes metagenomics sequences together with proteins from the nr database can then be used as a sensitive probe in protein fold-recognition, for detection of remote homologies to proteins of known structure.
Molecular modeling of R.Hpy188I
It is well known that fold recognition methods can produce artifacts. For instance, sequence alignments to wrong templates can reveal misleading local similarity of amino acid residues, and generate structures that are completely misfolded. Thus, in order to substantiate the sequence-based prediction of membership of R.Hpy188I in the GIY-YIG superfamily (with the confidence of FR predictions estimated to be around 95%), we decided to build a model of its structure and evaluate its quality on the three-dimensional level. Although the GIY-YIG domain of UvrC  has been identified as the preferred structure, fold recognition alignments reported by different methods exhibited differences. Thus, we used the "FRankenstein's Monster" approach to simultaneously generate a model of the protein core and optimize the target-template alignment by generation, evaluation, and recombination of alternative models [32, 33]. This approach has been evaluated as one of the best template-modeling methods in CASP5 and CASP6; we have also used it to generate accurate models of REases R.SfiI  and R.MvaI , which were confirmed by independent crystallographic analyses [36, 37]. The final alignment (Figure 1) indicated that regions 1–59, 89–103, and 113–121 of R.Hpy188I lack the counterpart in GIY-YIG domains of known structure and cannot be modeled "by homology".
Initially, we attempted to fold regions 1–59 (N-terminal extension), 89–103, and 128–143 (two insertions and a structure of low sequence similarity to the template) using ROSETTA (see Methods), while keeping the rest of the model 'frozen'. However, the resulting models (low-energy representatives of the 5 largest clusters of decoys) exhibited relatively poor packing (data not shown). Thus, we subjected these models to refinement with the REFINER method , using additional restraints on secondary structure, according to the consensus prediction reported by the meta-server. Recently, we have used this approach to correctly predict the structures of MiaA, MiaB, and MiaE enzymes . Among all the refined R.Hpy188I models, the one with the lowest predicted deviation to the native structure (root mean square deviation from the native structure of about 4.25 Å according to the MetaMQAP method, and LGscore of 3.536 i.e. 'very good model' according to PROQ) has been selected as the final model (Figure 2) and subjected to further analyses.
Analysis of the R.Hpy188I model
Comparison of the R.Hpy188I model with the much smaller template structures of GIY-YIG domains of UvrC and I-TevI homing endonuclease (Figure 2) illustrates the challenge of modeling, in particular with respect to regions that have no counterpart in the templates and have been added de novo. Nonetheless, our model obtained very good scores, which suggests that it is likely to be well-folded and that potential errors are unlikely to occur in the structurally most important regions. Parts modeled de novo do not form an autonomously folded (sub)domain. Instead, they pack against the homology-modeled GIY-YIG core. The secondary structure in the model fulfills the restraints used during model building; interestingly, a part of the N-terminal loop (residues 6–8) has formed a small β-sheet with a part of the insertion (residues 101–103). The model reveals the predicted configuration of the putative active site of R.Hpy188I, comprising amino acid residues Y63, K73, R84, Y88, E149, and Q169 (Figure 3). In comparison with the GIY-YIG domains analyzed so far , R.Hpy188I and some of its homologs are the first to exhibit K (K73 for R.Hpy188I) at position corresponding to Y29 of UvrC and Y17 of I-TevI (Y of the YIG half motif) and Q (Q169 for R.Hpy188I) at the position corresponding to N88 of UvrC and N90 of I-TevI (Figure 3 and 4).
The mechanism of phosphodiester bond hydrolysis has not been elucidated experimentally for any protein from the GIY-YIG superfamily, however a tentative mechanism has been proposed based on analysis of the crystal structure of a GIY-YIG domain from the UvrC enzyme . In analogy to that tentative mechanism, the divalent metal ion may function as Lewis acid, while E149 of R.Hpy188I may be responsible for metal coordination, K73 (alternatively Y63 or Y88) may function as a general base, and R84 may stabilize the negative charge of the free 5' phosphate after DNA cleavage. The hydoxyl group of Y29 of UvrC has been proposed to accept a proton from a nucleophilic water molecule while simultaneously transferring its proton to the metal-bound hydroxide . The amino group of K73 might act in a similar way.
Interestingly, among the afore-mentioned residues of the putative active site, R84 (indicated by an asterisk in Figure 1) is not absolutely conserved in the R.Hpy188I family alignment. However, in a number of R.Hpy188I homologs, a corresponding Arg residue (indicated in Figure 1) is found not in the α-helix, but in another loop, on the opposite side of the active site (positions 104 or 105 in R.Hpy188I). The distributions of R84 and R104/105 are exactly complementary. Modeling of the active site variants with the Arg residue in these alternative locations (Figure 4) revealed that the positively charged guanidino group at the tip of its side chain can assume spatially similar location as in the "orthodox" position. This finding suggests that R104/105 may fulfill the same role of phosphate binding as R84 despite being attached to a non-homologous position in the protein backbone. Such a spatial "migration" of a catalytic residue has not yet been observed in enzymes from the GIY-YIG superfamily; however, it has been reported for two different residues (Glu/Asp or Lys/Arg) in a number of nucleases from the PD-(D/E)XK superfamily [40–42]. Thus, it will be interesting to test experimentally the functional significance of the swapped Arg residue in the newly discovered GIY-YIG enzymes described in this work.
In addition to potential catalytic residues, the model of R.Hpy188I (Figure 4) revealed a pair of semi-conserved cysteines (C90 and C101) in the vicinity of the alternative positions of the afore-mentioned Arg residue (84 and 104/105). The presence of these two Cys residues is strongly correlated: they co-occur in 11 sequences and a single member of this pair is present only in 2 sequences. Both Cys residues are absent from all 12 members of the R.Hpy188I family that possess a shorter variant of the intervening loop (Figure 1). It is tempting to speculate that this pair of Cys residues might have a functional role, e.g. somehow stabilize the longer variant of the loop that may be involved in protein-DNA interactions. In the model they are sufficiently close to each other to form a disulfide, which is however unlikely to happen in nature due to the generally reducing environment of the bacterial cytoplasm, which prevents oxidation of sulfhydryl groups . Alternatively, if R.Hpy188I forms a dimer like most of Type II restriction endonucleases, they could form an intermolecular Zn-bindig site. Unfortunately, our model cannot provide detailed clues as to the function of C90 and C101, hence we propose them as interesting targets for experimental characterization.
Analysis of the protein surface with respect to the distribution of sequence conservation and the electrostatic potential (Figure 5) reveals that the surface of R.Hpy188I is mostly positively charged. The predicted catalytic residues line up a bottom of a pocket with an overall neutral charge that is surrounded by a charged rim. Most of that rim exhibits positive charge (complementary to the negative charge of DNA backbone), suggesting its possible role in the DNA binding. However, one side of the rim exhibits local concentration of negative charge, suggesting potential involvement in interactions with the positively charged metal ion.
Phylogenetics and genomic context analysis of the R.Hpy188I family
In order to interpret the structural and genomic features of different R.Hpy188I homologs in the evolutionary context, we have calculated a phylogenetic tree for the entire family (Figure 6). It reveals that the R.Hpy188I family comprises two subfamilies with different characteristic features (hereafter dubbed R.Hpy188I branch and R.HpyAORF481P branch after the representative members from REBASE). All the members of the R.Hpy188I branch contain the phosphate-binding Arg residue at the "orthodox" 84 position. All of them, except for two sequences from environmental samples, contain the aforementioned pair of cysteines. On the other hand, members of the R.HpyAORF481P branch possess the phosphate-binding Arg in the "alternative" location (104 or 105) and lack the aforementioned pair of Cys residues (at positions corresponding to 90 and 101 in R.Hpy188I).
R.Hpy188I and its close homolog R.HpyF17I are the only members of this protein family for which some function has been determined . Like virtually all Type II restriction enzymes, their genes are closely associated with genes encoding a DNA modification methyltransferase with cognate specificity. Thus, checking whether functionally uncharacterized members of the R.Hpy188I family are also genetically linked with DNA methyltransferase homologs is a convenient way to predict whether they could constitute a restriction-modification system. Unfortunately, most of R.Hpy188I members had been identified in metagenomic sequences, which typically contain only short fragments of genomic DNA and may not necessarily include the associated MTase gene together with the REase gene. Nonetheless, we carried out analysis of DNA sequence context for R.Hpy188I homologs to identify their neighbors and attempted to predict their cellular function beyond the putative generic nuclease activity.
It turned out that 19 members of the R.Hpy188I family are flanked with DNA MTase homologs (Table 1). For the remaining 6 homologs, the existence of a flanking MTase gene could not be verified because of incomplete nucleotide sequences that did not extend beyond the REase-like gene. We identified flanking MTase genes for 9 out of 13 members of the R.Hpy188I branch and 10 out of 12 of the R.Hpy481P branch (Table 1). Conserved association of members of the R.Hpy188I family with DNA MTases suggests that all of them are or used to be a functional restriction-modification system.
All but two (7 out of 9) MTases of the R.Hpy188I branch are closely related to M.Hpy188I (BLAST E-value < 1e-9). The remaining two members (GIs 136020097 and 140872195) are accompanied by truncated homologs of M.Hpy99ORF1012P and M.EsaSS1928P. In the case of 136020097 we cannot exclude that a second MTase closely related to M.Hpy188I is present on the unsequenced side of the REase gene. On the other hand, 140872195 appears to lack an M.Hpy188I homolog in its immediate neighborhood. The MTases accompanying R.Hpy188I homologs number 136020097 and 140872195 have mutually homologous sections and therefore exhibit similarity to each other. The catalytic domains of their full-length homologs M.Hpy99ORF1012P and M.EsaSS1928P are closely related to each other (33% identity), while they do not show close similarity to M.Hpy188I. This suggests that the MTase neighbors of 136020097 and 140872195 belong to a subfamily of MTases (together with M.Hpy99ORF1012P and M.EsaSS1928P) that is distinct from a subfamily of M.Hpy188I, although all these proteins belong to the same gamma class of N-MTases.
In the R.HpyAORF481P branch, 9 out of 10 detected MTase homologs are members of one family of closely related sequences, exemplified by M.HpyAORF481P. Interestingly, these proteins are members of the alpha class of N-MTases, which is topologically different from the gamma class represented by M.Hpy188I (see refs. [44, 45] for reviews of classes and permutations in DNA MTases). Finally, one member of the R.HpyAORF481P branch (GI 144033223) is associated with a MTase related to M.MunI, a member of beta class of N-MTases . The lack of evident sequence similarity between members of the three classes of MTases and their different topology suggests that their ancestors have diverged long before the divergence of R.Hpy188I and R.HpyAORF481P. This indicates that REases have exchanged their MTase partners at least twice in the evolution of the Hpy188I family of RM systems.
Several systems from the HpyAORF481P branch appear to be associated with another RM system (Table 1). The most interesting case is observed in the genome of Campylobacter upsaliensis RM3195, where another putative RM system has been inserted into the CupORF237P system comprising homologs of R.HpyAORF481P and M.HpyAORF481P. Insertion of a restriction-modification gene complex into another restriction-modification gene complex has been already suggested to have occurred in Helicobacter pylori .
Our results reveal that R.Hpy188I and its homologs are new members of the GIY-YIG superfamily, despite the fact that they exhibit two deviations from the consensus catalytic motif of the superfamily. First, R.Hpy188I exhibits Lys instead of Tyr of the "YIG" half-motif. Second, in one branch of R.Hpy188I family, a presumably catalytic Arg residue is missing at its typical position in sequence, but instead is found in a non-homologous position that nonetheless allows for spatial conservation of the guanidino group potentially involved in phosphate binding. Our discovery provides important insight into sequence diversity of GIY-YIG nucleases and suggests that other members with unusual active sites might await discovery. In this context, the theoretical model of R.Hpy188I structure developed in this work will serve as a convenient guide for experimental analyses aimed at understanding of the cleavage mechanism of GIY-YIG nucleases.
Our phylogenetic analysis shows that the R.Hpy188I family can be subdivided into two branches, one comprising close homologs of R.Hpy188I itself, and the other comprising close homologs of R.HpyAORF481P. Members of either branch are characterized by a different set of features, including localization of residues predicted to participate in the enzymatic activity and possibly in structural stability. They are also found associated with MTases from different classes. Last, but not least, sequence context analyses revealed that in the family of R.Hpy188I homologs, comprising mostly sequences detected in metagenomics data, all genes that have appropriate flanking sequences present in the database, are accompanied by a putative DNA MTase gene or its fragment, suggesting that they all are or used to be functional restriction-modification systems.
Sequence database searches, phylogenetic analyses and genomic context analyses
Searches of the non-redundant version of current sequence database (nr) and the database of environmental protein sequences (env_nr) were carried out using PSI-BLAST , initially separately for nr and env_nr via the websites of NCBI and MPI-Tuebingen, and finally using the local version (against the combined nr+env_nr database). The final search was carried out with the e-value threshold of 1e-3. The multiple sequence alignment of R.Hpy188I and proteins identified in nr+env_nr database was calculated using MAFFT  with default parameters and refined by hand to ensure that no unwarranted gaps had been introduced within α-helices and β-strands. Finally, based on the alignment, the phylogenetic tree was calculated using MEGA 4.0 , employing the Minimum Evolution method with the JTT model of substitutions. The stability of individual nodes was calculated using the bootstrap test (1000 replicates) and confirmed by the interior branch test. Genomic context analyses were carried out using hmmpfam from the HMMer package  against the PFAM  and TIGRFAMs  databases with e-value threshold of 1e-3.
Protein fold prediction
Preliminary structure predictions were carried out via the HHPRED server . As soon as we identified (by eye) sub-optimal alignments of R.Hpy188I sequence to GIY-YIG nucleases of known structure, we resubmitted it to the GeneSilico metaserver gateway  for secondary structure prediction and fold recognition. Structural predictions were carried out both for the R.Hpy188I sequence alone (without success), for the alignments of R.Hpy188I with sequences from the env_nr database (with somewhat better, but still statistically insignificant results), and finally for the alignment of the sequences found by searching the combined nr and env_nr database. It is important to indicate that different FR servers use completely different scoring systems, with different scales (e.g. Z-scores, e-values, percent values etc.). Moreover, the meaning of scores changes over time and may not be the same as reported in original publications describing the methods, as servers are modified and databases grow continuously. The comparable reliability thresholds for a number of servers are estimated e.g. by the Livebench benchmark [56, 57] conducted by Leszek Rychlewski and co-workers. For the servers, whose results are discussed in this work, the results of the Receiver Operator Characteristics (ROC) analysis, indicating a rough estimation of the score below which the servers' predictions become less reliable are as follows: HHSEARCH  probability: 0.629, FFAS  score: -8.9 (here the scale is inverted, i.e. lower scores are better), mGenTHREADER  probability: 0.351, FUGUE  score: 7.00, INUB  score: 25.35, PCONS [63, 64] score: 0.6657. These thresholds correspond to the average servers' scores for their 8th incorrect predictions, which amount to 5% of incorrect predictions for all targets in the Livebench test set . The scores were taken from the last Livebench run (Livebench-2008.2), with the sole exception of PCONS, which has not been included in Livebench-2008.2 and its score was taken from the CASP7 evaluation .
Protein structure modeling
Homology modeling of the catalytic core was carried out using the "FRankenstein's monster" approach (see [32, 33] for a detailed description). Briefly, preliminary models were built with MODELLER  based on alternative sequence alignments between R.Hpy188I and template structures obtained from various fold-recognition servers with significant scores (all templates used for modeling were members of the GIY-YIG superfamily). The preliminary models were scored by MetaMQAP  and a "hybrid" model was generated by merging fragments with consensus alignment with those non-consensus fragments that exhibited best MetaMQAP scores. Additional evaluations of protein structure quality were carried out with PROQ .
The "hybrid" model obtained was used as a starting point for folding simulations of the complete sequence using ROSETTA . The homology-modeled core of R.Hpy188I (residues 60–88, and 104–112) was completely "frozen" and the search of conformational space for the variable regions (residues 122–170) was restricted by the choice of fragments from known crystal structures that were compatible with the sequence and predicted secondary structure of R.Hpy188I. However, the models obtained by this protocol exhibited relatively poor packing and unsatisfactory MetaMQAP and PROQ scores (data not shown). Therefore, the final simulation of R.Hpy188I folding was conducted by the REFINER method, which uses a reduced representation of the protein chain and a statistical potential of mean force to describe intramolecular interactions . REFINER is a real-space version of a lattice-based algorithm CABS  we have earlier successfully combined with the "FRankenstein's Monster" method in CASP6  or for modeling of R.Eco29kI enzyme, another member of the GIY-YIG superfamily . The folding was carried out with restraints on predicted secondary structure. Models generated during the simulation had their full-atom representation rebuilt and were scored using PROQ and MetaMQAP. The best-scoring structure (in terms of predicted root mean square deviation with respect to the unknown true structure) was selected as the final model. Mapping of sequence conservation onto the model was done for the multiple sequence alignment of the Hpy188I family, initially with COLORADO3D , and ultimately with the ConSurf server .
Pingoud A, Fuxreiter M, Pingoud V, Wende W: Type II restriction endonucleases: structure and mechanism. Cell Mol Life Sci 2005, 62(6):685–707. 10.1007/s00018-004-4513-1
Pingoud AM: Restriction endonucleases. Volume 14. Berlin, Heidelberg: Springer-Verlag; 2004.
Roberts RJ, Belfort M, Bestor T, Bhagwat AS, Bickle TA, Bitinaite J, Blumenthal RM, Degtyarev S, Dryden DT, Dybvig K, et al.: A nomenclature for restriction enzymes, DNA methyltransferases, homing endonucleases and their genes. Nucleic Acids Res 2003, 31(7):1805–1812. 10.1093/nar/gkg274
Naito T, Kusano K, Kobayashi I: Selfish behavior of restriction-modification systems. Science 1995, 267(5199):897–899. 10.1126/science.7846533
Kobayashi I: Behavior of restriction-modification systems as selfish mobile elements and their impact on genome evolution. Nucleic Acids Res 2001, 29(18):3742–3756. 10.1093/nar/29.18.3742
Mochizuki A, Yahara K, Kobayashi I, Iwasa Y: Genetic addiction: selfish gene's strategy for symbiosis in the genome. Genetics 2006, 172(2):1309–1323. 10.1534/genetics.105.042895
Kusano K, Naito T, Handa N, Kobayashi I: Restriction-modification systems as genomic parasites in competition for specific sequences. Proc Natl Acad Sci USA 1995, 92(24):11095–11099. 10.1073/pnas.92.24.11095
Chinen A, Naito Y, Handa N, Kobayashi I: Evolution of sequence recognition by restriction-modification enzymes: selective pressure for specificity decrease. Mol Biol Evol 2000, 17(11):1610–1619.
Takahashi N, Naito Y, Handa N, Kobayashi I: A DNA methyltransferase can protect the genome from postdisturbance attack by a restriction-modification gene complex. J Bacteriol 2002, 184(22):6100–6108. 10.1128/JB.184.22.6100-6108.2002
Kobayashi I: Restriction-modification systems as minimal forms of life. In Restriction Endonucleases. Volume 14. Edited by: Pingoud A. Berlin: Springer-Verlag; 2004:19–62.
Bujnicki JM: Understanding the evolution of restriction-modification systems: clues from sequence and structure comparisons. Acta Biochim Pol 2001, 48(4):935–967.
Aravind L, Makarova KS, Koonin EV: SURVEY AND SUMMARY: Holliday junction resolvases and related nucleases: identification of new families, phyletic distribution and evolutionary trajectories. Nucleic Acids Res 2000, 28(18):3417–3432. 10.1093/nar/28.18.3417
Kosinski J, Feder M, Bujnicki JM: The PD-(D/E)XK superfamily revisited: identification of new members among proteins involved in DNA metabolism and functional predictions for domains of (hitherto) unknown function. BMC Bioinformatics 2005, 6(1):172. 10.1186/1471-2105-6-172
Sapranauskas R, Sasnauskas G, Lagunavicius A, Vilkaitis G, Lubys A, Siksnys V: Novel subtype of type IIs restriction enzymes. BfiI endonuclease exhibits similarities to the EDTA-resistant nuclease Nuc of Salmonella typhimurium . J Biol Chem 2000, 275(40):30878–30885. 10.1074/jbc.M003350200
Grazulis S, Manakova E, Roessle M, Bochtler M, Tamulaitiene G, Huber R, Siksnys V: Structure of the metal-independent restriction enzyme BfiI reveals fusion of a specific DNA-binding domain with a nonspecific nuclease. Proc Natl Acad Sci USA 2005, 102(44):15797–15802. 10.1073/pnas.0507949102
Ishikawa K, Watanabe M, Kuroita T, Uchiyama I, Bujnicki JM, Kawakami B, Tanokura M, Kobayashi I: Discovery of a novel restriction endonuclease by genome comparison and application of a wheat-germ-based cell-free translation assay: PabI (5'-GTA/C) from the hyperthermophilic archaeon Pyrococcus abyssi. Nucleic Acids Res 2005, 33(13):e112. 10.1093/nar/gni113
Miyazono K, Watanabe M, Kosinski J, Ishikawa K, Kamo M, Sawasaki T, Nagata K, Bujnicki JM, Endo Y, Tanokura M, et al.: Novel protein fold discovered in the PabI family of restriction enzymes. Nucleic Acids Res 2007, 35(6):1908–1918. 10.1093/nar/gkm091
Bujnicki JM, Radlinska M, Rychlewski L: Polyphyletic evolution of type II restriction enzymes revisited: two independent sources of second-hand folds revealed. Trends Biochem Sci 2001, 26(1):9–11. 10.1016/S0968-0004(00)01690-X
Saravanan M, Bujnicki JM, Cymerman IA, Rao DN, Nagaraja V: Type II restriction endonuclease R.KpnI is a member of the HNH nuclease superfamily. Nucleic Acids Res 2004, 32(20):6129–6135. 10.1093/nar/gkh951
Kriukiene E, Lubiene J, Lagunavicius A, Lubys A: MnlI – The member of H-N-H subtype of Type IIS restriction endonucleases. Biochim Biophys Acta 2005, 1751(2):194–204.
Jakubauskas A, Giedriene J, Bujnicki JM, Janulaitis A: Identification of a single HNH active site in Type IIS restriction endonuclease Eco31I. J Mol Biol 2007, 370(1):157–169. 10.1016/j.jmb.2007.04.049
Dunin-Horkawicz S, Feder M, Bujnicki JM: Phylogenomic analysis of the GIY-YIG nuclease superfamily. BMC Genomics 2006, 7(1):98. 10.1186/1471-2164-7-98
Ibryashkina EM, Zakharova MV, Baskunov VB, Bogdanova ES, Nagornykh MO, Den'mukhamedov MM, Melnik BS, Kolinski A, Gront D, Feder M, et al.: Type II restriction endonuclease R.Eco29kI is a member of the GIY-YIG nuclease superfamily. BMC Struct Biol 2007, 7(1):48. 10.1186/1472-6807-7-48
Orlowski J, Bujnicki JM: Structural and evolutionary classification of Type II restriction enzymes based on theoretical and experimental analyses. Nucleic Acids Res 2008, 36(11):3552–3569. 10.1093/nar/gkn175
Bujnicki JM: Crystallographic and bioinformatic studies on restriction endonucleases: inference of evolutionary relationships in the "midnight zone" of homology. Curr Protein Pept Sci 2003, 4(5):327–337. 10.2174/1389203033487072
Xu Q, Stickel S, Roberts RJ, Blaser MJ, Morgan RD: Purification of the novel endonuclease, Hpy188I, and cloning of its restriction-modification genes reveal evidence of its horizontal transfer to the Helicobacter pylori genome. J Biol Chem 2000, 275(22):17086–17093. 10.1074/jbc.M910303199
Xu Q, Morgan RD, Roberts RJ, Blaser MJ: Identification of type II restriction and modification systems in Helicobacter pylori reveals their substantial diversity among strains. Proc Natl Acad Sci USA 2000, 97(17):9671–9676. 10.1073/pnas.97.17.9671
Siew N, Fischer D: Twenty thousand ORFan microbial protein families for the biologist? Structure (Camb) 2003, 11(1):7–9. 10.1016/S0969-2126(02)00938-3
Chmiel AA, Radlinska M, Pawlak SD, Krowarsch D, Bujnicki JM, Skowronek KJ: A theoretical model of restriction endonuclease NlaIV in complex with DNA, predicted by fold recognition and validated by site-directed mutagenesis and circular dichroism spectroscopy. Protein Eng Des Sel 2005, 18(4):181–189. 10.1093/protein/gzi019
Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, et al.: The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol 2007, 5(3):e16. 10.1371/journal.pbio.0050016
Truglio JJ, Rhau B, Croteau DL, Wang L, Skorvaga M, Karakas E, Dellavecchia MJ, Wang H, Van Houten B, Kisker C: Structural insights into the first incision reaction during nucleotide excision repair. Embo J 2005, 24(5):885–894. 10.1038/sj.emboj.7600568
Kosinski J, Cymerman IA, Feder M, Kurowski MA, Sasin JM, Bujnicki JM: A "FRankenstein's monster" approach to comparative modeling: merging the finest fragments of Fold-Recognition models and iterative model refinement aided by 3D structure evaluation. Proteins 2003, 53(Suppl 6):369–379. 10.1002/prot.10545
Kosinski J, Gajda MJ, Cymerman IA, Kurowski MA, Pawlowski M, Boniecki M, Obarska A, Papaj G, Sroczynska-Obuchowicz P, Tkaczuk KL, et al.: FRankenstein becomes a cyborg: the automatic recombination and realignment of fold recognition models in CASP6. Proteins 2005, 61(Suppl 7):106–113. 10.1002/prot.20726
Chmiel AA, Bujnicki JM, Skowronek KJ: A homology model of restriction endonuclease SfiI in complex with DNA. BMC Struct Biol 2005, 5(1):2. 10.1186/1472-6807-5-2
Kosinski J, Kubareva E, Bujnicki JM: A model of restriction endonuclease MvaI in complex with DNA: a template for interpretation of experimental data and a guide for specificity engineering. Proteins 2007, 68(1):324–336. 10.1002/prot.21460
Vanamee ES, Viadiu H, Kucera R, Dorner L, Picone S, Schildkraut I, Aggarwal AK: A view of consecutive binding events from structures of tetrameric endonuclease SfiI bound to DNA. Embo J 2005, 24(23):4198–4208. 10.1038/sj.emboj.7600880
Kaus-Drobek M, Czapinska H, Sokolowska M, Tamulaitis G, Szczepanowski RH, Urbanke C, Siksnys V, Bochtler M: Restriction endonuclease MvaI is a monomer that recognizes its target sequence asymmetrically. Nucleic Acids Res 2007, 35(6):2035–2046. 10.1093/nar/gkm064
Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A: Protein fragment reconstruction using various modeling techniques. J Comput Aided Mol Des 2003, 17(11):725–738. 10.1023/B:JCAM.0000017486.83645.a0
Kaminska KH, Baraniak U, Boniecki M, Nowaczyk K, Czerwoniec A, Bujnicki JM: Structural bioinformatics analysis of enzymes involved in the biosynthesis pathway of the hypermodified nucleoside ms(2)io(6)A37 in tRNA. Proteins 2008, 70(1):1–18. 10.1002/prot.21640
Skirgaila R, Grazulis S, Bozic D, Huber R, Siksnys V: Structure-based redesign of the catalytic/metal binding site of Cfr10I restriction endonuclease reveals importance of spatial rather than sequence conservation of active centre residues. J Mol Biol 1998, 279(2):473–481. 10.1006/jmbi.1998.1803
Pingoud V, Sudina A, Geyer H, Bujnicki JM, Lurz R, Luder G, Morgan R, Kubareva E, Pingoud A: Specificity changes in the evolution of Type II restriction endonucleases: a biochemical and bioinformatic analysis of restriction enzymes that recognize unrelated sequences. J Biol Chem 2005, 280(6):4289–4298. 10.1074/jbc.M409020200
Feder M, Bujnicki JM: Identification of a new family of putative PD-(D/E)XK nucleases with unusual phylogenomic distribution and a new type of the active site. BMC Genomics 2005, 6(1):21. 10.1186/1471-2164-6-21
Wulfing C, Pluckthun A: Protein folding in the periplasm of Escherichia coli. Mol Microbiol 1994, 12(5):685–692. 10.1111/j.1365-2958.1994.tb01056.x
Malone T, Blumenthal RM, Cheng X: Structure-guided analysis reveals nine sequence motifs conserved among DNA amino-methyltransferases, and suggests a catalytic mechanism for these enzymes. J Mol Biol 1995, 253(4):618–632. 10.1006/jmbi.1995.0577
Bujnicki JM: Sequence permutations in the molecular evolution of DNA methyltransferases. BMC Evol Biol 2002, 2(1):3. 10.1186/1471-2148-2-3
Bujnicki JM, Feder M, Radlinska M, Blumenthal RM: Structure prediction and phylogenetic analysis of a functionally diverse family of proteins homologous to the MT-A70 subunit of the human mRNA:m6A methyltransferase. J Mol Evol 2002, 55(4):431–444. 10.1007/s00239-002-2339-8
Nobusato A, Uchiyama I, Ohashi S, Kobayashi I: Insertion with long target duplication: a mechanism for gene mobility suggested from comparison of two related bacterial genomes. Gene 2000, 259(1–2):99–108. 10.1016/S0378-1119(00)00456-X
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389
Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 2005, 33(2):511–518. 10.1093/nar/gki198
Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) Software Version 4.0. Mol Biol Evol 2007, 24(8):1596–1599. 10.1093/molbev/msm092
Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14(9):755–763. 10.1093/bioinformatics/14.9.755
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al.: The Pfam protein families database. Nucleic Acids Res 2008, (36 Database):D281–288.
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O: TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res 2007, (35 Database):D260–264. 10.1093/nar/gkl1043
Soding J, Biegert A, Lupas AN: The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 2005, (33 Web Server):W244–248. 10.1093/nar/gki408
Kurowski MA, Bujnicki JM: GeneSilico protein structure prediction meta-server. Nucleic Acids Res 2003, 31(13):3305–3307. 10.1093/nar/gkg557
Bujnicki JM, Elofsson A, Fischer D, Rychlewski L: LiveBench-1: continuous benchmarking of protein structure prediction servers. Protein Sci 2001, 10(2):352–361. 10.1110/ps.40501
Bujnicki JM, Elofsson A, Fischer D, Rychlewski L: LiveBench-2: large-scale automated evaluation of protein structure prediction servers. Proteins 2001, (Suppl 5):184–191. 10.1002/prot.10039
Soding J: Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21(7):951–960. 10.1093/bioinformatics/bti125
Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A: FFAS03: a server for profile – profile sequence alignments. Nucleic Acids Res 2005, (33 Web Server):W284–288. 10.1093/nar/gki418
McGuffin LJ, Jones DT: Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 2003, 19(7):874–881. 10.1093/bioinformatics/btg097
Shi J, Blundell TL, Mizuguchi K: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol 2001, 310(1):243–257. 10.1006/jmbi.2001.4762
Fischer D: Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pacific Symp Biocomp 2000, 119–130.
Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A: Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci 2001, 10(11):2354–2362. 10.1110/ps.08501
Wallner B, Elofsson A: Pcons5: combining consensus, structural evaluation and fold recognition scores. Bioinformatics 2005, 21(23):4248–4254. 10.1093/bioinformatics/bti702
Rychlewski L, Fischer D: LiveBench-8: the large-scale, continuous assessment of automated protein structure prediction. Protein Sci 2005, 14(1):240–245. 10.1110/ps.04888805
Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993, 234(3):779–815. 10.1006/jmbi.1993.1626
Pawlowski M, Gajda MJ, Matlak R, Bujnicki JM: MetaMQAP: a meta-server for the quality assessment of protein models. BMC Bioinformatics 2008, 9(1):403. 10.1186/1471-2105-9-403
Wallner B, Elofsson A: Can correct protein models be identified? Protein Sci 2003, 12(5):1073–1086. 10.1110/ps.0236803
Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 1997, 268(1):209–225. 10.1006/jmbi.1997.0959
Kolinski A: Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 2004, 51(2):349–371.
Kolinski A, Bujnicki JM: Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins 2005, 61(Suppl 7):84–90. 10.1002/prot.20723
Sasin JM, Bujnicki JM: COLORADO3D, a web server for the visual analysis of protein structures. Nucleic Acids Res 2004, (32 Web Server):W586–589. 10.1093/nar/gkh440
Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 2005, (33 Web Server):W299–302. 10.1093/nar/gki370
The work at the laboratory of IK was supported by the 21st century COE (center of excellence) project of 'Elucidation of Language Structure and Semantic behind Genome and Life System' supported by JSPS (Japanese Society for the Promotion of Science). MK was supported by post-doctoral fellowships from this COE project and from Medical Genome Science Program in Support Program for Improving Graduate School Education of JSPS. KHK was supported by a short-term fellowship from EMBO and by an exchange grant from JSPS and Polish Academy of Sciences. JMB was supported by the University of Tokyo (visiting professorship). JMB and KHK were also supported by a 6FP grant from the European Union ('DNA ENZYMES' MRTN-CT-2005-019566). Computing resources were provided by the supercomputer system at the Human Genome Center, Institute of Medical Science, University of Tokyo.
MK carried out initial sequence database searches, and was the first to identify the potential GIY-YIG motif in R.Hpy188I by eye. He also carried out genomic context analyses and classified homologs. KHK carried out sequence database searches, structure predictions with ROSETTA, calculated the phylogenetic tree, provided detailed description of the model, prepared the alignment and the figures. MB carried out model refinement with REFINER. IK participated in the design and coordination of this study. JMB conducted fold-recognition analysis, built the template-based model, and drafted the manuscript. All authors contributed to analysis of the data and to writing of the manuscript. All the authors have read and approved the final manuscript.
Katarzyna H Kaminska, Mikihiko Kawai contributed equally to this work.