Conservation of the C-type lectin fold for accommodating massive sequence variation in archaeal diversity-generating retroelements

Background Diversity-generating retroelements (DGRs) provide organisms with a unique means for adaptation to a dynamic environment through massive protein sequence variation. The potential scope of this variation exceeds that of the vertebrate adaptive immune system. DGRs were known to exist only in viruses and bacteria until their recent discovery in archaea belonging to the ‘microbial dark matter’, specifically in organisms closely related to Nanoarchaeota. However, Nanoarchaeota DGR variable proteins were unassignable to known protein folds and apparently unrelated to characterized DGR variable proteins. Results To address the issue of how Nanoarchaeota DGR variable proteins accommodate massive sequence variation, we determined the 2.52 Å resolution limit crystal structure of one such protein, AvpA, which revealed a C-type lectin (CLec)-fold that organizes a putative ligand-binding site that is capable of accommodating 1013 sequences. This fold is surprisingly reminiscent of the CLec-folds of viral and bacterial DGR variable protein, but differs sufficiently to define a new CLec-fold subclass, which is consistent with early divergence between bacterial and archaeal DGRs. The structure also enabled identification of a group of AvpA-like proteins in multiple putative DGRs from uncultivated archaea. These variable proteins may aid Nanoarchaeota and these uncultivated archaea in symbiotic relationships. Conclusions Our results have uncovered the widespread conservation of the CLec-fold in viruses, bacteria, and archaea for accommodating massive sequence variation. In addition, to our knowledge, this is the first report of an archaeal CLec-fold protein.


Background
Diversity-generating retroelements (DGRs) create massive sequence variation (10 12-20 ) in select proteins. The only parallel for this scale of variation occurs in the vertebrate immune system [1]. Massive sequence variation enables adaptation to a dynamic environment, as seen for the prototypical Bordetella bacteriophage DGR [2], just as it does in the vertebrate immune system. DGRs have been identified in ecologically diverse bacteria, including members of the human microbiome, and in numerous viruses of bacteria [3][4][5][6][7]. Recently, DGRs were also identified in the third domain of life, archaea, from single-cell sequencing data of organisms that were uncultivated and harvested from a subterranean environment [8]. These organisms are related to Nanoarchaeota, which are nanosized, hyperthermophilic organisms that exist in symbiotic relationship with larger archaea [9,10]. Although the single-cell sequenced organisms were not directly visualized, their genomic sequences support the hypothesis that these archaeal DGRs belong to nanosized, symbiotic organisms. Along with the archaeal DGRs, a putative virus of methanotrophic archaea, ANMV-1, was also identified to encode a DGR [8].
The archaeal DGRs have in common the genetic elements identified in bacterial DGRs (Fig. 1). This includes a variable region (VR) that is located within the coding region of a variable protein, a template region (TR) that is similar (~90 % typically) but not identical to the VR and located in a proximal noncoding region, and a reverse transcriptase (RT) [3]. Genetic information is transferred from the TR to the VR through an RNA intermediate, a process termed retrohoming. In DGRs, retrohoming is accompanied by adenine-specific mutagenesis of sequence information. Thus, a hallmark of DGRs is the substitution of adenines in the TR by other bases in the VR, resulting in protein coding variation. Archaeal elements display this hallmark pattern of adenine substitution. Along with these core DGR components, the archaeal DGRs contain initiation of mutagenic homing sequences in the VR (i.e., IMH) and TR (i.e., IMH*) (Fig. 1). These elements differ slightly in sequence, and have been documented in the Bordetella bacteriophage (Bb) DGR to specify the directionality of information transfer [3]. That is, the region containing the IMH* (i.e., TR) serves as the invariant source of sequence information, and the region containing the IMH (i.e., VR) serves as the recipient of that (mutagenized) sequence information. In addition, a hairpin/cruciform structure downstream of the VR in evident in the archaeal DGR, and in the Bb DGR this element was seen to increase the efficiency of homing [11]. Proteins were also identified with similar physical properties to the accessory variability determinant (Avd), which in the Bb DGR binds RT and is required for retrohoming [12].
The variable protein encoded by ANMV-1 was classifiable by sequence using Phyre [8,13]. This variable protein was predicted to be structurally similar to the Bb DGR variable protein Mtd, which serves as the Bordetella bacteriophage's receptor-binding protein. Sequence variation in Mtd enables Bordetella bacteriophage to keep pace with genetically programmed changes in its host Bordetella. A similar scenario is likely the case for ANMV-1 and its putative methanotrophic archaeal host. Mtd has a C-type lectin (CLec)-fold, and in particular belongs to the formylglycinegenerating enzyme (FGE) subclass of the CLec-fold [14]. The CLec-fold is a general ligand-binding motif [15], but can also have enzymatic functionality as seen in FGE [16] Fig. 1 Schematic of DGR. Genetic information is transferred from an invariant TR to the VR of a variable protein (in this case AvpA), requiring the action of a reverse transcriptase (RT) and an accessory variability determinant (Avd) protein. This process involves reverse transcription of an RNA encoding the TR, and is accompanied by adenine-specific mutagenesis of the TR sequence. The mutagenized cDNA homes to the variable protein locus, and replaces the sequence information in the VR, resulting in a variant of the variable protein and in sulfoxide synthase [17]. The only other structurally characterized DGR variable protein, TvpA from the spirochete Treponema denticola, also has an FGE-type CLec-fold [14]. Many bacterial and bacterial virus DGR variable proteins are predicted to have CLec-folds [18], while some others are predicted to have immunoglobulin (Ig)-folds [5,7]. In contrast to these DGR variable proteins and the ANMV-1 variable protein, the archaeal DGR variable proteins [8] were unclassifiable based on sequence [19] or predicted structure [13].
We previously reported initial characterization of one of the archaeal DGR variable proteins, which we call here AvpA (Archaeal variable protein A; OTU1, Contig 3 DGR2) [8]. AvpA has only 15 and 8 % sequence identity to Mtd and TvpA, respectively, and its structure could not be predicted through in silico methods [13]. To determine how AvpA accommodates massive sequence variation, we determined its crystallographic structure. We find that AvpA has a CLec-fold, but one that is distinct from those of Mtd and TvpA. Capitalizing on the new structural information, we also identified AvpA-like proteins in metagenomes of marine and groundwater organisms. Significantly, most of the AvpA-like proteins from groundwater organisms belonged to putative DGRs. These results reveal that the CLec-fold is utilized to accommodate massive sequence variation widely, being conserved not only in viruses and bacteria but also in archaea.

Overall structure
AvpA was expressed in Escherichia coli, purified, and crystallized. The structure of AvpA was determined by single-wavelength anomalous dispersion (SAD) from selenomethionine-labeled AvpA and refined to 2.52 Å resolution limit ( Table 1). The electron density calculated from SAD phases enabled residues 2-210 of AvpA to be traced, while electron density for residues 211-256 was absent, most likely due to the flexibility of this region. AvpA was a monomer in solution (data not shown) and in the crystal (Fig. 2).
The CLec-fold in AvpA begins at residue 8 and continues to residue 209. The N-and C-terminal segments In between these strands are other characteristic features of DGR CLec-fold proteins, such as two α-helices (α1 and α2) that are roughly perpendicular to each other, and a four-stranded, anti-parallel β-sheet (β2β3β4β4'), part of which forms the ligand-binding site [20]. Lastly, as in Mtd and TvpA, these secondary structure elements in AvpA are interrupted by inserts (Figs. 2d, e, and see below).

Variable region
The variable region of AvpA (residues 181-203) is located close to but not at the very C-terminus of the protein, as it is for Mtd and TvpA. This internal location is common for the other identified Nanoarchaeota DGR variable proteins [8]. Forty-six amino acids follow the VR in AvpA. Electron density for this 46-residue extension, which is predicted by in silico methods to form two α-helices [13], was absent, most likely due to disorder or flexibility of this region. The DNA coding sequence for this 46-residue extension contains the putative hairpin/cruciform structure (Fig. 1), which in bacterial DGRs is typically located in the noncoding region following the VR. The hairpin/cruciform structure in AvpA is predicted by in silico methods [13] to encode five disordered amino acids [13], and thus its DNA sequence is unlikely to be constrained by the need to encode specific amino acids that are required for structural or functional reasons. The variable regions of Mtd and TvpA were closely superimposable, despite their weak sequence identity of 16 % [14] (Table 2). In contrast, the variable region of AvpA differs substantially in conformation from those of Mtd and TvpA (Figs. 3a-c and Table 2). A major difference is that the variable residues of AvpA do not occur until the end of the β4' strand, whereas variable residues are found as early as the β3 strand or just after the β3 strand for TvpA and Mtd, respectively. As expected from this difference, the 27 residue-length of the AvpA VR is about half that of Mtd and TvpA. Nevertheless, AvpA has 12 variable residuesthe same number as in Mtd. These residues have the potential of generating 10 13 variants, as 10 of the 12 have AAY codons, which as previously noted capture the gamut of chemistry and permit no stop codons [18]. These 12 variable residues were organized by the CLec-fold into a potential ligandbinding site (Fig. 3d), with a nonvariable aromatic amino acid (Phe 185) positioned centrally at the base of the binding site. A nonvariable aromatic amino acid also occurs centrally at the base of the ligand-binding sites in Mtd and TvpA, and in Mtd was seen to be involved in ligand binding [20]. This amino acid presumably provides a constant element of binding energy through hydrophobic contacts. The last portion of the VR is

Inserts
TvpA has three inserts within the core CLec-fold. We number the inserts with reference to Mtd and TvpA, and thus the first insert in AvpA is 2, found in the same topological location between α2 and β2 as in Mtd and TvpA. The equivalents of insert 1 and 1' are missing in AvpA. Insert 2 is short (residues 41-57) and is composed of a 3 10 -helix and β-strand. AvpA has two inserts not seen in Mtd and TvpA: Insert 3 (residues 89-114) between β3 and β4, which is composed of loops and two α-helices; and insert 4 (residues 120-179) between β4 and β4' , which is composed of a more complicated arrangement of α-helices and two antiparallel β-strands. CLEC5A also has an equivalent of insert 4, but the CLEC5A and TvpA inserts are not structurally related. Indeed, the inserts in AvpA have no structural relationship to other known structures. As in Mtd and TvpA, the inserts serve in part to bolster the VR. In the case of AvpA, both inserts 2 and 4 make hydrogen bonds to the main chain of the VR, with the majority of contacts coming from insert 2 (Fig. 3e).

Conservation of CLec-fold in DGRs of groundwater organisms
To determine whether proteins having similarity to AvpA exist in other genomes, a comprehensive search was conducted against public databases. Striking sequence conservation was observed between AvpA and representatives derived from both marine and groundwater metagenomes (Fig. 4). Our search revealed only a single homolog from marine metagenomes, but 22 non-redundant homologues from uncultivated, groundwater-associated organisms (Paul et al., in prep.). Among the groundwater matches, 19 sequences were derived from putative DGRs (Table 3), as they were proximal to a recognizable RT gene and a template region. Genes encoding the remaining three AvpA homologues do not appear to be parts of DGRs. We extended this inquiry by examining which sequences were likely to have folds related to that of AvpA using BackPhyre [13]. This analysis revealed additional sequences from groundwater metagenomes and archaeal genomes, which appear distantly related to AvpA (17 to 30 % pairwise similarity; Table 3). With this approach, AvpA relatives were identified in sequences of Archaeon GW2011 AR17 and Archaeon GW2011 AR3, both from uncultivated members of the phylum Woesearchaeota [21]. A sequence alignment with the least similar relative of AvpA revealed three conserved sequence motifs. The first was GXXVVVYAH (residues 67-75 in AvpA), which occupies the β3 strand with one side of the strand packing against insert 3 and the other side against insert 4. The second was HPXXXPFXG (residues 139-147 in AvpA), which resides in insert 4 as a short α-helix and packs against the β3 strand. The third was RFXGV (residues 205-209 in AvpA), which occupies the β5 strand and is encoded by the IMH element.

Discussion
DGR variable proteins have evolved to accommodate massive sequence variation. This task is fulfilled in the adaptive immune system of jawed vertebrates by the Ig fold of antibodies and T cell receptors, and in the adaptive immune system of jawless vertebrates by the leucine-rich repeat fold of variable lymphocyte receptors. The first DGR variable protein to be structurally characterized was Bordetella bacteriophage Mtd. The crystal structure of Mtd revealed that its VR was organized into a ligand-binding site by a CLec-fold [18]. While sequence similarity among DGR variable proteins is strikingly low, an argument was made based on the structure of Mtd that several other DGR variable proteins were likely to have CLec-folds as well [18]. This prediction was confirmed by the crystal structure of one of these, T. denticola TvpA, which is capable of accommodating an astonishing 10 20 sequences [14]. Although Mtd and TvpA share only~16 % sequence identity, these proteins were both found to belong to the FGE subclass of the CLec-fold and have VRs that are remarkably similar in conformation. DGR variable proteins can apparently also adopt the Ig-fold [7], although direct structural verification of this prediction is not yet in hand. These putative Ig-fold proteins are predicted to have variable residues located on β-strand framework regions and in segments connecting Ig-fold domains, which is different from antibodies and T cell receptors, for which variable residues are sequestered to loops between β-strands.
A set of nine unique DGR variable proteins were identified in subterranean archaea related to Nanoarchaeota [8]. The sequence similarity among these proteins was low, and their folds were not predictable by in silico methods [13]. The results presented here on one of Fig. 4 Conservation of CLec fold. Sequence alignment, including gaps, of AvpA and homologues from metagenomic studies. The sequence of AvpA is underlined in green and the variable region (VR) is shown in purple. Residues that are conserved in at least 75 % of sequences are highlighted; each amino acid is shown in a different color. The consensus sequence for selected conserved motifs is displayed below the AvpA sequence and highlighted with a solid black line these, AvpA, revealed a remarkable conservation in archaea of the CLec-fold for accommodation of massive sequence variation. The AvpA CLec-fold was found to be divergent from those in Mtd and TvpA, with AvpA having a VR that differed considerably in conformation from the Mtd and TvpA VRs. These results are consistent with early divergence between bacterial and archaeal DGRs. AvpA-like proteins were also identified in metagenomes of uncultivated marine and groundwater organisms, with the majority of AvpA-like proteins in groundwater organisms belonging to putative DGRs. These groundwater metagenomes are rich in organisms representing archaeal phyla known to include ultra-small cells [9,10,21], raising the possibility that these DGRs also belong to nanosized organisms. In addition, AvpA-like proteins were identified in uncultivated members of Woesearchaeota, which have small genomes (~1000 protein coding genes) and limited metabolic capacities [21]. Thus, AvpA and AvpA-like proteins appear to occur in the DGRs of nanosized organisms, and while the function of AvpA and AvpA-like proteins is unknown, one likely possibility is to enhance symbiotic relationships between these minimal organisms and their hosts.

Conclusions
These results have made apparent the widespread conservation of the CLec-fold in viruses, bacteria, and archaea for accommodating massive sequence variation. The fact that the CLec-fold in AvpA was not predictable by in silico methods points to the remarkable sequence space available to this fold. The great proportion of CLec-fold proteins occurs in metazoans, but this fold has also been observed in some viral and bacterial proteins other than DGR variable proteins [15]. To our knowledge, this is the first report of a CLec-fold protein occurring in archaea. The structure of AvpA did not provide further illumination on the protein folds of the other eight identified archaeal DGR variable proteins [8]. This indicates that there may yet be other folds by which DGR variable proteins accommodate massive sequence variation, or more likely given the resilience of the CLec-fold to primary sequence variation, these proteins may represent further cases of the CLec-fold occurring in archaea.

Crystallization and structure determination
Selenomethionine (SeMet)-substituted AvpA was expressed and purified as described [8], except Escherichia coli was cultured in synthetic minimal media supplemented with 200 mg/L L(+)-Selenomethionine (Sigma) [22]. Crystals of SeMet-labeled AvpA were grown by the hanging drop method at 20 o C by mixing 1 μL of AvpA (50 mg/mL) and 1 μL of 30 % (v/v) PEG monomethyl ether 550, 50 mM MgCl 2 , 100 mM HEPES, pH 7.5. Crystals were cryoprotected by soaking in the precipitant solution supplemented with 10 % glycerol and 2 mM TCEP. Single-wavelength anomalous dispersion (SAD) data were collected at Advanced Photon Source (Argonne, IL) beamline 24-ID-E. Diffraction data were indexed, integrated, and scaled with MOSFLM [23][24][25]. Se sites were located from SAD data of SeMet-labeled AvpA, and initial phases were determined using SOLVE [26]. Out of the four methionines (M1, M16, M32 and M98), all but the first were located. The asymmetric unit was found to contain two molecules of AvpA. A partial model of AvpA (residues 10-28, 45-120, 146-157, 178-187 and 193-202) was built by automatic means using Autobuild (within Phenix) into SAD phased electron density. Further model building was carried out manually with COOT [27], as guided by σ A -weighted 2mF o -DF c and mF o -DF c difference maps. A total of sixty-three iterative rounds of manual model building and maximum likelihood refinement were carried out with Refine (within Phenix) using default parameters [28,29], with each refinement step consisting of 3-5 cycles. One round of TLS parameterization with default settings was then used, followed by the addition of water and magnesium ions into ≥3σ mF o -DF c density. Structure validation was carried out with Molprobity [30], and molecular figures were generated with PyMOL (http://www.pymol.org/).

Structural alignment of VR and equivalent regions
The structure of the VR of AvpA (residues 181-210) was compared to that of the VR of Mtd (residues 337-381) and TvpA (residues 285-329) using FATCAT [31]. For hFGE and CLEC5A, residues 322-369 and residues 158-187, respectively, were used for comparison. These regions of hFGE and CLEC5A are spatially equivalent to the VRs of the DGR variable proteins.

Homologue search and DGR analysis
The amino acid sequence of AvpA was compared with representatives from public databases, including NCBI nr, env, and UniprotKB, using blastp, tblastn [19], and pHMMER [32], respectively. Multiple sequence alignment of homologues and AvpA was performed using ClustalW [33] and conserved motifs were visualized using Geneious v8.1 (Biomatters Ltd). Analysis of structural homology was performed using BackPhyre [13] with the structure of AvpA as a query, and proteins from nanoarchaeal genomes, Woesearchaeota genomes, and previously identified groundwater metagenome DGRs (Paul et al., in preparation).
Putative DGR sequences were detected in three steps. First, RT-containing sequences were identified using blastp versus known DGRs and relatives with an evalue cutoff of 1×10 −10 . Next, using a custom python script, near repeats were identified within 10 kb of the putative RT gene (i.e., VR and TR) with at least five adenine-specific mismatches and no more than one non-adenine mismatch. This step involved fragmenting the~10 kb (+ RT) sequences using a sliding window of 200 bp and an overlapping step of 50 bp. Fragments were compared using blastall and near-identical repeats, whose mismatches exclusively consisted five or more adenine-variable sites, were recorded as putative VR/TR pairs.