Deciphering structure and topology of conserved COG2042 orphan proteins

Background The cluster of orthologous group COG2042 has members in all sequenced Eukaryota as well as in many Archaea. The cellular function of these proteins of ancient origin remains unknown. PSI-BLAST analysis does not indicate a possible link with even remotely-related proteins that have been functionally or structurally characterized. As a prototype among COG2042 orthologs, SSO0551 protein from the hyperthermophilic archaeon Sulfolobus solfataricus was purified to homogeneity for biophysical characterization. Results The untagged protein is thermostable and behaves as a monomeric protein in gel filtration experiment. Several mass spectrometry-based strategies were combined to obtain a set of low resolution structural information. Kinetic data from limited proteolysis with various endoproteases are concordant in pointing out that region Glu73-Arg78 is hyper-sensitive, and thus accessible and flexible. Lysine labeling with NHS-biotin and cross-linking with DTSSP revealed that the 35 amino acid RLI motif at the N terminus is solvent exposed. Cross-links between Lys10-Lys14 and Lys23-Lys25 indicate that these residues are spatially close and in adequate conformation to be cross-linked. These experimental data have been used to rank multiple three-dimensional models generated by a de novo procedure. Conclusion Our data indicate that COG2042 proteins may share a novel fold. Combining biophysical, mass-spectrometry data and molecular model is a useful strategy to obtain structural information and to help in prioritizing targets in structural genomics programs.


Background
Genomic comparative studies on entirely sequenced genomes from the three domains of life, i.e. Bacteria, Archaea and Eukaryota [1], evidenced that proteins involved in the organization or processing of genetic information (structures of ribosome and chromatin, translation, transcription, replication and DNA repair) display a closer relationship between Archaea and Eukaryota than between Bacteria and Eukaryota [2][3][4]. To identify new proteins involved in such important cellular mechanisms, an exhaustive inventory of proteins of unknown function common to only Eukaryota and Archaea but not in Bacteria has been devised [5][6][7]. Among such proteins, the Cluster of Orthologous Group COG2042 comprises proteins ubiquitously present in Eukaryota and present in many, but not all, Archaea; a hallmark of their ancient origin. The corresponding ancestral protein should have been present in the common ancestor of these two domains of life. Some partial experimental data are known from the Saccharomyces cerevisiae COG2042 homolog. Deletion of the Yor006c gene was shown to result in a viable phenotype but some apparent moderate growth defects were noticed on a fermentable carbon source [8,9]. Two putative protein partners for Yor006c were identified through a high-throughput two-hybrid study [10]: Ydl017w, a serine/threonine kinase also known as the cell division control protein 7 (Cdc7), and Yil025c, a hypothetical ORF. However, the cellular function of COG2042 proteins remains unknown.
A polar region, named RLI, is conserved at the N terminus of COG2042 proteins as well as at the N terminus of another cluster of orthologous proteins, namely COG1245. The latter, exemplified by SSO0287 in Sulfolobus solfataricus [11], are large proteins (about 600 residues) that encompass four different domains: a RLI domain, a [4Fe-4S] ferredoxin domain, and two ATPase domains, usually found in ABC transporter. Their putative function is currently subjected to discussion [12,13] but could be related to rRNA metabolism. Indeed, four of the eleven proteins shown to interact with the yeast COG1245 homolog (Ydr091c) were identified as involved in rRNA metabolism (Ymr047c, Ydl213c, Ylr340w, Ylr192c). Experimental data on the human homolog of Ydr091c indicated that this protein reversibly associates with RnaseL, and thus COG1245 proteins were named RNase L inhibitor [14].
Because knowledge of protein structure is of high importance to understand protein function, huge efforts have been recently invested in high-throughput protein structure determination programs [15]. Recent reports indicate that only a relatively small percentage of expressed and purified proteins are amenable to full 3D structure by NMR or crystallography and X-ray diffraction [16,17]. In silico modeling (homology modeling, fold recognition, ab initio and de novo modeling) is the alternative to quickly gain the fold of a protein. However, such approach sometimes remains ambiguous in reliably identifying correct structures for protein sequences remotely-related to those found in PDB database. A promising strategy is the use of experimental data (if possible easily obtained) for model discrimination or refinement [18][19][20]. For example, the tertiary structure of the bovine basic fibroblast growth factor (FGF)-2 was probed with a lysine-specific cross-linking agent and subjected to tryptic peptide mapping by mass spectrometry to identify the sites of cross-linking [21]. The low resolution interatomic distance information obtained experimentally allowed the authors to distinguish among threading models in spite of a relatively low sequence similarity (13 % of identical residues). Interestingly, the constant development of novel cross-linking reagents suitable for mass spectrometry [22] enables enrichment of crosslinked peptides facilitating such strategy. A chemical modification approach [23][24][25][26], in combination with limited proteolysis procedures [27,28], can also provide useful structural constraints [29] for model refinement.
A step further is to attempt such approaches with proteins having no detectable homologs. In order to get insight into the topology of COG2042 members and if possible to use these experimental data to discriminate among structural protein templates, we combined limited proteolysis, lysine labeling and cross-linking strategies. The protein SSO0551 from the hyperthermophilic archaea Sulfolobus solfataricus was chosen as a prototype because of its thermostability and the probable absence of posttranslational modifications when produced as a recombinant form in Escherichia coli. The SSO0551 protein is monomeric with a low molecular mass (19 kDa). This size is easily amenable to characterization by mass spectrometry. Our results reveal that the polar RLI motif at the N terminus is probably structured and solvent exposed, pointing at a common trait between COG2042 and COG1245 proteins, this latter group being also conserved in Eukaryota and Archaea but absent in Bacteria. The accessible and flexible regions defined by limited proteolysis combined with lysine accessibility assessed by NHSbiotin labeling and DTSSP cross-linking allowed us to discriminate among ten top ranking de novo three-dimensional (3D) models.

COG2042 comprises members exclusively from Eukaryota and Archaea
The sequence of SSO0551 from S. solfataricus was used as query in a PSI-BLAST database search to identify homologous proteins. A constant cutoff expectation value of 10 -15 resulted after three iterations in selection of 40 sequences (15 from Archaea and 25 from Eukaryota) that were all aligned over their full length. No close homologs (E-value below 10 -10 in the third iteration) with full-length sequence matching to SSO0551 were found among Bacteria. Remarkably, all completely sequenced Eukaryal organisms were found to have one SSO0551 homolog. Fig. 1 shows an unrooted phylogram of the updated COG2042 family (Fig. 1A) and an alignment of a selection of six representative sequences (Fig. 1B), selected on the basis of their phylogenetic distribution. When experimental evidences concerning the protein are unfortunately lacking for ORF description genome annotators usually take into consideration the most upstream initiation codon. For this reason, the most probable start codons of several open reading frames should be reconsidered after exhaustive alignment (Fig. 1). For example, atg codon starting at nucleotide 484790 on the Crick strand for SSO0551 from S. solfataricus (NC_002754) should be a more appropriate start codon than atg starting at nucleotide 484916 and mentioned erroneously in current database. From the unrooted phylogram (Fig. 1A), two main lineages (archaeal and eukaryal) can be defined based on organism origin. This suggests that occurrence of these proteins is at least as ancient as divergence of these two phyla. No paralogs, sign of a possible evolution of a new derived function, have been evidenced in entirely sequenced organisms currently available. Although these proteins are of ancient origin, the core sequence appears well conserved as observed in Fig. 1B. Thirty-three residues (38%) are found identical in the core central segment (out of 88 amino acids) between the most distant COG2042 orthologs, namely gi48852409 from Ferroplasma acidarmanus and gi6324579 from Saccharomyces cerevisiae (Fig. 1B). From the alignment, several conserved motifs that may be functionally crucial (cofactor or substrate binding, catalysis, or partner interactions) were detected. A conserved hexapeptide sequence, Val-Val/Ile/Leu-Asp/Glu-Cys-Ser-Trp (motif I in Fig. 1B), is found distant of 14-17 amino acids from another conserved motif of 25 amino acids containing 4 polar, 18 hydrophobic and 3 aromatic residues (motif II). Database searches with these motifs as queries did not allow identification of remotely-related proteins.
All sequences from COG2042 encompass a stretch of 35 conserved amino acids upstream of the core common sequence. This motif, called RLI, is extremely polar (11 basic and 4 acidic residues) and is also found at the N terminus of another group of orthologous sequences, namely COG1245.

Expression in E. coli of two engineered SSO0551 constructs
From multiple sequence alignments, SSO0551 should encode a 166 amino acid polypeptide. An N-terminal 6His tagged recombinant construct (pSBTN-AB31) was engineered. As we could not exclude that the 42 amino acids extension at the N terminus was not an annotation artifact, we intended to check experimentally whether this putative extension could have some influence on SSO0551. A second construct (pSBTN-AB30) was simultaneously engineered supposedly allowing production of a 26 kDa N-terminal 6His variant. Unexpectedly, no major difference in expression was detected between the two cellular extracts when they were resolved on SDS-PAGE. Two overexpressed products with both an apparent molecular weight of approximately 20 kDa were obtained upon addition of IPTG (data not shown). Fingerprint identification of these two products was carried out by trypsin proteolysis and mass spectrometry. Table 1 shows the MALDI-TOF mass measurements recorded for the two samples. The tryptic peptides that were detected revealed that both products correspond to native SSO0551  sequence. From the 6His-SSO0551 product (pSBTN-AB31 construct), thirteen peptides map with the theoretical sequence (57 % sequence coverage). Noteworthy, a peptide (1590.64 amu) was attributed to part of the 6Hismodified N terminus (Table 1). The twelve peptides recorded from the 6His-SSO0551 extended version (pSBTN-AB30 construct) fit only to the C terminus of the theoretical construct (43 % sequence coverage). These results along with low molecular weight observation on SDS-PAGE indicate that probably a truncated protein was obtained during expression of the ORF comprising the 126 nt 5'-extension (42 additional amino acids at the Nterminus). This product, corresponding in fact to untagged SSO0551 as confirmed hereafter with purified product, showed no binding on Ni-NTA chromatography. This observation is in agreement with absence of 6His tag at the N terminus.

Recombinant SSO0551 is structured, thermostable and monomeric
Crude extract containing native untagged SSO0551 polypeptide from E. coli Rosetta(DE3)(pLysS)(pSBTN-AB30) cells was heated at different temperatures. Proteins that remained soluble were analyzed on SDS-PAGE. Most of E. coli contaminants were removed by such treatment. SSO0551 polypeptide remained soluble even when cell extract was heated to 80°C and therefore this protein was considered as thermostable. This protein was purified to homogeneity by a three-step purification protocol. A 20 min heat treatment at 70°C ( Fig Content of secondary structure elements in SSO0551 was estimated by far-UV circular dichroïsm. Fig. 3 shows the spectrum recorded at 20°C. Purified protein presents negative ellipticity in the near-UV with minima at 208 (-14.7 10 3 deg cm 2 dmol -1 ) and 222 nm (-12.7 10 3 deg cm 2 dmol -1 ). Deconvolution of the CD spectrum leads to an estimation of secondary structural element content of about 28-29 % of α-helices and 14-16 % of β-sheets using K2D neural-software. Predictions of SSO0551 secondary structures by PSIPRED and Jpred web servers gave values of 10-11 % of β-sheets in relative agreement with the circu-lar dichroïsm data, but overestimated the α-helix average content (54 %). PSIPRED and Jpred predictions are based on neural networks trained on known folds. The overestimation of the α-helix content may be due to the novel fold of these COG2042 proteins as discussed here below.
Native molecular mass of SSO0551 was determined by size-exclusion chromatography on a Superdex 200 HR10/ 30 calibrated column. Pure protein eluted as a peak centered at 39.1 mL in the assay conditions corresponding to an apparent molecular mass lower than 20 kDa. This elution profile indicates that this structured protein behaves as a compact monomer.

Limited proteolysis defines Glu 73 -Arg 78 as a hyper-sensitive region
Purified SSO0551 protein was subjected to limited proteolysis with various endopeptidases (trypsin, chymotrypsin, ArgC and GluC). MALDI-TOF mass spectrometry was used to determine cleavage sites by following the time course generation of peptides. Several protease/substrate ratios were assessed to confirm which preferential sites on entire protein were first attacked (earliest cleavage), thus corresponding to a native state of the protein. The two fragments generated by such cleavage may be more vulnerable to subsequent attacks than native protein and therefore late proteolytic sites are considered less informative. Both small and large peptides generated during proteolysis were evaluated. Partial proteolyzed products obtained with trypsin were first resolved by reverse-phase chromatography and analyzed by MALDI-TOF mass spectrometry. Results recorded from direct analysis of the digestions without prior separation were almost similar to those obtained with separation. Therefore, the latter costeffective strategy was used for analyzing the numerous conditions tested. Figure 4 shows the MALDI mass spectrum of the main large products obtained from a tryptic digest of SSO0551 (enzyme/protein ratio of 1:20) after 60 sec of reaction. In these conditions, the signal of intact protein was still visible at m/z 19198.4, but mixed with signals corresponding to 8 different large fragments. Among these, 7 peptides arose from an N-terminal proteolysis: ,   (Fig. 4). Such peptidic profile indicates that SSO0551 N terminus is rather solvent exposed in comparison to C terminus.
During the earliest events of the trypsin proteolysis analyzed in various conditions for detection of large products but also smaller peptides, monocharged cations with following m/z: 8614. 6  For all these analysis, smaller peptidic fragments that accumulated over time could be attributed from further proteolysis of the products arising from initial attacks (data not shown). All these results are concordant in pointing out that Glu 73 -Arg 78 and Glu 28 -Arg 31 are two accessible solvent-exposed regions of the protein as they can be proteolyzed by several endopeptidases, the first cited being definitively hyper-sensitive. Local unfolding not just surface exposure is necessary for efficient in vitro Analysis of untagged SSO0551 purification proteolysis because the polypeptide segment being cleaved must form a specific structure with the associated protease [30]. For this reason, Glu 73 -Arg 78 region should also correspond to a flexible region, i.e. a protruding loop.

Lysine labeling with NHS-biotin and DTSSP cross-linking confirm that the N terminus is rather solvent-exposed
The SSO0551 protein contains 21 lysine residues (12 %) distributed along the whole polypeptide sequence. Under mild conditions that should keep the native conformation of the protein, specific labeling of these residues with NHS-biotin may give further details about their respective surface accessibility and/or their interactions with other residues [31]. After reaction with various amount of chemical reagent (molar ratio NHS-Biotin/total lysines of 1:40, 1:20, 1:10, 1:2, 1:1, 2:1), protein labeling was monitored by determining the mass of undigested samples. Figure 5 shows the signals measured by MALDI-TOF mass spectrometry for four of these ratios. The fact that some unmodified protein is still present at ratio below 1:20 testifies for mild conditions that should allow modification of protein still in a native state. As expected with NHSbiotin, each peak exhibits the predictable mass increment (average mass of 226.3 amu per label). Figure 5 shows that at molar ratio of 1:40 a simple modification is obtained, while a more heterogeneous population was detected for higher ratio. For examples, 1 to 3 modifications are detected at ratio 1:20, 2 to 5 modifications at ratio 1:2. However, a limited number of modifications (8-10) are recorded for higher ratio, indicating that among the 21 lysine residues only a fraction is accessible to the chemical.
To localize all labeled residues, NHS-biotin treated samples were subsequently subjected to proteolysis with various endoproteases (trypsin, Arg-C, or Glu-C) and compared to untreated samples. SSO0551 sequence coverage was estimated to be 92 % with all 21 lysine residues included in this coverage. Peptides (∆mass below 120 ppm) detected with NHS-biotin treated samples but not detected with untreated samples are listed in Table 2. Using limiting amount of NHS-biotin (molar ratios of 1:10, 1:20 or 1:40), nine reactive residues are unequivocally identified: Lys 10 , Lys 14 , Lys 20 , Lys 23 , Lys 25 , Lys 51 , Lys 75 , Lys 128 , and Lys 154 , assuming that proteases do not cleave after a modified residue. Other residues, such as Evaluation of SSO0551 secondary structure elements by circular dichroïsm MALDI-TOF mass spectrum of large peptide fragments produced by SSO0551 partial trypsin proteolysis Figure 4 MALDI-TOF mass spectrum of large peptide fragments produced by SSO0551 partial trypsin proteolysis. A partial enzymatic proteolysis was carried out during 60 sec at 20°C with a trypsin/SSO0551 protein ratio of 1/20 (w/w). The products were then resolved onto a C8 reverse phase chromatographic column and the different UV absorbing fractions were analyzed by MALDI-TOF. The m/z spectrum obtained with the fraction eluting at 40-50 % acetonitrile is shown. The asterisk labels a peak arising from trypsin autolysis.

166 M7
Lys 34 and Lys 49 might be also labeled ( Table 2). The number of labeled lysines is in agreement with the limited number of modifications recorded at higher ratio. Remarkably, spectra of whole peptide mixture were informative enough to give assignment of all modified peptides without the need of a purification step. Therefore, other amine reactive reagent that creates a mass shift could have been used.
Using a lysine cross-linking reagent, DTSSP, it is possible to assess intra-or inter-molecular protein contacts [21,32]. DTSSP enables cross-linking of amino groups up to 12 Å apart. As SSO0551 was shown to be monomeric and its concentration used in the assays was low (2.5 pM), intramolecular cross-links should be favored over intermolecular cross-links. In addition, the low reagent concentration used should avoid unwanted conformational changes that may be induced by multiple intramolecular cross-linking. After reaction with DTSSP, products were subjected to trypsin proteolysis and peptides were identified by MALDI-TOF. As the protein is relatively small, mass signals could be attributed with a good confidence (tolerance < 120 ppm). In addition, peak attribution was always confirmed upon reduction of products and sometimes through redundancy due to miss-cleavage.  [3][4][5][6][7][8][9][10][11][12][13][14][15] (+3 ppm), respectively. These peptides contain two proximal lysine residues (Lys 23 -Lys 25 and Lys 10 -Lys 14 ). As shown in Fig. 6, these two peaks were absent in mass spectrum following DTT reduction but new peaks at m/z 1493.69 and 1837.81 appears at the expected increment (+2.016 amu theoretically). An additional peak at m/z 2502.22 could be attributed to peptide  (+37 ppm) with an intrapeptide cross-link between Lys 49 and Lys 51 . However, the corresponding reduced peak was not detected. Strikingly, every lysines that were reactive with DTSSP were

Discussion
Although COG2042 proteins are distributed among a large number of organisms, no experimental evidences have yet been reported concerning their biochemical characterization and function. As they are not related, even remotely, to any other family of proteins, COG2042 members can be phylogenetically considered as orphans. Figure 7 (Panel A) summarizes the structural information obtained with chemical modification approach, in combination with limited proteolysis procedures. Using MALDI-TOF mass spectrometry to identify protease-accessible sites, we have shown that the most exposed regions are located at the first half of the protein, the Glu 73 -Arg 78 region being revealed hyper-sensitive to various proteases (Fig. 7A). It probably indicates a protruding loop out of the globular protein. This charged region is relatively conserved among COG2042 orthologs and lies between two highly conserved segments of COG2042 (motif I and II as shown on Fig. 7). Chemical modification agrees with limited proteolysis in that the RLI motif is solvent exposed while the C terminus appeared rather inaccessible (Fig.  7A). The length of the RLI motif, first defined by conserved domain search [33], matches perfectly with two sensitive proteolytic sites (Arg 31 and Lys 34 ). The RLI domain is also present at the N terminus of another group of orthologous proteins, namely COG1245. Remarkably, COG1245 proteins only occur in two domains of life (Archaea and Eukarya) similarly to COG2042 proteins. Although co-occurrence of protein members is not strictly identical (for example, pyrococci encompass the information for COG1245 but not for COG2042 polypeptides), such occurrence pattern may reflect a functional link between the two protein families.
Our initial objective was to obtain about SSO0551 as much low-resolution structural information as possible in order to discriminate among putative three-dimensional models representing COG2042 protein structure. However, currently available threading tools applied on SSO0551 failed to detect any structurally related-proteins. Alternatively, we obtained ten different ab initio models of SSO0551 using the fully-automated ROBETTA server based on ROSETTA procedures [34]. On these ten models, we applied all the low-resolution structural information gathered in this work. We predicted for every model location of preferential proteolytic sites using the Nick-Pred software [35]. Models M1, M2 and M6 on one hand, and M9 and M10 on the other, show hypersensitive regions in the RLI motif or C terminus, respectively. These features do not correspond to our experimental data. Only models M4, M7 and M8 predict that the loop Glu 73 -Arg 78 is solvent exposed (data not shown). Among these three models, M4 and M8 respect the ranking of preferential nick-sites for trypsin, chymotrypsin, ArgC and GluC proteases. Solvent accessibility for lysine side chain was evaluated for models M4, M7 and M8 and compared with experimental data (data not shown). All the lysine residues labeled with NHS-biotin are found solvent-exposed in model M8. Manual inspection of cross-linked lysines (Lys 10 -Lys 14 and Lys 23 -Lys 25 ) revealed that model M4 is DTSSP cross-linked peptides identified by MALDI-TOF mass spectrometry In conclusion, we have presented a strategy consisting in obtaining low-resolution structural information (deter-Experimental topology information obtained on SSO0551 and theoretical 3D model M8 shown to be compatible Figure 7 Experimental topology information obtained on SSO0551 and theoretical 3D model M8 shown to be compatible. A -Schematic representation of experimental low-resolution structural information. SSO0551 sequence is displayed with its predicted secondary structural elements (α-helices and β-strands are represented by helices and arrows, respectively). The two conserved COG2042 motifs described in this study (I and II) and the 35 amino acids RLI motif are specified. Vertical arrows indicate sites of limited proteolysis that have been experimentally determined. Lysines labeled with NHS-biotin are indicated with asterisks and lysines involved in DTSSP cross-links are schematically bridged. Sequence numbering refers to the native version (untagged) of SSO0551 experimentally characterized. B -Schematic drawing of ab initio model M8. Secondary structure elements are shown as coil for α-helices, arrows for β-strands, and lines for loops. The color code follows a rainbow style where N terminus is in blue and C terminus in red. C -Molecular surface of ab initio model M8. The localization of solvent-exposed lysines, as determined by NHS-biotin labeling experiments, is colored in red whereas other lysines are colored in blue. This view shows the protein in the same orientation as in B. mination of nick-sites, solvent exposed residues, and residue-residue distances) that can be used to distinguish among a large set of theoretical molecular models. Lack of remotely-related structural templates or lack of adequacy between experimental data and most theoretical models indicates that such family of proteins should become a priority in structural genomic projects.

Chemical and biological reagents
Most chemicals used in this study were obtained from Sigma and were of analytical grade. Oligonucleotide primers were purchased from Genset. N-hydroxysuccinimide-biotin (NHS-biotin) and 3,3'-dithio-bis [sulfosuccinimidyl-propionate] (DTSSP) were obtained from Pierce. Matrices for Matrix-assisted Laser Desorption Ionization-Time of Flight (MALDI-TOF) mass spectrometry and calibration standards were purchased from Bruker Daltonics. Sequencing grade proteolytic enzymes were from Roche Applied Science.

Cloning and overexpression of SSO0551
Two constructs were designed in order to get overexpression of the SSO0551 ORF (starting with an ATG codon at nucleotide 484790 on the Crick strand of S. solfataricus P2 genome (NC_002754)) and an N-terminal extended version of SSO0551 (starting with an ATG codon at nucleotide 484916). For both proteins, an N-terminal 6His tag was added to render the purification of the recombinant products easier. For this purpose, synthetic oligonucleotide primers were oAB22 (5'-gctagcAT-GAAGCCCAAACCC-3') and oAB49 (5'-gctagcATGAAGGTATATATTATAGAC-3') that both contain an engineered NheI site, oAC34 (5'-cggatcctacT-CATTTTTCAAGTATTTTC-3') and oAE62 (5'-ggatcctcaTCATTTTTCA AGTATTTTCTC-3') that both contain an engineered BamHI site (restriction sites underlined in the primer sequences and nucleotides not present in the original sequence shown by lower case). Oligonucleotide pairs oAB22/oAC34 and oAB49/oAC34 were used for two distinct PCR amplifications of SSO0551 with S. sulfolobus total DNA as template. A 643-bp fragment (Nter 6His-tag extended version of SSO0551) and a 517-bp fragment (N-ter 6His-tag SSO0551) were obtained, respectively. They were cloned into pCRScript-cam (Stratagene), resulting in plasmids pSBTN-AB36 and pSBTN-AB37, respectively. The two inserts were removed by digestion with NheI and BamHI and ligated with T4 DNA ligase into plasmid pSBTN-AB23 (Armengaud J. & Chaumont V., unpublished data), a derivative of pCR T7/NTtopo (Invitrogen) containing a T7 promoter and 6 His-tag, previously digested with the same endonucleases. The resulting plasmids pSBTN-AB30 and pSBTN-AB31, respectively, were verified by DNA sequencing in order to ascertain the integrity of the nucleotide sequence. Hyper-expression of the recombinant SSO0551 constructs was achieved with E. coli Rosetta(DE3)pLysS strain (Novagen), freshly transformed with the plasmids described above. Cultures were carried out at 30°C as described earlier [6].

Purification of recombinant SSO0551 protein
The purification of recombinant SSO0551 was performed from 44 g (wet material) packed cells. Buffer A consisted of 50 mM K 2 HPO 4 /KH 2 PO 4 buffer (pH 7.2) containing 400 mM K-glutamate. The pellet was thawed on ice and resuspended in 120 mL of buffer A. The cells were disrupted by sonication with a total energy delivered of 71 kJ. The cell-extract was then centrifuged at 30,000 g for 20 min at 4°C to remove cellular debris and aggregated proteins. The supernatant was subjected to a 20 min heat treatment using a water bath maintained at 70°C, and immediately centrifuged a second time at 30,000 g for 20 min at 4°C. Chromatographic steps were performed at room temperature using an Äkta Purifier FPLC system (Amersham Biosciences). The 135 mL supernatant was applied at a flow rate of 2.8 mL/min onto a XK 26 × 20 column (Amersham Biosciences) containing 50 mL of Chelating Sepharose Fast Flow (Amersham Biosciences) and previously loaded with 200 mM NiSO 4 , washed with milliQ water and equilibrated with Buffer A containing 50 mM imidazole. The fraction collected during the IMAC loading was shown to contain the SSO0551 protein. This 222 mL fraction was concentrated to a volume of 56 mL by means of Centricon Plus-20 filtration units (Millipore) and then dialyzed overnight at 4°C against 20 mM K 2 HPO 4 /KH 2 PO 4 buffer (pH 7.2) containing 20 mM NaCl (buffer B). The 78 mL supernatant obtained after centrifugation at 30,000 g for 10 min at 4°C was divided and applied in two separate runs onto a 6 mLResource-S ion-exchange column (30 mm × 16 mm, 15 µm) from Amersham Biosciences, previously equilibrated with buffer B and operated at a flow rate of 3 mL/min. After a 10 column volume wash with buffer B, proteins were resolved with a 25 column volume linear gradient from 20 to 500 mM NaCl in buffer B. Recombinant SSO0551 was eluted at approximately 250 mM NaCl and desalted by overnight dialysis against Buffer B. The resulting 20 mL protein solution was concentrated to a volume of 8 mL by means of Centricon Plus-20 filtration units (Millipore). The sample was again divided and applied in two separate runs onto a superdex75 gel filtration packed into a HR 16/ 50 column at a flow rate of 1.5 mL/min in 20 mM K 2 HPO 4 /KH 2 PO 4 buffer (pH 7.2) containing 100 mM NaCl. The fractions obtained with the two runs were pooled and dialyzed overnight at 4°C against 10 mM HEPES buffer (pH 7.2). After dialysis, the fraction was centrifuged at 26,000 g for 20 min at 4°C and the protein concentration was measured by spectrophotometry using a molar absorption coefficient of 19060 M -1 cm -1 at 280 nm. The purified protein was flash frozen in liquid nitrogen and stored at -80°C at a concentration of 0.48 mg/ mL.

Circular dichroïsm
Far-and near-UV circular dichroism spectra were recorded at 20°C between 200 and 300 nm on a J-810 Jasco spectropolarimeter equipped with a PTC-424S Jasco Peltier, using a quartz cuvette of 1 mm path length, with a 20 nm/ min scanning speed and a band-width of 1 nm. Three spectra of purified SSO0551 at 1.92 µM in 10 mM HEPES buffer (pH 7.2) were averaged and corrected from the baseline for buffer solvent contribution. Experimental data were analyzed using the program K2D [36] described by Andrade et al. [37].

Determination of native molecular mass by gel filtration
The native molecular mass of SSO0551 was estimated by gel filtration chromatography on a Superdex 200 gel packed into a HR10/30 column (Amersham Biosciences) with a final bed volume of 24 mL. The column was equilibrated at room temperature at a flow rate of 0.5 mL/ min with 50 mM Tris/HCl buffer, pH 8.3, containing 50 mM NaCl and eluted with the same buffer. Protein standards used to calibrate the column were ribonuclease A (15.8 kDa), chymotrypsinogen A (21.2 kDa), ovalbumin (49.4 kDa), albumin (69.8 kDa), aldolase (191 kDa) and catalase (215 kDa), all from Amersham Biosciences. Exclusion limit was evaluated with dextran blue 2000 (Amersham Biosciences). A sample consisting of 90 µL of SSO0551 at 25.2 µM was injected and specific absorptions at 280 and 266 nm were followed.

Mass spectrometry
Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight (MALDI-TOF) mass measurements were performed using a Biflex IV instrument (Bruker Daltonics) in positive ionization mode. Protein samples and large peptidic fragments (>3500 Da) were applied to the target using sinapinic acid prepared as saturated solution in 30 % acetonitrile, 70 % milli-Q water and 0.1 % TFA as matrix. Samples were prepared using the dried droplet method and measured in linear mode. Small peptide samples were measured in reflectron mode using α-cyano-4-hydroxycinnamic acid in 30% acetonitrile containing 0.1% trifluoroacetic acid as matrix. Mass spectra were obtained by summation of 100-210 laser shots. The instrument was calibrated for determination of entire protein masses using either a mixture of chymotrypsin and bovine serum albumine, or apomyoglobin and aldolase. For peptides, the instrument was calibrated using a pepmix calibration kit (Bruker Daltonics). When necessary, the mass spectrometer was also internally calibrated using some of the theoretical peptide masses.

Limited protease digestion
For in-solution partial digestion, 0.2 nmol of pure SSO0551 were diluted into buffer D1 (20 mM TRIS/HCl, pH 7.8), buffer D2 (20 mM NH 4 HCO 3 , pH 7.8) or buffer D3 (20 mM TRIS/HCl, pH 7.8, containing 10 mM CaCl 2 and 5 mM DTT). Trypsin or chymotrypsin was added to SSO0551 diluted into buffer D1, whereas Glu-C or Arg-C was added to the protein diluted into buffer D2 or D3, respectively. Several enzyme/protein ratios (1:50 (w/w), 1:20 (w/w) and 1:2 (w/w)) were tested for each endoprotease. The digestions were performed at room temperature and aliquots were analyzed from 30 sec to 10-240 min. Digested samples were desalted using ZipTip C18 or ZipTip C4 pipette tips (Millipore) according to the protocol specified by the manufacturer and their mass directly evaluated by MALDI-TOF. Eventually, partially proteolyzed mixtures of larger quantities (10 nmol of SSO0551) were fractionated by reverse-phase HPLC using an Aquapore RP-300 column (PerkinElmer; 100 × 1.0 mm, 7 µm, 300 Å pore size) developed at 200 µL/min with a linear gradient from 5 to 90 % of acetonitrile in TFA 0.1 % over 45 min. The elution was monitored at 220 nm with an Agilent 1100 Series HPLC system equipped with a G1315 diode array detector. Individual fractions were concentrated by evaporation in a SpeedVac (Savant) and directly analyzed by MALDI-TOF.

Lysine labeling by NHS-biotin
N-hydroxysuccinimide-biotin (NHS-biotin) was used to label ε-amino groups of SSO0551 lysines. After reaction the biotin labels resulted coupled to the lysines through a stable amide bond. The increase in mass for each label (C 10 H 14 N 2 O 2 S 1 ) should be 226.293 amu if average mass is considered or 226.078 amu in monoisotopic mode. Modification of lysine residues was carried out by incubating 1.25 nmol of SSO0551 in 20 mM HEPES, pH 7.2, with various amount of freshly prepared NHS-biotin reagent dissolved in anhydrous dimethylsulfoxide. After 30 min of incubation at room temperature, the reagent in excess was removed by a 30 min micro-dialysis against 20 mM HEPES, pH 7.2. Samples were directly desalted by using ZipTip C4 (Millipore) prior MALDI-TOF analysis. They were eventually digested overnight with an endoprotease (trypsin, GluC or ArgC) and desalted by using ZipTip C18 pipette tips (Millipore) prior mass analysis.
Lysine cross-linking with DTSSP 3,3'-Dithio-bis [sulfosuccinimidyl-propionate] (DTSSP) was used to cross-link two ε-amino groups of SSO0551 lysines, essentially as described in [32]. The mass increase (in monoisotopic mode) for each label should be 191.991 amu (C 6 H 8 O 3 S 2 ) or 87.998 amu (C 3 H 4 O 1 S 1 ) when DTT treated. The increase in mass for an intramolecular cross-link between two lysines should be 173.981 amu (C 6 H 6 O 2 S 2 ) or 175.997 amu (2 × C 3 H 4 O 1 S 1 ) when DTT treated. Therefore after reduction of the disulfide bridge by DTT, an additional increase of 2.016 amu should be measured. Reaction was carried out by incubating 0.25 nmol of SSO0551 in 20 mM NaH 2 PO 4 / Na 2 HPO 4 , pH 7.5 containing 150 mM NaCl, with various amount of DTSSP reagent (molar ratio of 20, 35, and 50 mol of DTSSP per mol of polypeptide). After 30 min of incubation at room temperature, the reagent in excess was removed by a 30 min micro-dialysis against 20 mM NaH 2 PO 4 /Na 2 HPO 4 , pH 7.5 containing 150 mM NaCl. Prior overnight trypsin proteolysis, urea (330 mM final concentration) was added to each sample. Before being desalted by using ZipTip C18 pipette tips (Millipore), the digested peptide mixture was eventually reduced with 50 mM DTT for 30 minutes at 37°C to reduce the thiol linker.

In silico analysis
Sequence searching was performed using PSI-BLAST with default parameters. Multiple sequence alignments were performed using VectorNTI software package (Informax Inc). Secondary structure predictions were obtained through the PSIPRED v2.4 web-interfaced facilities [38] described by McGuffin et al. [39]. The molar absorption coefficient at 280 nm for SSO0551 was obtained from calculation of the amino acid composition of the recombinant protein [40,41]. Isotopic and average mass of both DTSSP cross-linker and NHS-biotin were calculated using a web-interfaced molecular weight calculator [42]. The peptide assignment and the first attempt for identifying the labeled products and cross-linking products were performed using the FindMod package at ExPaSy [43]. If no match was found, a more detailed search for multiple labels or combinatorial cross-linkable peptide pairs was carried out. Partially proteolyzed products were assigned using the FindPept tool [44]. Tertiary structure predictions were carried out using publicly available online services, including 3D-PSSM [45], FUGUE [46] and PSIPRED [39]. Ab initio modeling was performed using the ROBETTA server [34,47]. Each model was analyzed in terms of proteolytic sensitivity using the NICKPRED software [35,48,49]. Residues accessibility have been calculated using a modified version of Connolly's MS program ( [50]; Pellequer JL, unpublished results). Structural homologs were searched using DALI web server from the European Bioinformatics Institute [51]. Model views were obtained with the MOLSCRIPT program [52] and rendered using RASTER3D [53].