Deciphering structure and topology of conserved COG2042 orphan proteins
© Armengaud et al; licensee BioMed Central Ltd. 2005
Received: 13 October 2004
Accepted: 08 February 2005
Published: 08 February 2005
Skip to main content
© Armengaud et al; licensee BioMed Central Ltd. 2005
Received: 13 October 2004
Accepted: 08 February 2005
Published: 08 February 2005
The cluster of orthologous group COG2042 has members in all sequenced Eukaryota as well as in many Archaea. The cellular function of these proteins of ancient origin remains unknown. PSI-BLAST analysis does not indicate a possible link with even remotely-related proteins that have been functionally or structurally characterized. As a prototype among COG2042 orthologs, SSO0551 protein from the hyperthermophilic archaeon Sulfolobus solfataricus was purified to homogeneity for biophysical characterization.
The untagged protein is thermostable and behaves as a monomeric protein in gel filtration experiment. Several mass spectrometry-based strategies were combined to obtain a set of low resolution structural information. Kinetic data from limited proteolysis with various endoproteases are concordant in pointing out that region Glu73-Arg78 is hyper-sensitive, and thus accessible and flexible. Lysine labeling with NHS-biotin and cross-linking with DTSSP revealed that the 35 amino acid RLI motif at the N terminus is solvent exposed. Cross-links between Lys10-Lys14 and Lys23-Lys25 indicate that these residues are spatially close and in adequate conformation to be cross-linked. These experimental data have been used to rank multiple three-dimensional models generated by a de novo procedure.
Our data indicate that COG2042 proteins may share a novel fold. Combining biophysical, mass-spectrometry data and molecular model is a useful strategy to obtain structural information and to help in prioritizing targets in structural genomics programs.
Genomic comparative studies on entirely sequenced genomes from the three domains of life, i.e. Bacteria, Archaea and Eukaryota , evidenced that proteins involved in the organization or processing of genetic information (structures of ribosome and chromatin, translation, transcription, replication and DNA repair) display a closer relationship between Archaea and Eukaryota than between Bacteria and Eukaryota [2–4]. To identify new proteins involved in such important cellular mechanisms, an exhaustive inventory of proteins of unknown function common to only Eukaryota and Archaea but not in Bacteria has been devised [5–7]. Among such proteins, the Cluster of Orthologous Group COG2042 comprises proteins ubiquitously present in Eukaryota and present in many, but not all, Archaea; a hallmark of their ancient origin. The corresponding ancestral protein should have been present in the common ancestor of these two domains of life. Some partial experimental data are known from the Saccharomyces cerevisiae COG2042 homolog. Deletion of the Yor006c gene was shown to result in a viable phenotype but some apparent moderate growth defects were noticed on a fermentable carbon source [8, 9]. Two putative protein partners for Yor006c were identified through a high-throughput two-hybrid study : Ydl017w, a serine/threonine kinase also known as the cell division control protein 7 (Cdc7), and Yil025c, a hypothetical ORF. However, the cellular function of COG2042 proteins remains unknown.
A polar region, named RLI, is conserved at the N terminus of COG2042 proteins as well as at the N terminus of another cluster of orthologous proteins, namely COG1245. The latter, exemplified by SSO0287 in Sulfolobus solfataricus , are large proteins (about 600 residues) that encompass four different domains: a RLI domain, a [4Fe-4S] ferredoxin domain, and two ATPase domains, usually found in ABC transporter. Their putative function is currently subjected to discussion [12, 13] but could be related to rRNA metabolism. Indeed, four of the eleven proteins shown to interact with the yeast COG1245 homolog (Ydr091c) were identified as involved in rRNA metabolism (Ymr047c, Ydl213c, Ylr340w, Ylr192c). Experimental data on the human homolog of Ydr091c indicated that this protein reversibly associates with RnaseL, and thus COG1245 proteins were named RNase L inhibitor .
Because knowledge of protein structure is of high importance to understand protein function, huge efforts have been recently invested in high-throughput protein structure determination programs . Recent reports indicate that only a relatively small percentage of expressed and purified proteins are amenable to full 3D structure by NMR or crystallography and X-ray diffraction [16, 17]. In silico modeling (homology modeling, fold recognition, ab initio and de novo modeling) is the alternative to quickly gain the fold of a protein. However, such approach sometimes remains ambiguous in reliably identifying correct structures for protein sequences remotely-related to those found in PDB database. A promising strategy is the use of experimental data (if possible easily obtained) for model discrimination or refinement [18–20]. For example, the tertiary structure of the bovine basic fibroblast growth factor (FGF)-2 was probed with a lysine-specific cross-linking agent and subjected to tryptic peptide mapping by mass spectrometry to identify the sites of cross-linking . The low resolution interatomic distance information obtained experimentally allowed the authors to distinguish among threading models in spite of a relatively low sequence similarity (13 % of identical residues). Interestingly, the constant development of novel cross-linking reagents suitable for mass spectrometry  enables enrichment of cross-linked peptides facilitating such strategy. A chemical modification approach [23–26], in combination with limited proteolysis procedures [27, 28], can also provide useful structural constraints  for model refinement.
A step further is to attempt such approaches with proteins having no detectable homologs. In order to get insight into the topology of COG2042 members and if possible to use these experimental data to discriminate among structural protein templates, we combined limited proteolysis, lysine labeling and cross-linking strategies. The protein SSO0551 from the hyperthermophilic archaea Sulfolobus solfataricus was chosen as a prototype because of its thermostability and the probable absence of post-translational modifications when produced as a recombinant form in Escherichia coli. The SSO0551 protein is monomeric with a low molecular mass (19 kDa). This size is easily amenable to characterization by mass spectrometry. Our results reveal that the polar RLI motif at the N terminus is probably structured and solvent exposed, pointing at a common trait between COG2042 and COG1245 proteins, this latter group being also conserved in Eukaryota and Archaea but absent in Bacteria. The accessible and flexible regions defined by limited proteolysis combined with lysine accessibility assessed by NHS-biotin labeling and DTSSP cross-linking allowed us to discriminate among ten top ranking de novo three-dimensional (3D) models.
Fingerprint identification of recombinant products from pSBTN-AB31 and pSBTN-AB30 constructs.
[MH]+ observed (in amu)
Δmass (in ppm)
[MH]+ observed (in amu)
Δmass (in ppm)
[MH]+ expected (in amu)
Native molecular mass of SSO0551 was determined by size-exclusion chromatography on a Superdex 200 HR10/30 calibrated column. Pure protein eluted as a peak centered at 39.1 mL in the assay conditions corresponding to an apparent molecular mass lower than 20 kDa. This elution profile indicates that this structured protein behaves as a compact monomer.
During the earliest events of the trypsin proteolysis analyzed in various conditions for detection of large products but also smaller peptides, monocharged cations with following m/z: 8614.6 amu, 10603.4 amu, 6489.8 amu, and 12724.1 amu, were attributed to fragments [1–75] (Δmass: -178 ppm), [76–166] (+89 ppm), [1–56] (+145 ppm), [57–166] (+200 ppm), respectively (data not shown). These data clearly indicate that Lys75 and Arg56 are two sites of early cleavage by trypsin. Identification of peptides Val32-Lys166 (15478.7 amu, -53 ppm) and Gly35-Lys166 (15191.2 amu, +153 ppm) also indicates that Arg31 and Lys34 could be two other initial nick-sites.
Similar experiments with endoproteinase Arg-C resulted in observation of two pairs of complementary peptides with m/z of 1920.9 amu ([1-15] +70 ppm) and 17296.8 amu ([16–166], -95 ppm) on one hand, 8998.1 amu ([1–78], -176 ppm) and 10217.5 amu ([79–166], +332 ppm) on the other hand. These data indicated that Arg78 and Arg15 are the main proteolyzed sites when ArgC enzyme was used. Chymotrypsin attacks SSO0551 native protein mainly at Phe74 because two complementary peptides with m/z of 8487.0 amu ([1–74], -249 ppm) and 10734.2 amu ([75–166], -157 ppm) were clearly evidenced. Glu73 is the main proteolyzed site when GluC protease was used, as peptides with m/z of 8338.7 amu ([1–73], -118 ppm) and 10880.0 amu ([74–166], +28 ppm) were detected. For all these analysis, smaller peptidic fragments that accumulated over time could be attributed from further proteolysis of the products arising from initial attacks (data not shown). All these results are concordant in pointing out that Glu73-Arg78 and Glu28-Arg31 are two accessible solvent-exposed regions of the protein as they can be proteolyzed by several endopeptidases, the first cited being definitively hyper-sensitive. Local unfolding not just surface exposure is necessary for efficient in vitro proteolysis because the polypeptide segment being cleaved must form a specific structure with the associated protease . For this reason, Glu73-Arg78 region should also correspond to a flexible region, i.e. a protruding loop.
Monoisotopic [M+H]+ peptides generated by various proteases after NHS-biotin labeling of SSO0551.
VYIIDYHK DDPK R
K 10 andK 14
K LVK LK
K 20 andK 23
LVK LK IAEFTR
K 23 andK 25
Our initial objective was to obtain about SSO0551 as much low-resolution structural information as possible in order to discriminate among putative three-dimensional models representing COG2042 protein structure. However, currently available threading tools applied on SSO0551 failed to detect any structurally related-proteins. Alternatively, we obtained ten different ab initio models of SSO0551 using the fully-automated ROBETTA server based on ROSETTA procedures . On these ten models, we applied all the low-resolution structural information gathered in this work. We predicted for every model location of preferential proteolytic sites using the NickPred software . Models M1, M2 and M6 on one hand, and M9 and M10 on the other, show hypersensitive regions in the RLI motif or C terminus, respectively. These features do not correspond to our experimental data. Only models M4, M7 and M8 predict that the loop Glu73-Arg78 is solvent exposed (data not shown). Among these three models, M4 and M8 respect the ranking of preferential nick-sites for trypsin, chymotrypsin, ArgC and GluC proteases. Solvent accessibility for lysine side chain was evaluated for models M4, M7 and M8 and compared with experimental data (data not shown). All the lysine residues labeled with NHS-biotin are found solvent-exposed in model M8. Manual inspection of cross-linked lysines (Lys10-Lys14 and Lys23-Lys25) revealed that model M4 is not valid because of the opposite orientation of Lys10 and Lys14. Figure 7 (Panels B & C) shows cartoon views of the M8 model that fulfills all our experimental constraints. For this model, the distance between the two reactive amine groups of Lys10-Lys14 and Lys23-Lys25 pairs are 12.7 Å and 13.3 Å, respectively. Search with DALI for structural homologs using model M8 did not result in significant scores with any known PDB structures. This is consistent with the PSI-BLAST results and may indicate that COG2042 proteins share a novel fold. COG2042 proteins are thus a target of choice for genomic structural studies.
In conclusion, we have presented a strategy consisting in obtaining low-resolution structural information (determination of nick-sites, solvent exposed residues, and residue-residue distances) that can be used to distinguish among a large set of theoretical molecular models. Lack of remotely-related structural templates or lack of adequacy between experimental data and most theoretical models indicates that such family of proteins should become a priority in structural genomic projects.
Most chemicals used in this study were obtained from Sigma and were of analytical grade. Oligonucleotide primers were purchased from Genset. N-hydroxysuccinimide-biotin (NHS-biotin) and 3,3'-dithio-bis [sulfosuccinimidyl-propionate] (DTSSP) were obtained from Pierce. Matrices for Matrix-assisted Laser Desorption Ionization-Time of Flight (MALDI-TOF) mass spectrometry and calibration standards were purchased from Bruker Daltonics. Sequencing grade proteolytic enzymes were from Roche Applied Science.
Two constructs were designed in order to get overexpression of the SSO0551 ORF (starting with an ATG codon at nucleotide 484790 on the Crick strand of S. solfataricus P2 genome (NC_002754)) and an N-terminal extended version of SSO0551 (starting with an ATG codon at nucleotide 484916). For both proteins, an N-terminal 6His tag was added to render the purification of the recombinant products easier. For this purpose, synthetic oligonucleotide primers were oAB22 (5'-gctagc ATGAAGCCCAAACCC-3') and oAB49 (5'-gctagc ATGAAGGTATATATTATAGAC-3') that both contain an engineered Nhe I site, oAC34 (5'-cggatcct acTCATTTTTCAAGTATTTTC-3') and oAE62 (5'-ggatcc tcaTCATTTTTCA AGTATTTTCTC-3') that both contain an engineered Bam HI site (restriction sites underlined in the primer sequences and nucleotides not present in the original sequence shown by lower case). Oligonucleotide pairs oAB22/oAC34 and oAB49/oAC34 were used for two distinct PCR amplifications of SSO0551 with S. sulfolobus total DNA as template. A 643-bp fragment (N-ter 6His-tag extended version of SSO0551) and a 517-bp fragment (N-ter 6His-tag SSO0551) were obtained, respectively. They were cloned into pCRScript-cam (Stratagene), resulting in plasmids pSBTN-AB36 and pSBTN-AB37, respectively. The two inserts were removed by digestion with Nhe I and Bam HI and ligated with T4 DNA ligase into plasmid pSBTN-AB23 (Armengaud J. & Chaumont V., unpublished data), a derivative of pCR T7/NT-topo (Invitrogen) containing a T7 promoter and 6 His-tag, previously digested with the same endonucleases. The resulting plasmids pSBTN-AB30 and pSBTN-AB31, respectively, were verified by DNA sequencing in order to ascertain the integrity of the nucleotide sequence. Hyperexpression of the recombinant SSO0551 constructs was achieved with E. coli Rosetta(DE3)pLysS strain (Novagen), freshly transformed with the plasmids described above. Cultures were carried out at 30°C as described earlier .
The purification of recombinant SSO0551 was performed from 44 g (wet material) packed cells. Buffer A consisted of 50 mM K2HPO4/KH2PO4 buffer (pH 7.2) containing 400 mM K-glutamate. The pellet was thawed on ice and resuspended in 120 mL of buffer A. The cells were disrupted by sonication with a total energy delivered of 71 kJ. The cell-extract was then centrifuged at 30,000 g for 20 min at 4°C to remove cellular debris and aggregated proteins. The supernatant was subjected to a 20 min heat treatment using a water bath maintained at 70°C, and immediately centrifuged a second time at 30,000 g for 20 min at 4°C. Chromatographic steps were performed at room temperature using an Äkta Purifier FPLC system (Amersham Biosciences). The 135 mL supernatant was applied at a flow rate of 2.8 mL/min onto a XK 26 × 20 column (Amersham Biosciences) containing 50 mL of Chelating Sepharose Fast Flow (Amersham Biosciences) and previously loaded with 200 mM NiSO4, washed with milliQ water and equilibrated with Buffer A containing 50 mM imidazole. The fraction collected during the IMAC loading was shown to contain the SSO0551 protein. This 222 mL fraction was concentrated to a volume of 56 mL by means of Centricon Plus-20 filtration units (Millipore) and then dialyzed overnight at 4°C against 20 mM K2HPO4/KH2PO4 buffer (pH 7.2) containing 20 mM NaCl (buffer B). The 78 mL supernatant obtained after centrifugation at 30,000 g for 10 min at 4°C was divided and applied in two separate runs onto a 6 mLResource-S ion-exchange column (30 mm × 16 mm, 15 μm) from Amersham Biosciences, previously equilibrated with buffer B and operated at a flow rate of 3 mL/min. After a 10 column volume wash with buffer B, proteins were resolved with a 25 column volume linear gradient from 20 to 500 mM NaCl in buffer B. Recombinant SSO0551 was eluted at approximately 250 mM NaCl and desalted by overnight dialysis against Buffer B. The resulting 20 mL protein solution was concentrated to a volume of 8 mL by means of Centricon Plus-20 filtration units (Millipore). The sample was again divided and applied in two separate runs onto a superdex75 gel filtration packed into a HR 16/50 column at a flow rate of 1.5 mL/min in 20 mM K2HPO4/KH2PO4 buffer (pH 7.2) containing 100 mM NaCl. The fractions obtained with the two runs were pooled and dialyzed overnight at 4°C against 10 mM HEPES buffer (pH 7.2). After dialysis, the fraction was centrifuged at 26,000 g for 20 min at 4°C and the protein concentration was measured by spectrophotometry using a molar absorption coefficient of 19060 M-1 cm-1 at 280 nm. The purified protein was flash frozen in liquid nitrogen and stored at -80°C at a concentration of 0.48 mg/mL.
Far- and near-UV circular dichroism spectra were recorded at 20°C between 200 and 300 nm on a J-810 Jasco spectropolarimeter equipped with a PTC-424S Jasco Peltier, using a quartz cuvette of 1 mm path length, with a 20 nm/min scanning speed and a band-width of 1 nm. Three spectra of purified SSO0551 at 1.92 μM in 10 mM HEPES buffer (pH 7.2) were averaged and corrected from the baseline for buffer solvent contribution. Experimental data were analyzed using the program K2D  described by Andrade et al. .
The native molecular mass of SSO0551 was estimated by gel filtration chromatography on a Superdex 200 gel packed into a HR10/30 column (Amersham Biosciences) with a final bed volume of 24 mL. The column was equilibrated at room temperature at a flow rate of 0.5 mL/min with 50 mM Tris/HCl buffer, pH 8.3, containing 50 mM NaCl and eluted with the same buffer. Protein standards used to calibrate the column were ribonuclease A (15.8 kDa), chymotrypsinogen A (21.2 kDa), ovalbumin (49.4 kDa), albumin (69.8 kDa), aldolase (191 kDa) and catalase (215 kDa), all from Amersham Biosciences. Exclusion limit was evaluated with dextran blue 2000 (Amersham Biosciences). A sample consisting of 90 μL of SSO0551 at 25.2 μM was injected and specific absorptions at 280 and 266 nm were followed.
Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight (MALDI-TOF) mass measurements were performed using a Biflex IV instrument (Bruker Daltonics) in positive ionization mode. Protein samples and large peptidic fragments (>3500 Da) were applied to the target using sinapinic acid prepared as saturated solution in 30 % acetonitrile, 70 % milli-Q water and 0.1 % TFA as matrix. Samples were prepared using the dried droplet method and measured in linear mode. Small peptide samples were measured in reflectron mode using α-cyano-4-hydroxycinnamic acid in 30% acetonitrile containing 0.1% trifluoroacetic acid as matrix. Mass spectra were obtained by summation of 100–210 laser shots. The instrument was calibrated for determination of entire protein masses using either a mixture of chymotrypsin and bovine serum albumine, or apomyoglobin and aldolase. For peptides, the instrument was calibrated using a pepmix calibration kit (Bruker Daltonics). When necessary, the mass spectrometer was also internally calibrated using some of the theoretical peptide masses.
For in-solution partial digestion, 0.2 nmol of pure SSO0551 were diluted into buffer D1 (20 mM TRIS/HCl, pH 7.8), buffer D2 (20 mM NH4HCO3, pH 7.8) or buffer D3 (20 mM TRIS/HCl, pH 7.8, containing 10 mM CaCl2 and 5 mM DTT). Trypsin or chymotrypsin was added to SSO0551 diluted into buffer D1, whereas Glu-C or Arg-C was added to the protein diluted into buffer D2 or D3, respectively. Several enzyme/protein ratios (1:50 (w/w), 1:20 (w/w) and 1:2 (w/w)) were tested for each endoprotease. The digestions were performed at room temperature and aliquots were analyzed from 30 sec to 10–240 min. Digested samples were desalted using ZipTipC18 or ZipTipC4 pipette tips (Millipore) according to the protocol specified by the manufacturer and their mass directly evaluated by MALDI-TOF. Eventually, partially proteolyzed mixtures of larger quantities (10 nmol of SSO0551) were fractionated by reverse-phase HPLC using an Aquapore RP-300 column (PerkinElmer; 100 × 1.0 mm, 7 μm, 300 Å pore size) developed at 200 μL/min with a linear gradient from 5 to 90 % of acetonitrile in TFA 0.1 % over 45 min. The elution was monitored at 220 nm with an Agilent 1100 Series HPLC system equipped with a G1315 diode array detector. Individual fractions were concentrated by evaporation in a SpeedVac (Savant) and directly analyzed by MALDI-TOF.
N-hydroxysuccinimide-biotin (NHS-biotin) was used to label ε-amino groups of SSO0551 lysines. After reaction the biotin labels resulted coupled to the lysines through a stable amide bond. The increase in mass for each label (C10H14N2O2S1) should be 226.293 amu if average mass is considered or 226.078 amu in monoisotopic mode. Modification of lysine residues was carried out by incubating 1.25 nmol of SSO0551 in 20 mM HEPES, pH 7.2, with various amount of freshly prepared NHS-biotin reagent dissolved in anhydrous dimethylsulfoxide. After 30 min of incubation at room temperature, the reagent in excess was removed by a 30 min micro-dialysis against 20 mM HEPES, pH 7.2. Samples were directly desalted by using ZipTipC4 (Millipore) prior MALDI-TOF analysis. They were eventually digested overnight with an endoprotease (trypsin, GluC or ArgC) and desalted by using ZipTipC18 pipette tips (Millipore) prior mass analysis.
3,3'-Dithio-bis [sulfosuccinimidyl-propionate] (DTSSP) was used to cross-link two ε-amino groups of SSO0551 lysines, essentially as described in . The mass increase (in monoisotopic mode) for each label should be 191.991 amu (C6H8O3S2) or 87.998 amu (C3H4O1S1) when DTT treated. The increase in mass for an intramolecular cross-link between two lysines should be 173.981 amu (C6H6O2S2) or 175.997 amu (2 × C3H4O1S1) when DTT treated. Therefore after reduction of the disulfide bridge by DTT, an additional increase of 2.016 amu should be measured. Reaction was carried out by incubating 0.25 nmol of SSO0551 in 20 mM NaH2PO4/Na2HPO4, pH 7.5 containing 150 mM NaCl, with various amount of DTSSP reagent (molar ratio of 20, 35, and 50 mol of DTSSP per mol of polypeptide). After 30 min of incubation at room temperature, the reagent in excess was removed by a 30 min micro-dialysis against 20 mM NaH2PO4/Na2HPO4, pH 7.5 containing 150 mM NaCl. Prior overnight trypsin proteolysis, urea (330 mM final concentration) was added to each sample. Before being desalted by using ZipTipC18 pipette tips (Millipore), the digested peptide mixture was eventually reduced with 50 mM DTT for 30 minutes at 37°C to reduce the thiol linker.
Sequence searching was performed using PSI-BLAST with default parameters. Multiple sequence alignments were performed using VectorNTI software package (Informax Inc). Secondary structure predictions were obtained through the PSIPRED v2.4 web-interfaced facilities  described by McGuffin et al. . The molar absorption coefficient at 280 nm for SSO0551 was obtained from calculation of the amino acid composition of the recombinant protein [40, 41]. Isotopic and average mass of both DTSSP cross-linker and NHS-biotin were calculated using a web-interfaced molecular weight calculator . The peptide assignment and the first attempt for identifying the labeled products and cross-linking products were performed using the FindMod package at ExPaSy . If no match was found, a more detailed search for multiple labels or combinatorial cross-linkable peptide pairs was carried out. Partially proteolyzed products were assigned using the FindPept tool . Tertiary structure predictions were carried out using publicly available online services, including 3D-PSSM , FUGUE  and PSIPRED . Ab initio modeling was performed using the ROBETTA server [34, 47]. Each model was analyzed in terms of proteolytic sensitivity using the NICKPRED software [35, 48, 49]. Residues accessibility have been calculated using a modified version of Connolly's MS program (; Pellequer JL, unpublished results). Structural homologs were searched using DALI web server from the European Bioinformatics Institute . Model views were obtained with the MOLSCRIPT program  and rendered using RASTER3D .
atomic mass unit
Cluster of Orthologous Group
high performance liquid chromatography
immobilized metal ion adsorption chromatography
Matrix-assisted Laser Desorption/Ionization Time-of-Flight
Position-Specific Iterated Blast
We gratefully acknowledge Yvan Zivanovic (CNRS-IGM, Orsay, France) for kind gift of S. sulfolobus total genomic DNA and Patrick Forterre (Université d'Orsay, Orsay, France) for initial discussions of the interest of characterizing SSO0551 protein. We thank our enthusiast technical assistants (CEA-VALRHO): Valérie Chaumont for performing the cloning and overexpression experiments, Charles Marchetti for operating the fermenter facilities, Bernard Fernandez for assistance with chromatography and recording circular dichroïsm signal, Isabelle Dany for initial fingerprint mass characterization of overproduced SSO0551, and Pascale Richard for technical support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.