Structural analysis of heme proteins: implications for design and prediction
© Li et al; licensee BioMed Central Ltd. 2011
Received: 26 October 2010
Accepted: 3 March 2011
Published: 3 March 2011
Skip to main content
© Li et al; licensee BioMed Central Ltd. 2011
Received: 26 October 2010
Accepted: 3 March 2011
Published: 3 March 2011
Heme is an essential molecule and plays vital roles in many biological processes. The structural determination of a large number of heme proteins has made it possible to study the detailed chemical and structural properties of heme binding environment. Knowledge of these characteristics can provide valuable guidelines in the design of novel heme proteins and help us predict unknown heme binding proteins.
In this paper, we constructed a non-redundant dataset of 125 heme-binding protein chains and found that these heme proteins encompass at least 31 different structural folds with all-α class as the dominating scaffold. Heme binding pockets are enriched in aromatic and non-polar amino acids with fewer charged residues. The differences between apo and holo forms of heme proteins in terms of the structure and the binding pockets have been investigated. In most cases the proteins undergo small conformational changes upon heme binding. We also examined the CP (cysteine-proline) heme regulatory motifs and demonstrated that the conserved dipeptide has structural implications in protein-heme interactions.
Our analysis revealed that heme binding pockets show special features and that most of the heme proteins undergo small conformational changes after heme binding, suggesting the apo structures can be used for structure-based heme protein prediction and as scaffolds for future heme protein design.
This year marks the 50th anniversary of the publication of the very first two protein structures, myoglobin and hemoglobin, two prototype heme proteins involved in oxygen storage and transport [1, 2]. Heme proteins, or hemoproteins, are a group of proteins carrying heme as the prosthetic group. Heme proteins are ubiquitous in biological systems and exhibit diverse biological activities. These include the classical functions of diatomic gas transportation/storage and electron transfer as exemplified by myoglobin, hemoglobin and cytochrome c[3, 4]. More recent studies continue to reveal more pleiotropic roles of heme proteins in transcriptional regulation [5, 6], ion channel chemosensing , circadian clock control , and microRNA processing .
The identification of human Rev-erb nuclear receptors as heme sensing transcription factors represents an important addition to the heme protein family [10, 11]. Rev-erbα (NR1D1) and Rev-erbβ (NR1D2) have been implicated in the regulation of circadian rhythms, lipid and glucose metabolism, and diseases [12–15]. They were previously categorized as orphan receptors with no known physiological ligand. Computational modeling and X-ray crystallization of the ligand binding domain (LBD) of Rev-erbs provided incentives for proposing heme as the bona fide ligand. However, the proposal was largely based on the homology between Rev-erb LBD and that of a known heme sensing protein E75, a Drosophila nuclear receptor; and the authenticity of heme as a ligand remained elusive at the time due to the lack of unified information on heme binding sites and heme-protein interaction. Therefore detailed analysis and prediction were not possible. Yet the Rev-erb story prompted us to ask: can we predict heme proteins? The worldwide structural genomics projects have produced a large number of new structures with unknown functions or annotated as hypothetical proteins [16, 17]. Owing to the ubiquitous and essential nature of heme in life, we hypothesize that some "orphan" structures in Protein Data Bank (PDB)  are heme proteins.
As a first step towards a long term goal to develop methodologies for predicting and designing novel heme proteins, a field of interest with great potential in medicine and green energy [27, 30], we set out to investigate the common characteristics of heme binding sites and the conformational differences between apo (without heme) and holo heme proteins, aiming at consolidating and synthesizing a large body of experimental data and extracting useful information and novel integrative insights.
We take into consideration two key questions crucial to the structure-function paradigm of heme proteins. The first concerns the structural implications of the heme-interactive sequence motifs. CXXCH represents the classic type-c heme binding motif in which the two vinyl groups of heme form covalent bonds with two cysteine residues in proteins [27, 28]. Recently, a heme regulatory motif CP (for cysteine-proline dipeptide) has received increasing attention [31–35]. But up to the present the functional importance of this CP heme sensing or regulatory motif has been studied only through mutational experiments on a limited number of proteins. It is still not clear from a structural point of view how the CP motif is involved in regulation of heme binding as has been established for the CXXCH heme c motif.
The second question concerns the structural environment or the physiochemical features of the heme binding pockets. Of particular importance is the conformational difference between the apo and holo forms of heme proteins since, in most cases, only apo structures will be available for prediction. Even though the global and local conformational changes induced by ligand binding in general have been surveyed by a number of studies [36–39], such systematic studies on heme proteins have not been reported. In this study, we compiled a non-redundant dataset of apo-holo pairs to examine the conformational and pocket changes in heme proteins after heme binding.
The diversity and conservation of interactions between heme and proteins have been analyzed previously by Schneider et al. . However they used a redundant dataset with 68 type-b heme proteins (based on 60% sequence identity cutoff) due largely to the limited availability of heme protein structures [27, 40]. A very recent study performed analysis on a smaller dataset of 34 heme proteins, each of which represents one CATH homologous family or a SCOP family . There are seven different heme groups in the 34 heme proteins with heme b and heme c as the dominant forms . Here we performed structural analysis on a larger, non-redundant dataset of heme proteins containing heme b and/or c types. Heme proteins are found in at least 31 different structural folds in all the four major classes based on SCOP classifications , attesting to the diversity and complexity of heme-protein interactions. The heme binding pockets are enriched in aromatic amino acids and relatively depleted with respect to the charged residues, glutamic acid, aspartic acid, and lysine. We also found that the CP motif has structural implications in heme-protein interactions.
Two non-redundant datasets were generated in this study. The first dataset, containing 125 heme-binding protein chains, was used for analysis of heme binding environment. This set was culled from protein structures in the Protein Data Bank (PDB, November 24, 2009)  with HEM (for heme b) or HEC (for heme c) as ligands with the following criteria: experimental method = X-ray crystallography, maximum resolution = 3 Å, and maximum R-value = 0.3. The protein chains that interact with heme molecules (described in next section "Analysis of heme interacting residues") were selected, and a non-redundant set of 125 heme-binding protein chains was generated using PISCES  with a sequence identity cutoff of 25% (Additional file 1, Table S1). The second dataset has 5596 protein chains in which each pair of protein chains has less than 25% sequence identity and each structure has a resolution of 2.5 Å or better and an R-factor of 0.3 or better. This set was used for calculating background frequencies of amino acids, secondary structure types, and relative solvent accessibility. The sequences for the protein chains derived from the PDB "SEQRES" records may have cloning and expression artifacts such as His-tags at the N- or C-terminus and some of the protein chains have missing residues [44, 45]. To avoid such artifacts and incomplete sequences, the amino acid frequencies were calculated using the full-length protein sequences through mapping PDB chains to Uniprot entries with PDBSWS .
A residue is considered as a heme axial ligand if the distance between the nitrogen, sulfur or oxygen of the residue and the heme iron is within 3 Å. Residues having heavy atoms within 4.5 Å of any non-hydrogen atoms of the heme molecule are identified as heme interacting amino acids. A protein chain is considered as heme binding if it has residue(s) as axial ligand(s) to the heme iron or has at least ten residue interactions with the heme molecule. DSSP was used to assign each residue to one of three secondary structure states, helix, strand, and coil . Following the widely used convention, H (α-helix), G (310-helix) and I (π-helix) from DSSP are classified as helix type while E (extended strand) and B (residue in isolated-bridge) states are classified as strand type. All the other states from DSSP are considered as coils. The relative solvent accessibility was calculated by dividing the absolute value of exposed area from DSSP over the maximum accessibility of each residue . We employ a three-state classification for relative solvent accessibility: buried (≤7%), intermediate (>7% and ≤37%), and exposed (>37%), as described previously .
To maximize the number of possible apo-holo heme protein pairs, each of the heme protein chains was first compared with all the non-heme protein chains derived from PISCES pdbaaent file using BLAST . There are a number of ligands that are similar to heme b or c in PDB, so structures with these heme-like ligands are not considered as apo proteins for our apo-holo comparisons. Based on HIC-Up keyword search using heme and porphyrin  and SuperLigands ligand structure similarity search , we identified 55 heme-like ligands in PDB (Additional file 1, Table S2). The highly similar apo-holo heme protein pairs (cutoffs set at 90% sequence identity and 95% sequence alignment overlap) were then culled to generate a list of 15 non-redundant apo-holo pairs using PISCES with a sequence identity cutoff of 25% . Five of the 15 apo proteins that contain other non-heme ligands in the heme-binding pockets were removed from the list as they are not truly "apo" forms with respect to the heme binding sites. The structural differences were evaluated with two structure alignment programs, FAST  and CE  for structure comparisons. The similarity/difference between two structures is measured by the RMSD (root mean square deviation) of the Cα atoms of aligned residues. The pocket/cavity was predicted using the CASTp server (Computed Atlas of Surface Topography of proteins). To compare the shape of the pockets, Rvs, the ratio between the volume and the surface area is used.
SCOP fold classes of the 125 heme binding protein chains in the non-redundant dataset
# of Chains
SCOP Fold Name
Nuclear receptor ligand-binding domain
Four-helical up-and-down bundle
Indolic compounds 2,3-dioxygenase-like
GST C-terminal domain-like
Common fold of diphtheria toxin/transcription factors...
Tryptophan synthase β subunit-like PLP-dependent enzymes
Cytochrome b5-like heme/steroid binding domain
Nitric oxide (NO) synthase oxygenase domain
Ligand-binding domain in NO signalling & Golgi transport
Heme iron utilization protein-like
Heme-binding four-helical bundle
Cytochrome c oxidase subunit I-like
The conserved interactions between protein residues and heme were previously studied by calculating either the frequencies of residues that are in van der Waals contact with heme for each fold class of b-type heme proteins  or by calculating the mean number per binding site . Smith et al also applied normalized amino acid profiles to assess the composition and conservation of heme binding sites . Here we explored the residue preferences in the heme binding pockets through calculating the relative frequencies of heme binding residues in our non-redundant dataset. The relative frequency of each amino acid is normalized to its background frequency.
Normally, the background frequencies used for comparisons are calculated from a non-redundant protein dataset. However, due to the dominant presence of all-α folds, it is not clear whether the residue distribution in heme proteins is different from that in other proteins. Therefore we first compared the residue distributions between non-redundant heme proteins and non-redundant all proteins. To avoid issues with missing residues and cloning artifacts (His-tags etc.) associated with PDB sequences, we used native full-length protein sequences to calculate residue compositions by mapping the PDB chains to Uniprot entries with PDBSWS . The relative residue frequencies between heme proteins and all proteins show that heme proteins tend to contain more alanine, phenylalanine, histidine, methionine, and tryptophan residues and fewer cysteine, aspartic acid, isoleucine, lysine, asparagine, and serine residues (Additional file 2, Figure S1). Statistical analysis (χ2) revealed a significant difference between these two frequency profiles (data not shown). In order to have a meaningful description of the enrichment or deficiency of residues in the heme interacting environment, we used the background frequencies from the non-redundant set of heme proteins as references.
Consistent with earlier reports, the aromatic residues (phenylalanine, tyrosine, and tryptophan) play important roles in protein-heme interactions through stacking interactions with the porphyrin[27, 41]. One exception is tryptophan in heme c proteins, which showed a similar level of occurrences compared to the background (Figure 4A). Leucine, isoleucine, and valine, which make hydrophobic interactions with the heme ring structure, are slightly increased over the background frequencies. The residues with the fewest occurrences, aspartic acid, glutamic acid, and lysine are charged residues, suggesting the heme binding pocket is mainly a hydrophobic environment. In contrast, arginine, a positively charged residue that has been considered a major player in anchoring the heme propionates, has a much higher occurrence than other charged amino acids and shows a similar (HEM) or slightly higher (HEC) level of frequency to the background (Figure 4A) .
Another motif worthy of note, G X[HR]XC[PLAV]G, comes from the heme b proteins with cysteine as axial ligands (Figure 6B). The motif represents the classic CYP signature heme binding motif FXXGXXCXG in bacteria, plant, and mammalian cytochrome P450 s [59–61]. At the -4 and +2 positions (with ligand cysteine as reference position) are small amino acids (glycine) while the -2 position prefers a positively charged amino acid such as histidine or arginine. These positively charged residues interact electronically with the negatively charged heme propionates (Figure 6C and 6D). The small glycine residue at the -4 position may provide the flexibility needed for positioning the positively charged residues close to heme propionate groups. The +1 position is dominated by proline and hydrophobic amino acids, leucine, alanine, valine and isoleucine. Six of the eighteen cases have proline right after the axial ligand cysteine, reminiscent of the dipeptide CP motif being implicated in heme sensing and regulation [31–35, 62]. While the importance of CP motif has been studied through deletion or site-directed mutation experiments in several important proteins, including transcription repressor Bach1, iron regulatory protein 2 (IRP2) , circadian factor period 2 (Per2)  and δ-aminolevulinic acid synthase (ALAS) , the possible role of the CP motif in heme interaction from a structural point of view remains unclear as the structures for most of these proteins with such CP motifs are unknown.
CP dipeptides have also been implicated in indirect interaction with heme. Ragsdale and colleagues reported a novel role for CP motifs in heme oxygenase 2 (HMOX-2) as a thiol/disulfide redox switch that localizes outside the heme-binding pocket [62, 64, 65], therefore regulating heme-protein interaction via sensing redox status in the environment. There are a total of twenty-nine CP dipeptides in our dataset. Less than a quarter of them (in 7 protein chains including 2PBJA) show physical interactions with heme molecules. It would be impractical at this point to predict the functional role of the remaining CP dipeptides in heme-protein interaction, mainly due to the limited sample size and the lack of structural details on heme pocket-CP interaction. Here we made use of statistical analysis to indirectly assess the functional relevance of CP dipeptides in heme interaction. The rationale behind the assay is that, if CP dipeptides are important heme signatures for heme interaction, the expected occurrences of CP dipeptides in hemoproteins should be higher compared to control population. We found no statistically significant difference between the presence of CP dipeptides in heme proteins and non-heme proteins (data not shown), suggesting other yet to be identified factors may exist to help determine the role CP dipeptides play in heme binding . It should be noted that we do not exclude the possibility that in the control sample there exist unknown hemoproteins; however for them to significantly affect the frequency of CP signals there would have to be a considerably large fraction of the control proteins being analyzed to be heme-interacting, which we anticipate as less likely.
Comparisons between apo and holo heme protein structures
surface area (Å2)
It should be noted that the above comparisons are based on heme proteins that have stable apo structures solved through X-ray crystallography. For some proteins, as in the case of hemoglobin, the absence of ligand(s) can increase the flexibility and cause partial unfolding of the protein structure, making it difficult for structure determination [70, 71]. Furthermore, intrinsically disordered or unstructured regions are considered to be responsible for many important cellular functions such as ligand binding [72, 73]. However the existence of such flexible apo structures would not interfere with our goal in structure-based heme protein prediction as we aim to take the existing apo structures in PDB as inputs .
Other features useful for comparing apo-holo heme proteins are the pocket size and shape. Due to different heme binding modes (partially exposed or fully embedded, Additional file 2 Figure S3) and the difficulty in identifying the exact heme binding pocket from existing automatic programs, the sizes of heme binding pockets vary from small (~400 Å3) to very large (over 2000 Å3) (Table 2). In addition, the changes in absolute pocket volumes after heme binding are variable. Small changes are seen in 2ITFA-2ITEB, 2R7AA-2RG7 D, and 2ZDOA-1XBWD. Other pairs exhibited significant changes in volume despite the minimal conformational change (Table 2). To take the shape into consideration we calculated the Rvs value (the ratio of pocket volume over the pocket surface area) of each pocket. Most of the apo or holo proteins have Rvs values around 1.4. To further investigate whether the binding pocket can be used as one of the characteristics for heme protein prediction, we compared the Rvs distributions between heme binding pockets and pockets in non-heme proteins (proteins that don't have heme ligand(s) and are not homologous to heme proteins) with similar sizes ranging from 350 to 2000Å3. The Rvs of heme binding pockets has a narrow distribution whereas the Rvs from similar pocket sizes of non-heme proteins has a wide spread with a long right tail (Additional file 2, Figure S4-A). We also investigated the distribution of Rvs normalized to a sphere shape as introduced by Sonavane and Chakrabarti . A similar trend was found (Additional file 2, Figure S4-B). It should be pointed out that, even though unknown heme proteins may be included in the non-heme dataset, many non-heme proteins share similar pocket characteristics.
In this study, we surveyed the known heme protein structures for the purpose of structure-based heme protein prediction and novel heme protein design. We first compiled a non-redundant dataset of 125 heme (type b and c) binding protein chains that encompass a large number of protein structural folds, reflecting the diversified roles of heme proteins. Structural analysis revealed that the residues interacting with heme are mainly non-polar, especially aromatic amino acids, providing a hydrophobic environment for the heme ring structure. We also investigated the possible structural roles of CP motifs that are implicated in the regulation of heme binding and have received much attention recently. While the CP dipeptide is not as strong a signature for heme binding as the classic CXXCH heme c binding motif, the proline in the heme-interacting CP dipeptides assume important structural roles when CP is conserved and the cysteine functions as an axial ligand with heme iron. Indirect interaction between CP motifs and heme binding has also been reported in HMOX-2 protein, in which CP dipeptides form thiol/disulfide redox switch away from the heme binding pocket [62, 64], suggesting the heterogeneity of CP-heme interactions.
Comparisons between the apo and holo heme proteins indicate that most of the heme proteins undergo small conformational changes after heme binding, suggesting the apo structure can be used for structure-based heme protein prediction and as a scaffold for heme protein design. In addition our analysis on the heme binding pockets showed that despite the different sizes, the Rvs values of heme binding pockets are confined in a small range, whereas the data from non-heme binding proteins spread over a large range. We will apply the results from this study to investigate if any of the hypothetical proteins in PDB are potential heme proteins through computational prediction and experimental validations in the near future.
lipid binding domain
structural classification of proteins
protein data bank
root mean square deviation
ratio of volume over area.
The authors thank Dr. Dennis Livesay and Dr. Laura Schrum for comments on this manuscript. This research was partly supported by the NSF CAREER grant (DBI#0844749) to JTG, the NIH 5R01DK038825 to HLB, and the CMC-UNC Charlotte Collaborative Grants Program (09-002) to TL and JTG.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.