Structural analysis of heme proteins: implications for design and prediction

Background Heme is an essential molecule and plays vital roles in many biological processes. The structural determination of a large number of heme proteins has made it possible to study the detailed chemical and structural properties of heme binding environment. Knowledge of these characteristics can provide valuable guidelines in the design of novel heme proteins and help us predict unknown heme binding proteins. Results In this paper, we constructed a non-redundant dataset of 125 heme-binding protein chains and found that these heme proteins encompass at least 31 different structural folds with all-α class as the dominating scaffold. Heme binding pockets are enriched in aromatic and non-polar amino acids with fewer charged residues. The differences between apo and holo forms of heme proteins in terms of the structure and the binding pockets have been investigated. In most cases the proteins undergo small conformational changes upon heme binding. We also examined the CP (cysteine-proline) heme regulatory motifs and demonstrated that the conserved dipeptide has structural implications in protein-heme interactions. Conclusions Our analysis revealed that heme binding pockets show special features and that most of the heme proteins undergo small conformational changes after heme binding, suggesting the apo structures can be used for structure-based heme protein prediction and as scaffolds for future heme protein design.


Background
This year marks the 50 th anniversary of the publication of the very first two protein structures, myoglobin and hemoglobin, two prototype heme proteins involved in oxygen storage and transport [1,2]. Heme proteins, or hemoproteins, are a group of proteins carrying heme as the prosthetic group. Heme proteins are ubiquitous in biological systems and exhibit diverse biological activities. These include the classical functions of diatomic gas transportation/storage and electron transfer as exemplified by myoglobin, hemoglobin and cytochrome c [3,4]. More recent studies continue to reveal more pleiotropic roles of heme proteins in transcriptional regulation [5,6], ion channel chemosensing [7], circadian clock control [8], and microRNA processing [9].
The identification of human Rev-erb nuclear receptors as heme sensing transcription factors represents an important addition to the heme protein family [10,11].
Rev-erbα (NR1D1) and Rev-erbβ (NR1D2) have been implicated in the regulation of circadian rhythms, lipid and glucose metabolism, and diseases [12][13][14][15]. They were previously categorized as orphan receptors with no known physiological ligand. Computational modeling and X-ray crystallization of the ligand binding domain (LBD) of Rev-erbs provided incentives for proposing heme as the bona fide ligand. However, the proposal was largely based on the homology between Rev-erb LBD and that of a known heme sensing protein E75, a Drosophila nuclear receptor; and the authenticity of heme as a ligand remained elusive at the time due to the lack of unified information on heme binding sites and heme-protein interaction. Therefore detailed analysis and prediction were not possible. Yet the Rev-erb story prompted us to ask: can we predict heme proteins? The worldwide structural genomics projects have produced a large number of new structures with unknown functions or annotated as hypothetical proteins [16,17]. Owing to the ubiquitous and essential nature of heme in life, we hypothesize that some "orphan" structures in Protein Data Bank (PDB) [18] are heme proteins.
To date, structure-based protein function prediction remains a major challenge in structural bioinformatics. Rational design of heme proteins represents another attractive research front for its potential in the development of advanced biocatalysts and therapeutics [19][20][21][22][23][24][25]. Regardless of the purposes, a thorough understanding of protein-ligand interaction is essential. The interactions between heme and its host proteins are complicated. Heme as a prosthetic group can exist in different forms. Among the known forms, heme b and heme c represent the most common types of heme groups associated with proteins [26]. Heme b binds to proteins noncovalently while heme c forms covalent bonds between the heme vinyl groups and two cysteine residues of proteins ( Figure 1). Previous studies suggested that the functional versatility of heme proteins is delivered not only by the variability of the heme molecules but also the diverse micro-environment of the proteins, the nature of the axial ligands to iron, and the relative solvent accessibility of heme [27,28]. Heme proteins encompass diverse protein fold structures, among which is the well-known globin fold. However as probably the result of convergent evolution, analogous fold structures do not always warrant successful functional inference. For example, the N-terminal domain of RsbR, a protein involved in environmental stress signaling, assumes a globin-fold structure but does not bind to heme [29], highlighting the complexity of heme proteins and the need for detailed analysis of the heme binding surroundings.
As a first step towards a long term goal to develop methodologies for predicting and designing novel heme proteins, a field of interest with great potential in medicine and green energy [27,30], we set out to investigate the common characteristics of heme binding sites and the conformational differences between apo (without heme) and holo heme proteins, aiming at consolidating and synthesizing a large body of experimental data and extracting useful information and novel integrative insights.
We take into consideration two key questions crucial to the structure-function paradigm of heme proteins. The first concerns the structural implications of the heme-interactive sequence motifs. CXXCH represents the classic type-c heme binding motif in which the two vinyl groups of heme form covalent bonds with two cysteine residues in proteins [27,28]. Recently, a heme regulatory motif CP (for cysteine-proline dipeptide) has received increasing attention [31][32][33][34][35]. But up to the present the functional importance of this CP heme sensing or regulatory motif has been studied only through mutational experiments on a limited number of proteins. It is still not clear from a structural point of view how the CP motif is involved in regulation of heme binding as has been established for the CXXCH heme c motif.
The second question concerns the structural environment or the physiochemical features of the heme binding pockets. Of particular importance is the conformational difference between the apo and holo forms of heme proteins since, in most cases, only apo structures will be available for prediction. Even though the global and local conformational changes induced by ligand binding in general have been surveyed by a number of studies [36][37][38][39], such systematic studies on heme proteins have not been reported. In this study, we compiled a non-redundant dataset of apo-holo pairs to examine the conformational and pocket changes in heme proteins after heme binding.
The diversity and conservation of interactions between heme and proteins have been analyzed previously by Schneider et al. [27]. However they used a redundant dataset with 68 type-b heme proteins (based on 60% sequence identity cutoff) due largely to the limited availability of heme protein structures [27,40]. A very recent study performed analysis on a smaller dataset of 34 heme proteins, each of which represents one CATH homologous family or a SCOP family [41]. There are seven different heme groups in the 34 heme proteins with heme b and heme c as the dominant forms [41]. Here we performed structural analysis on a larger, non-redundant dataset of heme proteins containing heme b and/or c types. Heme proteins are found in at least 31 different structural folds in all the four major classes based on SCOP classifications [42], attesting to the diversity and complexity of hemeprotein interactions. The heme binding pockets are enriched in aromatic amino acids and relatively depleted with respect to the charged residues, glutamic acid, aspartic acid, and lysine. We also found that the CP motif has structural implications in heme-protein interactions.

Datasets
Two non-redundant datasets were generated in this study. The first dataset, containing 125 heme-binding protein chains, was used for analysis of heme binding environment. This set was culled from protein structures in the Protein Data Bank (PDB, November 24, 2009) [18] with HEM (for heme b) or HEC (for heme c) as ligands with the following criteria: experimental method = X-ray crystallography, maximum resolution = 3 Å, and maximum R-value = 0.3. The protein chains that interact with heme molecules (described in next section "Analysis of heme interacting residues") were selected, and a non-redundant set of 125 heme-binding protein chains was generated using PISCES [43] with a sequence identity cutoff of 25% (Additional file 1, Table  S1). The second dataset has 5596 protein chains in which each pair of protein chains has less than 25% sequence identity and each structure has a resolution of 2.5 Å or better and an R-factor of 0.3 or better. This set was used for calculating background frequencies of amino acids, secondary structure types, and relative solvent accessibility. The sequences for the protein chains derived from the PDB "SEQRES" records may have cloning and expression artifacts such as His-tags at the N-or C-terminus and some of the protein chains have missing residues [44,45]. To avoid such artifacts and incomplete sequences, the amino acid frequencies were calculated using the full-length protein sequences through mapping PDB chains to Uniprot entries with PDBSWS [46].

Analysis of heme interacting residues
A residue is considered as a heme axial ligand if the distance between the nitrogen, sulfur or oxygen of the residue and the heme iron is within 3 Å. Residues having heavy atoms within 4.5 Å of any non-hydrogen atoms of the heme molecule are identified as heme interacting amino acids. A protein chain is considered as heme binding if it has residue(s) as axial ligand(s) to the heme iron or has at least ten residue interactions with the heme molecule. DSSP was used to assign each residue to one of three secondary structure states, helix, strand, and coil [47]. Following the widely used convention, H (α-helix), G (3 10 -helix) and I (π-helix) from DSSP are classified as helix type while E (extended strand) and B (residue in isolated-bridge) states are classified as strand type. All the other states from DSSP are considered as coils. The relative solvent accessibility was calculated by dividing the absolute value of exposed area from DSSP over the maximum accessibility of each residue [48]. We employ a threestate classification for relative solvent accessibility: buried (≤7%), intermediate (>7% and ≤37%), and exposed (>37%), as described previously [49].

Structural comparisons between apo and holo heme proteins
To maximize the number of possible apo-holo heme protein pairs, each of the heme protein chains was first compared with all the non-heme protein chains derived from PISCES pdbaaent file using BLAST [50]. There are a number of ligands that are similar to heme b or c in PDB, so structures with these heme-like ligands are not considered as apo proteins for our apo-holo comparisons. Based on HIC-Up keyword search using heme and porphyrin [51] and SuperLigands ligand structure similarity search [52], we identified 55 heme-like ligands in PDB (Additional file 1, Table S2). The highly similar apo-holo heme protein pairs (cutoffs set at 90% sequence identity and 95% sequence alignment overlap) were then culled to generate a list of 15 non-redundant apo-holo pairs using PISCES with a sequence identity cutoff of 25% [43]. Five of the 15 apo proteins that contain other non-heme ligands in the heme-binding pockets were removed from the list as they are not truly "apo" forms with respect to the heme binding sites. The structural differences were evaluated with two structure alignment programs, FAST [53] and CE [54] for structure comparisons. The similarity/difference between two structures is measured by the RMSD (root mean square deviation) of the Cα atoms of aligned residues. The pocket/cavity was predicted using the CASTp server (Computed Atlas of Surface Topography of proteins) [55]. To compare the shape of the pockets, Rvs, the ratio between the volume and the surface area is used [56].

Results and Discussion
Non-redundant dataset of heme binding proteins There are 1998 and 113 PDB entries containing ligand HEM (heme type-b) and HEC (heme type-c) respectively with resolutions of 3Å or better as of November 24, 2009 [18]. Among these entries, 10 (1BE3, 1BGY, 1FGJ, 1GWS, 1PP9, 1PPJ, 1S56, 1S61, 2A06, and 3H1J) contain both heme type b and c. In toto 4272 protein chains were identified as heme interacting protein chains as described in Methods. A non-redundant dataset of 125 protein chains (114 heme-b and 11 heme-c, Additional file 1, Table S1) were generated using PISCES with a sequence identity cutoff of 25% [43]. Eighty-two percent of these protein chains contain only one heme molecule while the number of heme molecules in the remaining protein chains ranges from 2 to 8 (Additional file 1, Table S1). Two examples of multi-heme protein chains, 1FS7A with 5 type b and 3F29A with 8 type c heme molecules, are shown in (Figure 2A & 2B).
The dataset of heme binding proteins includes a wide variety of protein folds. A total of 86 protein chains (~69% of the dataset) have SCOP annotations (based on release 1.75 and Pre-SCOP) and belong to 31 distinct structural folds in all four major classes (Table 1) [42]. The dataset is dominated by proteins in the all-α class, making up 64% (55 of 86) of the total. The top 4 folds, Globin-like (a.1), Cytochrome P450 (a.104), Cytochrome c (a.3), and Multi-heme cytochromes (a.138) represent the well-known heme binding proteins that have been investigated extensively (Table 1).

Structural environment of the heme binding pockets
To investigate the structural environment of heme binding pockets, we identified both residues that make coordinate bonds with the heme iron and the ones that interact with the heme porphyrin structure ( Figure 2C and Methods). Out of the 125 heme binding protein chains, only 2PBJA and 3HCNA do not have residues identified as axial ligands to heme iron though both have extensive interactions with heme; instead other small molecules, such as glutathione (GSH) in 2PBJA (microsomal prostaglandin E synthase) [57] and imidazole (IMD) in 3HCNA (human ferrochelatase) [58] form coordinate bonds with heme iron. Five different amino acids (H, M, C, Y, K) are found to serve as axial ligands to the heme iron with histidine as the dominant residue (~80%) in both heme b and heme c types ( Figure 3). Heme b utilizes more cysteine residues while heme c has slightly more methionine residues as axial ligands. It should be pointed out that there are only 41 residues as heme c ligands. Therefore the percentages of nonhistidine ligands may have a relatively large change with a slight increase or decrease of ligand numbers due to the small dataset.
The conserved interactions between protein residues and heme were previously studied by calculating either the frequencies of residues that are in van der Waals contact with heme for each fold class of b-type heme proteins [27] or by calculating the mean number per binding site [41]. Smith et al also applied normalized amino acid profiles to assess the composition and conservation of heme binding sites [41]. Here we explored the residue preferences in the heme binding pockets through calculating the relative frequencies of heme binding residues in our non-redundant dataset. The relative frequency of each amino acid is normalized to its background frequency.
Normally, the background frequencies used for comparisons are calculated from a non-redundant protein dataset. However, due to the dominant presence of all-α folds, it is not clear whether the residue distribution in heme proteins is different from that in other proteins. Therefore we first compared the residue distributions between nonredundant heme proteins and non-redundant all proteins. To avoid issues with missing residues and cloning artifacts (His-tags etc.) associated with PDB sequences, we used native full-length protein sequences to calculate residue compositions by mapping the PDB chains to Uniprot entries with PDBSWS [46]. The relative residue frequencies between heme proteins and all proteins show that heme proteins tend to contain more alanine, phenylalanine, histidine, methionine, and tryptophan residues and fewer cysteine, aspartic acid, isoleucine, lysine, asparagine, and serine residues (Additional file 2, Figure S1). Statistical analysis (χ 2 ) revealed a significant difference between these two frequency profiles (data not shown). In order to have a meaningful description of the enrichment or deficiency of   residues in the heme interacting environment, we used the background frequencies from the non-redundant set of heme proteins as references.
The top five residues with high relative frequencies are cysteine (C), histidine (H), phenylalanine (F), methionine (M), and tyrosine (Y) ( Figure 4A). Because four of the top five (C, H, M, and Y) can serve as axial ligands to heme iron (Figure 3), we removed axial ligands from the dataset and recalculated the relative frequencies. Figure  4B shows that the level of histidine decreases to the background level, suggesting the enrichment of histidine is essentially due to the large number of heme histidine ligands. The other four residues, on the other hand, have almost the same relative frequencies with or without ligand residues ( Figure 4B). In heme c proteins, the occurrence of cysteine residues is extremely high with an eight fold enrichment compared to the background distribution. This is not surprising as the classic CXXCH binding motif, in which the histidine serve as ligand and the cysteine residues form covalent thioether bonds with the heme vinyl groups, has dominant presence in heme c proteins [28].
Consistent with earlier reports, the aromatic residues (phenylalanine, tyrosine, and tryptophan) play important roles in protein-heme interactions through stacking interactions with the porphyrin [27,41]. One exception is tryptophan in heme c proteins, which showed a similar level of occurrences compared to the background (Figure 4A). Leucine, isoleucine, and valine, which make hydrophobic interactions with the heme ring structure, are slightly increased over the background frequencies.
The residues with the fewest occurrences, aspartic acid, glutamic acid, and lysine are charged residues, suggesting the heme binding pocket is mainly a hydrophobic environment. In contrast, arginine, a positively charged residue that has been considered a major player in anchoring the heme propionates, has a much higher occurrence than other charged amino acids and shows a similar (HEM) or slightly higher (HEC) level of frequency to the background ( Figure 4A) [27]. The secondary structure types for heme interacting residues are shown in Figure 5. There are more helical and less coil types in proteins with heme b no matter what dataset (heme proteins or all proteins) is used as a reference. Therefore the difference is not due to the large number of all-α proteins in the dataset. As for heme interacting residues in heme c, they have similar distribution to the background ( Figure 5). Based on our 3-category classification of relative solvent accessibility [49], the heme interacting residues are less likely to be exposed. The buried residues are comparable to the background distribution. About 20% increase is observed in the intermediate category (Additional file 2, Figure  S2).

Heme binding sequence motifs
To investigate possible sequence motifs involved in heme binding, the flanking sequences with four residues on each side of heme axial ligands were collected and aligned. The non-redundant dataset has 34 heme c ligands, 32 of which have histidine as axial ligands. The alignment of these sequences shows the classic CXXCH heme c binding motif [4,28] (Figure 6A).
Another motif worthy of note, GX[HR]XC[PLAV]G, comes from the heme b proteins with cysteine as axial ligands ( Figure 6B). The motif represents the classic CYP signature heme binding motif FXXGXXCXG in bacteria, plant, and mammalian cytochrome P450 s [59][60][61]. At the -4 and +2 positions (with ligand cysteine as reference position) are small amino acids (glycine) while the -2 position prefers a positively charged amino acid such as histidine or arginine. These positively charged residues interact electronically with the negatively charged heme propionates ( Figure 6C and 6D). The small glycine residue at the -4 position may provide the flexibility needed for positioning the positively charged residues close to heme propionate groups. The +1 position is dominated by proline and hydrophobic amino acids, leucine, alanine, valine and isoleucine. Six of the eighteen cases have proline right after the axial ligand cysteine, reminiscent of the dipeptide CP motif being implicated in heme sensing and regulation [31][32][33][34][35]62]. While the importance of CP motif has been studied through deletion or site-directed mutation experiments in several important proteins, including transcription repressor Bach1 [63], iron regulatory protein 2 (IRP2) [31], circadian factor period 2 (Per2) [34] and δ-aminolevulinic acid synthase (ALAS) [33], the possible role of the CP motif in heme interaction from a structural point of view remains unclear as the structures for most of these proteins with such CP motifs are unknown.
All the six CP dipeptides that have direct physical interactions with heme exhibit similar structural roles with the cysteines serving as ligands to the heme iron and the proline residue introducing a bend for the downstream structures, mainly α-helices, to steer them away from the heme face ( Figure 7B and 7C). A seventh protein chain, 2PBJA, contains a CP where the proline shows highly similar structural implication, whereas the cysteine residue interacts with heme but not as a ligand. Instead, the presence of a glutathione molecule (GSH), which forms a coordination bond with the heme iron, seems to push the cysteine slightly away from the axial ligand position (5.25 Å from heme iron) [57]. Considering the conformation in the proline-bend structure and the small distance between cysteine and heme iron, it is likely that the cysteine could serve as a heme ligand if GSH is not present in the structure. Interestingly, a closer examination of the structural conformation downstream of the proline residue in 2CIWA (cloroperoxidase), 3CQVA (Reverb), and 2PBJA (microsomal prostaglandin E synthase), which have the CP heme motifs with conserved proline, indicates nearly perpendicular orientation to the heme plane ( Figure 7A, 7B and 7D). In contrast, in the P450 family where the proline residue is less conserved, with leucine, isoleucine, and methionine also found at the position of proline as shown in the motif logo ( Figure 6B), the α-helices following the proline residue are in parallel with the heme plane ( Figure 7C). The difference suggests a different structural role for the proline in conserved CP dipeptides from that in the less-conserved CP dipeptides, more specifically at the proline position.
CP dipeptides have also been implicated in indirect interaction with heme. Ragsdale and colleagues reported a novel role for CP motifs in heme oxygenase 2 (HMOX-2) as a thiol/disulfide redox switch that localizes outside the heme-binding pocket [62,64,65], therefore regulating heme-protein interaction via sensing redox status in the environment. There are a total of twenty-nine CP dipeptides in our dataset. Less than a quarter of them (in 7 protein chains including 2PBJA) show physical interactions with heme molecules. It would be impractical at this point to predict the functional role of the remaining CP dipeptides in heme-protein interaction, mainly due to the limited sample size and the lack of structural details on heme pocket-CP interaction. Here we made use of statistical analysis to indirectly assess the functional relevance of CP dipeptides in heme interaction. The rationale behind the assay is that, if CP dipeptides are important heme signatures for heme interaction, the expected occurrences of CP dipeptides in hemoproteins should be higher compared to control population. We found no statistically significant difference between the presence of CP dipeptides in heme proteins and non-heme proteins (data not shown), suggesting other yet to be identified factors may exist to help determine the role CP dipeptides play in heme binding [31]. It should be noted that we do not exclude the possibility that in the control sample there exist unknown hemoproteins; however for them to significantly affect the frequency of CP signals there would have to be a considerably large fraction of the control proteins being analyzed to be heme-interacting, which we anticipate as less likely.

Structure comparison between apo and holo heme proteins
An interesting question related to structure-based heme binding protein design and prediction is the degree of global conformational transition and the local changes of the heme-binding pocket upon heme binding. We collected 446 heme protein chains (after removing heme protein chains with at least 90% sequence identity) and compared their sequences with the protein chains without heme or heme-like ligands (Additional file 1, Table  S2). One hundred seventy-nine heme protein chains are found to have apo structures with high sequence similarity and coverage. After removing redundant apo/holo pairs with a 25% sequence identity cutoff and proteins with non-heme or non-heme-like ligands occupying the heme binding pocket, the final dataset consists of 10 apo-holo protein pairs. Table 2 shows that 9 out of 10 proteins undergo very small global conformational changes after heme binding with RMSDs of 1.03Å or less. For example the 2ZDOA-1XBWD pair (iron-regulated surface determinant IsdG from Staphylococcus aureus) has an RMSD of 0.59 Å. In the absence of heme, the protein assumes the same conformation as the holo protein with heme ( Figure 8A, B). Even the side chain positions of the histidine ligand are similar. The one with relatively large conformational changes is Rev-erb (3CQVA-2V7CA). Without heme the C-terminal helix (residues 568-576) moves towards the heme pocket with His568 (heme-binding ligand) facing away from the binding pocket ( Figure 8C, D) [66].
Three of the ten heme proteins in Table 2 have multiple known apo structures. 1KBIA (flavin-binding domain of Baker's yeast flavocytochrome b 2 ), 1N45A (human heme oxygenase-1), and 1N5UA (human serum albumin) have 9, 3, and 28 apo structures respectively (with at least 99% sequence identity, Additional file 1, Table  S3). Because proteins are inherently dynamic and conformational selection has been considered as a major mechanism for biomolecular recognition [67][68][69], we checked the conformational differences between each of the apo structures and the holo structures. Figure 9A shows the RMSD (Cα atoms of aligned residues) values of the apo-holo structural differences. The RMSDs are generally less than 1Å for 1KBIA and 1N45A. On the contrary, apo structures of 1N5UA form two clusters. Members of one cluster with 12 apo structures have RMSDs around 0.8Å while the other contains 15 apo structures with RMSDs ranging from 4 to 5Å. Through manual inspection, we found that the differences are caused by the numbers of non-heme ligands in structures. In addition to heme, 1N5UA also has 5 myristic acid (MYR) molecules ( Figure 9B). The apo structures with higher RMSDs either do not have ligands ( Figure  9C) or have only one or two non-MYR ligands. For example, 1E7AA and 2BX8B have 2 PFL and 1 AZQ respectively. On the other hand, apo structures with  MYR ligands in similar positions as those in 1N5UA generally have smaller RMSDs ( Figure 9D). Therefore, under similar environment, there are relatively small structural differences between holo and apo heme protein structures.
It should be noted that the above comparisons are based on heme proteins that have stable apo structures solved through X-ray crystallography. For some proteins, as in the case of hemoglobin, the absence of ligand(s) can increase the flexibility and cause partial unfolding of the protein structure, making it difficult for structure determination [70,71]. Furthermore, intrinsically disordered or unstructured regions are considered to be responsible for many important cellular functions such as ligand binding [72,73]. However the existence of such flexible apo structures would not interfere with our goal in structure-based heme protein prediction as we aim to take the existing apo structures in PDB as inputs [18].
Other features useful for comparing apo-holo heme proteins are the pocket size and shape. Due to different heme binding modes (partially exposed or fully embedded, Additional file 2 Figure S3) and the difficulty in identifying the exact heme binding pocket from existing automatic programs, the sizes of heme binding pockets vary from small (~400 Å 3 ) to very large (over 2000 Å 3 ) ( Table 2). In addition, the changes in absolute pocket volumes after heme binding are variable. Small changes  are seen in 2ITFA-2ITEB, 2R7AA-2RG7 D, and 2ZDOA-1XBWD. Other pairs exhibited significant changes in volume despite the minimal conformational change (Table 2). To take the shape into consideration we calculated the Rvs value (the ratio of pocket volume over the pocket surface area) of each pocket. Most of the apo or holo proteins have Rvs values around 1.4. To further investigate whether the binding pocket can be used as one of the characteristics for heme protein prediction, we compared the Rvs distributions between heme binding pockets and pockets in non-heme proteins (proteins that don't have heme ligand(s) and are not homologous to heme proteins) with similar sizes ranging from 350 to 2000Å 3 . The Rvs of heme binding pockets has a narrow distribution whereas the Rvs from similar pocket sizes of non-heme proteins has a wide spread with a long right tail (Additional file 2, Figure S4-A). We also investigated the distribution of Rvs normalized to a sphere shape as introduced by Sonavane and Chakrabarti [56]. A similar trend was found (Additional file 2, Figure S4-B). It should be pointed out that, even though unknown heme proteins may be included in the non-heme dataset, many non-heme proteins share similar pocket characteristics.

Conclusion
In this study, we surveyed the known heme protein structures for the purpose of structure-based heme protein prediction and novel heme protein design. We first compiled a non-redundant dataset of 125 heme (type b and c) binding protein chains that encompass a large number of protein structural folds, reflecting the diversified roles of heme proteins. Structural analysis revealed that the residues interacting with heme are mainly nonpolar, especially aromatic amino acids, providing a hydrophobic environment for the heme ring structure. We also investigated the possible structural roles of CP motifs that are implicated in the regulation of heme binding and have received much attention recently. While the CP dipeptide is not as strong a signature for heme binding as the classic CXXCH heme c binding motif, the proline in the heme-interacting CP dipeptides assume important structural roles when CP is conserved and the cysteine functions as an axial ligand with heme iron. Indirect interaction between CP motifs and heme binding has also been reported in HMOX-2 protein, in which CP dipeptides form thiol/disulfide redox switch away from the heme binding pocket [62,64], suggesting the heterogeneity of CP-heme interactions.
Comparisons between the apo and holo heme proteins indicate that most of the heme proteins undergo small conformational changes after heme binding, suggesting the apo structure can be used for structure-based heme protein prediction and as a scaffold for heme protein design. In addition our analysis on the heme binding pockets showed that despite the different sizes, the Rvs values of heme binding pockets are confined in a small range, whereas the data from non-heme binding proteins spread over a large range. We will apply the results from this study to investigate if any of the hypothetical proteins in PDB are potential heme proteins through computational prediction and experimental validations in the near future.

Additional material
Additional file 1: Datasets used in structural analysis of heme proteins. Table S1: a list of 125 non-redundant heme-binding protein chains; Table S2: heme and heme-like ligands in PDB; Table S3: heme proteins with multiple apo structures Additional file 2: Comparative analysis of heme-binding proteins and non-heme proteins in terms of amino acid frequency, relative solvent accessibility, and Rvs (ratio between volume and area). Figure S1: relative frequencies of amino acids in non-redundant heme proteins; Figure S2: frequencies of relative solvent accessibility for heme interacting residues; Figure S3: heme binding pockets in EMMA (A) and 3CQVA (B); Figure S4: distribution of (A) Rvs (ratio between volume and area) and (B) nRvs (normalized Rvs) between heme binding pockets and non-heme binding pockets.
List of abbreviations used Å: angstrom; CP: cysteine-proline; LBD: lipid binding domain; SCOP: structural classification of proteins; PDB: protein data bank; RMSD: root mean square deviation; Rvs: ratio of volume over area.