Characterization of conserved properties of hemagglutinin of H5N1 and human influenza viruses: possible consequences for therapy and infection control

Background Epidemics caused by highly pathogenic avian influenza virus (HPAIV) are a continuing threat to human health and to the world's economy. The development of approaches, which help to understand the significance of structural changes resulting from the alarming mutational propensity for human-to-human transmission of HPAIV, is of particularly interest. Here we compare informational and structural properties of the hemagglutinin (HA) of H5N1 virus and human influenza virus subtypes, which are important for the receptor/virus interaction. Results Presented results revealed that HA proteins encode highly conserved information that differ between influenza virus subtypes H5N1, H1N1, H3N2, H7N7 and defined an HA domain which may modulate interaction with receptor. We also found that about one third of H5N1 viruses which are isolated during the 2006/07 influenza outbreak in Egypt possibly evolve towards receptor usage similar to that of seasonal H1N1. Conclusion The presented results may help to better understand the interaction of influenza virus with its receptor(s) and to identify new therapeutic targets for drug development.


Background
Influenza is currently considered as one of the most severe threats to human health and animal welfare. The highly pathogenic avian influenza (HPAIV) H5N1 viruses have been isolated from avian species in more than 50 countries. As of January 2008, 349 human H5N1 infections have been reported to the World Health Organization (WHO) [1]. Of these 349 cases, 216 patients have died (62%) and there has been no decline in mortality rate. Because the virus has evolving antigenicity for which humans may not have a preexisting immunity, the conditions for a possible pandemic exist.
The entry of influenza virus into susceptible cells is mediated by the viral hemagglutinin (HA) membrane glycoprotein which binds sialic acids of cell-surface glycoproteins and glycolipids. The binding preference of a given HA for different receptors correlates to some extent with the species specificity for infection. Human isolates preferentially bind to receptors with α2,6 linkages to galactose (SAα2,6Gal), whereas avian isolates prefer α2,3 linkages (SAα2,3Gal) [2][3][4][5][6]. A change in receptor preference is, however not necessary since the lower respiratory tract also expresses α 2.3 receptors [7]. It has also been reported that influenza virus can infect host cells via a sialic acid-independent pathway, either directly or in a multistage process [8]. It has been speculated that sialic acid enhances virus binding to secondary receptors that mediate entry [8].
Several approaches, such as structural analyses, model protein evolution, and mathematical modeling have been taken to study the antigenic drift and shift of influenza A viruses (for review see Ref. [23] and references therein). All of these approaches trace changes in HA but they do not allow precise assessment of biological consequences.
Here we applied the informational spectrum method (ISM), which is a theoretical approach to investigate the periodicity of structural motifs with defined physicochemical characteristics that determinate biological properties of proteins [9]. The protein sequence is encoded numerically by assigning to each amino acid its electronion interaction potential (EIIP), which describes the average energy states of valence electrons in an amino acid. By using the discrete Fourier transform (DFT), the numerical sequence is transformed into a frequency domain to create an ISM spectrum. It has been pointed out that the Fourier spectra of protein sequences involved in mutual interaction are similar and this similarity is represented by the common frequency component [9]. The ISM spectrum is a contribution of all individual amino acids in the sequence. Therefore, once the characteristic frequency has been identified it is possible to use ISM to determine how the substitution of an amino acid changes the frequency and influences the biological activity of the protein. Using this bioinformatics approach we have previously characterized the conserved information responsible for interaction between envelope glycoprotein gp120 of human immunodeficiency virus type 1 (HIV-1) and their CD4, CCR5 and CXCR4 receptors [9][10][11]. By analogy with HIV-1 gp120, it can be assumed that highly variable HA molecules of influenza viruses also encode conserved information, which may determine receptor-binding preferences. Identification and characterization of this information could contribute to a better understanding of HPAIV/host interaction.
Here we show that the HA subunit 1 (HA1) of H5N1 viruses encodes specific and highly conserved information which may determine the recognition and targeting of these HPAI viruses to their receptor. The comparison with seasonal strains suggests that a subset of H5N1 in Egypt may be evolving towards an H1N1-like receptor usage.

Sequences
The HA1 sequences were retrieved from GenBank database with following accession numbers and were used for the results of Figure 1

Informational spectrum method
The surface complementarity between interacting biomolecules, which was originally proposed by Emil Fischer in 1894, together with the collision theory, assuming that the first contact between interacting molecules is achieved accidentally by the thermal motions that cause molecular wander, represents the fundamental basis for our current understanding of intermolecular interaction in biological systems. According to this concept, the diffusion-limited association rate constant, calculated by the Smoluchowski's equation is 10 6 M -1 s -1 for a protein-ligand and~10 3 M -1 s -1 for a protein-protein interaction. On the other hand, the real protein-protein association generally occurs at rates that are 10 3 to 10 4 times faster than would be predicted from a simple 3D "random diffusion" model [12].
In order to overcome the discrepancy between theoretically estimated values and real values of the associated rate constant for a protein-protein interaction, the model for interaction between biological molecules based on frequency-selective long-range attractive forces which are efficient at a distance longer than one linear dimension of the interacting macromolecules (10 2 -10 3 Å), has been proposed [13,14]. It has been shown that the number of valence electrons and EIIP, representing the main energy term of the valence electrons, are essential physical parameters of biological molecules determining their long-range properties of biological molecules. The EIIP can be determined for organic molecules by the following simple equation derived from the "general model pseudopotential" [15,16]: where Z* is the average quasivalence number (AQVN) determined by

Figure 3
Consensus IS of HA1 from three Spanish flu H1N1 viruses.
where Z i is the valence number of the i-th atomic component, n i is the number of atoms of the i-th component, m is the number of atomic components in the molecule, and N is the total number of atoms. The EIIP values calculated according to equations (1) and (2) are in Rydbergs (Ry).
Using the concept of the long-range forces which increase numbers of productive collisions between interacting biomolecules and the EIIP values of amino acids, the informational spectrum method (ISM), for analysis of protein-protein interaction and the relationship between structure and function of proteins, was developed. This virtual spectroscopy method comprises three basic steps: Transformation of the alphabetic code of the primary structure into a sequence of numbers by assigning to each amino acid or nucleotide a corresponding numerical value representing the electron-ion interaction potential.
Conversion of the obtained numerical sequence by Fourier transformation into the informational spectrum (IS).
Cross-spectral analysis which allows identification of frequency components in the informational spectrum of molecules which are important for their biological function or interaction with other molecules.
The physical and mathematical basis of ISM was described in detail elsewhere [17][18][19][20], and here we will only present this bioinformatics method in brief. A sequence of N residues is represented as a linear array of N terms, with each term given a weight. The weight assigned to a residue is EIIP (Table 1). In this way the alphabetic code is transformed into a sequence of numbers. The obtained numerical sequence, representing the primary structure of protein, is then subjected to a DFT, which is defined as follows: where x(m) is the m-th member of a given numerical series, N is the total number of points in this series, and X(n) are DFT coefficients. These coefficients describe the amplitude, phase and frequency of sinusoids, which comprise the original signal. The absolute value of complex DFT defines the amplitude spectrum and the phase spectrum. The complete information about the original sequence is contained in both spectral functions. However, in the case of protein analysis, relevant information is presented in an energy density spectrum [17,18], which is defined as follows: In this way, sequences are analyzed as discrete signals. It is assumed that their points are equidistant with the distance d = 1. The maximal frequency in a spectrum defined in this way is F = 1/2d = 0.5. The frequency range is independent of the total number of points in the sequence. The total number of points in a sequence influences only the resolution of the spectrum. The resolution of the N-point sequence is 1/n. The n-th point in the spectral function corresponds to a frequency f(n) = nf = n/N. Thus, the initial information defined by the sequence of amino acids can now be presented in the form of IS, representing a series of frequencies and their amplitudes.
The IS frequencies correspond to distribution of structural motifs with defined physicochemical properties determining a biological function of a protein. When comparing proteins, which share the same biological or biochemical function, the ISM technique allows detection of code/frequency pairs which are specific for their common biological properties, or which correlate with their specific interaction. This common informational characteristic of sequences is determined by a crossspectrum or consensus informational spectrum (consensus IS). A consensus IS of N spectra is obtained by the following equation: where Π (i, j) is the j-th element of the i-th power spectrum and C(j) is the j-th element of consensus IS. Thus, consensus IS is the Fourier transform of the correlation function for the spectrum. In this way, any spectral component (frequency) not present in all compared IS is eliminated. Peak frequencies in consensus IS are common frequency components for the analyzed sequences. A measure of similarity for each peak is a signal-to-noise ratio (S/N), which represents a ratio between signal intensity at one particular IS frequency and the main value of the whole spectrum. If one calculates a consensus IS for a group of proteins, which have different primary structures, and finds strictly defined peak frequencies, it means that the analyzed proteins participate in mutual interaction or have a common biological function.
The ISM was successfully applied in structure-function analysis of different protein sequences and de novo design of biologically active peptides (for review see Refs. 10 and 20), assessment of biological effects of mutations [21] and prediction of new protein interactors [22].

Results
To identify conserved information encoded by HA1 proteins, we performed a cross-spectral analysis of all H5N1 HA1 amino acid sequences in GenBank (1407 entries). Figure 1a shows that the consensus IS of these sequences contains only one peak of the frequency F (0.076). According to the ISM concept, this information represents the long-range component of the proteinprotein interaction between HA1 and a putative partner, such as a receptor. Figures 1b and 1c show the IS of HA1 of the H5N3 virus A/swan/Hokkaido/51/96, the putative progenitor of the HPA1 H5N1 subtype, and of the first H5N1 virus isolated in China 2006 (A/Goose/Guangdong/1/96) [24]. Both of these IS have a dominant peak at the same characteristic frequency F(0.076), demonstrating that HA of these two viruses encode the same information as the H5N1 HA1 shown in Figure 1a. The computer scanning survey of the primary structure of H5N1 HA1 showed that the main contribution to information represented by the frequency F(0.076) comes from the domain (denoted VIN1) located in the N-terminus of the protein which encompasses residues 42 -75 of the mature protein (Table 2, Figure 4). Interestingly, this domain of H5N1 HA1 is highly conserved in all H5N1 viruses.
Next, we performed the ISM analysis of HA1 molecules of seasonal viruses H1N1 (n = 29) and H3N2 (n = 30), as well as H7N7 viruses (n = 30), from different years and geographic regions. Their consensus IS show characteristic peaks of the frequencies F(0.236), F (0.363) and F(0.285), respectively (Figures. 1d, e and  1f), distinct from the F(0.076) of H5N1 HA. This may suggest that HA1 sequences encode information which is specific for each of these subtypes. The domains of HA1 of H1N1, H3N2, H7N7 influenza viruses, derived from the above frequencies are shown in Table 2 and highlighted in the HA structural model ( Figure 4).
Despite its low infectivity for humans, there has been evidence in Egypt of several clusters of human-to-human transmission with very high mortality rate. ISM analysis of 95 HA sequences from Egypt 2006 and showed that these viruses can be divided into two groups. Consensus IS of a first group (Egypt-1) of 55 strains contains a dominant peak of the frequency F(0.076) which is characteristic for H5N1 HA1 and a less prominent peak of the frequency F(0.236) which is characteristic for H1N1 HA1 (Figure 2a). In contrast, consensus IS of the second group (Egypt-2) (Figure 2b), which includes 40 H5N1 HA1, contains only one significant peak of the frequency F(0.236) corresponding to the consensus IS of H1N1 HA1 in Figure 1d. Figures 2c  and 2d show representative IS of individual strains of both groups. Of H5N1 viruses which were isolated in Egypt during 2006, 76% belong to the group Egypt-1, and 24% were from the group Egypt-2. In contrast, in 2007, 48% belong to the Egypt-1 and 52% to Egypt-2. Figure 4 shows the IS spectra of peptide VIN1 and of the domains identified by consensus IS of H1N1, H3N2, H5N1 and H7N7 viruses ( Table 2) and the position of these domains in the molecule. As can be seen, the receptor targeting site of H5N1 virus from the group Egypt-1 (A/Egypt/0636-NAMRU3/2007) is closer to the receptor binding site than in the other viruses of   Finally, we compared informational properties of H1N1 pandemic strains from 1918 from GenBank and seasonal H1N1 strains. The consensus IS of these pandemic isolates ( Figure 3) is characterized by a dominant peak of the frequency F(0.258) which is different from the frequency F (0.236) characteristic of other seasonal flu H1N1 isolates (Figure 1d). Table 2 shows the domain corresponding to the frequency F(0.258). In the model of A/South Carolina/1/18 ( Figure 4i) the position of this domain does not overlap with the corresponding domain of other seasonal H1N1 strains, but overlaps with the corresponding domain of Egypt-2 H5N1 viruses.

Discussion
The differentiation of H5N1 in an increasing number of clades and subclades is alarming but the fundamental changes associated with efficient human to human transmission are poorly understood. The development of approaches which allow the tracing and the understanding of such changes is of the highest priority.
To identify specific information which determines longrange components of protein-protein interactions between H5N1 and putatively its receptor(s), we performed the ISM analysis of the HA1 protein. This analysis revealed that this protein, although highly variable, encodes conserved information, which is represented by the IS frequency component F(0.076).
The main information corresponding to the IS frequency F(0.076) is contributed by the VIN1 domain located in the N-terminus of HA1 molecule (Figure 4). This domain is highly conserved in all H5N1 viruses. The peptide VIN1 is located within the site E between residues 42 and 75, one of the five major antigenic domains of the HA molecule. In the 3D structure of HA1 the site E is located below the globular head involved in receptor binding [5]. It was previously shown that protein domains, which are essential for particular IS frequency are directly involved in protein-protein interaction [9,22]. Therefore, we postulate that the VIN1 domain plays an important role in the recognition and targeting between virus and receptor. For this reason, VIN1 may represent a potential target for therapy of H5N1 infection.
It is of note that the E site, encompassing the VIN1 domain, is placed below the globular head of HA1 which is involved in the receptor binding [5]. Most mutations which encode receptor tropism [6,7] and are involved in immune avoidance occur in this globular part of HA1 molecule. On the other hand, mutations within the site E are rare. This indicates that variable antigenic sites A and B located in the globular head of HA1 could represent an immune decoy which protects the important functional site E, determining the conserved long-range properties of the molecule. A similar structural organization was previously reported for HIV-1 gp120 [11,25] and it was pointed out as an important obstacle in development of AIDS vaccine [26][27][28].
H5N1 already replicates efficiently in humans, and cause case fatality rates that are ten times higher than those seen in the 1918 pandemic. Thus, an infectivity of H5N1 similar to seasonal flu would cause a catastrophic pandemic. The main obstacle for this worst case scenario is poor human-to-human transmission of H5N1 viruses, which is attributed to the paucity of sialic acid a2,3 receptor in the epithelium of the human upper respiratory tract, and the inability of the virus to replicate efficiently at this site. Interestingly, the ISM approach identifies important differences between H5N1 viruses from Egypt. Some have the characteristics of most H5N1 strains whereas about one third of the viruses display characteristics that are also found in human H1N1 seasonal virus. Interestingly the proportion of the latter viruses has increased from 25 to about 50: between 2006 and 2007.
Similarly the results of H5N1 strains from Egypt ( Figure  2) may be indicative of a possible viral evolution towards receptor usage similar to that of H1N1 viruses, which efficiently replicate in the upper respiratory tract. The protein domain, which seems to be involved in this subtle change, corresponds to amino acid domain 99-132 ( Figure 4g) [29]. Unexpectedly, this analysis revealed six amino acid substitutions (K35R, D45N, D94N, K35R/D45N, K35R/45N/D94N, A247T) outside the receptor-binding domain of HA, which could enhance interaction between H5 HA and human-type SAα2,6Gal receptor. As can be seen, three of these mutations encompass mutation D45N which is located within peptide VIN1 and two other mutations (K35R and D94N) are located in its vicinity. It is the first report that naturally occurring mutations in region of H5 HA which encompasses peptide VIN1 play an important role in virus transmission from avian to human. It is of note that Egyptian strains contain all of these mutations, except mutation in position K35. These results point out need for future testing of evolution of Egyptian strains using hemiadsorption assays for HA receptor-binding activity in order to identify possible new mutations in this domain of HA which could increase affinity of H5N1 viruses to human-type receptor.
Du and co-workers discovered monoclonal antibody (MAb) 4G6 which efficiently and selectively recognizes and neutralizes recently emerged Asian H5N1 viruses [30]. The epitope-mapping analysis revealed that epitope of the neutralizing 4G6 MAb is located within peptide VIN1, pointing out this domain of HA as therapeutic and diagnostic target for H5N1 viruses. The 4G6 MAb recognizes residue D43 within peptide VIN1, which characterizes Asian H5N1 viruses, but not N43 which characterizes H5N2 and H5N1 viruses. It is also shown that this MAb recognizes Egyptian H5N1 strains derived from clade 2.2 containing D43. Based on these results, Du and co-workers suggested that the 4G6 MAb could be useful for rapid diagnosis of the infection of H5N1 currently circulating in Asia, Europe and Africa, as well as for development of an antibody-based therapy. It is of note that recent Egypt group-2 strains are characterized by N43, in contrast to Egypt group-1 strains which contain D43. It means that the 4G6 MAb can not be used for detection and neutralization of H5N1 viruses belonging to the Egypt group-2.

Conclusion
In summary, the presented results showed that: (i) H5N1 HA1 encode specific information represented by an IS frequency different from that encoded by other subtypes; (ii) this characteristic frequency is largely determined by a highly conserved N-terminal domain of HA1; (iii) other subtypes encode information that corresponds to other domains including residues 262-295 for H1N1, residues 57-90 for H3N2, residues 28-61 for H7N7 and residues 87-120 for Spanish flu, (iv) at least in Egypt H5N1 viruses have acquired features that may adapt them for H1N1like receptor usage possibly allowing more efficient human-to-human transmission. Our results suggest subtle but so far elusive differences in interactions of these different viral subtypes with their receptors. Collectively these results may help to better understand the interaction of influenza virus with its receptor(s) and to identify new targets for drug development.

Authors' contributions
VV conceived of the study, participated in its design and coordination and preparation of the manuscript. NV carried out the ISM analysis of viral sequences. CPM performed 3D structural analysis of viral proteins and participated in preparation of the manuscript. SM contributed with immunological interpretation of results. SG collected sequences from databases and carried out structure/function analysis of viral proteins. VP developed the ISM software for bioinformatics analysis of viral proteins. HK participated in design of study, interpretation of data and preparation of the manuscript.