Molecular models of NS3 protease variants of the Hepatitis C virus

Background Hepatitis C virus (HCV) currently infects approximately three percent of the world population. In view of the lack of vaccines against HCV, there is an urgent need for an efficient treatment of the disease by an effective antiviral drug. Rational drug design has not been the primary way for discovering major therapeutics. Nevertheless, there are reports of success in the development of inhibitor using a structure-based approach. One of the possible targets for drug development against HCV is the NS3 protease variants. Based on the three-dimensional structure of these variants we expect to identify new NS3 protease inhibitors. In order to speed up the modeling process all NS3 protease variant models were generated in a Beowulf cluster. The potential of the structural bioinformatics for development of new antiviral drugs is discussed. Results The atomic coordinates of crystallographic structure 1CU1 and 1DY9 were used as starting model for modeling of the NS3 protease variant structures. The NS3 protease variant structures are composed of six subdomains, which occur in sequence along the polypeptide chain. The protease domain exhibits the dual beta-barrel fold that is common among members of the chymotrypsin serine protease family. The helicase domain contains two structurally related beta-alpha-beta subdomains and a third subdomain of seven helices and three short beta strands. The latter domain is usually referred to as the helicase alpha-helical subdomain. The rmsd value of bond lengths and bond angles, the average G-factor and Verify 3D values are presented for NS3 protease variant structures. Conclusions This project increases the certainty that homology modeling is an useful tool in structural biology and that it can be very valuable in annotating genome sequence information and contributing to structural and functional genomics from virus. The structural models will be used to guide future efforts in the structure-based drug design of a new generation of NS3 protease variants inhibitors. All models in the database are publicly accessible via our interactive website, providing us with large amount of structural models for use in protein-ligand docking analysis.


Background
After the development of serological tests for hepatitis A and B viruses in the 1970s it became clear that an additional agent accounted for approximately 90% of transfusion-associated hepatitis (non-A non-B hepatitis, NANBH) [1].
The novel agent, hence termed hepatitis C virus (HCV), currently infects approximately 3% of the world's population and it was classified within the Flavivirideae family. Diagnostic tests for anti-HCV antibodies developed thereafter proved that HCV was indeed the predominant cause of NANBH [2]. In view of the lack of vaccines against HCV, there is an urgent need for a treatment of the disease by an effective antiviral drug. This necessity has boosted research on the structural biology of HCV with the primary focus being to identify possible targets for pharmaceutical intervention [3].
Rational drug design has not been the primary way for discovering major therapeutics. However, recent successes in the area give reason to expect that drug discovery projects will increasingly be structure based. One of the possible targets for drug development against HCV is the NS3 protease variants. HCV RNA is translated into a polyprotein that during maturation is cleaved into functional components. One component, nonstructural protein 3 (NS3), is a 631-residue bifunctional enzyme with protease and helicase activities.
The N-terminal portion of the NS3 protein was predicted to contain a serine protease domain as judged from conserved sequence patterns and by homology to Flavi-and Pestiviruses [4][5][6]. The NS3 serine protease processes the HCV polyprotein by both cis and trans mechanisms. The interative refinement and optimization of drug leads is an effective strategy for generating potent preclinical candidate [7,8]. Ongoing genome sequencing efforts have led to the identification of hundreds of potential therapeutic targets, many of which represent possible sources of crossover pharmacology. Homology or comparative modeling is a key feature of an integrated drug discovery effort because it allows this genomics information to be utilized early in the development of target ligands or in the engineering of ligand specificity [9].
Genome sequencing efforts are providing us with complete genetic blueprints for hundreds of organisms, including humans. We are now faced with assigning, understanding and modifying the functions of proteins encoded by these genomes. This task is generally facilitated by 3D structures [10], which are best determined by experimental methods such as X-ray crystallography and NMR spectroscopy. The theoretical approaches [11] can be divided into physical and empirical methods. The physical prediction methods are based on interactions between atoms and include molecular dynamics and energy minimization [12], whereas the empirical methods depend on the protein structures that have been already determined by experiment. They include combinatorial [13] and comparative modeling [14,15].
Comparative modeling uses experimentally determined protein structures to predict conformation of other proteins with similar amino acid sequences. For modeling of proteins was used restrained-based modeling implemented in the program MODELLER [16]. The models consist of coordinates for all non-hydrogen atoms in the modeled part of a protein. Models are generated entirely automatically in a four-step procedure [17]: (i) fold assignment, (ii) sequence-structure alignment, (iii) model building, and (iv) model evaluation. This procedure was applied to variants of NS3 protease using Perl-CGI, C and MPI programming.
We modeled the structure of variants of NS3 protease variants available in the National Center for Biotchnology Information (Genbank), using structural bioinformatics tools. Knowledge of the three-dimensional structure variants will undoubtedly aid the design of useful inhibitors that may be used as a drug against hepatitis C virus. In order to speed up the modeling process all NS3 models were generated in a Beowulf cluster (BioComp, S.J. Rio Preto, Brazil). The potential of the structural bioinformatics for development of new antiviral drugs is discussed.

Primary sequence comparasion
The identity between the sequences of a bifunctional protease structure (PDB access codes:1CU1, 1DY9) [31,38] (templates) and NS3 protease variants (targets) is shown in Table 1. The secondary structural elements are indicated in the Figure 2 without inhibitor and in the Figure 3 with inhibitor. The sequence from crystallographic structure 1CU1 shows more than 79.1% identity with the sequences of NS3 protease variants, which provide high accuracy for the models ( Table 1).

Quality of the models
The atomic coordinates of crystallographic structure 1CU1 solved to resolution of the 2.5 Å were used as starting model for modeling of the NS3 protease variant structures, and the structure of NS3 complexed with an inhibitor (PDB access code: 1DY9) was used to generate homology models for docking studies. Binding of an inhibitor to the active site of an enzyme is typically connected with local and possibly also global structural rearrangement of the enzyme (induced-fit mechanism). Therefore structure-based drug design preferentially relies on the crystal structures of enzyme-inhibitor complexes containing bound inhibitors of similar chemical structures to the compounds being designed. Such complexes offer more detailed and accurate picture of the inhibitorenzyme interactions and structural complementarity between the inhibitor and the active site. The homology models of the variants of NS3 protease which used the NS3 complexed with an inhibitor are more adequate to docking simulations. The atomic coordinates of all water molecules were removed from the templates.
The analysis of the Ramachandran diagram φ-ψ plots of the 1CU1 structure (template) were used to compare the overall stereochemical quality of the NS3 protease variants structures against template solved by biocrystallography (Table 1). They present over 94.0% of the residues in the most favorable regions. The same analysis for crystallographic structure (1CU1) present 88.9% of residues in the most favorable, 10.5% additional allowed regions, 0.6% generously allowed regions, and 0.0% disallowed regions, which strongly indicates that the molecular models present good overall stereochemical quality.

Overall description
The NS3 protease variant structures are composed of six subdomains, which occur in sequence along the polypeptide chain (Figure 2 and 3). The protease domain exhibits the dual β-barrel fold that is common among members of the chymotrypsin serine protease family. The helicase domain contains two structurally related β-α-β subdomains and a third subdomain of seven helices and three short β strands. The latter domain is usually referred to as the helicase α-helical subdomain. The 13-residue protease activation domain of NS4A contributes one strand to the N-terminal protease β-barrel and is considered to be the sixth subdomain [31].
Differences in subdomain structure in the NS3 protease variant molecule and in the structures of the isolated protease and helicase domains were assessed in several ways.
The strucuture of NS3 protease without inhibitor Figure 2 The strucuture of NS3 protease without inhibitor. The structure of NS3 protease variant, an engineered molecule that consists of 631 NS3 residues. It has six subdomains: two β barrels in the protease domain (down); two β-α-β subdomains (up on the left) and one α-helical subdomain in the helicase (up on the right). The figure was generated by Molmol [37].
Inspection of the molecule revealed that the subdomain folds are similar. Overall preservation of structure is also apparent when the subdomains from the various structures are superposed [31].
The rmsd value of bond lengths and bond angles, the average G-factor and Verify 3D values are shown in Table  2 for NS3 protease variants structures. The same analysis for crystallographic structure (1CU1) present rmsd values Table 1: Analysis of the Ramachandran plot and identity between template and models of NS3 protease variants. The accuracy of comparative modeling is related to the percentage of sequence identity on which the model is based, correlating with the relationship between the structural and sequence similarities of two proteins. High accuracy comparative models are based on >50% sequence identity to their templates. They tend to have ~1 Å r.m.s. error for the main-chain atoms, which is comparable to the accuracy of a medium-resolution NMR structure or a low-relosution X-ray structure. All structure models in the database were generated using alignments with more than 79% sequence identity, which generating models with high accuracy.

Database design, access, and interface
A MySQL database based on relational database management system (RDBMS) was developed to archive protein structure identified in infectious agents such as NS3 protease variants from hepatitis C virus. All supporting *Total Score: is the sum of the 3D-1D scores (statistical preferences) of each residue present in protein. Ideal Score: S ideal = exp(-0.83+1.008xln(L)); where L is number of amino acids. S ideal Score: is compatibility of the sequence with their 3D structure. It is obtained Total Score / Ideal Score. S ideal Score above 0.45S ideal . **Ideally, scores should be above -0.5. Values below -1.0 may need investigation.
data related to the 3D structures modeling, such as protein codes, atomic coordinates in PDB format from modeled proteins, fasta sequence, links to others databases and various information about the protein were arranged in the MySQL [32] database under a master table. The aim this database is to provide access to a collection of annotated models generated by automated homology modelling of NS3 protease variants from hepatitis C virus. All models in the database are publicly accessible via our interactive website ( Figure 1) [33]. The database user interface provides user friendly menus, so that all information can be printed in one step from any standard web browser. A small ribbon representation is included to obtain a first impression of the model structure (Figure 2 and 3). Atomic coordinates for the homology models can be downloaded in PDB format and their primary sequence in fasta format. The fields are defined with links to the target sequence, the template structure entries in PDB [34], structural information and analysis. There are two hom-ology models for each sequence in the database, one obtained using 1CU1 as template and other using 1DY9 as template. The second model is adequade for docking simulation, since it was used as template a structure complexed with an inhibitor (PDB access code: 1DY9).

Conclusions
Large scale protein homology modeling, in which whole sequence databases or whole genomes are used as input into automated modeling algorithms, have been reported by several groups [35]. By utilizing powerful computer systems with multiple processors, these efforts have allowed the creation of large databases of homology models of proteins. This project increases the certainty that homology modeling is an useful tool in structural biology and that it can be very valuable in annotating genome sequence information and contributing to structural and functional genomics from virus, bacteria and other organisms. Inhibition studies have shown that NS3 is only modestly inactivated by classic serine protease inhibitors such as chloromethylketones or phenylmethyl sylfonylfluoride [36]. The structural models will be used to guide future efforts in the structure based design of a new generation of NS3 protease variants inhibitors. This database is freelly available for all users on the Web, providing us with large amount of structural models for use in protein-ligand docking analysis.

Molecular modeling
Molecular modeling is usually the method of choice when there is a clear relationship of homology between the sequence of a target protein and at least one known structure. This computational technique is based on the assumption that tertiary structures of two proteins will be similar if their sequences are related, and it is the approach most likely to give accurate results [18]. There are two main approaches to homology modeling: (1) fragment-based comparative modeling [14,19] and (2) restrained-based modeling [16]. For modeling of NS3 protease variants from hepatitis C virus we used the second approach. Model building of NS3 protease variants was carried out using the program MODELLER [16]. MODELLER is an implementation of an automated approach to comparative modeling by satisfaction of spatial restraints [20][21][22]. The modeling procedure begins with an alingment of the sequence to be modeled (target) with related known three-dimensional structure (templates). This alignment is usually the input to the program. The output is a three-dimensional model for the target sequence containing all main-chain and sidechain non-hydrogen atoms.
Next, the spatial restraints and CHARMM energy terms enforcing proper stereochemistry [23] were combined into an objective function. Finally, the model is obtained by optimizing the objective function in Cartesian space.
The optimization is carried out by the use of the variable target function method [24] employing methods of conjugate gradients and molecular dynamics with simulated anneling. Several slightly different models can be calculated by varying the initial structure. A total of 1000 models were generated for each enzyme and the final models were selected based on stereochemical quality. All optimization process was performed on a Beowulf cluster with 16 nodes (BioComp, AMD Athlon XP 2100+).

Analysis of the models
The overall stereochemical quality of the final models for each NS3 protease variants from hepatitis C virus was acessed by the program PROCHECK [25]. The root mean square deviations (rmsd) differences from ideal geometries for bond lengths and bond angles were calcu-lated with X-PLOR [26,27]. G-factor value is essentially just log-odds score based on the observed distributions of the stereochemical parameters. It was computed for the following properties: torsion angles (the analyses provided the observed distributions of φ-δ, χ 1 -χ 2 , χ -1 , χ -3 , χ -4 and ω values for each of the 20 amino acid types) and covalent geometry (for the main-chain, bond lengths and bond angles) these average values were calculated using PROCHECK [25]. The Verify-3D measures the compatibility of a protein model with its sequence, these values were calculated using 3D profile [28][29][30].