QRNAS: software tool for refinement of nucleic acid structures

Background Computational models of RNA 3D structure often present various inaccuracies caused by simplifications used in structure prediction methods, such as template-based modeling or coarse-grained simulations. To obtain a high-quality model, the preliminary RNA structural model needs to be refined, taking into account atomic interactions. The goal of the refinement is not only to improve the local quality of the model but to bring it globally closer to the true structure. Results We present QRNAS, a software tool for fine-grained refinement of nucleic acid structures, which is an extension of the AMBER simulation method with additional restraints. QRNAS is capable of handling RNA, DNA, chimeras, and hybrids thereof, and enables modeling of nucleic acids containing modified residues. Conclusions We demonstrate the ability of QRNAS to improve the quality of models generated with different methods. QRNAS was able to improve MolProbity scores of NMR structures, as well as of computational models generated in the course of the RNA-Puzzles experiment. The overall geometry improvement may be associated with increased model accuracy, especially on the level of correctly modeled base-pairs, but the systematic improvement of root mean square deviation to the reference structure should not be expected. The method has been integrated into a computational modeling workflow, enabling improved RNA 3D structure prediction.


Background
Ribonucleic acid (RNA) molecules play pivotal roles in living organisms. RNAs are involved in a variety of biological processes: they transmit genetic information, they sense and communicate responses to cellular signals, and even catalyze chemical reactions [1]. With the very rapid discovery of new classes of RNA molecules, new functions beyond storing genetic information are also being discovered. The functions of RNA molecules and interactions of proteins, RNAs, and their complexes, often depend on their structure, which in turn is encoded in the linear sequence of ribonucleotide residues. Thus, the understanding of the molecular basis of RNA function requires the knowledge of RNA structure.
The experimental determination of RNA 3D structures is expensive and difficult [2,3]. However, the ribonucleotide sequence determines RNA structure (in a similar manner as amino acid sequence determined protein structure), it is theoretically possible to infer the RNA structures from sequences. Since the historically first prediction of tRNA 3D structure in 1969 [4], throughout the decades, numerous computational methods were developed to generate RNA 3D structure from sequence. Currently, the field of research on RNA structure prediction is quite advanced, and the advantages and limitations of different methods are known, in particular from the assessment within the RNA-Puzzles community-wide experiment [5][6][7], which has been inspired by the CASP experiment for protein structure prediction [8].
Because of the very high costs of all-atom simulations, RNA 3D structures are usually not predicted by simulating all the details of the physical process of macromolecular folding, starting from sequence alone. The most successful general strategy for RNA 3D structure prediction that emerged from the RNA-Puzzles experiment involves the following approaches or their combination: 1) identification of pre-existing information in databases of molecular structure and e.g., using known structures as templates to develop a comparative model for the whole structure or its part; 2) running a simulation, often using a coarse-grained strategy, with restraints to represent all possible knowledge about the target structure, to generate ensembles of structurally similar conformations with possibly best scores. In this strategy, a typical approach is to derive potentials (scoring functions) based on a statistical analysis of experimentally determined structures. Statistical potentials can be used to replace or supplement the calculation of the physical free energy by evaluating the relative frequencies of features, such as pairwise distances of atoms (bonded and non-bonded) and mutual orientations of chemical groups (e.g., torsion angles). In this methodological framework, the most frequently observed structural features are also the most probable ones.
Simplifications applied in the process of RNA 3D structure prediction come with a cost of the loss of fine structural details. Computational models often present imperfect stereochemistry, unnatural bond lengths or steric conflicts. These deficiencies are clearly visible when using quality assessment tools, such as MolProbity [9,10]. To obtain a high-quality model, a structure obtained from template-based modeling or from coarse-grained simulations needs to be further refined. However, even models perceived as correct by validation tools can still be far from their native structures. The most challenging task faced by the refinement is not only to improve the visible quality of the model but to bring it closer to the 'true' structure (which in case of real predictions is unknown at the time of the modeling). According to RNA-Puzzles, the best models of medium-sized RNA molecules exhibit root mean square deviation (RMSD) of 5-10 Å from the reference structure. It is tempting to ask whether a dedicated software tool could improve these results.
In this article, we present QRNAS, a new software tool for fine-grained refinement of nucleic acid structures, dedicated to improving the quality of models generated by low-to medium-resolution methods commonly used, e.g., for RNA 3D structure modeling. QRNAS is capable of handling RNA, DNA or chimeras and hybrids thereof, and enables modeling of nucleic acids containing modified residues. We demonstrate the ability of QRNAS to improve the quality of models generated in the course of RNA-Puzzles, often with improvement in the model accuracy, as compared to the reference structure. QRNAS is also able to improve MolProbity scores of NMR structures from Protein Data Bank.

Force field
The force field used by QRNAS is a modified version of AMBER [11,12] adopted to represent 107 modified nucleotides currently known to be present in RNA [13]. Currently, 130 residues are parametrized, including four canonical ribonucleotides (A, G, C, U) and deoxyribonucleotides (dA, dC, dG, dT) as well as naturally occurring modifications thereof (e.g., m7G, m1A, dU, wybutosine, queuosine, etc.). The key novel feature of QRNAS is an extension of the AMBER force field with energy terms that allow for modeling of restrained structures and enforce the backbone regularization. Imposition of secondary structure is also possible due to interaction types that go beyond the original AMBER force field, namely: explicit hydrogen bonds and enforcement of base pair co-planarity. These two interaction types are often poorly modeled in structures generated by computational structure prediction methods, and in our experience, their enforcement is a critical element of high-resolution refinement. Application of custom distance restraints required the introduction of pairwise harmonic interactions. Regularization of backbone torsions was realized by introduction of a knowledge-based energy term. All these add-ons carry along a certain degree of arbitrariness, and for this reason, we made them optional. In particular, our program falls back to plain AMBER [13] when all four additional terms are disabled. Similarly, electrostatics and van der Waals interactions can be disabled by the user (e.g., to speed up the calculation). With electrostatics enabled, the user can choose between generalized Born solvent and vacuum environment. In either case, the system is assumed to be non-periodic.
The new energy terms associated with hydrogen bonds, base pairs, backbone irregularities, and custom restraints are given, respectively, by Eqs. (1)-(4) (see below).

Explicit hydrogen bonds
Although hydrogen bonds in AMBER are currently handled by means of electrostatic and van der Waals interactions, we decided to reintroduce an additional explicit description. Our goal was to gain finer control over the strength of this interaction. This was prompted in part by our observation, e.g., in the context of the RNA-Puzzles experiment, that in computational models of RNA structure obtained by low-to medium-resolution computational methods, interactions based on hydrogen bonding are often poorly modeled [5][6][7]. Computationally modeled structures often present an "almost correct" orientation of hydrogen bond donors and acceptors, which nonetheless deviates from the values typically observed in high-resolution structures.
In these computational models, a relatively small adjustment of geometry often leads not only to an interaction that can be detected as a "proper" hydrogen bond by software for structure analysis but to an improved overall orientation of base moieties involved in pairing via these hydrogen bonds. Thus, with high force constant, explicit hydrogen bonds can be used as restraints when imposing secondary structure on the modeled nucleic acid molecule. Another benefit of enforcing strong hydrogen bonds in the structure optimization procedure is that geometrically correct contacts are preserved throughout the computational simulation once they are formed.
According to Lu et al., the statistical analysis of the hydrogen-bonds obtained from simulations shows that the strengths of hydrogen bonds in liquid water conform to a Gaussian distribution [14]. Therefore, the energy term associated with hydrogen bond (E H-bond ) was chosen to be Gaussian in its length with an exponential dependence on the cosine of its angle: Where k 1 denotes the force constant, r ij is the hydrogen bond length between donor hydrogen i and acceptor j, and θijk is the bond angle between donor-hydrogen-acceptor. The parameters k 1 , i, θ 0 were iteratively tuned to reproduce experimental hydrogen bond lengths. The multiplier was arbitrarily set at a value of − 1 kcal/mol, which proved to provide good persistence of contacts in the course of energy minimization.

Base pair co-planarity
Models of RNA structure obtained by computational methods (in particular by coarse-grained methods and in the process of comparative modeling) often present various deviations of base-pair geometry. In particular, canonical Watson-Crick base pairs often deviate from co-planarity. Therefore, QRNAS was equipped with an optional feature that performs the idealization of base pair planarity. When enabled, Watson-Crick base pairs are not only restrained by explicit hydrogen bonds but also additionally flattened. The flattening is implemented by application of force to the atoms of each base according to Eq. (2): where k 2 denotes the force constant; r i0 is the distance from the i-th atom of the base to the plane that best matches the base pair. The plane is least-squares fitted to the atoms of both bases. The magnitude of the force acting on each atom is proportional to its distance from the plane of the base, while the direction of the force is perpendicular to this plane. Base pair restraints are introduced only at startup. For two Watson-Crick bases to be considered as a pair, the energy resulting from term (2) must be below − 2 kcal/mol. A user can also override this behavior by providing secondary structure in Vienna format (for a single chain) or as a list of contacts (in general case). In such case automatic detection of base pairs is disabled.

Backbone regularization
The feature of backbone regularization is intended to correct outlying conformers reported by MolProbity. Upon energy minimization, it drags the backbone atoms of each residue to a known conformation, stored in an internal database. The database of preferred conformations was populated with data from all crystal structures of RNA stored in Protein Data Bank (PDB) [15] with a resolution below 1.4 Å as of June 2013. QRNAS identifies a local backbone conformation in a fragment stored in the database that is closest to the one in the input model according to a minimal Root Mean Square Deviation (RMSD) value. The forces acting on atoms are harmonic, as given by Eq. (3).
The parameter k 3 denotes the force constant; bi is the position of i-th backbone atom in a reference backbone. Coordinates b i are transformed by translations and rotations to minimize the RMSD between the optimized backbone and the reference one. A similar library-based approach has been used in RNAfitme web-server for remodeling of nucleic-acid residue conformations of RNA structures [16].
Noteworthy, the original force field parameters were subject to minor tuning, to generate structures with better MolProbity scores. We changed the rest values of OP1-P-OP2 and N9-C1'-O4' angles to 119.62°and 109.00°respectively, thereby allowing for the elimination of most 'bad angles' reported by MolProbity.

Custom restraints
Distance restraints are implemented as simple harmonic forces, as given by Eq. (4).
k 4 denotes the force constant which can be set by the user. The spring forces can be used as positional or distance restraints since their anchor points c i can be constituted by both atoms and arbitrary points in space.

Minimization
After setting up the model, QRNAS starts to minimize the energy of the system. All force field terms in our model are analytically differentiable, enabling us to use minimization schemes with explicit gradient information. We implemented two algorithms: steepest descent with golden section search and Polak-Ribiere conjugate gradients [17].

Performance optimization
Calculation of electrostatics was parallelized for machines with symmetric multiprocessing (SMP) capability, i.e., multicore workstations. Parallelism was achieved by processing of the 'electrostatic interaction matrix' in blocks that share no common atoms. Consequently, the proposed algorithm is nearly lock-free and has much-improved cache hit rate compared to a version which processes pairwise interactions in a random order. We tuned the parameters of the algorithm (block size and pointer hashing function) to achieve good performance on workstations with up to 8 cores. As a proof of concept, we successfully conducted minimization of ribosomal RNA taken from the 60S subunit of the eukaryotic ribosome (PDB code: 4A18) achieving the performance of 0.2 golden-section search steps per hour.
Example run-times for representative models of RNA structure analyzed in this paper, minimized for 1000 steps on a single core of 2.
We used QRNAS with restraints on explicit hydrogen bonds, restraints on base pair co-planarity, and backbone regularization. No custom restraints were used at this stage. QRNAS was able to resolve all clashes in the studied set, outperforming both the RNAfitme web server (which uses NAMD with CHARMM force-field for optimizing RNA structures) and sander from the AMBER package (Table 1). The mean amount of bad angles was reduced from 3.46 to 1.31%. The average fraction of wrong backbone conformations was reduced from 27.43 to 14.83%. On the contrary, RNAfitme and sander increased the percentages of bad angle and wrong backbone conformations upon refinement. None of the methods has shown consistent improvement of the fraction of bad bonds. This analysis demonstrates the ability of QRNAS to regularize structures and improve their MolProbity scores, and also shows the limitations of current methods. For practical application of QRNAS to optimize NMR-derived RNA models it will be worthwhile to use NMR-derived data as additional custom restraints in the optimization process and to validate the optimized structures against the NMR data that were not used in the optimization.

Assessment of model accuracy
In molecular modeling, one of the essential steps is the selection of the potentially best models. Once the different conformations are generated, a scoring function can be applied to assess the global and local features of the model, aiming at discriminating models that are closer to the 'true' structure (usually represented as a model obtained in the course of X-ray crystallography or NMR experiments and used as a reference) from those that are less accurate. While the selection of models was not the primary goal of QRNAS, we tested its ability to score models. In general, in our various analyses, we did not observe the correlation of QRNAS single point energy values (combined with additional scoring from our custom terms) with the model quality (data not shown) [6,7,[29][30][31]. We suspected that this might be caused by the fine-grained character of the scoring function and its extreme sensitivity to the ruggedness of the RNA energy landscape. In other words, we expected that QRNAS might be able to discriminate 'good' and 'bad' models only very close to the global energy minimum corresponding to the reference structure. On the other hand, in typical modeling exercises, models generated computationally are relatively far from the reference structure, and their RMSD values rarely fall below 5 Å.

5.69
The first models from the NMR were used in this analysis. The PDBs that contains DNA/hybrid were not analyzed using RNAfitme and represented by '-'in the table II self-splicing intron (PDB code: 1KXK [34]), viral RNA pseudoknot (PDB code: 1L2X [35]), G-riboswitch aptamer (PDB code: 1Y27 [36]), and fluoride riboswitch (PDB code: 4ENC [37]); and we generated models by introducing minor random perturbations to positions of all atoms. From the pool of generated models, we selected 1000 structures with RMSD to the starting/reference structure ranging from near 0.00 to 5.00 Å. Scoring these models with QRNAS revealed a funnel-like shape, indicative of an energy/score minimum near the native structure (Fig. 1). Alas, the funnel was very narrow, less than 2 Å, which indicated that QRNAS could discriminate only between models that were extremely close to the reference and all the others, but it was incapable of discriminating between models that are very good (RMSD, e.g., around 2 Å) and those that are much worse. This also suggested that the optimization of QRNAS score (e.g., in the course of model refinement) is unlikely to improve the global accuracy of models unless the starting models are already extremely close to the 'true' structure. For models of lower accuracy, statistical potentials can be used, such as RASP [38] or the energy functions used in 3D structure prediction methods such as SimRNA [31,39] or ROSETTA/ FARNA/FARFAR [40,41]. It is worth emphasizing that computational improvement of model accuracy remains a difficult problem, for which no perfect solution exists. QRNAS addresses one of the aspects of this problem, at the level of local geometry.

Refinement of models in RNA-puzzles experiment
We analyzed the performance of QRNAS on models for two targets of the RNA-Puzzles experiment (Puzzle #1relatively easy [5], Puzzle #6very difficult [6]), and the resulting broad range of model accuracy. We analyzed up to five top first structures submitted by various participants, generated with different modeling methods, and hence presenting different types of errors and inaccuracies. The modeling methods used by different groups for Puzzles #1 and #6 include ModeRNA [42] and SimRNA [31,39]  To assess the capabilities of QRNAS, we conducted a full refinement with default parameters for 10,000 steps. For comparison, we performed refinement with RNAfitme and minimization with sander from the Amber 14 package [47]. RNAfitme was run with the default settings on the web server. Minimization with sander was performed in a truncated octahedral box of 10 Å with TIP3P water model [48] and leaprc.ff14SB variant of the forcefield [49,50]. The following parameters were used while running sander: imin 1, maxcyc 10,000, cut 300, igb 2, saltcon 0.2, gbsa 1, ntpr 10, ntx 1, ntb 0. For the resulting models, we calculated the value of global RMSD to assess the overall accuracy, and the Interaction Network Fidelity (INF) to compare the accuracy of residue-residue contacts identified in the original and optimized structures [51]. INF values are calculated for all types of contacts including canonical and non-canonical base-pairs and stacking. For the detection of base pairs, we have used our in-house method ClaRNA [52].
In all cases, QRNAS improved MolProbity scores, in particular, it resolved nearly all steric clashes (Tables 2  and 3). For Puzzle #1 (Table 2), the average change of RMSD was − 0.01 for QRNAS vs. 0.26 for sander (i.e., essentially no change vs. minimal deterioration). However, the average INF value decreases from 0.802 to 0.768, 0.759, and 0.482, calculated from the optimized models using QRNAS, sander and RNAfitme web server, respectively. For Puzzle #6 (Table 3) the average change of RMSD was 0.53 for QRNAS vs. 0.51 for sander and 0.52 for RNAfitme (negligible deterioration), and the average improvement of INF was 0.001 (for QRNAS) compare to 0.00 (for sander) and − 0.04 (for RNAfitme) in respect to the starting models. To evaluate the performance of QRNAS to see how it can optimize the non-canonical contacts, we have calculated INF considering only the non-Watson-Crick contacts (INF_nWC) for the models of RNA-Puzzles #1 and #6. In both the rounds, QRNAS improved the INF_nWC values with respect to the starting models. Though QRNAS and RNAfitme have comparable (very minor) improvement of non-canonical contacts, sander does not improve such contacts. Summarizing, in terms of RMSD, the structures changed very little; sometimes the models improved slightly, sometimes they deteriorated slightly. This was expectable because in all cases the models were so far from the reference structure that the local refinement was not expected to drive them towards the global energy minimum, but rather towards a local minimum, which could be further away from the reference structure. On the other hand, we could observe a small increase in the INF values, indicating a small improvement of predicted contacts. We attribute this small change to the ability of QRNAS to improve the local geometry, in particular in the case of base pairs. In models that are reasonably close to the 'true' structure and exhibit residues that are 'almost' in proper contact with each other (as in many models for Puzzle #1), the optimization by QRNAS can refine these contacts and enable the formation of proper base pairs. The smaller improvement of contacts in models of Puzzle #6 can be explained by the low quality of the starting structures, and the lower fraction of 'nearly correct' contacts that could be optimized.

Previously published examples of QRNAS application
Following the development and initial tests of QRNAS, we applied it in various modeling studies. In the course of collaborative work on models generated by all groups for Puzzles #5, #6, and #10, we found that models submitted by the Das group had poor clash scores, despite their overall relative accuracy, as measured in terms of RMSD to the reference structure. We have therefore run QRNAS on all Das models submitted for Puzzles #5, #6, and #10 (17 models total). In all cases, a dramatic reduction of clash scores was obtained; in 10 models even down to zero. Only in three cases, the clash scores remained larger than 4; however, these models had initial Clash Scores of nearly 30. Details of this analysis were reported in an article describing RNA-Puzzles Round II [6].
In order to evaluate the performance of QRNAS for blind predictions (at the time when the experimentally determined structure was not available), we calculated the MolProbity scores of RNA-Puzzles #6 models generated in our group before the refinement. The MolProbity scores show improvement in the quality of the models as the average Clashscores reduced from 8.99 to 1.99 ( Table 4). The current version of QRNAS has also reduced the bad conformations, bad angles, and bad bonds in the models submitted for RNA-Puzzles #6 (Table 3).
In the case of group I intron modeling study [29], QRNAS was used as the final step of a workflow to improve a model generated with ModeRNA [42] and SimRNA [31]. It reduced the clash-score from 184.69 to 0.37, bad bonds from 4.12 to 0.00%, bad angles from 6.53 to 0.88%, without major changes of the deviation from the reference structure (10.9 Å to 11.0 Å).

Conclusions
QRNAS is a software tool for fine-grained refinement of nucleic acid structures, based on the AMBER force field with additional restraints. QRNAS is capable of handling RNA, DNA, chimeras, and hybrids thereof, and enables modeling of nucleic acids containing modified residues. We demonstrate the ability of QRNAS to improve the quality of RNA 3D structure models generated with different methods. QRNAS was able to improve MolProbity scores of NMR structures, as well as of computational models generated in the course of the RNA-Puzzles experiment. The overall geometry improvement may be associated with the improvement of local contacts, but the systematic improvement of root mean square deviation to the reference structure should not be expected. QRNAS can be integrated into a computational modeling workflow with other tools, enabling improved RNA 3D structure prediction. Our group systematically uses QRNAS at the final stage of model refinement in the context of the RNA-Puzzles experiment.
Programming language: C++ License: GNU GPLv3+ Any restrictions to use by non-academics: None For the compilation of QRNAS, a C++ compiler, such as GNU g++ is required. A Makefile is provided for the compilation of the package. Download the software from http://genesilico.pl/software/stand-alone/qrnas or clone it from https://github.com/sunandanmukherjee/QRNAS. git. Unzip the archive, and compile it with the command make to create an executable version of QRNAS. To execute the program use the command …/path/to/ QRNAS/QRNA -i input.pdb -o output.pdb where input.pdb is the file to be optimized and output.pdb is the optimized structure. For more advanced usage of QRNAS, users should consult the user manual and the README.txt file in the QRNAS package.