Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce
© Zhang et al; licensee BioMed Central Ltd. 2013
Published: 8 November 2013
Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment.
On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance.
By using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.
Ribonucleic acid (RNA) is made up of four types of nucleotide bases: adenine (A), cytosine (C), guanine (G), and uracil (U). A sequence of these bases is strung together to form a single-stranded RNA molecule. RNA plays important roles in many biological processes including gene expression and regulation. RNA molecules vary greatly in size, ranging from nineteen nucleotide bases in microRNAs  to long polymers of over 30,000 bases in complete viral genomes . Although an RNA molecule is a linear polymer, it tends to fold back on itself to form a 3-dimensional (3D) functional structure, mostly by pairing complementary bases. Among the four nucleotide bases, C and G form complementary base pairs by hydrogen bonding, as do A and U; in RNA (but not DNA), G can also base pair with U residues. The overall stability of an RNA structure element is determined by the "minimal free energy" defined as the amount of energy it would take to completely unpair all of the base pairs that hold it together (e.g., by denaturing it with heat).
RNA secondary structure predictions
Secondary structures are crucial for the RNA functionality and therefore the prediction of the secondary structures is widely studied. Development of mathematical models and computational prediction algorithms for stem-loop structures began in the early 1980's [5–7]. Pseudoknots, because of the extra base-pairings involved, must be represented by more complex models and data structures that require large amounts of memory and computing time to obtain the optimal and suboptimal structures with minimal free energies. As a result, development of pseudoknot prediction algorithms began in the 1990's [8, 9].
Most existing secondary structure prediction algorithms are based on the minimization of a free energy (MFE) function and the search for the most thermodynamically stable structure for the whole RNA sequence. Searching for a structure with global minimal free energy may be memory and time intensive, especially for long sequences with pseudoknots. To overcome the tremendous demand on computing resources, various alternative algorithms have been proposed that restrict the types of pseudoknots for possible prediction in order to keep computation time and storage size under control. Yet, most programs available to date for pseudoknot structure prediction can only process sequences of limited lengths on the order of several hundred nucleotides. These programs, therefore, cannot be applied directly to larger RNA molecules such as the genomic RNA in viruses, which may be thousands of bases in length. At the same time, minimal energy configurations may not be the most favorable structures for carrying out the biological functions of RNA, which often require the RNA to react and bind with other molecules (e.g., RNA binding proteins). Our current work suggests that local structures formed by pairings among nucleotides in close proximity and based on local minimal free energies rather than the global minimal free energy, may better correlate with the real molecular structure of long RNA sequences. This hypothesis has yet to be supported by more detailed experimental evidence. If proven correct, our approach will open the door to a new generation of programs based on segmenting long RNA sequences into shorter chunks, predicting the secondary structures of each chunk individually, and then assembling the prediction results to give the structure of the original sequence.
In our previous work, we had proposed to predict secondary structures for long RNA sequences using three steps: (1) cut the long sequence into shorter, fixed-size chunks; (2) predict the secondary structures of the chunks individually by distributing them to different processors on a Condor grid; and (3) assemble the prediction results to give the structure of the original sequence . We used this approach on the genome sequences of the virus family Nodaviridae, leading to the discovery of secondary structures essential for RNA replication of the Nodamura virus . However, the study also identified the necessity of having a more effective segmentation strategy for cutting the sequence so that the predicted results of the chunks can be assembled to generate a reasonably accurate structure for the original molecule. Indeed, the selection of cutting points in the original RNA sequence is a crucial component of the segmenting step. In this paper, we propose to approach the problem by identifying inversion excursions in the RNA sequence and cutting around them. We consider two alternative inversion-based segmentation strategies: the centered and optimized chunking methods. Both methods identify regions in the sequence with high concentrations of inversions and avoid cutting into these regions. In the centered method, the longest spanning inversion clusters are centered in the chunks, while in the optimized method, the number of bases covered by inversions is maximized. Preliminary results have been presented in the authors' work [12, 13].
MapReduce and Hadoop
The prediction of RNA secondary structures for long RNA sequences based on sequence segmentation can be performed in parallel, thus benefiting from parallel computing systems and paradigms. We use the well-known MapReduce framework Hadoop for our parallel predictions. The MapReduce paradigm is a parallel programming model that facilitates the processing of large distributed datasets, and it was originally proposed by Google to index and annotate data on the Internet . In this paradigm, the programmer specifies two functions: map and reduce. The map function takes as input a key k1 and value v2 pair, performs the map function, and outputs a list of intermediate key and value pairs which may be different from the input list 〈k2, v2〈 - i.e., Map 〈k1, v1〉 → list 〈k2, v2〉. The runtime system automatically groups all the values associated with the same key and forms the input to the reduce function. The reduce function takes as input a key and values pair 〈k2, list(v2)〉, performs the reduce function, and outputs a list of values - i.e., Reduce 〈k2, list(v2)〉 → list 〈v3〉. Note that the input values to reduce is the list of all the values associated with the same key.
MapReduce is appealing to scientific problems, including the one addressed in this paper, because of the simplicity of programming, the automatic load balancing and failure recovery, as well as the scalability. It has been widely adapted for many bioinformatics applications. For example, Hong et al. designed an RNA-Seq analysis tool for the estimation of gene expression levels and genomic variant calling , and Langmead et al. designed a next-generation sequencing tool based on MapReduce Hadoop . To the best of our knowledge, our work is the first one to adapt MapReduce into secondary structure predictions of long
RNA sequences. Preliminary work on the reasoning behind adapting RNA secondary structure predictions to the MapReduce paradigm can be found at .
Workflow for parallel chunk-based predictions
Rather than predicting the RNA sequence as a whole, we cut each sequence into chunks and predict each chunk independently before merging the predictions into the whole secondary structure. As the cutting process can be performed in different ways, the search for effective ways to cut sequences can require a large search space and generate a large number of independent prediction jobs that can potentially be performed in parallel. The workflow for a parallel chunk-based RNA secondary structure prediction and accuracy assessment consists of the following four steps: (1) chunking: each RNA sequence is cut into multiple chunks (or segments) according to various chunking algorithms and parameters; (2) prediction: the secondary structure for each chunk is predicted independently by using one or more prediction programs; (3) reconstruction: the whole secondary structure of a sequence is reconstructed from predicted structures, one for each chunk; and (4) analysis: reconstructed structures are compared against known structures to assess prediction accuracies.
Chunking process based on inversions
Given a long RNA sequence, we identify regions with high concentrations of inversions by using an adapted version of the "Palindrome" program in the EMBOSS package , which is a free open source software analysis package. Two main reasons for adapting the EMBOSS Palindrome program are as follows: the original program works correctly on DNA but not RNA sequences and does not support G-U pairing that we plan to include in our adaptation. Our adapted program, InversFinder, is written in Java and is available for download at http://rnavlab.utep.edu. InversFinder requires a text file containing the RNA sequence in FASTA format as input. The minimum stem length L and maximum gap size G of the inversion are parameters specified by the user.
The chunking step relies on a general excursion approach first formulated in , which has already been applied to a variety of sequence analysis problems but not to RNA secondary structure predictions. In many bioinformatics applications, the problem calls for identifying high concentration regions of a certain property in the nucleotide bases of biomolecular sequences. For example, replication origins in viral genomes have been predicted by looking for regions that are unusually rich in the nucleotides A and T in DNA sequences . In this paper, we follow the same approach for RNA sequences, but our focus is whether or not the nucleotide base is found inside an inversion. We refer to the excursions generated by this property as "inversion excursions." The excursion method requires assigning a positive score to each nucleotide if it is a part of an inversion (including the two stems and the gap between them), and a negative score if it does not. We go through the entire nucleotide sequence accumulating the scores to form inversion excursions.
After generating the excursion plot, we identify the positions, called peaks, where the excursion scores are local maxima. Then, the bottom of each peak, which is the last position with a zero excursion score right before the peak, is located. After that, the length of the peak (the location difference between a peak and its peak bottom) is calculated. Note that since we require chunk lengths to be smaller than a prescribed maximum c, peak lengths greater than c have to be flagged and analyzed separately. Figure 5 also shows examples of peaks, peak bottoms, and peak lengths. Peaks are sorted in decreasing order based on their excursion scores. The sorted peaks are then used to cut sequences in chunks by the centered and optimized chunking methods.
Centered chunking method
The centered method cuts the sequence by identifying inversions and building the chunks around them. The objective is to segment the RNA sequence in such a way as to avoid losing structural information as much as possible by centering the longest spanning inversion clusters in the chunks. After peaks are identified, they are sorted in decreasing order of their excursion values. The peak with the highest excursion value is considered first, then the second highest peak is considered, and so on. The algorithm stops either when all the peaks are exhausted or when all the inversion regions of the sequence (i.e., all "1"s in the binary sequence) have been included in the chunks, whichever occurs first. Overlapping chunks are adjusted so that any nucleotide base is captured by only one chunk, with priority given to the peak with a higher excursion score.
Optimized chunking method
Regular chunking method
The regular chunking method is the simplest method of segmentation and is used as a reference method in this paper. This method cuts the nucleotide sequence regularly into chunks of a specified maximum chunk-length c until the sequence is exhausted.
Prediction based on well-known algorithms
After the RNA sequence is cut into chunks, the structure of each chunk is predicted independently using well-known algorithms and their programs. We use the same prediction algorithms to predict the entire sequence without chunking. We employ seven commonly used prediction programs to test the chunking methods. The programs that predict structures only for non-pseudoknotted sequences are UNAFOLD (2008) and RNAfold (1994). The programs that predict both pseudoknotted and non-pseudoknotted sequences are IPknot (2011), pknotsRG (2007), HotKnots (2005), NUPACK (2004), and PKNOTS(1998). These prediction programs, which typically involve some form of minimization of free energy, maximization of expected accuracy, or dynamic programming models in their algorithms, are all publicly available.
Reconstruction based on concatenation
The results of the chunk predictions are assembled to build a whole secondary structure. Currently, our framework simply concatenates all these predicted secondary structures to give the secondary structure for the whole sequence. This is possible because the cutting does not allow any overlap between two consecutive chunks. More sophisticated reconstruction methods that include partial chunk overlaps can be used with minor changes to our framework.
Accuracy analysis based on comparisons with known structures
Various statistical tests are applied to the accuracy analysis for the different chunking methods including t-tests, Pearson correlation analysis, and the non-parametric Friedman tests. We use the statistical functions provided by MATLAB . Metrics of interests include: (1) accuracy chunking (AC), which is the accuracy of the predicted structure assembled from the chunks when compared with the known secondary structure; (2) accuracy whole (AW), which is the accuracy of the predicted structure obtained from the whole sequence when compared with the known secondary structure; and (3) accuracy retention (AR), which is the ratio between AC and AW. While AC and AW reflect accuracies of the particular prediction in use with and without chunking, AR tells us how well a particular chunking method retains the accuracy of the original prediction program.
where a and b represent respectively the number of unpaired bases and the number of base pairs in common between the two structures, and n is the length of the RNA sequence. Large AC and AW values (close to 100%) for a predicted structure mean that it is highly similar to the real structure.
AR provides a comparison of the prediction accuracies with chunking versus without chunking. Intuitively, we expect that a good chunking method would cause only a minimal loss of prediction accuracy after cutting the sequence and would have AR values somewhat less than but close to 1. However, we will see in the result section that in many cases the AR values turn out to be greater than 1, meaning that secondary structure predicted using chunking is more similar to the real structure than it is the secondary structure predicted by using the whole sequence. Several standard statistical tests, including t-tests, Pearson correlation analysis, and the non-parametric Friedman tests , are applied to analyze the AR values for the different chunking methods.
Adapting multiple searching paths to MapReduce
Given an RNA sequence, the search for the best set of chunking parameters (i.e., maximum chunk length c, chunking method, minimum stem length L, and maximum gap length G) requires us to traverse or search a multi-level tree (i.e. the chunking tree in Figure 3.b). In the chunking tree, each path from the root (RNA sequence) to the leaves (RNA chunks) represents a set of parameter values of the chunking method (i.e., c, L, and G). The overall workflow (including the chunking, prediction, reconstruction, and analysis steps) naturally adapts to fit into the MapReduce (MR) paradigm and can be easily implemented with Hadoop for which the chunking and predictions can be solved by multiple mappers while the reconstruction and the analysis are done by a single reducer. In our framework, each MR job is designed to partially traverse the multi-level tree. Multiple MR jobs can be executed in parallel to explore the whole tree. The multiple searching paths combine attributes of both breadth-first search (performed by multiple MR jobs in parallel) and depth-first search (performed by a single MR job). While traversing the tree with multiple MR jobs, we can explore the impact of different chunking methods as well as different c, L and G values for a given sequence. An example of an MR job is shown in the circled part of Figure 3.b, for which we assume the centered chunking method, with fixed c = 60 bases, and we vary L and G between 3 and 8 and between 0 and 8 respectively. As previously outlined, for a sequence and a combination of parameters, the mappers perform the chunking and predictions. The input to each mapper is a 〈k1, v1〉 value pair, in which k1 is the ID of the sequence, and v1 is the chunking parameters' values (including the chunking method). Each mapper cuts the sequence according to the chunking parameters values in the chunking step by identifying a variable number of chunks meeting the parameter requirements. Note that each combination of parameters (each branch of the tree) can result in a variable number of chunks. Each mapper performs the prediction on one or more chunks using a certain prediction program. Here we use five secondary structure prediction programs capable of predicting pseudoknots (IPknot , pknotsRG , HotKnots , NUPACK , and PKNOTS ) and two programs that do not include this capability (i.e., UNAFOLD  and RNAfold ). Other programs can be easily used in our framework as a plug-and-play software module. After the prediction, each mapper outputs the list of 〈k2, v2〉 pairs as the intermediate output to reduce. The k2 is the ID of the whole secondary structure to which the predicted chunk belongs and v2 is the predicted secondary structure of the chunk. After the Hadoop runtime system groups all the values associated with the same key and passes the 〈k2, list(v2)〉 to the reducer, the reducer reconstructs the whole secondary structure of the sequence using all the v2 (predicted chunk structures) associated with the same k2. If required, the reducer analyzes the results in terms of their accuracy. After the accuracy has been computed, the reducer outputs the final results as a list(v3), in which v3 is the AR for reconstructed structures.
Granularity of mappers
Results and discussion
Datasets and hardware platform
For the study of both accuracy and performance, we plug seven RNA secondary structure prediction programs into our framework for both the chunk-based predictions and the predictions of the same sequences without chunking (the whole sequence is taken). Five of the programs, IPknot , pknotsRG , HotKnots , NUPACK , PKNOTS  can predict both stem-loops and pseudoknots. The remaining two programs, UNAFOLD  and RNAfold , can predict stem-loops only. We consider both the centered (C) and optimized (O) chunking methods and compare them against the naïve regular method (R) as a reference. We also consider a wide range of parameter settings with maximum chunk length c from 60 to 150 bases, minimum stem length L from 3 to 8, and maximum gap length G from 0 to 8.
To study the framework accuracy, we use two datasets of sequences which have previously established secondary structures. The first dataset, compiled from the RFAM database, consists of 50 non-pseudoknotted sequences and the lengths of sequences range from 127 to 568 bases. The second dataset, compiled from the RFAM and Pseudobase++ [21, 29] databases, consists of 23 pseudoknotted sequences, and the lengths of the sequences in this dataset range from 77 to 451 bases. Note that there are no large datasets of experimentally determined RNA secondary structures including pseudoknots, and to the best of our knowledge the one used in this paper is one of the few available to the public for free.
To study the framework performance, we use a smaller dataset of longer sequences (i.e., 14 RNA sequences from the virus family Nodaviridae) for which the secondary structures are not known. We assume pseudo-knots may be present and use the above-mentioned five prediction programs that are capable of capturing pseudoknots and we report only performance values but not accuracy. Because these RNA sequences are long (each has about 1300 to 3200 bases) and contain possible pseudoknots, none of the available programs can predict the secondary structures for the entire sequences. The use of the MapReduce framework is vital for the exhaustive, efficient exploration of the tree branches.
We ran the MapReduce framework on a cluster composed of 8 dual quad-core compute nodes (64 cores), each with two Intel Xeon 2.50 GHz quad-core processors. A front-end node is connected to the compute nodes and is used for compilation and job submissions. A high-speed DDR Infiniband interconnect for application and I/O traffic and a Gigabit Ethernet interconnect for management traffic connects the compute and front-end nodes. Our implementation is based on Hadoop 0.20.2.
There are three main questions that we want to answer in regard to the effects of our chunk-based approaches on the accuracy of various established secondary structure prediction programs. First, we want to evaluate to what extent chunk-based predictions retain the prediction accuracy. Second, we want to identify whether the capability of a chunking method to retain the prediction accuracy might decline with increasing sequence lengths. Third, we want to assess the extent to which the inversion based chunking methods (C and O) outperform the naïve chunking method (R), and whether there is any difference in accuracy between the C and O chunking methods.
As described in the Method section, the AC value for a predicted RNA structure is the percentage of agreement between the known structure and the structure obtained by concatenating the predicted structures of the chunks. Likewise, the AW value is the percentage of agreement between the known structure and the predicted structure when the whole sequence is used. These values indicate how closely the predicted structure resembles the real structure. A larger AC value means that the chunk-based predicted structure is more similar to the real structure. For a given dataset, prediction program, and chunking method, our MR framework collects multiple predicted structures associated with different c, L, and G parameters. The MAC value for a sequence is the maximum AC value, which gives the highest accuracy that can be attained for that sequence by the chunking method and the specific prediction program employed. In Figures 13.d and 14.d, the AW of the sequences in the two datasets are presented respectively. From these figures, it appears that most of the prediction methods have similar accuracy ranges regardless of the chunking method used and whether the prediction was obtained with the whole sequence or with the chunks; however, the PKNOTS program produces somewhat lower accuracies. This lower accuracy is quite expected because PKNOTS is actually the earliest algorithm allowing for pseudoknot prediction. The other prediction programs with pseudoknot prediction capability that have developed afterwards have incorporated improvements over the original PKNOTS.
MAR statistics for 50 non-pseudoknotted sequences.
MAR statistics for 23 pseudoknotted sequences
For non-pseudoknotted sequences, the mean MAR is significantly greater than 1 for all three chunking methods, whereas the mean MAR values for the pseudoknotted sequences are greater than 1 for the C and O chunking methods. With the R chunking method, one of the mean MAR values (with NUPACK) falls below 1 to 0.93. Looking at all the p-values, one can conclude that the average prediction accuracy attained with segmentation is not significantly less than that without. With the inversion based C and O chunking methods, we can conclude that the average prediction accuracies attained with segmentation are at least as good as, and often even better than, those without segmentation.
While the above results show that sequence segmentation will not reduce prediction accuracy on average, we still need to examine whether the MAR values would decline as the whole sequence length grows, because a declining trend would imply that the accuracy retention will deteriorate when the segmentation approaches are applied to longer RNA sequences. To this end, for each dataset, chunking method, and prediction program, we perform the Pearson correlation analysis on the MAR values of the sequences . For each dataset, we report both the correlation coefficient r and corresponding p-value between MAR and sequence length. If the r value is close to -1, it means that MAR and sequence lengths are negatively correlated, implying a decline in accuracy retention of the chunking method. If the associated p value is less than 0.05, we consider the correlation statistically significant; otherwise the correlation is not significant.
MAR correlation coefficients (r) and p-values (p)
Count and rank sum of sequences
P-values from the Friedman test
Because the Friedman test does not reveal whether any one method is significantly better than another, we also perform the post-hoc pairwise comparison test on each pair of the three chunking methods in order to confirm that the inversion based centered and optimized chunking methods are indeed superior to the naïve regular method. The p-values, shown in the "R-C," "R-O," and "C-O" columns, indicate that both the centered and optimized methods are better than the regular method. Furthermore, there are no significant differences between the centered and optimized chunking methods except when PKNOTS is applied to the pseudoknotted sequences.
The results above demonstrate that, for a variety of secondary structure prediction programs, our segmentation approach for handling the long RNA sequences can retain and even enhance the average prediction accuracy. Furthermore, using the inversion based C and O methods to cut the sequence will produce better prediction accuracy than the naïve R method. More questions remain to be answered and are part of our current research.
Our current investigations focus on the following two questions. First, we want to study how we should choose the parameters c, L, and G to maximize the accuracy retention. We have been conducting studies to identify how the prediction accuracy correlates with these parameters. Some of the results have been reported in preliminary work of the group [12, 13]. So far we have not found any definitive criteria that work for all sequences in general. Rather, the nucleotide base composition and length of the individual sequence, as well as the sequence length limitations imposed by the particular prediction program, need to be taken into account. Second, the fact that segmentation can in many cases improve the prediction accuracy for an RNA sequence is somewhat counter-intuitive. One possible explanation is that secondary structure prediction algorithms are generally based on global minimal free energy, resulting in the most thermodynamically stable isoforms. However, these structures may not be most favorable for biological functions, which often require RNAs to interact with other molecules or unfold during replication. Our results suggest that local structures formed by pairings of bases in close proximity, rather than the global energies, may better correlate with the real structures of large RNA molecules. This hypothesis is being tested in coauthor Johnson's molecular virology lab using the virus family Nodaviridae.
For the performance analysis, we use a smaller dataset of longer sequences from the virus family Nodaviridae [31, 32] and we explore a wider range of parameter values. The virus family Nodaviridae is divided into two genera: alphanodaviruses that primarily infect insects and betanodaviruses that infect only fish. These viruses share a common genome organization, namely a bipartite positive strand RNA genome (i.e., mRNA sense). The longer genome segment RNA1 (ranging in size from 3011 to 3204 nucleotide bases) encodes the RNA-dependent RNA polymerase that catalyzes replication of both genome segments, while the shorter RNA 2 (ranging in size from 1305 to 1433 nucleotide bases) encodes the precursor of the viral capsid protein that encapsidates the RNA genome. The 14 sequences we analyze in this paper are identified as follows: Boolarra virus (BoV) RNA2 (1305 nucleotide bases), Pariacoto virus (PaV) RNA2 (1311), Nodamura virus (NoV) RNA2 (1336), Black beetle virus (BBV) RNA2 (1393), Flock House virus (FHV) RNA2 (1400), Striped jack nervous necrosis virus (SJNNV) RNA2 (1421), Epinephelus tauvina nervous necrosis virus (ETNNV) RNA2 (1433), BoV RNA1 (3096), PaV RNA1 (3011), BBV RNA1 (3099), ETNNV RNA1 (3103), FHV RNA1 (3107), SJNNV RNA1 (3107), NoV RNA1 (3204). These sequences are sorted based on their increasing lengths, and this order is preserved in all the figures and tables presented below. There are three important questions that we want to answer when measuring performance. First, we want to quantify the time spent for exploring the several branches of the search trees for these 14 sequences using each of the two chunking methods (centered or optimized) and for the granularity of the mapping (coarse-or fine-grained). Second, we want to identify how the time is spent for each search in terms of map, reduce, and data shuffling among processors. Third, we want to measure the efficiency of the search and look for those aspects of the search that can impact performance.
We measure the total time needed to explore the chunking tree of each sequence using either the centered or optimized methods and with either coarse-grained or fine-grained mapping. The total time includes the time needed for chunking and prediction (map time), reconstruction (reduce time), exchange of predictions among nodes (shuffling time), and any overhead due to load imbalance and synchronizations. Note that the total time does not include the time needed for analysis since the secondary structures of the sequences considered here are not known experimentally; thus an analysis in terms of accuracy is not feasible. We use IPknot for our predictions since it is the most recently implemented program and its accuracy values are very high in the previous section.
Total times for RNA2 and RNA1 with coarse-grained mapping
Mean Total Time (sec)
RNA2(∼ 1300 bases)
RNA1(∼ 3100 bases)
Total times for RNA2 and RNA1 with fine-grained mapping
Mean Total Time (sec)
RNA2(∼ 1300 bases)
RNA1(∼ 3100 bases)
Average number of chunks for RNA2 and RNA1 with fine-grained mapping
Mean Total Time (sec)
RNA2(∼ 1300 bases)
RNA1(∼ 3100 bases)
When comparing coarse-grained mapping vs. fine-grained mapping, we observe that coarse-grained mapping results in shorter execution time compared to fine-grained mapping, independent of the chunking method used. Also we observe the trend that when the maximum chunk length grows from 60 to 300, the time gain of coarse-grained mapping over fine-grained mapping decreases. The speedup of coarse-grained mapping over fine-grained mapping using the centered chunking method for RNA2 subgroup of sequences decreases from 3.75 to 1.38, and for RNA1 it decreases from 7.86 to 2.37. A similar behavior is observed for the optimized chunking method: speedup of coarse-grained mapping over fine-grained mapping for RNA2 subgroup of sequences decreases from 3.4 to 1.25, and for RNA1 it decreases from 6.82 to 1.93.
Independent of the maximum chunk length, Figure 19 shows how fine-grained mapping reaches better efficiency compared to coarse-grained mapping. In other words, with fine-grained mapping, the mappers spend more time doing real chunking and predictions. We observe in Figure 17 (left y-axes) how fine-grained mapping has a larger number of map tasks and each map task is shorter (it predicts only one chunk) making easier for the Hadoop scheduler to allocate the several tasks efficiently by using a first-in-first-out (FIFO) policy. On the other hand, coarse-grained mapping has a smaller number of map tasks and each map task is longer (all the sequence chunks of a given L and G combination are predicted by a single mapper). In this case, once the scheduler assigns a longer task to a mapper, it has to wait for its completion, even if the other mappers have generated their chunk predictions, before proceeding to the reduce phase. We also observe that as the maximum chunk length increases from 60 to 300 bases, the map efficiency tends to drop. More specifically, the average map efficiency for coarse-grained mapping decreases from 36% to 25% on RNA2 and from 18% to 15% on RNA1 when using centered or optimized chunking methods. The average map efficiency for fine-grained mapping decreases from 91% to 79% on RNA2 and from 97% to 93% on RNA1. This is due to the fact that the centered and optimized chunking methods tend to produce more chunks with shorter chunk lengths when using a maximum chunk length of 60. On the other hand, when using a maximum chunk 300, the same methods tend to produce fewer chunks each with longer lengths.
The overall results suggest that the best set of parameter values to achieve higher accuracy, performance, and efficiency depend on multiple aspects including the input sequence and the available resources. Driven by these two aspects, in future work we will integrate an automatic selection of these values into our MR framework.
In this paper, we propose a MapReduce-based, modularized framework that allows scientists to systematically and efficiently explore the parametric space associated with chunk-based secondary structure predictions of long RNA sequences. By using our framework we can observe how sequence segmentation strategies, directed by inversion distributions enable us to predict the secondary structures of large RNA molecules. Furthermore, the chunk-based predictions can, on average, attain accuracies even higher than those obtained from predictions using the whole sequence. The observations in this study have led to our hypothesis that local structures formed by pairings of bases in close proximity, rather than the global free energies, may better correlate with the real structures of large RNA molecules. This hypothesis will be tested by further computational and experimental investigations.
BZ and DY equally contributed to this work. This work is supported in part by grants DMS 0800272/0800266 and EIA 0080940 from the NSF, RCMI 5G12RR008124-18 and NIMHD 8G12MD007592 from the NIH.
The publication costs for this article were funded by the corresponding authors.
This article has been published as part of BMC Structural Biology Volume 13 Supplement 1, 2013: Selected articles from the Computational Structural Bioinformatics Workshop 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcstructbiol/supplements/13/S1.
- Finnegan E, Matzke M: The small RNA world. Journal of Cell Science 2003, 116(23):4689–4693. 10.1242/jcs.00838View ArticlePubMedGoogle Scholar
- Thiel V, Ivanov KA, Putics A, Hertziq T, Schelle B, Bayer S, Weissbrich B, Snijder EJ, Rabenau H, Doerr HW, Gorbalenya AE, Ziebuhr J: Mechanisms and enzymes involved in SARS coronavirus genome expression. Journal of General Virology 2003, 84(Pt 9):2305–2315.View ArticlePubMedGoogle Scholar
- Ren J, Rastegari B, Condon A, Hoos HH: HotKnots: Heuristic prediction of RNA secondary structures including pseudoknots. RNA 2005, 11(10):1494–1504. 10.1261/rna.7284905PubMed CentralView ArticlePubMedGoogle Scholar
- Brierley I, Pennell S, Gilbert RJ: Viral RNA pseudoknots: Versatile motifs in gene expression and replication. Nature Reviews Microbiology 2007, 5(8):598–610. 10.1038/nrmicro1704View ArticlePubMedGoogle Scholar
- Nussinov R, Jacobson A: Fast algorithm for predicting the secondary structure of single stranded RNA. Proceedings of the National Academy of Sciences of the United States of America 1980, 77(11):6309–6313. 10.1073/pnas.77.11.6309PubMed CentralView ArticlePubMedGoogle Scholar
- Sankoff D: Simultaneous solution of the RNA folding, alignment, and protosequence problems. SIAM Journal on Applied Mathematics 1985, 45(5):810–825. 10.1137/0145048View ArticleGoogle Scholar
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research 2003, 31(13):3406–3415. 10.1093/nar/gkg595PubMed CentralView ArticlePubMedGoogle Scholar
- Rivas E, Eddy SR: A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology 1999, 285(5):2053–2568. 10.1006/jmbi.1998.2436View ArticlePubMedGoogle Scholar
- Dirks R, Pierce N: An algorithm for computing nucleic acid base-pairing probabilities including pseudoknots. Journal of Computational Chemistry 2004, 25(10):1295–1304. 10.1002/jcc.20057View ArticlePubMedGoogle Scholar
- Taufer M, Leung MY, Solorio T, Licon A, Mireles D, Araiza R, Johnson K: RNAVLab: a virtual laboratory for studying RNA secondary structures based on Grid computing technology. Parallel Computing 2008, 34(11):661–680. 10.1016/j.parco.2008.08.002PubMed CentralView ArticlePubMedGoogle Scholar
- Rosskopf JJ, III JHU, Rodarte L, Romero TA, Leung MY, Taufer M, Johnson KL: A 3' terminal stem-loop structure in Nodamura virus RNA2 forms an essential cis-acting signal for RNA replication. Virus Research 2010, 150(1–2):12–21. 10.1016/j.virusres.2010.02.006PubMed CentralView ArticlePubMedGoogle Scholar
- Yehdego D, Kodimala V, Viswakula S, Zhang B, Vegesna R, Johnson K, Taufer M, Leung MY: Poster: Secondary structure predictions for long RNA sequences based on inversion excursions. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (BCB) 2012.Google Scholar
- Yehdego D, Zhang B, Kodimala VKR, Johnson K, Taufer M, Leung MY: Secondary structure predictions for long RNA sequences based on inversion excursions and MapReduce. Proceedings of 12th IEEE International Workshop on High Performance Computational Biology (HiCOMB) 2013.Google Scholar
- Dean J, Ghemawat S: MapReduce: Simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation 2004.Google Scholar
- Hong D, Rhie A, Park SS, Lee J, Ju YS, Kim S, Yu SB, Bleazard T, Park HS, Rhee H, Chong H, Yang KS, Lee YS, Kim IH, Lee JS, Kim JI, Seo JS: FX: an RNA-Seq analysis tool on the Cloud. Bioinformatics 2012, 28(5):721–723. 10.1093/bioinformatics/bts023View ArticlePubMedGoogle Scholar
- Langmead B, Hansen KD, Leek JT: Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol 2010, 11: R83. 10.1186/gb-2010-11-8-r83PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang B, Yehdego D, Johnson K, Leung MY, Taufer M: A modularized MapReduce framework to support RNA secondary structure prediction and analysis workflows. Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 4–7 October 2012 2012, 86–93. 10.1109/BIBMW.2012.6470251View ArticleGoogle Scholar
- Karlin S, Dembo A, Kawabata T: Statistical composition of high-scoring segments from molecular sequences. Annals of Statistics 1990, 18(2):571–581. 10.1214/aos/1176347616View ArticleGoogle Scholar
- Chew DS, Leung MY, Choil KP: AT excursion: a new approach to predict replication origins in viral genomes by locating AT-rich regions. BMC Bioinformatics 2007, 8: 163. 10.1186/1471-2105-8-163PubMed CentralView ArticlePubMedGoogle Scholar
- RFAM database[http://rfam.sanger.ac.uk/]
- MATLAB: Version 188.8.131.529 (R2010a). Natick, Massachusetts: The MathWorks Inc.; 2010.Google Scholar
- Friedman M: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 1937, 32(200):675–701. 10.1080/01621459.1937.10503522View ArticleGoogle Scholar
- Sato K, Kato Y, Hamada M, Akutsu T, Asai K: IPknot: Fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. Bioinformatics 2011, 27(13):i85-i93. 10.1093/bioinformatics/btr215PubMed CentralView ArticlePubMedGoogle Scholar
- Reeder J, Steffen P, Giegerich R: pknotsRG: RNA pseudoknot folding including near-optimal structures and sliding windows. Nucleic Acids Res 2007, 35: W320-W324. 10.1093/nar/gkm258PubMed CentralView ArticlePubMedGoogle Scholar
- Dirks R, Pierce NA: A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of Computational Chemistry 2003, 24(13):1664–1677. 10.1002/jcc.10296View ArticlePubMedGoogle Scholar
- Markham NR, Zuker M: UNAFold: Software for nucleic acid folding and hybridization. Methods in Molecular Biology 2008, 453: 3–31. 10.1007/978-1-60327-429-6_1View ArticlePubMedGoogle Scholar
- Hofacker I, Fontana W, Stadler P, Bonhoeffer S, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie 1994, 125: 167–188. 10.1007/BF00818163View ArticleGoogle Scholar
- Taufer M, Licon A, Araiza R, Mireles D, van Batenburg FH, Gultyaev AP, Leung MY: PseudoBase++: an extension of PseudoBase for easy searching, formatting, and visualization of pseudoknots. Nucleic Acids Res 2009, 37(Database issue):D127-D135. 10.1093/nar/gkn806PubMed CentralView ArticlePubMedGoogle Scholar
- Snedecor GW, Cochran WG: The sample correlation coefficient r. In Statistical Methods. 7th edition. Ames, IA: Iowa State Press; 1980:175–178.Google Scholar
- Johnson KN, Johnson KL, Dasgupta R, Gratsch T, Ball LA: Comparisons among the larger genome segments of six Nodaviruses and their encoded RNA replicases. Journal of General Virology 2001, 82(Pt 8):1855–1866.View ArticlePubMedGoogle Scholar
- Thiery R, Johnson KL, Nakai T, Schneemann A, Bonami JR, Lightner DV: Family Nodaviridae. In Virus Taxonomy Ninth Report of the International Committee on Taxonomy of Viruses. Edited by: King AM, Lefkowitz E, Adams MJ, Carstens EB, Waltham, MA. Elsevier Academic Press; 2011:1061–1067.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.