# gEMpicker: a highly parallel GPU-accelerated particle picking tool for cryo-electron microscopy

- Thai V Hoang
^{1}Email author, - Xavier Cavin
^{1}, - Patrick Schultz
^{2}and - David W Ritchie
^{1}

**13**:25

https://doi.org/10.1186/1472-6807-13-25

© Hoang et al.; licensee BioMed Central Ltd. 2013

**Received: **13 June 2013

**Accepted: **14 October 2013

**Published: **21 October 2013

## Abstract

### Background

Picking images of particles in cryo-electron micrographs is an important step in solving the 3D structures of large macromolecular assemblies. However, in order to achieve sub-nanometre resolution it is often necessary to capture and process many thousands or even several millions of 2D particle images. Thus, a computational bottleneck in reaching high resolution is the accurate and automatic picking of particles from raw cryo-electron micrographs.

### Results

We have developed “gEMpicker”, a highly parallel correlation-based particle picking tool. To our knowledge, gEMpicker is the first particle picking program to use multiple graphics processor units (GPUs) to accelerate the calculation. When tested on the publicly available keyhole limpet hemocyanin dataset, we find that gEMpicker gives similar results to the FindEM program. However, compared to calculating correlations on one core of a contemporary central processor unit (CPU), running gEMpicker on a modern GPU gives a speed-up of about 27 ×. To achieve even higher processing speeds, the basic correlation calculations are accelerated considerably by using a hierarchy of parallel programming techniques to distribute the calculation over multiple GPUs and CPU cores attached to multiple nodes of a computer cluster. By using a theoretically optimal reduction algorithm to collect and combine the cluster calculation results, the speed of the overall calculation scales almost linearly with the number of cluster nodes available.

### Conclusions

The very high picking throughput that is now possible using GPU-powered workstations or computer clusters will help experimentalists to achieve higher resolution 3D reconstructions more rapidly than before.

## Keywords

## Background

Despite recent advances in the use of computational techniques, solving the structures of large macromolecular complexes by cryo-electron microscopy (EM) is still a painstaking and labour-intensive task [1]. It is also a very computationally intensive task. In single-particle cryo-EM, large numbers of micrographs containing low-resolution and noisy two-dimensional (2D) images of the particle of interest are recorded. Because each micrograph usually contains multiple particles in multiple random orientations, and possibly also in various conformations, the particles are then picked and classified into groups having similar orientations. Fast Fourier transform (FFT) deconvolution and averaging techniques may then be applied to reduce both systematic deformations of the 2D images due to the instrument’s contrast transfer function and the random noise which arises from using low electron intensities necessary to preserve the structural integrity of the samples. Once a good set of 2D images has been obtained, a three-dimensional (3D) electron density map of the particle may be constructed using 3D back-projection or Radon transform techniques [2], for example. However, the resolution of such maps, which are often calculated from only *O*(10^{4}) molecular images, is low compared to density maps obtained by X-ray crystallography which are typically derived from *O*(10^{15}) molecules. Therefore, in cryo-EM, the main way to increase the resolution of the final density map is to capture and process many thousands or even several millions of 2D particle images. In the past, the particles in EM micrographs were picked manually, but this is not practical to reach sub-nanometre resolution or to resolve conformational changes within molecules. Modern digitial imaging technology combined with automated high-throughput data collection techniques now allow both higher resolution and unlimited sizes of 2D datasets to be captured. Hence, a major bottleneck in reaching atomic resolution in 3D reconstruction by cryo-EM is now the accurate and automated picking of particles from the raw EM micrographs.

Many methods have been proposed for automatic cryo-EM particle picking [3, 4]. Amongst the most popular are those that use particle templates to facilitate particle recognition. A template is usually a noise-free representation of the particle in a particular orientation. It can be obtained either by projecting a known 3D structure onto a 2D plane or by calculating the average of some representative particles selected from micrographs. Some picking methods use mathematical functions for templates such as the difference of Gaussians method [5, 6]. In general, template-based methods recognise particles by computing similarity scores between the template and similar sized regions of each micrograph. For example, a widely used template-based method employs the normalised cross-correlation technique [7] which calculates an array of matching scores in the form of a 2D correlation map. This approach has been implemented in FindEM [8], SPIDER LFCPick [9], and SIGNATURE [10], for example.

Some picking methods use machine learning techniques to discriminate between real particles and non-particles such as those due to contaminants and noise. Example techniques are cascades of classifiers [11, 12], pyramid of neural networks [13], and support vector machine [14]. Other methods are based on the observation that 2D images of particles often have rather limited geometric complexity. For example, [15] use a Hough transform for particle edge detection. A related approach uses image processing techniques to segment particles directly from micrographs [16]. However, methods which do not use templates often require human intervention during the picking process.

Because it is difficult to surpass the accuracy of automatic template-based methods when the templates match the particles well, template-based approaches are often preferred although their computational cost is often higher than that of other methods [4]. However, in single particle cryo-EM, large and diverse sets of both micrographs and templates are usually needed to represent and identify different orientations of particles in micrographs in order to achieve a high resolution 3D reconstruction. There is therefore a need to be able to pick multiple images from multiple micrographs using multiple templates as rapidly as possible. In order to help satisfy this need, we have developed a highly parallel correlation-based particle picking tool called gEMpicker, which exploits recent advances in high performance computing technology in order to distribute particle picking calculations over multiple nodes of a computer cluster.

Nowadays, most research institutions have at least one computer cluster for scientific calculations. Each node of the cluster usually consists of several CPU cores, and an increasing number of clusters are configured with a certain number of GPUs in order to accelerate arithmetically intense calculations. Indeed, in the last few years, GPUs have been used to accelerate many scientific calculations [17] in fields ranging from molecular dynamics simulations [18] and quantum chemistry [19] to protein and DNA sequence alignment [20] and protein docking [21]. Recently, GPUs have also been used to accelerate single particle reconstruction [22], tomographic reconstruction [23], and subtomogram averaging [24]. With these observations in mind, we designed gEMpicker to be able to adapt easily to different hardware configurations, ranging from a modest workstation with one or two attached GPUs to large CPU-based or GPU-based clusters with tens or even hundreds of processors, and we have endeavoured to ensure that its performance increases linearly with the computational resources available. Here, we present particle picking speed-up results obtained on four different computational platforms, and we demonstrate the practical utility of the approach using the publicly available keyhole limpet hemocyanin (KLH) dataset. To our knowledge, gEMpicker is the first particle picking program to use multiple modern graphics processor units (GPUs) to accelerate FFT-based NCC calculations.

## Implementation

### NCC-based automatic particle picking

*S*

_{ k }(

*k*=1,2,…,

*N*), each of which contains a candidate particle to be picked, NCC-based automatic particle picking involves determining the highest peaks in the correlation maps calculated between these search images and the target image. The overall calculation involves essentially three main steps. The first step calculates the correlation, NCC

_{ k }, between each

*S*

_{ k }and the target image. NCC

_{ k }can be efficiently calculated using FFTs by exploiting the formulation in [7] (Additional file 1 Section 1). The second step combines all of the NCC

_{ k }correlation maps into a global correlation map NCC using

for all relative distance **v** of search images to the origin of target image. In the parallel processing community, the process of gathering results in this way is often called a “reduction” because it reduces multiple result arrays into a single global result array. In large-scale distributed calculations, the efficiency of the reduction step can have a significant impact on the overall speed of the calculation. We return to this point below.

Assuming that **v** is the location of a local maximum in NCC, the search image that corresponds to that local maximum is given by *k*=IND(**v**). In other words, the calculation has associated the search image *S*
_{
k=IND(v)} at location **v** of the target image. Lastly, the third step locates the coordinates of local maxima in NCC in order to produce a final list of picked particles. The above procedure is then repeated for each target image in the dataset.

### FFT size and zero-padding

Because almost all of the computational cost in gEMpicker arises from FFT-based NCC calculations, the choice of FFT library can significantly affect overall performance. We therefore tested gEMpicker using the proprietary MKL (Math Kernel Library) [25], CUFFT (CUDA Fast Fourier Transform) [26], and the open source FFTW (Fastest Fourier Transform in the West) [27] libraries. Although the theoretical advantage of the FFT is that it can perform a calculation that apparently requires *O*(*N*^{2}) operations in just *O*(*N* log*N*) time, the actual speed-up that might be achieved can be quite sensitive to the dimension *N*.

Current FFT libraries use the Cooley–Tukey algorithm [28] to reduce recursively a transform of size *N* into transforms of smaller dimensions which are normally implemented as small “kernels” of dimension 2, 3, 5, or 7. If the dimension cannot be factored into small prime numbers, a slower general purpose algorithm is used (e.g. [29, 30]). Therefore, if the image dimension is not a natural product of small primes, it is often worthwhile to pad the image with zeros up to a suitable larger dimension. Additionally, on current GPUs, global GPU memory can be accessed most efficiently if memory request can be factored into similar dimensions, because this can allow the GPU to coalesce multiple memory accesses into a single transaction (the precise conditions necessary for coalesced memory access are described in the CUDA C Programming Guide [31]). Consequently, gEMpicker automatically zero-pads images when it detects an opportunity to improve performance due to the above considerations. This simple trick has demonstrated its effectiveness when the data size does not conform to the library’s recommendation.

### Parallel processing framework

In parallel processing, it is usual to use the notion of a “thread” to mean one instance of a calculation that will run essentially independently on one CPU core. Often, multiple threads are launched from a single parent program, or “process”, on each CPU node. Although different threads may run independently, they often still communicate with each other in a controlled way using one or more message passing techniques to send and receive data and results. Here, we consider the basic unit of calculation to be the correlation of one template with one micrograph because this operation is relatively expensive yet it does not depend on either the number of micrographs or the number of templates to be processed. With this level of granularity, the particle picking problem can be parallelised quite naturally by distributing the correlation calculations over several threads running in parallel. When GPUs are available, it is legitimate for a CPU thread to pass a part or even all of a calculation to an attached GPU.

When running in multi-threaded mode, each thread will calculate the correlation between the micrograph and multiple templates. However, concurrent reading of data by multiple threads could cause contention in the disc storage device and consequently lead to sub-optimal performance. Therefore, to avoid this problem, gEMpicker adopts a producer–consumer pattern [32]. The producer’s job is simply to read data from disc, and copy it into a queue. If the number of producers is one, which is the case in gEMpicker, there is only one stream of data from the storage device, and hence the possibility of contention is completely avoided. gEMpicker normally uses multiple consumer threads according to a simple thread pool pattern [33]. Each consumer removes one template at a time from the queue and processes it independently of any other template calculations. In order to avoid race or deadlock conditions amongst the threads, access to the queue is controlled by locks within the “Boost.Thread” library [34]. Additionally, if the queue becomes empty, any idle consumer threads will sleep until more data is made available by the producer. On the other hand, if the queue grows beyond a certain size, the producer will sleep in order to avoid exhausting physical memory. The number of consumer threads in the pool can be adapted according to the available resources. Typically, the number of threads would be set to the number of CPU cores or the number of GPUs per node. Thus, the producer-consumer model provides a way to read data smoothly from disc and to process it as quickly as possible.

In order to calculate the global NCC map for a micrograph with a set of templates, gEMpicker distributes the calculation over a given number of threads, which might ultimately be executed on multiple CPUs, GPUs, or a mixture of the two. Thus each thread *t* calculates NCC_{
t
} for a subset of the templates and it maintains NCC^{
t
} and IND^{
t
} as its individual correlation map and the corresponding index map. When the queue of templates becomes exhausted, each thread combines its NCC^{
t
} with NCC^{
p
} so that NCC^{
p
} and IND^{
p
} will contain the candidate picks calculated by the threads belonging to process *p*. When running on a single workstation, NCC^{
p
} and IND^{
p
} will immediately describe all of the picked particles, and all that remains is to identify the local maxima to obtain the final picked list.

### Cluster implementation

We have implemented both direct and tree-based reduction algorithms in gEMpicker. The direct reduction algorithm uses the MPI_Send and MPI_Recv functions to send and receive data between the node and master processes. For a cluster of 2^{
n
} nodes, this approach requires 2^{
n
}-1 data transfers and 2^{
n
}-1 reduce operations. The tree-structured reduction uses the MPI_Reduce function to propagate results towards the master process at the root of the tree and requires only *n* data transfers and *n* reduce operations in a cluster of 2^{
n
} nodes. Such a tree-based approach is theoretically optimal, since the total elapsed time should scale only logarithmically in the number of cluster nodes. It is worth noting that the cluster reduction step is performed only when all node-level processes have finished their correlation tasks. This means that the reduction calculation itself may be accelerated using multi-threading on each node’s main process. Because this step mainly involves element-wise processing of large arrays it is easily parallelised using a few fine-grained OpenMP [36] compiler directives.

### The computational platforms

**The characteristics of the four computer platforms used in the current study**

Machine name | CPU cores | CPU type | Memory (node) | GPUs (total) | GPU type | Infiniband connection |
---|---|---|---|---|---|---|

Dirac | 8 | i7-965 (3.2GHz) | 12Gb | 1 | C2075 (575MHz, 448 cores) | – |

Mbiserv | 12 | X5690 (3.5GHz) | 64Gb | 4 | C2075 (575MHz, 448 cores) | – |

Adonis | 8×8 | E5520 (2.3GHz) | 24Gb | 16 | C1060 (602MHz, 240 cores) | 40GB/s |

Griffon | 64×8 | L5420 (2.5GHz) | 16Gb | 0 | – | 20GB/s |

## Results and discussion

### FFT-based NCC performance comparison

### Multi-node cluster performance

^{ p }, not the cluster’s global correlation map NCC, were calculated because the latter involves the reduction step, which is considered separately below. Here, the number of consumer threads in each process is equal to the number of CPU cores per node (Figure 4a) or the number of GPUs per node (Figure 4b). Figure 4 shows that the gain increases almost linearly with the number of nodes when clusters have a relatively small size, such as in Adonis. The sub-theoretical gain in Griffon may be due to the use of a network file system to store all template images in a single storage device. Since each node has a producer thread to read template images for its consumer threads, this could lead to contention on the disc device, as discussed above. Nevertheless, the results in Figure 4b also show that the performance of gEMpicker scales linearly with the number of GPUs and the number of nodes in a GPU cluster.

### Case study: keyhole limpet hemocyanin

This section demonstrates the practical utility of gEMpicker using the publicly available keyhole limpet hemocyanin (KLH) dataset^{a}. This annotated dataset was used previously to assess the performance of several automatic particle picking algorithms in a particle picking “bake-off” experiment [4]. This relatively small dataset consists of 82 defocus pairs of high-magnification images of size 2048×2048 of KLH particles, the locations of 1042 side-view particles picked manually by a human expert (Mouche’s picks), and a preliminary 3D reconstruction. Each defocus pair contains an image acquired at near-to-focus conditions and an image acquired at far-from-focus conditions.

We then used gEMpicker to pick particles from the 82 far-from-focus micrographs in the KLH dataset. This gave 1249 side-view particles, which contain 979 (i.e. ∼94%) of Mouche’s 1042 manually picked particles. In comparison, FindEM, which uses Roseman’s NCC algorithm, picked 1282 side-view particles containing 1011 (i.e. ∼97%) of the manually picked particles. Thus, gEMpicker picked approximately 3% fewer particles than FindEM from 3% fewer attempts. The small difference in the results between gEMpicker and FindEM is due to the different templates and masks used here and the slightly different parameter settings in the final peak extraction procedure. As noted by [4], different human experts can pick different sets of particles, and so it is rather difficult to define a “gold standard” for particle picking. Therefore, although FindEM gave amongst the best results in the bake-off comparison, we would not wish to claim that gEMpicker is superior to FindEM. In addition, since the dataset does not provide the coordinates of manually picked top-view particles, we cannot apply a similar performance comparison for the top-view picking results of gEMpicker. To obtain an independent validation of our results, we uploaded the picks obtained by gEMpicker to the 3D Electron Microscopy Benchmark (http://i2pc.cnb.csic.es/3dembenchmark/) for 50 KLH micrographs. This generated the following statistics: Precision: 78.8%; Recall: 93.6%; False Discovery Rate: 21.2%; F-measure: 85.6%; Average distance from manual pick: 4.7 pixels.

Regarding timing, the total time to compute the correlation maps for the 82 micrographs in this dataset was 5,972s when using one CPU core on Dirac compared to 223s using one C2075 GPU. This corresponds to a GPU/CPU speed-up factor of ∼27. However, in this case it is probably fairer to compare one GPU with one quad-core CPU, which reduces the speed-up factor to ∼9. Effectively, the 82 micrographs in this small dataset may be processed in less than 4 minutes using a single GPU or in just over 28 minutes using all 8 cores of a modern workstation. A higher speed-up is expected using a greater number of templates. In contrast, the FindEM program requires 9,430s to compute the 82 correlation maps using one CPU core on Dirac. Thus, the speed-up obtained by using one C2075 GPU in gEMpicker when compared to FindEM is ∼42×.

Assuming the almost linear speed-up demonstrated by our cluster calculations (Figure 5), we estimated that the entire KLH picking exercise could be completed in about 1 minute on our 4-GPU Mbiserv machine. However, the actual observed time is almost 3 minutes. This is because for micrographs of size 2048×2048, the time required to process four templates in four GPUs is less than the time required to read four templates from the storage device. Hence the consumer threads often have to wait for data to become available. In addition, using multi-threading leads to the additional overhead of combining results at the final step. Similar phenomena are also observed on the Adonis and Griffon clusters. Thus, by exploiting GPUs for the particle picking problem, the rate-limiting factor is no longer raw computing power but the bandwidth of the hard disk drives.

## Conclusions

We have presented gEMpicker, a highly parallel multi-threaded cryo-EM particle picking tool which implements Roseman’s NCC matching algorithm on multi-CPU and multi-GPU computer systems. Our results on picking particles in the KLH dataset indicate that gEMpicker performs at least as well as Roseman’s FindEM algorithm. Our computational experiments show that gEMpicker’s automatic particle picking calculation is approximately 30–40 times faster on a contemporary GPU than on a single CPU core. Compared to a quad-core CPU running four gEMpicker threads in parallel, the speed-up from using one contemporary GPU is a factor of ∼9×. We have shown that increasing the number of GPUs speeds up the calculation linearly with almost no additional overhead. We have also demonstrated how the picking task may be distributed over multiple nodes in a computer cluster. On a cluster with a fast Infiniband connection, our tree-based reduction algorithm for combining node-level picks almost eliminates the overhead of distributing the calculation over multiple nodes, and allows the overall calculation speed to increase almost linearly with the available hardware. Thus, the very high picking throughput that is now possible with gEMpicker will help experimentalists to achieve higher resolution 3D reconstructions more rapidly than before.

## Availability and requirements

**Project name:** gEMpicker **Project homepage:**
http://gem.loria.fr/gEMpicker.html
**Operating system(s):** Linux OS **Programming language:** C++, CUDA **Other requirements:** Boost 1.49 or higher, FFTW 3.3 or higher, CUDA Toolkit 4.2 or higher **License:** Unlimited for academic use **Any restrictions to use by non-academics:** license needed

## Endnote

^{a} Available at http://ami.scripps.edu/redmine/projects/ami/wiki/KLH_dataset_I.

## Declarations

### Acknowledgements

This work was funded in part by CNRS and by Agence Nationale de la Recerche, grant reference ANR-MNU-006-02. We thank the French Grid5000 network (https://www.grid5000.fr) for access to the Griffon and Adonis clusters.

## Authors’ Affiliations

## References

- Orlova EV, Saibil HR:
**Structural analysis of macromolecular assemblies by electron microscopy.***Chem Rev*2011,**111**(12):7710–7748. 10.1021/cr100353tPubMed CentralView ArticlePubMedGoogle Scholar - Lanzavecchia S, Bellon PL, Radermacher M:
**Fast and accurate three-dimensional reconstruction from projections with random orientations via Radon transforms.***J Struct Biol*1999,**128**(2):152–164. 10.1006/jsbi.1999.4185View ArticlePubMedGoogle Scholar - Nicholson WV, Glaeser RM:
**Review: automatic particle detection in electron microscopy.***J Struct Biol*2001,**133**(2–3):90–101.View ArticlePubMedGoogle Scholar - Zhu Y, Carragher B, Glaeser RM, Fellmann D, Bajaj C, Bern M, Mouche F, de Haas F, Hall RJ, Kriegman DJ, Ludtke SJ, Mallick SP, Penczek PA, Roseman AM, Sigworth FJ, Volkmann N, Potter CS:
**Automatic particle selection: results of a comparative study.***J Struct Biol*2004,**145**(1–2):3–14.View ArticlePubMedGoogle Scholar - Voss NR, Yoshioka CK, Radermacher M, Potter CS, Carragher B:
**DoG Picker and TiltPicker: software tools to facilitate particle selection in single particle electron microscopy.***J Struct Biol*2009,**166**(2):205–213. 10.1016/j.jsb.2009.01.004PubMed CentralView ArticlePubMedGoogle Scholar - Langlois R, Pallesen J, Frank J:
**Reference-free particle selection enhanced with semi-supervised machine learning for cryo-electron microscopy.***J Struct Biol*2011,**175**(3):353–361. 10.1016/j.jsb.2011.06.004PubMed CentralView ArticlePubMedGoogle Scholar - Roseman AM:
**Particle finding in electron micrographs using a fast local correlation algorithm.***Ultramicroscopy*2003,**94**(3–4):225–236.View ArticlePubMedGoogle Scholar - Roseman AM:
**FindEM – A fast, efficient program for automatic selection of particles from electron micrographs.***J Struct Biol*2004,**145**(1–2):91–99.View ArticlePubMedGoogle Scholar - Rath BK, Frank J:
**Fast automatic particle picking from cryo-electron micrographs using a locally normalized cross-correlation function: a case study.***J Struct Biol*2004,**145**(1–2):84–90.View ArticlePubMedGoogle Scholar - Chen JZ, Grigorieff N:
**SIGNATURE: A single-particle selection system for molecular electron microscopy.***J Struct Biol*2007,**157:**168–173. 10.1016/j.jsb.2006.06.001View ArticlePubMedGoogle Scholar - Mallick SP, Zhu Y, Kriegman D:
**Detecting particles in cryo-EM micrographs using learned features.***J Struct Biol*2004,**145**(1–2):52–62.View ArticlePubMedGoogle Scholar - Sorzano C, Recarte E, Alcorlo M, Bilbao-Castro J, San-Martín C, Marabini R, Carazo J:
**Automatic particle selection from electron micrographs using machine learning techniques.***J Struct Biol*2009,**167**(3):252–260. 10.1016/j.jsb.2009.06.011PubMed CentralView ArticlePubMedGoogle Scholar - Ogura T, Sato C:
**Automatic particle pickup method using a neural network has high accuracy by applying an initial weight derived from eigenimages: a new reference free method for single-particle analysis.***J Struct Biol*2004,**145**(1–2):63–75.View ArticlePubMedGoogle Scholar - Arbeláez P, Han BG, Typke D, Lim J, Glaeser RM, Malik J:
**Experimental evaluation of support vector machine-based and correlation-based approaches to automatic particle selection.***J Struct Biol*2011,**175**(3):319–328. 10.1016/j.jsb.2011.05.017View ArticlePubMedGoogle Scholar - Zhu Y, Carragher B, Mouche F, Potter CS:
**Automatic particle detection through efficient Hough transforms.***IEEE Trans Med Imaging*2003,**22**(9):1053–1062. 10.1109/TMI.2003.816947View ArticlePubMedGoogle Scholar - Adiga U, Baxter WT, Hall RJ, Rockel B, Rath BK, Frank J, Glaeser R:
**Particle picking by segmentation: A comparative study with SPIDER-based manual particle picking.***J Struct Biol*2005,**152**(3):211–220. 10.1016/j.jsb.2005.09.007View ArticlePubMedGoogle Scholar - Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ:
**A survey of general-purpose computation on graphics hardware.***Comput Graph Forum*2007,**26:**80–113. 10.1111/j.1467-8659.2007.01012.xView ArticleGoogle Scholar - Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG, Schulten K:
**Accelerating molecular modeling applications with graphics processors.***J Comput Chem*2007,**28:**2618–2640. 10.1002/jcc.20829View ArticlePubMedGoogle Scholar - Ufimtsev IS, Martínez TJ:
**Quantum chemistry on graphical processor units. 1. Strategies for two-electron integral evaluation.***J Chem Theory Comput*2008,**4:**222–231. 10.1021/ct700268qView ArticleGoogle Scholar - Manavski SA, Valle G:
**CUDA-compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment.***BMC Bioinformatics*2008,**9**(2):S10.PubMed CentralView ArticlePubMedGoogle Scholar - Ritchie DW, Venkatraman V:
**Ultra-fast FFT protein docking on graphics processors.***Bioinformatics*2010,**26**(19):2398–2405. 10.1093/bioinformatics/btq444View ArticlePubMedGoogle Scholar - Li X, Grigorieff N, Cheng Y:
**GPU-enabled FREALIGN: Accelerating single particle 3D reconstruction and refinement in Fourier space on graphics processors.***J Struct Biol*2010,**172**(3):407–412. 10.1016/j.jsb.2010.06.010PubMed CentralView ArticlePubMedGoogle Scholar - Zheng SQ, Branlund E, Kesthelyi B, Braunfeld MB, Cheng Y, Sedat JW, Agard DA:
**A distributed multi-GPU system for high speed electron microscopic tomographic reconstruction.***Ultramicroscopy*2011,**111**(8):1137–1143. 10.1016/j.ultramic.2011.03.015PubMed CentralView ArticlePubMedGoogle Scholar - Castaño-Díez D, Kudryashev M, Arheit M, Stahlberg H:
**Dynamo: A flexible, user-friendly development tool for subtomogram averaging of cryo-EM data in high-performance computing environments.***J Struct Biol*2012,**178**(2):139–151. 10.1016/j.jsb.2011.12.017View ArticlePubMedGoogle Scholar - Intel Corporation:
**The Intel Math Kernel Library.**2012.Google Scholar - Nvidia Corporation:
**The Nvidia CUDA Fast Fourier Transform library (CUFFT).**2012.Google Scholar - Frigo M, Johnson S:
**The design and implementation of FFTW3.***Proc IEEE*2005,**93**(2):216–231. [http://www.fftw.org/]View ArticleGoogle Scholar - Cooley JW, Tukey JW:
**An algorithm for the machine calculation of complex Fourier series.***Math Comput*1965,**19:**297–301. 10.1090/S0025-5718-1965-0178586-1View ArticleGoogle Scholar - Rader CM:
**Discrete Fourier transforms when the number of data samples is prime.***Proc IEEE*1968,**56**(6):1107–1108.View ArticleGoogle Scholar - Bluestein LI:
**A linear filtering approach to the computation of discrete Fourier transform.***IEEE Trans Audio Electroacoustics*1970,**18**(4):451–455. 10.1109/TAU.1970.1162132View ArticleGoogle Scholar - Nvidia Corporation:
**The CUDA C Programming Guide.**2012.Google Scholar - Eckel B:
*Thinking in C++: Practical Programming, 2nd edition*. Upper Saddle River, New Jersey 07458, USA: Prentice Hall; 2000.Google Scholar - Garg RP, Sharapov IA:
*Techniques for Optimizing Applications: High Performance Computing*. Upper Saddle River, New Jersey, 07458, USA: Prentice Hall; 2001.Google Scholar - Schäling B:
*The Boost C++ Libraries*. Suite O-175 Laguna Hills, CA 92637, USA: XML Press; 2011. [http://www.boost.org/]Google Scholar - Gropp W, Lusk E, Skjellum A:
*Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd edition*. Cambridge, MA 02142–1493, USA: The MIT Press; 1999.Google Scholar - Chapman B, Jost G, van der Pas R:
*Using OpenMP: Portable Shared Memory Parallel Programming*. Cambridge, MA 02142–1493, USA: The MIT Press; 2007.Google Scholar - Orlova EV, Dube P, Harris J, Beckman E, Zemlin F, Markl J, van Heel M:
**Structure of keyhole limpet hemocyanin type 1 (KLH1) at 15 Å resolution by electron cryomicroscopy and angular reconstitution.***J Mol Biol*1997,**271**(3):417–437. 10.1006/jmbi.1997.1182View ArticlePubMedGoogle Scholar - Tang G, Peng L, Baldwin PR, Mann DS, Jiang W, Rees I, Ludtke SJ:
**EMAN2: An extensible image processing suite for electron microscopy.***J Struct Biol*2007,**157:**38–46. 10.1016/j.jsb.2006.05.009View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.